Claude Context Overflow Prevention
Status: Production Ready
Overview
Long, tool-heavy conversations against Claude eventually run out of context window. Without preventive measures, the conversation either errors at the API boundary or silently degrades quality as critical context gets squeezed out.
QRY runs a three-layer defense that activates progressively as conversations grow. Each layer addresses a different failure mode and a different cost / quality trade-off.
The three layers
Layer 0 — Truncation (default 8 KB threshold)
_truncate_tool_result_for_history compresses oversized tool_result entries to metadata + 5 sample rows, while the full result is stored in Redis as data_ref. The user-facing conversation receives the full streamed result; only Claude's history sees the truncated form.
Why this exists: a single SQL query returning 50 MB of rows shouldn't poison the conversation's context for the next 20 turns. Truncation keeps the history compact while preserving "this query returned ~5,000 rows of customer data" semantically.
Cost: zero — it's pure local string manipulation.
Trade-off: if a follow-up question genuinely needs the full row content (rare), Claude has to re-query rather than recall from history. The data_ref allows backend tools (notebooks, downloads) to retrieve the original.
Layer 1 — Server-side compaction (default 120 K tokens)
Anthropic's clear_tool_uses_20250919 beta feature drops old tool_use / tool_result pairs from the conversation history server-side, keeping the user-facing messages intact. Activates when the conversation reaches the configured token threshold.
Why this exists: tool calls (SQL, Python, file reads) accumulate quickly. After 20 turns of analysis, the bulk of the context is tool_result blobs. Stripping the oldest pairs reclaims most of the room without losing the user's natural-language thread.
Cost: zero extra API calls — happens inline with the streaming request.
Trade-off: Claude can no longer recall exactly what an old tool returned. Prose summaries of those answers (in subsequent assistant messages) usually carry the load.
Layer 2 — Client-side compaction (default 150 K tokens)
The Anthropic SDK's CompactionControl triggers a summarisation API call that condenses the entire conversation into a tighter representation. Activates only when Layer 1 hasn't been enough — long conversations with substantive prose, not just tool calls.
Why this exists: stripping tool_uses reclaims tokens but doesn't help if the prose itself is too long. A summarised version preserves the gist and lets the conversation continue.
Cost: one extra API call per compaction event. Compaction events are rare (most conversations never hit Layer 2).
Trade-off: the summary is necessarily lossy — fine-grained details from earlier in the conversation may be lost. The summary prompt is domain-specific and tries to preserve data_refs, notebook IDs / cells, table names, and artifacts that downstream tools need to find.
Configuration
All three thresholds live in runtime_config.ContextOverflowConfig:
| Setting | Default | What it controls |
|---|---|---|
truncation_threshold_bytes | 8 KB | Layer 0 — when to compress a tool_result |
truncation_sample_rows | 5 | Layer 0 — how many rows to keep in the truncated form |
layer1_token_threshold | 120,000 | Layer 1 — when server-side compaction kicks in |
layer2_token_threshold | 150,000 | Layer 2 — when client-side summarisation kicks in |
For tenants with unusually long conversations or unusually expensive models, raise the thresholds (more memory, less aggressive compaction). For cost-sensitive tenants on cheap models, lower them.
SDK quirk fixes
QRYStreamingToolRunner._check_and_compact() overrides the SDK's behaviour to fix two upstream bugs:
- The SDK discards its own tool_use cleanup on the compaction API call. Without the override, calling Layer 2 would undo Layer 1's work.
- The SDK omits the system prompt on the compaction call. Without the override, the summary would lose the system-prompt context (domain context, ABAC policies, etc.).
The override re-injects the system prompt and preserves Layer 1's cleanup before invoking the SDK's compaction.
Domain-specific summary prompt
Layer 2's summary prompt is tuned for QRY's typical content. It's instructed to preserve:
data_refidentifiers (so future turns can retrieve the original tool_result).- Notebook IDs and cell ids referenced in the conversation.
- Table names and column names used.
- Artifacts (created dashboards, scheduled tasks, etc.) referenced by id.
- Schema / module bindings.
This is in contrast to a generic chat summary, which would happily flatten "the customers table has 5,000 rows" into "the user looked at customer data" — losing the row count.
Operational signals
qry_context_layer0_truncations_total— counter; how often tool_results were compressed.qry_context_layer1_compactions_total— counter; how often server-side compaction fired.qry_context_layer2_compactions_total— counter; how often the expensive client-side summarisation fired.qry_context_window_exceeded_total— counter; should be near zero. If non-zero, the layers aren't keeping up — raise thresholds or investigate which conversation pattern is overflowing.
Why three layers, not one
The three-layer design is a deliberate trade-off:
- Layer 0 handles 95% of pressure for free.
- Layer 1 handles another 4% with no API cost.
- Layer 2 catches the remaining 1% at the cost of one extra API call per event.
A simpler design with only Layer 2 would work but cost meaningfully more on every long conversation, even when Layer 0/1 would have been enough.
See Also
- LLM providers — admin configuration including these thresholds.
- Conversation Checkpoints — separate user-facing rewind, complementary to context overflow.
Last updated: 2026-05-04