Skip to main content

Claude Context Overflow Prevention

Status: Production Ready

Overview

Long, tool-heavy conversations against Claude eventually run out of context window. Without preventive measures, the conversation either errors at the API boundary or silently degrades quality as critical context gets squeezed out.

QRY runs a three-layer defense that activates progressively as conversations grow. Each layer addresses a different failure mode and a different cost / quality trade-off.

The three layers

Layer 0 — Truncation (default 8 KB threshold)

_truncate_tool_result_for_history compresses oversized tool_result entries to metadata + 5 sample rows, while the full result is stored in Redis as data_ref. The user-facing conversation receives the full streamed result; only Claude's history sees the truncated form.

Why this exists: a single SQL query returning 50 MB of rows shouldn't poison the conversation's context for the next 20 turns. Truncation keeps the history compact while preserving "this query returned ~5,000 rows of customer data" semantically.

Cost: zero — it's pure local string manipulation.

Trade-off: if a follow-up question genuinely needs the full row content (rare), Claude has to re-query rather than recall from history. The data_ref allows backend tools (notebooks, downloads) to retrieve the original.

Layer 1 — Server-side compaction (default 120 K tokens)

Anthropic's clear_tool_uses_20250919 beta feature drops old tool_use / tool_result pairs from the conversation history server-side, keeping the user-facing messages intact. Activates when the conversation reaches the configured token threshold.

Why this exists: tool calls (SQL, Python, file reads) accumulate quickly. After 20 turns of analysis, the bulk of the context is tool_result blobs. Stripping the oldest pairs reclaims most of the room without losing the user's natural-language thread.

Cost: zero extra API calls — happens inline with the streaming request.

Trade-off: Claude can no longer recall exactly what an old tool returned. Prose summaries of those answers (in subsequent assistant messages) usually carry the load.

Layer 2 — Client-side compaction (default 150 K tokens)

The Anthropic SDK's CompactionControl triggers a summarisation API call that condenses the entire conversation into a tighter representation. Activates only when Layer 1 hasn't been enough — long conversations with substantive prose, not just tool calls.

Why this exists: stripping tool_uses reclaims tokens but doesn't help if the prose itself is too long. A summarised version preserves the gist and lets the conversation continue.

Cost: one extra API call per compaction event. Compaction events are rare (most conversations never hit Layer 2).

Trade-off: the summary is necessarily lossy — fine-grained details from earlier in the conversation may be lost. The summary prompt is domain-specific and tries to preserve data_refs, notebook IDs / cells, table names, and artifacts that downstream tools need to find.

Configuration

All three thresholds live in runtime_config.ContextOverflowConfig:

SettingDefaultWhat it controls
truncation_threshold_bytes8 KBLayer 0 — when to compress a tool_result
truncation_sample_rows5Layer 0 — how many rows to keep in the truncated form
layer1_token_threshold120,000Layer 1 — when server-side compaction kicks in
layer2_token_threshold150,000Layer 2 — when client-side summarisation kicks in

For tenants with unusually long conversations or unusually expensive models, raise the thresholds (more memory, less aggressive compaction). For cost-sensitive tenants on cheap models, lower them.

SDK quirk fixes

QRYStreamingToolRunner._check_and_compact() overrides the SDK's behaviour to fix two upstream bugs:

  1. The SDK discards its own tool_use cleanup on the compaction API call. Without the override, calling Layer 2 would undo Layer 1's work.
  2. The SDK omits the system prompt on the compaction call. Without the override, the summary would lose the system-prompt context (domain context, ABAC policies, etc.).

The override re-injects the system prompt and preserves Layer 1's cleanup before invoking the SDK's compaction.

Domain-specific summary prompt

Layer 2's summary prompt is tuned for QRY's typical content. It's instructed to preserve:

  • data_ref identifiers (so future turns can retrieve the original tool_result).
  • Notebook IDs and cell ids referenced in the conversation.
  • Table names and column names used.
  • Artifacts (created dashboards, scheduled tasks, etc.) referenced by id.
  • Schema / module bindings.

This is in contrast to a generic chat summary, which would happily flatten "the customers table has 5,000 rows" into "the user looked at customer data" — losing the row count.

Operational signals

  • qry_context_layer0_truncations_total — counter; how often tool_results were compressed.
  • qry_context_layer1_compactions_total — counter; how often server-side compaction fired.
  • qry_context_layer2_compactions_total — counter; how often the expensive client-side summarisation fired.
  • qry_context_window_exceeded_total — counter; should be near zero. If non-zero, the layers aren't keeping up — raise thresholds or investigate which conversation pattern is overflowing.

Why three layers, not one

The three-layer design is a deliberate trade-off:

  • Layer 0 handles 95% of pressure for free.
  • Layer 1 handles another 4% with no API cost.
  • Layer 2 catches the remaining 1% at the cost of one extra API call per event.

A simpler design with only Layer 2 would work but cost meaningfully more on every long conversation, even when Layer 0/1 would have been enough.

See Also


Last updated: 2026-05-04

QRYA product of IXEN.