06 / 20 Context & Autocompact src/services/compact/ · src/utils/tokens.ts

167K token threshold. 4 compaction layers.

Token counting uses a hybrid approach: API usage data when available, character/4 heuristic for new messages. Autocompact triggers at effectiveWindow - 13K buffer. Claude summarizes old messages in a single maxTurns=1 call with NO tools allowed. Circuit breaker after 3 consecutive failures.

200K
Default window — expandable to 1M with [1m]
13K
Buffer — autocompact triggers at window - 13K
20K
Reserved for summary output (p99.99 = 17.4K)
3
Max failures before circuit breaker

How Token Counting Works

Not a simple tokenizer call — a hybrid system designed for accuracy during streaming with parallel tool calls.

1
tokenCountWithEstimation(messages)
The canonical function for threshold checks. Walks messages backwards to find the last message with API usage data (input_tokens from response headers).
2
Find first sibling (parallel tool handling)
When model makes 10 parallel tool calls, streaming emits separate records all with same message.id. Function walks forward to first sibling to include all interleaved results.
3
Rough estimation for new messages
roughTokenCountEstimationForMessages() — uses character/4 ≈ token heuristic with 4/3× conservative padding for messages added since the last API call returned usage data.
4
Return: API usage + rough estimate
Final count = last known API input_tokens + estimated tokens for new messages. This is compared against the autocompact threshold.
Official token counting is also available via messages.countTokens() on the Anthropic SDK — used for precise budget checks. But the hybrid estimation is faster for the hot-path threshold check.

Threshold Calculation

The math (autoCompact.ts lines 72-91)

  1. effectiveContextWindow = getContextWindowForModel(model) - reservedTokensForSummary
    reservedTokensForSummary = min(model's max_output_tokens, 20,000) — caps at 20K since p99.99 summary is 17.4K tokens
  2. autocompactThreshold = effectiveContextWindow - 13,000 (AUTOCOMPACT_BUFFER_TOKENS)
  3. Example for 200K model: 200,000 - 20,000 = 180,000 effective → 180,000 - 13,000 = 167,000 token trigger

Override: CLAUDE_AUTOCOMPACT_PCT_OVERRIDE (0-100%) allows % of effective window instead of fixed 13K buffer

Token Budget Distribution

System 5-15K
Memory 2-10K
Conversation (grows)
Tool results (grows)
Buffer 13K

4 Compaction Strategies

Applied in order. Each layer is independent — they don't all fire every turn.

1Snip — Drop old messages entirely

Feature-gated (HISTORY_SNIP). Removes messages above a boundary. Cheapest option — no API call. The snipTokensFreed parameter accounts for freed tokens that the API response usage data still reflects.

2Microcompact — Clear tool result content (3 sub-paths)

Time-based: If gap since last assistant message > 60min (server's cache TTL), clear all compactable tool results except last 5. Saves ~20-100K tokens.
Cached (ANT-only): Queue cache_edits for API to delete tool results without invalidating cached prefix.
Compactable tools: Read, Bash, Shell, Grep, Glob, WebSearch, WebFetch, Edit, Write.

3Autocompact — Claude summarizes the conversation

Fires when tokenCountWithEstimation(messages) ≥ threshold. Calls Claude with maxTurns=1 and NO tools allowed (prevents slow streaming tool calls). Summary constrained to <analysis> (scratchpad, stripped) + <summary> blocks. Frees ~40-60% of context.

4Context Collapse — Fold multiple messages into summaries

Feature-gated (CONTEXT_COLLAPSE). Mutually exclusive with autocompact — only one system active. Alternative approach that summarizes message groups rather than the whole conversation.

What Claude Sees During Compaction

The compaction prompt (prompt.ts, 302 lines) is carefully designed to preserve critical information.

Compaction Instructions — What to Preserve

  1. Primary request and intent — the original user goal
  2. Key technical concepts — architecture, patterns, decisions
  3. Files and code sections — full snippets, not just names
  4. Errors and fixes — step-by-step, with resolution
  5. Problem solving — what worked, what didn't, why
  6. All user messages — preserve tone, feedback, corrections
  7. Pending tasks — what's left to do
  8. Current work — detailed, with code context
  9. Optional next step — with direct quotes from recent messages
CRITICAL instruction: "Respond with TEXT ONLY. Do NOT call any tools. Tool calls will be REJECTED and will waste your only turn — you will fail the task." — This prevents the model from trying to use Read/Bash to gather more context during compaction (which would slow it down and waste the single allowed turn).

Compact Boundary Message

After compaction, a special marker is inserted into the conversation history.

SystemCompactBoundaryMessage

FieldValue
type'system'
subtype'compact_boundary'
content'Conversation compacted'
trigger'manual' | 'auto'
preTokensContext size before compaction
messagesSummarizedCount of summarized messages
preservedSegmentUUIDs of head/anchor/tail (partial compact)

Circuit Breaker

MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 — If compaction fails 3 times in a row (e.g., a single message exceeds context limit even after compaction), the system stops attempting for the rest of the session. Prevents hammering the API with doomed requests. Counter resets to 0 on success.

User Configuration

Setting / Env VarEffect
autoCompactEnabledToggle auto-compact on/off (default: true)
/compactManual trigger — always available even if auto is off
DISABLE_COMPACTDisables ALL compaction (auto + manual)
DISABLE_AUTO_COMPACTDisables auto only — /compact still works
CLAUDE_CODE_AUTO_COMPACT_WINDOWOverride effective context window size
CLAUDE_AUTOCOMPACT_PCT_OVERRIDETrigger at X% of window instead of 13K buffer
Custom compaction instructionsInject via settings or hooks — merged with base prompt