06 / 20 Context & Autocompact src/services/compact/ · src/utils/tokens.ts

167K token threshold. 4 compaction layers.

Token counting uses a hybrid approach: API usage data when available, character/4 heuristic for new messages. Autocompact triggers at effectiveWindow - 13K buffer. Claude summarizes old messages in a single maxTurns=1 call with NO tools allowed. Circuit breaker after 3 consecutive failures.

200K

Default window — expandable to 1M with [1m]

13K

Buffer — autocompact triggers at window - 13K

20K

Reserved for summary output (p99.99 = 17.4K)

Max failures before circuit breaker

How Token Counting Works

Not a simple tokenizer call — a hybrid system designed for accuracy during streaming with parallel tool calls.

tokenCountWithEstimation(messages)

The canonical function for threshold checks. Walks messages backwards to find the last message with API usage data (input_tokens from response headers).

Find first sibling (parallel tool handling)

When model makes 10 parallel tool calls, streaming emits separate records all with same message.id. Function walks forward to first sibling to include all interleaved results.

Rough estimation for new messages

roughTokenCountEstimationForMessages() — uses character/4 ≈ token heuristic with 4/3× conservative padding for messages added since the last API call returned usage data.

Return: API usage + rough estimate

Final count = last known API input_tokens + estimated tokens for new messages. This is compared against the autocompact threshold.

Official token counting is also available via messages.countTokens() on the Anthropic SDK — used for precise budget checks. But the hybrid estimation is faster for the hot-path threshold check.

Threshold Calculation

The math (autoCompact.ts lines 72-91)

effectiveContextWindow = getContextWindowForModel(model) - reservedTokensForSummary
reservedTokensForSummary = min(model's max_output_tokens, 20,000) — caps at 20K since p99.99 summary is 17.4K tokens
autocompactThreshold = effectiveContextWindow - 13,000 (AUTOCOMPACT_BUFFER_TOKENS)
Example for 200K model: 200,000 - 20,000 = 180,000 effective → 180,000 - 13,000 = 167,000 token trigger

Override: CLAUDE_AUTOCOMPACT_PCT_OVERRIDE (0-100%) allows % of effective window instead of fixed 13K buffer

Token Budget Distribution

System 5-15K

Memory 2-10K

Conversation (grows)

Tool results (grows)

Buffer 13K

4 Compaction Strategies

Applied in order. Each layer is independent — they don't all fire every turn.

1Snip — Drop old messages entirely

Feature-gated (HISTORY_SNIP). Removes messages above a boundary. Cheapest option — no API call. The snipTokensFreed parameter accounts for freed tokens that the API response usage data still reflects.

2Microcompact — Clear tool result content (3 sub-paths)

Time-based: If gap since last assistant message > 60min (server's cache TTL), clear all compactable tool results except last 5. Saves ~20-100K tokens.
Cached (ANT-only): Queue cache_edits for API to delete tool results without invalidating cached prefix.
Compactable tools: Read, Bash, Shell, Grep, Glob, WebSearch, WebFetch, Edit, Write.

3Autocompact — Claude summarizes the conversation

Fires when tokenCountWithEstimation(messages) ≥ threshold. Calls Claude with maxTurns=1 and NO tools allowed (prevents slow streaming tool calls). Summary constrained to <analysis> (scratchpad, stripped) + <summary> blocks. Frees ~40-60% of context.

4Context Collapse — Fold multiple messages into summaries

Feature-gated (CONTEXT_COLLAPSE). Mutually exclusive with autocompact — only one system active. Alternative approach that summarizes message groups rather than the whole conversation.

What Claude Sees During Compaction

The compaction prompt (prompt.ts, 302 lines) is carefully designed to preserve critical information.

Compaction Instructions — What to Preserve

Primary request and intent — the original user goal
Key technical concepts — architecture, patterns, decisions
Files and code sections — full snippets, not just names
Errors and fixes — step-by-step, with resolution
Problem solving — what worked, what didn't, why
All user messages — preserve tone, feedback, corrections
Pending tasks — what's left to do
Current work — detailed, with code context
Optional next step — with direct quotes from recent messages

CRITICAL instruction: "Respond with TEXT ONLY. Do NOT call any tools. Tool calls will be REJECTED and will waste your only turn — you will fail the task." — This prevents the model from trying to use Read/Bash to gather more context during compaction (which would slow it down and waste the single allowed turn).

Compact Boundary Message

After compaction, a special marker is inserted into the conversation history.

SystemCompactBoundaryMessage

Field	Value
`type`	'system'
`subtype`	'compact_boundary'
`content`	'Conversation compacted'
`trigger`	'manual' \| 'auto'
`preTokens`	Context size before compaction
`messagesSummarized`	Count of summarized messages
`preservedSegment`	UUIDs of head/anchor/tail (partial compact)

Circuit Breaker

MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 — If compaction fails 3 times in a row (e.g., a single message exceeds context limit even after compaction), the system stops attempting for the rest of the session. Prevents hammering the API with doomed requests. Counter resets to 0 on success.

User Configuration

Setting / Env Var	Effect
`autoCompactEnabled`	Toggle auto-compact on/off (default: true)
`/compact`	Manual trigger — always available even if auto is off
`DISABLE_COMPACT`	Disables ALL compaction (auto + manual)
`DISABLE_AUTO_COMPACT`	Disables auto only — `/compact` still works
`CLAUDE_CODE_AUTO_COMPACT_WINDOW`	Override effective context window size
`CLAUDE_AUTOCOMPACT_PCT_OVERRIDE`	Trigger at X% of window instead of 13K buffer
Custom compaction instructions	Inject via settings or hooks — merged with base prompt