Skip to main content
Prompt caching is one of the highest-ROI levers in an agent’s token budget, and it costs nothing to enable. The cache-read tier on every major provider is heavily discounted vs fresh input:
ProviderCache-read discountMinimum prefixMechanism
Anthropic Claude−90% (0.30/Mvs0.30/M vs 3/M on Sonnet)1024 tokens (2048 on Haiku)Explicit cache_control markers
OpenAI−50%≥1024 tokensAutomatic on stable prefixes
Google Gemini−75% on cached portionvariescontext_caching API
On agentic workloads with 1M+ input tokens per problem, this is the difference between roughly 3and3 and 0.50 per call.

The rule that matters

Cache after any string compression, never before. Compression that mutates the cached prefix invalidates the cache for every downstream call. ReasonBlocks’s TokenSavingMiddleware only modifies mutable tool-output messages mid-history — the system prompt and tools (the cached prefix) are never touched. If you wire your own compressor before RB, make sure it preserves the cached prefix byte-for-byte.

One-line enable (Anthropic)

The fastest path: flip prompt_caching=True on the validated D-arm preset. Internally TokenSavingMiddleware delegates wrap_model_call to LangChain’s AnthropicPromptCachingMiddleware, so you don’t add a second middleware to your list.
from reasonblocks import for_code_review

agent = create_agent(
    model="claude-sonnet-4-6",
    tools=[bash, ...],
    middleware=for_code_review(
        fail_to_pass_tests=task.fail_to_pass,
        prompt_caching=True,           # ← single flag, single middleware
        prompt_caching_ttl="5m",       # "5m" (default) or "1h"
    ),
)
Or use the standalone middleware for non-code-review agents:
from reasonblocks import prompt_caching_middleware

agent = create_agent(
    model="claude-sonnet-4-6",
    tools=[...],
    middleware=[prompt_caching_middleware(ttl="5m")],
)

Validated headline

On paired SWE-bench Pro (n=50 ansible instances, claude-sonnet-4-6, no other middleware) — the same agent loop with and without the flag:
armmean cost/instancecache_read shareaccuracy
baseline$3.330%4/15*
prompt_caching=True$0.5996.5% of input2/15*
Δ−82.4%flat (noise at this n)
*Preliminary partial n=15/50. Full n=50 paired result lands on this page when both arms finish. Every one of 15 paired instances saved cost (range −58.6% to −93.4%). The remaining 0.0% of input billed as fresh ≈ 683 tokens across all 15 runs — the cache effectively replaces input billing entirely after turn 1.

What RB does out of the box

ReasonBlocksMiddleware and TokenSavingMiddleware are cache-safe by design:
  • The middleware never modifies the system prompt or tool definitions across turns — the stable prefix providers cache against. ReasonBlocksMiddleware wraps your base system prompt in a cache_control: {"type": "ephemeral"} block automatically, and rides the variable [REASONBLOCKS] steering text in a separate uncached block so it can’t bust the prefix.
  • Tool-output compression replaces same-id ToolMessage instances (LangGraph’s add_messages reducer treats same-id messages as in-place replacements). Cache invalidation only affects the modified message onward; earlier turns still hit the cache.
  • Injection text (monitor firings, early-exit nudges) lands as a new HumanMessage at the tail of the message list, not in the cached prefix.
So if you mount RB with enable_token_saving=True (the default), you’re already cache-safe. The remaining work is downstream: verify the provider sees a stable prefix on every turn.

Verify cache hits in production

Every Anthropic response includes cache_read_input_tokens and cache_creation_input_tokens in usage. A healthy long-running agent writes the cache on turn 1, then reads it on every subsequent turn.
from anthropic import Anthropic

client = Anthropic()
resp = client.messages.create(
    model="claude-sonnet-4-6",
    system=[{
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},   # the breakpoint
    }],
    tools=tools,
    messages=messages,
    max_tokens=2048,
)

print(f"input_fresh:  {resp.usage.input_tokens}")
print(f"cache_read:   {resp.usage.cache_read_input_tokens}")      # should grow turn-over-turn
print(f"cache_write:  {resp.usage.cache_creation_input_tokens}")  # ~zero after turn 1
If cache_read_input_tokens is zero on turn 2+, something upstream is mutating the prefix. Common culprits:
  1. cache_control missing or mis-placed — must be on the system block (or last tool, or the message you want as the breakpoint). Anthropic supports up to 4 breakpoints.
  2. System prompt being re-templated — even a timestamp injected into the system text breaks the cache.
  3. Tools list changing across calls — a downstream wrapper sorting or re-ordering tools.
  4. Prefix below the minimum — Anthropic’s cache requires ≥1024 tokens on Sonnet/Opus, ≥2048 on Haiku. Tiny prompts can’t be cached.

Anthropic native context editing (provider-side)

Recent Claude models ship a server-side context-editing feature (behind a beta header) that clears stale tool_use rounds once the context grows past a threshold. It’s a provider feature you enable directly on the Anthropic client — ReasonBlocks does not wrap it. Configure it per the Anthropic context-editing docs; it stacks with RB because the token pools are disjoint (Anthropic evicts stale tool rounds server-side, RB head+tail-truncates the tool outputs you keep visible).

Stack the input-side levers

The full input-side token-reduction stack for a coding agent — one factory, one flag, and each layer targets a mostly disjoint token pool so the reductions compound:
from langchain_anthropic import ChatAnthropic
from langchain.agents import create_agent
from reasonblocks import for_code_review, chain_of_draft_directive

agent = create_agent(
    model=ChatAnthropic(model="claude-sonnet-4-6", max_tokens=2048),
    tools=[bash, ...],
    system_prompt=task_prompt + chain_of_draft_directive(),  # CoD on output reasoning
    middleware=for_code_review(                              # D-arm + caching, one middleware list
        fail_to_pass_tests=task.fail_to_pass,
        max_tool_calls=50,
        prompt_caching=True,                                 # ← turns on Anthropic prompt caching
    ),
)
LayerTargetsTypical cut
Prompt caching (provider)system prompt + tools (15–30% of input)up to 90% of that pool
Anthropic context editingstale tool_use rounds (15–30% of input)up to 84% of that pool (vendor-reported)
RB tool-output compressionretained tool outputs (the 30–55% middle pool)~50% via head+tail
Chain-of-Draftreasoning output text (5–25% of output)up to 92% of that pool (math workloads)
Measure on your own workload — the combined cut depends on your token mix and how orthogonal these pools actually are for your agent.

See also