Skip to main content
Chain-of-Draft (CoD) is a prompt technique: tell the model to do its thinking in ≤5-word drafts instead of full prose. It’s a string you append to your system prompt — no SDK call, no middleware, zero infrastructure cost. It composes with the token-saving middleware because the two target different token pools (CoD shapes output reasoning; compression shapes input tool outputs).

When to use it

Use Chain-of-Draft when reasoning text is a non-trivial share of your agent’s output. Math-heavy or multi-hop reasoning workloads gain the most; pure tool-call agents see smaller wins because the tool-call payload dominates the output.
The original paper (Xu et al., Zoom, Feb 2025, arXiv 2502.18600) reports CoD using ~7.6% of CoT tokens at parity accuracy on GSM8K math (≈40 tokens vs ≈200 per response on GPT-4o / Claude 3.5 Sonnet). That headline number is math-specific. On tool-calling code agents the win is smaller — measure on your workload before quoting a savings number.

The directive

CoD is just text the model sees once per call (~84 tokens of overhead). Save it as a constant and concatenate it to whatever system prompt you already have:
CHAIN_OF_DRAFT = """
# Reasoning style — Chain of Draft
Keep each reasoning step to <=5 words. Drop articles, hedges, and restatements.
The final answer or tool call goes in full prose — the *thinking* is in 5-word
fragments.

Example:
  Not: "I should look up the user by id, then check their account status."
  Yes: "Need user. Look up id. Then check status."
"""

Use it

from langchain.agents import create_agent

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    system_prompt=my_base_prompt + CHAIN_OF_DRAFT,
)

Compose with token saving

CoD shapes output tokens; TokenSavingMiddleware head+tail-truncates tool outputs on the input side. They stack cleanly.
from langchain.agents import create_agent
from reasonblocks import ReasonBlocks, TokenSavingMiddleware

rb = ReasonBlocks(api_key="rb_live_...")

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    system_prompt=task_prompt + CHAIN_OF_DRAFT,        # CoD on output reasoning
    middleware=[
        rb.middleware(agent_name="bugfixer"),
        TokenSavingMiddleware(),                        # compression on tool outputs
    ],
)

What CoD does not do

  • It does not speed up inference — the model still processes the same input. The saving is on the model’s output (billed at the higher output-token rate).
  • It does not change accuracy on its own — the paper’s headline is parity accuracy. Harder symbolic reasoning may see different deltas; measure on your tasks.
  • It is not a substitute for tool-output compression — those are two different categories of tokens. Stack both.

See also