When to use it
Use Chain-of-Draft when reasoning text is a non-trivial share of your agent’s output. Math-heavy or multi-hop reasoning workloads gain the most; pure tool-call agents see smaller wins because the tool-call payload dominates the output.The original paper (Xu et al., Zoom, Feb 2025, arXiv 2502.18600) reports CoD using ~7.6% of CoT tokens at parity accuracy on GSM8K math (≈40 tokens vs ≈200 per response on GPT-4o / Claude 3.5 Sonnet). That headline number is math-specific. On tool-calling code agents the win is smaller — measure on your workload before quoting a savings number.
The directive
CoD is just text the model sees once per call (~84 tokens of overhead). Save it as a constant and concatenate it to whatever system prompt you already have:Use it
Compose with token saving
CoD shapes output tokens;TokenSavingMiddleware head+tail-truncates tool outputs on the input side. They stack cleanly.
What CoD does not do
- It does not speed up inference — the model still processes the same input. The saving is on the model’s output (billed at the higher output-token rate).
- It does not change accuracy on its own — the paper’s headline is parity accuracy. Harder symbolic reasoning may see different deltas; measure on your tasks.
- It is not a substitute for tool-output compression — those are two different categories of tokens. Stack both.
See also
- Reduce token usage — tool-output compression and early-exit
- Prompt caching — the cache-read discount on the input side

