Prompt caching
In one line: If two calls share an identical prompt prefix, the provider can reuse the work it already did on that prefix — billing you a fraction of the price and serving the first token faster. Structure prompts so the stable stuff is first and you're done.
Remember the transformer phases — prefill (process the input) then decode (generate output)? Prefill is the expensive part. Prompt caching is the provider saying "I already processed this exact prefix two minutes ago; I'll skip that work and start where it differs." You get the same answer for 10–30% of the input cost.
How it works (mechanically)
When the model processes your input tokens, it computes a KV cache — a giant tensor of intermediate attention state, one slice per token. Generating the next token only requires the KV cache of all previous tokens, not re-running the whole prefill.
Providers cache that KV tensor keyed by the prefix hash. If your next request has the exact same first N tokens, they reuse the cached KV state for those N tokens and only do prefill on the new tail.
The cache lives for minutes (provider-specific). A new call within the TTL hits warm cache. A new call after expiry pays full price again.
Provider differences (May 2026)
| Provider | How to enable | Cache TTL | Cached input price |
|---|---|---|---|
| OpenAI | Automatic; prefixes ≥1024 tokens | ~5–10 min | 50% of normal input |
| Anthropic | Explicit cache_control breakpoints | 5 min (5-min cache) or 1 hour (extended) | 10% (5-min) / 25% (1-hour) |
| Google Gemini | Explicit context caching API | configurable 1 min–24 hours | ~25% of normal input |
| DeepSeek / Together / etc. | Often automatic | varies | varies (often 10-20%) |
Anthropic gives the deepest discount (10× on the cached portion) but requires you to mark cache boundaries explicitly. OpenAI's is automatic but only 2× off. Gemini lets you pre-create caches with a long TTL for very expensive prefixes (huge docs).
Worked example: a chatbot with a 4K system prompt
Without caching, every turn:
4,000 input × $3/M = $0.012 just on system prompt
× 50K turns/day = $600/day in system-prompt tokens alone
With Anthropic caching at 10% on the cached portion:
First turn: 4,000 × $3.75/M = $0.015 (cache write costs 25% more)
Subsequent: 4,000 × $0.30/M = $0.0012 each
50K turns/day, most cache hits: ~$60/day
Same model, same prompt, 10× cheaper on the system-prompt portion. Multiplied across every turn of every conversation, it's the single biggest cost lever in production chat apps.
How to structure prompts for cache hits
Rules:
- Stable stuff first, variable stuff last. System prompt → tool definitions → long shared context → conversation history → latest user message.
- Don't put a timestamp in the system prompt. "Current date: 2026-05-23 14:30:11" makes the prefix change on every request — zero cache hits.
- Don't shuffle tool definitions. Same tools in the same order every call.
- For Anthropic, mark cache breakpoints explicitly: add
cache_control: {"type": "ephemeral"}to the last block you want cached.
Worked example: Anthropic with explicit caching
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 4,000 tokens
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": HUGE_DOCUMENT, # 50,000 tokens
"cache_control": {"type": "ephemeral"},
},
{"type": "text", "text": "Summarize the section on liability."},
],
}
],
)
print(response.usage)
# cache_creation_input_tokens: 54,000 (first call writes the cache)
# cache_read_input_tokens: 0
Second call within 5 minutes with the same prefix:
cache_creation_input_tokens: 0
cache_read_input_tokens: 54,000
input_tokens: 8 (just the new question)
Same call, ~10× cheaper.
What cache doesn't solve
- Output tokens — never cached. The model has to generate them fresh.
- Truly novel prompts — first call pays full price. The win is on repeated calls.
- Cache misses from subtle differences — even one different token in the prefix invalidates everything from that point on.
- Stale knowledge — if you cached "today is Monday" and it's now Tuesday, your model is wrong. Don't cache things that should change.
What beginners get wrong
- Putting variable content first. "Hi $username" at the top of the system prompt = zero cache hits. Move variables to the user message.
- Reordering tool definitions or context between calls. Stable order = cache hit. Random order = miss.
- Assuming all providers cache the same way. OpenAI does it automatically and silently. Anthropic requires explicit annotations. Gemini wants you to pre-create caches via a separate API.
- Not measuring
cache_read_input_tokens. If you're not checking the usage stats, you don't actually know whether your cache is hitting. - Caching tiny prompts. Below the minimum (often 1K tokens), nothing is cached. Don't try to optimize for tiny prefixes.
- Forgetting TTL. A low-traffic endpoint with 1 request every 30 minutes will never warm-cache. Either accept the cost or use a longer-TTL cache (Anthropic 1-hour, Gemini explicit cache).
When the math works
Cache pays off when:
- (Stable prefix tokens) × (calls per TTL window) × (cache discount) > break-even.
- Rule of thumb: any chatbot or RAG endpoint with >5 calls/minute on the same prefix wins.
Cache doesn't pay off when:
- You serve one query at a time per unique prompt.
- Your prefix changes every call (avoid: timestamps, random nonces, reordered fields).
Refactoring a production prompt to be cache-friendly is annoying. Designing for it from the start costs nothing. Put stable content first, mark caching boundaries on the providers that need them, and monitor cache_read ratios. It's the cheapest 5–10× cost win you'll ever ship.
→ Next: Sampling