Claude API Prompt Caching in 2026: How to Cut LLM Costs by 90%
A hands-on guide to Claude API prompt caching in 2026 — how cache_control works, what to cache, the TTL trade-offs, and the bugs that quietly void your cache.
If your Claude API bill crossed $500 a month sometime in the last year and you can’t fully explain where it went, there’s about a 90% chance the answer is the same: you’re re-sending the same system prompt, the same tool definitions, and the same long document on every single request, and paying full input token price every time. Claude API prompt caching is the feature that fixes this, and properly configured it routinely cuts input token costs by 80–90% on agent workloads.
This guide walks through how caching actually works under the hood, the two cache TTLs Anthropic now offers, what to cache and what to leave alone, and the three bugs that quietly destroy your cache hit rate without throwing any error. Code examples use the official Anthropic SDK.
TL;DR
- Prompt caching stores large prefixes of your request server-side so subsequent requests reuse them at ~10% of the normal input token price.
- You opt in per content block with
cache_control: {"type": "ephemeral"}. Without it, nothing is cached. - Anthropic offers two TTLs in 2026: 5 minutes (default) and 1 hour. The 1-hour cache costs 2× a cache write but is usually the right default for agents.
- Caches are bucketed by the exact prefix bytes, tool set, model version, and system message. Any change invalidates.
- Common killers: putting dynamic content (timestamps, user IDs) above a cache breakpoint, or mutating the tool list between requests.
Why Caching Matters So Much in 2026
Modern Claude workflows — agents, RAG pipelines, code assistants — are structurally expensive because they front-load a huge, mostly-static context on every call. A typical agent request in 2026 looks something like this:
- 8,000 tokens of system prompt defining the agent’s role, guardrails, and output format
- 12,000 tokens of tool schemas (every MCP server, every custom function)
- 4,000 tokens of conversation history
- 200 tokens of the actual new user turn
Without caching, you pay the full input rate on all 24,200 tokens every turn. With caching, you pay the write-rate once on the static 24,000 tokens, then only $0.30/MTok (10% of base) on every subsequent read, plus full price on the 200 new tokens. On a conversation of 20 turns the total input cost drops from roughly $1.45 to $0.21 — a 7× reduction, and that multiplier grows with both prompt size and turn count.
Anthropic documents the pricing structure on the prompt caching page, and the economics generally favor caching any static block over about 1,024 tokens (the minimum cacheable size for Sonnet-class models — Haiku-class has a smaller 2,048 floor for some blocks, always check the current docs).
How Cache Breakpoints Actually Work
Caching is not automatic. You tag specific content blocks with cache_control, and Anthropic treats every tagged block as a cache breakpoint. The cached content is everything from the beginning of the prompt up through that breakpoint.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: LONG_SYSTEM_PROMPT, // ~8k tokens
cache_control: { type: "ephemeral", ttl: "1h" },
},
],
tools: TOOLS_WITH_CACHE_ON_LAST_TOOL, // see below
messages: [
{ role: "user", content: "What changed in the latest commit?" },
],
});
Two things to internalize:
Breakpoints are prefix-based, not random-access. If you cache block 3 of your system messages but block 2 changes between requests, the cache is invalidated all the way through block 3. The rule of thumb is: put static content at the top, dynamic content at the bottom, and place cache_control at the boundary.
You can have up to four breakpoints. This lets you cache the system prompt, the tool set, and a long user document all in one request — each can live in the cache independently at the cost of a separate write the first time.
The Tool List Is Part of the Cache Key
This is where teams routinely burn cache hits without realizing it. The entire tools array is part of the hashed cache key, so if you add one new tool mid-conversation, the cache for your system prompt above it is still valid, but the cache breakpoint placed on the tools array is gone.
Two practical rules: freeze your tool list for the duration of a conversation, and put the cache_control tag on the last tool definition so a single breakpoint covers the entire tool block.
5-Minute vs 1-Hour TTL: Which to Pick
Anthropic offers two cache lifetimes in 2026:
| TTL | Write cost multiplier | Read cost | Best for |
|---|---|---|---|
| 5 minutes | 1.25× base input | 0.10× base | Chat UIs, interactive loops where turns are ~seconds apart |
| 1 hour | 2× base input | 0.10× base | Background agents, batch pipelines, anything with gaps |
The decision hinges on one question: will your next request land within 5 minutes of the last one? For a live chat app, yes, and the cheaper 5-minute TTL is the right default. For an agent that does work, waits on a human review, then resumes — or a scheduled job that runs every 15 minutes — the 5-minute cache expires between calls and you end up paying the write cost over and over. The 1-hour TTL’s 2× write is cheaper than even a single extra write.
If you’re not sure, instrument your request intervals first. Most teams think they’re “interactive” but the P50 gap between turns is longer than 5 minutes.
Pros & Cons
| Pros | Cons | |
|---|---|---|
| Cost | 80–90% reduction on static input tokens | First request pays 1.25–2× write penalty |
| Latency | Cached reads are materially faster (30–50% TTFT reduction on our agent workloads) | Cache miss is no faster than uncached |
| Complexity | Single cache_control field, no infrastructure | Easy to accidentally invalidate; requires discipline about prompt stability |
| Compatibility | Works across all current Claude 4.x models | Some older beta headers were deprecated in early 2026 — verify against current docs |
Who Should Use This
Use prompt caching if you are:
- Running any kind of agent loop (LangGraph, Claude Agent SDK, custom) with a system prompt over 1k tokens
- Building a RAG pipeline where the retrieved documents are reused across follow-up questions
- Running a coding assistant that re-sends the same repo tree or file contents on every turn
- Doing structured output generation where the same schema is sent on every call (see our guide on Claude structured output with Zod and streaming)
Skip it if you are:
- Doing one-shot completions with small prompts (under 1k tokens of static content)
- Sending genuinely different content on every request (e.g., classifying unrelated documents with short, different system prompts)
- Running workloads where the break-even write cost would never be recovered
The Three Bugs That Quietly Void Your Cache
1. Timestamps or IDs in the System Prompt
The most common mistake. Someone prepends a string like "Today's date is 2026-04-17T14:32:11Z" to the system prompt for grounding, and because it changes every request, no cache hit is ever possible on anything that comes after it.
Fix: put dynamic context in a user message (below any cache breakpoint) or in a separate trailing system message. The cached system prompt should be byte-identical across calls.
2. Model Version Drift
Caches are scoped to the exact model string. If half your fleet is on claude-sonnet-4-6 and half is on claude-opus-4-7, they maintain separate caches. Worse, if you alias to “the latest” and Anthropic ships a new point release, your cache is silently invalidated the moment the alias flips.
Fix: pin to an explicit model version in production, and roll new versions deliberately. See our agentic AI frameworks overview for the broader model-selection trade-offs.
3. Whitespace and JSON Formatting Jitter
The cache key hashes the exact bytes of the prefix. JSON.stringify(obj) with a different key order, a trailing newline that sometimes appears, or a template literal that interpolates a user’s locale-formatted date all produce byte-level drift.
Fix: generate the cacheable prefix through a single canonical serializer. Snapshot-test it. The usage.cache_read_input_tokens and usage.cache_creation_input_tokens fields on every response are your ground truth — log them.
Monitoring Cache Effectiveness
Every response returns a usage object with four counters:
{
"usage": {
"input_tokens": 212,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 23812,
"output_tokens": 186
}
}
Ship these four numbers to your telemetry. The metric that matters is cache read tokens / (cache read tokens + cache creation tokens + input tokens). On a mature agent this should sit above 90%. If it drops, you’ve either changed your prompt, your tools, or your model.
FAQ
Does prompt caching work with streaming?
Yes. Streaming is orthogonal to caching — the cache read happens before the first token is emitted, so streamed responses benefit from the same latency improvement.
Can I cache extended thinking tokens?
As of 2026, thinking blocks in extended-thinking mode cache under the same rules as other content. The thinking itself is billed as output, not input, but the cached prefix read cost applies normally.
What’s the minimum cacheable block size?
1,024 tokens on Sonnet and Opus class models, 2,048 on Haiku for some block types. Requests below the minimum succeed but silently skip caching — check cache_creation_input_tokens to confirm.
Does this work with tool use and agent loops?
Yes, and this is where caching shines. Each turn’s tool result comes back as a small user message; the large static prefix (system + tools + prior turns) stays cached. An agent that does 10 tool calls with a 20k-token prefix pays roughly the same input cost as a single uncached call.
What about on AWS Bedrock or Google Vertex AI?
Both platforms have caching, but the API surface and pricing differs slightly from the direct Anthropic API. Verify the current docs for your provider — the core concept transfers, but the cache_control field name or TTL options may vary.
Bottom Line
Prompt caching is the single highest-leverage change you can make to a Claude API bill in 2026. Tag your system prompt and tool list with cache_control, pick the 1-hour TTL unless you’re genuinely in a sub-5-minute loop, freeze the cache prefix byte-for-byte across requests, and watch your cache-read ratio in the usage telemetry. Teams that do this consistently report 80–90% input-cost reductions on agent workloads — without changing a single line of model behavior.
If you’re building production agents, pair this with schema-locked structured output covered in our Claude structured output with Zod guide, and the agentic framework comparison for picking the orchestration layer.
Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.