Skip to content
Back to articles

OpenAI Build Hour: Prompt Caching

February 26, 2026ยท18 min readยท3,555 words
Prompt CachingOpenAIAPI OptimizationLLMCost Reduction
OpenAI Build Hour: Prompt Caching - YouTube thumbnail
Image: Screenshot from YouTube.

Key insights

  • Prompt caching provides up to 90% discount on input tokens with GPT-5+ and up to 67% faster response times
  • Use prompt_cache_key to route related requests to the same server. One customer went from 60% to 87% cache hit rate
  • Warp doubled their cache hit rate by structuring caching in three levels: global, user, and task
  • There is no quality difference between responses with and without caching. The output is mathematically identical
SourceYouTube
Published February 18, 2026
OpenAI
OpenAI โ€” Build Hour
Hosts:Christine (Startup Marketing), Erica (Solutions Engineer)
Warp
Guest:Siraj (Technical Lead) โ€” Warp

Read this article in norsk


In Brief

OpenAI's Build Hour on prompt caching, a technique that lets AI models remember work they've already done, saving you money and delivering faster responses. Erica from OpenAI explains how it works under the hood, shows live demos with an AI styling assistant, and shares five concrete optimization tips. Siraj from Warp (a developer tool with 700,000+ users) shows how they more than doubled the percentage of requests hitting their cache.

90%
discount on cached tokens
67%
faster response times
2x
cache hit rate improvement

What is prompt caching?

Every time you send a message to an AI model via an API (Application Programming Interface, the way your code talks to the AI), the model has to do heavy mathematical work to understand everything you've sent: instructions, previous messages, images, and so on. Prompt caching means the model remembers this work from last time. If you send the same content again, it doesn't have to redo the job. As Erica from OpenAI explains it: we reuse computation. When multiple requests share the same prefix (the identical part at the start of a prompt), we skip processing those tokens and only use resources on what's new (3:15).

Analogy:

Explained simply: Think of it like a chef making the same sauce for 50 dishes. Without caching, the chef makes the sauce from scratch for each dish. With caching, they make it once and reuse it โ€” focusing instead on what's new on the plate. Unlike a real sauce that stays good until it runs out, the cached work expires after 5-10 minutes unless you extend it.

Ground rules

Erica walked through the basic rules for how OpenAI's prompt caching works (3:43):

  • Minimum 1,024 tokens: your prompt must be at least 1,024 tokens (roughly 750 words) before caching kicks in. Shorter prompts are not cached
  • Blocks of 128 tokens: after the first 1,024 tokens, content is cached in blocks of 128 tokens
  • Order must be identical. The model only recognizes work it's done before if everything comes in exactly the same order. Swap two paragraphs, and it has to redo everything
  • Works for everything: text, images, and audio
  • Happens automatically. You don't need to change your code. OpenAI caches automatically (this is called implicit caching) (4:14)
  • Lasts 5-10 minutes. After that, the cache is cleared from memory. With extended prompt caching you can extend this to 24 hours (4:34)
Analogy:

Explained simply: A token is a small piece of text that AI models use instead of whole words. A token is typically 3โ€“4 characters. The word "prompt" is one token, while "caching" is two. When we talk about "1,024 tokens", that means roughly 750 English words.


What do you save? Price and speed

Price

When the model reuses cached work, those tokens cost much less. Erica showed how discounts have increased with newer OpenAI models (5:00):

Model familyDiscount on cached tokens
GPT-4o50%
GPT-4.175%
GPT-5+90%
Realtime (audio)~99%

With GPT-5, you pay only 10% of the normal price for the part of the prompt that's already cached. For the audio model, it's nearly free.

What does this mean in dollars and cents?

Imagine you're building a chatbot for an online store. It handles 1,000 conversations per day, and each conversation sends the same product descriptions, instructions, and FAQ. That's about 3,000 tokens that are identical every time.

Without CachingWith Caching
Daily cost (input tokens)~$8~$1.50
Monthly cost~$240~$45
Savingsโ€”~$195/month

That's the difference between a noticeable expense and something that barely shows on the bill. And this is just input tokens (what you send to the model). Output tokens (what the model sends back) cost the same regardless.

Speed (latency)

Latency is the wait time from when you send a request to when you get the first word back, often called time-to-first-token (TTFT). Erica tested 2,300 requests of varying length (5:38) and found:

  • For short prompts: minimal difference (~7% faster with cache)
  • For long prompts: up to 67% faster with cache (6:08)

The longer the prompt, the more time caching saves. That's because caching makes the wait time depend on how much new content the model needs to generate, not how long the entire conversation history is.


Under the hood: What's actually in the cache?

To understand why the tips later work, it helps to know a bit about how AI models actually process text. Erica spends a few minutes explaining this (6:32).

Attention: how the model "reads"

The heart of a modern AI model is a mechanism called attention. It lets the model decide which parts of the text are relevant to each individual word it needs to understand.

Think of it this way: when you read the sentence "she went to the bank to go fishing", you know that "bank" means a riverbank, because you see the word "fishing" in the context. Erica uses precisely this example to show how the model does something similar, but mathematically (8:59).

For each token, the model creates three representations (8:01):

  • Query: "what do I need to know about the other words to understand myself?"
  • Key: "what do I represent? Am I a verb, a place, a color?"
  • Value: "if I'm relevant, here's the information I can contribute"

The model compares each token's query with all other tokens' keys to find out what's relevant, then retrieves information from the relevant tokens' values. This process happens in layers. A model can have 32 to 64 such layers stacked on top of each other, and the result is an increasingly rich understanding of the text.

What the cache actually stores

All this query-key-value work produces enormous amounts of numbers. Without caching, the model has to redo all this work for every API call, even if 90% of the text is identical to last time. As Erica says: what's stored in the cache is just a giant pile of floating-point numbers (decimal numbers like 0.7382). Not your words, not your text, but the finished mathematical representations that the attention process produced (9:14).


What happens when you send an API request to OpenAI?

Erica walked through the actual flow that happens behind the scenes (9:38):

1

Fingerprint the start

OpenAI takes the first 256 tokens of your message and creates a hash (a digital fingerprint). This is used to find the right server (10:06).

2

Choose a server (engine)

The request is routed to one of many GPUs (graphics processing units). Each GPU can handle about 15 requests per minute (10:19).

3

Check the cache

Has this server seen your text before? It checks in blocks of 128 tokens until it finds something new.

4

Process the new part

Everything that wasn't cached is run through the model normally.

5

Update the cache

After the model has responded, the new results are stored so the next request can benefit (11:43).

Key insight:

Key insight: Caching only works if the request lands on the same server as last time. This is one of the reasons prompt cache key (more on that soon) is so useful, because it helps OpenAI route related requests to the same place.


Five optimization tips

Erica presented five concrete tips for maximizing cache hit rate (15:05).

1. Use prompt cache key

The problem: OpenAI uses the first 256 tokens to determine which server your request is sent to. But if you send thousands of requests with the same start, traffic must be spread across multiple servers to avoid overload, and then they don't hit each other's cache (15:36).

The solution: Include a prompt_cache_key, an optional parameter that helps OpenAI group related requests on the same server. Think of it as a mailing address: without it, your packages end up at random post offices. With it, everything goes to the right place. Erica compares it to a shard key (a label that tells a database which server should store a particular piece of data) in a database (20:20).

Results: A coding customer increased their cache hit rate (the percentage of requests that hit the cache) from 60% to 87% just by adding this parameter (16:23).

Strategies for choosing a key:

  • Per user: when the same user works on the same codebase across multiple conversations
  • Per conversation: when users have completely unrelated tasks
  • Grouped: combine multiple users under one key to use server capacity more effectively (max 15 requests/min per server)

2. Be mindful of context engineering

Context engineering is about controlling what the model sees: removing irrelevant information, summarizing long conversations, etc. But here's a dilemma: context engineering changes the content, while caching requires that the content stays identical. Erica calls them "inherently at odds" (22:00).

Two approaches:

  • Trimming: remove older conversation history entirely. Make large cuts infrequently rather than small cuts on every call. Think of it as cleaning your desk. It's better to do a proper cleanup once in a while than to move one thing every time
  • Summarization (compaction): instead of deleting old context, use OpenAI's new compaction endpoint (responses/compact) to create a summary that replaces the full conversations

For the realtime API (voice) with its 32K token limit, Erica showed that a retention_ratio of 0.7 (keep 70% of the context, delete 30% when needed) can save 70% on 30-minute sessions (25:36), compared to trimming a little on every call.

3. Use Responses API with reasoning models

Reasoning models (like o3, o4-mini) think step-by-step internally before giving an answer. They generate hidden "thinking tokens" (the model's internal reasoning that the user never sees). If you use the older Chat Completions API, these thinking tokens are discarded between conversation turns. This means the model "forgets" its own reasoning, and the cache breaks.

With OpenAI's Responses API, thinking tokens are preserved. Erica showed that simply switching APIs can increase cache hit rate from 40% to 80%, plus smarter answers because the model retains its reasoning history (28:49).

4. Consider flex processing over Batch API

If you're processing large amounts of data (e.g., images overnight), you might already use Batch API (sending many requests at once instead of one by one) for a 50% discount. Flex processing gives you the same discount, but with more control (29:16):

  • Set service_tier: "flex" on individual requests instead of sending everything as one batch
  • Combine it with extended prompt caching and prompt cache key, something Batch API doesn't support

Erica tested with 10,000 requests and found 8.5% higher cache hit rate and 23% lower cost on input tokens with flex (30:23).

5. Use allowed_tools instead of changing the toolset

When you use OpenAI's API with tools (function calling), the tool definitions are part of the prefix that gets cached. If you remove or add a tool, the cache breaks (30:44).

The solution: define all your tools once, and use the allowed_tools parameter to control which tools the model can actually use on a given request. The tool definitions remain identical (cache is preserved), but the model only sees the tools you allow.


Customer case: Warp

Siraj, technical lead at Warp (a developer platform where AI agents write and debug code for 700,000+ developers), shares how they think about prompt caching in practice (35:15).

How a code agent works

Siraj explains that AI agents in Warp work in loops (37:05): the user gives a task (e.g., "fix this compilation error"), the agent reads files, runs commands, thinks about the result, and repeats, step by step. With each step, the prompt grows, but most of the content is identical from the previous step. This makes code agents a perfect use case for prompt caching. Tools like Claude Code, where sub-agents (smaller AI helpers launched by a main agent) run in parallel loops, multiply this effect further.

Three levels of caching

Warp structures caching in three layers, from broad to narrow (41:30):

1

Global level

System prompt (the instructions a developer gives to the AI, like "you are a helpful coding assistant") and tool definitions are identical for all users. This gives ~15,000 cached tokens already on the very first request (42:00), because thousands of users send the same thing.

2

User level

The user's personal setup (rules, MCP servers for connecting to external tools, codebases) is placed in a separate message after the system prompt. When a user runs multiple agents simultaneously, they share this cache.

3

Task level

Within a single task, the prompt grows step by step. Most content is identical from step to step, so cache reuse is highest here.

The key: Warp removed all dynamic information from the system prompt and placed it in a separate message afterwards. This way, all users share the global cache, while each user gets their own cache at level two.

Golden rule: Never modify history

Siraj shares an important lesson (40:32): if a user changes their mind mid-conversation, it's tempting to go back and modify the original message. But that destroys the cache for everything that came after it. Warp's solution: add a new message at the end saying "the user has changed plans" โ€” this preserves the entire existing cache.

Results

By introducing a task-scoped prompt cache key, Warp doubled their cache hit rate (43:41). Siraj emphasizes the balance: the key should be stable enough to give a high hit rate, but not so narrow that it spreads requests across too many servers.


No trade-offs, seriously

Erica is crystal clear: there is no quality difference between responses with and without caching. Given identical input, the mathematical representations in the cache will be exactly the same as if the model computed them from scratch. The responses are identical (52:09).

The only trade-off is architectural: if you design your entire system solely to maximize caching, you might end up giving the model too much irrelevant context (because you don't dare change it). It's about finding the balance between good context management and high cache utilization.


Checklist: Why am I not hitting the cache?

Erica wrapped up with a troubleshooting list (33:49). Experiencing lower cache hit rate than expected? Go through this:

  • Are you changing the content? Even an extra space or a timestamp in the prompt breaks the cache
  • Are you changing the toolset? Use allowed_tools instead (see tip 5)
  • Has too much time passed? Standard cache lasts 5-10 minutes. Use extended prompt caching for 24 hours
  • Too many requests? High volume spreads traffic across multiple servers. Use prompt cache key to control routing
  • Prompt too short? Prompts under 1,024 tokens are not cached. Erica showed that even with just 50% cache hit rate, you save 33% on token costs by extending your prompt past the 1,024 threshold (51:01)
  • Wrong API? Chat Completions with reasoning models gives lower cache hit. Switch to Responses API
  • Batch API with older models? Pre-GPT-5 models don't support caching in Batch API. Try flex processing
Remember:

Remember: The theoretical maximum for cache hit rate is always higher than what you see in practice. Engine health, load balancing, and other factors mean some requests will always "miss", and that's completely normal.

Practical implications

Solo developers

Start with implicit caching. You don't need to change your code at all. Just use GPT-5+ and you automatically get up to 90% discount on repeated content. If your prompts are under 1,024 tokens, consider adding a detailed system prompt to push past the threshold. Even 50% cache hit rate saves 33% on token costs.

Teams building production apps

Follow the Warp pattern: structure your prompts in three layers (global system prompt, user context, task context) and use prompt_cache_key to route related requests to the same server. Move all dynamic information out of the system prompt and into a separate message. This single change can double your cache hit rate.

Cost-conscious projects

Combine prompt caching with flex processing for a 50% discount plus cache benefits. If you're currently using Batch API, test whether flex processing gives better total cost. Erica's benchmark showed 23% lower input token cost with flex compared to batch.

Test yourself

  1. Architecture trade-off: Context engineering and caching are "inherently at odds." How would you design a system that balances both โ€” and when would you prioritize one over the other?
  2. Cost model: With a 90% discount on cached tokens in GPT-5+, at what point does it become cheaper to send too much context rather than invest in smart trimming?
  3. Server routing: Prompt cache key controls which server a request lands on. What can go wrong if the key is too broad (too many users per key) vs. too narrow?
  4. Developer behavior: Erica says caching doesn't affect quality. But can it affect developer behavior โ€” e.g., avoiding prompt improvements because changes break the cache?
  5. Transfer learning: Warp structures caching in three levels (global, user, task). How would this strategy look for a completely different application โ€” e.g., a medical advisory service?

Glossary

TermDefinition
API (Application Programming Interface)A way for programs to talk to each other. When you build a chatbot using GPT-5, your code sends requests to OpenAI's API, like a waiter taking your order and bringing food from the kitchen.
AttentionThe mechanism that lets an AI model figure out which words are relevant to each other. Much like when you read "bank" and use the rest of the sentence to understand whether it's about money or a riverbank.
Batch APIA way to send many requests to OpenAI at once (in a "batch"). Cheaper, but you don't get answers immediately. You submit the job and collect results later.
Cache / CachingStoring the result of work so you don't have to redo it. Your browser does this with web pages (that's why they load faster the second time), and AI models do it with their computations.
Cache hit rateThe percentage of your requests that hit the cache, i.e., how often the model can reuse previous work. An 80% cache hit rate means 80% of your input tokens are free or heavily discounted.
Chain-of-thoughtWhen a reasoning model (like o3) thinks step-by-step internally before answering. These "thinking steps" are usually hidden from the user.
Chat Completions APIOpenAI's older API for talking to AI models. Works fine, but for reasoning models you lose thinking tokens between conversation turns, which breaks the cache.
CompactionCompressing a long conversation history into a shorter summary. OpenAI offers this as a dedicated service (responses/compact). Saves tokens, but breaks the cache for what was summarized.
Context windowThe model's "working memory", meaning everything it can see at once. GPT-5 has a window of up to 1 million tokens. When the window is full, older content must be removed.
EngineA single GPU (graphics card) that processes AI requests at OpenAI. Each engine can handle about 15 requests per minute.
Extended prompt cachingAn OpenAI parameter that extends the cache from 5โ€“10 minutes to 24 hours. The cache is moved from RAM to local GPU storage.
Flex processingAn alternative to Batch API where you send requests one by one (with service_tier: "flex"), but get the same 50% discount. The advantage is that you can combine it with caching and prompt cache key.
Function callingWhen the AI model can "use tools", e.g., look up weather, search a database, or send an email. You define the tools as functions in the API call.
HashA digital fingerprint, a short code representing a larger piece of data. OpenAI hashes the first 256 tokens to quickly find the right server. Even a small change in input produces a completely different hash.
KV cacheShort for Key-Value cache. Stores the finished mathematical representations (key and value tensors) from the attention process, so they can be reused.
LatencyThe wait time from when you send a request to when you get a response. Low latency = fast response. Often measured as time-to-first-token (TTFT), the time until the first word appears.
PrefixEverything you send to the model that's identical from one request to the next: system prompt, tool definitions, previous messages. This is the part caching reuses.
PromptEverything you send to the AI model in a request: instructions, context, your question, images, etc. The system prompt is the part the developer writes (e.g., "you are a helpful assistant"), while the user prompt is what the end user writes.
Prompt cache keyAn optional parameter you send with your request to help OpenAI route it to the right server. Works like an address label saying "this request belongs with the others from the same conversation/user".
Reasoning modelAI models (like o3, o4-mini) that think step-by-step before answering, instead of generating the response directly. Gives better answers on complex tasks, but uses more tokens.
Responses APIOpenAI's newer API that replaces Chat Completions. Preserves reasoning tokens between conversation turns, which gives better caching and smarter answers.
TokenThe smallest unit an AI model works with. Roughly 3โ€“4 characters or 3/4 of an English word. "Hello world" is two tokens. Pricing and context windows are measured in tokens.
TruncationCutting away older parts of the conversation history to stay within the model's context window. Like tearing out the first pages of a notebook when it's full.

Sources and resources