Without Caching	With Caching
Daily cost (input tokens)	~$8	~$1.50
Monthly cost	~$240	~$45
Savings	—	~$195/month

Term	Definition
API (Application Programming Interface)	A way for programs to talk to each other. When you build a chatbot using GPT-5, your code sends requests to OpenAI's API, like a waiter taking your order and bringing food from the kitchen.
Attention	The mechanism that lets an AI model figure out which words are relevant to each other. Much like when you read "bank" and use the rest of the sentence to understand whether it's about money or a riverbank.
Batch API	A way to send many requests to OpenAI at once (in a "batch"). Cheaper, but you don't get answers immediately. You submit the job and collect results later.
Cache / Caching	Storing the result of work so you don't have to redo it. Your browser does this with web pages (that's why they load faster the second time), and AI models do it with their computations.
Cache hit rate	The percentage of your requests that hit the cache, i.e., how often the model can reuse previous work. An 80% cache hit rate means 80% of your input tokens are free or heavily discounted.
Chain-of-thought	When a reasoning model (like o3) thinks step-by-step internally before answering. These "thinking steps" are usually hidden from the user.
Chat Completions API	OpenAI's older API for talking to AI models. Works fine, but for reasoning models you lose thinking tokens between conversation turns, which breaks the cache.
Compaction	Compressing a long conversation history into a shorter summary. OpenAI offers this as a dedicated service (`responses/compact`). Saves tokens, but breaks the cache for what was summarized.
Context window	The model's "working memory", meaning everything it can see at once. GPT-5 has a window of up to 1 million tokens. When the window is full, older content must be removed.
Engine	A single GPU (graphics card) that processes AI requests at OpenAI. Each engine can handle about 15 requests per minute.
Extended prompt caching	An OpenAI parameter that extends the cache from 5–10 minutes to 24 hours. The cache is moved from RAM to local GPU storage.
Flex processing	An alternative to Batch API where you send requests one by one (with `service_tier: "flex"`), but get the same 50% discount. The advantage is that you can combine it with caching and prompt cache key.
Function calling	When the AI model can "use tools", e.g., look up weather, search a database, or send an email. You define the tools as functions in the API call.
Hash	A digital fingerprint, a short code representing a larger piece of data. OpenAI hashes the first 256 tokens to quickly find the right server. Even a small change in input produces a completely different hash.
KV cache	Short for Key-Value cache. Stores the finished mathematical representations (key and value tensors) from the attention process, so they can be reused.
Latency	The wait time from when you send a request to when you get a response. Low latency = fast response. Often measured as time-to-first-token (TTFT), the time until the first word appears.
Prefix	Everything you send to the model that's identical from one request to the next: system prompt, tool definitions, previous messages. This is the part caching reuses.
Prompt	Everything you send to the AI model in a request: instructions, context, your question, images, etc. The system prompt is the part the developer writes (e.g., "you are a helpful assistant"), while the user prompt is what the end user writes.
Prompt cache key	An optional parameter you send with your request to help OpenAI route it to the right server. Works like an address label saying "this request belongs with the others from the same conversation/user".
Reasoning model	AI models (like o3, o4-mini) that think step-by-step before answering, instead of generating the response directly. Gives better answers on complex tasks, but uses more tokens.
Responses API	OpenAI's newer API that replaces Chat Completions. Preserves reasoning tokens between conversation turns, which gives better caching and smarter answers.
Token	The smallest unit an AI model works with. Roughly 3–4 characters or 3/4 of an English word. "Hello world" is two tokens. Pricing and context windows are measured in tokens.
Truncation	Cutting away older parts of the conversation history to stay within the model's context window. Like tearing out the first pages of a notebook when it's full.

Model family	Discount on cached tokens
GPT-4o	50%
GPT-4.1	75%
GPT-5+	90%
Realtime (audio)	~99%

Key insights

In Brief

What is prompt caching?

Ground rules

What do you save? Price and speed

Price

What does this mean in dollars and cents?

Speed (latency)

Under the hood: What's actually in the cache?

Attention: how the model "reads"

What the cache actually stores

What happens when you send an API request to OpenAI?

Fingerprint the start

Choose a server (engine)

Check the cache

Process the new part

Update the cache

Five optimization tips

1. Use prompt cache key

2. Be mindful of context engineering

3. Use Responses API with reasoning models

4. Consider flex processing over Batch API

5. Use allowed_tools instead of changing the toolset

Customer case: Warp

How a code agent works

Three levels of caching

Global level

User level

Task level

Golden rule: Never modify history

Results

No trade-offs, seriously

Checklist: Why am I not hitting the cache?

Practical implications

Solo developers

Teams building production apps

Cost-conscious projects

Test yourself

Glossary

Sources and resources