Why are cached input tokens cheaper with AI services?

Published on , 761 words, 3 minutes to read

TL;DR: the GPU doesn't have to math as hard

When you see AI model pricing pages, you usually see things broken down like this:

ModelContext LengthMax CoT TokensMax Output TokensInput Price (Cache Hit)Input Price (Cache Miss)Output Price
deepseek-chat64K-8K$0.07 / 1M tokens$0.27 / 1M tokens$1.10 / 1M tokens
deepseek-reasoner64K32K8K$0.14 / 1M tokens$0.55 / 1M tokens$2.19 / 1M tokens

Source: DeepSeek API Docs

If you manage to have most of your input tokens be cached, you save a huge amount, in this case $0.20 per million tokens. What does this mean though? What does caching do that makes you save so much, in some cases upwards of tens of kilodollars?

Someone explain the cached vs not thing to me for how this is $10,000 worth of savings lol



[image or embed]

— Chimney Sweepers Local 420 FKA yburyug (

@bobbby.online

)

June 12, 2026 at 12:39 AM

Warning

I'm gonna be totally honest, I barely understand the basic outline of the math involved here. Where possible I am to not be completely wrong here, but I'm not going to emit something 1:1 accurate with the mathematical truth of large language models' inner workings. Bear with me.

When you make an API call to large language model services, you make an API call like the following:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ]
}'

That messages element is the key bit. Every time you accumulate messages from the initial system prompt, initial user request, AI responses and any tool use requests/responses, you add to that array and make it grow bigger and bigger.

A good way to think about this is that sending a conversation to a large language model is like having a pair of people share a roll of paper on two different typewriters. Every time you finish your message, you send the roll of paper back to the AI model and it has to re-read through the entire conversation in order to start typing on the end with its response. As the conversation gets longer, this gets more and more expensive because the model has to recalculate its internal state all over again for every additional message.

However, large language model inference is complicated but deterministic. Given the same inputs, you will always get the same output. This means that you can use a technique called key-value caching (KV caching) in order to save that intermediate state and use it for next time. Most of the time this cache is a prefix cache because that allows you to just add on more messages to the end of the request pretty easily and be fine.

Imagine something like this:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    },
    {
      "role": "assistant",
      "content": "The sky is blue because of a phenomenon..."
    },
    {
      "role": "user",
      "content": "But I am looking outside right now and it is orange!"
    }
  ]
}'

If the model has already processed the question about the sky being blue and generated the response about Rayleigh scattering, it doesn't need to process both of those messages again to answer the user's question about sunsets. In production AI model deployments you would put that generated intermediate state into the KV cache so that the model doesn't need to run twice for the same data. This saves time and effort on the side of the AI model provider, and currently model providers decide to pass that savings onto API users in the form of cheaper inference costs for cached lookups.

As you develop an application with AI in it, try to avoid changing any inference settings or previous messages between prompts. This makes your application's queries much more likely to read from the cache, making it faster, reducing the environmental impact, and saving you(r users) money.


Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.

Tags: