KV cache: how context length eats your VRAM

In short: the KV cache is the model's working memory of your conversation - every token's keys and values, kept in VRAM so the model never re-reads the prompt. It grows roughly linearly with context length, which is why a long context can add gigabytes on top of the weights.

Why does it exist?

Transformers attend over every previous token when generating the next one. Recomputing that from scratch per token would be quadratic and hopeless, so inference engines cache each layer's key/value tensors. The price is memory: more context, more cache.

KV cache growth, computed

Llama 3.1 8B and 70B at Q4_K_M (KV cache component and total VRAM):

Context	8B KV cache	8B total	70B KV cache	70B total
2,048 tokens	0.3 GB	5.6 GB	0.3 GB	40.5 GB
8,192 tokens	1.1 GB	6.4 GB	1.1 GB	41.2 GB
16,384 tokens	2.1 GB	7.5 GB	2.1 GB	42.3 GB
32,768 tokens	4.3 GB	9.6 GB	4.3 GB	44.5 GB

What you can do about it

Pick the context you actually need (the board's slider shows the cost live), prefer models with grouped-query attention (fewer KV heads - most modern models), and consider KV-cache quantization where your runtime supports it. Models with many KV heads pay more per token of context.

Frequently asked questions

Why does my model fit at 4K context but not at 32K?

The weights are constant but the KV cache scales with context, so a 28 GB-at-4K setup can need several GB more at 32K and cross your card's limit.

Is the KV cache affected by quantization?

Weight quantization does not shrink it - the cache is separate. Some runtimes offer KV-cache quantization (e.g. 8-bit KV) which roughly halves it with minor quality cost.

Tool Check your own GPU on the fit board