VRAMfit guide · updated 2026-06-27

KV cache: how context length eats your VRAM

In short: the KV cache is the model's working memory of your conversation - every token's keys and values, kept in VRAM so the model never re-reads the prompt. It grows roughly linearly with context length, which is why a long context can add gigabytes on top of the weights.

Why does it exist?

Transformers attend over every previous token when generating the next one. Recomputing that from scratch per token would be quadratic and hopeless, so inference engines cache each layer's key/value tensors. The price is memory: more context, more cache.

KV cache growth, computed

Llama 3.1 8B and 70B at Q4_K_M (KV cache component and total VRAM):

Context8B KV cache8B total70B KV cache70B total
2,048 tokens0.3 GB5.6 GB0.3 GB40.5 GB
8,192 tokens1.1 GB6.4 GB1.1 GB41.2 GB
16,384 tokens2.1 GB7.5 GB2.1 GB42.3 GB
32,768 tokens4.3 GB9.6 GB4.3 GB44.5 GB

What you can do about it

Pick the context you actually need (the board's slider shows the cost live), prefer models with grouped-query attention (fewer KV heads - most modern models), and consider KV-cache quantization where your runtime supports it. Models with many KV heads pay more per token of context.

Frequently asked questions

Why does my model fit at 4K context but not at 32K?

The weights are constant but the KV cache scales with context, so a 28 GB-at-4K setup can need several GB more at 32K and cross your card's limit.

Is the KV cache affected by quantization?

Weight quantization does not shrink it - the cache is separate. Some runtimes offer KV-cache quantization (e.g. 8-bit KV) which roughly halves it with minor quality cost.

Tool Check your own GPU on the fit board