VRAMfit guide · updated 2026-06-27
How much VRAM does an LLM need?
In short: an LLM needs roughly parameters × bits-per-weight / 8 of VRAM for its weights, plus a KV cache that grows with context length, plus 1-2 GB of runtime overhead. At the popular Q4_K_M quantization an 8B model needs about 6 GB, a 70B model about 41 GB.
What actually uses the memory?
Three things sit in VRAM when a model runs: the weights (the model itself, shrunk by quantization), the KV cache (the running memory of your conversation, which grows with context length), and runtime overhead (the inference engine, activations, and fragmentation). VRAMfit computes all three for every model in its catalog.
How much VRAM do popular models need?
Computed with VRAMfit's fit math at Q4_K_M quantization and a 8,192-token context:
| Model | Params | Weights | KV cache | Total VRAM | Smallest retail GPU that fits comfortably |
|---|---|---|---|---|---|
| llama3.2:3b | 3B | 1.7 GB | 1.1 GB | 3.6 GB | NVIDIA GeForce RTX 2060 |
| llama3.1:8b | 8B | 4.5 GB | 1.1 GB | 6.4 GB | NVIDIA GeForce RTX 5060 |
| gemma3:27b | 27B | 15.2 GB | 1.1 GB | 17.1 GB | NVIDIA GeForce RTX 4090 |
| qwen3:32b | 32B | 18.0 GB | 2.1 GB | 20.9 GB | NVIDIA GeForce RTX 5090 |
| llama3.1:70b | 70B | 39.4 GB | 1.1 GB | 41.2 GB | NVIDIA RTX PRO 5000 Blackwell 72GB |
| gpt-oss:20b | 20B | 11.2 GB | 0.4 GB | 12.4 GB | NVIDIA GeForce RTX 5080 |
| gpt-oss:120b | 120B | 67.5 GB | 0.6 GB | 68.9 GB | NVIDIA RTX PRO 6000 Blackwell |
| deepseek-r1:671b | 671B | 377.4 GB | 14.3 GB | 392.6 GB | multi-GPU territory |
Numbers update as the catalog refreshes. Check any other model on the interactive fit board.
Frequently asked questions
Can I run a 70B model on a 24 GB GPU?
Not fully in VRAM at Q4_K_M - it needs about 41 GB. You can run it with CPU offloading at reduced speed, use a lower-bit quantization, or pool two 24 GB cards.
Does a bigger context window need more VRAM?
Yes. The KV cache grows roughly linearly with context length, so doubling the context roughly doubles that component of memory use.
Do MoE models save VRAM?
No - they save compute. All experts must sit in VRAM, so a mixture-of-experts model needs memory for its TOTAL parameter count, while only the active experts run per token.