How much VRAM does an LLM need?

In short: an LLM needs roughly parameters × bits-per-weight / 8 of VRAM for its weights, plus a KV cache that grows with context length, plus 1-2 GB of runtime overhead. At the popular Q4_K_M quantization an 8B model needs about 6 GB, a 70B model about 41 GB.

What actually uses the memory?

Three things sit in VRAM when a model runs: the weights (the model itself, shrunk by quantization), the KV cache (the running memory of your conversation, which grows with context length), and runtime overhead (the inference engine, activations, and fragmentation). VRAMfit computes all three for every model in its catalog.

How much VRAM do popular models need?

Computed with VRAMfit's fit math at Q4_K_M quantization and a 8,192-token context:

Model	Params	Weights	KV cache	Total VRAM	Smallest retail GPU that fits comfortably
llama3.2:3b	3B	1.7 GB	1.1 GB	3.6 GB	NVIDIA GeForce RTX 2060
llama3.1:8b	8B	4.5 GB	1.1 GB	6.4 GB	NVIDIA GeForce RTX 5060
gemma3:27b	27B	15.2 GB	1.1 GB	17.1 GB	NVIDIA GeForce RTX 4090
qwen3:32b	32B	18.0 GB	2.1 GB	20.9 GB	NVIDIA GeForce RTX 5090
llama3.1:70b	70B	39.4 GB	1.1 GB	41.2 GB	NVIDIA RTX PRO 5000 Blackwell 72GB
gpt-oss:20b	20B	11.2 GB	0.4 GB	12.4 GB	NVIDIA GeForce RTX 5080
gpt-oss:120b	120B	67.5 GB	0.6 GB	68.9 GB	NVIDIA RTX PRO 6000 Blackwell
deepseek-r1:671b	671B	377.4 GB	14.3 GB	392.6 GB	multi-GPU territory

Numbers update as the catalog refreshes. Check any other model on the interactive fit board.

Frequently asked questions

Can I run a 70B model on a 24 GB GPU?

Not fully in VRAM at Q4_K_M - it needs about 41 GB. You can run it with CPU offloading at reduced speed, use a lower-bit quantization, or pool two 24 GB cards.

Does a bigger context window need more VRAM?

Yes. The KV cache grows roughly linearly with context length, so doubling the context roughly doubles that component of memory use.

Do MoE models save VRAM?

No - they save compute. All experts must sit in VRAM, so a mixture-of-experts model needs memory for its TOTAL parameter count, while only the active experts run per token.

Tool Check your own GPU on the fit board