What GPU do I need to run Llama 3.3 70B?

In short: llama3.3:70b is a 70B dense model, so at the popular Q4_K_M quantization it needs about 41 GB of VRAM (weights + KV cache + overhead) at a 8,192-token context. That means a single 24 GB card cannot hold it comfortably; the smallest comfortable RETAIL option is a NVIDIA RTX PRO 5000 Blackwell 72GB card, and higher quants need even more.

How much VRAM does llama3.3:70b need?

Computed at a 8,192-token context. "Smallest comfortable retail GPU" excludes datacenter accelerators and unified-memory Macs - it answers what you could buy as a card:

Quantization	Weights	Total VRAM	Smallest comfortable retail GPU
Q4_K_M	39.4 GB	41.2 GB	NVIDIA RTX PRO 5000 Blackwell 72GB (72 GB)
Q5_K_M	48.1 GB	50.0 GB	NVIDIA RTX PRO 5000 Blackwell 72GB (72 GB)
Q8_0	70.0 GB	71.9 GB	NVIDIA RTX PRO 6000 Blackwell (96 GB)

Why not a 24 GB card?

A 70B model's weights alone exceed 24 GB at 4-bit, so a single 24 GB card (RTX 4090/3090) must offload layers to system RAM and slows down sharply. To keep the whole model resident you want a 48 GB-class workstation card, two 24 GB cards pooled, or a large unified-memory machine.

The practical pick

For full-VRAM 70B on one card the NVIDIA RTX PRO 5000 Blackwell 72GB runs llama3.3:70b comfortably at about 19 tok/s (estimated) at Q4_K_M. Check the exact fit and speed for any card on the fit board.

Frequently asked questions

Can a single 24 GB GPU run llama3.3:70b?

Not comfortably at Q4_K_M - the model needs about 41 GB, more than a 24 GB card holds. It runs with CPU offloading at reduced speed, or you can pool two 24 GB cards.

What is the cheapest way to run a 70B model fully in VRAM?

A 48 GB workstation card holds it on one card; two used 24 GB cards (e.g. RTX 3090) pooled is often the cheapest 48 GB of VRAM, with a small multi-GPU efficiency penalty.

Does a higher quant change the GPU I need?

Yes. Q5 and Q8 grow the weights, pushing the requirement higher - the table above shows the smallest comfortable retail card per quant.

Tool Check your own GPU on the fit board