Can an RTX 4090 run 70B models?

In short: an RTX 4090 has 24 GB of VRAM, and a 70B model needs about 41 GB at Q4_K_M - so the 4090 cannot hold a 70B model comfortably. It can run one with CPU offloading at reduced speed, but where the 4090 shines is 30B-class models, which it runs comfortably and fast.

Does a 70B model fit on a 4090?

Computed at a 8,192-token context against the 4090's 24 GB:

Quantization	70B needs	RTX 4090 VRAM	Verdict
Q4_K_M	41.2 GB	24 GB	Needs CPU offload (slower)
Q5_K_M	50.0 GB	24 GB	Will not fit
Q8_0	71.9 GB	24 GB	Will not fit

Even at Q4_K_M a 70B model overflows 24 GB, so a 4090 must spill layers to system RAM. That works, but decode speed drops because the offloaded layers stream over the much slower RAM bus.

What the RTX 4090 runs comfortably

The 4090 is a superb local-LLM card for everything up to ~32B. The biggest model it runs comfortably from the catalog is deepseek-r1:32b at about 31 tok/s (estimated) at Q4_K_M - fast, all in VRAM.

If you really want full 70B

Step up to a 48 GB-class card (NVIDIA RTX PRO 5000 Blackwell 72GB) to hold a 70B model entirely in VRAM, or pool two 24 GB cards. See the exact fit for your setup on the fit board.

Frequently asked questions

Can an RTX 4090 run Llama 70B at all?

Yes, with CPU offloading - part of the model sits in system RAM. It runs, but slower than a card that holds the whole model in VRAM.

What is the best model size for a 4090?

Up to about 32B parameters at Q4_K_M run comfortably and fast on a 4090's 24 GB, leaving headroom for context.

Is two 4090s enough for 70B?

Two 24 GB cards pool to 48 GB, which holds a 70B model at Q4_K_M with room for context, at a small multi-GPU efficiency cost.

Tool Check your own GPU on the fit board