What LLMs can a 24GB GPU run?

In short: a 24 GB GPU (like the RTX 4090) comfortably runs models up to about 32B parameters at Q4_K_M, all in VRAM and fast. It can squeeze larger models tightly or with offloading, but the comfortable sweet spot is the 7B-32B range, which covers the most-used chat and coding models.

What runs comfortably on 24 GB

Computed for a 24 GB card (NVIDIA GeForce RTX 4090) at Q4_K_M / 8,192 context - biggest models first, with estimated decode speed:

Model	Params	VRAM used	Est. speed
deepseek-r1:32b	32B	19.9 GB	~31 tok/s
qwen2.5-coder:32b	32B	19.9 GB	~31 tok/s
qwen:32b	32B	19.9 GB	~31 tok/s
qwen3-vl:32b	32B	19.9 GB	~31 tok/s
qwen2.5vl:32b	32B	19.9 GB	~31 tok/s
cogito:32b	32B	19.9 GB	~31 tok/s
openthinker:32b	32B	19.9 GB	~31 tok/s
aya-expanse:32b	32B	19.9 GB	~31 tok/s
exaone-deep:32b	32B	19.9 GB	~31 tok/s
exaone3.5:32b	32B	19.9 GB	~31 tok/s

320 catalog models run comfortably and another 19 fit tightly (no headroom for long context). The largest comfortable model is deepseek-r1:32b.

The ceiling: 70B models

A 70B model needs roughly 40 GB+ at Q4_K_M, so it overflows 24 GB and must offload to system RAM. For full-VRAM 70B you want a 48 GB card or two pooled 24 GB cards.

The card

The NVIDIA GeForce RTX 4090 is the reference 24 GB card: 24 GB of VRAM at high bandwidth runs the whole 7B-32B range fast. Check the exact list for any 24 GB card on the fit board.

Frequently asked questions

What is the biggest LLM a 24 GB GPU can run?

About 32B parameters comfortably at Q4_K_M with room for context. Larger models fit tightly or need CPU offloading.

Can a 24 GB GPU run a 70B model?

Not fully in VRAM - a 70B model needs ~40 GB+ at Q4_K_M. It runs with offloading at reduced speed, or across two 24 GB cards.

How many models can a 24 GB card run?

From VRAMfit's catalog, 320 models run comfortably on a 24 GB card at Q4_K_M, covering essentially every popular model up to 32B.

Tool Check your own GPU on the fit board