How to choose a GPU for local LLMs (2026)

In short: VRAM is the spec that decides what you can run; memory bandwidth decides how fast it runs. 16 GB comfortably runs ~14B models, 24 GB runs ~32B, and 96 GB puts 120B-class MoE models on a single workstation card. Buy as much VRAM as the budget allows; everything else is secondary for local LLMs.

Why VRAM first?

A model either fits in VRAM or it does not - no amount of GPU compute rescues a model that spills to system RAM. Bandwidth then sets the token rate, because decoding streams the whole model per token. Raw compute (TFLOPS) mostly affects prompt-processing speed.

What each VRAM tier buys you

The biggest model that fits comfortably at Q4_K_M / 8,192 context, with VRAMfit's estimated decode speed on the example card:

VRAM tier	Example card	Biggest comfortable model	Est. speed
8 GB	NVIDIA GeForce RTX 4060	llama3:8b (8B)	~33 tok/s
12 GB	NVIDIA GeForce RTX 5070	deepseek-r1:14b (14B)	~47 tok/s
16 GB	NVIDIA GeForce RTX 5080	gpt-oss:20b (20B)	~47 tok/s
24 GB	NVIDIA GeForce RTX 4090	deepseek-r1:32b (32B)	~31 tok/s
32 GB	NVIDIA GeForce RTX 5090	falcon:40b (40B)	~44 tok/s
48 GB	NVIDIA RTX PRO 5000 Blackwell	deepseek-llm:67b (67B)	~20 tok/s
96 GB	NVIDIA RTX PRO 6000 Blackwell	zephyr:141b (141B)	~12 tok/s

See your exact card - including used-market and workstation options - on the fit board or the GPU comparison chart.

Frequently asked questions

Is a used RTX 3090 still a good buy for local AI?

Often, yes: 24 GB of VRAM at used prices runs the same models a 4090 fits, just at lower speed (936 vs 1008 GB/s bandwidth and far less compute for long prompts).

Do AMD and Intel GPUs work for local LLMs?

Yes - llama.cpp/Ollama support Radeon (ROCm/Vulkan) and Intel Arc (Vulkan/SYCL). Software is less turnkey than CUDA but improving quickly; the VRAM math is identical.

Can I combine two GPUs?

Yes, inference engines can split layers across cards, pooling VRAM with a modest bandwidth-efficiency penalty. VRAMfit's board has a card-count control that models this.

Tool Check your own GPU on the fit board