VRAMfit guide · updated 2026-06-27
How to choose a GPU for local LLMs (2026)
In short: VRAM is the spec that decides what you can run; memory bandwidth decides how fast it runs. 16 GB comfortably runs ~14B models, 24 GB runs ~32B, and 96 GB puts 120B-class MoE models on a single workstation card. Buy as much VRAM as the budget allows; everything else is secondary for local LLMs.
Why VRAM first?
A model either fits in VRAM or it does not - no amount of GPU compute rescues a model that spills to system RAM. Bandwidth then sets the token rate, because decoding streams the whole model per token. Raw compute (TFLOPS) mostly affects prompt-processing speed.
What each VRAM tier buys you
The biggest model that fits comfortably at Q4_K_M / 8,192 context, with VRAMfit's estimated decode speed on the example card:
| VRAM tier | Example card | Biggest comfortable model | Est. speed |
|---|---|---|---|
| 8 GB | NVIDIA GeForce RTX 4060 | llama3:8b (8B) | ~33 tok/s |
| 12 GB | NVIDIA GeForce RTX 5070 | deepseek-r1:14b (14B) | ~47 tok/s |
| 16 GB | NVIDIA GeForce RTX 5080 | gpt-oss:20b (20B) | ~47 tok/s |
| 24 GB | NVIDIA GeForce RTX 4090 | deepseek-r1:32b (32B) | ~31 tok/s |
| 32 GB | NVIDIA GeForce RTX 5090 | falcon:40b (40B) | ~44 tok/s |
| 48 GB | NVIDIA RTX PRO 5000 Blackwell | deepseek-llm:67b (67B) | ~20 tok/s |
| 96 GB | NVIDIA RTX PRO 6000 Blackwell | zephyr:141b (141B) | ~12 tok/s |
See your exact card - including used-market and workstation options - on the fit board or the GPU comparison chart.
Frequently asked questions
Is a used RTX 3090 still a good buy for local AI?
Often, yes: 24 GB of VRAM at used prices runs the same models a 4090 fits, just at lower speed (936 vs 1008 GB/s bandwidth and far less compute for long prompts).
Do AMD and Intel GPUs work for local LLMs?
Yes - llama.cpp/Ollama support Radeon (ROCm/Vulkan) and Intel Arc (Vulkan/SYCL). Software is less turnkey than CUDA but improving quickly; the VRAM math is identical.
Can I combine two GPUs?
Yes, inference engines can split layers across cards, pooling VRAM with a modest bandwidth-efficiency penalty. VRAMfit's board has a card-count control that models this.