VRAMfit - compare GPUs & open-weight LLMs

Compare GPUs

Loading GPUs...

Latest open-weight models

Loading models...

AI articles

Loading articles...

How it works

VRAMfit helps you decide which open-weight language models you can run on your hardware. Pick a GPU (or a multi-GPU harness) on the board and it sorts every model into fit buckets, shows a transparent VRAM breakdown, estimates speed, recommends a quantization, and links to where each model lives. It is a planning aid, not a guarantee.

How "fit" is calculated

For a model at a given quantization and context length, the VRAM it needs is:

Weights	parameters x bits-per-weight / 8. MoE models count ALL experts (they all sit in VRAM).
KV cache	grows with context length, layers, and KV-heads - the running memory of the conversation.
Overhead	a flat allowance for the runtime, activations, and fragmentation.

That total is compared to your GPU's VRAM and bucketed into Comfortable, Tight, Needs offloading, or Won't fit.

Speed, quantization & confidence

Decoding is memory-bandwidth bound, so tokens/sec is approximately bandwidth x efficiency / size; multi-GPU harnesses pool VRAM and scale bandwidth at an efficiency factor. Each model recommends the highest-quality quantization that still fits. Every model is tagged verified (specs from a curated entry, a HuggingFace config, or a community fix), estimated (sized from the parameter count - VRAM is approximate), or unknown. All numbers are estimates.

Benchmarks & data sources

Quality scores come from the Open LLM Leaderboard (independent, standardized evals). The catalog is built from the Ollama library, enriched with specs and release dates from Hugging Face. VRAMfit is not affiliated with these projects or the model creators; all trademarks belong to their owners.