VRAMfit guide · updated 2026-06-27
MoE models: why a 120B model can feel like a 5B
In short: a mixture-of-experts (MoE) model activates only a few "experts" per token, so it runs with the COMPUTE of a small model - but every expert must sit in memory, so it needs the VRAM of its full size. A 120B MoE with 5B active parameters decodes about as fast as a 5B model, yet still needs 120B-class memory.
Active vs total parameters
Dense models use every parameter for every token. MoE models route each token through a small subset of expert networks: the total count sets the memory bill, the active count sets the per-token compute. That split is why MoE models top the quality-per-speed charts - and why their VRAM needs surprise people.
MoE models in the catalog
VRAM computed at Q4_K_M / 8,192 context:
| Model | Total params | Active per token | VRAM needed |
|---|---|---|---|
| mixtral:8x7b | 46.7B | 12.9B | 28 GB |
The practical upside
If you have the memory (large unified-memory Macs, 48-96 GB cards, or multi-GPU rigs), MoE models give frontier-class answers at small-model speeds. The fit board marks MoE models and counts all experts in its math.
Frequently asked questions
Does an MoE model run faster than a dense model of the same size?
Much faster per token: only the active experts compute, so speed tracks the active parameter count while quality tracks the total.
Can I load only the active experts to save VRAM?
No - which experts fire changes token by token, so all of them must be resident. Some runtimes offload cold experts to RAM at a heavy speed cost.