MoE models: why a 120B model can feel like a 5B

In short: a mixture-of-experts (MoE) model activates only a few "experts" per token, so it runs with the COMPUTE of a small model - but every expert must sit in memory, so it needs the VRAM of its full size. A 120B MoE with 5B active parameters decodes about as fast as a 5B model, yet still needs 120B-class memory.

Active vs total parameters

Dense models use every parameter for every token. MoE models route each token through a small subset of expert networks: the total count sets the memory bill, the active count sets the per-token compute. That split is why MoE models top the quality-per-speed charts - and why their VRAM needs surprise people.

MoE models in the catalog

VRAM computed at Q4_K_M / 8,192 context:

Model	Total params	Active per token	VRAM needed
mixtral:8x7b	46.7B	12.9B	28 GB

The practical upside

If you have the memory (large unified-memory Macs, 48-96 GB cards, or multi-GPU rigs), MoE models give frontier-class answers at small-model speeds. The fit board marks MoE models and counts all experts in its math.

Frequently asked questions

Does an MoE model run faster than a dense model of the same size?

Much faster per token: only the active experts compute, so speed tracks the active parameter count while quality tracks the total.

Can I load only the active experts to save VRAM?

No - which experts fire changes token by token, so all of them must be resident. Some runtimes offload cold experts to RAM at a heavy speed cost.

Tool Check your own GPU on the fit board