Guides
Written and computed by VRAMfit: the numbers come from the same math that powers the fit board. See also the buying guide for AI machines (DGX Spark, Mac Studio, Strix Halo desktops).
How much VRAM does an LLM need?
The VRAM formula for local LLMs, with computed requirements for popular open-weight models at Q4_K_M and the smallest GPU that runs each comfortably.
LLM quantization explained: Q4_K_M vs Q8_0 vs FP16
What GGUF quantization levels mean, computed VRAM needs at every level for 8B and 70B models, and how to choose.
How to choose a GPU for local LLMs (2026)
VRAM tiers from 8 GB to 96 GB with the biggest model each runs comfortably, computed from VRAMfit's catalog, plus why bandwidth beats TFLOPS for decoding.
MoE models: why a 120B model can feel like a 5B
Mixture-of-experts explained: active vs total parameters, computed VRAM needs for the catalog's MoE models, and when MoE is the right choice.
KV cache: how context length eats your VRAM
What the KV cache is, computed growth from 2K to 32K context for 8B and 70B models, and how to keep long contexts affordable.
Latest AI articles
Aggregated automatically from trusted sources (Hugging Face, Ollama, NVIDIA, Google DeepMind, OpenAI, Qwen and more), kept only when they touch the models and hardware VRAMfit tracks. Headlines link to the original.
Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer
As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an...
Quoting Dean W. Ball
This is a bad state of affairs. Consider, in particular, some industry dynamics: Frontier models are trained at an enormous cost, and a significant fraction of that cost is recouped in the few post-release months that they are broadly available. After that period elapses, the...
Quoting Timothy B. Lee
This is like saying there's no learning curve to being a manager because your employees will just do whatever you tell them to do. — Timothy B. Lee , on the idea that LLMs take no skill and have no learning curve Tags: llms , ai , generative-ai
Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support
Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the...
Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel
Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World
OpenAI and Broadcom unveil LLM-optimized inference chip
OpenAI and Broadcom introduce Jalapeño, a custom AI chip built for LLM inference to improve performance, efficiency, and scale across AI systems.
Quoting Tom MacWright
In the last few months, I've started to see [job applications] that were clearly cowritten by an LLM, link to an LLM-generated portfolio site, which then links to LLM-generated GitHub projects, with purely LLM-generated commit messages. [...] My other reaction is that I don't...
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
Experimenting with the proposed Cross-Origin Storage API in Transformers.js
Maximize AI Factory Energy Efficiency Through Full-Stack Inference and Training Optimizations
Power can account for 40% of the operating expenses (OpEx) to run an AI factory. Each watt can be spent on overhead, data ingestion, training, or generating...
Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding
As AI systems move from single-turn interactions to coordinated multiagent workflows, low-latency inference becomes increasingly important. Autoregressive LLMs...
How Telcos Build Autonomous Networks with Agentic AI
Telecom operators are adopting AI across network operations, customer care, and back-office workflows, but most are still early in the journey to autonomy. In...
PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters
CCCL Runtime: A Modern C++ Runtime for CUDA
The NVIDIA CUDA Core Compute Libraries (CCCL) provides delightful and efficient abstractions for CUDA developers in C++ and Python. It features: Parallel...
Quoting Sean Lynch
The real valuable capability MCP offers over skills/CLI is isolating the auth flow outside of the agent’s context window, and potentially out of the harness completely. [...] Maybe the idealized form of MCP is just an auth gateway for the API and nothing else. That’d still be a...
Beyond LoRA: Can you beat the most popular fine-tuning technique?
Is it agentic enough? Benchmarking open models on your own tooling
Using AI to help physicians diagnose rare genetic diseases affecting children
Researchers used an OpenAI reasoning model to help diagnose rare diseases, identifying 18 new diagnoses in previously unsolved cases.
From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot
GLM-5.2: Built for Long-Horizon Tasks
Agentic Resource Discovery: Let agents search
Introducing LifeSciBench
Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.
GLM-5.2 is probably the most powerful text-only open weights LLM
Chinese AI lab Z.ai released GLM-5.2 to their coding plan subscribers on June 13th, and then yesterday (June 16th) released the full open weights under an MIT license. Similar in size to their previous GLM-5 and GLM-5.1 releases this is a 753B parameter, 1.51TB monster - with 40...
NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance
NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium....
Build On-Device AI Companions with the NVIDIA ACE Game Agent SDK and Unreal Engine 5 Plugins
NVIDIA RTX technologies are deeply integrated into Unreal Engine 5 through the NVIDIA RTX Branch of Unreal Engine and the NVIDIA DLSS Unreal Engine plugin. This...
How to Optimize Transformer-Based Models for Low-Precision Training
Transformer architectures are the backbone of many modern large language and generative AI models. As these models grow in size, training runs consume more GPU...
Quoting Georgi Gerganov
I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org - nothing really impressive,...
Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes
Foundation models are reshaping computational biology. Pretrained on massive corpora of protein or genomic sequences, models such as ESM2 (a protein language...
Boosting MoE Training Throughput with Advanced Fusion Kernels
Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable...
Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models
Quick glossary for readers new to VLA/WAM terminology VLA Vision-Language-Action model: a robot policy that starts from a pretrained VLM backbone and adapts it...
Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation
Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This...
NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark
AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how...
Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure
As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and...
Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude
Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude Big scoop for Maxwell Zeff at Wired: “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff...
Ollama's highest performance on Apple Silicon yet with MLX
Ollama's MLX engine has been updated to deliver its highest performance on Apple Silicon yet. Models output higher quality responses, respond faster, and use less memory.
DiffusionGemma: 4x faster text generation
DiffusionGemma
DiffusionGemma Last May Google briefly released an experimental Gemini Diffusion model. I tried the preview at the time and recorded it running at 857 tokens/second. It was an exciting model, but Google made no further announcements about it. That research has returned in the...
Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces
Migrating Your GitHub CI to Hugging Face Jobs
Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT
Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster...
Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech
Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine,...
Introducing Gemma 4 12B: a unified, encoder-free multimodal model
llm 0.32a3
Release: llm 0.32a3 Almost entirely written by the new Claude Fable 5, see my write-up for more details . Tags: projects , ai , generative-ai , llms , llm , claude-mythos
The Open Source Community is backing OpenEnv for Agentic RL
Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step...
datasette-agent-edit 0.1a0
Release: datasette-agent-edit 0.1a0 I'm planning several plugins for Datasette Agent which can make edits to existing pieces of text - things like collaborative Markdown editing, updating large SQL queries, and editing SVG files. Agentic editing of text is a little tricky to get...
Improved performance and model support with GGUF
Ollama 0.30 is now available with improved performance and GGUF model compatibility through llama.cpp. This augments Ollama's MLX engine on Apple silicon, bringing support to more models on a wider range of hardware.
Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI
NVIDIA Nemotron 3 Ultra
NVIDIA Nemotron 3 Ultra is built for high-throughput reasoning and long-running agent workflows.
NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents
Single-turn chatbots are evolving into long-running agents that can reason, maintain context, use tools, and run efficiently across many turns to complete...
Deploy Agentic-Ready AI at the Edge with Memory Efficiency in NVIDIA JetPack 7.2
As AI agents move from the digital world to the physical environment, they can readily use NVIDIA Jetson to accelerate real-world deployment with optimized...
Microsoft's new MAI models
Microsoft announced two new text LLMs this morning - MAI-Thinking-1 (reasoning, 1T parameters, 35B active, available to "select early partners") and MAI-Code-1-Flash (137B Parameters, 5B active, "purpose-built for GitHub Copilot and VS Code to deliver high performance and lower...
California Brown Pelican
California Brown Pelican, in Fort Mason, CA, US I'm at the Microsoft Build conference today, held at Fort Mason in San Francisco. There are California Brown Pelicans diving into the water directly behind venue! Tags: microsoft , ai , generative-ai , llms , llm-release
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic
Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark
The rise of autonomous, long-running AI agents has introduced a new class of compute demand, namely tasks that maintain large context windows, spawn concurrent...
Advancing AI Infrastructure for Agentic AI with NVIDIA DOCA In-Silicon Security
The AI era is driving a new class of infrastructure: AI factories that transform data into intelligence for autonomous AI agents operating at unprecedented...
NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories
Each wave of AI has created a new scaling law. Pretraining scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems....
OpenAI frontier models and Codex are now available on AWS
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS environments, controls, and procurement workflows they already use. Customers can get started with OpenAI on AWS and move faster from...
May 2026 newsletter
I just sent out the May edition of my sponsors-only monthly newsletter . If you are a sponsor (or if you start a sponsorship now) you can access it here . This month: Al got expensive, and Anthropic had a really good month The model releases were a little disappointing...
DynoSim: Simulating the Pareto Frontier
Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...
Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI
AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and...
OpenJarvis: a local-first personal AI is now available to run with Ollama
OpenJarvis v1.0 is now available: an open-source framework for building personal AI agents that run on your own hardware, with Ollama support built-in.
How Endava builds an agentic organization with Codex
Learn how Endava uses Codex to build an agentic organization, accelerating software delivery and reducing requirements analysis from weeks to hours.
NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
The cold-start problem In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However,...
NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance
Large language models (LLMs) are revolutionizing the financial trading landscape by enabling sophisticated analysis of vast amounts of unstructured data to...
What’s New for Game Developers in NVIDIA RTX: DLSS 4.5 for UE5 and Multilingual AI Characters
NVIDIA RTX provides game developers with direct paths to AI-driven characters, frame generation, and ray-traced rendering. This post walks through a meaningful...
Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile
Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based...
NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates
NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...
Run Key Genomics and Protein Folding Workloads Faster with NVIDIA RTX PRO 4500 Blackwell
Precision medicine depends on two fundamental capabilities: understanding disease at the genomic level and identifying treatments at the molecular level. ...
Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling
As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on...
Mastering Agentic Techniques: AI Agent Customization
Autonomous AI agents are taking on all types of work for businesses: routing logistics fleets, triaging support tickets, generating code, and orchestrating...
Mastering Agentic Techniques: AI Agent Evaluation
Evaluating an AI model and evaluating an AI agent are related—but they answer fundamentally different questions. A model benchmark tests the capability of a...
I/O 2026: Welcome to the agentic Gemini era
The latest from Google I/O: See how we’re helping you get more done with Gemini.
PaddleOCR 3.5: Running OCR and Document Parsing Tasks with a Transformers Backend
Gemini 3.5: frontier intelligence with action
Gemini 3.5 is built to help you execute complex, agentic workflows.
Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality
How the NVIDIA Vera Rubin Platform is Solving Agentic AI’s Scale-Up Problem
Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories—actions, observations,...
How to Eliminate Pipeline Friction in AI Model Serving
The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...
Building Blocks for Foundation Model Training and Inference on AWS
vLLM V0 to V1: Correctness Before Corrections in RL
Granite 4.1 LLMs: How They’re Built
DeepInfra on Hugging Face Inference Providers 🔥
Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
How to Use Transformers.js in a Chrome Extension
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Multimodal Embedding & Reranker Models with Sentence Transformers
Gemma 4: Byte for byte, the most capable open models
Gemma 4: Our most intelligent open models to date, purpose-built for advanced reasoning and agentic workflows.
Ollama is now powered by MLX on Apple Silicon in preview
Today, we're previewing the fastest way to run Ollama on Apple silicon, powered by MLX, Apple's machine learning framework.