AI articles & guides - VRAMfit

Guides

Written and computed by VRAMfit: the numbers come from the same math that powers the fit board. See also the buying guide for AI machines (DGX Spark, Mac Studio, Strix Halo desktops).

How much VRAM does an LLM need?

The VRAM formula for local LLMs, with computed requirements for popular open-weight models at Q4_K_M and the smallest GPU that runs each comfortably.

VRAMfit guide · updated 2026-06-27

LLM quantization explained: Q4_K_M vs Q8_0 vs FP16

What GGUF quantization levels mean, computed VRAM needs at every level for 8B and 70B models, and how to choose.

VRAMfit guide · updated 2026-06-27

How to choose a GPU for local LLMs (2026)

VRAM tiers from 8 GB to 96 GB with the biggest model each runs comfortably, computed from VRAMfit's catalog, plus why bandwidth beats TFLOPS for decoding.

VRAMfit guide · updated 2026-06-27

MoE models: why a 120B model can feel like a 5B

Mixture-of-experts explained: active vs total parameters, computed VRAM needs for the catalog's MoE models, and when MoE is the right choice.

VRAMfit guide · updated 2026-06-27

KV cache: how context length eats your VRAM

What the KV cache is, computed growth from 2K to 32K context for 8B and 70B models, and how to keep long contexts affordable.

VRAMfit guide · updated 2026-06-27

Latest AI articles

Aggregated automatically from trusted sources (Hugging Face, Ollama, NVIDIA, Google DeepMind, OpenAI, Qwen and more), kept only when they touch the models and hardware VRAMfit tracks. Headlines link to the original.

Run a vLLM Server on HF Jobs in One Command

Hugging Face · 2026-06-26

llmvllm

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer

As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an...

NVIDIA Developer · 2026-06-26

context windownemotronquantization

Quoting Dean W. Ball

This is a bad state of affairs. Consider, in particular, some industry dynamics: Frontier models are trained at an enormous cost, and a significant fraction of that cost is recouped in the few post-release months that they are broadly available. After that period elapses, the...

Simon Willison · 2026-06-26

frontier model

Quoting Timothy B. Lee

This is like saying there's no learning curve to being a manager because your employees will just do whatever you tell them to do. — Timothy B. Lee , on the idea that LLMs take no skill and have no learning curve Tags: llms , ai , generative-ai

Simon Willison · 2026-06-26

llm

Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support

Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the...

NVIDIA Developer · 2026-06-25

inferencetensorrt

Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel

Hugging Face · 2026-06-24

fine-tuntransformer

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Hugging Face · 2026-06-24

benchmark

OpenAI and Broadcom unveil LLM-optimized inference chip

OpenAI and Broadcom introduce Jalapeño, a custom AI chip built for LLM inference to improve performance, efficiency, and scale across AI systems.

OpenAI · 2026-06-24

inferencellm

Quoting Tom MacWright

In the last few months, I've started to see [job applications] that were clearly cowritten by an LLM, link to an LLM-generated portfolio site, which then links to LLM-generated GitHub projects, with purely LLM-generated commit messages. [...] My other reaction is that I don't...

Simon Willison · 2026-06-24

llm

Build real agentic apps using CUGA: two dozen working examples on a lightweight harness

Hugging Face · 2026-06-23

agentic

Experimenting with the proposed Cross-Origin Storage API in Transformers.js

Hugging Face · 2026-06-23

transformer

Maximize AI Factory Energy Efficiency Through Full-Stack Inference and Training Optimizations

Power can account for 40% of the operating expenses (OpEx) to run an AI factory. Each watt can be spent on overhead, data ingestion, training, or generating...

NVIDIA Developer · 2026-06-23

inference

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

As AI systems move from single-turn interactions to coordinated multiagent workflows, low-latency inference becomes increasingly important. Autoregressive LLMs...

NVIDIA Developer · 2026-06-23

inferencellm

How Telcos Build Autonomous Networks with Agentic AI

Telecom operators are adopting AI across network operations, customer care, and back-office workflows, but most are still early in the journey to autonomy. In...

NVIDIA Developer · 2026-06-23

agentic

PP-OCRv6 on Hugging Face: 50-Language OCR from 1.5M to 34.5M Parameters

Hugging Face · 2026-06-22

hugging face

CCCL Runtime: A Modern C++ Runtime for CUDA

The NVIDIA CUDA Core Compute Libraries (CCCL) provides delightful and efficient abstractions for CUDA developers in C++ and Python. It features: Parallel...

NVIDIA Developer · 2026-06-22

cuda

Quoting Sean Lynch

The real valuable capability MCP offers over skills/CLI is isolating the auth flow outside of the agent’s context window, and potentially out of the harness completely. [...] Maybe the idealized form of MCP is just an auth gateway for the API and nothing else. That’d still be a...

Simon Willison · 2026-06-19

context window

Beyond LoRA: Can you beat the most popular fine-tuning technique?

Hugging Face · 2026-06-18

fine-tun

Is it agentic enough? Benchmarking open models on your own tooling

Hugging Face · 2026-06-18

agenticbenchmark

Using AI to help physicians diagnose rare genetic diseases affecting children

Researchers used an OpenAI reasoning model to help diagnose rare diseases, identifying 18 new diagnoses in previously unsolved cases.

OpenAI · 2026-06-18

reasoning model

From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot

Hugging Face · 2026-06-17

hugging face

GLM-5.2: Built for Long-Horizon Tasks

Hugging Face · 2026-06-17

glm-5glm-5.2

Agentic Resource Discovery: Let agents search

Hugging Face · 2026-06-17

agentic

Introducing LifeSciBench

Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.

OpenAI · 2026-06-17

benchmark

GLM-5.2 is probably the most powerful text-only open weights LLM

Chinese AI lab Z.ai released GLM-5.2 to their coding plan subscribers on June 13th, and then yesterday (June 16th) released the full open weights under an MIT license. Similar in size to their previous GLM-5 and GLM-5.1 releases this is a 753B parameter, 1.51TB monster - with 40...

Simon Willison · 2026-06-17

glm-5glm-5.1glm-5.2

NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance

NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium....

NVIDIA Developer · 2026-06-16

benchmark

Build On-Device AI Companions with the NVIDIA ACE Game Agent SDK and Unreal Engine 5 Plugins

NVIDIA RTX technologies are deeply integrated into Unreal Engine 5 through the NVIDIA RTX Branch of Unreal Engine and the NVIDIA DLSS Unreal Engine plugin. This...

NVIDIA Developer · 2026-06-16

on-device

How to Optimize Transformer-Based Models for Low-Precision Training

Transformer architectures are the backbone of many modern large language and generative AI models. As these models grow in size, training runs consume more GPU...

NVIDIA Developer · 2026-06-16

transformer

Quoting Georgi Gerganov

I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org - nothing really impressive,...

Simon Willison · 2026-06-16

m2 ultraqwenqwen3

Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes

Foundation models are reshaping computational biology. Pretrained on massive corpora of protein or genomic sequences, models such as ESM2 (a protein language...

NVIDIA Developer · 2026-06-15

fine-tun

Boosting MoE Training Throughput with Advanced Fusion Kernels

Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable...

NVIDIA Developer · 2026-06-15

mixture-of-expertsmoe

Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models

Quick glossary for readers new to VLA/WAM terminology VLA Vision-Language-Action model: a robot policy that starts from a pretrained VLM backbone and adapts it...

NVIDIA Developer · 2026-06-15

fine-tun

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This...

NVIDIA Developer · 2026-06-12

agenticgemmatext generation

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how...

NVIDIA Developer · 2026-06-12

agenticbenchmarkinference

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and...

NVIDIA Developer · 2026-06-12

agentic

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude Big scoop for Maxwell Zeff at Wired: “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff...

Simon Willison · 2026-06-11

llm

Ollama's highest performance on Apple Silicon yet with MLX

Ollama's MLX engine has been updated to deliver its highest performance on Apple Silicon yet. Models output higher quality responses, respond faster, and use less memory.

Ollama · 2026-06-11

ollama

DiffusionGemma: 4x faster text generation

Google DeepMind · 2026-06-10

gemmatext generation

DiffusionGemma

DiffusionGemma Last May Google briefly released an experimental Gemini Diffusion model. I tried the preview at the time and recorded it running at 857 tokens/second. It was an exciting model, but Google made no further announcements about it. That research has returned in the...

Simon Willison · 2026-06-10

gemma

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Hugging Face · 2026-06-09

benchmark

How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces

Hugging Face · 2026-06-09

hugging face

Migrating Your GitHub CI to Hugging Face Jobs

Hugging Face · 2026-06-09

hugging face

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster...

NVIDIA Developer · 2026-06-09

inferencequantizationquantized

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine,...

NVIDIA Developer · 2026-06-09

nemotron

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google DeepMind · 2026-06-09

gemmamultimodal

llm 0.32a3

Release: llm 0.32a3 Almost entirely written by the new Claude Fable 5, see my write-up for more details . Tags: projects , ai , generative-ai , llms , llm , claude-mythos

Simon Willison · 2026-06-09

llm

The Open Source Community is backing OpenEnv for Agentic RL

Hugging Face · 2026-06-08

agentic

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step...

NVIDIA Developer · 2026-06-08

llm

datasette-agent-edit 0.1a0

Release: datasette-agent-edit 0.1a0 I'm planning several plugins for Datasette Agent which can make edits to existing pieces of text - things like collaborative Markdown editing, updating large SQL queries, and editing SVG files. Agentic editing of text is a little tricky to get...

Simon Willison · 2026-06-07

agentic

Improved performance and model support with GGUF

Ollama 0.30 is now available with improved performance and GGUF model compatibility through llama.cpp. This augments Ollama's MLX engine on Apple silicon, bringing support to more models on a wider range of hardware.

Ollama · 2026-06-05

ggufllama.cppollama

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Hugging Face · 2026-06-04

multimodalnemotron

NVIDIA Nemotron 3 Ultra

NVIDIA Nemotron 3 Ultra is built for high-throughput reasoning and long-running agent workflows.

Ollama · 2026-06-04

nemotron

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents

Single-turn chatbots are evolving into long-running agents that can reason, maintain context, use tools, and run efficiently across many turns to complete...

NVIDIA Developer · 2026-06-04

nemotron

Deploy Agentic-Ready AI at the Edge with Memory Efficiency in NVIDIA JetPack 7.2

As AI agents move from the digital world to the physical environment, they can readily use NVIDIA Jetson to accelerate real-world deployment with optimized...

NVIDIA Developer · 2026-06-02

agentic

Microsoft's new MAI models

Microsoft announced two new text LLMs this morning - MAI-Thinking-1 (reasoning, 1T parameters, 35B active, available to "select early partners") and MAI-Code-1-Flash (137B Parameters, 5B active, "purpose-built for GitHub Copilot and VS Code to deliver high performance and lower...

Simon Willison · 2026-06-02

llm

California Brown Pelican

California Brown Pelican, in Fort Mason, CA, US I'm at the Microsoft Build conference today, held at Fort Mason in San Francisco. There are California Brown Pelicans diving into the water directly behind venue! Tags: microsoft , ai , generative-ai , llms , llm-release

Simon Willison · 2026-06-02

llm

Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains

Hugging Face · 2026-06-01

mixture-of-experts

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

Hugging Face · 2026-06-01

llm

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark

The rise of autonomous, long-running AI agents has introduced a new class of compute demand, namely tasks that maintain large context windows, spawn concurrent...

NVIDIA Developer · 2026-06-01

context windowlocal ai

Advancing AI Infrastructure for Agentic AI with NVIDIA DOCA In-Silicon Security

The AI era is driving a new class of infrastructure: AI factories that transform data into intelligence for autonomous AI agents operating at unprecedented...

NVIDIA Developer · 2026-06-01

agentic

NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories

Each wave of AI has created a new scaling law. Pretraining scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems....

NVIDIA Developer · 2026-06-01

agentic

OpenAI frontier models and Codex are now available on AWS

OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS environments, controls, and procurement workflows they already use. Customers can get started with OpenAI on AWS and move faster from...

OpenAI · 2026-06-01

frontier model

May 2026 newsletter

I just sent out the May edition of my sponsors-only monthly newsletter . If you are a sponsor (or if you start a sponsorship now) you can access it here . This month: Al got expensive, and Anthropic had a really good month The model releases were a little disappointing...

Simon Willison · 2026-06-01

model release

DynoSim: Simulating the Pareto Frontier

Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...

NVIDIA Developer · 2026-05-29

llm

Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and...

NVIDIA Developer · 2026-05-29

multimodaltext generation

OpenJarvis: a local-first personal AI is now available to run with Ollama

OpenJarvis v1.0 is now available: an open-source framework for building personal AI agents that run on your own hardware, with Ollama support built-in.

Ollama · 2026-05-28

ollama

How Endava builds an agentic organization with Codex

Learn how Endava uses Codex to build an agentic organization, accelerating software delivery and reducing requirements analysis from weeks to hours.

OpenAI · 2026-05-28

agentic

NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes

The cold-start problem In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However,...

NVIDIA Developer · 2026-05-27

inference

NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

Large language models (LLMs) are revolutionizing the financial trading landscape by enabling sophisticated analysis of vast amounts of unstructured data to...

NVIDIA Developer · 2026-05-27

inferencelanguage modelllm

What’s New for Game Developers in NVIDIA RTX: DLSS 4.5 for UE5 and Multilingual AI Characters

NVIDIA RTX provides game developers with direct paths to AI-driven characters, frame generation, and ray-traced rendering. This post walks through a meaningful...

NVIDIA Developer · 2026-05-27

rtx pro

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile

Developers can now use NVIDIA CUDA Tile programming within large existing C++ GPU codebases to develop highly optimized GPU kernels using tile-based...

NVIDIA Developer · 2026-05-26

cuda

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates

NVIDIA CUDA 13.3 brings new capabilities and performance optimizations to developers across the CUDA ecosystem. The launch of NVIDIA CUDA Tile programming in...

NVIDIA Developer · 2026-05-26

cuda

Run Key Genomics and Protein Folding Workloads Faster with NVIDIA RTX PRO 4500 Blackwell

Precision medicine depends on two fundamental capabilities: understanding disease at the genomic level and identifying treatments at the molecular level. ...

NVIDIA Developer · 2026-05-26

nvidia rtx pro 4500 blackwellrtx pro

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on...

NVIDIA Developer · 2026-05-21

b200

Mastering Agentic Techniques: AI Agent Customization

Autonomous AI agents are taking on all types of work for businesses: routing logistics fleets, triaging support tickets, generating code, and orchestrating...

NVIDIA Developer · 2026-05-20

agentic

Mastering Agentic Techniques: AI Agent Evaluation

Evaluating an AI model and evaluating an AI agent are related—but they answer fundamentally different questions. A model benchmark tests the capability of a...

NVIDIA Developer · 2026-05-19

agenticbenchmark

I/O 2026: Welcome to the agentic Gemini era

The latest from Google I/O: See how we’re helping you get more done with Gemini.

Google AI · 2026-05-19

agentic

PaddleOCR 3.5: Running OCR and Document Parsing Tasks with a Transformers Backend

Hugging Face · 2026-05-18

transformer

Gemini 3.5: frontier intelligence with action

Gemini 3.5 is built to help you execute complex, agentic workflows.

Google DeepMind · 2026-05-15

agentic

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

Hugging Face · 2026-05-14

embedding

How the NVIDIA Vera Rubin Platform is Solving Agentic AI’s Scale-Up Problem

Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories—actions, observations,...

NVIDIA Developer · 2026-05-14

agenticinference

How to Eliminate Pipeline Friction in AI Model Serving

The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...

NVIDIA Developer · 2026-05-12

fine-tun

Building Blocks for Foundation Model Training and Inference on AWS

Hugging Face · 2026-05-11

inference

vLLM V0 to V1: Correctness Before Corrections in RL

Hugging Face · 2026-05-06

llmvllm

Granite 4.1 LLMs: How They’re Built

Hugging Face · 2026-04-29

llm

DeepInfra on Hugging Face Inference Providers 🔥

Hugging Face · 2026-04-29

hugging faceinference

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Hugging Face · 2026-04-28

multimodalnemotron

How to Use Transformers.js in a Chrome Extension

Hugging Face · 2026-04-23

transformer

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Hugging Face · 2026-04-21

llm

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face · 2026-04-16

embeddingmultimodaltransformer

Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face · 2026-04-09

embeddingmultimodaltransformer

Gemma 4: Byte for byte, the most capable open models

Gemma 4: Our most intelligent open models to date, purpose-built for advanced reasoning and agentic workflows.

Google DeepMind · 2026-04-02

agenticgemma

Ollama is now powered by MLX on Apple Silicon in preview

Today, we're previewing the fastest way to run Ollama on Apple silicon, powered by MLX, Apple's machine learning framework.

Ollama · 2026-03-30

ollama