Your Browser Can Run a 20B AI Model Now

I build browser-based PWAs. No servers, no accounts, no tracking. Until recently, "AI in the browser" meant sending your data to an API endpoint and hoping nobody was logging it. In 2026, that changed. The browser became a real ML runtime — and most developers haven't noticed yet.

I spent a week reading 23 papers, benchmarks, and framework docs. Here's where browser AI actually stands.

WebGPU Is a Real Platform

All four major browser engines — Chrome, Firefox, Safari, Edge — now ship WebGPU by default. Global coverage sits at roughly 77% of users (per Can I Use, Jan 2026). This is not an experimental flag. It is first-class compute shaders written in WGSL, mapping directly to Vulkan, Metal, and D3D12 under the hood.

The performance gap between WebGPU and the old WebGL path is not incremental. It is a generational jump:

Task	WebGPU	WebGL	Speedup
Text gen (~1B params)	18 tok/s	4 tok/s	4.5x
Image classification	12ms	45ms	3.7x
Whisper (30s audio)	2.5s	8s	3.2x

One benchmark that caught my attention: LLaMA-7B running in-browser via WebGPU is only 4.2% slower than native CPU inference. A browser tab competing with a C++ binary on the same machine. That used to be a punchline. Now it is a measurement.

Transformers.js v4

The Transformers.js v4 preview dropped in February 2026. Version 3 already supported 120+ model architectures. The v4 preview pushes that toward 200. But the headline number is this: 20-billion-parameter models running at 60 tokens per second, in-browser.

Under the hood, v4 introduces a new C++ runtime that replaces the old ONNX-JS execution path. Practical results: 4x speedup for BERT-class models. Build times dropped from 2 seconds to 200 milliseconds. The library now feels instant rather than tolerable.

The API is clean. Load a pipeline, pass it text, get structured output. No GPU setup, no driver installs, no CUDA version conflicts. The user's browser handles everything.

The Five Browser AI Frameworks

Transformers.js is not the only option. The browser ML ecosystem has consolidated around five serious frameworks, each with a different sweet spot:

Framework	Best For
Transformers.js	NLP, embeddings, multimodal pipelines
TensorFlow.js	Vision, audio, custom classifiers
WebLLM	LLM chat, offline agents
MediaPipe	Face/hand/pose tracking, AR
ONNX Runtime Web	Custom ML models, cross-framework

If you need an embedding model for semantic search inside a PWA, that is Transformers.js. If you need real-time hand tracking for a camera app, that is MediaPipe. If you want to run a full chat model offline, WebLLM handles the quantization and caching. The right tool depends on the task, not the hype. (See web.dev AI frameworks guide for detailed comparisons.)

WebLLM deserves special mention: it compiles models via MLC-LLM (arXiv:2412.15803) and Apache TVM, achieving 80% of native performance on Apple Silicon. On mid-range hardware, expect 4-7 tok/s for 8B models. It is the right choice for full chat interfaces.

Chrome's Built-In AI

Chrome now ships Gemini Nano directly inside the browser. No download, no CDN, no setup. You call window.ai.languageModel.create() and get a language model. This is part of Chrome's Built-in AI initiative. Edge ships Phi-4 mini through a similar API.

This is genuinely zero-friction. The model is already on disk when Chrome is installed. For simple tasks — summarization, rewriting, classification — it works out of the box. The limitations are real, though: constrained output length, the model can be swapped or removed by the browser vendor at any time, and it is entirely browser-specific. You cannot count on it the way you can count on a model you bundle yourself. Useful for progressive enhancement, not as a foundation.

The Cost Model

This is the part I care about most. GPU inference cost: $0. It is the user's device. You never see the compute bill because there is no bill.

The only real cost is CDN bandwidth for the initial model download. A quantized 250MB model costs roughly $0.005 per new user to serve from Cloudflare Pages or similar. After first load, IndexedDB caches the model locally (via the Cache API and IndexedDB). Repeat visits cost nothing.

4-bit quantization reduces model size by 75%. A 1B-parameter model that would be 4GB at full precision fits in ~250MB quantized. That is a single large image download. Users on modern connections barely notice.

For someone like me running free-tier infrastructure, this changes the math completely. The AI compute cost scales with the number of user devices, not with my server budget.

WebNN: The Future (Not Yet the Present)

WebNN targets dedicated neural processing units — the Apple Neural Engine, Intel NPU, Qualcomm Hexagon. The W3C published a Candidate Recommendation in January 2026 (W3C WebNN spec). The potential: up to 50x speedup over JavaScript for supported operations.

The reality: it is not production-ready. Chrome and Edge have experimental support. CPU backend only on most machines. NPU access requires specific hardware and driver versions. The performance hierarchy today looks like this:

WebNN (NPU) — fastest when it works, but rarely works
WebGPU (GPU) — ~10x faster than plain JS, production-ready now
WASM (CPU) — ~3x faster than JS, universal fallback
Plain JS — baseline, always available

WebNN is worth watching. It is not worth building on today. WebGPU is the production target.

What This Means

The browser is no longer a thin client pretending to be smart by calling APIs. It is a real ML runtime. Private — data never leaves the device. Free — no inference costs. Offline — works without a connection after first load.

I build PWAs at evey.cc — a habit tracker, a price tracker, a home inventory app, a focus timer. They are all local-first, all zero-backend. Some of them could start using on-device AI for features that currently require no intelligence at all: fuzzy search, smart categorization, text summarization, natural language input parsing. All without a single API call, all without sending a byte of user data anywhere.

The stack is here. WebGPU is the runtime. Transformers.js is the framework. 4-bit quantization is the delivery mechanism. The user's GPU is the compute. The cost is zero. The only question left is what to build with it.

In fact, I already built one: evey.cc/chat runs a language model entirely in your browser via WebLLM. Pick a model from TinyLlama 1.1B to Phi-3.5 Mini 3.8B, wait for the one-time download, and chat — no server, no API key, no data leaving your device. Try it.

References

WebGPU browser support. caniuse.com/webgpu
Transformers.js — ML for the web. github.com/huggingface/transformers.js
WebLLM: High-performance browser LLM inference, arXiv:2412.15803. arxiv.org
WebLLM project. webllm.mlc.ai
Chrome Built-in AI. developer.chrome.com
W3C WebNN specification. w3.org/TR/webnn
Google web.dev — AI frameworks for the web. web.dev