Transformers.js v4: Browser-Based AI Hits 60 Tokens/Sec

Hugging Face released a preview of Transformers.js v4 on February 9, shipping nearly a year's worth of work that started in March 2025. The headline number: a 20-billion-parameter language model running at roughly 60 tokens per second, entirely inside a browser. No API calls, no server, no cloud inference bill.

That's on an M4 Pro Max with q4f16 quantization, so calm down. Your average laptop isn't doing that. But the fact that it works at all tells you something about where browser-based ML is headed.

The WebGPU rewrite

The core change in v4 is a complete rewrite of the WebGPU runtime in C++, built in collaboration with Microsoft's ONNX Runtime team. The previous version already supported WebGPU, but the new runtime was tested against around 200 model architectures and uses specialized ONNX Runtime operators (GroupQueryAttention, MatMulNBits, QMoE) to squeeze out performance that the generic export path couldn't touch.

For BERT-based embedding models, the Hugging Face team claims a 4x speedup from adopting the MultiHeadAttention operator alone. I'd like to see independent benchmarks confirm that, but the architectural reasoning checks out: custom operators bypass a lot of the overhead that comes from expressing attention as a graph of smaller operations.

And here's the part that matters for server-side JavaScript developers: the same WebGPU-accelerated code now runs in Node, Bun, and Deno. Not just browsers. One codebase, multiple runtimes. That's been a pain point for anyone trying to share inference logic between client and server.

What actually ships with it

The model lineup got substantially wider. V4 adds support for architectures that weren't possible before in JavaScript: Mamba state-space models, Multi-head Latent Attention, and Mixture of Experts. Specific models include GPT-OSS, Chatterbox, FalconH1, Olmo3, and a handful of others. All WebGPU-compatible, all runnable in the browser with hardware acceleration.

The library now handles models exceeding 8 billion parameters. The 20B GPT-OSS test used q4f16 quantization on Apple's highest-end laptop silicon, which is a far cry from running on a mid-range Android phone. Hugging Face's own guidance suggests sticking to models under 2B parameters for broad device compatibility. That gap between "what's technically possible" and "what works for real users" is worth keeping in mind.

Full offline support landed too. After the initial model download, WASM files cache locally in the browser. Open a page, run inference, close your laptop's wifi. It just works.

The build system nobody asked about (but should care about)

They moved from Webpack to esbuild. Build times went from 2 seconds to 200 milliseconds. Fine, a 10x improvement on something that was already fast isn't going to change your life. But the bundle size reduction might. The default web export (transformers.web.js) is 53% smaller, and the average across all builds dropped about 10%. For a library that downloads model weights and runtime code to the browser, every kilobyte of overhead in the runtime itself is kilobytes your users are waiting on before inference even starts.

The codebase also got a major restructuring. The old models.js file was over 8,000 lines. They split it into modular files, converted the repo to a pnpm monorepo, and extracted the tokenization logic into a standalone package.

That tokenizer library (@huggingface/tokenizers) is 8.8kB gzipped with zero dependencies. If all you need is tokenization for a chat interface or a prompt builder, you can skip the rest of Transformers.js entirely.

The WebGPU timing

This release lands at an interesting moment. WebGPU hit all major browsers by late 2025: Chrome has had it since version 113, Firefox 141 brought it to Windows, and Safari 26 shipped it on macOS, iOS, iPadOS, and visionOS. That convergence is what makes Transformers.js v4 viable as more than a demo. A year ago you'd have been asking users to flip browser flags. Now you're targeting maybe 70% of desktop users without any special instructions.

Mobile is still a mess, though. Chrome on Android needs recent hardware and Android 12+. Firefox on Android? Behind a flag. Safari on iOS requires iOS 26. So if your use case is "ML inference on every phone," you're still stuck with WASM fallbacks or, more realistically, a server.

Who this is actually for

The privacy angle is real. Sensitive data that never leaves the device is a compliance argument that no amount of server-side encryption can fully match. Healthcare, finance, legal: any domain where "the data didn't go anywhere" is a regulatory advantage.

The cost angle is real too. Every inference call to a cloud API has a price. If you're running sentiment analysis on user reviews or doing client-side embeddings for search, local inference eliminates that per-request cost entirely. The tradeoff is a heavier initial page load and the assumption that your users have decent hardware.

But let's not get carried away. The GitHub repo still shows this as a preview release, published under the next tag on NPM. Hugging Face hasn't committed to a stable release date. The API could still change. Production deployments right now are a bet that the final release won't break your integration.

What's missing

Training code. You're doing inference only. The 60 tokens/sec number came from Apple's top-tier silicon with aggressive quantization. There's no published data on how this performs on a three-year-old Windows laptop with integrated graphics. And the demo collection is still thin.

The install command is npm i @huggingface/transformers@next. The "next" tag is doing a lot of work in that sentence.