The teams behind TokenSpeed and Alibaba's Qwen line posted a number to the PyTorch blog on May 27: 580 tokens per second, single user, serving the 397-billion-parameter Qwen3.5 on one NVIDIA Blackwell node. They're calling it a speed record for agentic workloads. It might be. The conditions attached to that number are the part worth slowing down for.
First, the shape of the claim. The 580 figure is peak, measured at batch size one (one user, zero concurrency) on a B200 in TP8 configuration, with the model quantized to NVFP4, NVIDIA's 4-bit format. All four parallelism setups the team tried cleared 500 tok/s at bs=1, so the headline isn't a fluke of one lucky config. But single-user peak and what you'd actually see serving a fleet at concurrency are different animals, and the post is upfront that this is the former.
What does 580 actually measure?
That agentic test isn't real traffic. It's a simulation: 50K tokens of first-turn context, 800 tokens tacked on each following turn, 10 to 15 turns total, built with EvalScope tooling. The profile is generous to TokenSpeed in one specific way. Across the multi-turn run the engine reports a KV cache hit rate above 90 percent, which means most of that 50K prefix gets reused rather than recomputed. Real coding agents do reuse context heavily, granted. Whether your workload hits 90 percent is another matter.
And a big chunk of the single-user speed comes from MTP, the multi-token prediction path (speculative decoding by another name). The team's own figures put MTP's contribution at +100 to +159 percent at bs=1. Speculative decoding gains live and die by acceptance rate, which is model- and prompt-dependent. Strip MTP out and the number gets a lot less dramatic. The post doesn't dwell on that.
One more thing. This is a vendor benchmark. The byline reads "TokenSpeed Team, Qwen Team," it's the engine's authors measuring the engine, and there's no head-to-head table against TensorRT-LLM in the writeup itself, even though that's the comparison TokenSpeed's own design blog picks as the bar to clear. I'd want an independent run before treating 580 as gospel.
The kernel that isn't in the number
Here's the detail that caught my attention. Some framing around this release bundles in FlashAttention-4, Tri Dao's Blackwell-targeted attention kernel that shipped back in March, implying FA4 is part of how they got to 580. It isn't, at least not yet. The blog says native FA4 support for Qwen3.5 (which runs head_dim=256, a size not every kernel handles cleanly) is still under active development and will arrive in a later release. The head_dim=256 fix got merged upstream, but it isn't powering this result.
Which cuts two ways: either there's real headroom still coming, or the FA4 association is doing marketing work it hasn't earned. Probably the former. We'll know when the number moves.
Why Qwen3.5 is a genuinely hard target
What's hard to fake is the engineering behind Qwen3.5's hybrid architecture. Most of its layers are Gated Delta Network linear attention, with standard full-attention layers interleaved every so often. Those linear layers carry recurrent Mamba-style state, not a conventional key-value cache, and that breaks prefix caching: reusing a prefix means restoring the recurrent state at exactly the right boundary, not just pointing at cached pages. TokenSpeed's answer treats that state as a checkpoint owned by the cache tree, with C++ controlling when a slot joins the tree and Python controlling when tensors get copied or zeroed. Dry stuff. Also the reason that 90 percent hit rate is even achievable on this model.
Long-context decay, almost as an aside, is the quietly impressive part: decode holds around 530 tok/s per user at 128K and still manages roughly 445 at 1M, about 16 percent off across an 8x context jump.
What to watch
The GitHub repo lists Qwen3.6, DeepSeek V4, and MiniMax M2.7 support as in-progress, plus prefill-decode disaggregation cleanup, over the coming weeks. Native FA4 is the next shoe to drop. And the open question stays open: somebody outside the LightSeek-Qwen-NVIDIA circle running this against TensorRT-LLM on the same silicon. Until then, 580 is a real, reproducible number for a very specific setup and a headline for everything else. The model weights are public, so anyone with a B200 can check the homework.




