vLLM Benchmark: TurboQuant Trades Throughput for Memory

Red Hat AI engineers writing on the vLLM blog published the first comprehensive accuracy and performance study of TurboQuant, the KV-cache compression method Google Research has been touting since last year. Their conclusion lands roughly opposite to the marketing: FP8 quantization stays the safer default for production serving, and TurboQuant's headline memory savings come with a throughput tax that's hard to justify outside narrow deployment scenarios.

What the benchmarks actually show

The team ran four TurboQuant variants against BF16 and FP8 baselines on Llama-3.3-70B-Instruct, two Qwen3-30B configurations, and the 200B+ parameter MiniMax-M2.7. Five benchmarks covered both long-context retrieval and reasoning workloads. The accuracy story splits cleanly along bit-width lines.

The higher-bit variants, k8v4 (8-bit keys, 4-bit values) and 4bit-nc, hold up well. On Qwen3-30B's long-context retrieval pushed out to 256k tokens, both stayed within a couple of points of the BF16 baseline. The 4bit-nc variant gets you up to 3.4x KV-cache capacity at what the post calls a "modest" accuracy hit, which is honest framing for a 1-4 point drop.

The 3-bit variants are a different story. On Qwen3-30B-A3B-Thinking, the k3v4-nc and 3bit-nc configurations dropped roughly 20 points on AIME25 and LiveCodeBench-v6. On long-context retrieval at 256k, the relative AUC fell about 30%. Even MiniMax-M2.7, a much larger model that should absorb quantization noise better, took up to an 8-point hit on the same reasoning benchmarks. So much for the original paper's "absolute quality neutrality" framing once you go below 4 bits.

The throughput problem

Here's where the value proposition gets complicated. TurboQuant compresses KV-cache storage but dequantizes back to BF16 before attention compute. FP8, by contrast, runs attention directly on H100 FP8 Tensor Cores. That architectural difference shows up everywhere in the numbers.

Latency overhead relative to BF16 ranges from 10% to 68% on Llama-3.3-70B depending on variant and batch size. Throughput drops to between 75% and 66% of BF16 for the same model. Under burst serving load, TurboQuant variants are 1.5x to 2.5x slower per output token than BF16. FP8 either matches BF16 throughput or beats it across the same metrics.

On the bigger Llama model, the TurboQuant tax climbs with batch size, which is the wrong direction for an inference-serving optimization. More requests means more KV-cache to dequantize, and the cost scales with the data being read.

So when is it actually useful?

There's one scenario where TurboQuant earns its place, and the vLLM team is straightforward about it. On Llama-3.3-70B running on 4xH100 with tight memory, BF16's P99 time-to-first-token at burst exploded to roughly 17 seconds as the system ran out of KV-cache room and queued requests. TurboQuant variants kept P99 TTFT under 3.5 seconds. FP8 stayed under 1.5 seconds, which makes it the better choice even here, but TurboQuant beats running out of memory.

"TurboQuant 4bit-nc offers a compelling memory-for-throughput tradeoff," the authors write, which is corporate-speak for "use it only when you're out of options."

The recommendation: stick with FP8 for almost everything, consider 4bit-nc for memory-constrained edge deployments after validating on the target workload, and avoid the 3-bit variants entirely for production. The FP8 post from three weeks earlier already made the case for FP8 as default; these results reinforce it.

A few caveats worth flagging. The tests cover only H100s, not Blackwell or AMD silicon where the FP8 advantage might look different. TurboQuant support in vLLM is also missing sliding-window and hybrid attention, which rules out a chunk of recent model releases. Support landed in vLLM 0.20.2 if you want to reproduce the numbers.

vLLM Benchmarks TurboQuant KV-Cache Quantization, Finds FP8 Still Wins

What the benchmarks actually show

The throughput problem

So when is it actually useful?

Oliver Senti

Related Articles

Goodfire Identifies Reused Addition Module Inside Llama 3.1 8B

Microsoft Benchmark Finds Top LLMs Corrupt 25% of Documents Over 20 Edits

DeepSeek Paper Stabilizes 27B Model Training With New Residual Connection Math

Stay Ahead of the AI Curve