Google TurboQuant Compresses LLM KV Cache 6x

Google Research published TurboQuant on Monday, a vector quantization algorithm that compresses the key-value cache in large language models down to 3 bits per channel while claiming zero accuracy loss. The paper, which will be presented at ICLR 2026, comes from a team led by research scientist Amir Zandieh and Vahab Mirrokni, a VP and Google Fellow. At 4-bit precision, Google says TurboQuant delivers up to 8x speedup in computing attention logits on H100 GPUs compared to the unquantized 32-bit baseline.

That 8x number deserves some scrutiny. It is measured specifically on attention logit computation against a JAX baseline, not on end-to-end inference throughput. The blog post and paper are careful about this distinction, but the headline framing ("up to 8x speedup") is doing a lot of heavy lifting.

Why the KV cache matters right now

The KV cache problem is one of those infrastructure headaches that most people never think about but that dominates the cost calculations of anyone running LLMs in production. Every token a model processes generates key and value vectors that need to be stored so the model does not have to recompute them. For a model like Llama 3 at 70B parameters, serving a batch of 512 requests with a 2,048-token prompt length means storing roughly 512 GB of KV cache alone, according to a recent survey on KV cache compression. That is nearly four times the memory needed for the model weights themselves.

So compressing the cache is not optional anymore. It is an active area of competition between Google, Nvidia, and the open-source research community, and TurboQuant is Google's latest entry.

The two-stage trick

TurboQuant is actually three papers stapled together, which the blog post describes as a unified system. The first component, PolarQuant (to appear at AISTATS 2026), converts vectors from Cartesian to polar coordinates, which eliminates the need to store per-block normalization constants. Traditional quantization methods waste 1 to 2 bits per number on these constants, which partially defeats the purpose of quantizing in the first place. PolarQuant sidesteps this by mapping data onto a fixed circular grid where the boundaries are already known.

The second piece, Quantized Johnson-Lindenstrauss (QJL), is a 1-bit error correction step. After PolarQuant does the heavy compression, QJL takes the leftover error and reduces each residual vector to a single sign bit. This corrects bias in the inner product estimates, which matters because MSE-optimal quantizers, as the paper proves, do not produce unbiased inner product estimators. The whole system uses most of its bit budget on PolarQuant and just 1 bit on QJL.

What caught my attention is that TurboQuant is data-oblivious. It requires no training, no fine-tuning, and no calibration on specific datasets. You rotate the input vectors randomly, which induces a concentrated Beta distribution on the coordinates, then apply scalar quantizers independently. The theoretical backing is solid: the authors prove TurboQuant comes within a factor of about 2.7 of the information-theoretic lower bound on distortion.

But how does it compare?

Google benchmarked TurboQuant against KIVI, which was published at ICML 2024 and has become the standard baseline for KV cache quantization. KIVI introduced asymmetric 2-bit quantization (keys per-channel, values per-token) and showed 2.6x memory reduction with minimal quality loss. TurboQuant claims better results at 3 bits with no quality loss at all, and the LongBench numbers on Llama-3.1-8B-Instruct do look favorable.

The needle-in-a-haystack results are the strongest claim: perfect retrieval across all test lengths while compressing the cache by at least 6x. PolarQuant alone gets close to lossless on this task, and TurboQuant matches the uncompressed baseline exactly.

What is missing from the comparison is any head-to-head with Nvidia's KVTC, which was also accepted at ICLR 2026 and claims up to 20x compression. KVTC takes a completely different approach, borrowing from JPEG-style transform coding with PCA-based decorrelation, adaptive quantization, and entropy coding. It does require a calibration step, which TurboQuant avoids, but 20x versus 6x is a large gap in compression ratio even if KVTC needs that extra setup.

The two methods may not be directly comparable. KVTC targets offline cache storage and reuse across conversation turns, while TurboQuant is designed for online, real-time quantization during inference. Different problems. But from an infrastructure buyer's perspective, both are solving the same headache, and Nvidia's approach offers more aggressive compression even if it requires calibration.

What they tested on (and what they did not)

Google evaluated TurboQuant across LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval using Gemma, Mistral, and Llama-3.1-8B-Instruct. That is a reasonable spread for long-context benchmarks. The vector search evaluation used the GloVe dataset against product quantization baselines like PQ and RabbiQ.

The paper claims quality neutrality at 3.5 bits and marginal degradation at 2.5 bits. I would have liked to see results on larger models. The biggest model tested appears to be around 8B parameters. Gemini, the model Google is presumably most interested in applying this to, is considerably larger. Whether the approach holds up at hundreds of billions of parameters is an open question that the paper does not address.

The vector search angle

Google frames TurboQuant as useful beyond just LLMs, specifically for vector search at scale. The recall numbers on GloVe do outperform product quantization and RabbiQ baselines, and the indexing time drops to near zero because TurboQuant is data-oblivious (no codebook training needed). For anyone running large-scale nearest-neighbor search, that zero-indexing-time property is potentially more interesting than the KV cache application, though it got buried in the blog post's second half.

The real question

TurboQuant is mathematically elegant. The theoretical proofs are tight, the approach is clean, and the data-oblivious property makes it easy to deploy. But Google published this research without releasing code or model integrations, which makes it hard to verify the claims independently.

The collaborator list spans Google researchers, an assistant professor at KAIST, and a PhD student at NYU, so this is not purely an internal product effort. Whether TurboQuant ends up integrated into Gemini's serving infrastructure, or remains a research contribution that others build on, depends on implementation details the paper does not cover: custom kernel availability, integration with existing frameworks like vLLM or TensorRT-LLM, and real-world performance at production batch sizes.

For now, KV cache compression is one of those rare areas where the research is moving faster than the infrastructure can absorb it. KIVI shipped with HuggingFace Transformers. KVTC is heading into Nvidia's Dynamo framework. TurboQuant has strong theory and good benchmarks but no deployment story yet. The FTC is not going to care about any of this, but your cloud bill might.