RAG

Meta's REFRAG Squeezes RAG Latency by 30x, but the Code Is Still Missing

A new decoding framework from Meta compresses retrieved context before it hits the LLM. The speedups are real, if narrow.

Oliver Senti
Oliver SentiSenior AI Editor
February 16, 20265 min read
Share:
Abstract visualization of document chunks being compressed into dense vector embeddings flowing into a neural network decoder

Meta's Superintelligence Labs, working with researchers from the National University of Singapore and Rice University, published a paper describing REFRAG, a decoding framework that compresses retrieved passages into dense embeddings before sending them to an LLM. The headline numbers: 30.85x faster time-to-first-token and 16x larger effective context windows, with no loss in perplexity.

Those are the kind of figures that make you do a double-take. And then look at the fine print.

The actual problem here

Standard RAG pipelines are wasteful in a way that is almost comically obvious once someone points it out. You retrieve a bunch of document chunks, tokenize them, concatenate everything, and shove the whole mess into the LLM's context window. The model dutifully processes every token, including the ones that contribute nothing to the answer. Attention scales quadratically with input length, so doubling your context roughly quadruples compute and memory. Everyone building production RAG systems knows this pain.

What makes REFRAG's approach interesting is the observation that RAG contexts have a peculiar structure most optimization methods ignore. Retrieved passages are often semantically unrelated to each other (thanks to diversity and deduplication during re-ranking), which produces block-diagonal attention patterns. Most cross-chunk attention is close to zero. The model is doing expensive math on relationships that don't exist.

Compress, sense, expand

REFRAG's pipeline works in three steps. A lightweight encoder (think RoBERTa-scale) splits retrieved passages into fixed-size chunks of 16 tokens and compresses each one into a single dense embedding. An RL-trained policy then scores each compressed chunk and decides which ones actually matter for the query. Only those chunks get decompressed back into full token embeddings and sent to the decoder. Everything else stays as a single vector placeholder.

The result is that the LLM's input sequence shrinks dramatically. At a compression rate of k=16, REFRAG achieves 16.53x TTFT acceleration. Push that to k=32 and you hit 30.85x, which is 3.75x better than CEPE, the previous state-of-the-art from Princeton NLP that managed only 2-8x speedups. Throughput improves by up to 6.78x compared to LLaMA baselines.

The "compress anywhere" capability is worth noting. Unlike CEPE, which requires cross-attention modules inserted at every decoder layer, REFRAG can mix compressed and uncompressed chunks at arbitrary positions in the sequence. That matters for multi-turn conversations and agentic workflows where context keeps changing.

How skeptical should we be?

Pretty skeptical, actually. The 30.85x TTFT number comes at k=32 compression, which is aggressive. At that rate, each 16-token chunk becomes a single vector. That's a lot of information to discard, and the paper's claim of "no loss in perplexity" deserves scrutiny.

The perplexity comparison is against their own REFRAG variants and against CEPE, not against a full-context LLaMA baseline processing the same information. When the paper says REFRAG achieves about 9.3% better perplexity than CEPE, that's a comparison between two compressed approaches, not between compression and no compression. The paper does claim comparable performance to LLaMA when given the same number of passages, and slightly better performance under weak retrievers where lots of irrelevant passages dominate. But "comparable" is doing work there.

The training setup is also narrow. REFRAG was pretrained on 20 billion tokens from SlimPajama, specifically the Books and arXiv subsets. Testing happened on Book, arXiv, PG19, and ProofPile datasets. These are all text-heavy, relatively clean domains. How well does the RL chunk selector generalize to code, legal documents, or messy web scrapes? The paper doesn't say.

The RL policy question

The reinforcement learning component is where I have the most questions. The policy learns to predict which chunks the LLM "needs" to see in full versus which can stay compressed. The paper compares three selection methods: the trained RL policy, a perplexity-based heuristic (compress low-perplexity chunks first), and random selection. RL wins, which is expected, but the margin over the perplexity heuristic seems modest in the figures. If a simple heuristic gets you 80% of the way there, the RL training overhead starts to look less attractive for production deployments.

What's missing

The code. Meta says it will be released at github.com/facebookresearch/refrag, but as of this writing the repository doesn't exist. A community reimplementation on GitHub exists, but it's a reference implementation, not Meta's production code. Without the official codebase, independent reproduction of these results is essentially impossible.

There's also no discussion of training cost. The paper describes continual pretraining on 20B tokens, which isn't trivial. If you need to retrain or fine-tune the encoder and RL policy for each new domain, the amortized cost could eat into the inference savings.

And the elephant in the room: this was tested exclusively on LLaMA-2 7B with a 4K context window. LLaMA-2 is two generations old at this point. Modern models ship with 128K+ context windows and much better long-context handling out of the box. Whether REFRAG's compression approach still delivers meaningful gains when your base model already handles 128K tokens natively is an open question the paper doesn't address.

Where this actually matters

If you're running a RAG system with a smaller model, hard latency requirements, and lots of retrieved passages, REFRAG could be genuinely useful. The framework doesn't modify the LLM architecture at all, which makes it a plug-and-play addition to existing pipelines. The weak-retriever results are particularly interesting: when your retriever pulls back a lot of junk, REFRAG's selective expansion means you can afford to retrieve more passages (since most will be compressed away) without paying the full latency cost.

For enterprise deployments constrained by GPU memory and KV cache sizes, the 16x context extension could open up use cases that were previously impractical. But that's a specific niche, not the universal RAG fix that some of the coverage has suggested.

The paper was submitted in September 2025 and revised in October. Until Meta releases the code and someone reproduces these numbers independently, treat the benchmarks as promising but unconfirmed.

Tags:RAGMeta AILLM optimizationREFRAGcontext compressioninference latencyretrieval augmented generationreinforcement learning
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Meta REFRAG: 30x Faster RAG Decoding via Context Compression | aiHola