Open-Source AI

DeepSeek Sparse Attention Gets a From-Scratch Implementation Built for Reading

A new from-scratch DSA build prioritizes readable code over the fused kernels that make it actually fast.

Liza Chan
Liza ChanAI & Emerging Tech Correspondent
May 24, 20264 min read
Share:
Abstract visualization of sparse attention selecting a small subset of tokens from a long sequence

The widely used LLMs-from-scratch repository recently picked up a from-scratch build of DeepSeek Sparse Attention, the token-selection mechanism at the heart of DeepSeek-V3.2. It is small, it is readable, and it runs as a single script. It also skips, on purpose, the one part that makes the real thing fast.

That last part is where it gets interesting.

The idea, minus the math

Standard causal attention compares every query token against every token before it. Cost grows with the square of the sequence length, which is fine until your context gets long and then it really, really isn't. Sparse attention narrows the field. Instead of looking at everything, each token looks at a chosen subset of what came before.

DeepSeek does the choosing with two pieces. A "lightning indexer" scores how relevant each past token is to the current one, and a token selector keeps the top-K highest scorers and masks the rest to negative infinity before the softmax. The technical report spells out the scoring function; the repo reproduces it with the scale factors made explicit, which is more than most paper appendices bother to do.

Three classes come with it: a lightning indexer, a drop-in attention module that applies the sparse mask (with an optional KV cache, if you want to watch it interact with generation), and a small GPT that swaps the new module in. Set a topk flag, run the script, and you can watch the selection happen.

Here's the catch

The README doesn't hide it. The implementation builds the full dense attention score matrix and then applies the top-K mask before softmax. So it computes everything, then throws most of it away. In a fused production kernel, DSA can drop attention compute from O(L²) to O(L·k). Here it does not. You get the selection logic, fully inspectable. And none of the speedup.

And honestly? That's the right call for a teaching repo. I'd rather read forty lines that show me which tokens get picked than a wall of fused CUDA that runs faster and explains nothing. Sebastian Raschka, who maintains the repository, is upfront that this isn't production code: no Multi-Head Latent Attention, no fused sparse kernels, and none of the serving-side machinery DeepSeek built around the idea. The indexer here even runs off the raw hidden states rather than the compressed latent vectors the full model weights feed it through MLA.

So if you clone this expecting your toy GPT to handle long context cheaply, you'll be disappointed. That's not what it's for.

About that 50 percent number

When DeepSeek shipped V3.2-Exp on September 29, 2025, the headline everywhere was cost. TechCrunch reported up to 50 percent cheaper API calls on long-context work (and yes, DeepSeek cut its own API prices the same week, which suggests they believe the figure). Worth keeping straight where that saving actually lives, though: in the fused kernels and the serving stack, not in the attention idea on its own. This teaching code reproduces the idea. The savings stay behind in DeepSeek's infrastructure.

Does the trick hold up at scale? DeepSeek says yes. The full V3.2 report claims rough parity with V3.1-Terminus across reasoning, coding, and agentic benchmarks, with the efficiency gains concentrated on long sequences. Those are the company's own evaluations on its own setup, so treat the parity claim as a starting point, not a verdict. And the flashier results, like the V3.2-Speciale variant matching Gemini-3.0-Pro and taking gold at the 2025 IMO, come from a separate high-compute model that has little to do with the sparse-attention story.

Who's this actually for?

If you read the V3.2 paper and bounced off the indexer equations, this is the cleanest way I've seen to make them concrete. Pair it with Raschka's companion writeup, which traces the same architecture from V3 forward, and the formula stops being abstract.

For everyone else, the broader release is the real story. DeepSeek published the V3 report in December 2024 and has since shipped V3.2 weights and a full report under an MIT license, kernels included. The open question now isn't whether the indexer works in a notebook. It's whether anyone outside DeepSeek reproduces that long-context cost curve on their own hardware. That's the test worth watching.

Tags:DeepSeeksparse attentionDSAlightning indexerlarge language modelstransformer architecturelong-contextopen source AILLMs from scratch
Liza Chan

Liza Chan

AI & Emerging Tech Correspondent

Liza covers the rapidly evolving world of artificial intelligence, from breakthroughs in research labs to real-world applications reshaping industries. With a background in computer science and journalism, she translates complex technical developments into accessible insights for curious readers.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

DeepSeek Sparse Attention: A Readable From-Scratch Build | aiHola