Keye-VL 2.0: Sparse Attention Meets Long Video

Kuaishou's Kwai-Keye team has released Keye-VL 2.0, a 30B mixture-of-experts vision-language model with roughly 3B active parameters and a 256K context window. The pitch is straightforward: take DeepSeek Sparse Attention, the efficiency trick that showed up in text models last fall, and point it at video that runs for an hour. Weights are out on ModelScope under Apache 2.0.

That's the interesting part, and it's worth being clear about what's actually new here. DSA itself is not new. DeepSeek shipped it in V3.2-Exp back in September 2025, a lightning indexer plus fine-grained token selection that drops attention from quadratic to roughly linear cost. That was a text model. What Kuaishou is claiming is that they're the first to land DSA in a production multimodal system. The model card calls it exactly that: the "first multi-modal model to land DSA in production."

The 256K context play

Long video breaks most vision models in a predictable way. Attention spreads thin, the model loses the thread between the opening of a clip and its payoff forty minutes later, and causal chains just dissolve. DSA's whole reason for existing is to keep computation manageable while the context stretches, so coupling it to hour-long video is a sensible bet. Whether it pays off is the question.

The number Kuaishou leans on is VideoMME V2, and it's the one I find genuinely interesting. Most models get worse as you feed them more frames, drowning in their own input. Keye-VL 2.0 goes the other way: accuracy climbs from 35.3% at 64 frames to 42.4% at 512 frames, and the non-linear reasoning sub-score moves from 18.5 to 24.2. An improvement curve that points up as the input grows is the kind of result that, if it holds, means the sparse attention is doing real work rather than just saving money.

The catch: 42.4% is the ceiling here, not a triumphant score. The story is the trajectory, not the absolute. Read it as "doesn't fall apart" rather than "nails it."

Benchmarks, and who picked them

On LongVideoBench, Kuaishou reports 74.1, and claims it beats Qwen3-VL-235B-A22B, a model with nearly eight times the total parameters. If that comparison is fair, beating a 235B model at 30B scale on long video is the headline. The usual caveat applies, though, because these are the vendor's own selected highlights, and the model card says as much, pointing you to the technical report for the full table. So we're looking at the numbers Kuaishou wanted us to look at.

Their temporal grounding figures are framed against Gemini 3 Flash, which is a more honest comparison than benchmarking only against your own last generation. On the TimeLens suite it's a mixed picture: behind Gemini on Charades (58.4 versus 61.19 mIoU), ahead on ActivityNet (58.5 versus 56.95), and well clear on QVHighlights (70.1 versus 49.45). That QVHighlights gap is large enough to make me want to know what the benchmark is actually measuring before reading too much into it.

What about the cost claim?

The original Russian-language writeup that pointed me here cited a roughly 50% reduction in prefill cost. I couldn't find that specific figure on the model card, which talks about reduced long-sequence prefill cost and higher training throughput without putting a number on it. For reference, DeepSeek's own DSA work was described in third-party analysis as cutting per-token GPU cost by up to 2x in long-sequence settings, so a 50% prefill reduction is in a believable range. But I'm not going to repeat it as Kuaishou's number when Kuaishou didn't publish it.

And the architecture isn't pure DSA magic. The efficiency comes bundled with custom kernels, heterogeneous ViT-LM parallelism, and something they call ExtraIO. If you're hoping to reproduce the prefill savings in a vanilla setup without their kernel stack, temper expectations.

The agent angle

Kuaishou is also calling this the first Keye base model with built-in agent behavior, covering code, tool use, and search. That's a lot of capability to claim for a 30B model alongside the long-video story, and the model card offers benchmark dimensions but not much I can independently check yet. File it under "promising, unverified."

The serving path is real, at least. There's a custom GitHub repository, an SGLang branch, a prebuilt Docker image, and a minimal two-GPU H800 launch config. So this isn't a weights-dump-and-disappear release. Someone clearly wants people to run it.

So is it real?

The VideoMME scaling curve is the one result that would change my mind about long-video models if it replicates outside Kuaishou's lab. Everything else is the familiar pattern of a capable Chinese open-weight model arriving with carefully chosen comparisons. The weights are public, the license is permissive, and the serving code exists, which means the independent benchmarks will come soon enough. Watch for third-party LongVideoBench runs over the next few weeks. That's where the 74.1 either holds or doesn't.