Qwen Open-Sources FlashQLA Kernels for Gated DeltaNet

Alibaba's Qwen team has open-sourced FlashQLA, a library of GPU kernels purpose-built for the Gated DeltaNet attention layers powering Qwen3-Next. The project repository, distributed under MIT license, leans on the TileLang DSL and targets NVIDIA's Hopper architecture exclusively.

The bottleneck nobody wanted to talk about

Linear attention has a marketing problem and an engineering problem. The marketing problem is that it works. The engineering problem is that "works" comes with caveats most papers gloss over.

When Qwen3-Next started leaning hard on Gated DeltaNet, a linear attention variant introduced in a December 2024 research paper from Songlin Yang, Jan Kautz, and Ali Hatamizadeh, the team ran into the kind of trouble that does not show up in headline FLOPs charts. The architecture interleaves three GDN layers for every full attention layer, meaning roughly 75% of the model's attention compute happens in linear blocks. Fine. But when contexts stretch to 256K tokens and active parameters climb into the dozens of billions, those blocks turn memory-bound in a way that's actively hostile to tensor parallelism.

Two problems, really. The kernels keep shuffling K, V, and intermediate states between HBM and on-chip memory. And the recurrent nature of the state means small batches or sharded inference setups leave compute units sitting around doing nothing.

So what do you do? One option: write a single monolithic kernel and hope. Qwen took a different route.

A compromise, not a monolith

FlashQLA splits the forward pass into two fused kernels with a preprocessing step wedged between them, enabling automatic context parallelism inside a single card. Within each streaming multiprocessor, the team pulls a trick that's become increasingly common in modern Hopper kernels: dedicate some warps to data movement while others ping-pong between Tensor Cores and CUDA Cores, so the matmul units never have to wait for memory.

That's the structural answer. The interesting part is what they cut out.

Skipping the warm-up

Here's the bit worth paying attention to. GDN's gating mechanism applies exponential decay to the recurrent state. Old tokens fade fast. According to the project, on roughly 60 to 80% of attention heads, the influence of distant tokens drops off quickly enough that you don't actually need to compute the recurrent state from the start of the sequence. FlashQLA does what the team describes as a "light warm-up" across six to eight chunks and gets a practically accurate state for the current block.

Is that good enough? I'd want to see ablations. The project's claim is that the approximation error stays below what matters for downstream loss, but they're picking 6-8 chunks the same way most kernel work picks chunk sizes: by looking at what worked. And "what worked" depends entirely on the head decay distribution in the model you trained. Plug FlashQLA into a different GDN-architecture model with different gate statistics and the warm-up window probably needs retuning.

FlashQLA is tuned for this family of decay profiles. Generic linear attention kernel work happens elsewhere.

Hopper or bust

The library targets SM90 exclusively. If your training cluster is on Ampere A100s, FlashQLA isn't an option. Read the code, save your H200 budget, move on.

The other constraint matters more for adoption: the kernels are tightly coupled to GDN-specific algebra. You cannot drop FlashQLA into Kimi Linear's KDA mechanism, or into Mamba2, or into any other linear attention variant. It's a point solution for one architecture family, not a replacement for the broader Flash Linear Attention library.

Not a criticism, exactly. That's the cost of optimizing this aggressively against architectural assumptions.

How fast?

Qwen's reported numbers are 2 to 3x forward speedups and roughly 2x backward speedups against FLA's Triton kernels on Hopper, with larger gaps against FlashInfer. Pretraining and agent inference benefit most, since those workloads involve long single prompts where the warm-up shortcut pays off.

I haven't seen independent benchmarks yet. Internal numbers from a model-training company on kernels they wrote for their own architecture should be treated like internal numbers always are.

But the engineering story holds up regardless of the exact speedup ratio. The optimizations follow from the mathematical properties of the architecture, which is rare. Most kernel work is fighting the hardware. This one exploits a property of the gating function.

What happens next

The Flash Linear Attention project added TileLang backend support for GDN, KDA, and parallel attention kernels in April 2026, and FlashQLA's design choices will likely filter into that broader codebase. vLLM and SGLang already automatically dispatch to specialized GDN kernels when serving Qwen3-Next variants. Whether they pick up FlashQLA specifically, or whether the warm-up shortcut gets generalized to KDA and other gated linear attention variants, is the next thing to watch.

For H100 owners, FlashQLA delivers a meaningful speed bump on a model family that's becoming hard to avoid in long-context work. Everyone else waits.