Open-Weight LLMs Cut Long-Context Costs, Not Model Size

The transformer block isn't going anywhere, but everything wrapped around it keeps mutating. That's the takeaway from Sebastian Raschka's latest survey of recent open-weight model releases, where Gemma 4, Poolside's Laguna XS.2, Zyphra's ZAYA1-8B, and DeepSeek V4 all push in the same direction: making long-context inference cheaper without shrinking total parameters.

The real bottleneck

Reasoning models and agent workflows keep more tokens hanging around for longer, and that exposes the same constraint every time. KV-cache size, memory traffic, attention FLOPs. These are now what architecture teams optimize for, not raw capability gains.

Gemma 4 takes the simplest route. Its small E2B and E4B variants reuse KV tensors across layers, a trick formally introduced in 2024 in a NeurIPS paper on cross-layer attention but only now showing up in a flagship release. The E2B has 35 layers; only the first 15 compute their own KV projections. The rest share. That cuts roughly 2.7GB of KV cache at 128K context. The trade-off is approximate, but the cross-layer attention research suggests the quality hit is small on smaller models. Google didn't publish ablation studies, so take that on faith.

Attention in compressed space

ZAYA1-8B is the most architecturally interesting of the bunch. Zyphra's model, trained on AMD GPUs (already an outlier), introduces Compressed Convolutional Attention, detailed in their CCA paper from October 2025. Where DeepSeek's MLA compresses what you store but expands it back for the attention computation, CCA performs attention directly in the compressed latent space. Convolutions on the compressed Q and K tensors give back some of the local context the compression strips away.

The Zyphra writeup claims CCA outperforms MLA under comparable compression. The paper backs it with their own experiments. Independent replication is still pending, and the modeling-quality story behind any attention compression scheme tends to depend on training scale.

DeepSeek V4 goes harder

DeepSeek V4-Pro is the headline release of the year and also the most extreme on efficiency. At 1M-token context, the paper reports it uses 27% of the per-token inference FLOPs and 10% of the KV cache size of DeepSeek V3.2. The smaller V4-Flash variant drops further: 10% of the FLOPs, 7% of the cache.

How? Two changes. The model alternates Compressed Sparse Attention and Heavily Compressed Attention layers, which compress along the sequence dimension rather than per-token. CSA keeps more detail but uses sparse top-k selection. HCA bundles every 128 tokens into one compressed entry and runs dense attention over the much shorter cache. The two are complementary, which is why they're interleaved rather than picked as a single winner.

The second change is the residual stream itself. DeepSeek V4 widens it using manifold-constrained hyper-connections, from a recent paper. Instead of one residual stream, the block carries four parallel streams with constrained mixing matrices, adding capacity without making the attention or MoE layers wider. Reported overhead in the paper: 6.7% additional training time at four streams. That number assumes their fused, optimized implementation, so practical mileage will vary.

What's missing

None of these releases ship with the kind of clean ablation studies that would settle whether each tweak earns its complexity. DeepSeek V4's quality numbers are reported for the full recipe: new attention, mHC, Muon optimizer, training data, system changes. Pulling apart which piece does what is left to whoever wants to retrain things.

What's clear is that nobody is trying to replace the decoder-only transformer. The basic GPT-style stack from 2019 is still the spine. Raschka estimates the new tweaks roughly 10x the code complexity of a basic transformer block, which is the cost of squeezing long contexts onto current hardware.

DeepSeek V4 is already out. Gemma 4 shipped at the start of April. The next watch is whether any of these compression schemes show up in proprietary releases, where architecture details usually leak slower.

Open-Weight LLMs in 2026 Reshape Attention to Cut Long-Context Costs

The real bottleneck

Attention in compressed space

DeepSeek V4 goes harder

What's missing

Oliver Senti

Related Articles

Zyphra Releases 74B MoE Checkpoint Trained Entirely on AMD

Thinking Machines Previews Real-Time Interaction Models

Poolside ships Laguna coding models publicly after years selling only to defense

Stay Ahead of the AI Curve