The Allen Institute for AI released OLMo Hybrid 7B, a language model that swaps 75% of standard attention layers for Gated DeltaNet (GDN) recurrent layers. The architecture, detailed in the team's technical paper, alternates three GDN layers for every one full attention layer. On MMLU, it reaches the same accuracy as OLMo 3 7B with roughly half the training tokens, per Ai2's own benchmarks.
The theoretical argument is interesting: GDN layers handle state tracking (sequentially dependent reasoning that pure transformers struggle with), while attention layers handle precise recall from context. Neither primitive solves both. Ai2's scaling experiments from 60M to 1B parameters confirmed that the 3:1 GDN-to-attention ratio beat alternatives, including pure GDN, pure transformer, and Mamba2 hybrids. The blog post frames the gains as architectural, not a training artifact.
Where things get sharp: long-context performance. After mid-training and context extension, OLMo Hybrid scored 85.0 on RULER at 64k tokens versus 70.9 for OLMo 3 7B. That 14-point gap is large for a controlled comparison where the only variable is architecture. The model was pretrained on 6 trillion tokens across 512 GPUs, starting on NVIDIA H100s before migrating to B200s on Lambda.
A caveat from the team's own analysis: post-training results were mixed. The pretraining gains didn't fully translate through instruction tuning and RL stages, with some losses on extended reasoning tasks. Software tooling for GDN models remains rough, and inference optimizations that would unlock the architecture's memory savings are months away. Weights, code, and data are on Hugging Face and GitHub under Apache 2.0.
Bottom Line
OLMo Hybrid matches OLMo 3 7B on MMLU with 49% fewer training tokens and scores 85.0 vs. 70.9 on 64k-context RULER, but post-training gains remain inconsistent.
Quick Facts
- 7B parameters, 6T training tokens
- 3:1 ratio of Gated DeltaNet to attention layers (75% RNN)
- 49% fewer tokens to match OLMo 3 on MMLU (Ai2-reported)
- 85.0 on RULER 64k vs. 70.9 for OLMo 3 7B
- Trained on 512 GPUs (H100s then B200s), Apache 2.0 license




