AI Research

DeepSeek Paper Stabilizes 27B Model Training With New Residual Connection Math

DeepSeek's mHC paper fixes the training instability that breaks wider residual connections in large transformer models.

Oliver Senti
Oliver SentiSenior AI Editor
May 12, 20263 min read
Share:
Abstract visualization of parallel neural network signal pathways converging through a geometric constraint

DeepSeek published a paper in late December describing a fix for the training instability that shows up when researchers widen the residual connections inside transformer models. The technique, manifold-constrained hyper-connections or mHC, let the team stabilize a 27-billion-parameter run that previously diverged.

What the fix does

Residual connections have been load-bearing infrastructure inside neural networks since ResNet introduced them. The idea is plain: let signal skip past layers so gradients don't vanish on the way through. Hyper-Connections, an extension proposed in 2024, widened that single shortcut into multiple parallel streams with learnable mixing matrices. More expressive in theory. In practice, the math compounded badly at depth.

DeepSeek's technical paper reports that on an unconstrained 27B run, the composite signal gain hit roughly 3,000. That is the polite way to say the model exploded. mHC fixes this by forcing the residual mixing matrices onto the Birkhoff polytope, the geometric space of doubly stochastic matrices. The projection runs through the Sinkhorn-Knopp algorithm, which dates to 1967 and was not designed with transformers in mind.

With the constraint applied, signal gain drops to about 1.6. No loss spikes. Identity mapping is preserved. Whether you find this elegant or convoluted probably depends on how much linear algebra you've stared at this year.

About those benchmarks

DeepSeek reports mHC outperforming both the baseline and standard hyper-connections across eight downstream tasks at 27B parameters. Compared to HC, the gains are 2.1% on BBH and 2.3% on DROP. The team also flags improvements on GSM8K and MMLU. Custom CUDA kernels, recomputation tricks, and pipeline tweaks keep the overhead to 6.7% of training time.

Treat those numbers with some skepticism. They come from the same team that built the architecture, on a single training setup, against a baseline they control. A 2.1-point gain on BBH is real but small. The bigger claim is that HC simply could not be trained stably at 27B at all, while mHC can. That is harder to argue with, since the alternative is a divergence chart.

So what now

CEO Liang Wenfeng is listed among the 20 authors, which usually signals that something will land in the next flagship model. DeepSeek's last few releases have leaned on hardware-software co-design, partly because their H800 GPUs ship with throttled interconnect bandwidth. mHC continues that pattern: a fix that targets the systems layer as much as the math.

External interest moved quickly. A PyTorch implementation appeared as a drop-in variant of hyper-connections, though without the paper's CUDA-level optimizations. A separate follow-up, mHC-lite, has already pushed back, arguing that finite Sinkhorn iterations leave an approximation gap that accumulates with depth. The fight over how to do this efficiently is already underway.

The next concrete data point arrives whenever DeepSeek ships its next model. If mHC is inside it, the 27B benchmarks become the floor.

Tags:DeepSeekmHCtransformer architectureLLM trainingresidual connectionshyper-connectionsAI researchneural networksLiang Wenfeng
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

DeepSeek mHC: New Architecture Stabilizes 27B Model Training | aiHola