DeepSeek's CEO Posts First Paper of 2026: A Fix for Transformer Instability

DeepSeek dropped a 19-page paper on arXiv on December 31st. Manifold-Constrained Hyper-Connections. Not the catchiest name. But CEO Liang Wenfeng uploaded it himself, and that's a tell. He did the same thing for R1 and V3. Other researchers handle the less important papers.

So what did they actually build?

The problem they're trying to solve

Residual connections have been the backbone of deep learning since Kaiming He introduced them in 2015. The idea is simple: instead of making each layer learn the full transformation, you let the input skip ahead and add itself to the output. Input plus transformation. This keeps gradients flowing through very deep networks without vanishing.

ByteDance researchers published Hyper-Connections in September 2024. Their tweak: instead of one residual stream, create several parallel streams. Let the model learn how to mix signals between them. The method showed performance improvements in pre-training large language models, including both dense and sparse (MoE) models.

The catch? The earlier approach did not fully account for rising memory costs, leaving its practical scalability constrained for large-model training. And there's a worse problem. When you let the model freely mix signals between streams, those mixing operations can compound across layers. At 27 billion parameters, unconstrained connections can lead to a gain magnitude of 3000. The signal explodes. Training crashes.

What mHC actually does

The core innovation is a constraint. Instead of letting the mixing matrices do whatever they want, mHC projects them onto a manifold of doubly stochastic matrices. In plain terms: every row and column in the connection matrix must sum to exactly one. The mixing becomes a weighted average rather than an amplification.

The framework projects the residual connection space onto a specific manifold to restore the identity mapping property. Identity mapping is the key insight from the original ResNet work: if a layer can't improve the representation, it should at least be able to pass the input through unchanged. Unconstrained hyper-connections break this. mHC restores it.

They implement the constraint using the Sinkhorn-Knopp algorithm, which iteratively normalizes rows and columns until you get a doubly stochastic matrix. Mathematically elegant. Computationally, it adds overhead.

The overhead question

Here's where it gets interesting. In its internal tests, DeepSeek determined that mHC incurs a hardware overhead of only 6.27%. That's with an expansion rate of 4 (four parallel streams instead of one).

How'd they get it that low? They write three new kernels that employ mixed-precision strategies to maximize numerical accuracy without compromising speed, and fuse multiple operations with shared memory access into unified compute kernels. They also throw away intermediate activations after the forward pass and recompute them during backprop. Classic memory-compute tradeoff.

But 6.7% is against their heavily optimized internal infrastructure. Good luck replicating that in vanilla PyTorch. The paper doesn't include reference code yet, though DeepSeek has historically been good about releasing implementations eventually.

Does it actually work?

They tested on 3B, 9B, and 27B parameter models. The results at 27B:

The constrained variant mitigates the instability seen in HC and reaches a final loss reduction of 0.021 compared to baseline, with gradient norms staying stable and comparable to baseline behavior.

Not a revolution. But the stability is the point. Standard Hyper-Connections at this scale hit loss surges around the 12k step mark. mHC didn't. Where unconstrained HC signals can grow exponentially, mHC keeps them around 1.6x.

They also report improvements on reasoning benchmarks like BBH and DROP, a couple of points over HC. That's not why you'd use this. You'd use it because your training runs stop crashing.

The thing they didn't test

DeepSeek-V3 is 671 billion parameters. The largest model in this paper is 27 billion. That's a 25x gap. Does mHC scale to frontier models?

They don't say. Convenient.

The paper talks about "superior scalability" but only demonstrates it within the range they tested. Maybe it works at 671B. Maybe it doesn't. If I had to guess, they've tested it internally and wouldn't be publishing if the results were bad. But guessing isn't data.

Why this matters

Industry expectations are running high that DeepSeek could release its next major model in the run-up to the Spring Festival holiday in mid-February. The rumored model is being called R2. DeepSeek's papers tend to telegraph what's coming: they published R1 research right before releasing R1.

So if mHC is in this paper, there's a reasonable chance it's in the next model. Or at least being tested for one.

The broader pattern: DeepSeek keeps finding ways to squeeze more out of less compute. DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training, dramatically less than competitors. US export restrictions mean Chinese labs can't just buy more Nvidia chips. They have to be clever instead.

mHC isn't clever in a flashy way. It's clever in the way that matters: it solves a real engineering problem (training instability) without breaking other things (efficiency). That's harder than it sounds.

What to watch

The Spring Festival deadline is mid-February. If DeepSeek announces something by then, check the architecture details for mentions of manifold constraints or modified residual connections.

There's also a community implementation of the original ByteDance Hyper-Connections on GitHub. Someone will probably add mHC support eventually. That's when the rest of us get to actually test whether this works as advertised.

For now, it's a paper. A well-engineered paper, uploaded by a CEO who doesn't waste his time on minor work. Make of that what you will.