Sakana AI and University of Tokyo researchers built a method, DiffusionBlocks, that trains a deep network one block at a time, so only a single block sits in memory rather than the whole model. The work was presented at ICLR 2026.
The memory trick
Standard training runs backpropagation through every layer at once, which means every intermediate activation has to sit in memory until the backward pass finishes. Stack more layers and the bill grows in lockstep. DiffusionBlocks chops the network into B blocks and trains each one on its own, so the memory you need drops by roughly a factor of B.
That is the entire pitch, and it is a good one. The project page frames the motivation as making large-model training reachable for people without thousands of GPUs, the kind of accessibility line every efficiency paper opens with. Whether it delivers at that scale is a separate question, and the honest answer right now is that nobody knows yet.
Diffusion, but for training
The clever part is the reframing. The technical paper leans on an old observation that a residual connection looks like one step of a differential equation. Push that a little further and each block can be treated as one denoising step of a diffusion process, the same math behind image generators. Give every block a narrow job, nudge the representation slightly closer to the target, and it can learn that job without watching what its neighbors are doing.
So diffusion stops being a thing you use to make pictures and becomes a way to organize training itself. The authors describe each block's goal as "gradually approaching the target," which sounds tidy on a slide and is harder to pull off across five different architectures. They tried anyway.
What they actually tested
ViT on CIFAR-100, DiT on ImageNet at 256, a masked diffusion model on text8, an autoregressive transformer on OpenWebText. Across those, block-wise training matched or edged past end-to-end baselines while using a slice of the memory. The GitHub repo currently ships the ViT classification setup, and the rest you take on faith from the paper for now.
Note what is missing from that list: anything at frontier scale. These are small and mid-size benchmarks. The authors are upfront that they have only validated networks trained from scratch, and that the real prize, converting existing pretrained models, sits in future work. Until someone runs this on a model people actually deploy, the train-anything-on-any-hardware framing stays a hypothesis.
The looped-transformer bonus
One result is harder to wave off. Recurrent-depth models, also called Looped Transformers, run the same block over and over and normally train through backpropagation through time, which gets expensive fast. Cast as a diffusion process, that loop collapses into a single forward pass during training, while inference keeps the original iterations.
That is the kind of concrete saving that ages better than an accessibility mission statement.
Where to look next
The code, the paper, and an OpenReview thread are all live. The thing to watch is the fine-tuning result the team flagged as next. If DiffusionBlocks can convert a real pretrained model without wrecking its quality, the memory math starts to matter outside a benchmark table. If it cannot, this stays a sharp reinterpretation with a strong looped-transformer footnote.




