Sakana AI DiffusionBlocks Cuts Neural Net Training Memory

Abstract visualization of a neural network split into separate glowing segments along a gradient path

Sakana AI and University of Tokyo researchers built a method, DiffusionBlocks, that trains a deep network one block at a time, so only a single block sits in memory rather than the whole model. The work was presented at ICLR 2026.

The memory trick

Standard training runs backpropagation through every layer at once, which means every intermediate activation has to sit in memory until the backward pass finishes. Stack more layers and the bill grows in lockstep. DiffusionBlocks chops the network into B blocks and trains each one on its own, so the memory you need drops by roughly a factor of B.

That is the entire pitch, and it is a good one. The project page frames the motivation as making large-model training reachable for people without thousands of GPUs, the kind of accessibility line every efficiency paper opens with. Whether it delivers at that scale is a separate question, and the honest answer right now is that nobody knows yet.

Diffusion, but for training

The clever part is the reframing. The technical paper leans on an old observation that a residual connection looks like one step of a differential equation. Push that a little further and each block can be treated as one denoising step of a diffusion process, the same math behind image generators. Give every block a narrow job, nudge the representation slightly closer to the target, and it can learn that job without watching what its neighbors are doing.

So diffusion stops being a thing you use to make pictures and becomes a way to organize training itself. The authors describe each block's goal as "gradually approaching the target," which sounds tidy on a slide and is harder to pull off across five different architectures. They tried anyway.

What they actually tested

ViT on CIFAR-100, DiT on ImageNet at 256, a masked diffusion model on text8, an autoregressive transformer on OpenWebText. Across those, block-wise training matched or edged past end-to-end baselines while using a slice of the memory. The GitHub repo currently ships the ViT classification setup, and the rest you take on faith from the paper for now.

Note what is missing from that list: anything at frontier scale. These are small and mid-size benchmarks. The authors are upfront that they have only validated networks trained from scratch, and that the real prize, converting existing pretrained models, sits in future work. Until someone runs this on a model people actually deploy, the train-anything-on-any-hardware framing stays a hypothesis.

The looped-transformer bonus

One result is harder to wave off. Recurrent-depth models, also called Looped Transformers, run the same block over and over and normally train through backpropagation through time, which gets expensive fast. Cast as a diffusion process, that loop collapses into a single forward pass during training, while inference keeps the original iterations.

That is the kind of concrete saving that ages better than an accessibility mission statement.

Where to look next

The code, the paper, and an OpenReview thread are all live. The thing to watch is the fine-tuning result the team flagged as next. If DiffusionBlocks can convert a real pretrained model without wrecking its quality, the memory math starts to matter outside a benchmark table. If it cannot, this stays a sharp reinterpretation with a strong looped-transformer footnote.

Tags:DiffusionBlocksSakana AImachine learningdiffusion modelsneural network trainingICLR 2026memory efficiencytransformersUniversity of Tokyo

Liza Chan

AI & Emerging Tech Correspondent

Liza covers the rapidly evolving world of artificial intelligence, from breakthroughs in research labs to real-world applications reshaping industries. With a background in computer science and journalism, she translates complex technical developments into accessible insights for curious readers.

Sakana AI's DiffusionBlocks Trains Networks One Block at a Time

The memory trick

Diffusion, but for training

What they actually tested

The looped-transformer bonus

Where to look next

Liza Chan

Related Articles

New CUSP Benchmark Finds Top LLMs Can't Predict Future Science

Google Rebuilt Colab Around an AI Agent

YouTube Will Automatically Label AI Videos Even Without Creator Disclosure

Stay Ahead of the AI Curve