DMax: Self-Correcting Parallel Decoding for Diffusion LLMs

A team at the National University of Singapore has released DMax, a new approach to decoding in diffusion language models that treats the generation process less like filling in blanks and more like editing a draft. The technical paper, posted April 9, comes with open-source code and pretrained models on Hugging Face.

The pitch is straightforward: diffusion LLMs (dLLMs) generate tokens in parallel instead of one at a time, but when you unmask too many tokens at once, errors compound. DMax tries to fix this by teaching the model to revise its own mistakes during generation, not just predict masked tokens from scratch.

Why parallel decoding keeps breaking

Conventional masked diffusion models like LLaDA work by gradually unmasking tokens over multiple steps. The problem is binary: a token is either masked or revealed. Once you commit to a prediction, it sticks. Push the model to reveal too many tokens per step and quality tanks, because early mistakes propagate through subsequent iterations with no mechanism for correction.

This is the central bottleneck for the entire dLLM field right now. Fast-dLLM from NVIDIA addressed part of the issue with confidence thresholds, only unmasking tokens the model felt sure about. Other recent work like DEMASK and dParallel have tried dependency-guided selection and certainty-forcing distillation, respectively. But these are mostly inference-time tricks layered on top of models that were never trained to handle their own errors.

DMax takes a different angle. It changes the training.

On-Policy Uniform Training

The core technical contribution is something the authors call On-Policy Uniform Training. Standard dLLM training only teaches the model one thing: predict clean tokens from masked inputs. The model never sees its own wrong predictions during training, so when it encounters them at inference time (which it inevitably does during aggressive parallel decoding), it has no idea what to do with them.

DMax unifies two training regimes. The model learns to recover tokens from both masked inputs and from corrupted sequences containing its own errors. The "on-policy" part means the training data includes the model's actual mistakes, not synthetic noise. This is a meaningful distinction, since the distribution of errors a model makes is specific to that model.

Soft Parallel Decoding

The second piece is how DMax represents intermediate states during generation. Instead of the hard mask-or-token binary, each position gets represented as an interpolation between the predicted token embedding and the mask embedding. The model can then iteratively refine these soft representations, gradually sharpening predictions across steps rather than making irreversible commitments.

Think of it as the difference between writing in pen and writing in pencil. Standard dLLMs write in pen. DMax writes in pencil and keeps erasing.

The numbers

The team benchmarks against LLaDA-2.0-mini, the 16B-parameter MoE diffusion model from Ant Group's InclusionAI team. The headline metric is TPF, tokens per forward pass, which measures how many tokens the model produces per inference step.

On GSM8K (math reasoning), TPF jumps from 2.04 to 5.47. On MBPP (code generation), from 2.71 to 5.86. Both without accuracy loss, according to the paper. On two H200 GPUs at batch size 1, they report 1,338 tokens per second.

Those TPF numbers are solid. But I want to be careful here. TPF is a useful metric for measuring decoding parallelism, but it doesn't tell you everything about real-world throughput. The 1,338 TPS figure is more practically meaningful, though it's worth noting that's on H200s, NVIDIA's latest datacenter GPU, not something most people have lying around.

And the accuracy preservation claim deserves scrutiny. The paper says accuracy is maintained, but the specific numbers and what "maintained" means in context (within 1%? identical? varies by benchmark?) matters. I couldn't find independent reproduction of these results yet.

Where this sits in a crowded field

The dLLM acceleration space has gotten absurdly competitive in the past few months. You've got Fast-dLLM, D2F, LoPA, SchED, Hierarchy-dLLM, DEMASK, dParallel, Learn2PD, and now DMax, all attacking roughly the same problem from different angles. Some are training-free, some require fine-tuning, some change the architecture entirely.

DMax's bet is that you need to change the training, not just the inference algorithm. That's a more expensive proposition but potentially a more durable one. Training-free methods are limited by what the base model can already do; training-based methods can teach the model new capabilities.

The team behind this, led by Zigeng Chen and Xinchao Wang at NUS, has released the training data on Hugging Face alongside the models. That's a good sign for reproducibility. Whether the approach generalizes beyond the LLaDA-2.0-mini backbone and the specific benchmarks tested remains an open question.

What I find most interesting isn't the speed numbers but the framing. Treating decoding as self-correction rather than progressive unmasking is a conceptual shift that could influence how future dLLMs are designed from the ground up. Or it could end up being one more paper in an already overwhelming stack. The field is moving fast enough that it's genuinely hard to tell.