Apple, UCSD propose latent diffusion for LLM reasoning

Apple researchers and collaborators at UC San Diego have published a paper proposing what they call LaDiR, a hybrid reasoning method that pushes most of an LLM's chain-of-thought work out of token space and into a continuous latent space governed by a diffusion model. The technical report appeared on arXiv last October and has been updated through April ahead of an ICLR workshop. Code is on the GitHub repo.

The pitch is straightforward, even if the implementation isn't. Standard chain-of-thought generates tokens one after the other, and once a token is committed, the model can't really go back and rewrite it. LaDiR replaces that linear process for the reasoning steps only. Each "thought" gets compressed into a block of continuous latent vectors by a Variational Autoencoder, the diffusion model denoises those vectors over multiple steps, and only when the model decides reasoning is done does it switch to ordinary autoregressive decoding to write out the final answer.

Why bolt diffusion onto a language model

The interesting bit, to me, is the choice of latent diffusion rather than the masked-token diffusion driving most of the diffusion-LM hype. Models like LLaDA work in discrete token space and effectively just unmask tokens in parallel. LaDiR operates on continuous representations from a VAE that the authors initialize from the same backbone LLM. That gives the diffusion process actual continuous gradients to work with, and lets the model revise the meaning of an earlier reasoning step rather than just swapping a masked token for a different word.

Architecture details are denser than you'd expect for a workshop paper. Each reasoning sentence becomes one block of latent tokens. Inside a block, attention is bidirectional. Across blocks, it stays causal. A special token decides when to stop reasoning and start writing the answer. Training happens in two stages: teacher-forcing on oracle latents from the VAE, then a rollout phase where the model has to denoise its own outputs. Without that second stage, performance tanks. The ablation table is blunt about it: drop Stage 2 and Pass@1 on math benchmarks falls from 43.5% to 27.9% on average.

Numbers, with appropriate skepticism

On seven math benchmarks tested with a LLaMA 3.1 8B backbone, LaDiR posts a 43.5% Pass@1 average. That's 1.5 points above TaH+, the strongest prior latent-reasoning baseline, and roughly 3 points above standard CoT supervised fine-tuning. A real improvement, but a modest one. The more interesting result is Pass@100, where LaDiR pulls ahead by 6.1 points over AR CoT SFT. On Pass@100, the diversity-guidance trick clearly does something.

Countdown is where the gap looks dramatic. On the four-number variant, LaDiR hits 76.6% Pass@1 versus 46.7% for an AR LLaMA 8B fine-tune, and 96.4% Pass@100 against 65.3%. On five numbers it's 38.5% versus 8.9%. Those are not subtle gains.

A couple of caveats are worth flagging. The math win over TaH+ is small enough that a different random seed or decoding temperature could move it around. And the Countdown numbers, while impressive, come on a synthetic combinatorial task that diffusion models tend to do well on for reasons not entirely about "reasoning." MGDM, a small task-specific discrete diffusion model the authors include as a baseline, beats LaDiR on Countdown-4 Pass@1 (91.5%) despite being much smaller. The authors note this and make the fair point that LaDiR is general-purpose while MGDM isn't, but it tells you something about how amenable Countdown is to non-AR approaches in the first place.

For code generation, the team swaps in a Qwen3-8B-Base backbone and reports a 5.2-point average gain over AR SFT on MBPP and HumanEval, with the largest jump (about 8 points) on HumanEval+.

Costs, and the scaling question nobody answers

There's a test-time compute story here that cuts both ways. More denoising steps give you more accuracy. Going from 5 to 10 steps adds 11.7 points on average; going to 50 adds another 9.8 on top of that. A usable knob. But it's a knob you have to turn, and the paper doesn't wrestle hard with what this actually costs in wall-clock latency relative to AR decoding at the same accuracy.

Whether any of this survives at frontier scale is the bigger open question. Everything is 8B. The VAE is trained from the same 8B backbone. The reasoning model is the same backbone again. None of that tells you what happens when the underlying LLM is 70B or 400B, trained with vastly more reasoning supervision than DART-MATH and a filtered code SFT set provide.

So what

LaDiR sits in a small pile of Apple papers exploring alternatives to pure autoregressive reasoning. Their earlier Latent Lookahead work from March pushed in a related direction. Whether any of this lands in a shipped Apple Intelligence feature is a different question, and one nobody at the company will answer. The code is public, the weights are not, and the benchmarks here are the ones the academic field grades on, not the ones a phone assistant gets graded on.

It'll be presented at the ICLR 2026 workshop on latent and implicit thinking. Whether the wider community picks up the latent-diffusion-for-reasoning thread will probably depend less on this result and more on whether someone replicates the Countdown gains at a scale that actually matters.

Apple and UCSD propose latent diffusion method for LLM reasoning

Why bolt diffusion onto a language model

Numbers, with appropriate skepticism

Costs, and the scaling question nobody answers

So what

Oliver Senti

Related Articles

talkie Releases 13B Language Model Trained Only on Pre-1931 Text

Anthropic Says Claude Opus 4.7 Halved Sycophancy in Relationship Advice

Microsoft Launches AI Legal Agent Inside Word

Stay Ahead of the AI Curve