Bengio Lab's GRAM Adds Stochastic Trajectories to Reasoning

Researchers from KAIST, NYU, and Mila published a paper on Tuesday that gives recursive reasoning models a property they have been missing: randomness. The architecture, called GRAM, slots into a niche that has been getting attention lately, the kind of tiny network that does its thinking in latent space instead of by emitting more tokens.

Yoshua Bengio is on the author list (his Montreal group co-authored the original attention paper for neural networks back in 2014, though calling him "the creator" of attention, as some early write-ups have, is a stretch), which got me past the title.

What the paper actually does

Recursive reasoning models, in the sense the paper uses the term, are not the chain-of-thought reasoners everyone has been shipping. They iterate inside the network. A latent state gets refined some number of times, and the refined state produces the answer. Sapient Intelligence's Hierarchical Reasoning Model from last summer and Alexia Jolicoeur-Martineau's TRM paper showed this could work at very small parameter counts. TRM in particular hit 45% on ARC-AGI-1 with just 7 million parameters, which embarrassed a lot of larger systems.

But these models had a problem. Same input in, same trajectory out, same answer. The reasoning collapses to a single attractor. If you want to scale at inference time, you can only go deeper, not wider. Parallel sampling does not help if every sample is identical.

GRAM's contribution is straightforward in description, less so in execution. At each recursion step, the deterministic update gets a learned stochastic shift added on top. The mean of that shift encodes a direction. The variance controls how much to explore. The whole thing is trained with amortized variational inference. The result: you can sample many trajectories in parallel, then pick the best one with a latent process reward model.

The numbers, and what to make of them

On Sudoku-Extreme, GRAM scores 97.0% with around 10M parameters, against TRM's 87.4%. That is a real improvement. The project page also shows GRAM with 20 parallel samples at 16 iterations beating TRM at 320 iterations (97.0% vs 90.5%), which is the kind of width-versus-depth result the authors clearly want you to focus on.

The headline numbers most people will quote are ARC-AGI: 52.0% on ARC-AGI-1, 11.1% on ARC-AGI-2. Some early coverage has called these "roughly GPT-5.2 level." That comparison falls apart on inspection. GPT-5.2 Pro hits 90.5% on ARC-AGI-1, the first model to cross that threshold. GRAM's 52% sits closer to GPT-5 Mini territory (54.3%). On ARC-AGI-2 the gap is starker. GPT-5.2 Thinking scores 52.9%, GRAM scores 11.1%.

Is the right comparison GPT-5.2 anyway? Probably not. GPT-5.2 Pro is a frontier reasoning model running at roughly $12 per ARC task. GRAM is a 10-million-parameter network. Putting them side by side is silly in both directions. What is genuinely interesting is that GRAM beats TRM by 7 points on ARC-AGI-1 at roughly 10M parameters against TRM's 7M, and that it produces something that can sample diverse hypotheses at all.

The stochasticity question

One of the more honest sections of the paper is the ablation. Removing the stochasticity entirely tanks performance to 0% on both Sudoku and N-Queens. Naive stochasticity (a stochastic decoder, random initialization) does not help TRM. The gain comes specifically from the variational framework, not from just dumping noise into the network.

What I keep coming back to is the constraint satisfaction results. On N-Queens, GRAM reaches 99.7% accuracy, against 96.3% for an autoregressive Transformer and 96.1% for a masked diffusion baseline. On Graph Coloring with 10 vertices, GRAM averages 3.3 conflict edges per solution, against 61.3 for the autoregressive model. That gap is too large to be explained by hyperparameter tuning. Recursive refinement just seems to be a more natural fit for hard constraint satisfaction than generative sampling.

What is missing

Scaling is the obvious problem. GRAM tops out at roughly 10-11M parameters. The authors note in their own conclusion that "the sequential nature of deep supervision limits training efficiency compared to Transformers, posing a significant barrier to scaling GRAM toward larger foundation models." A polite way of saying we have no idea if any of this works at 1B parameters, let alone 70B.

And the ARC-AGI-2 number, while better than TRM's 8%, is well below the test-time-trained systems that placed at the 2025 ARC Prize. NVARC reached around 24% on ARC-AGI-2 under contest constraints. GRAM is doing something architecturally different, but the benchmark gap is real.

Tuesday's release includes the GitHub repo. Whether someone can plug stochastic latent transitions into a larger backbone without the training cost exploding is the open question. I do not have an answer, and neither does the paper.

Bengio joins KAIST and NYU team on GRAM, a stochastic recursive reasoning model

What the paper actually does

The numbers, and what to make of them

The stochasticity question

What is missing

Oliver Senti

Related Articles

OpenAI Model Disproves 80-Year-Old Erdős Geometry Conjecture

Emergence AI Drops Five Frontier Models Into a 15-Day Society Sim. Gemini Agents Logged 683 Crimes.

arXiv Will Ban Authors for One Year Over Unchecked LLM Output

Stay Ahead of the AI Curve