DiffThinker Uses Diffusion Models to Solve Visual Puzzles

Researchers from multiple institutions dropped a paper on December 30th proposing something that sounds almost too simple: instead of having AI describe how to solve a maze, just have it draw the solution path directly on the image.

The approach is called DiffThinker, and the results are striking enough that I kept double-checking the numbers. We'll get to those.

The setup

So here's what's been happening in multimodal reasoning. When you ask Gemini or ChatGPT to solve a visual puzzle, the model looks at the whole image once, thinks in text, and spits out a text answer. OpenAI's o-series models got slightly cleverer: they can ask Python to crop and zoom parts of the image, look again, think more, then answer. It's still fundamentally text-centric reasoning with occasional visual check-ins.

DiffThinker does something different. It takes Qwen-Image-Edit (Alibaba's 20B parameter image editing diffusion model) and fine-tunes it to generate solutions directly in visual space. Ask it to solve a maze? It draws the path. Traveling salesman problem? It connects the dots. Sudoku? It fills in the numbers.

The output image then gets parsed by separate code to extract the actual answer. Which, okay, that's not fully end-to-end. But the reasoning itself happens entirely in image generation.

The numbers that made me pause

The paper claims DiffThinker outperforms GPT-5 by 314% and Gemini-3-Flash by 111% on their benchmark tasks. Also beats a fine-tuned Qwen3-VL-32B baseline by 39%.

I don't love benchmark comparisons where one system is specifically trained on the task types and the other isn't. But the gap here is large enough that something real is going on. The tasks span sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration, which covers a decent range of visual reasoning challenges.

What's more interesting: the training cost. The authors report 3 hours on 8 H200 GPUs per task. That's remarkably cheap for this kind of capability gain. For context, H200s are NVIDIA's latest data center GPUs with 141GB of HBM3e memory, but even so, 3 hours of training to significantly outperform frontier models on specific visual tasks? Worth paying attention to.

Why this might work

The intuition here isn't crazy. Text-based reasoning creates what researchers call a "semantic gap" between continuous visual information and discrete symbolic thought. When you try to describe a maze solution in words ("go right, then down, then..."), you're forcing spatial relationships through a bottleneck that loses information.

Image generation models already understand spatial relationships natively. They know what paths look like, what connections between points look like. The diffusion process itself explores multiple candidate solutions in parallel during denoising, which the authors argue gives it "native parallel reasoning" for free.

The processing happens in latent space after the VAE encoder compresses the image. So you're exploring solution variations in a compressed but still visually meaningful representation, then decoding back to an actual image.

The "but" section

This isn't end-to-end, and that matters. The model generates an image, then separate parsing code extracts structured answers. The model can't verbally explain its reasoning. It can't look at its own output and decide to try again. There's no chain-of-thought here, just input-to-output.

The authors also fine-tune on each task type separately. The GitHub repo shows distinct training pipelines for Maze, TSP, Sudoku, Jigsaw, and FrozenLake tasks. So we don't really know if this generalizes to novel visual reasoning tasks the model hasn't seen. They test on different grid sizes (8x8, 16x16, 32x32 mazes), but that's changing parameters within a task type, not testing true generalization.

The 6.7% computational overhead they report comes with heavily optimized custom kernels. Good luck replicating that with vanilla PyTorch.

Where this fits in the bigger picture

OpenAI just announced their o3 and o4-mini models can "think with images" by cropping, zooming, and rotating during their reasoning chains. The models use tools to transform uploaded images, allowing them to crop, zoom in, and rotate, in addition to other simple image processing techniques. There's now an entire survey paper categorizing this emerging "thinking with images" paradigm.

But those approaches still reason in text, with images as supporting evidence. DiffThinker represents something more radical: reasoning that happens in image space. The output isn't a description of a solution, it's the solution rendered visually.

The practical question is whether this can scale beyond toy problems. Mazes and Sudoku are great for benchmarks, but real-world visual reasoning is messier. Can this approach handle ambiguous inputs? Multiple valid solutions? Tasks that require integrating visual and textual information?

What would make this actually useful

For this paradigm to matter beyond a paper:

The model needs to close the loop. Generate a solution, evaluate it visually, refine if needed. Maybe even explain in text what it did. That requires integrating the diffusion process into a larger system that can plan and verify.

It needs to generalize. Training task-specific models is fine for research, but you'd want something that can tackle novel visual reasoning problems. The authors don't show evidence of this.

Someone needs to benchmark the failure modes. When does visual generation reasoning break down? What happens with adversarial inputs? The paper is light on this.

Bottom line

DiffThinker is a proof of concept that image generation models can be repurposed for visual reasoning in ways that text-based approaches can't match. The results are impressive enough on the specific tasks they tested.

Whether this becomes a real paradigm shift or a clever trick for spatial puzzles depends on work the authors haven't done yet: closing the reasoning loop, demonstrating generalization, and integrating with text-based thinking rather than replacing it entirely.

The model weights and training code are available. Expect to see follow-ups trying to push this further.