A team led by Yann LeCun has released a paper showing that Joint Embedding Predictive Architectures can finally train without the pile of engineering hacks that have kept them unstable for years. The LeWorldModel paper, authored by Lucas Maes, Quentin Le Lidec, Damien Scieur, LeCun, and Randall Balestriero from Mila, NYU, Samsung SAIL, and Brown University, describes a system that learns a world model directly from raw pixels using only two loss terms. Fifteen million parameters. One GPU. A few hours of training.
That last part is the headline, but the interesting story is what they threw away to get there.
The collapse problem, finally addressed
World models built on JEPA have had a persistent, embarrassing failure mode: representation collapse. The model finds a shortcut where it maps every input to the same output vector, driving the prediction loss to zero without learning anything useful. Think of a weather app that just predicts "sunny" every day. Technically correct often enough. Practically useless.
Previous fixes for this were ugly. PLDM, the closest comparable end-to-end approach, required a seven-term training objective derived from VICReg regularization, which according to the paper introduced training instability and made hyperparameter tuning a nightmare. The search complexity for tuning those coefficients scaled as O(n^6). Other approaches, like DINO-WM, sidestepped the problem entirely by freezing a massive pre-trained vision encoder, which meant giving up on end-to-end learning altogether.
LeWorldModel replaces all of this with two components: a standard mean-squared-error prediction loss and a regularizer called SIGReg that forces the latent embeddings to follow an isotropic Gaussian distribution. The math behind SIGReg relies on the Cramér-Wold theorem (if all one-dimensional projections of a distribution match your target, the full distribution matches too), and the team uses the Epps-Pulley test statistic to check normality along random projections. One hyperparameter to tune instead of six. Bisection search with O(log n) complexity instead of polynomial-time grid search.
What the benchmarks actually show
The model was tested on four environments: a 2D navigation task (Two-Room), a two-joint arm controller (Reacher), a block manipulation task (Push-T), and a 3D robotic pick-and-place scenario (OGBench-Cube). LeWorldModel outperformed PLDM on all the challenging tasks and beat DINO-WM on Push-T and Reacher, even without pre-trained visual features.
On Push-T, it beat DINO-WM even when DINO-WM had access to proprioceptive inputs that LeWorldModel didn't use. That's a genuinely surprising result.
But here's the thing. DINO-WM still wins on the visually complex 3D OGBench-Cube task, and the authors acknowledge this is likely because a frozen DINOv2 encoder carries richer visual priors from large-scale pretraining. And LeWorldModel actually underperforms on Two-Room, the simplest environment tested. The authors suspect the task's intrinsic dimensionality is too low for the Gaussian regularizer to produce a well-structured latent space. So we have a method that struggles in very simple environments and in very complex visual scenes. The sweet spot is real, but it is a sweet spot, not a universal solution.
The planning speed numbers are more straightforward. Because LeWorldModel encodes each frame as a single 192-dimensional token (roughly 200 times fewer tokens than DINO-WM), planning completes in about one second versus 47 seconds. A 48x speedup. Under fixed compute budgets, LeWorldModel outperforms DINO-WM on both Push-T and OGBench-Cube, which tells you something about efficiency even in the tasks where raw performance lags.
Why this matters for LeCun's bigger bet
LeCun published his JEPA position paper, "A Path Towards Autonomous Machine Intelligence," in mid-2022. The core argument: predicting the next word (or pixel) is a dead end for real intelligence. Instead, you should predict abstract representations of future states, discarding unpredictable details. The idea was elegant. The execution kept falling apart because nobody could train these things reliably without bolting on exponential moving averages, stop-gradient operations, frozen encoders, and half a dozen auxiliary losses.
LeWorldModel removes that objection. Small model, stable training, no hacks.
The timing matters. LeCun left Meta in late 2025 and founded AMI Labs, which raised $1.03 billion at a $3.5 billion valuation in early March. Europe's largest seed round ever. Investors include Bezos Expeditions, NVIDIA, Samsung, Toyota Ventures, and Eric Schmidt. AMI's CEO Alexandre LeBrun told TechCrunch that the company would spend its first year on pure research, and that commercial applications could take years. The LeWorldModel paper, dropping two weeks after the funding announcement, reads like the first public proof point that the research direction actually works.
"In six months, every company will call itself a world model to raise funding," LeBrun said. He might be right. Fei-Fei Li's World Labs raised $1 billion in February. General Intuition pulled in $133.7 million. Decart raised $100 million at a $3.1 billion valuation. The money is flowing.
What I'm not convinced about
Fifteen million parameters is compelling for a proof of concept, but LeCun's own company is building systems meant to power robotics, autonomous driving, and healthcare applications. The gap between "competitive on Push-T" and "navigating a construction site" is vast, and the paper says nothing about how SIGReg's Gaussian regularization behaves at scale. Does the Cramér-Wold approach hold up at 1 billion parameters? At 10 billion? With real-world visual complexity instead of simulated block-pushing? I don't know, and the paper doesn't say.
The GitHub repo is public, which is good. The project website includes qualitative rollouts showing both successes and failures, which is refreshingly honest. But the environments tested are all simulated, all relatively low-dimensional, and all involve rigid-body physics. The jump to real-world sensor data, with noise, occlusion, deformable objects, and the sheer visual chaos of actual rooms, is where world models have historically fallen apart regardless of their theoretical elegance.
LeWorldModel is a clean result in a field that desperately needed one. Whether it's the foundation for LeCun's billion-dollar thesis or a neat trick that doesn't generalize is a question that probably won't be answered for another year or two. AMI Labs says it will publish openly as it goes. We'll see.




