Robotics & Automation

Ai2's MolmoMotion Forecasts How Objects Will Move in 3D Space Before They Move

The Allen Institute's new 4B model predicts where tagged object points will travel in metric 3D space, given a frame and a text instruction.

Oliver Senti
Oliver SentiSenior AI Editor
June 22, 20265 min read
Share:
Abstract visualization of predicted 3D motion trajectories tracing the future path of an object through space

The Allen Institute for AI released MolmoMotion on June 17, a 4-billion-parameter model that predicts how a tagged object will move over the next couple of seconds. You point at something in a frame, write what you want done to it ("move and rotate the wooden bowl with fruit on the table" is their own example), and the model draws the future 3D trajectory of those points in meters relative to the camera. Built on Ai2's Molmo 2 backbone, it ships with weights, a 1.16-million-clip dataset, and a benchmark, all under Apache 2.0.

Forecasting, not perception. That's the pitch. Plenty of models can track what already moved through a scene. Far fewer try to say what happens next, which is the part a robot actually needs before it reaches for a cup.

What's actually in the release

Two variants exist on paper. The autoregressive one (MolmoMotion-AR) writes out coordinates step by step as structured text, the coordinate-prediction trick VLMs already use. The other, a flow-matching version, generates trajectories as a continuous distribution, which the team says fits cases where one instruction has several plausible endings. Useful distinction. But here's the thing: only the AR checkpoints made it into the open release. The flow-matching variant is described in the technical report and that's where it stays, at least for now.

The two public checkpoints are H3-F30, which takes 3 history frames and predicts roughly 2 seconds at 15 fps, and H1-F32, for when you only have a single keyframe. Both live on Hugging Face, with the code and a quickstart script for building inputs and visualizing a prediction. The model card carries a fairly blunt warning, too: predicted trajectories are estimates and should be validated before they drive anything actuated. Good. Somebody was paying attention to the robotics use case.

The data and the benchmark don't describe the same thing

Worth slowing down here, because the numbers get conflated easily. The training corpus, MolmoMotion-1M, is 1.16 million clips spanning 736 motion types and about 5.6K distinct objects, assembled by an automatic pipeline that lifts noisy 2D tracks into metric 3D and throws out points that don't move coherently with the rest of the object. The evaluation set is a different, much smaller thing: PointMotionBench, 2.7K human-validated clips covering 111 object categories and 61 motion types, drawn from DAVIS, HOT3D, and WorldTrack. So if you see "736 motion types" attached to the benchmark anywhere, that's the training number bleeding over. The benchmark is leaner and the categories are different.

So how good are the numbers?

On PointMotionBench, Ai2 reports MolmoMotion beating everything it was compared against, measured as 3D average displacement error in meters (lower is better). On the HOT3D split, MolmoMotion-AR with 3 frames lands at 0.109m, ahead of the next method at 0.129 and well clear of pixel-space video generators like Wan2.2-5B (0.200) and Cosmos Predict (0.225). The gap widens dramatically on the WorldTrack split, where the video generators basically fall apart (Track2Act at 1.230 against MolmoMotion-AR's 0.143).

Now the part that gives me pause. On the DAVIS split, every method struggles, and MolmoMotion-AR comes in around 1.146m to 1.227m. The model still wins, but "wins" here means errors over a meter. DAVIS is the hard, unconstrained-video case, and the absolute numbers say nobody has this solved. The benchmark is the company's own, the model is the company's own, and the strongest results cluster on the cleaner splits. None of that makes the work wrong. It does mean the headline "most accurate forecaster we've measured" is doing some quiet leaning on a benchmark they designed.

The robotics result is the one I'd actually want replicated. In simulation, a control policy initialized from MolmoMotion hit 76.3% on pick-and-place against 56.0% for the same policy built on plain Molmo 2. It also learned faster, reaching 51% success after 10K training steps where the Molmo 2 version stalled at 19%. Same policy, same data, different backbone. That's a clean comparison, and it's the kind of transfer claim that's easy to check once people get their hands on it.

Video generation gets a paragraph and then I'm moving on, because it's the least surprising result. Feed MolmoMotion's predicted paths into a generator and motion quality improves on all five metrics they tracked, beating a larger image-to-video model on four of them. Fine. The differences are tiny (0.968 vs 0.965 on temporal consistency) and the company notes it rescaled the bars to make small gaps visible, which is honest but also tells you how small the gaps are.

The limitation they admit

Eight query points per object during training. That's the number Ai2 flags itself. Enough to sketch a trajectory, not enough to capture surface geometry, which means complex deformable motion is where this breaks down. A bowl sliding and rotating? Sure. Something squishing and folding in detail? Don't count on it. I appreciate that they put this in the post rather than burying it, but it's a real ceiling, not a footnote.

What I can't tell from the materials: whether the AR checkpoints' edge over flow-matching on the cleaner splits holds once you're dealing with genuinely ambiguous actions, the exact case flow-matching was supposed to handle better. The released checkpoints are all AR. So the variant built for uncertainty is the one you can't download. Make of that what you will.

Everything is out now under Apache 2.0, which means the obvious next move belongs to whoever fine-tunes it on real robot data and reports whether that 76.3% survives contact with hardware. Ai2 already did a version of this on the DROID dataset. The interesting question is what someone else gets.

Tags:MolmoMotionAllen Institute for AIAi23D motion forecastingroboticscomputer visionopen source AIMolmo 2vision-language modelsmachine learning
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.