The Allen Institute for AI launched Molmo 2 on December 16, 2025, releasing an open-source video understanding model family that can track objects, count events, and answer questions tied to specific frames in a clip. Models and datasets landed on Hugging Face that day. Training code, promised "soon" in the official announcement, has not appeared. The GitHub repo still reads "Code coming soon" as of this writing.
That gap matters less than it might seem, because the technical report (arXiv 2601.10611) documents the training recipe in enough detail that a well-resourced team could reconstruct it. But it does complicate Ai2's "fully open" framing, which their December newsletter used before the training code was actually out.
The efficiency claim is the interesting part
Ai2's strongest argument for Molmo 2 isn't the benchmark numbers. It's the data efficiency. The 8B model was trained on 9.19 million videos. Meta's PerceptionLM used 72.5 million. Ai2 claims Molmo 2 outperforms PerceptionLM on key video tracking metrics despite training on less than one-eighth of the data. If that holds up under independent testing, it's a meaningful result for anyone trying to build video understanding on a budget.
The approach relies on dense human annotation rather than volume. Ai2's video caption dataset uses spoken descriptions from human annotators, which the institute says produces richer temporal detail than written captions or auto-generated transcripts. Captions average hundreds of words per video clip. Whether that explains the efficiency gap or whether there are other confounding factors (different training compute, different benchmark selection) is hard to say from the paper alone.
What the benchmarks actually show
Ranjay Krishna, Ai2's computer vision research lead, made the most memorable claim in pre-launch briefings: the 7B model now outperforms last year's 72B model. That's against Ai2's own previous system, not the current field, and the 72B model it's beating is over a year old. Still, a 10x reduction in parameters for equivalent performance is genuinely significant if it replicates.
On video tracking, Molmo 2's numbers against Gemini 3 Pro are sharp: 38.4 vs 20.0 F1 on video pointing, and 56.2 vs 41.1 J&F on video object tracking, according to the technical report. Those are Ai2's own evals, so the usual caveats apply. But the gap is large enough that it's hard to dismiss as cherry-picking.
Here's what Ai2 doesn't hide, which I find more credible than their headline claims: "video grounding is still hard, and no model yet reaches 40% accuracy" on their video grounding benchmarks. Including Molmo 2. The honest acknowledgment that they're nowhere near solving the problem is more useful than a press release claiming they have.
Three variants, one fully open
The model family has three main checkpoints. Molmo 2 (8B) and Molmo 2 (4B) are both built on Alibaba's Qwen 3 language model. The third variant, Molmo 2-O (7B), is built on Olmo, Ai2's own fully open language model. The "-O" variant is the one researchers who want full end-to-end transparency should reach for: the underlying LLM, the vision encoder, the connector weights, all accessible. The Qwen-based variants are open-weight but carry Alibaba's licensing restrictions on the LLM backbone.
For inference, the HuggingFace model cards include working code using the transformers library. The Ai2 blog post covers the demo playground where you can upload video clips directly. Deployment beyond that, including multi-node fine-tuning and the checkpoint conversion utilities the original source described, is still waiting on the training code release.
Why this is still worth watching
Open video understanding models that go beyond description have been scarce. Grounding, the ability to pinpoint where and when something happens at pixel and timestamp level, has basically been a proprietary feature. Molmo 2 is the most capable openly licensed model for that task. And Ai2 built the nine new datasets they used to train it without distilling from closed models, which means the data lineage is clean. Teams building on Molmo 2 aren't inheriting legal uncertainty from GPT-4o-generated annotations.
Whether Ai2 gets the training code out in weeks or months will determine how much of the research community can actually build on this rather than just run inference. The institute's track record on follow-through is good: the original Molmo training code eventually shipped. But "coming soon" has now meant more than two months. The question is whether anyone's waiting.




