ByteDance's Intelligent Creation Lab has open-sourced Lance, a multimodal model that handles image and video understanding, generation, and editing inside a single framework. The technical paper went up around mid-May, with weights posted to Hugging Face under an Apache 2.0 license.
The pitch is efficiency. Lance runs on 3B active parameters and was trained from scratch on no more than 128 A100 GPUs, modest figures in a field where rivals casually ship 7B unified models. The architecture pairs a dual-stream mixture-of-experts setup with what the team calls modality-aware rotary positional encoding, which tags each visual token by its job (analyze this, condition on that, generate this) so the model stops confusing what's being asked of it mid-sequence.
On the scoreboard: 85.11 on VBench for video generation, 0.90 on GenEval for image generation, 62.0 on MVBench for video understanding, and 7.30 on GEdit-Bench for editing. The catch is that every one of those numbers comes from the authors' own paper, framed as best "among unified models," with no independent testing yet. The team itself flags Lance as a research artifact rather than a polished product, capped at 768x768 images and 480p, 12 FPS video.
Code, demos, and benchmark scripts are live on GitHub. The Apache 2.0 terms allow commercial use. So the real test now is whether developers reproduce the scores outside ByteDance's own setup.
Bottom Line
Lance claims top scores among unified models at 3B active parameters, but all benchmarks are self-reported in the team's paper.
Quick Facts
- 3B active parameters
- Trained from scratch on up to 128 A100 GPUs
- Released under Apache 2.0 license
- Paper-reported scores: VBench 85.11, GenEval 0.90, MVBench 62.0, GEdit-Bench 7.30
- arXiv paper submitted May 18, 2026




