Linum V2: Two Brothers Ship 2B-Parameter Text-to-Video Model Under Apache 2.0

Sahil and Manu Chopra, brothers who went through Y Combinator's W23 batch, released Linum v2 on January 20, 2026. The models generate 2-5 second video clips at up to 720p and are available on Hugging Face under Apache 2.0.

Two people, one model

The release is notable mostly for who built it. Linum is a team of two. Not two leads with a supporting cast. Two people total, training foundation models from scratch.

Sahil graduated from Stanford in 2019, where he co-wrote a graduate course on LLMs. Manu finished early at UC Berkeley in 2021 and did reinforcement learning research for anesthesia administration. They started Linum in fall 2022, got into YCombinator, and have been grinding on video generation since.

The company's launch post is refreshingly candid about the work involved. Data procurement, VLM training for filtering, captioning pipelines for what they describe as "100+ years of video footage." Then there's the compute side: benchmarking cloud providers (apparently not all H100s perform equally), negotiating prices, keeping clusters running.

It took two years.

What V2 actually does

The model uses T5 for text encoding, the Wan 2.1 VAE for compression, and a DiT-variant backbone trained with flow matching. They built their own temporal VAE but ended up using Wan's because it was smaller with equivalent performance. The brothers say they'll open-source their VAE separately.

In their Hacker News thread, they're upfront about what works and what doesn't. Cartoons, food, nature scenes, simple character motion: good. Complex physics, fast motion like gymnastics or dancing, consistent text rendering: not good.

The 720p model needs roughly 20GB of VRAM. Generating a 5-second clip takes about 15 minutes on an H100 at 50 steps. A commenter pointed out this seems high for a 2B model; the Chopras explained the T5 encoder alone is around 5B parameters, eating up ~10GB in bfloat16.

The v1 detour

This is their second attempt. The first version bootstrapped off Stable Diffusion XL by inflating the U-NET's 2D convolutions into 3D and adding temporal attention layers. It shipped in January 2024 as a 180p, 1-second GIF bot on Discord.

That approach hit a dead end. Image VAEs don't understand temporal coherence. Without the original training data, transitioning smoothly between image and video distributions proved too costly. Better to start over.

What they're betting on

The Chopras frame video generation as a stepping stone toward accessible animation filmmaking. Traditional animation software like Blender, they argue, is "functionally rich but semantically poor." You can do anything, but actually doing something useful is hard.

Their bet is that video models can invert this relationship. The physics will be lossy and often wrong (see: their failed generation examples with cars that have foregrounds and backgrounds moving independently). But the controls could eventually become more intuitive.

Whether that bet pays off depends on post-training for physics and deformations, speed improvements through CFG and timestep distillation, and scaling the model up. Audio capabilities are on the roadmap too.

For now, it's 2-5 seconds of 720p video from two brothers who decided training foundation models was preferable to depending on someone else's API.