ByteDance's Intelligent Creation Lab published DreamLite, a 0.39B-parameter diffusion model designed to generate and edit 1024×1024 images directly on a smartphone. No cloud required. The team claims it's the first unified on-device model to handle both tasks in a single network, and the project page demos look solid, if limited to curated examples.
The architecture runs on a pruned U-Net backbone derived from SDXL, paired with a tiny 1.2M-parameter VAE and Qwen3-VL as the text encoder (quantized to 4-bit for mobile). Training follows a progressive curriculum: text-to-image first, then editing, then joint training on both. DMD2 step distillation compresses inference to just four denoising steps. On a Xiaomi 14 with a Snapdragon 8 Gen3, the team reports sub-one-second generation using W8A8 quantization and pre-computed text embeddings. On iPhone 17 Pro, with live text encoding via 4-bit Qwen VL, that stretches to roughly three seconds.
Benchmark numbers: 0.72 on GenEval for generation, 4.11 on ImgEdit for editing. Both are self-reported and beat existing mobile baselines like SnapGen and SANA-0.6B, according to the paper. The team acknowledges weak spots: the ultra-compact VAE struggles with text rendering and identity preservation in portraits. A larger VAE is planned.
The GitHub repo exists but contains no model weights or code yet. Release timeline: "coming soon."
Bottom Line
DreamLite fits both image generation and editing into a 390M-parameter model that runs locally on phones, but weights and code haven't shipped yet.
Quick Facts
- Model size: 0.39B parameters (U-Net) with 1.2M-parameter VAE
- Inference: 4 denoising steps via DMD2 distillation
- Speed: sub-1s on Xiaomi 14 (pre-computed embeddings), ~3s on iPhone 17 Pro (live text encoding)
- Benchmarks (self-reported): GenEval 0.72, ImgEdit 4.11
- Paper published: March 30, 2026 on arXiv




