ACE Studio and StepFun released ACE-Step 1.5, an open-source music generation model that runs locally on machines with as little as 4GB of VRAM. The GitHub repo went live alongside an arXiv paper and Hugging Face weights. The model generates complete songs with vocals and lyrics in under 2 seconds on an A100 and under 10 seconds on an RTX 3090, according to self-reported benchmarks.
The developers claim ACE-Step 1.5 beats most commercial alternatives on standard evaluation metrics, scoring between Suno v4.5 and v5 on their internal tests. Those numbers come from the team's own AudioBox benchmark, where the model achieved 8.09 on content understanding and 8.35 on perceptual quality. Independent verification hasn't surfaced yet.
What makes this release unusual: the team says the training data is entirely licensed, royalty-free, or synthetic (MIDI-to-audio conversion). That's a bold claim in a space where most music AI faces legal uncertainty. The MIT license explicitly permits commercial use of generated outputs.
The architecture combines a language model planner with a diffusion transformer for audio synthesis. Users can train LoRAs from just a few songs to capture specific styles. The model supports 50+ languages and scales from short loops to 10-minute compositions.
The Bottom Line: An open-source music model matching commercial quality on consumer hardware would be a first, but the training data claims remain unverified.
QUICK FACTS
- VRAM requirement: 4GB minimum
- Generation speed: under 2 seconds on A100, under 10 seconds on RTX 3090 (company-reported)
- License: MIT (commercial use permitted)
- Languages supported: 50+




