MiniMax M2.7 Open Source: Self-Evolving AI Weights

MiniMax open-sourced the weights for M2.7 on April 12, roughly a month after the model's initial announcement on March 18. The Shanghai-based company had delayed the release, candidly admitting it underestimated the infrastructure work needed to prepare the weights for public deployment. That kind of honesty is refreshing from an AI lab, even if the delay frustrated developers who'd been using M2.7 through the API since launch.

The model itself is a sparse Mixture-of-Experts architecture: 230 billion total parameters, but only 10 billion active per token. That ratio is the entire pitch. You get frontier-adjacent performance at a fraction of the compute cost, with weights now on Hugging Face and deployment support already baked into SGLang and vLLM.

But the part worth paying attention to isn't the architecture. It is the self-evolution story.

The model that rewrote its own scaffolding

MiniMax built a research agent powered by an internal version of M2.7 and pointed it at the company's own reinforcement learning pipeline. The agent handles data pipelines, monitors training runs, debugs failures, and collaborates with human researchers. According to MiniMax's blog post, the model now handles 30 to 50 percent of the RL team's workflow end-to-end, with humans stepping in only for critical decisions.

Then they went further. They tasked M2.7 with optimizing its own programming scaffold, an internal coding harness, over more than 100 autonomous rounds. The loop was simple in concept: analyze failure trajectories, modify code, run evaluations, compare results, decide whether to keep or revert changes. No human directed each step. The result, according to MiniMax, was a 30% improvement on internal evaluation sets.

That 30% figure deserves some scrutiny. It is measured against MiniMax's own internal benchmarks, not any public evaluation. And the optimizations M2.7 discovered weren't architectural breakthroughs. They were things like tuning sampling parameters (temperature, frequency penalty), adding loop detection to the agent loop, and designing workflow guidelines such as automatically searching for the same bug pattern across files after a fix. Useful, sure. But this is hyperparameter search and prompt engineering on autopilot, not a model rewriting its own weights.

The comparison to Andrej Karpathy's autoresearch project is apt, and several commentators have already made it. Karpathy's repo automates the scientific method for ML experiments in about 630 lines of code. M2.7 goes one step further: instead of just iterating on external code, it modifies the harness it runs inside. Whether that distinction matters in practice is an open question.

Benchmarks (with caveats)

On SWE-Pro, which tests real software engineering across multiple languages, M2.7 scored 56.22%. MiniMax says that matches GPT-5.3-Codex. On VIBE-Pro, a repo-level code generation benchmark that covers web, Android, iOS, and simulation tasks, it hit 55.6%, which MiniMax claims is close to Opus 4.6. These are MiniMax's reported numbers. Independent verification is thin on the ground so far, though the model's availability on Ollama and multiple API providers should change that quickly.

The MLE Bench Lite result is more interesting to me. MiniMax gave M2.7 access to 22 machine learning competitions (all runnable on a single A30 GPU) and let it run three 24-hour trials with a simple harness built around short-term memory, self-feedback, and self-optimization. The best run produced 9 gold medals, 5 silver, 1 bronze. Average medal rate across the three trials: 66.6%, tying with Gemini 3.1 and trailing Opus 4.6 at 75.7% and GPT-5.4 at 71.2%.

What caught my attention wasn't the final medal count. It was that performance kept improving continuously across each 24-hour window. The model found better approaches the longer it ran. That's the kind of long-horizon behavior that actually matters for agentic use cases, where a model needs to sustain coherent optimization over hours, not seconds.

On GDPval-AA, which measures professional task delivery across 45 models, M2.7 hit an ELO of 1495. Highest among open-source models, trailing only Opus 4.6, Sonnet 4.6, and GPT-5.4.

The price and the catch

At $0.30 per million input tokens and $1.20 per million output tokens through the API, M2.7 is absurdly cheap for its performance tier. Only xAI's Grok 4.1 Fast undercuts it. For teams running high-volume agent workloads, that pricing changes what's feasible.

And now the weights are public, so self-hosting is on the table. The GitHub repo provides deployment guides for SGLang, vLLM, and Transformers. NVIDIA has already posted an integration blog with optimized kernels for the MoE architecture on Blackwell GPUs. Community quantizations from Unsloth and others are already appearing on Hugging Face.

The license is MIT, which is about as permissive as it gets.

What's actually new here?

I'm not sure the self-evolution framing is as novel as MiniMax wants it to be. Google DeepMind shipped AlphaEvolve, OpenAI has been talking about Symphony, and Karpathy's autoresearch showed the pattern in a weekend project. The idea that models can participate in their own improvement loop isn't new. What's new is a company shipping it as a product feature and open-sourcing the result.

The more practically interesting development might be Agent Teams, M2.7's native multi-agent collaboration feature with stable role identity across complex workflows. MiniMax also open-sourced OpenRoom, a browser-based desktop where an AI agent operates apps through natural language (and most of the code was, naturally, written by M2.7 itself).

The broader context matters too. M2.7 drops at a moment when Chinese open-source models are stacking up fast. Zhipu released GLM-5.1 recently, and DeepSeek V4 is reportedly coming in late April with native multimodal support. MiniMax is positioning M2.7 not just as a model but as a workflow primitive, one that developers are already running through Claude Code, Cursor, and OpenClaw.

Whether the self-evolution story holds up under independent testing is the question to watch. The weights are out. The community will tell us soon enough.

MiniMax M2.7 Goes Open Source With a Model That Helped Build Itself

The model that rewrote its own scaffolding

Benchmarks (with caveats)

The price and the catch

What's actually new here?

Oliver Senti

Related Articles

EU Picks Italy's Domyn to Build 400-Billion-Parameter Open AI Model

NVIDIA Releases LocateAnything-3B Visual Grounding Model

Linux Foundation Launches Akrites to Coordinate Open Source Patching

Stay Ahead of the AI Curve