Meta Launches Muse Spark AI Model: Benchmarks and Details

Meta released Muse Spark on Wednesday, the first AI model out of its nine-month-old Superintelligence Labs unit and a tacit admission that the Llama era needed a reboot. The model, codenamed Avocado, powers the updated Meta AI assistant across meta.ai and the company's apps starting today, with WhatsApp, Instagram, and Messenger rollouts coming in the next few weeks.

This is Alexandr Wang's first big deliverable since Meta paid $14.3 billion for a 49% stake in Scale AI last June and installed him as chief AI officer. The price tag makes Muse Spark one of the most expensive model debuts in history, though Meta would probably prefer you not do that math.

The benchmarks tell a familiar story

Muse Spark scores 52 on the Artificial Analysis Intelligence Index, which slots it behind GPT-5.4 and Gemini 3.1 Pro (both at 57) and Claude Opus 4.6 (53). Competitive, sure. But "competitive" is doing a lot of work in Meta's messaging here.

The health benchmarks are where things get interesting. On HealthBench Hard, Muse Spark posted 42.8%, ahead of every rival, including GPT-5.4 at 40.1%. Meta says it worked with over 1,000 physicians to curate training data for health reasoning, which is either a genuine differentiator or the kind of stat that sounds better in a press release than in practice. Time will tell which.

Coding is a different story. On Terminal-Bench 2.0, Muse Spark scored 59.0 against GPT-5.4's 75.1. On ARC-AGI-2, which tests abstract reasoning, it managed 42.5 while both GPT-5.4 and Gemini 3.1 Pro scored in the mid-70s. Meta acknowledged the gaps directly, noting continued investment in "long-horizon agentic systems and coding workflows." At least they're not pretending.

About those Llama 4 benchmarks

There's an elephant in the room. Fortune noted that Meta previously got caught using specialized, unreleased model variants to inflate Llama 4's benchmark scores. The general-release version didn't match. Whether you trust the Muse Spark numbers at face value depends on how much goodwill you think Meta has rebuilt since then.

Independent evaluation from Artificial Analysis does broadly corroborate the picture: Muse Spark is a genuine jump from Llama 4 Maverick, which scored just 18 on the same index. That's a real improvement. But "third-party confirms it's better than the thing everyone agreed was bad" is a low bar.

Closed source, and that matters

The other big shift: Muse Spark is proprietary. No open weights. Meta's CNBC interview included vague language about hoping to open-source future versions, but for now, access runs through Meta's own apps or a private API preview for select partners. That's a significant departure from the Llama strategy that made Meta popular with the open-source community.

The model requires a Facebook or Instagram login to use, which, as TechCrunch pointed out, raises obvious privacy questions. Meta doesn't explicitly say it uses personal data from those accounts, but the company's track record and its positioning of Muse Spark as a "personal superintelligence" product don't exactly inspire confidence.

What's the actual play here?

Meta's framing of Muse Spark as "small and fast by design" reads like careful expectation management. The company is spending somewhere between $115 billion and $135 billion on AI infrastructure in 2026 alone, according to its latest earnings report. Calling your first model from that investment "small" is a choice.

The multi-agent "Contemplating" mode, where parallel subagents tackle different parts of a query simultaneously, is genuinely novel in how it's deployed at consumer scale. Whether it works as advertised for Meta's 3 billion users is another question entirely. It's rolling out gradually, which in product-speak usually means "we're not sure it's ready."

Muse Spark is available now at meta.ai. The Contemplating mode and broader app rollouts are expected in the coming weeks. API pricing hasn't been announced.

Meta Launches Muse Spark, Its First Model Since Llama 4 Flop

The benchmarks tell a familiar story

About those Llama 4 benchmarks

Closed source, and that matters

What's the actual play here?

Liza Chan

Related Articles

Anthropic Finds 171 Emotion-Like Patterns Inside Claude That Drive Its Behavior

Google Releases Gemma 4 Open Models in Four Sizes

OpenAI Killed Sora to Build Its Next Model, Codenamed Spud

Stay Ahead of the AI Curve