Ant Group's research arm pushed Ling-2.6-flash to Hugging Face this week, an open-source mixture-of-experts instruct model that totals 104 billion parameters but only activates 7.4 billion on any given forward pass. Mirror weights also went up on ModelScope. License is MIT.
The architecture continues the direction set by Ling 2.5: a 1:7 ratio of MLA and Lightning Linear attention layers, stacked on top of a sparse MoE backbone. Ant says this gets inference to 340 tokens per second on a 4x H20 setup, with prefill and decode throughput up roughly 4x against comparable peers. Those numbers are in-house, measured on Ant's own hardware.
The independent read is more modest. Artificial Analysis clocks the median provider at 209.8 tokens per second and scores the model 26 on its Intelligence Index. The full benchmark run consumed 15M output tokens and cost $22.90 in compute, which AA flags as somewhat verbose against the 7.9M-token average for the suite. Context window is 262K.
Ant is aiming this squarely at agent workloads: tool use, multi-step planning, the kind of thing that burns tokens fast in long-reasoning systems. The release notes also concede the model still hallucinates tool calls in complex scenarios, and that bilingual Chinese-English switching needs work.
SGLang and vLLM are both supported at launch, with BF16 and FP8 weights officially provided. Pricing on the lone API provider currently listed sits at $0.10 per million input tokens, $0.30 per million output.
Bottom Line
Ling-2.6-flash activates only 7.4B of its 104B parameters per token and cost $22.90 to evaluate across the full Artificial Analysis Intelligence Index.
Quick Facts
- 104B total parameters, 7.4B active per forward pass
- 262K context window
- 340 tokens/s on 4x H20 (company-reported); 209.8 t/s median per Artificial Analysis
- 15M tokens consumed on full AA Intelligence Index suite (company-reported as efficient)
- Pricing: $0.10 / $0.30 per 1M input/output tokens
- MIT license; BF16 and FP8 weights




