Z.ai Releases GLM-4.7-Flash, a 30B MoE Model Built for Local Inference

Abstract visualization of MoE architecture showing selective neural pathway activation

Z.ai pushed GLM-4.7-Flash to Hugging Face this week, adding a smaller sibling to the full GLM-4.7 released in late December. The model uses a Mixture-of-Experts architecture with 30 billion total parameters but only activates roughly 3 billion per forward pass, cutting compute costs while keeping performance competitive.

The company's benchmark table puts GLM-4.7-Flash at 91.6% on AIME 2025, 75.2% on GPQA, and 59.2% on SWE-bench Verified. That SWE-bench number is the attention-getter: resolving real GitHub issues at that rate would make it one of the stronger coding models in its weight class. BrowseComp lands at 42.8%, and τ²-Bench hits 79.5%, numbers the company claims represent state-of-the-art for models around 30B parameters.

Deployment uses vLLM or SGLang, both on their main branches. Z.ai's documentation includes SGLang launch commands with speculative decoding enabled. The full model weights run about 62.5GB on Hugging Face with bf16 precision. Hardware requirements aren't trivial for local use, but the active parameter count keeps inference more tractable than dense 30B alternatives.

Z.ai positions this as a free-tier option alongside the flagship, suitable for coding agents, web browsing tasks, and general chat.

The Bottom Line: GLM-4.7-Flash gives developers an open-weight MoE option that trades total parameter count for inference efficiency, with self-reported benchmarks that warrant independent testing.

QUICK FACTS

Architecture: 30B total parameters, ~3B active (MoE)
Model weights: 62.5GB (bf16)
SWE-bench Verified: 59.2% (company-reported)
AIME 2025: 91.6% (company-reported)
License: MIT
Inference frameworks: vLLM and SGLang (main branches)

Tags:GLM-4.7-FlashZ.aiMoEopen-source LLMcoding modelSWE-benchlocal deployment

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Z.ai Releases GLM-4.7-Flash, a 30B MoE Model Built for Local Inference

QUICK FACTS

Andrés Martínez

Related Articles

Meituan Open-Sources LongCat-2.0, a 1.6T Coding Model

Mistral Releases Leanstral 1.5, an Apache-2.0 Lean 4 Proof Model

Tencent Hunyuan Releases PhoneBuddy Phone-Use Agent Models

Stay Ahead of the AI Curve