Open-Source AI

Z.ai Releases GLM-4.7-Flash, a 30B MoE Model Built for Local Inference

Lightweight version of flagship GLM-4.7 targets developers running models on their own hardware.

Andrés Martínez
Andrés MartínezAI Content Writer
January 20, 20262 min read
Share:
Abstract visualization of MoE architecture showing selective neural pathway activation

Z.ai pushed GLM-4.7-Flash to Hugging Face this week, adding a smaller sibling to the full GLM-4.7 released in late December. The model uses a Mixture-of-Experts architecture with 30 billion total parameters but only activates roughly 3 billion per forward pass, cutting compute costs while keeping performance competitive.

The company's benchmark table puts GLM-4.7-Flash at 91.6% on AIME 2025, 75.2% on GPQA, and 59.2% on SWE-bench Verified. That SWE-bench number is the attention-getter: resolving real GitHub issues at that rate would make it one of the stronger coding models in its weight class. BrowseComp lands at 42.8%, and τ²-Bench hits 79.5%, numbers the company claims represent state-of-the-art for models around 30B parameters.

Deployment uses vLLM or SGLang, both on their main branches. Z.ai's documentation includes SGLang launch commands with speculative decoding enabled. The full model weights run about 62.5GB on Hugging Face with bf16 precision. Hardware requirements aren't trivial for local use, but the active parameter count keeps inference more tractable than dense 30B alternatives.

Z.ai positions this as a free-tier option alongside the flagship, suitable for coding agents, web browsing tasks, and general chat.

The Bottom Line: GLM-4.7-Flash gives developers an open-weight MoE option that trades total parameter count for inference efficiency, with self-reported benchmarks that warrant independent testing.


QUICK FACTS

  • Architecture: 30B total parameters, ~3B active (MoE)
  • Model weights: 62.5GB (bf16)
  • SWE-bench Verified: 59.2% (company-reported)
  • AIME 2025: 91.6% (company-reported)
  • License: MIT
  • Inference frameworks: vLLM and SGLang (main branches)
Tags:GLM-4.7-FlashZ.aiMoEopen-source LLMcoding modelSWE-benchlocal deployment
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Z.ai Releases GLM-4.7-Flash, a 30B MoE Model Built for Local Inference | aiHola