Infrastructure

Z.ai Blames GLM-5 Production Glitches on Inference Stack, Not the Model

Z.ai's debug post traces rare garbled GLM-5 outputs to a KV cache race and HiCache desync, not the model weights.

Oliver Senti
Oliver SentiSenior AI Editor
May 4, 20264 min read
Share:
Server rack with GPU clusters and abstract glitch artifacts overlaid on data streams

Z.ai published a postmortem on its blog last week about debugging GLM-5 at production scale, and the punchline isn't about the model. The 744B-parameter mixture-of-experts system was producing rare garbled outputs in production: strange symbols, token repetitions, the occasional rogue Chinese character showing up where it didn't belong. Tests passed. Metrics looked clean. And yet the artifacts kept surfacing under real load.

The fix wasn't more training data or another round of fine-tuning. It was the inference stack.

The bugs were in the plumbing

Three problems, by Z.ai's account in the scaling pain writeup. The first was a race condition in the KV cache: under heavy concurrency, the key-value tensors for parallel requests were being read and overwritten in the wrong order. The model wasn't hallucinating. It was being fed corrupted context.

The second sat one layer up, in HiCache. SGLang's hierarchical KV cache is supposed to cut latency by tiering hot and cold cache across memory levels. Useful idea, apparently easy to break. Z.ai says certain load patterns desynced the levels, and the cache itself turned into a bug source. (For what it's worth, a separate SGLang issue opened in February already flagged HiCache offloading problems on GLM-4.6 FP8, so the symptom isn't entirely new.)

The third item is an optimization rather than a fix: a layer redistribution scheme Z.ai is calling LayerSplit. The team rebalanced model layers across compute resources to keep more silicon busy more of the time. They report a 132% throughput improvement. That's a big number. It's also internal-against-internal benchmarking, and the post doesn't compare against alternative serving stacks like vLLM or FriendliAI's, so the headline figure is hard to read in absolute terms.

Why this matters more than another benchmark

The GLM-5 weights are public. Anyone can pull them from Hugging Face and run their own inference. What the post is really arguing, between the lines, is that the model isn't the product anymore. The serving infrastructure is.

FriendliAI made a similar argument in a March piece on serving GLM-5: with a 200K context window and sparse MoE routing, the bottleneck shifts from compute to memory bandwidth, KV cache management, and whether your scheduler can handle agentic workflows that hold state for minutes or hours. Long-context inference is a memory problem, not a FLOPs problem. Their pitch, of course, is that they've built infrastructure that handles it. They would say that.

Both posts arrive at the same uncomfortable place. Benchmark scores tell you what a model can do in lab conditions. Production tells you whether the system around it is ready for the real world. Those are increasingly different questions.

The catch

I'd take the 132% figure with a grain of salt until someone outside Z.ai reproduces it. Throughput improvements at this scale depend on which workload you measure, which hardware you use, which baseline you compare against. The blog is light on those specifics.

The KV cache race condition is more interesting because it's concrete and falsifiable. If you have rare garbled outputs at production scale and you're running an SGLang-based stack, that's something to check. Cache races are exactly the bug class that doesn't show up on a quiet test cluster.

What the post doesn't include is also telling. No detailed reproduction of the bad outputs, no commit hashes, no error rates before and after the fix. "Rare" is doing a lot of work here. One in ten thousand requests? One in a million? The post doesn't say, and Z.ai didn't respond to questions about it.

What's next

GLM-5.1 shipped in April with stronger long-horizon agent capabilities, and the GitHub repo has kept landing deployment recipes for both vLLM and SGLang. Whether LayerSplit gets upstreamed into either is the question worth watching. If it stays an internal Z.ai optimization, the 132% number means little to anyone running GLM-5 outside their API. If it lands in SGLang, other teams will find out quickly whether the gains hold up.

Tags:GLM-5Z.aiLLM inferenceSGLangKV cachemixture of expertsAI infrastructureHiCache
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

GLM-5 Bugs Trace to Inference Stack, Z.ai Says | aiHola