StepFun dropped its Step-3.5-Flash model over the weekend, and the numbers are worth a closer look. The Shanghai-based company claims their 196B parameter model beats DeepSeek's 671B V3.2 across multiple benchmarks while activating just 11B parameters per token during inference. If those claims hold up under independent testing, the efficiency implications are substantial.
The Parameter Gap
The headline comparison is stark. Step-3.5-Flash uses 196B total parameters with 11B active. DeepSeek V3.2 sits at 671B total with 37B active. That's roughly a third the size on both metrics.
StepFun's own benchmarks show an average score of 81.0 across eight core evaluations, compared to 76.7 for DeepSeek V3.2. On AIME 2025, they report 97.3 versus 93.1. On LiveCodeBench-V6, it's 86.4 versus 83.3. The agent benchmark τ²-Bench shows an even wider gap: 88.2 versus 80.3.
But here's where I'd pump the brakes. These are StepFun's numbers, run under their conditions. Independent verification matters. And the model just hit Hugging Face today, so we're waiting on third-party results.
What's Under the Hood
The architecture relies on a sparse Mixture of Experts design, nothing unusual there. The interesting bits are in the attention mechanism: they're using a 3:1 sliding window attention ratio, alternating three SWA layers for every full-attention layer. This hybrid approach keeps the 256K context window manageable without the quadratic cost explosion you'd normally expect.
StepFun claims the setup allows for speculative decoding via their 3-way Multi-Token Prediction heads, which they say pushes generation throughput to 100-300 tokens per second in typical use, peaking at 350 tok/s for single-stream coding. That's fast enough for real-time agentic workflows, which appears to be their primary target.
The model also runs on consumer hardware, according to StepFun's documentation. They specifically mention Mac Studio M4 Max and NVIDIA DGX Spark as viable local deployment targets.
The RL Story
StepFun introduced what they call MIS-PO (Metropolis Independence Sampling Filtered Policy Optimization) for training. The pitch: standard PPO-style importance weighting breaks down on long reasoning sequences because even small token-level discrepancies can produce extreme gradient updates. Their approach uses importance ratios purely as a binary filter, accepting or rejecting whole trajectories rather than scaling gradients continuously.
Whether this actually solves the off-policy drift problem at scale remains to be seen. The technical documentation provides some ablation curves but stops short of the kind of comprehensive analysis you'd want for evaluating a new RL method.
vLLM Support Landed Same Day
The timing is notable. StepFun's team submitted a pull request to vLLM that was merged within 24 hours of the model's release. That's fast for a new architecture implementation, and it suggests either strong coordination with the vLLM maintainers or significant pre-release work. Either way, day-zero inference support removes one of the usual friction points for new model adoption.
The PR includes support for their MTP-3 speculative decoding and the custom SwiGLU variant they use in the expert layers.
StepFun's Position
StepFun is one of China's "AI Tiger" companies, alongside Zhipu AI, MiniMax, and Moonshot AI. They just closed a $717 million Series B+ round in late January, outpacing the IPO proceeds of competitors who went public on the Hong Kong exchange. The company has partnerships with Chinese phone manufacturers including Honor, Oppo, and ZTE, with their models reportedly running on over 42 million devices.
The company was founded in 2023 by Jiang Daxin, formerly of Microsoft's Asia Research Institute. They recently appointed Yin Qi, founder of Megvii Technology (the Face++ company), as chairman.
What's Missing
The benchmarks cover reasoning, coding, and agentic tasks. What you don't see: comprehensive multilingual evaluation, detailed safety testing, or independent reproduction of the training efficiency claims. StepFun acknowledges in their documentation that the model "may experience reduced stability during distribution shifts" and can exhibit "repetitive reasoning, mixed-language outputs, or inconsistencies in time and identity awareness" in specialized domains.
The token efficiency concern is worth noting. StepFun admits Step-3.5-Flash "relies on longer generation trajectories than Gemini 3.0 Pro to reach comparable quality." So the per-token cost advantage may get partially eaten by needing more tokens to complete equivalent tasks.
The Broader Pattern
This release fits a pattern we've been watching since DeepSeek V3 dropped on New Year's Eve 2024. Chinese labs are competing aggressively on inference efficiency, not just benchmark scores. The bet seems to be that making frontier-class models cheap to run matters more than pushing the absolute capability ceiling.
Whether Step-3.5-Flash actually delivers on its efficiency claims will become clear as the community runs independent tests. The GitHub repo has deployment instructions, and weights are available on Hugging Face under Apache 2.0. We should have third-party numbers within the week.
For now, the release is notable primarily for what it implies: the efficiency frontier in open-source AI is moving faster than most observers expected.




