StepFun shipped Step 3.7 Flash, a 198B sparse Mixture-of-Experts vision-language model released under Apache 2.0. The model card pegs it at roughly 11B active parameters per token, a 256K context window, and three selectable reasoning levels. Throughput tops out at 400 tokens per second.
The pitch is agent efficiency, not raw size. StepFun built it to parse documents, charts, and UIs, then write code or call tools without the toolcall drift that breaks long agent loops. The company claims 98%+ on the tau-2 benchmark across all difficulty tiers, plus #1 placements on ClawEval-1.1 (67.1) and SimpleVQA Search (79.2). All self-reported. Independent runs on standard multimodal evals like MMMU and MATH-Vision haven't surfaced yet.
Weights are live on GitHub alongside GGUF quants for local hosting. With 128GB of unified memory, it runs on a Mac Studio, DGX Station, or AMD Ryzen AI Max+ 395 box. Inference works through vLLM, SGLang, llama.cpp, and Hugging Face Transformers.
API access is open now on StepFun's open platform, with OpenRouter and NVIDIA NIM also listed. One reporting outlet put pricing at $0.20 input and $1.15 output per million tokens, undercutting Western mid-tier models, though StepFun's own docs don't headline those figures. DeepInfra, Fireworks AI, and Modal are slated to add the model soon.
Bottom Line
Step 3.7 Flash ships as open weights under Apache 2.0 with ~11B active parameters and a 256K context, runnable on a 128GB Mac Studio.
Quick Facts
- 198B total parameters, ~11B active per token
- Up to 400 tokens per second
- 256K context window, three reasoning levels
- 98%+ on tau-2 benchmark (company-reported)
- Runs locally on 128GB unified memory devices




