AI Models Platforms

DeepSeek Releases DSpark Speculative Decoding for V4 Models

Open-source draft-model method claims throughput gains of 51% to 400%, company-reported.

Andrés Martínez
Andrés MartínezAI Content Writer
June 29, 20262 min read
Share:
Abstract visualization of parallel token streams accelerating through a neural network pipeline

DeepSeek dropped DSpark, a speculative decoding method for its new V4 Flash and V4 Pro models, with code on GitHub and a paper alongside it. The technique pairs a small, fast draft model with the main model: the draft proposes several next tokens, the big model verifies them in one batch. When the guesses land, you skip a chunk of expensive full-model passes.

The pitch is throughput. DeepSeek reports gains running from 51% all the way to 400% depending on the model and workload, though those numbers are self-reported and haven't been independently checked yet. The wide range is doing a lot of work there.

What makes this more interesting than a typical inference tweak: DeepSpec ships as a full training-and-eval codebase, and the supported target families include Qwen3 and Gemma, not just DeepSeek's own checkpoints. So the acceleration isn't locked to V4.

The V4 models themselves are sizable. The model card lists V4 Pro at 1.6T parameters (49B activated) and V4 Flash at 284B (13B activated), both with a million-token context window. DSpark is bolted onto the existing checkpoint as a separate module, not a retrained model.

For anyone serving these at scale, the math is straightforward: if quality holds and throughput climbs, you fit more requests on the same GPUs or cut your per-token cost. The repo is MIT-licensed. Whether the headline speedups survive real production traffic is the open question.


Bottom Line

DSpark is a speculative decoding module for DeepSeek V4 Pro (1.6T params) and Flash (284B), released open-source under MIT with reported throughput gains of 51% to 400%.

Quick Facts

  • Method: DSpark speculative decoding (DeepSpec codebase)
  • Throughput gain: 51% to 400%, company-reported, unverified
  • V4 Pro: 1.6T total params, 49B activated
  • V4 Flash: 284B total params, 13B activated
  • Context window: 1 million tokens
  • License: MIT; supported targets include Qwen3 and Gemma
Tags:DeepSeekspeculative decodinginference optimizationopen-weight modelsDeepSeek V4LLM throughput
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.