DeepSeek released DeepSeek-OCR-2 today, introducing an approach the company calls Visual Causal Flow that departs from how most vision-language models read documents. The code and paper are now on GitHub, with weights on Hugging Face.
Standard vision LLMs process images as grids, scanning left-to-right and top-to-bottom. OCR-2's architecture instead forms a global understanding of the document first, then determines the logical reading order. For a newspaper with multiple columns or a form with scattered fields, that distinction matters. The model figures out what connects to what before deciding the sequence.
The predecessor DeepSeek-OCR achieved 97% OCR accuracy while using 10 times fewer vision tokens than comparable systems. OCR-2 builds on the DeepEncoder architecture with a dynamic resolution system that processes up to six 768×768 tiles plus one 1024×1024 overview, generating between 144-400+ visual tokens depending on document complexity. The technical paper provides full architectural details.
Unsloth has already added fine-tuning support for the model. The first version showed an 88% improvement in character error rate after just 60 training steps when adapted to Persian text.
The Bottom Line: A 3B-parameter OCR model that reads documents more like humans do, with immediate open-source availability and fine-tuning tools.
QUICK FACTS
- Model size: 3B parameters
- License: Apache 2.0
- Dynamic resolution: up to 6×768×768 + 1×1024×1024 tiles
- Inference: vLLM and Transformers supported
- Outputs: Markdown, structured text, table parsing, figure descriptions




