DeepSeek Open-Sources OCR-2 with Human-Like Reading Architecture

Visualization of DeepSeek-OCR-2's human-like document reading flow connecting related text elements

DeepSeek released DeepSeek-OCR-2 today, introducing an approach the company calls Visual Causal Flow that departs from how most vision-language models read documents. The code and paper are now on GitHub, with weights on Hugging Face.

Standard vision LLMs process images as grids, scanning left-to-right and top-to-bottom. OCR-2's architecture instead forms a global understanding of the document first, then determines the logical reading order. For a newspaper with multiple columns or a form with scattered fields, that distinction matters. The model figures out what connects to what before deciding the sequence.

The predecessor DeepSeek-OCR achieved 97% OCR accuracy while using 10 times fewer vision tokens than comparable systems. OCR-2 builds on the DeepEncoder architecture with a dynamic resolution system that processes up to six 768×768 tiles plus one 1024×1024 overview, generating between 144-400+ visual tokens depending on document complexity. The technical paper provides full architectural details.

Unsloth has already added fine-tuning support for the model. The first version showed an 88% improvement in character error rate after just 60 training steps when adapted to Persian text.

The Bottom Line: A 3B-parameter OCR model that reads documents more like humans do, with immediate open-source availability and fine-tuning tools.

QUICK FACTS

Model size: 3B parameters
License: Apache 2.0
Dynamic resolution: up to 6×768×768 + 1×1024×1024 tiles
Inference: vLLM and Transformers supported
Outputs: Markdown, structured text, table parsing, figure descriptions

Tags:DeepSeek OCR vision-language document-AI open-source VLM DeepEncoder

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

DeepSeek Open-Sources OCR-2 with Human-Like Reading Architecture

QUICK FACTS

Andrés Martínez

Related Articles

Baidu's Tiny OCR Model Just Embarrassed the Industry Giants

Zhipu Open-Sources GLM-OCR, a 0.9B Model That Tops Document Parsing Benchmarks

StepFun's Step-3.5-Flash Matches Frontier Models at a Fraction of the Size

Stay Ahead of the AI Curve