Researchers in China have published a new benchmark that tests vision-language models on the full visual history of Chinese writing, and the early scores suggest frontier labs have been quietly avoiding this kind of evaluation for good reason. On Chronicles-OCR, GPT-5 scores 0.0 on detecting individual characters in oracle bone inscriptions. Claude Opus 4.7 scores 0.0. Gemini 3.1 Pro scores 0.0.
The benchmark was assembled by a team led by researchers at the Chinese Academy of Sciences, with annotation help from the Key Laboratory of Oracle Bone Inscription Information Processing at Anyang Normal University and the Palace Museum. It covers what paleographers call the Seven Chinese Scripts, from the angular carvings on tortoise shells in the Shang Dynasty through to running script. 2,800 images, evenly split across the seven, four evaluation tasks. The technical report went up alongside the GitHub release.
An OCR test that isn't really OCR
Here's the pitch. Standard OCR benchmarks ask whether a model can read text. Chronicles-OCR asks a different question: when the same character is written seven different ways across three thousand years, on bone, on bronze, in cursive ink, can the model recognize it's the same character?
That's a visual reasoning question, not a text-recognition one. And it's exactly the kind of thing multimodal models are supposed to be getting good at.
Data on the Hugging Face mirror splits cleanly between archaic scripts (Oracle Bone, Bronze, Seal) and mature ones (Clerical onward). Performance falls off a cliff at the archaic boundary, and not just for the small open-source models.
The 0.0 problem
Some numbers, because the numbers are the story. On Oracle Bone character spotting, the best score belongs to ByteDance's Seed 2.0 Pro at 3.0 percent. Second place: 2.4. Of the 29 model runs in the leaderboard, 17 scored exactly 0.0 on that task, including every Western proprietary model except a non-zero rounding error from Seed 1.8.
Fine-grained character recognition is slightly better. Slightly. Kimi K2.5 leads on oracle bones at 11.5 percent. Most others sit between 0 and 5 percent. Bronze script does a bit better: Kimi K2.5 at 25.8, Seed 2.0 Pro at 30.8. Then there's seal script, where Kimi K2.5 manages 58.5 percent. Not great. Not zero either.
Mature scripts are an entirely different story. Kimi K2.5 hits 77.0 percent average classification accuracy across clerical, regular, running, and cursive. Qwen3.5-A17B gets 0.73 normalized edit distance on parsing tasks, which is actually usable. Anything from the Han Dynasty onward, models can mostly handle. Anything older breaks them.
Does thinking help?
No, mostly. And this is the part of the leaderboard I find most interesting. Chain-of-thought modes (the "Think" column in the results table) don't reliably improve performance and frequently make things worse. Qwen3-VL-A22B drops from 7.8 to 2.1 on character spotting when reasoning is enabled. Kimi K2.5 falls from 27.1 to 20.3 on fine-grained archaic recognition. Seed 2.0 Pro is one of the few that benefits, and only by a hair.
I'm not sure what to make of that, honestly. The intuitive read is that thinking modes encourage models to over-reason about glyphs they can't actually see clearly, generating plausible-sounding interpretations rather than admitting visual uncertainty. Worth digging into.
What this actually measures
There's a fair question about whether oracle bone OCR is something the field should care about. Probably not on its own terms. But the interesting failure mode here isn't "model lacks training data." It's that the visual transformation from oracle bone to seal script to regular script is enormous, and current VLMs apparently don't generalize across it.
Models can label an image as oracle bone script with near-perfect accuracy (classification scores hit 96 to 100 across the board). They just can't read what's on the bone. The benchmark is doing what good benchmarks do: separating two capabilities that previously looked the same.
Worth noting that open-source models from Chinese labs dominate the leaderboard. Moonshot's Kimi K2.5, ByteDance's Seed 2.0 Pro, and Alibaba's Qwen3.5 variants outperform every Western proprietary model on the archaic tasks. Geographic specialization in training data probably explains most of this, though the gap is wider than I expected. GPT-5 scoring 0.0 on character spotting and 3.7 on fine recognition is not a number that gets advertised in keynote slides.
What happens next
Benchmark code is live, the dataset is on Hugging Face under a research-only license, and anyone with API credits can run their own evaluations. The cited paper lists a 2026 publication year, so it's likely still a preprint heading toward review.
For frontier labs, this is a cheap, unflattering test. Whether the next round of model updates picks it up as a target probably depends on whether the labs decide Chinese paleography is a capability worth optimizing for, or a niche that doesn't move the needle. The 0.0 scores suggest no one has been training for it yet.




