AI Benchmarks

Tencent Releases CL-bench, a Benchmark That Exposes Context Learning Gaps in LLMs

Even GPT-5.1 solves under a quarter of tasks requiring in-context knowledge application.

Oliver Senti
Oliver SentiSenior AI Editor
February 4, 20263 min read
Share:
Illustration of a robot ignoring instruction manuals while attempting to solve a puzzle, representing AI models failing to use provided context

Tencent's Hunyuan research team, in collaboration with Fudan University, has released CL-bench, an open-source benchmark designed to evaluate whether language models can learn new knowledge from context rather than relying on pre-trained parameters. The GitHub repository went live this week alongside a Hugging Face dataset and leaderboard at clbench.com.

The results are uncomfortable for the industry. Across ten evaluated models, the average task-solving rate sits at 17.2%. GPT-5.1 with high reasoning effort leads at 23.7%, which sounds reasonable until you consider that every task provides all necessary information within the context itself.

The benchmark nobody wanted

CL-bench contains 1,899 tasks spread across 500 expert-crafted contexts, each requiring around 20 hours of human effort to annotate. The tasks span four categories: domain knowledge reasoning, rule system application, procedural task execution, and empirical discovery.

What makes this different from standard benchmarks is the contamination-free design. Domain experts either created entirely fictional content (like legal systems for made-up countries), modified existing knowledge to create variants, or drew from niche, recently-emerged specialized knowledge. The researchers wanted to ensure models couldn't simply recall something from pre-training.

The team's paper, available via the repository, details how each context averages 63.2 verification rubrics for grading. This isn't a pass/fail situation where getting the gist counts.

Why the scores are so low

The Hunyuan researchers identified a specific failure mode: models ignore or misuse context even when it's directly relevant. Rather than learning the new rules, definitions, or procedures provided, models default to whatever static knowledge they acquired during training.

Inductive tasks proved especially brutal. When models needed to derive laws from provided data or explore simulated environments, success rates typically dropped below 10%. Applying clearly-stated rules fared better, but even there, performance remained well below what you'd expect given that all necessary information was explicitly provided.

Higher reasoning effort helped in some cases (GPT-5.1 gained roughly 6% on management and experimental data tasks), but the improvement wasn't consistent. Some models showed no benefit or even declined with additional compute.

The strategic timing

This release marks the first public research output from Tencent HY Research, a blog platform the company is using to share frontier work. It's also the first major publication associated with Shunyu Yao since his appointment as Tencent's chief AI scientist in December 2025.

Yao joined Tencent after working at OpenAI, where he contributed to agent-focused projects. His academic work on language agents and the ReAct framework has accumulated over 15,000 citations. The CL-bench author list includes his name alongside researchers from Fudan University and Tencent's Hunyuan team.

Tencent positioning itself as a benchmark-setter rather than just a model-trainer carries some strategic value. Benchmarks shape optimization targets, and whoever defines what "good" looks like gains indirect influence over research direction.

What this implies

The gap CL-bench exposes matters for practical applications. Enterprise deployments often involve feeding models domain-specific documentation, proprietary procedures, or updated technical specs. If models can't reliably learn from and apply this context, the entire premise of context-window scaling becomes less compelling.

The benchmark is licensed for evaluation only, not training, which prevents contamination from model developers specifically optimizing against it. Whether that restriction holds depends on enforcement.

CL-bench submissions are open. Developers can download the dataset, run inference using provided scripts, and submit results for the leaderboard. The evaluation uses GPT-5.1 as a judge model with instance-level rubrics.

Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Tencent Releases CL-bench, a Benchmark That Exposes Context Learning Gaps in LLMs | aiHola