Tencent Hunyuan and Renmin University's Gaoling School of AI put out PlanningBench, a framework that generates planning problems for both testing and training large language models. The technical paper landed on arXiv May 20.
The pitch: instead of a fixed bag of hand-written examples, PlanningBench synthesizes self-contained tasks with verification checklists baked in, so a model's output gets graded automatically against the constraints. The taxonomy spans more than 30 task types across six families, covering scheduling, routing, resource allocation, emergency response, and a couple of others.
One detail worth flagging. The GitHub repo currently ships 467 synthetic instances, and the authors say all 467 are meant for evaluation, not training. So the training angle described in the source is about the paper's reinforcement learning experiments, not something you can download yet.
The headline finding from those experiments: frontier models, open and closed, still flub complete plans once constraints start coupling together. The team reports RL on verified PlanningBench data carried over to unseen planning benchmarks and broader instruction-following, though that's the authors' own measurement and hasn't been independently checked.
The evaluation set is live on Hugging Face now.
Bottom Line
PlanningBench ships 467 evaluation instances on Hugging Face, covering 30+ planning task types, with training-data generation described only in the paper.
Quick Facts
- 467 synthetic evaluation instances released
- 30+ task types across six planning families
- arXiv paper submitted May 20, 2026
- Collaboration: Tencent Hunyuan and Renmin University
- RL gains on unseen benchmarks are company-reported




