Meta has released a new dataset designed to help train AI systems that can assist with scientific research. The Research Plan Generation (RPG) dataset contains approximately 22,500 research tasks across three domains: machine learning, general arXiv papers, and PubMed biomedical literature.
Each task in the dataset includes a research goal, a grading rubric, and a reference solution. The rubrics and solutions are outputs of Llama 4 Maverick, Meta's mixture-of-experts model, which means any derivative models must include "Llama" in their name per the license terms.
The dataset splits into three subsets: ML (7.56K rows), arXiv (8.07K rows), and PubMed (6.89K rows). Each subset contains train and test splits. The structure is straightforward: a goal describes the research task, the rubric provides evaluation criteria, and the reference solution summarizes how authors addressed similar problems. The arXiv subset includes subdomain and category labels for more granular filtering.
Meta positions this as benchmarking data, not training data in the traditional sense. The data is released under CC-by-NC and is intended for benchmarking purposes only, according to the dataset card. The company cites a forthcoming paper titled "Training AI Co-Scientists using Rubric Rewards" as the source.
The Bottom Line: A benchmark dataset for measuring how well AI systems can plan scientific research, with Llama 4-generated rubrics as the grading standard.
QUICK FACTS
- ~22,500 total research tasks across three subsets
- Domains: ML, arXiv, PubMed
- License: CC-by-NC (non-commercial)
- Reference solutions generated by Llama 4 Maverick
- Available at huggingface.co/datasets/facebook/research-plan-gen




