Wharton Study Finds AI Expert Personas Fail to Improve Answer Accuracy

Illustration of a robot wearing stacked expert costumes looking confused at a multiple choice test

Telling an AI to act like an expert does not make it answer hard questions more accurately, according to new research from the University of Pennsylvania's Wharton School. The study, published by the Generative AI Labs team, tested persona prompts across six major language models and found no consistent benefit from the widely recommended technique.

The Common Advice That Doesn't Hold Up

Google, Anthropic, and OpenAI all recommend persona prompting in their official documentation. Google's Vertex AI guide lists "assign a role" as a best practice. Anthropic offers sample prompts beginning with "You are an expert AI tax analyst." OpenAI's developer materials include "You are a world-class Python developer."

The Wharton team put these recommendations to the test using two challenging benchmarks: GPQA Diamond (198 PhD-level questions in biology, physics, and chemistry) and MMLU-Pro (graduate-level questions across engineering, law, and chemistry). They ran 25 independent trials per question across GPT-4o, GPT-4o-mini, o3-mini, o4-mini, Gemini 2.0 Flash, and Gemini 2.5 Flash.

What the Data Actually Shows

Expert personas like "you are a world-class expert in physics" produced results statistically indistinguishable from a baseline prompt with no persona at all. Matching the persona to the question domain (physics expert for physics questions) offered no advantage over mismatched experts.

The one partial exception was Gemini 2.0 Flash on the MMLU-Pro benchmark, where all five expert personas showed modest improvements. The researchers noted this appears to be model-specific rather than a generalizable finding.

Low-knowledge personas performed worse. When instructed to respond as a "toddler who thinks the moon is made of cheese," four of six models showed reduced accuracy. The degree of implied ignorance correlated with performance drops: toddler prompts performed worse than layperson prompts in five of six models tested on MMLU-Pro.

An Unexpected Failure Mode

Gemini 2.5 Flash exhibited a peculiar behavior when given mismatched expert personas. Under the "unrelated expert" condition on GPQA Diamond, the model refused to answer an average of 10.56 out of 25 trials per question, stating it lacked the relevant expertise. The researchers flagged this as a risk: overly narrow role instructions can cause models to ignore knowledge they actually possess.

What Personas Might Still Be Good For

The study focused strictly on factual accuracy. Personas may still serve purposes the research did not measure, including shifting response tone, changing what factors the model emphasizes, or helping users frame their own questions more effectively. OpenAI's recent system card for GPT-5.1 acknowledges personas play a role in shaping output presentation.

The researchers recommend that organizations invest effort in task-specific instructions and evaluation workflows rather than simply adding expert personas to prompts.

The full paper, including supplementary statistical tables, is available on SSRN.

Tags:AI researchprompt engineeringWharton SchoolGPT-4oGeminiLLM benchmarksAI accuracymachine learningOpenAIGoogle AI

Liza Chan

AI & Emerging Tech Correspondent

Liza covers the rapidly evolving world of artificial intelligence, from breakthroughs in research labs to real-world applications reshaping industries. With a background in computer science and journalism, she translates complex technical developments into accessible insights for curious readers.

Wharton Study Finds AI Expert Personas Fail to Improve Answer Accuracy

The Common Advice That Doesn't Hold Up

What the Data Actually Shows

An Unexpected Failure Mode

What Personas Might Still Be Good For

Liza Chan

Related Articles

New CUSP Benchmark Finds Top LLMs Can't Predict Future Science

Microsoft SkillOpt Trains Agent Skills Without Touching the Model

Google Triples Gemini Limits Inside Antigravity, Then Quietly Does It Again

Stay Ahead of the AI Curve