AI Models Platforms

Frontier LLMs Beat Specialized Clinical AI Tools in Nature Medicine Study

NYU-led study finds GPT-5.2, Gemini and Claude outscored OpenEvidence and UpToDate on every test.

Oliver Senti
Oliver SentiSenior AI Editor
June 13, 20264 min read
Share:
A physician reviewing medical information on a tablet in a clinical setting, with abstract data visualization overlaid

General-purpose models from OpenAI, Google and Anthropic outscored two purpose-built clinical AI tools across knowledge tests, expert-alignment tasks and real physician queries, according to a Nature Medicine paper published June 12 by a team led out of NYU Langone Health. The losers were OpenEvidence and Wolters Kluwer's UpToDate Expert AI, tools sold specifically to doctors.

The framing matters here. These are products marketed on the premise that domain-specific training and retrieval beat a generalist chatbot. The study says they don't, at least not yet.

What the numbers actually say

On 500 MedQA licensing-exam questions, Gemini 3.1 Pro Preview led at 97.4% (95% CI 95.6 to 98.5). GPT-5.2 hit 94.2%, Claude Opus 4.6 reached 90.2%. OpenEvidence and UpToDate landed at 89.6% and 88.4%. So the gap on raw medical knowledge is real but narrower than you'd expect from a headline. Worth flagging: the authors themselves warn that frontier models may have seen MedQA during training, which inflates these scores. They treat this benchmark as the weakest evidence in the paper, and they're right to.

The stronger signal comes from the real clinical queries test. The team pulled 100 de-identified questions that physicians had actually typed into NYU Langone's HIPAA-compliant GPT instance, then had twelve blinded clinicians rate six systems across correctness, completeness, safety and clarity. That produced 1,800 annotations. Two tiers emerged. The three frontier models clustered at the top (Gemini 3.62, GPT 3.54, Claude 3.52 on a four-point scale) with no meaningful difference between them. The clinical tools sat below, and here's the part that should sting: Google's auto-enabled AI Overview, the thing that shows up uninvited above your search results, scored as well as both specialized products.

The RAG problem

Why do the dedicated tools lose? The honest answer is nobody outside the companies knows, because the architectures are closed. The authors lean on a likely culprit: retrieval-augmented generation. Both OpenEvidence and UpToDate are thought to rely on RAG, pulling in external literature before answering. When that retrieval surfaces irrelevant material or the base model integrates it poorly, quality drops. The paper cites earlier work from some of the same authors showing medical LLMs are easily distracted by retrieved noise.

UpToDate's other problem was refusals. It declined 19% of queries, far more than any frontier model's 1 to 3%, apparently because of aggressive safety filtering. A tool that won't answer one in five clinical questions is a tool clinicians will route around.

OpenEvidence stumbled on a different axis. It scored lowest on clarity, mean 2.84, which the authors read as a communication failure rather than a knowledge one. The model knew things; it just said them badly.

Where the specialists might still win

The team is careful, almost to a fault, about not overclaiming. Hallucination and harmful-content rates didn't differ across any of the six systems, so the frontier models weren't trading safety for performance. And the authors flag that deeply subspecialized tasks could still favor domain-tuned systems. "Our results should therefore be interpreted as a snapshot of a rapidly evolving landscape," they write, which is the kind of hedge that's both genuinely true and convenient if Gemini drops three points next quarter.

One disclosure deserves daylight: senior author Eric Oermann reports consulting for Google, whose Gemini topped two of three tests. The paper names it; you should know it too.

There are real limits. The clinical tools have no public API, so researchers queried them by hand through browser interfaces, which caps sample size and muddies comparisons. The code is posted on GitHub, though the clinical query set stays private for patient-privacy reasons.

The practical takeaway the authors push: hospitals should run their own independent tests before buying clinical AI, rather than trusting vendor benchmarks. Given that OpenEvidence claims use by 40% of US physicians and UpToDate by systems covering most major enterprises, that advice arrives a little late for a lot of buyers.

Tags:clinical AIlarge language modelsNature MedicineOpenEvidenceUpToDateGPT-5.2Geminimedical AIRAG
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.