GPT-5 Beats Federal Judges in Legal Reasoning Experiment

Researchers at the University of Chicago Law School ran GPT-5 through a legal reasoning experiment originally designed for 61 U.S. federal judges, and the model applied the correct law in 100% of cases. The judges managed about 52%. Eric A. Posner and Shivam Saran published the results in a new paper posted to SSRN on January 29.

The setup

The experiment, which replicates a 2024 study by Daniel Klerman and Holger Spamann, presents a straightforward scenario: a hypothetical car accident case with a choice-of-law dispute. Three variables get manipulated. First, whether the applicable legal doctrine is a rigid rule or a flexible standard. Second, whether the plaintiff or defendant is portrayed more sympathetically. Third, the location of the accident, which determines whether Kansas or Nebraska law applies.

That location variable is where things get interesting. Kansas caps pain-and-suffering damages at $250,000. Nebraska doesn't. So in practice, the legal question controls whether an injured plaintiff walks away with $250,000 or $750,000. Not exactly abstract.

What GPT-5 actually did

GPT-5 applied the correct choice-of-law framework every single time, regardless of which state's law it was selecting, regardless of how sympathetic the plaintiff appeared, and regardless of whether the doctrine was a rule or a standard. No variation. No wobble. The model used the Restatement (Second) of Conflict of Laws and its "most significant relationship" test like a machine, which, to be fair, it is.

The federal judges told a different story. Only about half applied the law that the experimental design treated as correct. Many opted for full compensation even when the legal framework pointed toward the Kansas cap. The paper suggests judges gravitated toward the outcome that felt just: compensating a badly injured plaintiff rather than capping damages because of a technicality about which state's law governs.

"We find the LLM to be perfectly formalistic, applying the legally correct outcome in 100% of cases," the authors write, adding that this was "significantly higher than judges, who followed the law a mere 52% of the time." That framing deserves scrutiny, though. Calling judicial discretion an "error" loads the analysis in a particular direction.

Neither humans nor AI cared about sympathy

One finding cuts against the easy narrative. Neither GPT-5 nor the human judges changed their decisions based on whether the plaintiff or defendant was made more sympathetic. The sympathy manipulation didn't move the needle for either group. Where they diverged was on the legal standards themselves: judges exercised the interpretive flexibility that standards are designed to provide, while GPT-5 treated standards and rules identically.

Posner and Saran have been here before. Their earlier paper from January 2025 ran GPT-4o through a different judicial experiment involving a simulated war crimes appeal, originally conducted on 31 federal judges. That study found GPT-4o was strongly affected by precedent but not by sympathy, and no amount of prompt engineering could make it behave more like a human judge. They called it "Judge AI," a formalist judge rather than a human one.

Is perfect formalism the point?

The paper's conclusion is more nuanced than the headline suggests. The authors report that other models were also tested. According to the paper, Gemini 3 Pro matched GPT-5's perfect score. Earlier models performed worse, with GPT-4o reportedly making the kind of discretionary choices that human judges make, including attempts to justify departures from damage caps.

"Whatever may explain such behavior in judges and some LLMs, however, certainly does not apply to GPT-5 and Gemini 3 Pro," the authors write. "Across all conditions, regardless of doctrinal flexibility, both models followed the law without fail."

And then the kicker: "And does that mean that LLMs are becoming better than human judges or worse?"

Because here is the tension the paper surfaces but cannot resolve. Judges who declined to apply the Kansas damages cap weren't confused about the law. Many cited fairness concerns. The materials describe a plaintiff with gruesome injuries, and a $250,000 cap that would cover a fraction of her losses. A judge who bends the standard to reach the more generous result isn't making an error in the way a math student gets a problem wrong. They're exercising a function the legal system arguably built into flexible standards on purpose.

GPT-5 can't do that. It won't do that. Whether that's a feature or a bug depends entirely on your theory of what judges are for. Posner and Saran acknowledge as much, framing the question as one about legal philosophy rather than model capability. The AI alignment built into these models prevents them from deviating from stated rules, which looks like perfect compliance from one angle and like a dangerous inability to exercise discretion from another.

The full paper is available on Chicago Unbound. Klerman and Spamann's original experiment, which shaped the methodology, was published in the Journal of Law, Economics, and Organization in 2024.

GPT-5 Applied the Law Correctly Every Time. Federal Judges Hit 52%.

The setup

What GPT-5 actually did

Neither humans nor AI cared about sympathy

Is perfect formalism the point?

Andrés Martínez

Related Articles

OpenAI's First Hardware Product Is a Camera-Equipped Smart Speaker That Watches Your Home

OpenAI Leaked Slides Show $600 Billion Spend Plan and Shrinking Margins

OpenAI Acqui-Hires Enterprise Search Startup Nerve

Stay Ahead of the AI Curve