Law professors picked AI-written answers over ones from fellow professors in 75% of blind comparisons, according to a Stanford study led by professor Julian Nyarko. Sixteen professors from U.S. law schools ran the test on contracts-law tutoring questions, and the paper landed on SSRN on May 27, with Stanford publicizing it on June 1.
The setup was tighter than most AI-versus-human stunts. Participants wrote 40 questions students might actually ask after class, answered them themselves, then judged 2,918 anonymized pairings without knowing which side was machine and which was a colleague. The headline number, a 75.33% win rate for the models, comes straight from the abstract, not a press summary.
The harm number is the interesting one
Forget the win rate for a second. The detail worth chewing on: professors flagged AI answers as pedagogically harmful 3.53% of the time, versus 12.06% for the human-written ones. So the people who spend their careers worrying about misleading students found their peers nearly four times more likely to mislead. That is either a damning verdict on rushed academic writing or a sign of how good current models have gotten at sounding safe. Probably both.
Yale co-author Sarath Sanga framed the whole exercise around judgment rather than facts. "In most fields where AI gets tested, there's a right answer. In law, there often isn't," he said, which is the entire reason the result carries more weight than yet another bar-exam benchmark. Two defensible answers can disagree, and the models apparently met the bar lawyers use on each other.
Read the fine print
A few things keep this from being the rout the headline suggests. Nyarko runs Stanford's liftlab, which by its own description builds prototypes and collaborates with industry to push AI into private-sector legal services. That doesn't void the data, the study was blinded and the methods look careful, but the lab has a stake in AI looking good.
Bigger problem: the professors weren't teaching. They were writing quick answers as a side task for an experiment, not running office hours or reading a room. So "AI beats professors" really means AI beats professors dashing off written replies under research conditions. Different thing. The models also matched the best human in the study rather than blowing past everyone, a nuance that tends to vanish in the retelling.
The team did calibrate AI responses to match human answers on length and structure, and tested commercial tutoring tools alongside Google's NotebookLM, with performance varying by model. Even when context limits hurt the AI, professors often still preferred it.
Nyarko, to his credit, isn't overselling. "We're not advocating for wholesale adoption of AI tutors," he said, while arguing that blanket skepticism looks just as unwarranted now. The full paper, all 61 pages, is on SSRN for anyone who wants to check the methodology instead of the press release.




