New Research Probes LLMs' Legal Reasoning Abilities as EU AI Act Deadlines Loom

Simone Corbo at Politecnico di Milano has published research examining how large language models perform legal reasoning tasks, from interpreting statutes to drafting contracts. The paper, posted to arXiv this month, arrives as the EU AI Act's core provisions for high-risk systems begin taking effect across the 27-member bloc.

The timing matters. Legal AI systems now fall squarely under the EU's strictest oversight category.

The "high-risk" designation problem

The EU's classification isn't arbitrary. Recital 61 of the AI Act specifically names AI systems that assist judicial authorities in interpreting law and applying it to facts as high-risk. That means any commercial legal AI product operating in Europe faces mandatory impact assessments, transparency requirements, and human oversight obligations.

Corbo's research doesn't shy away from the limitations. The paper flags hallucinations (confident but fabricated responses), algorithmic monoculture (what happens when everyone relies on the same model architecture), and the fundamental tension between AI assistance and professional accountability.

"When it comes to justice, should we let the machines make their case, or is it time to object?" The paper's closing question reads like rhetoric, but the regulatory answer is already here: neither pure automation nor pure rejection, but carefully supervised augmentation.

What the benchmarks actually measure

The paper examines two evaluation frameworks that have emerged to test legal AI capabilities.

LegalBench, a collaborative effort involving 40 contributors from Stanford and other institutions, comprises 162 tasks spanning six categories of legal reasoning. These aren't generic comprehension tests. The benchmark includes tasks like identifying whether evidence qualifies as hearsay, determining if a statute contains a private right of action, and distinguishing between different types of contractual clauses.

The framework draws on the IRAC model (Issue, Rule, Application, Conclusion) that law students learn in their first year. Can an LLM spot the legal issue? Recall the relevant rule? Apply it coherently? Reach the right conclusion? Breaking legal reasoning into discrete components lets researchers identify where models succeed and where they stumble.

LawBench targets Chinese civil law, a system where statutory interpretation takes precedence over case precedent. The benchmark tested 51 models across 20 tasks and found even GPT-4 achieved only about 52% average accuracy. That figure should give pause to anyone assuming these systems are ready for unsupervised deployment.

The regulatory surge nobody can ignore

The paper cites statistics from the AI Index showing AI-related regulations increased 56.3% in the US during 2023 compared to the prior year. Legislative mentions of artificial intelligence nearly doubled globally, jumping from around 1,247 in 2022 to over 2,175 in 2023 across 49 countries.

More recent data shows the trend accelerating. US state legislatures introduced nearly 700 AI-related bills in 2024, up from 191 in 2023. Colorado enacted the first comprehensive state AI legislation in May 2024. The EU AI Act's core provisions for general-purpose AI models, including large language models, became enforceable in August 2025.

China isn't sitting still either. The country's Interim Measures for Managing Generative AI Services took effect in August 2023, and by late 2023, 22 organizations had registered AI models with Chinese authorities.

The research positions all this regulatory activity as context for understanding how legal AI should develop. You can't build compliant systems without knowing what they're supposed to comply with.

Where the use cases hold up

The paper identifies several areas where LLM assistance shows promise.

Contract negotiation and analysis emerges as a natural fit. Identifying clause types, flagging unusual terms, summarizing lengthy agreements, comparing versions of documents. These tasks benefit from pattern recognition across large document corpora without requiring the model to reach binding legal conclusions.

Legal summarization follows similar logic. Condensing case law, extracting key holdings, generating research memos. The output serves as a starting point for human review rather than a final product.

Information retrieval for clarifying ambiguous concepts represents perhaps the safest use case. When a practitioner needs context on how courts have interpreted a particular term or doctrine, retrieval-augmented generation can surface relevant precedents faster than manual search.

But Corbo is clear about the boundaries. These applications work best when humans remain firmly in the loop, treating model outputs as suggestions requiring verification rather than authoritative answers.

The hallucination problem isn't going away

One researcher's summary of the field's current state: "the study of hallucinations in LLMs is still in its infancy, even outside the legal field." That assessment appears in related literature the paper cites.

Legal practice has zero tolerance for fabricated citations, invented case holdings, or imaginary statutes. Yet hallucinations remain an inherent property of how these models generate text. They produce plausible-sounding content without any mechanism for verifying factual accuracy.

The paper doesn't propose a solution. It documents the problem and notes that responsible deployment requires acknowledging models will sometimes generate false information with complete confidence.

What comes next

The research positions itself as foundational rather than conclusive. It maps the landscape of legal AI capabilities, the regulatory constraints taking shape, and the evaluation frameworks available for measuring progress.

The core provisions of the EU AI Act for general-purpose AI systems took effect in August 2025. High-risk AI systems, including those used in legal interpretation, face additional requirements that phase in over the following months. Organizations deploying legal AI in Europe now operate under binding transparency and accountability obligations.

Whether the technology can meet those obligations remains an open question. The benchmarks suggest current models have significant limitations. The regulators have decided those limitations require careful management.