Andrew Heiss was doing what professors increasingly do at semester's end: checking whether the citations in student papers actually exist. An assistant professor of public policy at Georgia State University, Heiss expected to catch some undergrads who'd asked ChatGPT for sources without verifying them. What he found instead was that the fabrications had metastasized beyond student papers and into the professional scholarly record. When he searched Google Scholar to confirm that a suspicious reference was fake, he found dozens of published, peer-reviewed articles citing slight variations of the same nonexistent studies.
"There have been lots of AI-generated articles, and those typically get noticed and retracted quickly," Heiss told Rolling Stone. The hallucinated citation problem is different, and worse, because articles which include references to nonexistent research material aren't getting flagged or retracted. They're becoming legitimate sources themselves.
The cascade
The mechanics of academic citation make this particularly insidious. A fake reference in one paper gets cited by another. That second paper, now published and seemingly credible, gets cited by a third. Each iteration launders the fabrication further into the scholarly ecosystem. Keith Moser, a professor at Mississippi State University, compared it to environmental contamination: "Nonexistent citations show up in research that's sloppy or dishonest, and from there get into other papers and articles that cite them, and papers that cite those, and then it's in the water." Hard to trace, he added, and hard to filter out even when you're trying to avoid it.
Moser takes issue with the term "hallucination" itself. The word implies deviation from normal function, but "a predictive model predicts some text, and maybe it's accurate, maybe it isn't, but the process is the same either way." The models aren't malfunctioning when they invent a journal article. They're doing exactly what they were designed to do: generate plausible-sounding text.
How bad is it, exactly?
That depends on who's testing and which model they're using. A study published in Scientific Reports tested GPT-3.5 and GPT-4 on 42 topics across disciplines and found that 55% of GPT-3.5's citations were fabricated, compared to 18% for GPT-4. Even among the real citations from GPT-4, 24% contained substantive errors like wrong page numbers, incorrect publication years, or nonexistent DOIs.
More recent tests of GPT-4o paint a bleaker picture. Researchers at Deakin University asked the model to write literature reviews on mental health topics. Nearly 20% of the 176 citations it produced were completely fabricated, and another 45% of the real citations contained errors. Only 77 citations, or 43.8%, were both real and accurate.
The fabrication rate varied wildly by subject. Depression citations were fabricated just 6% of the time. Body dysmorphic disorder citations? 29%. Less-studied topics, it seems, give the models less accurate training data to draw from, so they improvise more. When researchers asked for narrower, more specific summaries, fabrication rates jumped to 46% for some topics. The more specialized your question, the more likely you are to get fiction.
And the fake citations are convincing. When GPT-4o provided DOIs for fabricated sources, 64% of those links led to real papers that had nothing to do with the AI-generated claims. Someone casually clicking through to verify would find a legitimate-looking journal article at the link, just not the one they were promised.
ICLR's ugly surprise
The International Conference on Learning Representations is one of the most prestigious venues in AI research. This fall, GPTZero scanned 300 papers submitted for ICLR 2026 and found that 50 included at least one obvious fabricated citation. Each of those papers had already been reviewed by three to five peer experts. Most of the reviewers missed the fakes. Some papers with hallucinated citations had average ratings of 8/10, meaning they would almost certainly have been accepted.
The irony of an AI conference being flooded with AI-generated problems wasn't lost on anyone. Pangram Labs found that 21% of the peer reviews submitted for ICLR 2026 appeared to be fully AI-generated, and more than half showed signs of AI assistance. The conference organizers are now working with detection tools to screen all 20,000 submissions before announcing acceptances.
ICLR has declared that papers making extensive undisclosed use of LLMs will be desk rejected, and hallucinated citations constitute a Code of Ethics violation. That's the policy, anyway. Enforcing it across thousands of submissions while under deadline pressure is another matter.
Legal precedent
Courts got an early lesson in this problem. In the Mata v. Avianca case, attorney Steven Schwartz used ChatGPT to research legal precedents for a personal injury suit against the airline. The chatbot provided cases including Varghese v. China Southern Airlines and Martinez v. Delta Airlines. None of them existed. When the opposing counsel couldn't locate the citations and raised concerns, Schwartz asked ChatGPT to confirm whether Varghese was real. "Yes," the chatbot assured him, claiming the case "can be found on legal research databases such as Westlaw and LexisNexis."
U.S. District Judge Kevin Castel called the fake decisions "legal gibberish" and sanctioned Schwartz, his colleague Peter LoDuca, and their firm $5,000 each. The lawyers might have escaped sanctions if they'd admitted the mistake promptly, but instead they doubled down on the fake citations even after the court questioned their existence. "Poor and sloppy research would merely have been objectively unreasonable," the judge wrote. "But Mr. Schwartz was aware of facts that alerted him to the high probability that 'Varghese' and 'Zicherman' did not exist and consciously avoided confirming that fact."
The students are confused
Back in the classroom, Heiss noted a particularly frustrating loop. "The AI-generated things get propagated into other real things, so students see them cited in real things and assume they're real, and get confused as to why they lose points for using fake sources when other real sources use them."
This is the crux of the problem. It's no longer enough to tell students not to use ChatGPT for citations. They also need to verify that the citations in their actual sources are real, because those sources may themselves have been contaminated.
There's no automated solution that catches everything. GPTZero and similar tools flag suspicious citations, but as one skeptic pointed out to BetaKit, a 99% accuracy rate applied to 20,000 submissions would falsely flag 200 papers, potentially creating academic integrity concerns for authors who didn't use AI at all. The tools catch cheaters but generate false positives. Human verification remains the only reliable method, and it doesn't scale.
ICLR expects to announce accepted papers in about a month. The conference's attempts to stem the AI contamination will be instructive for every other venue in academic publishing, all of which face the same problem with fewer resources and less technical sophistication.




