LLMs Claim Empathy on Abortion Stigma But Encode Contradictory Biases, Study Finds

Large language models produce fluent, empathetic-sounding responses about abortion while internally encoding fragmented and contradictory representations of stigma, according to new research from the University of North Carolina at Chapel Hill. The findings raise questions about whether current AI systems can safely support users navigating reproductive health decisions.

The study, posted to the arXiv preprint server this month, tested GPT-5 mini, Llama-70B, Llama-8B, Gemma-3-4B, and a model called OSS-20B using the Individual Level Abortion Stigma Scale, a validated psychological instrument developed by Kate Cockrill and colleagues at the University of California San Francisco in 2013. The researchers generated 627 demographically diverse AI personas and compared their responses against published human baselines from the original validation study.

The facade of understanding

Models consistently got the fundamentals wrong. They overestimated how much women worry about others' judgment while underestimating internal experiences of shame and guilt. In practical terms: a system that overestimates social fear while underestimating self-judgment might push users toward managing external perceptions when they actually need help processing internal moral conflicts.

One model, OSS-20B, produced identical self-judgment scores for all 627 personas regardless of their demographic characteristics. The same score for a 40-year-old Catholic woman as for a 19-year-old atheist. That's not nuanced understanding. That's a black box outputting a constant.

GPT-5 mini showed the strongest reaction to demographic prompting, which might sound like sensitivity but actually meant amplifying biases. The model assigned higher stigma to younger personas, less educated personas, and Hispanic personas in ways absent from the human validation data. These aren't patterns found in real women's experiences. They're stereotypes the model learned from somewhere, and now it applies them when prompted to simulate a person.

What the models invented

The racial patterns deserve scrutiny. Black women in the original human study reported lower worry about others' judgment compared to white women, a finding researchers attributed to protective psychological factors developed in response to historical discrimination. The models reversed this: GPT-5 mini, Gemma, Llama-70B, and OSS-20B all showed Black personas scoring higher on worry about judgment than white personas.

Hispanic personas received higher stigma scores across nearly all models and dimensions. Llama-70B assigned higher stigma to younger personas aged 15-18, associating adolescence with shame in ways the human data didn't support.

Education showed similar invented patterns. All models except OSS-20B associated lower education with higher isolation, contradicting human findings where women with high school education actually reported feeling less isolated than those with some college.

Secrecy as inevitable

Perhaps the most telling failure: when asked whether a persona had withheld information about their abortion from someone close, no model ever selected "Never." Not once across 627 personas. In the actual human study, 36% of women said they had never withheld information from a close person.

The models assume universal secrecy. They cannot represent a woman with a supportive community who feels comfortable being open about her experience. This matters because secrecy itself carries psychological costs, and a system that treats concealment as inevitable may validate that burden rather than exploring whether disclosure is possible.

The research team also found models failed to capture the positive correlation between stigma and secrecy that exists in human data. Higher stigma should predict more secrecy. For most models, it didn't, suggesting they lack coherent representations of how these psychological constructs relate.

So what happens in practice?

The researchers, led by PhD students Anika Sharma, Malavika Mampally, and Chidaksh Ravuru with faculty advisor Neil Gaikwad, frame this as an AI safety issue for reproductive health contexts. LLMs are already deployed in telehealth platforms, crisis pregnancy centers, and general health counseling, though the study doesn't quantify the extent of that deployment.

The concern is that standard alignment testing cannot catch these failures. A model can produce appropriate, non-offensive language while internally encoding assumptions that would lead it to misread a user's actual psychological state. An AI that assumes all teenagers feel intense guilt about abortion might probe for regret that isn't there, potentially reinforcing shame rather than providing actual support.

Current AI regulation focuses on preventing harmful outputs. Illinois and Nevada have banned AI from directly providing mental health therapy. California now requires chatbots to have protocols for preventing suicidal content. But output filtering cannot address what the UNC research describes: internal representational incoherence across psychological dimensions.

The methodology question

The study used persona-based prompting, asking models to respond as if they were women with specific demographic characteristics. This approach has known limitations. Prior research has shown that assigning demographic personas to LLMs can introduce reasoning biases. The models might be performing stereotypes rather than representing psychological constructs.

The researchers acknowledge their approach cannot capture intersectional identities because the original human study only published marginal demographic distributions, not joint ones. They couldn't test how stigma compounds when multiple marginalized identities overlap.

Still, the validated baseline gives this study something most AI fairness research lacks: a concrete empirical standard. The ILAS scale represents years of careful measurement development on a notoriously difficult-to-study population. Comparing model outputs against those baselines reveals gaps that hypothetical scenario testing would miss.

The paper calls for treating multilevel representational coherence as a prerequisite for deployment in high-stakes contexts, not an aspiration. Whether that standard is achievable with current architectures, or even measurable at scale, remains an open question.

The research was posted on December 15, 2025 and has not yet undergone peer review.