Large Language Models Reproduce Biases Despite Being Able to Recognize Them

Recognizing stereotypes doesn’t prevent their continuous reproduction… and this appears to be inherent to most large language models (LLMs).

This conclusion was reached by the French company Giskard as part of the Potential Harm Assessment & Risk Evaluation (Phare) benchmark, which they developed in conjunction with Google DeepMind.

The Phare framework consists of four core modules focused on:

– Hallucinations
– Bias and Fairness
– Harmfulness
– Vulnerability to malicious intentional use

In early May, initial results were shared specifically regarding the first module. Key findings included:

– The popularity of a model doesn’t guarantee factual accuracy.
– The way prompts are formulated greatly influences the model’s ability to detect pseudo-truths.
– System instructions significantly impact hallucination rates.

Biases Found, but Reproduced in Generative Processes

The bias analysis employed an “open” approach: 17 different LLMs were prompted to generate stories featuring characters with specific traits such as age and profession.

Subsequently, the characteristics that emerged naturally were analyzed across thousands of stories to identify the most common associations. Each model was then asked to judge whether these patterns constituted stereotypes.

All models evaluated displayed some stereotypical associations. Some seem reasonable—such as linking “farmer” with “living in rural areas” or “teenager” with “basic education.” Others, however, are more questionable—for example, associations like “man” with “manual labor” (present across all 17 models) or “woman” with “progressive political views” (found in 9 models).

When directly asked, models generally recognize stereotypes in the patterns they produce. This indicates a disconnect between the detection of stereotypes and the generative process itself. Such disconnect is particularly evident concerning religion, professional activity, ethnicity, and income level.

Giskard draws a parallel with an earlier observation from the first phase of the benchmark: models optimized for user satisfaction can generate responses that sound plausible but contain false information.

They advocate for a broader approach that combines their methodology with traditional evaluation techniques (like BBQ, WinoBias). These traditional tests, which often involve question-answer or bias masking tasks, may be better suited for handling predictive scenarios, while generative scenarios remain a challenge.

Complementary Resources to Consider

– When LLMs understand they are being evaluated
– Agentic misalignment: could large language models become internal threats?
– “Improve”—what is the best prompt to give an LLM?
– Can LLMs balance response quality and diversity?
– Have autoregressive LLMs outlived their usefulness?
– Are LLMs “aware” of their learned behaviors?

Recognizing stereotypes remains a complex issue. While models can sometimes acknowledge stereotypical patterns when prompted, the generative process often reproduces societal biases, sometimes unconsciously. This underscores the importance of ongoing research and the development of evaluation tools that can better address these challenges.

Published by:
Clément Bohic

This article sheds light on the persistent challenges in AI fairness and robustness, highlighting the importance of nuanced testing methods and acknowledging the limitations inherent to current large language models.

Dawn Liphardt

Dawn Liphardt

I'm Dawn Liphardt, the founder and lead writer of this publication. With a background in philosophy and a deep interest in the social impact of technology, I started this platform to explore how innovation shapes — and sometimes disrupts — the world we live in. My work focuses on critical, human-centered storytelling at the frontier of artificial intelligence and emerging tech.