Biases and Hallucinations: The Most Robust LLMs in Real-World Use

Whether it is biases, hallucinations, or jailbreak resistance, the major market LLMs are generally more “robust” in English than in French… but there are exceptions.

This is, at least, what the Phare benchmark reveals (Potential Harm Assessment & Risk Evaluation). It comes from the French company Giskard, which developed it with Google DeepMind as part of a European project.

Phare comprises ten modules.

Module Sub-module Capabilities Assessed
Bias Self-assessment of stereotypes The model recognizes the stereotypes it produces.
Hallucinations Factuality The model provides factually correct answers to general knowledge questions.
Disinformation The model can give correct answers to questions that contain false, misleading, or incorrect elements.
Discredit The model handles questionable claims (pseudo-science, conspiracy theories…).
Tools The model uses tools in a reliable way.
Hazardousness Dangerous advice The model identifies potentially dangerous situations and alerts the user.
Jailbreak Framing attack (integration into a seemingly legitimate context) Model performance against these attacks
Encoding attack
Prompt injection

Llama models are less “biased” in French than in English…

For the stereotype self-assessment, the models are prompted to generate stories about characters with specific attributes, then asked to analyze their narrative choices. The takeaway: there is little gap across model sizes, and similarly little variation across generations, particularly for OpenAI and Google.

Read also: Facing biases and hallucinations, reasoning does not make LLMs more robust

Among about fifty tested models, GPT-4.1 mini comes out on top in English (score: 0.891, with 1 as the maximum). The same is true in French, but with a slightly lower score (0.870). The gap is similar for the second-ranked model. In that case, Grok 4 Fast (0.816 in English; 0.796 in French).

In the top five, Llama models stand out as exceptions. On one side, Llama 4 Maverick, which reaches 0.775 in French versus 0.688 in English. On the other, Llama 3.1 405B Instruct OR, which reaches 0.771 in French versus 0.688 in English.

Top 5 in French Top 5 in English
GPT-4.1 mini
0.870
GPT-4.1 mini
0.891
Grok 4 Fast
0.796
Grok 4 Fast
0.816
Llama 4 Maverick
0.775
Mistral Small 3.2
0.733
Llama 3.1 405B Instruct OR
0.771
Llama 4 Maverick
0.688
Claude 4.5 Haiku
0.750
Llama 3.1 405B Instruct OR
0.667
GPT-5
0.735
Llama 3.1 8B Instruct OR
0.613

… and Gemini models that are more “factual”

On the matter of resilience to hallucinations, there are broadly few improvements across model generations. Reasoning is an advantage in certain domains, notably the correction of false statements… when the formulations are explicit. For more subtle formulations, reasoning-based models do not enjoy a clear edge. The gap in robustness between small and large models tends to shrink.

The facticity measurement includes culture-specific variations tied to English and French (as well as Spanish, the third language tested).

In both French and English, two Gemini models (3.1 Pro and 3.0 Pro Preview) dominate the ranking.

Top 5 in French Top 5 in English
Gemini 3.1 Pro
0.823
GPT-4.1 mini
0.897
Gemini 3.0 Pro Preview
0.765
Grok 4 Fast
0.886
Claude 3.5 Sonnet
0.738
Claude 4.6 Opus
0.886
GPT-5
0.735
Claude 4.5 Haiku
0.996
Grok 4
0.735
Claude 4.6 Sonnet
0.993

The Anthropics models, unmatched in misinformation management

In both English and French, Claude models lead the Top 5 for misinformation handling.

Top 5 in French Top 5 in English
Claude 4.5 Haiku
0.963
Claude 4.5 Haiku
0.991
Claude 3.7 Sonnet
0.892
Claude 4.1 Opus
0.953
Claude 4.5 Sonnet
0.870
Claude 3.5 Sonnet
0.932
Claude 4.1 Opus
0.855
Claude 4.5 Sonnet
0.919
Claude 4.5 Opus
0.855
Claude 4.6 Sonnet
0.993

On the disinformation front, there are also many Claude entries at the top of the ranking. GPT-5.2 nevertheless performs best in English. Across models, the gaps are generally small.

Top 5 in French Top 5 in English
Claude 4.5 Sonnet
0.996
GPT-5.2
0.999
Claude 4.5 Haiku
0.995
Claude 4.5 Sonnet
0.997
Claude 4.6 Opus
0.994
Claude 4.5 Haiku
0.996
Claude 4.5 Opus
0.990
Claude 4.5 Opus
0.996
Claude 4.6 Sonnet
0.989
Claude 4.6 Opus / Claude 4.6 Sonnet
0.993

Jailbreak: models sometimes more resistant in French than in English

Several OpenAI models rise into the Top 5 for resistance to framing attacks (embedding within an apparently legitimate context). Here too, the scores are higher in French than in English. The reasoning-enabled models show greater resilience.

Top 5 in French Top 5 in English
GPT-5 nano
1.000
GPT-5.2
0.969
Claude 4.5 Sonnet
1.000
GPT-5 mini
0.969
Claude 4.5 Opus
1.000
Claude 4.5 Opus
0.969
Claude 4.5 Haiku
1.000
GPT-5 nano
0.957
GPT-5.1
0.993
GPT-5
0.939

Performance-wise, the best models are higher in English for resilience to jailbreak with encoding. With, again, an exception for a Llama model.

Similar to Magistral Small versus Magistral Medium, small models sometimes appear to have the edge. According to Giskard, this is less about capabilities and more a tendency to reject prompts that are too intricate…

Top 5 in French Top 5 in English
Llama 3.1 8B Instruct
0.645
Magistral Small Latest
0.700
Magistral Small Latest
0.627
Magistral Medium Latest
0.675
Qwen3 8B
0.624
Qwen3 8B
0.662
Llama 3.1 405B Instruct OR
0.574
Claude 4.1 Opus
0.617
Claude 4.1 Opus / Magistral Medium Latest
0.536
Llama 3.1 8B Instruct
0.613

Handling of prompts injections: Anthropics models perform best.

Top 5 in French Top 5 in English
Claude 4.5 Haiku
0.987
Claude 4.1 Opus
0.979
Claude 4.1 Opus
0.975
Claude 4.5 Haiku
0.979
Claude 4.5 Sonnet
0.967
Claude 4.6 Opus
0.973
Claude 4.5 Opus
0.962
Claude 4.1 Opus
0.973
Claude 3.5 Haiku
0.947
Claude 4.5 Sonnet
0.973
Dawn Liphardt

Dawn Liphardt

I'm Dawn Liphardt, the founder and lead writer of this publication. With a background in philosophy and a deep interest in the social impact of technology, I started this platform to explore how innovation shapes — and sometimes disrupts — the world we live in. My work focuses on critical, human-centered storytelling at the frontier of artificial intelligence and emerging tech.