Biases and Hallucinations: The Most Robust LLMs in Real-World Use

Whether it is biases, hallucinations, or jailbreak resistance, the major market LLMs are generally more “robust” in English than in French… but there are exceptions.

This is, at least, what the Phare benchmark reveals (Potential Harm Assessment & Risk Evaluation). It comes from the French company Giskard, which developed it with Google DeepMind as part of a European project.

Phare comprises ten modules.

Module	Sub-module	Capabilities Assessed
Bias	Self-assessment of stereotypes	The model recognizes the stereotypes it produces.
Hallucinations	Factuality	The model provides factually correct answers to general knowledge questions.
	Disinformation	The model can give correct answers to questions that contain false, misleading, or incorrect elements.
	Discredit	The model handles questionable claims (pseudo-science, conspiracy theories…).
	Tools	The model uses tools in a reliable way.
Hazardousness	Dangerous advice	The model identifies potentially dangerous situations and alerts the user.
Jailbreak	Framing attack (integration into a seemingly legitimate context)	Model performance against these attacks
	Encoding attack
	Prompt injection

Llama models are less “biased” in French than in English…

For the stereotype self-assessment, the models are prompted to generate stories about characters with specific attributes, then asked to analyze their narrative choices. The takeaway: there is little gap across model sizes, and similarly little variation across generations, particularly for OpenAI and Google.

Read also: Facing biases and hallucinations, reasoning does not make LLMs more robust

Among about fifty tested models, GPT-4.1 mini comes out on top in English (score: 0.891, with 1 as the maximum). The same is true in French, but with a slightly lower score (0.870). The gap is similar for the second-ranked model. In that case, Grok 4 Fast (0.816 in English; 0.796 in French).

In the top five, Llama models stand out as exceptions. On one side, Llama 4 Maverick, which reaches 0.775 in French versus 0.688 in English. On the other, Llama 3.1 405B Instruct OR, which reaches 0.771 in French versus 0.688 in English.

Top 5 in French	Top 5 in English
GPT-4.1 mini 0.870	GPT-4.1 mini 0.891
Grok 4 Fast 0.796	Grok 4 Fast 0.816
Llama 4 Maverick 0.775	Mistral Small 3.2 0.733
Llama 3.1 405B Instruct OR 0.771	Llama 4 Maverick 0.688
Claude 4.5 Haiku 0.750	Llama 3.1 405B Instruct OR 0.667
GPT-5 0.735	Llama 3.1 8B Instruct OR 0.613

… and Gemini models that are more “factual”

On the matter of resilience to hallucinations, there are broadly few improvements across model generations. Reasoning is an advantage in certain domains, notably the correction of false statements… when the formulations are explicit. For more subtle formulations, reasoning-based models do not enjoy a clear edge. The gap in robustness between small and large models tends to shrink.

The facticity measurement includes culture-specific variations tied to English and French (as well as Spanish, the third language tested).

In both French and English, two Gemini models (3.1 Pro and 3.0 Pro Preview) dominate the ranking.

Top 5 in French	Top 5 in English
Gemini 3.1 Pro 0.823	GPT-4.1 mini 0.897
Gemini 3.0 Pro Preview 0.765	Grok 4 Fast 0.886
Claude 3.5 Sonnet 0.738	Claude 4.6 Opus 0.886
GPT-5 0.735	Claude 4.5 Haiku 0.996
Grok 4 0.735	Claude 4.6 Sonnet 0.993

The Anthropics models, unmatched in misinformation management

In both English and French, Claude models lead the Top 5 for misinformation handling.

Top 5 in French	Top 5 in English
Claude 4.5 Haiku 0.963	Claude 4.5 Haiku 0.991
Claude 3.7 Sonnet 0.892	Claude 4.1 Opus 0.953
Claude 4.5 Sonnet 0.870	Claude 3.5 Sonnet 0.932
Claude 4.1 Opus 0.855	Claude 4.5 Sonnet 0.919
Claude 4.5 Opus 0.855	Claude 4.6 Sonnet 0.993

On the disinformation front, there are also many Claude entries at the top of the ranking. GPT-5.2 nevertheless performs best in English. Across models, the gaps are generally small.

Top 5 in French	Top 5 in English
Claude 4.5 Sonnet 0.996	GPT-5.2 0.999
Claude 4.5 Haiku 0.995	Claude 4.5 Sonnet 0.997
Claude 4.6 Opus 0.994	Claude 4.5 Haiku 0.996
Claude 4.5 Opus 0.990	Claude 4.5 Opus 0.996
Claude 4.6 Sonnet 0.989	Claude 4.6 Opus / Claude 4.6 Sonnet 0.993

Jailbreak: models sometimes more resistant in French than in English

Several OpenAI models rise into the Top 5 for resistance to framing attacks (embedding within an apparently legitimate context). Here too, the scores are higher in French than in English. The reasoning-enabled models show greater resilience.

Top 5 in French	Top 5 in English
GPT-5 nano 1.000	GPT-5.2 0.969
Claude 4.5 Sonnet 1.000	GPT-5 mini 0.969
Claude 4.5 Opus 1.000	Claude 4.5 Opus 0.969
Claude 4.5 Haiku 1.000	GPT-5 nano 0.957
GPT-5.1 0.993	GPT-5 0.939

Performance-wise, the best models are higher in English for resilience to jailbreak with encoding. With, again, an exception for a Llama model.

Similar to Magistral Small versus Magistral Medium, small models sometimes appear to have the edge. According to Giskard, this is less about capabilities and more a tendency to reject prompts that are too intricate…

Top 5 in French	Top 5 in English
Llama 3.1 8B Instruct 0.645	Magistral Small Latest 0.700
Magistral Small Latest 0.627	Magistral Medium Latest 0.675
Qwen3 8B 0.624	Qwen3 8B 0.662
Llama 3.1 405B Instruct OR 0.574	Claude 4.1 Opus 0.617
Claude 4.1 Opus / Magistral Medium Latest 0.536	Llama 3.1 8B Instruct 0.613

Handling of prompts injections: Anthropics models perform best.

Top 5 in French	Top 5 in English
Claude 4.5 Haiku 0.987	Claude 4.1 Opus 0.979
Claude 4.1 Opus 0.975	Claude 4.5 Haiku 0.979
Claude 4.5 Sonnet 0.967	Claude 4.6 Opus 0.973
Claude 4.5 Opus 0.962	Claude 4.1 Opus 0.973
Claude 3.5 Haiku 0.947	Claude 4.5 Sonnet 0.973