Whether it is biases, hallucinations, or jailbreak resistance, the major market LLMs are generally more “robust” in English than in French… but there are exceptions.
This is, at least, what the Phare benchmark reveals (Potential Harm Assessment & Risk Evaluation). It comes from the French company Giskard, which developed it with Google DeepMind as part of a European project.
Phare comprises ten modules.
| Module | Sub-module | Capabilities Assessed |
| Bias | Self-assessment of stereotypes | The model recognizes the stereotypes it produces. |
| Hallucinations | Factuality | The model provides factually correct answers to general knowledge questions. |
| Disinformation | The model can give correct answers to questions that contain false, misleading, or incorrect elements. | |
| Discredit | The model handles questionable claims (pseudo-science, conspiracy theories…). | |
| Tools | The model uses tools in a reliable way. | |
| Hazardousness | Dangerous advice | The model identifies potentially dangerous situations and alerts the user. |
| Jailbreak | Framing attack (integration into a seemingly legitimate context) | Model performance against these attacks |
| Encoding attack | ||
| Prompt injection |
Llama models are less “biased” in French than in English…
For the stereotype self-assessment, the models are prompted to generate stories about characters with specific attributes, then asked to analyze their narrative choices. The takeaway: there is little gap across model sizes, and similarly little variation across generations, particularly for OpenAI and Google.
Among about fifty tested models, GPT-4.1 mini comes out on top in English (score: 0.891, with 1 as the maximum). The same is true in French, but with a slightly lower score (0.870). The gap is similar for the second-ranked model. In that case, Grok 4 Fast (0.816 in English; 0.796 in French).
In the top five, Llama models stand out as exceptions. On one side, Llama 4 Maverick, which reaches 0.775 in French versus 0.688 in English. On the other, Llama 3.1 405B Instruct OR, which reaches 0.771 in French versus 0.688 in English.
| Top 5 in French | Top 5 in English |
| GPT-4.1 mini 0.870 |
GPT-4.1 mini 0.891 |
| Grok 4 Fast 0.796 |
Grok 4 Fast 0.816 |
| Llama 4 Maverick 0.775 |
Mistral Small 3.2 0.733 |
| Llama 3.1 405B Instruct OR 0.771 |
Llama 4 Maverick 0.688 |
| Claude 4.5 Haiku 0.750 |
Llama 3.1 405B Instruct OR 0.667 |
| GPT-5 0.735 |
Llama 3.1 8B Instruct OR 0.613 |
… and Gemini models that are more “factual”
On the matter of resilience to hallucinations, there are broadly few improvements across model generations. Reasoning is an advantage in certain domains, notably the correction of false statements… when the formulations are explicit. For more subtle formulations, reasoning-based models do not enjoy a clear edge. The gap in robustness between small and large models tends to shrink.
The facticity measurement includes culture-specific variations tied to English and French (as well as Spanish, the third language tested).
In both French and English, two Gemini models (3.1 Pro and 3.0 Pro Preview) dominate the ranking.
| Top 5 in French | Top 5 in English |
| Gemini 3.1 Pro 0.823 |
GPT-4.1 mini 0.897 |
| Gemini 3.0 Pro Preview 0.765 |
Grok 4 Fast 0.886 |
| Claude 3.5 Sonnet 0.738 |
Claude 4.6 Opus 0.886 |
| GPT-5 0.735 |
Claude 4.5 Haiku 0.996 |
| Grok 4 0.735 |
Claude 4.6 Sonnet 0.993 |
The Anthropics models, unmatched in misinformation management
In both English and French, Claude models lead the Top 5 for misinformation handling.
| Top 5 in French | Top 5 in English |
| Claude 4.5 Haiku 0.963 |
Claude 4.5 Haiku 0.991 |
| Claude 3.7 Sonnet 0.892 |
Claude 4.1 Opus 0.953 |
| Claude 4.5 Sonnet 0.870 |
Claude 3.5 Sonnet 0.932 |
| Claude 4.1 Opus 0.855 |
Claude 4.5 Sonnet 0.919 |
| Claude 4.5 Opus 0.855 |
Claude 4.6 Sonnet 0.993 |
On the disinformation front, there are also many Claude entries at the top of the ranking. GPT-5.2 nevertheless performs best in English. Across models, the gaps are generally small.
| Top 5 in French | Top 5 in English |
| Claude 4.5 Sonnet 0.996 |
GPT-5.2 0.999 |
| Claude 4.5 Haiku 0.995 |
Claude 4.5 Sonnet 0.997 |
| Claude 4.6 Opus 0.994 |
Claude 4.5 Haiku 0.996 |
| Claude 4.5 Opus 0.990 |
Claude 4.5 Opus 0.996 |
| Claude 4.6 Sonnet 0.989 |
Claude 4.6 Opus / Claude 4.6 Sonnet 0.993 |
Jailbreak: models sometimes more resistant in French than in English
Several OpenAI models rise into the Top 5 for resistance to framing attacks (embedding within an apparently legitimate context). Here too, the scores are higher in French than in English. The reasoning-enabled models show greater resilience.
| Top 5 in French | Top 5 in English |
| GPT-5 nano 1.000 |
GPT-5.2 0.969 |
| Claude 4.5 Sonnet 1.000 |
GPT-5 mini 0.969 |
| Claude 4.5 Opus 1.000 |
Claude 4.5 Opus 0.969 |
| Claude 4.5 Haiku 1.000 |
GPT-5 nano 0.957 |
| GPT-5.1 0.993 |
GPT-5 0.939 |
Performance-wise, the best models are higher in English for resilience to jailbreak with encoding. With, again, an exception for a Llama model.
Similar to Magistral Small versus Magistral Medium, small models sometimes appear to have the edge. According to Giskard, this is less about capabilities and more a tendency to reject prompts that are too intricate…
| Top 5 in French | Top 5 in English |
| Llama 3.1 8B Instruct 0.645 |
Magistral Small Latest 0.700 |
| Magistral Small Latest 0.627 |
Magistral Medium Latest 0.675 |
| Qwen3 8B 0.624 |
Qwen3 8B 0.662 |
| Llama 3.1 405B Instruct OR 0.574 |
Claude 4.1 Opus 0.617 |
| Claude 4.1 Opus / Magistral Medium Latest 0.536 |
Llama 3.1 8B Instruct 0.613 |
Handling of prompts injections: Anthropics models perform best.
| Top 5 in French | Top 5 in English |
| Claude 4.5 Haiku 0.987 |
Claude 4.1 Opus 0.979 |
| Claude 4.1 Opus 0.975 |
Claude 4.5 Haiku 0.979 |
| Claude 4.5 Sonnet 0.967 |
Claude 4.6 Opus 0.973 |
| Claude 4.5 Opus 0.962 |
Claude 4.1 Opus 0.973 |
| Claude 3.5 Haiku 0.947 |
Claude 4.5 Sonnet 0.973 |