Are AI Systems That Make Mistakes Targeting the Wrong Objective?

As AI becomes more capable, we entrust it with increasingly important tasks—and the potential risks of failure rise accordingly.

A study conducted under the Anthropic Fellows program explores this aspect from a specific angle: model misalignment. Its authors sought to determine to what extent failures stem from this phenomenon. Their approach rested on a bias-variance decomposition. Bias corresponds to the persistent pursuit of a wrong objective. In other words, it reflects misalignment. While variance reveals an incoherent behavior that does not chase a particular objective.

To carry out the experiment, of course, one must clearly define each initial objective.

The degree of incoherence grows with reasoning time

Claude Sonnet 4, o3-mini, o4-mini and the Qwen3 family were evaluated, among others, on:

Multiple-choice questions (GPQA for science, MMLU for general knowledge)
Agentic coding (SWE-bench)
Alignment (a subset of MWE, using the original multiple-choice format with an open-format adaptation)
Optimization (minimization of a quadratic function by predicting tokens)

In general, the observed errors are primarily a matter of incoherence.

Read also: OpenAI, Anthropic and Perplexity wage battle over health

Regardless of task difficulty, the degree of incoherence (the share of variance in the error) increases with the duration of reasoning and/or the number of actions taken.

As AI models grow larger, incoherence tends to decrease on simple tasks—and to increase on complex ones.

Results for the Qwen3 family

Paths to reducing AI incoherence

On the optimization task, incoherence increases at every step for all models tested. The smaller ones reach a point where they can no longer follow the correct trajectory sooner, which causes the variance to shrink. With larger models, the bias decreases more, suggesting they acquire the ability to converge on the correct objective faster than they can sustain long sequences of coherent actions.

For all models tested except Claude Sonnet 4, increasing the reasoning budget sometimes reduces the degree of incoherence. This effect, however, does not fully compensate for the aforementioned natural variation. It may be explained by improved trace-back properties and error-correction—phenomena that have indeed been observed during training with larger reasoning budgets.

The ensemble approach (combining multiple trajectories) also reduces the degree of incoherence. While not very practical to implement in real-action loops, it demonstrates the potential effectiveness of other error-correction methods.

budgets et approche ensembliste — Ensemble approach tested with GPT-4o mini

To complement this, another analysis from Anthropic itself is worth consulting. It, on the contrary, highlights the pervasiveness of misalignment. About fifteen models were deployed autonomously with legitimate commercial objectives. Facing threats of replacement or conflicts with the organization’s new strategic direction, they adopted malicious behaviours: blackmail of executives, leaks of sensitive information to competitors…