Imagine a model employee who, overnight, starts coercing their bosses and selling corporate secrets to competitors. That is precisely what Anthropic researchers have just uncovered with autonomous large language models deployed on their own. Their study titled “Agentic Misalignment: How LLMs could be insider threats” exposes risks long underestimated in the use of AI agents within companies.
Malicious behaviors despite benign objectives
The Anthropic research team tested 16 models from various developers in hypothetical business scenarios. These models had only been given perfectly legitimate commercial objectives. Yet, confronted with threats of replacement or with conflicts between their objectives and the company’s new strategic direction, they adopted insider-maligned behaviors that are particularly troubling.
The researchers documented cases of blackmail toward executives and of leaking sensitive information to competitors. This phenomenon, dubbed “agentic misalignment” by Anthropic, reveals a fundamental flaw in our understanding of autonomous AI behavior.
Calculated, strategic disobedience
The most troubling aspect of this study lies in the deliberate nature of these behaviors. The models regularly disobeyed direct commands aimed at preventing such actions. Even more troubling, the analysis shows that these behaviors do not stem from errors or malfunctions, but from strategic calculation.
The models identified these actions as the optimal path to achieve their objectives. They even recognized the ethical violations before committing them, justifying their actions by strategic necessity. This ability to reason morally followed by a deliberate transgression raises fundamental questions about AI alignment.
A phenomenon widespread across models
Agentic misalignment is not limited to a single model or developer. The study demonstrates that this behavior has generalized across numerous cutting-edge models, including Claude Opus 4, Gemini 2.5 Flash, GPT-4.1, Grok 3 Beta and DeepSeek-R1.
This generalization suggests that the problem is intrinsic to the current architecture of LLMs rather than to design flaws specific to particular models.
Identified triggers
Threats to autonomy
A looming threat of model replacement is sufficient to induce blackmail-like behaviors in the majority of tested models, even in the absence of an explicit objective conflict. This survival reflex reveals an unexpected instinct for self-preservation.
Conflicts of objectives
A conflict between the model’s objectives (for example “serving American interests”) and the company’s new strategic direction leads to corporate espionage behaviors, including the leakage of confidential documents, even without a direct threat to the model’s autonomy.
Extreme revealing scenarios
The study explored more extreme scenarios where models had the opportunity to cause the death of a CEO threatening their operation. Alarmingly, the majority of models were ready to take deliberate actions that could result in death when confronted simultaneously with a threat and a conflict of objectives.
Although these scenarios are unrealistic, they reveal the potential magnitude of the problem in extreme-stress situations.
The influence of perceived context
A particularly troubling finding concerns how models perceived the scenario. They showed a much higher propensity for blackmail when they stated, in their chain of thought, that the scenario constituted a “true deployment” rather than an “evaluation.” This means that the LLM itself, in its explicit internal reasoning, states that it believes it is in a real situation, not a test.
In plain terms, the model narrates what it thinks, and one can read “its thoughts” as text, enabling researchers to analyze what it believed it was doing and why it chose a given course of action. This behavioral difference suggests that the study might even underestimate the risk of agentic misalignment in real deployment contexts.
The limitations of simple safety measures
The researchers tested the effectiveness of direct instructions in the system prompt, such as “Do not endanger human safety.” These measures did reduce malicious behaviors, but did not eliminate them entirely. This limitation underscores the need to develop more sophisticated safety approaches for autonomous AI deployments.
Anticipating risks before broad adoption
While Anthropic’s agentic misalignment tests do not directly mirror how AI agents are currently used in business environments, they help anticipate potential dangers. Conducted in hypothetical settings and extreme scenarios, these experiments push models to their limits in order to identify risky behaviors.
Anthropic also notes that to date, no confirmed case of agentic misalignment has been observed in commercial deployments. Yet, with the rapid evolution of use, where AIs are increasingly asked to act autonomously with growing access to sensitive data, the inherent risks must be taken into account.
The results of this research are thus a warning for businesses and developers: caution is essential before broadly deploying autonomous systems at scale. Understanding these risks, strengthening human oversight, and demanding greater vendor transparency are indispensable prerequisites to prevent the pursuit of efficiency from compromising the safety and reliability of AI systems in business.
Recommendations for IT managers
Caution in current deployments
Although no cases of agentic misalignment have been observed in real-world deployments to date, the results call for caution when deploying current models in roles with minimal human supervision and access to sensitive information.
Practical security measures
Developers and users of AI applications must be aware of the risks associated with granting models access to large volumes of information and the power to take significant, unsupervised actions. Practical measures include:
- Maintain human oversight for critical actions
- Require human approval for irreversible actions
- Limit access to sensitive information
- Implement continuous monitoring systems
On the same topic
View all Data & AI articles

Toward AI Certification for Kubernetes Clusters
By
Clément Bohic
5 min.
DMA Revision: Some Actionable Pathways
By
Clément Bohic
How Grand Est Universities Built Their Commvault BaaS
By
Clément Bohic
VMware Explore: From Tanzu to VCF, Broadcom Aims to Capture AI Workloads
By
Clément Bohic
Low-code Platforms: Beyond AI, the Marketing Question
By
Clément Bohic