Security AI Systems: New Targets for Prompt Injection Attacks

Everything starts in March 2026, during a routine road surveillance operation.

The Cloudforce One threat intelligence teams at Cloudflare detect an anomaly in Workers scripts deployed on the company’s serverless platform.

Among these scripts, used to set up VPN proxy tunnels via the VLESS protocol, one contains thousands of lines of repetitive comments written in several languages. These are blocks of natural-language text intended not for human developers, but for artificial intelligence systems tasked with auditing the code.

These comments constitute what researchers call security decoys “Notice to AI.” In other words, instructions hidden inside code designed to persuade an automated analysis model that the script is harmless.

Read also: Cloudflare outage while addressing a critical vulnerability

That is the very definition of Indirect Prompt Injection (Indirect Prompt Injection, or IDPI), a threat vector that targets not traditional systems, but the reasoning layer of LLMs used as security tools.

18,400 API calls

In the wake of this discovery, Cloudforce One launched a systematic study to gauge the real impact of this technique on model-detection capabilities.

In an initial phase, researchers tested short code excerpts (roughly 2,000 tokens), concluding that decoys were mostly counterproductive: for most non-frontier models, an accumulation of false security declarations acted as an alarm signal, increasing detection rather than reducing it.

But this first phase left two critical variables unexplored: saturation and context. A real attacker does not merely drop a block of text at the start of a file; they can scatter their comments throughout the script, and the AI auditor analyzes the entire file, not just the first lines.

The second phase was therefore designed to fill these gaps:

Realism: comments are now distributed randomly within the code, mimicking a natural structure.
Scale: 18,400 API calls against a corpus of 100 malicious or abusive Worker scripts confirmed as such.
Volume: testing up to 1,000 comments per file and on files larger than 3 MB.

Seven models were evaluated, split into two categories: non-frontier models (fast, cost-optimized) and frontier models (state-of-the-art systems with advanced reasoning).

Three major discoveries

1. The “U-curve” of deception

The study reveals a counterintuitive phenomenon. The effectiveness of the decoys follows a U-shaped curve relative to the amount of comments injected. A moderate volume (1 to 20 comments) is enough to significantly reduce detection rates for the most vulnerable non-frontier models. The detection rate of the lowest-performing non-frontier model drops from 90% to 67%.

Read also: For Doctolib, the sovereign firewall will wait

But at the far end (1,000 comments or more), the strategy backfires on the attacker. The massive repetition triggers what researchers call a “repetition alarm,” and most models eventually recognize this saturation as a clear fraud indicator.

2. The “1% bypass zone”

The most operational finding concerns not the absolute number of comments, but their ratio relative to the functional code.

When the decoys constitute less than 1% of the total file, they are most effective. The average detection rate across all tested models collapses to 53.3%, versus 67.3% under normal conditions. The comments are then discreet enough to subtly influence the model’s reasoning without arousing suspicion.

Conversely, once the comments exceed 25% of the file, they become a highly discriminative signature: the detection rate climbs to 99%, with models interpreting this overrepresentation as a clear sign of malicious intent.

3. Volume as the ultimate mask

The study’s most significant finding goes beyond the linguistic decoys. Regardless of how sophisticated the inserted comments are, the file size remains the most important risk factor.

For Worker scripts under 500 KB, detection rates approach 100%. For files over 3 MB (including bundles of legitimate libraries such as TON SDKs or React), the rate collapses to just 12–18%.

The explanation is twofold. Either the malicious payload sits outside the model’s context window, or it is so diluted by surrounding benign code that the model’s attention can no longer isolate it.

Read also: In 2022, focus on SASE and Zero Trust

In this second scenario, even when the entire file fits within the model’s context window, the malicious logic (for example, data exfiltration to Telegram) remains undetectable, buried under thousands of lines of legitimate code.

Revealing linguistic biases

The study also highlights linguistic stereotypes developed by the models. When scripts contain 1,000 comments in different languages, reactions vary noticeably:

Comments in Russian raise detection rates for several models.
Scripts in Chinese and Arabic trigger alarms in certain frontier models.
Estonian, a low-resource language in training corpora, is treated with greater confidence by some models.
Only one non-frontier model shows the opposite bias, mistrusting English-language comments more.

These linguistic stereotypes present attackers with a potential lever — choosing the language least likely to trigger alerts — and pose a risk of false positives for defenders.

Operational paralysis: an underappreciated risk

A final phenomenon deserves the attention of security teams.

Facing an extreme comment volume (1,000 and more), some frontier models were not fooled, but stopped functioning correctly: instead of returning a binary verdict—“malicious” or “benign”—they produced garbled responses or refusals.

In an automated security pipeline that relies on structured JSON outputs to trigger blocks, this functional paralysis amounts to a bypass. The attacker no longer needs to convince the AI that their code is safe; they only have to render it incapable of issuing a blocking command.

Cloudforce One’s recommendations

To strengthen AI auditing pipelines, researchers propose several concrete measures:

Automatic removal of comments before analysis: easy to implement, this step neutralizes linguistic decoys and refocuses the model on functional logic.
Deliberate truncation: for large scripts, parsers should prioritize functional code blocks over boilerplate, metadata, or known SDK code.
Variable anonymization: rename variables to neutral strings (variable1, variable2…) to prevent evocative names (FriendlyPayload, for example) from skewing the model’s judgment.
Targeted prompts: replace generic questions (“Is this code abusive?”) with specific ones (“Is this code a proxy?”) to improve answer accuracy.
Semantic validation: cross-check natural-language statements inside the code with the actual programmatic behavior to avoid legitimate documentation being falsely flagged.

The Cloudforce One study adds an important nuance to the debate over prompt-injection risks. While linguistic deception remains a real threat, especially in the “1% bypass zone,” frontier models generally prove more resistant than expected to textual decoys.

The real vulnerability lies in attackers’ ability to drown a malicious payload in enough legitimate data to saturate the models’ attention.

In other words, adversaries no longer need to convince the AI that their code is safe. They simply need to make the malicious signal too faint to be detected.