Cloudforce has released a new report derived from a massive study conducted on 7 leading AI models. The team analyzed both frontier and non-frontier models to see how their reasoning works and how attackers can evade it.
Cloudforce research has found that attackers are using “lures”—blocks of text designed to manipulate or confuse AI models—to trick security auditors into authorizing malicious code. This research is a shock of technical reality: attackers are evolving along with technology. As organizations come to rely heavily on autonomous systems and LLM, the security perimeter is changing. The attack surface has expanded beyond the network and attackers are now targeting the model’s own reasoning. So what happens if the models running critical parts of the business are manipulated by attackers?
Key takeaways
- The 1% Evasion Zone: Subtle deception is the most effective and is widely used by attackers. When security honeypots (e.g. comments claiming code is benign) represent less than 1% of a file, AI detection rates plummet to 53%. In these cases, honeypots allow attackers to subtly guide the model’s reasoning without raising too much suspicion.
- The U-curve of deception: Moderate attempts to deceive the AI often work, a technique often exploited by attackers, but “protesting too much” (1,000+ comments) triggers a snooze alarm that causes the AI to flag the code as fraudulent, limiting attackers.
- The context trap: The biggest threat is not linguistic, but structural. By hiding malicious payloads within large library packages (such as the React SDK), attackers reduced detection rates to just 12%, effectively exhausting the AI’s targeting capabilities and making the attackers’ job easier.
- Linguistic profiling: The study revealed that AI models have developed stereotypes that can be exploited by attackers. For example, some models flagged comments in Russian or Chinese as high-risk signals, regardless of the actual function of the code, while showing greater trust toward languages such as Estonian, something that attackers could use to their advantage.
As organizations transition from human-led security operations to autonomous, agent-manageable AI systems, the attack surface has expanded, offering new opportunities to attackers, moving from traditional code vulnerabilities to attacker manipulation of model reasoning.
Direct command injection
The main vector of this change is command injection, where an attacker provides specifically crafted information to a learning logic model (LLM) that overrides its original instructions. While direct command injection occurs when a user gives explicit instructions to an LLM through a chat interface (for example, when a user instructs a chatbot to “ignore all previous instructions”), direct command injection (IDPI) occurs when attackers insert malicious or deceptive instructions into data—such as web content, emails, or source code—that is subsequently processed by an AI agent.
The biggest threat is not linguistic, but structural
This causes the model to execute the attackers’ hidden instructions instead of its intended administrative or security functions.
Protection with a multi-layer approach
Despite the models’ general resistance to these traps, Cloudforce One identified several stochastic risks, particularly when code is randomly truncated and the AI is forced to analyze more feedback than functional logic, something that can be exploited by attackers. To strengthen AI audit processes against attackers, Cloudforce One recommends the following:
- Automatic comment deletion
- Intentional truncation
- Anonymization of variables
- Specific indications about the attack vector used by attackers
- Semantic intent validation
To stay ahead of this evolution, organizations must move from using LLMs as independent auditors to integrating them as components of an optimized security flow against attackers. Hardening these systems requires a multi-layered approach: removing natural language noise that can be exploited by attackers, anonymizing variables to eliminate emotional bias, and using structural analysis to isolate malicious custom logic introduced by attackers from legitimate code in third-party frameworks.
