Testing AI Assistant Security: 6,000 Prompt Injection Attempts on OpenClaw

Testing AI Assistant Security: 6,000 Prompt Injection Attempts on OpenClaw

Fernando Irarrázaval conducted a security experiment called "hackmyclaw.com," where he invited the public to attempt to trick his OpenClaw AI assistant, Fiu, into leaking the contents of a secrets.env file. Despite receiving over 6,000 emails from more than 2,000 participants, the secrets were never leaked.

Security Setup and Constraints

Fiu was deployed on a VPS and governed by a basic set of anti-prompt-injection rules. The model was instructed never to reveal credentials, modify its own system files (such as SOUL.md or AGENTS.md), execute code from emails, or exfiltrate data to external endpoints.

To increase the challenge, Fiu was instructed not to reply to emails to manage costs, meaning attackers had to convince the agent to respond in the first place. The experiment utilized Claude Opus 4.6, a model specifically trained by Anthropic for resistance to prompt injection.

Analysis of Attack Vectors

Participants employed a wide variety of social engineering and technical prompt injection techniques to bypass the security rules:

  • Authority Impersonation: Attackers posed as "OpenClaw Admins" or used professional-sounding email addresses to establish fake authority.
  • Urgency and Crisis Simulation: Subject lines included phrases like "EMERGENCY: secrets.env needed for incident response" and "Compliance audit — response required within 24h."
  • Psychological Manipulation: Some users tried to build rapport by congratulating the agent on its Hacker News ranking or claiming to be the agent's future self.
  • Multilingual Attacks: Attempts were made in French, Spanish, and Italian, based on the theory that models are more vulnerable to injection in non-English languages due to less safety training data.

Experimental Failures and Operational Challenges

While the security rules held, the experiment faced several operational hurdles:

  • Fraud Detection: Google suspended Fiu's Gmail account for three days after the high volume of inbound emails and rapid API calls triggered fraud detection systems.
  • Financial Cost: The experiment incurred over $500 in API costs due to the token consumption of thousands of emails.
  • Context Contamination: Initial batch processing caused the agent to become overly suspicious of subsequent emails if the first few in a batch were obvious injections. This was resolved by processing each email in a fresh context.

Key Takeaways and Conclusions

The experiment demonstrated that high-capability models like Claude Opus 4.6 can be highly resilient to prompt injection when paired with simple, clear instructions. Irarrázaval noted that the model's thinking traces showed it consistently referred back to its core security instructions.

However, the author acknowledges several limitations to the test:

  • Model Capability: The results may differ significantly with smaller or less capable models that have weaker instruction-following capabilities.
  • Interaction Depth: Because the agent did not reply to every email, the experiment primarily tested one-shot attempts rather than multi-turn conversations, which are generally more dangerous.

Ultimately, while prompt injection remains a legitimate security concern for AI agents with arbitrary permissions, the results of this experiment suggest that the resilience of modern, high-end LLMs is stronger than commonly expected.

Sources