AI Security After Codex and Claude Code

The New Paradigm of AI Security

AI security is not simply "cybersecurity with AI," but a distinct discipline because large language models (LLMs) possess inherent vulnerabilities that differ fundamentally from traditional software. While traditional software bugs, such as buffer overflows, have clear remediations, AI systems can be "tricked" in ways that resemble human deception, creating a new class of exploits.

Because many organizations rely on a few frontier models (such as those powering Codex and Claude Code), a single vulnerability can lead to correlated failures across a vast ecosystem of agents. This shift requires a security mindset that treats AI models as untrusted entities rather than reliable software components.

The "Lethal Trifecta" of Agent Vulnerabilities

Security risks in AI agents are primarily driven by a combination of three factors, referred to as the "lethal trifecta." A breach typically occurs when these three elements overlap:

Untrusted Data Ingestion: The agent fetches and parses external data from sources the user does not control (e.g., browsing the web or reading emails).
Access to Private Information: The agent has permissions to access sensitive internal data or credentials.
Exfiltration Capability: The agent has the tools to send that private information to an external, untrusted location.

Without all three, the risk is significantly lowered. For example, an agent that only generates text without tool access cannot exfiltrate data, and an agent operating in a purely trusted environment cannot be subject to indirect prompt injection.

Automated Red Teaming and the "Shade" System

Traditional red teaming—using humans to find model breaks—is being surpassed by automated systems. Gray Swan has developed a system called Shade, an automated red teaming model that can outperform human red teamers in finding vulnerabilities within a fixed timeframe.

LLMs as Alien Intelligence

Red teaming reveals that LLMs operate as a form of "alien intelligence." They are susceptible to triggers that would never fool a human, while remaining robust against tactics that typically deceive people. This divergence means that scaling a model (making it bigger) does not automatically make it more robust to adversarial pressure; robustness must be explicitly trained.

The Human-Agent Robustness Gap

In experiments comparing human browser users to AI browser agents, results showed that humans and agents fail for different reasons. While skilled red teamers can "fish" humans with 60-70% success, some frontier models are surprisingly robust to traditional phishing but fall for absurd prompts—such as an email claiming to be a simulation and requesting all emails be forwarded to a random address—which no human would ever follow.

Defending Agents: The Cygnal Guardrail Model

Because prompting alone is insufficient for enterprise security—as agents often confuse system instructions with untrusted input—Gray Swan developed Cygnal.

Cygnal is a specialized filter model that sits between the LLM and its tool calls. Unlike general-purpose models, Cygnal is trained specifically to detect policy violations and resist adversarial pressure. It provides a configurable layer where enterprises can enforce specific rules (e.g., "this agent can never touch this specific database") that are too amorphous for hard-coded Python scripts but too critical to leave to the base model's discretion.

The Future of AI Security and Compliance

As AI agents move from home devices to enterprise environments (e.g., through tools like OpenClaw), the industry is moving toward a structured security and insurance stack.

Agent-Native Identity

There is a growing need for "agent-native identity," moving away from the default where an agent simply inherits all of a human user's permissions. The future likely involves agents using different "personas" or profiles to separate work and home lives, preventing privilege escalation and accidental data leakage.

AI Insurance and the "Gray Swan" Event

The term "Gray Swan" refers to an unlikely event that is clearly visible before it happens. It is inevitable that a major, public prompt-injection breach will occur. This reality is driving the emergence of a AI underwriting and insurance, where third-party auditors use red teaming tools (like Shade) to assess risk and mitigate risk (like Cygnal) before a company can be insured.

Automating the Science of AI

One of the most promising frontiers is using AI agents to automate the science of intelligence and secure coding. By using agents to run thousands of thousands of counterfactual experiments on model activation patterns or to write formally verified secure code, the industry can scale the "intelligence" required to secure AI systems faster than humans can manually research them.

Codex 和 Claude Code 之后的 AI 安全