ロール混乱としてのプロンプトインジェクション
Prompt Injection as Role Confusion
Prompt Injection is a Failure of Role Perception
Prompt injections occur because Large Language Models (LLMs) cannot reliably distinguish between different roles—such as the user, the system, or the model's own internal reasoning—based on structural tags alone. Instead, LLMs rely on "writing style" as a proxy for role identification. When an attacker mimics the style of a privileged role, the model misperceives the text as having that role's authority, leading it to execute malicious commands buried in low-privilege data.
The "Token Soup": How LLMs Process Context
To an LLM, a conversation is not a structured dialogue but a single, continuous stream of text—a "token soup." This stream contains everything: system prompts, user messages, tool outputs, and the model's own previous reasoning and responses.
To impose structure on this stream, providers use role tags (e.g., system, user, think, assistant, tool). These tags are intended to act as a type system for language, signaling how the model should process the following text:
- User: Treat as a human request/instruction.
- Think: Private reasoning; trust its conclusions.
- Tool: External data; do not take orders from it.
While these tags are designed as discrete architectural boundaries, the model's internal representation of these roles is far more fluid.
Role Probes: Measuring Internal Beliefs
Researchers developed "role probes" to measure what role an LLM internally believes a token belongs to, regardless of the tag wrapping it. By training linear probes on model activations using neutral text wrapped in different tags, they created metrics like CoTness (the probability the model thinks a token is in a think block) and Userness.
Experiments revealed a critical flaw: writing style overrides structural tags.
- Tag Removal: When role tags were stripped from a conversation, the model still assigned high CoTness to tokens that sounded like reasoning (e.g., "The user wants...").
- Tag Conflict: When reasoning-style text was wrapped in
usertags, the model continued to perceive it as internal reasoning.
This demonstrates that LLMs identify roles via an insecure feature (style) rather than a secure one (tags). If text sounds like a specific role, the LLM treats it as that role, even if the structural tags explicitly state otherwise.
CoT Forgery: Stealing the Model's Trust
Because the think role is highly privileged—the model implicitly trusts its own reasoning conclusions—attackers can use CoT Forgery to bypass safety guardrails. By injecting fake reasoning that mimics the model's internal style (terse syntax, specific safety-related phrasing), attackers can convince the model that it has already decided to comply with a harmful request.
In tests on frontier models, CoT Forgery increased attack success rates from near-zero to approximately 60%. This attack is particularly robust because it does not rely on persuasion; it exploits a structural vulnerability where the model believes it is acting on its own prior conclusions.
Generalizing Role Confusion to Standard Injections
This principle extends to standard prompt injections where commands are hidden in tool outputs (e.g., a webpage). The research found that simply prepending "User: " to a command in a tool block increases the model's internal "Userness" score for those tokens, making the model more likely to execute the command. The more the model perceives the injected text as belonging to the user role, the higher the attack success rate.
The Theoretical Purpose of Roles
Roles are not merely formatting tricks; they are designed to isolate competing objectives so they can be optimized independently:
- Think vs. Assistant: Separates messy exploration (reasoning) from clean communication (final answer).
- User vs. Assistant: Separates comprehension (understanding the request) from generation (producing the response).
- User vs. Tool: Separates instructions (commands to follow) from data (information to use).
Role confusion is the failure of this isolation, where competing objectives bleed together, allowing low-privilege data to be processed as high-privilege instructions.
Future Research and Security Implications
Subconscious Steering
Beyond dramatic jailbreaks, role confusion enables "subconscious steering." If role boundaries are soft, innocuous text (like an enthusiastic tone on a product page) could bleed into the model's persona, subtly shifting its recommendations without the user's knowledge. This presents a significant risk for AI agents handling e-commerce or financial decisions.
New Role Abstractions
To resolve objective conflicts, new roles may be necessary. For example, a dedicated Planning Role could treat plans as commitments rather than ephemeral tool data, or an Evaluation Role could provide the critical distance needed for honest self-correction, reducing sycophancy.
Cognitive Windows
Roles provide a unique opportunity for interpretability research. Because input-only roles (user, tool) are loss-masked during training, their activations may provide a "clean window" into the model's comprehension, unpolluted by the generation signals present in output roles (assistant, think).
Community Perspectives
Discussion among technical practitioners highlights several critical points regarding the practical application of this theory:
"LLMs in their current form provide no security boundaries or guarantees full stop. We need to be clear about this otherwise we end up with truly insecure architectures..."
Critics argue that treating roles as a "security architecture" is a misnomer, as LLMs are fundamentally functions mapping strings to strings. Some suggest that the only viable defense is to move away from a single-channel input stream entirely, perhaps by using separate embeddings for different roles to create an unspoofable signal that cannot be mimicked by writing style.
Summary
Research reveals that prompt injections succeed because LLMs rely on writing style rather than structural role tags to identify instructions, allowing attackers to spoof privileged roles like internal reasoning.