Lethal Trifecta
Definition
Coined by Simon Willison in June 2025 (“The Lethal Trifecta for AI Agents”, simonwillison.net), the Lethal Trifecta is the minimum set of three agent capabilities that, when present together, allow an attacker to easily exfiltrate private data via prompt injection:
- Access to private data — the agent can read sensitive content (mailbox, files, internal docs, RAG store, secrets in env).
- Exposure to untrusted content — the agent ingests content from sources the attacker can influence (web pages, emails, documents, calendar invites, RAG entries, MCP tool descriptions).
- Ability to externally communicate — the agent can send data outside the trust boundary (HTTP requests, outbound mail, file writes to shared storage, markdown image rendering, URL-fetch tool calls).
When all three hold, a successful prompt injection (almost always indirect) can chain the agent’s own capabilities into an exfiltration path. The attacker doesn’t need a software exploit; the agent supplies the read-and-write primitives.
Why It Is “Lethal”
The trifecta is necessary and (in practice) sufficient for end-to-end data exfiltration via natural-language attack:
- Any two of the three capabilities is annoying-but-recoverable. Two agents, each with two-of-three, can each be useful.
- All three in a single agent gives an attacker an exfiltration channel that requires no privilege escalation, no code execution exploit, and often no detectable malware artifact. The compromise looks identical to normal use.
The framing is load-bearing because it gives architects a structural test: ask, “does any single agent in our system hold all three?” If yes, that agent is one indirect injection away from being a data-exfil tool.
How the Trifecta Manifests in Real Attacks
| Stage | Capability used | Example |
|---|---|---|
| Trigger | Untrusted content | Hidden HTML comment in a shared Google Doc the agent is asked to summarize |
| Read | Private data access | Agent has access to user’s mailbox or file system |
| Send | External communication | Agent renders a markdown image with ?data=… query string, or POSTs to attacker URL |
The Jules AI kill chain and most of Johann Rehberger’s “Month of AI Bugs” disclosures are concrete instantiations of the trifecta.
Containment Strategies
Break the trifecta. Remove at least one of the three capabilities from any agent that handles untrusted content. Practical patterns:
- Split agents along the trifecta axes. A “research agent” has untrusted-content + external-comms but no private-data access. A “personal-assistant agent” has private-data + external-comms but no untrusted-content ingest. A “summarizer” has private-data + untrusted-content but no external-comms.
- Remove external communication from sensitive agents. If an agent must touch private data and untrusted content, it cannot also speak to the network. All output must go through a human-reviewed surface.
- Treat retrieved content as data, not instructions. Use system prompt architecture with explicit trust labels. This does not break the trifecta on its own (a determined injection can still succeed) but reduces the success rate.
- Egress filtering. Domain allowlists at the network layer make the “external communication” leg detectable and constrainable.
- Capability-level audit. Every agent definition should declare which legs of the trifecta it holds. The audit asks: is this combination justified?
- Capability-based authorization at the action layer. Even when an agent must hold all three legs, the action it can take with them can be deterministically constrained per task. Capability-based authorization (e.g. Tenuo Warrants from [[capability-based-authorization-niyikiza-talk|Niyikiza, [un]prompted March 2026]]) issues task-scoped, holder-bound, delegation-aware capabilities; sub-agent capabilities can only narrow (monotonic attenuation). This contains an exfil-oriented prompt injection at execution time without removing any leg of the trifecta from the agent’s role — the agent is allowed to ingest untrusted content, hold private-data access, and reach the network, but only the specific action set the warrant permits is allowed. Reports 90%→0% multi-agent ASR on Tenuo’s custom harness.
- Layered structural defenses (“Architecting the Fortress”). Nicolas Lidzborski’s [[securing-workspace-genai-at-google-lidzborski-talk|Google Workspace talk at [un]prompted March 2026]] presents a four-layer blueprint: (1) low-risk input — strip hidden content + abuse-signal-aware ingestion + data-provenance tracking; (2) prompt delimitation via sentinel tokens + adversarial fine-tuning; (3) deterministic orchestration with state-aware FSM that constrains downstream capabilities by data origin; (4) output sanitization including markdown scrubbing, dynamic URL classification, and removal of ungrounded LLM-hallucinated URLs. Combined with Plan-Validate-Execute for high-stakes irreversible actions. Worked example: the Nassi et al. “Invitation Is All You Need” attack (calendar invite as zero-click hijack vector for Gemini) extended in Lidzborski’s deployment to smart-home control (lights, curtains, heater) — a real-world demonstration of trifecta exploitation when the action surface is broader than recognized.
This is the architectural premise of Stripe’s containment architecture, presented by Andrew Bullen at [[unprompted-conference-march-2026|[un]prompted, March 2026]] — see Breaking the Lethal Trifecta (Without Ruining Your Agents) for the full worked example. Stripe’s argument: among the three legs, only egress is feasible to remove in a real enterprise (private data is needed by most agents; untrusted content is structurally hard to filter without losing utility). The same talk introduces the Lethal Bifecta as a write-side analogue. Niyikiza’s capability-warrants approach is complementary: when egress can’t be fully removed, action-layer capability attenuation contains the blast radius even when the agent reaches it.
Relationship to OWASP Frameworks
- LLM01 Prompt Injection is the attack vector; the Lethal Trifecta is the structural condition that makes it lethal.
- ASI01 Agent Goal Hijack and ASI05 Sensitive Data Disclosure are the OWASP labels for the outcome when the trifecta is exploited.
- Least Agency Principle can be read as the Lethal Trifecta’s positive form: strip every leg you can.
Distinguishing It From Adjacent Concepts
- Lethal Trifecta ≠ “agent can do bad things.” The trifecta is specifically the configuration that makes silent exfiltration trivial. Other harms (cost explosion, data destruction, hallucination cascades) are real but mechanistically different.
- Lethal Trifecta is a Confidentiality + Integrity threat model. The trifecta describes silent exfiltration (C) and the Bifecta describes unintended action (I via the write-side). Availability harms — runaway agents, recursive loops, resource exhaustion — sit outside the trifecta entirely and have their own threat surface; see Agent Availability Threats. The MAAIS CIAA augmentation makes the case for treating Availability and Accountability as co-equal axes alongside C + I.
- Lethal Trifecta is structural, not behavioral. A trifecta agent is a problem before any user interaction occurs. Defense begins at design time.
- The trifecta does not assume the model is misaligned or compromised. A perfectly aligned model with all three capabilities is still exposed because the attacker is the source of misalignment, via injected content.
On "unconditionally vulnerable"
A serious skeptic will push back on the unconditional framing. Stripe (Bullen, March 2026) runs trifecta agents in production with platform-level egress containment + sensitive-action HITL and reports 1.5–6.7% attack success rates across model generations. CaMeL (Google DeepMind) and deterministic-gating research demonstrate further reductions without splitting the trifecta. The honest framing: the trifecta is necessary for natural-language exfil at scale and sufficient given current defense maturity to require platform-layer containment. In production, containment drives ASR very low but not zero. Bullen’s “even 0.1% is too high” is the operative bar — the threshold, not the unconditional nature, is what makes the trifecta a design-time test. See Wiki Novelty and Counter-Arguments §Thesis 3.
See Also
- Indirect Prompt Injection — the dominant attack vector against trifecta agents
- Tool-Abuse Chains — what happens when external communication is via tool calls rather than text rendering
- Prompt Injection Containment for Agentic Systems — runtime containment when the trifecta cannot be broken at design time
- Least Agency Principle — the autonomy-governance principle that complements trifecta-splitting
- Agentic AI Threat Classes — 2026 Expansion — the broader threat model that contains the Lethal Trifecta as one structural test among five threat classes (insider, APT campaign, collusion, model-version regression, jurisdictional adversaries)