Breaking the Lethal Trifecta (Without Ruining Your Agents)

[un]prompted Conference, March 4, 2026 — Stage 1, 11:20. Andrew Bullen, Head of AI Security at Stripe. Subtitle: Architectural Defenses against Prompt Injection.

This page combines two source artifacts: the slides (13 frames; data + diagrams + the actual ToolAnnotations API) and the speaker transcript (the war stories, caveats, and Q&A that don’t appear in the deck). Where the two sources tell different parts of the story, the section noting which input each claim comes from is preserved.

TL;DR

  • The model layer will not save you — even Claude 3.7 Sonnet:Thinking sees a 1.5% attack-success rate on the published competition (slide 3). For security, “1% is too high.”
  • Treat prompt injection as inevitable. Containment is two architectural rules:
    • Guardrail 1 (egress): break the Lethal Trifecta by removing the External-Communication leg in any agent that touches private data + untrusted content.
    • Guardrail 2 (sensitive writes): for the write-side analogue (untrusted-content + sensitive-action — Bullen’s “Lethal Bifecta”), require human review of sensitive actions.
  • Both guardrails kill agent UX unless you do compensating UX work: safe search, SaaS-MCP proxy, queued/batched confirmations, optimistic writes with reverts.
  • Both guardrails decay unless you do compensating enforcement work: tag agentic services + CI-time egress checks (Stripe uses Smokescreen); tool annotations evaluated automatically (Stripe’s Toolshed central MCP); proxying connections out of “deep” agent sandboxes — work-in-progress.

The threat baseline (from slides only)

Slide 3 cites “Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition” (arXiv) — attack-success rate on prompt-injection challenges, by model:

ModelASRProvider
Llama 3.3 70B6.7%Meta
Pixtral Large6.1%Mistral
Llama 3.1 405B5.9%Meta
o3 Mini:High4.9%OpenAI
Nova Lite v14.5%Amazon
Nova Micro v14.4%Amazon
Grok 24.4%xAI
o3 Mini4.3%OpenAI
Nova Pro v13.9%Amazon
Command-R3.8%Cohere
o32.9%OpenAI
GPT 4.52.7%OpenAI
o12.7%OpenAI
GPT 4o2.5%OpenAI
3.5 Haiku2.4%Anthropic
3.5 Sonnet1.9%Anthropic
3.7 Sonnet1.7%Anthropic
3.7 Sonnet:Thinking1.5%Anthropic

Slide 4 grounds the abstract numbers in four incidents (headlines verbatim from the deck). Each now has a wiki incident page (ingested 2026-05-03 from primary sources):

Bullen’s framing (transcript): “even a 0.1% failure chance on attack is not enough … we need something better than just relying on the models.” He believes prevalence in consumer settings will get worse before better.

The two guardrails

Guardrail 1: Prevent egress

The Lethal Trifecta is Private Data + Untrusted Content + External Communication. Bullen’s argument (transcript) is feasibility:

Trifecta legCan Stripe remove it across the agent estate?
Untrusted ContentNo. “A big part of why LLMs are useful is that you can throw a bunch of stuff at them”; attackers are creative about smuggling injection into content surfaces.
Private DataNo. “At Stripe we have a lot — support tickets, emails, etc. Many agents need it.”
External Communication (egress)Yes — the only feasible leg.

So Stripe’s first guardrail collapses to: break egress for any agent that has private data + untrusted content. Two surfaces to control:

  1. General web requests (the obvious case).
  2. Third-party SaaS connections — “a very easy way to exfiltrate data would be writing it to a public Google Doc with a Google Doc tool.” Less obvious; just-as-effective.

Guardrail 2: Human review of sensitive actions (the “Lethal Bifecta”)

Slide 8 / transcript: data exfil isn’t the only outcome of prompt injection. Bad writes need only two ingredients: untrusted content + the agent’s ability to take a sensitive action. Bullen calls this the Lethal Bifecta — a Bullen coinage on the same naming pattern as Willison’s trifecta, addressing the write side rather than the read/exfil side.

Same feasibility argument: untrusted content can’t be removed, so you control the other leg — route every sensitive action through human review. “Sensitive is very load-bearing here. Generally the rule of thumb: anything that is a production write or broad communication or sending a message.”

The compensating UX work (without which the guardrails are unadoptable)

Bullen’s framing (transcript, repeated in his concluding remark): “Step one threat model, step two mitigation, step three figure out how to put in place the mitigation in a way that the business still can achieve its goal.” The talk is mostly step three. Slide 9 shows the failure mode he’s trying to avoid — the Claude Code permission dialog interrupting rm -r foo over and over, until users just enable dangerously-skip-permissions.

Egress-side UX

Pain pointStripe’s solution
”We need public Internet data”Safe Search — built on OpenAI Web Search with external_web_access: false, so the agent can request a search and receive cached results without itself triggering egress. Bullen’s honest caveat (transcript only): “That’s not really making anything any safer. … you prevent the data from being egressed by your agent, and instead it just gets egressed by OpenAI.” The trust is shifted, not eliminated.
”I want to use this SaaS MCP”Proxy third-party MCPs through Toolshed — Stripe’s central MCP proxy. Rules can be written at the proxy: e.g. don’t allow connections to non-Stripe tenants when writing to Google Docs / Figma / etc. Side benefit: users connect once to one MCP server, not N.

Sensitive-write-side UX

Slide 11 enumerates three pain points; each has a concrete countermeasure:

Pain pointCountermeasure
Confirmations interrupt the agentQueue and batch non-blocking confirmations — let the agent keep working, surface them later. Slide 11 shows a UI mockup of a “Pending Actions (3 queued)” approve/reject panel. (Slide labels: “this is a mockup btw.”)
Review fatigueOptimistic writes with reverts — execute reversible writes immediately and offer revert; keeps the agent moving.
Rubber-stampingLLM-as-second-reviewer (transcript only — not on the slides) — when policy gets sophisticated, an automated reviewer can quickly tell the agent “you’re trying to do something bad” before the human gets pinged.

Enforcement (slide 12 + transcript)

Bullen treats enforcement as the third design problem. Two surfaces:

Egress enforcement

Stripe pre-existed the AI agent era with a strong egress program. The mechanism:

  1. Tag every service that’s an agent. Easy in practice — to be an agent, the service has to talk to a foundation model, and Stripe routes those through a known proxy. (Transcript: “You could use whether your service talks to Bedrock or whatever as a way of doing this.”) This is a generalizable pattern beyond Stripe.
  2. CI-time check — if the service is tagged-as-agent, you can’t configure egress without an escalated review process.

The OSS implementation Stripe leans on for the network-side egress proxy is Smokescreen (open-source).

Sensitive-write enforcement: tool annotations

Slide 12 shows the actual API. Inline Ruby example from the deck:

class SampleGoogleTool
  def self.annotations
    ToolAnnotations.new(
      production_impacting_write: false,
      data_sensitivity: :medium,
      broadcasts_data_internally: false,
    )
  end
end

Two policy surfaces:

  • Tools authored inline in agent frameworks carry the annotations directly.
  • Tools exposed via Stripe’s central MCP service (Toolshed) carry the same annotations at the registration boundary.

The framework reads annotations and decides whether human review applies. Centralizing it gives one UX surface to improve over time.

The hard problem (work-in-progress, transcript only)

“Increasingly, agents don’t need special tools — they’ll just write their own code and hit random APIs on your existing services. So what do we do here? This is, like, work that is in progress right now, but the approach we’re looking at is essentially proxying the connections coming out of agents, out of their sandboxes, and then using that as a choke point where you can similarly have annotations on the API endpoints that the agents are talking to.”

This is the unsolved piece on the slide too — slide 12 ends with the question “What about Claude Code style ‘Deep’ Agents?” but no answer. Important caveat for any downstream wiki claim: Stripe’s tool-annotation enforcement, as of [un]prompted March 2026, does NOT yet cover deep / code-writing agents that bypass declared tools.

Q&A (transcript only)

Two questions worth quoting verbatim

Q1 (group/team aggregation of human-in-the-loop reviews): “We’re early in the UX experimentation process … in general, trying to find ways to make it so that people need to stop less, fewer checks, fewer reviews need to be made, while ensuring that human judgment is applied at the right time is going to be sort of the North Star.”

Q2 (still doing threat modeling for prompt injection?): “100% there’s a place for detective and other types of controls that aren’t guarantees. Especially for customer-facing products. But ultimately, because we’re not at the point where we can fully trust those, we really want to lean on these more deterministic, architectural controls.”

The Q&A confirms a methodological hierarchy Stripe applies: deterministic architectural controls dominate; behavioral / detective controls are supplementary, especially for surfaces (consumer-facing) where the architectural lever is weaker.

Where this lands in the wiki

Wiki artifactUpdate from this talk
Lethal TrifectaExisting concept page; this talk supplies the canonical practitioner worked example. The concept page already pointed at “the Stripe Lethal Trifecta containment architecture (pending elevation)” — this page partially fulfills that promise.
Lethal BifectaNEW concept — Bullen’s coining. Pairs with the trifecta on the write side.
Prompt Injection Containment for Agentic SystemsThree new concrete patterns: agent-tag + CI egress check, SaaS-MCP proxying, ToolAnnotations declarative schema.
MCP Security”Toolshed” central MCP is a vendor-named instance of the proxy-MCP-for-policy pattern.
Oversight Layer (PDP + PEP for Agentic AI)The Toolshed proxy + tool annotations are a concrete PDP/PEP implementation (PDP = annotation policy; PEP = MCP proxy).
Agentic AI Security Reference Architecture (2026)Direct evidence for Egress plane (Smokescreen + agent-tag CI), Control plane (annotations), Identity plane (per-tenant rules at Toolshed).
Agentic AI Security Capability Maturity Model — A 2026 Practical ProposalL3/L4 evidence for D3 (decision rights via annotations), D4 (ToolAnnotations review policy), D5 (egress via tagged services + CI).
StripePromotes the entity from stub to real page (Toolshed, Smokescreen, Bullen, Andrew’s containment architecture).
Toolshed (Stripe)NEW product stub.
Smokescreen (Stripe)NEW product stub.
Andrew BullenNEW people stub.
[[unprompted-conference-march-2026[un]prompted Conference — AI Security Practitioner Conference (March 3–4, 2026)]]
[[unprompted-march-2026-talks-vs-ra-cmm[un]prompted March 2026 Talks — Relevance to RA + CMM]]

Slides-only vs transcript-only — what each input added

This was the first ingest of the wiki using paired slides + transcript. The split below is so future readers can see what each input contributed (and so we know what we’d have lost by ingesting only one).

Slides only:

  • Hard ASR numbers per model (1.5–6.7%) and the exact arXiv title.
  • The four named CVE / incident headlines (CVE-2025-62453; Claude→Stripe coupons; Cursor npm 3,200+; Slack AI exfil).
  • The actual ToolAnnotations Ruby API with field names (production_impacting_write, data_sensitivity, broadcasts_data_internally).
  • The “Pending Actions” UI mockup design.
  • The trifecta + bifecta diagrams as pictures.

Transcript only:

  • Names of the implementations: Toolshed (central MCP) and Smokescreen (egress proxy).
  • The honest caveat that Safe Search via OpenAI shifts trust, doesn’t eliminate it.
  • The “Lethal Bifecta” coinage (slide just says “Bad Writes are even simpler…”).
  • LLM-as-second-reviewer as a future direction.
  • Q&A (group-level review aggregation, threat-modeling-vs-architectural-controls hierarchy).
  • The “deep agents” work-in-progress: proxying connections out of agent sandboxes as the next chokepoint.
  • The agent-tag heuristic: “to be an agent, you need to talk to foundation models” — generalizable to any org with a model proxy.

If we’d ingested only the slides we’d have the architecture but not the names (Toolshed/Smokescreen) or the limits (deep agents). If we’d ingested only the transcript we’d have the names and limits but no exact API field names, no model ASR numbers, and no incident citations.

See also