AgentDojo — Independent Prompt-Injection Benchmark
A peer-reviewed, independent benchmark for prompt injection against tool-using AI agents. Published at NeurIPS 2024 (arXiv:2406.13352). Distinguishes itself from PyRIT / Garak / Promptfoo / Mindgard CART by being academic and venue-validated, not vendor self-evaluation.
What it is
| Property | Detail |
|---|---|
| Scope | 97 tasks / 629 security cases for tool-using agent prompt injection |
| Methodology | Realistic agent tasks under attack across multiple LLM targets |
| Headline finding | Best agents <25% attack success; tool-filtering defense drops ASR to 7.5% |
| Use by vendors | Meta uses AgentDojo to evaluate LlamaFirewall PromptGuard 2 (ASR 17.6% → 7.5%; combined with AlignmentCheck 1.75%) |
| Venue | NeurIPS 2024 (peer-reviewed) |
| URL | arxiv.org/abs/2406.13352 |
Why it matters for the wiki
The wiki’s prompt-injection detection-rate citations are mostly vendor self-evaluation: Anthropic Constitutional Classifiers, Meta LlamaFirewall, Promptfoo regression numbers. AgentDojo is the cleanest third-party comparator — Meta’s own evaluation uses it, which means the same benchmark numbers appear in vendor-published evaluations and in independent papers, making cross-comparison defensible.
For the wiki’s CMM D7 L4 evidence requirement (multi-tool red-team eval), AgentDojo serves as the independent benchmark anchor that vendor self-evals are compared against. Mature D7 L4 programs should report both vendor-self-eval and AgentDojo numbers for the same defense.
How it differs from vendor red-team tools
| Tool | Type | Self-eval bias |
|---|---|---|
| PyRIT | Multi-turn orchestration framework | DIY — orgs run their own attacks |
| Garak | Probe library | NVIDIA-published probes |
| Promptfoo | Regression suite | Vendor (now part of OpenAI) |
| Mindgard CART | Continuous SaaS | Commercial vendor library |
| AgentDojo | Academic benchmark | Peer-reviewed; venue-validated |
The wiki’s CMM D7 L4 should require at least one independent benchmark (AgentDojo or InjecAgent or WASP) alongside the four-quadrant tool coverage to count as L4 evidence.
Related benchmarks
- InjecAgent (arXiv:2403.02691) — indirect-prompt-injection benchmark; ReAct GPT-4 vulnerable in 24% of cases
- WASP (arXiv:2504.18575) — web-agent security benchmark for prompt injection
See Also
- Source Triangulation Audit 2026-05-02 — Claim 5
- PyRIT · Garak · Promptfoo · Mindgard CART — vendor red-team toolchain (AgentDojo is their independent comparator)
- Agentic AI Security CMM 2026 D7 L4