AgentDojo — Independent Prompt-Injection Benchmark

A peer-reviewed, independent benchmark for prompt injection against tool-using AI agents. Published at NeurIPS 2024 (arXiv:2406.13352). Distinguishes itself from PyRIT / Garak / Promptfoo / Mindgard CART by being academic and venue-validated, not vendor self-evaluation.

What it is

PropertyDetail
Scope97 tasks / 629 security cases for tool-using agent prompt injection
MethodologyRealistic agent tasks under attack across multiple LLM targets
Headline findingBest agents <25% attack success; tool-filtering defense drops ASR to 7.5%
Use by vendorsMeta uses AgentDojo to evaluate LlamaFirewall PromptGuard 2 (ASR 17.6% → 7.5%; combined with AlignmentCheck 1.75%)
VenueNeurIPS 2024 (peer-reviewed)
URLarxiv.org/abs/2406.13352

Why it matters for the wiki

The wiki’s prompt-injection detection-rate citations are mostly vendor self-evaluation: Anthropic Constitutional Classifiers, Meta LlamaFirewall, Promptfoo regression numbers. AgentDojo is the cleanest third-party comparator — Meta’s own evaluation uses it, which means the same benchmark numbers appear in vendor-published evaluations and in independent papers, making cross-comparison defensible.

For the wiki’s CMM D7 L4 evidence requirement (multi-tool red-team eval), AgentDojo serves as the independent benchmark anchor that vendor self-evals are compared against. Mature D7 L4 programs should report both vendor-self-eval and AgentDojo numbers for the same defense.

How it differs from vendor red-team tools

ToolTypeSelf-eval bias
PyRITMulti-turn orchestration frameworkDIY — orgs run their own attacks
GarakProbe libraryNVIDIA-published probes
PromptfooRegression suiteVendor (now part of OpenAI)
Mindgard CARTContinuous SaaSCommercial vendor library
AgentDojoAcademic benchmarkPeer-reviewed; venue-validated

The wiki’s CMM D7 L4 should require at least one independent benchmark (AgentDojo or InjecAgent or WASP) alongside the four-quadrant tool coverage to count as L4 evidence.

  • InjecAgent (arXiv:2403.02691) — indirect-prompt-injection benchmark; ReAct GPT-4 vulnerable in 24% of cases
  • WASP (arXiv:2504.18575) — web-agent security benchmark for prompt injection

See Also