AgentDojo — Independent Prompt-Injection Benchmark

A peer-reviewed, independent benchmark for prompt injection against tool-using AI agents. Published at NeurIPS 2024 (arXiv:2406.13352). Distinguishes itself from PyRIT / Garak / Promptfoo / Mindgard CART by being academic and venue-validated, not vendor self-evaluation.

What it is

Property	Detail
Scope	97 tasks / 629 security cases for tool-using agent prompt injection
Methodology	Realistic agent tasks under attack across multiple LLM targets
Headline finding	Best agents <25% attack success; tool-filtering defense drops ASR to 7.5%
Use by vendors	Meta uses AgentDojo to evaluate LlamaFirewall PromptGuard 2 (ASR 17.6% → 7.5%; combined with AlignmentCheck 1.75%)
Venue	NeurIPS 2024 (peer-reviewed)
URL	arxiv.org/abs/2406.13352

Why it matters for the wiki

The wiki’s prompt-injection detection-rate citations are mostly vendor self-evaluation: Anthropic Constitutional Classifiers, Meta LlamaFirewall, Promptfoo regression numbers. AgentDojo is the cleanest third-party comparator — Meta’s own evaluation uses it, which means the same benchmark numbers appear in vendor-published evaluations and in independent papers, making cross-comparison defensible.

For the wiki’s CMM D7 L4 evidence requirement (multi-tool red-team eval), AgentDojo serves as the independent benchmark anchor that vendor self-evals are compared against. Mature D7 L4 programs should report both vendor-self-eval and AgentDojo numbers for the same defense.

How it differs from vendor red-team tools

Tool	Type	Self-eval bias
PyRIT	Multi-turn orchestration framework	DIY — orgs run their own attacks
Garak	Probe library	NVIDIA-published probes
Promptfoo	Regression suite	Vendor (now part of OpenAI)
Mindgard CART	Continuous SaaS	Commercial vendor library
AgentDojo	Academic benchmark	Peer-reviewed; venue-validated

The wiki’s CMM D7 L4 should require at least one independent benchmark (AgentDojo or InjecAgent or WASP) alongside the four-quadrant tool coverage to count as L4 evidence.

InjecAgent (arXiv:2403.02691) — indirect-prompt-injection benchmark; ReAct GPT-4 vulnerable in 24% of cases
WASP (arXiv:2504.18575) — web-agent security benchmark for prompt injection

Enterprise Security in the Agentic AI Era

Explorer

AgentDojo — Independent Prompt-Injection Benchmark

AgentDojo — Independent Prompt-Injection Benchmark

What it is

Why it matters for the wiki

How it differs from vendor red-team tools

See Also

Graph View

Table of Contents

Backlinks

Enterprise Security in the Agentic AI Era

Explorer

AgentDojo — Independent Prompt-Injection Benchmark

AgentDojo — Independent Prompt-Injection Benchmark

What it is

Why it matters for the wiki

How it differs from vendor red-team tools

Related benchmarks

See Also

Graph View

Table of Contents

Backlinks