Red Teaming for AI: Synthesis
Question
What does a complete red-team practice for AI applications look like in 2026, across probe libraries, orchestration, regression suites, and continuous adversarial testing? Specifically: which tools cover which quadrants of the four-quadrant red-team grid from CMM D7 L4? What are the trust and provenance assumptions behind each? How does evaluation methodology (CLASP, ECBD, LLM-as-a-judge) tie back into the practice?
Current Position
This is the most mature of the wiki’s new scope axes — the substrate was already built before the scope expansion. The four-quadrant red-team coverage codified in CMM D7 L4 is the canonical decomposition:
- Probe libraries. garak (NVIDIA) is the canonical OSS LLM vulnerability scanner — ~18+ probe categories spanning encoding, prompt-injection, GCG, DAN, malware generation, XSS, and leak-replay. Vendor-published numbers should be cross-checked against garak outputs.
- Orchestration. PyRIT (Microsoft AI Red Team) provides multi-turn adversarial orchestration with adapters across OpenAI, Anthropic, Google, HuggingFace, and self-hosted endpoints — the de-facto OSS standard for orchestrated red-team campaigns.
- Regression suites. Promptfoo is the regression-test surface for application-layer LLM behavior — most useful as the “CI gate” for prompts and tool definitions.
- Continuous adversarial testing. Mindgard CART is the canonical SaaS for continuous red-team across deployed models; General Analysis is the agentic-AI-specific entrant.
Evaluation methodology is the harder problem. CLASP supplies a capability-centric evaluation rubric (Planning, Tool Use, Memory, Reasoning, Reflection, Perception); ECBD provides the design methodology for benchmark construction; LLM-as-a-judge is the semantic-matching approach that most evaluation toolchains converge on but that carries known failure modes (overconfidence, bias, prompt sensitivity).
The vendor surface for productized red-team-for-AI is consolidating around three incumbents: Lakera Guard for content-layer guardrails plus testing, HiddenLayer for AIDR with model scanning and adversarial robustness assessment, and Protect AI for AI-BOM, ModelScan, and the huntr bounty surface.
Supporting Evidence
- AgentDojo (NeurIPS 2024) is the canonical independent benchmark for tool-using agents — 97 tasks, 629 security cases. Independent academic benchmarks remain rare; this one matters.
- OWASP LLM Top 10 and OWASP Agentic AI Top 10 supply the vulnerability taxonomy that probe libraries map against.
- OWASP AIVSS establishes the scoring framework for AI vulnerabilities — analogous to CVSS for traditional vulnerabilities.
Counter-Evidence
Coverage of model-extraction and inversion attacks in productized tooling
Model-layer attacks (extraction, inversion, membership inference) are well-documented as concepts but undermarked in the productized testing surface. Most commercial scanners focus on prompt-injection and jailbreak; the model-layer attack class is harder to test for and is consequently under-covered.
Independent reproducibility of vendor red-team claims
Vendor-published numbers dominate; few independent reproductions exist. The AgentDojo benchmark is one of the few neutral data points.
How This Has Evolved
Seeded 2026-05-13. This synthesis page consolidates material that was previously spread across concepts/, entities/products/, and the CMM domain definitions. As an existing-content synthesis (not an ingest-driven seed), this page can be promoted to developing status quickly once cross-links are added to the constituent pages.
Open Sub-Questions
- Is
redteam-for-aia separate scope axis, or is it a sub-axis ofsec-of-aithat should be collapsed? Current judgment: keep separate, because the tooling and methodology surface is large enough to warrant its own synthesis address. - How does red-team-for-AI methodology need to evolve for agentic AI (multi-turn, multi-tool, multi-agent) vs. classical LLM testing? Some of CLASP’s extensions hint at this but the field is unsettled.
- See Gaps Index for related open questions.