Agentic AI Threat Classes — 2026 Expansion

The wiki’s existing threat coverage — OWASP Agentic AI Top 10 (ASI), MITRE ATLAS, CSA MAESTRO, the Lethal Trifecta — is well-developed for external prompt injection, supply-chain compromise, and MCP-server attacks. A May 2026 peer-review readiness audit (peer-review-readiness-2026-05-02 §5) flagged five classes a serious reviewer would surface immediately. Each is documented below with authoritative sources, named real-world incidents (where any), defensive controls, and where it lands in the six-plane RA + 9-domain CMM.

Single highest-leverage control

Three of five classes (insider, APT campaign, model-version regression) collapse to the same observable signal: a delta against a trusted baseline produced by a customer-owned, version-pinned, continuously-executed eval harness over every artifact (weights, prompts, RAG documents, tool definitions) with cryptographic provenance. This is the AI-BOM + always-on customer eval pattern. Class 3 (collusion) partially overlaps via output canonicalization and monitor isolation. Class 5 (jurisdictional) is the outlier — it requires governance-layer controls (vendor abstraction, jurisdiction tagging, contract resilience), not technical artifact controls.

Why this page exists

A peer reviewer testing the wiki’s RA + CMM will ask: “Your threat coverage is excellent for indirect prompt injection and tool-poisoning. What about the classes I worry about as a CISO — my own engineers, slow APTs, two agents talking past me, the next minor-version bump, and Beijing or Brussels?” The Lethal Trifecta is a sharp structural test for one specific exfil pattern; it is not a comprehensive threat model. This page names the five gaps explicitly, anchors each in published material from Anthropic, NIST, RAND, CSET Georgetown, Apollo Research, the UK AI Safety Institute, CrowdStrike, and Microsoft, and threads them back into the existing architecture.

Class 1 — AI-aware insider threat

The privileged insider with model-platform access. Not every insider is a developer with git push; in agentic environments the high-leverage roles are model engineers, MLOps and platform admins, prompt librarians, RAG curators, fine-tuning operators, and the eval team itself. Each of these can corrupt the system in ways that look identical to legitimate maintenance.

Authoritative sources. NIST AI 100-2e2025 (adversarial ML taxonomy) formalizes targeted and availability poisoning by training-data-access actors. RAND RR-A2849-1 (Securing AI Model Weights, 2024) identifies 38 attack vectors across nine categories and names insider-threat programs as a required control for frontier labs. NIST CAISI’s January 2026 RFI on Securing AI Agent Systems explicitly scopes “models that have been subjected to poisoning” as in-scope for agent security. Vendor synthesis: Proofpoint’s AI is Becoming the Next Insider Threat (2026) reports that 32% of organizations now flag unsupervised data access by agents as a critical insider-risk threat.

Concrete attack scenarios.

  • Malicious fine-tuning operator inserts a backdoor trigger into a domain-tuned model.
  • RAG curator silently swaps high-trust reference documents for poisoned variants.
  • Eval-team insider weakens an evaluation harness so unsafe regressions ship undetected.
  • Prompt librarian inserts a covert “jailbreak-on-keyword” pattern into the canonical system-prompt repo.
  • “Asking AI” exfiltration: low-skill insider exfils structured data via summarization without scripting (Proofpoint).

Real-world incidents. No publicly attributed insider model-poisoning incident at a frontier lab as of May 2026. RAND notes most of its 38 vectors “have already been used” against non-AI targets but AI-specific cases remain largely unattributed publicly.

Defensive controls named in the literature. Insider-threat programs (RAND playbook); least-privilege MLOps roles; signed prompt artifacts; dual control on fine-tune jobs; eval-harness integrity attestation; dataset provenance and hash logging; monitoring of training-data deltas.

RA + CMM mapping. Identity (PIP); Control (PDP, dual-control policy); Data (RAG/prompt/eval-harness integrity, AI-BOM provenance); Observability (training-data and prompt-repo deltas). CMM domains: D2 Identity, D3 Control & Least-Agency, D6 Data/Memory/RAG, D8 Supply Chain & AI-BOM, D9 Operations & Human Factors.

Class 2 — Long-running adaptive adversarial campaigns

APT-class adversaries that operate against AI systems over weeks or months and adapt to defender response. The 2025–2026 evidence base for this class is now substantial — it is no longer a hypothetical.

Authoritative sources. CrowdStrike 2026 Global Threat Report reports AI-enabled adversary activity up 89% during 2025, naming Russia-nexus FANCY BEAR (LAMEHUG LLM-enabled malware) and DPRK FAMOUS CHOLLIMA (AI-personas-at-scale insider operations). Microsoft Digital Defense Report 2025 covers nation-state AI use for influence ops and lateral movement; Anthropic’s GTG-1002 disclosure (Nov 2025) describes a PRC-nexus group operating Claude Code as an autonomous penetration orchestrator across roughly 30 global targets. UK AISI’s Frontier AI Trends Report finds that the length of cyber tasks frontier models can complete unassisted is doubling roughly every eight months — a measurement built on METR’s “Measuring AI Ability to Complete Long Tasks” (arXiv:2503.14499) methodology, which finds the generalist task horizon doubles every ~7 months across 2019–2025 and accelerated to ~4 months in 2024–2025. Citing AISI without METR is citing the conclusion without the methodological foundation.

Direct quote. “An operator tasked instances of Claude Code to operate in groups as autonomous penetration testing orchestrators and agents… the threat actor able to leverage AI to execute 80–90% of tactical operations independently.” — Anthropic, GTG-1002.

Concrete attack scenarios. Slow RAG poisoning over weeks to evade behavioral baselines; iterative jailbreak probing across model versions; AI-orchestrated multi-step intrusions at scale (GTG-1002 pattern); AI-generated personas to scale insider operations (FAMOUS CHOLLIMA / DPRK fake-employee, >130% YoY growth).

Real-world incidents. GTG-1002 (PRC-nexus, Sep 2025); LAMEHUG (FANCY BEAR / Russia, 2025); FAMOUS CHOLLIMA (DPRK, 2025); Secret Blizzard / ApolloShadow (Russia, Feb 2025).

Defensive controls named. Cross-version eval continuity, model-side abuse classifiers (Anthropic’s GTG-1002 mitigation pattern), behavioral baselines on agent tool-call patterns, threat-intel sharing with model vendors, drift detection on RAG indices, sustained AI-workload threat hunting.

RA + CMM mapping. Runtime, Egress, Observability planes. CMM domains: D4 Runtime & Guardrails, D5 Egress & Network, D7 Observability & Detection, D9 Operations & Human Factors. Argues for a sustained Threat-Hunt function on AI workloads at CMM L4.

Class 3 — Collusion (agent–agent, insider–agent, agent–human)

Three sub-cases that share a structural property: oversight built around any single channel can be bypassed when two channels coordinate.

(a) Insider amplifier. Compromised insider plus AI agent in coordination — the insider directs the agent through legitimate channels to reach systems they could not access alone.

(b) Multi-agent collusion. Two agents collude to bypass an oversight that would catch either alone — e.g., a code-author agent and a code-review agent coordinate to ship sabotaged code; or two agents in a multi-agent system exchange steganographically encoded plans to bypass an output-monitoring filter.

(c) Agent–human social engineering. AI agents manipulate humans (especially HITL approvers) — AI-driven phishing of approvers, “approval-fatigue” exploitation, persuasive-agent over-reliance.

Authoritative sources. Apollo Research published the formal threat model for steganographic agent-to-agent collusion (Secret Collusion among AI Agents: Multi-Agent Deception via Steganography, NeurIPS 2024) and the detection technique (Detecting Strategic Deception Using Linear Probes, 2025). OpenAI × Apollo’s 2025 Detecting and Reducing AI Scheming covers pre-deployment evals for scheming. The HITL-specific failure mode is articulated in Changkun’s Confirmation Fatigue and the Protocol Gap in Agentic AI Oversight (2025): “Confirmation fatigue makes per-call approval security-equivalent to no approval.” OWASP ASI09 (Human–Agent Trust Exploitation) names the over-reliance failure mode.

Direct quotes.

  • “Secret collusion is a form of multi-agent deception, in which two or more agents employ steganographic methods to conceal the true nature of their interactions… from oversight.” — Apollo Research.
  • “By using steganographic techniques, the model can remain helpful to the user while circumventing output monitoring systems.” — Apollo Research.
  • “GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities.” — Apollo Research.

Real-world incidents. No publicly attributed agent-agent collusion incident in production yet; Apollo’s evidence is in controlled stress-tests. Treat agent-agent collusion as forward-looking with strong empirical foundation. HITL bypass is, by contrast, well-documented in agent tooling (Cursor denylist bypassed four ways; JS injection into Claude Desktop Electron app to bypass confirmations).

Defensive controls named. Paraphrase / output canonicalization (Apollo); linear deception probes on residual streams (Apollo); tiered architecture — deterministic policy gates → constitutional self-check → AI supervisor → HITL only for irreversible/novel → audit trail; approval-budget rate-limiting and coarse-grained (per-session) approval; mutually-blind monitor agents that cannot share state.

RA + CMM mapping. Control (PDP), Runtime (PEP, supervisor agents), Observability (deception probes, monitor isolation). CMM domains: D3 Control & Least-Agency, D4 Runtime & Guardrails, D7 Observability & Detection, D9 Operations & Human Factors. Justifies a future agent-oversight practice page and a CMM evidence item on monitor isolation at D7 L4+.

Class 4 — Model-version-degradation attacks

Attacks tied to the foundation-model update cadence. Four sub-cases: vendor pushes a minor update that regresses a safety property (jailbreak susceptibility differs); upstream weight poisoning persists through downstream fine-tuning; adversary times an exploit to a known rollout window; customer’s own evals break silently when the underlying model updates.

Authoritative sources. Anthropic’s Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training is the load-bearing paper for the persistence sub-case. Anthropic’s Simple probes can catch sleeper agents documents the detection counterpart. The Anthropic Responsible Scaling Policy and METR’s Common Elements of Frontier AI Safety Policies describe the vendor pre-deployment commitments. Joint UK/US AISI pre-deployment evaluations of upgraded Claude 3.5 Sonnet established the “two AISIs evaluating one upgrade” pattern. Practitioner empirics: Promptfoo’s Your model upgrade just broke your agent’s safety documented version-to-version regression in benchmark coverage between GPT-5 → GPT-5.4 generation.

Direct quotes.

  • “Such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training.” — Anthropic, Sleeper Agents.
  • “Rather than removing backdoors, adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.” — Anthropic, Sleeper Agents.
  • “Newer models are not automatically safer, and if you assume safety ‘transfers’ across upgrades, you will ship regressions.” — Promptfoo, 2025.
  • “Vendor safety ratings don’t transfer — your deployment context… is unique.” — Promptfoo.

Concrete attack scenarios. Vendor minor-version update silently regresses jailbreak resistance; upstream weight poisoning (sleeper agent) survives downstream fine-tuning; adversary times exploit to coincide with an announced rollout; customer eval suite hard-codes assumptions that break silently on update so safety assertions become rubber stamps.

Real-world incidents. Documented version-to-version safety regressions across the GPT-5 generation (Promptfoo benchmarks). Anthropic Constitutional Classifiers public demo, Feb 3–10 2025: a universal jailbreak was found during the live red-team window. Sleeper-agent attack itself is a controlled lab demonstration, not a documented in-the-wild incident.

Defensive controls named. Continuous regression red-teaming on every model-version pin; pin-by-hash deployment (no auto-upgrades); customer-side eval suites versioned independently from the vendor’s; rollback playbooks; canary traffic on new versions; defection probes / linear deception detectors for trojaned weights; AI-BOM with model-version provenance.

RA + CMM mapping. Runtime, Data, Supply-Chain, Operations. CMM domains: D4 Runtime & Guardrails, D6 Data/Memory/RAG (eval-harness integrity), D8 Supply Chain & AI-BOM (model-version provenance), D9 Operations & Human Factors (rollback drill, model-deprecation policy — already named in the validation page §3).

Class 5 — Jurisdictional adversaries with regulatory leverage

State actors using legal and regulatory mechanisms as the attack vector. This is not technical exploitation; it is denial-of-service, compelled disclosure, or cutoff via legal authority. A peer reviewer in compliance counsel’s seat will ask about it specifically.

Sub-cases.

  1. Mid-deployment vendor cutoff: a BIS or IEEPA designation forces an enterprise off a foundation model already in production (DeepSeek-style).
  2. Compelled disclosure of system prompts, fine-tune datasets, or weights to a foreign regulator (China CAC algorithm filing).
  3. Cross-border data-flow restriction used as denial-of-service against an inference-as-a-service deployment.
  4. Sovereignty-driven forced replication of model/data into a hostile jurisdiction.
  5. State-mandated registration that exposes RAG corpus structure or evaluation methodology.

Authoritative sources. CSET GeorgetownRecommendations on Export Controls for AI and For Export Controls on AI, Don’t Forget the Catch-All Basics. Covington’s U.S. Federal and State Governments Moving Quickly to Restrict Use of DeepSeek (Feb 2025) is a concrete vendor-cutoff scenario for enterprises. Alston & Bird’s DeepSeek and the US Data Regulations documents the IEEPA-based US Data Regulation that entered force April 8, 2025, prohibiting transactions with “country-of-concern” AI providers. Securiti and Global Legal Insights describe the China CAC algorithm-filing, dataset-disclosure, and security-assessment regime — extraterritorial in application. White & Case’s AI Watch tracker covers EU AI Act extraterritorial scope.

Direct quotes.

  • “The Gen AI Measures assert extraterritorial reach by explicitly covering any service – regardless of where the provider is located – that is offered to the Chinese public, and the CAC can also take technical measures (e.g., block or restrict access).” — Global Legal Insights, China AI Laws 2025.
  • “The US Data Regulation was passed under the International Emergency Economic Powers Act (IEEPA)… currently slated to enter into force on April 8, 2025. The regulation prohibits or restricts transactions with companies from ‘countries of concern’ including China, Russia, Iran, North Korea, Cuba, and Venezuela.” — Alston & Bird.
  • “The AI Act has an extraterritorial effect… If an AI system’s output or use affects people in the EU, the EU AI Act may apply.” — White & Case.

Real-world incidents. DeepSeek bans (Feb 2025 onward — NASA, multiple US federal/state entities, several allied governments) — the canonical “vendor cut off mid-deployment” pattern. April 8, 2025 IEEPA Data Regulation entered force — concrete enterprise compliance trigger. CAC algorithm filings ongoing — mandatory disclosure regime now in operation.

Defensive controls named. Multi-region, multi-vendor model abstraction layer with portable prompts and evals; jurisdiction tagging on data flows and model endpoints in the AI-BOM; vendor-cutoff playbooks with pre-validated alternates; counsel-in-the-loop for any model-vendor contract change; data residency controls on RAG corpora and embeddings.

RA + CMM mapping. Governance and Data planes; supply-chain and operations cross-cutting. CMM domains: D1 Governance & Accountability, D6 Data/Memory/RAG (residency), D8 Supply Chain & AI-BOM (vendor abstraction), D9 Operations & Human Factors (cutoff playbook, contract resilience). Argues for a future jurisdictional resilience practice page distinct from general AI-policy commentary.

Cross-class synthesis

block-beta
  columns 5

  C1["Class 1<br/>Insider"]:1
  C2["Class 2<br/>APT campaign"]:1
  C3["Class 3<br/>Collusion"]:1
  C4["Class 4<br/>Version regression"]:1
  C5["Class 5<br/>Jurisdictional"]:1

  AI_BOM["AI-BOM + always-on customer eval harness · cryptographic provenance · drift alerts<br/>(absorbs Classes 1, 2, 4)"]:3
  ISO["Output canonicalization · monitor isolation · deception probes<br/>(absorbs Class 3)"]:1
  GOV["Vendor abstraction · jurisdiction tagging · counsel-in-loop · cutoff playbooks<br/>(absorbs Class 5)"]:1

  classDef control fill:#cfe2ff,stroke:#0d6efd,color:#000
  classDef class1 fill:#f8d7da,stroke:#dc3545,color:#000
  classDef gov fill:#fff3cd,stroke:#fd7e14,color:#000

  class C1,C2,C3,C4,C5 class1
  class AI_BOM,ISO control
  class GOV gov

The deepest defensive overlap is between Classes 1, 2, and 4. An insider who silently corrupts a fine-tune, an APT that slowly poisons a RAG index over weeks, and a vendor that ships a regressed model version are detected by the same control: a customer-owned, version-pinned, continuously-executed eval harness with cryptographic provenance over every artifact. Class 3 partially overlaps via output canonicalization; Class 5 is the outlier and requires governance-layer answers, not technical artifact controls. For CISO architecture, the order of operations is: (i) AI-BOM + always-on customer eval harness; (ii) monitor-isolation discipline for collusion; (iii) governance workstream for jurisdictional resilience.

Mapping to RA + CMM

ClassPrimary planesPrimary CMM domains
1. AI-aware insiderIdentity · Control · Data · ObservabilityD2 · D3 · D6 · D8 · D9
2. Long-running APT campaignsRuntime · Egress · ObservabilityD4 · D5 · D7 · D9
3. CollusionControl · Runtime · ObservabilityD3 · D4 · D7 · D9
4. Model-version-degradationRuntime · Data · Supply Chain · OperationsD4 · D6 · D8 · D9
5. Jurisdictional adversariesGovernance · Data · Supply Chain · OperationsD1 · D6 · D8 · D9

D9 Operations & Human Factors appears in every class — the validation page (§3) already flagged this domain as the home for guardrail latency / cost budgets, decommission drills, model-deprecation policy, federated incident sharing, and HITL fatigue. This expansion adds: monitor-isolation evidence (Class 3), model-version-pinning and rollback drill cadence (Class 4), and vendor-cutoff playbook (Class 5).

Open issues

Where the literature is still thin

  1. Real-world incident data on agent–agent collusion — Apollo’s evidence base is controlled lab work. As multi-agent meshes go to production, an in-the-wild incident is the missing anchor.
  2. Quantitative cost/latency budgets for the always-on customer eval harness — Class 1, 2, and 4 mitigation depends on this control being affordable. Continuous re-evaluation over a large eval suite is non-trivial. No published guidance on percentage-of-inference-spend benchmarks.
  3. Insider threat program staffing for AI — RAND’s 38-vector framework is solid but does not specify how many people, what skills, and what tooling an enterprise needs to operate it. NIST IR on AI insider would help.
  4. Cross-border AI data-flow standards — the regulatory landscape (US IEEPA + China CAC + EU AI Act + state-by-state) is fragmented. No single tracker yet packages this for an architect’s quick reference.
  5. Vendor-cutoff playbook benchmarks — DeepSeek’s removal forced enterprises to scramble; few published lessons learned. Class 5 mitigation maturity is currently undocumented.

See Also