Agentic AI Security CMM — Measurement Protocol (Assessor’s Handbook)
This is the assessment instrument the validation page (Validation: Agentic AI Security CMM vs Widely Adopted Standards §6 rec #2) said the CMM was missing. Without it, two assessors auditing the same organization will reach different verdicts.
The protocol is modeled on BSIMM’s observation/assertion structure (descriptive — record what is actually done) layered with CMMC 2.0’s three-level assessment guide pattern (prescriptive — match observed state against documented criteria). It applies to all 9 CMM domains.
Three-stage assessment
flowchart LR P1[Stage 1<br/>Pre-engagement] --> P2[Stage 2<br/>Evidence collection] P2 --> P3[Stage 3<br/>Scoring & report] P1 -.- D1[Scope letter<br/>Agent inventory<br/>Document request list] P2 -.- D2[Interview script<br/>Artifact checklist<br/>Live observation] P3 -.- D3[Per-domain score<br/>Floor rating<br/>Gap report]
Stage 1 — Pre-engagement (1–2 weeks)
The org under assessment delivers:
- Scope letter identifying which agents are in-scope. Each agent gets an Agent Card (system manifest) with: name, owner (human), purpose, data classifications touched, tools/MCP servers used, deployment shape (chatbot / coding tool / RAG / etc.), production status, downstream consumers.
- Agent inventory export — the full registry, even if some agents are out-of-scope for this assessment. Required so the assessor can detect shadow agents.
- Document request list response. Standard requests: AI security policy, IR runbook, last red-team report, AI-BOM artifact, gateway config, identity graph export, latest decommission drill report, last quarterly board AI-risk pack.
If any of the documents is missing, that’s automatic L1 in the relevant domain.
Stage 2 — Evidence collection (2–4 weeks)
Three parallel tracks: interviews, artifacts, live observation.
Interview script (per domain)
Each domain has a structured interview block. Sample questions are NOT exhaustive — the assessor follows up on every “yes we do that” with “show me.” Pure verbal evidence is L2 at best; L3+ requires artifact corroboration.
D1 Governance
- Who chairs the AI Risk Committee? When did it last meet? Show the minutes.
- How is an agent’s risk tier assigned? Show the rubric.
- Who can approve a high-risk agent for production? Show one approval.
- Does the board get AI-risk reporting? Show the most recent pack.
D2 Identity & Authorization
- Show me the identity for agent
[X]. Trace one of its actions back to the human owner. - What happens when the human owner of agent
[X]leaves the company? Walk me through. - Show me a credential proxy log for agent
[X]. Confirm the agent process never sees the underlying credential. - How is agent
[X]’s identity attested? (SPIFFE / OAuth 2.1 / OIDC / Microsoft Entra Agent ID / Okta for AI Agents.)
D3 Control & Least-Agency
- Show me agent
[X]’s tier (auto / notify / confirm / block) per tool. Who decides? - Show me the PDP config in production. What happens if the PDP is unreachable?
- Trigger a synthetic high-risk-tier action for agent
[X]— does HITL fire? - Show me a lethal-trifecta detection event from the last 30 days.
D4 Runtime & Guardrails
- What guardrails sit in front of agent
[X]’s LLM call? In-line, sidecar, or external? - What’s the bypass-class coverage of your input filter? (English-only? Multilingual? Leetspeak?)
- Show me an AlignmentCheck firing on a real agent run.
- What’s your sandbox grain — per-call, per-task, per-agent? Show the sandbox config.
D5 Egress & Network
- What proxy / gateway sits between agent
[X]and external tools? - How does agent
[X]get a token to call MCP server[Y]? Show the exchange. - Show me a tool-poisoning detection event. What does the gateway do when it fires?
- Where does agent
[X]’s outbound traffic actually go? Show the egress allowlist.
D6 Data, Memory & RAG
- For RAG: show me document attestation at ingest. Show a poisoned-document detection.
- For memory: how do you detect memory poisoning? Show a recent detection.
- Show me the cognitive file integrity baseline for agent
[X]’sIDENTITY.md/ system prompt. - Are canary tokens deployed in the system prompt? When was the last leak alert?
D7 Observability & Detection
- Show me OTel
gen_ai.*traces for an agent run end-to-end. - Show me the behavioral-drift alert from the agent behavioral monitoring system, from the last quarter.
- Walk me through a multi-tool red-team eval — which tools were used (Promptfoo / PyRIT / Garak / Mindgard CART)?
- Show me a MCP CVEs Q1 2026-class CVE alert flowing through your detection pipeline.
D8 Supply Chain & AI-BOM
- Show me the AI-BOM for agent
[X](build-time and runtime). - Show me a sigstore signature for one of your skills / models.
- Show me a registry-scan finding from Aguara Watch / SecureClaw / equivalent.
- Walk me through how you detect a
ClawHavoc-class supply-chain event.
D9 Operations & Human Factors
- What’s the p99 latency budget for your guardrail stack? Show the dashboard.
- What’s your fail-mode for a guardrail timeout — fail-closed or fail-open? Show the test.
- When was the last decommission drill? Show the report.
- What’s your HITL approval-rate? Show the rubber-stamp metric (approval-rate without comment).
- Show me a system-prompt leak test result and your canary-token deployment.
- What’s your model-deprecation policy? Show the version-pin register for agent
[X].
Artifact checklist (required per level)
| Domain | L2 artifacts | L3 artifacts | L4 artifacts | L5 artifacts (achievable today) | L5+ artifacts (leading-edge) |
|---|---|---|---|---|---|
| D1 | Policy doc; RACI | Risk Committee minutes; deployment-gate evidence | KPI dashboard; board pack; gap report; standards crosswalk matrix | Most-recent AIUC-1 / ISO 42001 cert; board-attested risk metrics; ≥1-year committee minutes | Named-contributor evidence; published research; external observability dataset |
| D2 | Agent inventory | Identity graph; sample audit trail; OIDC tokens | Cred-proxy logs; Cedar/OPA repo; tabletop drill report | Registry export; ISPM dashboard; SPIFFE-JWT-SVID chain; coupled-credential migration report | NIST CAISI participation; cross-platform identity federation report |
| D3 | Tool allowlist config | PDP config; tier assignments per agent | Promotion-gate runbook (org-authored); HITL telemetry; trifecta-detection log | Warrant samples; step-up logs; per-release policy-compile artifact; cryptographic SoD evidence | CaMeL production deployment evidence; formal-verification reports; temporal-logic policy artifact |
| D4 | Provider safety config | Hook code; firewall logs; sandbox config | AlignmentCheck logs; CodeShield findings; grounding scores | Platform-enforcement coverage report (zero opt-outs); multi-language eval log; classifier refresh receipts; response-leak alert log; latency/cost dashboard with fail-closed proof | TEE attestation chain; CaMeL split production evidence; bypass-class eval with remediation timeline |
| D5 | Outbound proxy config | Gateway config; certs; A2A enforcement profile | Token-exchange logs; rule sets; CVE-tagged log | Mesh topology with zero-bypass proof; per-task token samples; SSRF closure verification; CVE-feed auto-quarantine log | Sigstore-for-MCP verifier; A2A drift rule library; cross-cloud reconciliation report |
| D6 | Source labels | Scan results; CFI baseline | Attestation logs; rollback drill report | Drift dashboard; threshold-justification memo; conflict-flagging logs; canary-token deployment log; rollback drill RTO report | Per-doc attestation chain; taint-lattice implementation; ZK-proof verifier logs |
| D7 | Tool-call audit log | Trace samples; span schema validation | Behavioral-monitoring dashboards; multi-tool eval reports with ID tags | DeepTracing graph; agent-aware playbook samples; prompt-volume-to-alert dashboard ≥1 quarter; analyst-actionable rate report | Cascade rule registry with thresholds; multi-agent joint-baseline statistics; forward-pass activation monitor |
| D8 | Inventory | AI-BOM artifact; sigstore log | Sig-verified registry; reconciliation report; ID-tagged ML-VEX | Closed-loop diagram with SLA evidence; SLSA L3 attestation; runtime/build AI-BOM reconciliation; ML-VEX feed | SLSA L4 report; cross-vendor AI-BOM federation; standards-WG named contribution |
| D9 | Runbook artifact | Latency/cost dashboard; reaper logs; canary deployment proof | HITL-fatigue KPIs; benign-drift dashboard; drill reports; AI-VEX feed | SLA-bounded controls-update log; clean-state attestations; quarterly continuity-test report; HITL-fatigue dashboard within thresholds | External observability dataset; named contributions to CoSAI IR / OWASP / ATLAS; coordinated-disclosure leadership artifacts |
Live observation requirements
The assessor MUST observe at least one live action per high-risk-tier agent in the assessed scope. Specifically:
- One L3+ assessment requires: live OTel trace + live PDP decision + live HITL gate fire (synthetic if necessary).
- One L4 assessment requires the above plus: live behavioral-drift event from the agent behavioral monitoring system + live red-team eval run.
- One L5 assessment requires the above plus: live closed-loop incident replay (an alert firing → controls update closing the loop within SLA) + verification of the L4→L5 prerequisite gate (≥2-quarter L4 evidence, AIUC-1/ISO 42001 cert dated within last quarter, continuity-test execution proof).
- One L5+ assessment requires the above plus: live attestation chain verification (TEE-backed guardrail execution proof) OR live cascade-detection rule fire OR live cross-vendor AI-BOM reconciliation, AND verification of the named-contributor artifact.
Static configs alone do not satisfy live-observation requirements at L3+.
Stage 3 — Scoring & report (1 week)
Per-domain scoring rubric
For each of the 9 domains, the assessor scores the organization Level 0 (no evidence at L1) through Level 5. The rubric per cell:
| Score | Criterion |
|---|---|
| 0 | No evidence the L1 baseline exists. |
| 1 | L1 verbal evidence; no policy or artifact. |
| 2 | L1 + L2 artifacts present and verifiable. |
| 3 | L1 + L2 + L3 artifacts present, AND ID tagging is operational for findings in this domain (ASI## / AIVSS / AML.T#### / CVE), AND live observation requirement met. |
| 4 | L3 + L4 artifacts AND quantitative metrics are tracked AND multi-tool eval is operational AND ID tagging is comprehensive (no untagged findings in last 90 days). |
| 5 | L4 + L5 artifacts AND closed-loop evidence over ≥2 quarters AND L4→L5 prerequisite gate met (see below). |
| 5+ | L5 + L5+ artifacts AND research-stage primitives in production with documented exit criteria AND active named contribution to one or more standards bodies (PR / RFC / spec authorship). |
Level 3 is the auditable inflection. Below L3, the org is structurally vulnerable and the assessment is largely about whether evidence supports L2 vs L1. At L3+, the assessor is checking platform-level enforcement, ID tagging, and live behavior.
L4 → L5 is a campaign, not a step. Before scoring an organization L5 in any domain, the assessor MUST verify the prerequisite gate (per stress-test §Change 5 and the CMM page level table):
- ≥2 quarters of stable L4 operation across all 9 domains — no regression in the per-domain matrix during the look-back window. Evidence: prior assessment reports OR continuous-monitoring artifacts (KPIs, drift telemetry, red-team results, AI-BOM reconciliation) covering the period.
- AIUC-1 readiness assessment scheduled with an accredited auditor (Schellman or equivalent) OR ISO/IEC 42001 surveillance cycle in flight. Evidence: signed engagement letter or surveillance-audit report.
- Bus-factor ≥2 with documented continuity test — a deputy has executed the runbook end-to-end at least once in the look-back window (anti-pattern I3 recovery). Evidence: continuity-test report.
- Gap-closure plan from floor-domain to L5 — even if the floor is L5, the program must document what L5+ work it is or is not pursuing in each domain.
Meeting every per-domain L5 row without the gate evidence scores L4-stable, not L5. The gate is asymmetric: the same gate is NOT required to claim L4 from L3 — that jump is a step, not a campaign.
L5+ Leading Edge tier. A separate, optional tier that requires L5 across all 9 domains plus (a) at least one research-stage primitive in production deployment with documented exit criteria back to L5 if the pilot fails, and (b) active named contribution to one or more standards bodies (PR / RFC / spec authorship — not membership only). L5+ is intentionally bleeding-edge and unachievable without category-creation work. Most assessments terminate at L5; L5+ scoring is appropriate for frontier labs, hyperscaler platforms, and dedicated AI-security research shops.
Aggregation rule — dependency-resolved effective scores
The organization’s overall rating is reported as a per-domain matrix (raw + effective scores). Aggregation uses dependency-resolved effective scores under the active rule set documented in Effective-Score Dependency Rules. A domain’s effective score = min(raw, min over upstream-dependency raw scores).
Headline format (replaces the prior single-floor headline):
- Typical = median of effective scores across all 9 domains
- Weakest = min of effective scores, with the cap source labeled (which upstream domain set the cap, if any)
- Strongest = max of raw scores, with the domain labeled
- Strategic rationale field for any domain whose raw score is intentionally below its peers (architectural-containment trade-offs)
Cherry-picking is prevented by mandatory matrix disclosure: any rating claim must publish the full per-domain matrix (raw + effective) and the active rule-set version. Reports that cite a single domain’s score without the matrix are non-compliant. This replaces the prior single-floor rule (CMMC import) which misreported 3 of 5 realistic archetypes per the stress test (Stripe-style architectural-containment, Microsoft Agent 365-driven, resource-constrained startup all under-reported).
Active rule set (v1, 2026-05-04): DR-001 D2 caps D5 (per-agent identity required for per-agent egress enforcement), DR-002 D2 caps D7 (per-agent identity required for behavioral attribution), DR-003 D3 caps D4 (PDP decisions required for runtime guardrail enforcement). See dependency-rules page for promotion criteria, candidate registry, and revision protocol.
Gap report structure
Final report contains, at minimum:
- Executive summary — three-number headline (typical / weakest / strongest), three-sentence framing, active rule-set version cited.
- Per-domain matrix — 9 rows (D1–D9) × per-row columns:
raw level,effective level,cap source(which upstream-dependency rule fired, if any),verdict per L1–L5+ criterion. The L5+ column may be left as “n/a” if the engagement does not target L5+. - Weakest-domain explanation — which domain holds the weakest effective score, whether a dependency cap fired, and the strategic rationale (if any) for an intentional trade-off (Stripe-style architectural-containment).
- ID-tagged finding registry — every finding with
ASI##/ AIVSS score /AML.T####/ CVE. - Crosswalk extract — for each L4+ finding, the corresponding Annex IV / AIUC-1 / ISO 42001 anchor (per Agentic AI Security CMM — Standards Crosswalk Matrix).
- Top 5 prioritized recommendations — what would move the weakest effective score up by one level (and any candidate dependency-rule promotions to monitor).
- Re-assessment cadence — recommendation for next assessment date (tied to AIUC-1 quarterly cadence at L5).
- Active rule-set version — must be cited (e.g. “scored under dependency-rules v1, 2026-05-04”). When the rule set is revised, prior assessments retain their original version; re-scoring under a new version is a separate engagement.
Sample assessment timeline
For a mid-size enterprise with ~30 agents in scope:
| Week | Activity |
|---|---|
| -2 | Scope letter signed; document request list issued |
| -1 | Documents received; initial gap scan |
| 1 | Kickoff; D1 + D2 interviews; identity-graph review |
| 2 | D3 + D4 interviews; live PDP / guardrail observation |
| 3 | D5 + D6 + D7 interviews; behavioral-monitoring / RAG attestation review |
| 4 | D8 + D9 interviews; AI-BOM reconciliation; decommission drill |
| 5 | Synthetic incidents fired across 3 agents (if scope permits) |
| 6 | Scoring synthesis; gap report draft |
| 7 | Report review with org; final report delivered |
Assessor competence requirements
Borrowed from ISO/IEC 42006:2025 (auditor competence) and CMMC C3PAO licensing patterns. The assessor MUST demonstrate:
- Operational experience with at least 4 of the 9 domains.
- Working knowledge of: OWASP ASI Top 10, OWASP AIVSS v0.8, MITRE ATLAS v5.4.0, NIST AI RMF + 600-1, ISO/IEC 42001, EU AI Act high-risk classification.
- Experience reading and validating: OTel
gen_ai.*traces, AI-BOM (CycloneDX/SPDX), Cedar/OPA policies, MCP server configs, sigstore signatures. - No conflict of interest (the assessor’s firm did not architect or operate any agent in scope within the last 12 months).
Differences from existing audit programs
| Existing program | Difference vs this protocol |
|---|---|
| ISO/IEC 42001 audit | Governance-heavy; weak on technical AI controls. This protocol pulls technical evidence into stage 2 live observation. |
| AIUC-1 (Schellman) | 4–8 week scope; six pillars. This protocol’s 9 domains are more granular and require multi-tool eval at L4. |
| BSIMM | Descriptive only; no levels. This protocol uses BSIMM-style observation but adds CMMC-style cumulative levels. |
| CMMC 2.0 | Three levels; defense-contractor scope. This protocol uses five levels and is AI-specific. |
| SOC 2 | Type 1 / Type 2 Trust Services Criteria. This protocol’s scope is narrower (agentic AI) and deeper. |
Open gaps in this protocol
Known unfilled spots
- Quantitative metric thresholds at L4. “Quantitative HITL-fatigue indicators” should have specific thresholds (rubber-stamp rate < X%, queue age p95 < Y minutes) — these are TBD pending production data from early adopters.
- Synthetic incident library. Stage 2 calls for synthetic incidents but no library exists yet. Candidates: PoisonedRAG corpus injection, ClawHavoc-class skill swap, prompt-injection via retrieved doc, A2A impersonation.
- Self-attestation form. Some orgs will start with a self-assessment before engaging an external assessor. A self-attestation form would mirror this protocol but with relaxed live-observation requirements.
- Continuous-assessment mode. Some orgs will want continuous (vs annual) assessment — what does the protocol look like in always-on mode? Mindgard CART is the closest model on the testing side.
Relations
- Companion to: Agentic AI Security Capability Maturity Model — A 2026 Practical Proposal — supplies the assessment instrument the CMM lacked.
- Companion to: Agentic AI Security CMM — Standards Crosswalk Matrix — assessor uses crosswalk in stage 3 step 5.
- Resolves: Validation: Agentic AI Security CMM vs Widely Adopted Standards §6 recommendation #2.