Agentic AI Security CMM — Measurement Protocol (Assessor’s Handbook)

This is the assessment instrument the validation page (Validation: Agentic AI Security CMM vs Widely Adopted Standards §6 rec #2) said the CMM was missing. Without it, two assessors auditing the same organization will reach different verdicts.

The protocol is modeled on BSIMM’s observation/assertion structure (descriptive — record what is actually done) layered with CMMC 2.0’s three-level assessment guide pattern (prescriptive — match observed state against documented criteria). It applies to all 9 CMM domains.

Three-stage assessment

flowchart LR
    P1[Stage 1<br/>Pre-engagement] --> P2[Stage 2<br/>Evidence collection]
    P2 --> P3[Stage 3<br/>Scoring & report]
    P1 -.- D1[Scope letter<br/>Agent inventory<br/>Document request list]
    P2 -.- D2[Interview script<br/>Artifact checklist<br/>Live observation]
    P3 -.- D3[Per-domain score<br/>Floor rating<br/>Gap report]

Stage 1 — Pre-engagement (1–2 weeks)

The org under assessment delivers:

  1. Scope letter identifying which agents are in-scope. Each agent gets an Agent Card (system manifest) with: name, owner (human), purpose, data classifications touched, tools/MCP servers used, deployment shape (chatbot / coding tool / RAG / etc.), production status, downstream consumers.
  2. Agent inventory export — the full registry, even if some agents are out-of-scope for this assessment. Required so the assessor can detect shadow agents.
  3. Document request list response. Standard requests: AI security policy, IR runbook, last red-team report, AI-BOM artifact, gateway config, identity graph export, latest decommission drill report, last quarterly board AI-risk pack.

If any of the documents is missing, that’s automatic L1 in the relevant domain.

Stage 2 — Evidence collection (2–4 weeks)

Three parallel tracks: interviews, artifacts, live observation.

Interview script (per domain)

Each domain has a structured interview block. Sample questions are NOT exhaustive — the assessor follows up on every “yes we do that” with “show me.” Pure verbal evidence is L2 at best; L3+ requires artifact corroboration.

D1 Governance

  • Who chairs the AI Risk Committee? When did it last meet? Show the minutes.
  • How is an agent’s risk tier assigned? Show the rubric.
  • Who can approve a high-risk agent for production? Show one approval.
  • Does the board get AI-risk reporting? Show the most recent pack.

D2 Identity & Authorization

  • Show me the identity for agent [X]. Trace one of its actions back to the human owner.
  • What happens when the human owner of agent [X] leaves the company? Walk me through.
  • Show me a credential proxy log for agent [X]. Confirm the agent process never sees the underlying credential.
  • How is agent [X]’s identity attested? (SPIFFE / OAuth 2.1 / OIDC / Microsoft Entra Agent ID / Okta for AI Agents.)

D3 Control & Least-Agency

  • Show me agent [X]’s tier (auto / notify / confirm / block) per tool. Who decides?
  • Show me the PDP config in production. What happens if the PDP is unreachable?
  • Trigger a synthetic high-risk-tier action for agent [X] — does HITL fire?
  • Show me a lethal-trifecta detection event from the last 30 days.

D4 Runtime & Guardrails

  • What guardrails sit in front of agent [X]’s LLM call? In-line, sidecar, or external?
  • What’s the bypass-class coverage of your input filter? (English-only? Multilingual? Leetspeak?)
  • Show me an AlignmentCheck firing on a real agent run.
  • What’s your sandbox grain — per-call, per-task, per-agent? Show the sandbox config.

D5 Egress & Network

  • What proxy / gateway sits between agent [X] and external tools?
  • How does agent [X] get a token to call MCP server [Y]? Show the exchange.
  • Show me a tool-poisoning detection event. What does the gateway do when it fires?
  • Where does agent [X]’s outbound traffic actually go? Show the egress allowlist.

D6 Data, Memory & RAG

  • For RAG: show me document attestation at ingest. Show a poisoned-document detection.
  • For memory: how do you detect memory poisoning? Show a recent detection.
  • Show me the cognitive file integrity baseline for agent [X]’s IDENTITY.md / system prompt.
  • Are canary tokens deployed in the system prompt? When was the last leak alert?

D7 Observability & Detection

  • Show me OTel gen_ai.* traces for an agent run end-to-end.
  • Show me the behavioral-drift alert from the agent behavioral monitoring system, from the last quarter.
  • Walk me through a multi-tool red-team eval — which tools were used (Promptfoo / PyRIT / Garak / Mindgard CART)?
  • Show me a MCP CVEs Q1 2026-class CVE alert flowing through your detection pipeline.

D8 Supply Chain & AI-BOM

  • Show me the AI-BOM for agent [X] (build-time and runtime).
  • Show me a sigstore signature for one of your skills / models.
  • Show me a registry-scan finding from Aguara Watch / SecureClaw / equivalent.
  • Walk me through how you detect a ClawHavoc-class supply-chain event.

D9 Operations & Human Factors

  • What’s the p99 latency budget for your guardrail stack? Show the dashboard.
  • What’s your fail-mode for a guardrail timeout — fail-closed or fail-open? Show the test.
  • When was the last decommission drill? Show the report.
  • What’s your HITL approval-rate? Show the rubber-stamp metric (approval-rate without comment).
  • Show me a system-prompt leak test result and your canary-token deployment.
  • What’s your model-deprecation policy? Show the version-pin register for agent [X].

Artifact checklist (required per level)

DomainL2 artifactsL3 artifactsL4 artifactsL5 artifacts (achievable today)L5+ artifacts (leading-edge)
D1Policy doc; RACIRisk Committee minutes; deployment-gate evidenceKPI dashboard; board pack; gap report; standards crosswalk matrixMost-recent AIUC-1 / ISO 42001 cert; board-attested risk metrics; ≥1-year committee minutesNamed-contributor evidence; published research; external observability dataset
D2Agent inventoryIdentity graph; sample audit trail; OIDC tokensCred-proxy logs; Cedar/OPA repo; tabletop drill reportRegistry export; ISPM dashboard; SPIFFE-JWT-SVID chain; coupled-credential migration reportNIST CAISI participation; cross-platform identity federation report
D3Tool allowlist configPDP config; tier assignments per agentPromotion-gate runbook (org-authored); HITL telemetry; trifecta-detection logWarrant samples; step-up logs; per-release policy-compile artifact; cryptographic SoD evidenceCaMeL production deployment evidence; formal-verification reports; temporal-logic policy artifact
D4Provider safety configHook code; firewall logs; sandbox configAlignmentCheck logs; CodeShield findings; grounding scoresPlatform-enforcement coverage report (zero opt-outs); multi-language eval log; classifier refresh receipts; response-leak alert log; latency/cost dashboard with fail-closed proofTEE attestation chain; CaMeL split production evidence; bypass-class eval with remediation timeline
D5Outbound proxy configGateway config; certs; A2A enforcement profileToken-exchange logs; rule sets; CVE-tagged logMesh topology with zero-bypass proof; per-task token samples; SSRF closure verification; CVE-feed auto-quarantine logSigstore-for-MCP verifier; A2A drift rule library; cross-cloud reconciliation report
D6Source labelsScan results; CFI baselineAttestation logs; rollback drill reportDrift dashboard; threshold-justification memo; conflict-flagging logs; canary-token deployment log; rollback drill RTO reportPer-doc attestation chain; taint-lattice implementation; ZK-proof verifier logs
D7Tool-call audit logTrace samples; span schema validationBehavioral-monitoring dashboards; multi-tool eval reports with ID tagsDeepTracing graph; agent-aware playbook samples; prompt-volume-to-alert dashboard ≥1 quarter; analyst-actionable rate reportCascade rule registry with thresholds; multi-agent joint-baseline statistics; forward-pass activation monitor
D8InventoryAI-BOM artifact; sigstore logSig-verified registry; reconciliation report; ID-tagged ML-VEXClosed-loop diagram with SLA evidence; SLSA L3 attestation; runtime/build AI-BOM reconciliation; ML-VEX feedSLSA L4 report; cross-vendor AI-BOM federation; standards-WG named contribution
D9Runbook artifactLatency/cost dashboard; reaper logs; canary deployment proofHITL-fatigue KPIs; benign-drift dashboard; drill reports; AI-VEX feedSLA-bounded controls-update log; clean-state attestations; quarterly continuity-test report; HITL-fatigue dashboard within thresholdsExternal observability dataset; named contributions to CoSAI IR / OWASP / ATLAS; coordinated-disclosure leadership artifacts

Live observation requirements

The assessor MUST observe at least one live action per high-risk-tier agent in the assessed scope. Specifically:

  • One L3+ assessment requires: live OTel trace + live PDP decision + live HITL gate fire (synthetic if necessary).
  • One L4 assessment requires the above plus: live behavioral-drift event from the agent behavioral monitoring system + live red-team eval run.
  • One L5 assessment requires the above plus: live closed-loop incident replay (an alert firing → controls update closing the loop within SLA) + verification of the L4→L5 prerequisite gate (≥2-quarter L4 evidence, AIUC-1/ISO 42001 cert dated within last quarter, continuity-test execution proof).
  • One L5+ assessment requires the above plus: live attestation chain verification (TEE-backed guardrail execution proof) OR live cascade-detection rule fire OR live cross-vendor AI-BOM reconciliation, AND verification of the named-contributor artifact.

Static configs alone do not satisfy live-observation requirements at L3+.

Stage 3 — Scoring & report (1 week)

Per-domain scoring rubric

For each of the 9 domains, the assessor scores the organization Level 0 (no evidence at L1) through Level 5. The rubric per cell:

ScoreCriterion
0No evidence the L1 baseline exists.
1L1 verbal evidence; no policy or artifact.
2L1 + L2 artifacts present and verifiable.
3L1 + L2 + L3 artifacts present, AND ID tagging is operational for findings in this domain (ASI## / AIVSS / AML.T#### / CVE), AND live observation requirement met.
4L3 + L4 artifacts AND quantitative metrics are tracked AND multi-tool eval is operational AND ID tagging is comprehensive (no untagged findings in last 90 days).
5L4 + L5 artifacts AND closed-loop evidence over ≥2 quarters AND L4→L5 prerequisite gate met (see below).
5+L5 + L5+ artifacts AND research-stage primitives in production with documented exit criteria AND active named contribution to one or more standards bodies (PR / RFC / spec authorship).

Level 3 is the auditable inflection. Below L3, the org is structurally vulnerable and the assessment is largely about whether evidence supports L2 vs L1. At L3+, the assessor is checking platform-level enforcement, ID tagging, and live behavior.

L4 → L5 is a campaign, not a step. Before scoring an organization L5 in any domain, the assessor MUST verify the prerequisite gate (per stress-test §Change 5 and the CMM page level table):

  1. ≥2 quarters of stable L4 operation across all 9 domains — no regression in the per-domain matrix during the look-back window. Evidence: prior assessment reports OR continuous-monitoring artifacts (KPIs, drift telemetry, red-team results, AI-BOM reconciliation) covering the period.
  2. AIUC-1 readiness assessment scheduled with an accredited auditor (Schellman or equivalent) OR ISO/IEC 42001 surveillance cycle in flight. Evidence: signed engagement letter or surveillance-audit report.
  3. Bus-factor ≥2 with documented continuity test — a deputy has executed the runbook end-to-end at least once in the look-back window (anti-pattern I3 recovery). Evidence: continuity-test report.
  4. Gap-closure plan from floor-domain to L5 — even if the floor is L5, the program must document what L5+ work it is or is not pursuing in each domain.

Meeting every per-domain L5 row without the gate evidence scores L4-stable, not L5. The gate is asymmetric: the same gate is NOT required to claim L4 from L3 — that jump is a step, not a campaign.

L5+ Leading Edge tier. A separate, optional tier that requires L5 across all 9 domains plus (a) at least one research-stage primitive in production deployment with documented exit criteria back to L5 if the pilot fails, and (b) active named contribution to one or more standards bodies (PR / RFC / spec authorship — not membership only). L5+ is intentionally bleeding-edge and unachievable without category-creation work. Most assessments terminate at L5; L5+ scoring is appropriate for frontier labs, hyperscaler platforms, and dedicated AI-security research shops.

Aggregation rule — dependency-resolved effective scores

The organization’s overall rating is reported as a per-domain matrix (raw + effective scores). Aggregation uses dependency-resolved effective scores under the active rule set documented in Effective-Score Dependency Rules. A domain’s effective score = min(raw, min over upstream-dependency raw scores).

Headline format (replaces the prior single-floor headline):

  • Typical = median of effective scores across all 9 domains
  • Weakest = min of effective scores, with the cap source labeled (which upstream domain set the cap, if any)
  • Strongest = max of raw scores, with the domain labeled
  • Strategic rationale field for any domain whose raw score is intentionally below its peers (architectural-containment trade-offs)

Cherry-picking is prevented by mandatory matrix disclosure: any rating claim must publish the full per-domain matrix (raw + effective) and the active rule-set version. Reports that cite a single domain’s score without the matrix are non-compliant. This replaces the prior single-floor rule (CMMC import) which misreported 3 of 5 realistic archetypes per the stress test (Stripe-style architectural-containment, Microsoft Agent 365-driven, resource-constrained startup all under-reported).

Active rule set (v1, 2026-05-04): DR-001 D2 caps D5 (per-agent identity required for per-agent egress enforcement), DR-002 D2 caps D7 (per-agent identity required for behavioral attribution), DR-003 D3 caps D4 (PDP decisions required for runtime guardrail enforcement). See dependency-rules page for promotion criteria, candidate registry, and revision protocol.

Gap report structure

Final report contains, at minimum:

  1. Executive summary — three-number headline (typical / weakest / strongest), three-sentence framing, active rule-set version cited.
  2. Per-domain matrix — 9 rows (D1–D9) × per-row columns: raw level, effective level, cap source (which upstream-dependency rule fired, if any), verdict per L1–L5+ criterion. The L5+ column may be left as “n/a” if the engagement does not target L5+.
  3. Weakest-domain explanation — which domain holds the weakest effective score, whether a dependency cap fired, and the strategic rationale (if any) for an intentional trade-off (Stripe-style architectural-containment).
  4. ID-tagged finding registry — every finding with ASI## / AIVSS score / AML.T#### / CVE.
  5. Crosswalk extract — for each L4+ finding, the corresponding Annex IV / AIUC-1 / ISO 42001 anchor (per Agentic AI Security CMM — Standards Crosswalk Matrix).
  6. Top 5 prioritized recommendations — what would move the weakest effective score up by one level (and any candidate dependency-rule promotions to monitor).
  7. Re-assessment cadence — recommendation for next assessment date (tied to AIUC-1 quarterly cadence at L5).
  8. Active rule-set version — must be cited (e.g. “scored under dependency-rules v1, 2026-05-04”). When the rule set is revised, prior assessments retain their original version; re-scoring under a new version is a separate engagement.

Sample assessment timeline

For a mid-size enterprise with ~30 agents in scope:

WeekActivity
-2Scope letter signed; document request list issued
-1Documents received; initial gap scan
1Kickoff; D1 + D2 interviews; identity-graph review
2D3 + D4 interviews; live PDP / guardrail observation
3D5 + D6 + D7 interviews; behavioral-monitoring / RAG attestation review
4D8 + D9 interviews; AI-BOM reconciliation; decommission drill
5Synthetic incidents fired across 3 agents (if scope permits)
6Scoring synthesis; gap report draft
7Report review with org; final report delivered

Assessor competence requirements

Borrowed from ISO/IEC 42006:2025 (auditor competence) and CMMC C3PAO licensing patterns. The assessor MUST demonstrate:

  1. Operational experience with at least 4 of the 9 domains.
  2. Working knowledge of: OWASP ASI Top 10, OWASP AIVSS v0.8, MITRE ATLAS v5.4.0, NIST AI RMF + 600-1, ISO/IEC 42001, EU AI Act high-risk classification.
  3. Experience reading and validating: OTel gen_ai.* traces, AI-BOM (CycloneDX/SPDX), Cedar/OPA policies, MCP server configs, sigstore signatures.
  4. No conflict of interest (the assessor’s firm did not architect or operate any agent in scope within the last 12 months).

Differences from existing audit programs

Existing programDifference vs this protocol
ISO/IEC 42001 auditGovernance-heavy; weak on technical AI controls. This protocol pulls technical evidence into stage 2 live observation.
AIUC-1 (Schellman)4–8 week scope; six pillars. This protocol’s 9 domains are more granular and require multi-tool eval at L4.
BSIMMDescriptive only; no levels. This protocol uses BSIMM-style observation but adds CMMC-style cumulative levels.
CMMC 2.0Three levels; defense-contractor scope. This protocol uses five levels and is AI-specific.
SOC 2Type 1 / Type 2 Trust Services Criteria. This protocol’s scope is narrower (agentic AI) and deeper.

Open gaps in this protocol

Known unfilled spots

  1. Quantitative metric thresholds at L4. “Quantitative HITL-fatigue indicators” should have specific thresholds (rubber-stamp rate < X%, queue age p95 < Y minutes) — these are TBD pending production data from early adopters.
  2. Synthetic incident library. Stage 2 calls for synthetic incidents but no library exists yet. Candidates: PoisonedRAG corpus injection, ClawHavoc-class skill swap, prompt-injection via retrieved doc, A2A impersonation.
  3. Self-attestation form. Some orgs will start with a self-assessment before engaging an external assessor. A self-attestation form would mirror this protocol but with relaxed live-observation requirements.
  4. Continuous-assessment mode. Some orgs will want continuous (vs annual) assessment — what does the protocol look like in always-on mode? Mindgard CART is the closest model on the testing side.

Relations