Agentic AI Security CMM — Measurement Protocol (Assessor’s Handbook)

This is the assessment instrument the validation page (Validation: Agentic AI Security CMM vs Widely Adopted Standards §6 rec #2) said the CMM was missing. Without it, two assessors auditing the same organization will reach different verdicts.

The protocol is modeled on BSIMM’s observation/assertion structure (descriptive — record what is actually done) layered with CMMC 2.0’s three-level assessment guide pattern (prescriptive — match observed state against documented criteria). It applies to all 9 CMM domains.

Three-stage assessment

flowchart LR
    P1[Stage 1<br/>Pre-engagement] --> P2[Stage 2<br/>Evidence collection]
    P2 --> P3[Stage 3<br/>Scoring & report]
    P1 -.- D1[Scope letter<br/>Agent inventory<br/>Document request list]
    P2 -.- D2[Interview script<br/>Artifact checklist<br/>Live observation]
    P3 -.- D3[Per-domain score<br/>Floor rating<br/>Gap report]

Stage 1 — Pre-engagement (1–2 weeks)

The org under assessment delivers:

Scope letter identifying which agents are in-scope. Each agent gets an Agent Card (system manifest) with: name, owner (human), purpose, data classifications touched, tools/MCP servers used, deployment shape (chatbot / coding tool / RAG / etc.), production status, downstream consumers.
Agent inventory export — the full registry, even if some agents are out-of-scope for this assessment. Required so the assessor can detect shadow agents.
Document request list response. Standard requests: AI security policy, IR runbook, last red-team report, AI-BOM artifact, gateway config, identity graph export, latest decommission drill report, last quarterly board AI-risk pack.

If any of the documents is missing, that’s automatic L1 in the relevant domain.

Stage 2 — Evidence collection (2–4 weeks)

Three parallel tracks: interviews, artifacts, live observation.

Interview script (per domain)

Each domain has a structured interview block. Sample questions are NOT exhaustive — the assessor follows up on every “yes we do that” with “show me.” Pure verbal evidence is L2 at best; L3+ requires artifact corroboration.

D1 Governance

Who chairs the AI Risk Committee? When did it last meet? Show the minutes.
How is an agent’s risk tier assigned? Show the rubric.
Who can approve a high-risk agent for production? Show one approval.
Does the board get AI-risk reporting? Show the most recent pack.

D2 Identity & Authorization

Show me the identity for agent [X]. Trace one of its actions back to the human owner.
What happens when the human owner of agent [X] leaves the company? Walk me through.
Show me a credential proxy log for agent [X]. Confirm the agent process never sees the underlying credential.
How is agent [X]’s identity attested? (SPIFFE / OAuth 2.1 / OIDC / Microsoft Entra Agent ID / Okta for AI Agents.)

D3 Control & Least-Agency

Show me agent [X]’s tier (auto / notify / confirm / block) per tool. Who decides?
Show me the PDP config in production. What happens if the PDP is unreachable?
Trigger a synthetic high-risk-tier action for agent [X] — does HITL fire?
Show me a lethal-trifecta detection event from the last 30 days.

D4 Runtime & Guardrails

What guardrails sit in front of agent [X]’s LLM call? In-line, sidecar, or external?
What’s the bypass-class coverage of your input filter? (English-only? Multilingual? Leetspeak?)
Show me an AlignmentCheck firing on a real agent run.
What’s your sandbox grain — per-call, per-task, per-agent? Show the sandbox config.

D5 Egress & Network

What proxy / gateway sits between agent [X] and external tools?
How does agent [X] get a token to call MCP server [Y]? Show the exchange.
Show me a tool-poisoning detection event. What does the gateway do when it fires?
Where does agent [X]’s outbound traffic actually go? Show the egress allowlist.

D6 Data, Memory & RAG

For RAG: show me document attestation at ingest. Show a poisoned-document detection.
For memory: how do you detect memory poisoning? Show a recent detection.
Show me the cognitive file integrity baseline for agent [X]’s IDENTITY.md / system prompt.
Are canary tokens deployed in the system prompt? When was the last leak alert?

D7 Observability & Detection

Show me OTel gen_ai.* traces for an agent run end-to-end.
Show me the behavioral-drift alert from the agent behavioral monitoring system, from the last quarter.
Walk me through a multi-tool red-team eval — which tools were used (Promptfoo / PyRIT / Garak / Mindgard CART)?
Show me a MCP CVEs Q1 2026-class CVE alert flowing through your detection pipeline.

D8 Supply Chain & AI-BOM

Show me the AI-BOM for agent [X] (build-time and runtime).
Show me a sigstore signature for one of your skills / models.
Show me a registry-scan finding from Aguara Watch / SecureClaw / equivalent.
Walk me through how you detect a ClawHavoc-class supply-chain event.

D9 Operations & Human Factors

What’s the p99 latency budget for your guardrail stack? Show the dashboard.
What’s your fail-mode for a guardrail timeout — fail-closed or fail-open? Show the test.
When was the last decommission drill? Show the report.
What’s your HITL approval-rate? Show the rubber-stamp metric (approval-rate without comment).
Show me a system-prompt leak test result and your canary-token deployment.
What’s your model-deprecation policy? Show the version-pin register for agent [X].

Artifact checklist (required per level)

Domain	L2 artifacts	L3 artifacts	L4 artifacts	L5 artifacts (achievable today)	L5+ artifacts (leading-edge)
D1	Policy doc; RACI	Risk Committee minutes; deployment-gate evidence	KPI dashboard; board pack; gap report; standards crosswalk matrix	Most-recent AIUC-1 / ISO 42001 cert; board-attested risk metrics; ≥1-year committee minutes	Named-contributor evidence; published research; external observability dataset
D2	Agent inventory	Identity graph; sample audit trail; OIDC tokens	Cred-proxy logs; Cedar/OPA repo; tabletop drill report	Registry export; ISPM dashboard; SPIFFE-JWT-SVID chain; coupled-credential migration report	NIST CAISI participation; cross-platform identity federation report
D3	Tool allowlist config	PDP config; tier assignments per agent	Promotion-gate runbook (org-authored); HITL telemetry; trifecta-detection log	Warrant samples; step-up logs; per-release policy-compile artifact; cryptographic SoD evidence	CaMeL production deployment evidence; formal-verification reports; temporal-logic policy artifact
D4	Provider safety config	Hook code; firewall logs; sandbox config	AlignmentCheck logs; CodeShield findings; grounding scores	Platform-enforcement coverage report (zero opt-outs); multi-language eval log; classifier refresh receipts; response-leak alert log; latency/cost dashboard with fail-closed proof	TEE attestation chain; CaMeL split production evidence; bypass-class eval with remediation timeline
D5	Outbound proxy config	Gateway config; certs; A2A enforcement profile	Token-exchange logs; rule sets; CVE-tagged log	Mesh topology with zero-bypass proof; per-task token samples; SSRF closure verification; CVE-feed auto-quarantine log	Sigstore-for-MCP verifier; A2A drift rule library; cross-cloud reconciliation report
D6	Source labels	Scan results; CFI baseline	Attestation logs; rollback drill report	Drift dashboard; threshold-justification memo; conflict-flagging logs; canary-token deployment log; rollback drill RTO report	Per-doc attestation chain; taint-lattice implementation; ZK-proof verifier logs
D7	Tool-call audit log	Trace samples; span schema validation	Behavioral-monitoring dashboards; multi-tool eval reports with ID tags	DeepTracing graph; agent-aware playbook samples; prompt-volume-to-alert dashboard ≥1 quarter; analyst-actionable rate report	Cascade rule registry with thresholds; multi-agent joint-baseline statistics; forward-pass activation monitor
D8	Inventory	AI-BOM artifact; sigstore log	Sig-verified registry; reconciliation report; ID-tagged ML-VEX	Closed-loop diagram with SLA evidence; SLSA L3 attestation; runtime/build AI-BOM reconciliation; ML-VEX feed	SLSA L4 report; cross-vendor AI-BOM federation; standards-WG named contribution
D9	Runbook artifact	Latency/cost dashboard; reaper logs; canary deployment proof	HITL-fatigue KPIs; benign-drift dashboard; drill reports; AI-VEX feed	SLA-bounded controls-update log; clean-state attestations; quarterly continuity-test report; HITL-fatigue dashboard within thresholds	External observability dataset; named contributions to CoSAI IR / OWASP / ATLAS; coordinated-disclosure leadership artifacts

Live observation requirements

The assessor MUST observe at least one live action per high-risk-tier agent in the assessed scope. Specifically:

One L3+ assessment requires: live OTel trace + live PDP decision + live HITL gate fire (synthetic if necessary).
One L4 assessment requires the above plus: live behavioral-drift event from the agent behavioral monitoring system + live red-team eval run.
One L5 assessment requires the above plus: live closed-loop incident replay (an alert firing → controls update closing the loop within SLA) + verification of the L4→L5 prerequisite gate (≥2-quarter L4 evidence, AIUC-1/ISO 42001 cert dated within last quarter, continuity-test execution proof).
One L5+ assessment requires the above plus: live attestation chain verification (TEE-backed guardrail execution proof) OR live cascade-detection rule fire OR live cross-vendor AI-BOM reconciliation, AND verification of the named-contributor artifact.

Static configs alone do not satisfy live-observation requirements at L3+.

Stage 3 — Scoring & report (1 week)

Per-domain scoring rubric

For each of the 9 domains, the assessor scores the organization Level 0 (no evidence at L1) through Level 5. The rubric per cell:

Score	Criterion
0	No evidence the L1 baseline exists.
1	L1 verbal evidence; no policy or artifact.
2	L1 + L2 artifacts present and verifiable.
3	L1 + L2 + L3 artifacts present, AND ID tagging is operational for findings in this domain (`ASI##` / AIVSS / `AML.T####` / CVE), AND live observation requirement met.
4	L3 + L4 artifacts AND quantitative metrics are tracked AND multi-tool eval is operational AND ID tagging is comprehensive (no untagged findings in last 90 days).
5	L4 + L5 artifacts AND closed-loop evidence over ≥2 quarters AND L4→L5 prerequisite gate met (see below).
5+	L5 + L5+ artifacts AND research-stage primitives in production with documented exit criteria AND active named contribution to one or more standards bodies (PR / RFC / spec authorship).

Level 3 is the auditable inflection. Below L3, the org is structurally vulnerable and the assessment is largely about whether evidence supports L2 vs L1. At L3+, the assessor is checking platform-level enforcement, ID tagging, and live behavior.

L4 → L5 is a campaign, not a step. Before scoring an organization L5 in any domain, the assessor MUST verify the prerequisite gate (per stress-test §Change 5 and the CMM page level table):

≥2 quarters of stable L4 operation across all 9 domains — no regression in the per-domain matrix during the look-back window. Evidence: prior assessment reports OR continuous-monitoring artifacts (KPIs, drift telemetry, red-team results, AI-BOM reconciliation) covering the period.
AIUC-1 readiness assessment scheduled with an accredited auditor (Schellman or equivalent) OR ISO/IEC 42001 surveillance cycle in flight. Evidence: signed engagement letter or surveillance-audit report.
Bus-factor ≥2 with documented continuity test — a deputy has executed the runbook end-to-end at least once in the look-back window (anti-pattern I3 recovery). Evidence: continuity-test report.
Gap-closure plan from floor-domain to L5 — even if the floor is L5, the program must document what L5+ work it is or is not pursuing in each domain.

Meeting every per-domain L5 row without the gate evidence scores L4-stable, not L5. The gate is asymmetric: the same gate is NOT required to claim L4 from L3 — that jump is a step, not a campaign.

L5+ Leading Edge tier. A separate, optional tier that requires L5 across all 9 domains plus (a) at least one research-stage primitive in production deployment with documented exit criteria back to L5 if the pilot fails, and (b) active named contribution to one or more standards bodies (PR / RFC / spec authorship — not membership only). L5+ is intentionally bleeding-edge and unachievable without category-creation work. Most assessments terminate at L5; L5+ scoring is appropriate for frontier labs, hyperscaler platforms, and dedicated AI-security research shops.

Aggregation rule — dependency-resolved effective scores

The organization’s overall rating is reported as a per-domain matrix (raw + effective scores). Aggregation uses dependency-resolved effective scores under the active rule set documented in Effective-Score Dependency Rules. A domain’s effective score = min(raw, min over upstream-dependency raw scores).

Headline format (replaces the prior single-floor headline):

Typical = median of effective scores across all 9 domains
Weakest = min of effective scores, with the cap source labeled (which upstream domain set the cap, if any)
Strongest = max of raw scores, with the domain labeled
Strategic rationale field for any domain whose raw score is intentionally below its peers (architectural-containment trade-offs)

Cherry-picking is prevented by mandatory matrix disclosure: any rating claim must publish the full per-domain matrix (raw + effective) and the active rule-set version. Reports that cite a single domain’s score without the matrix are non-compliant. This replaces the prior single-floor rule (CMMC import) which misreported 3 of 5 realistic archetypes per the stress test (Stripe-style architectural-containment, Microsoft Agent 365-driven, resource-constrained startup all under-reported).

Active rule set (v1, 2026-05-04): DR-001 D2 caps D5 (per-agent identity required for per-agent egress enforcement), DR-002 D2 caps D7 (per-agent identity required for behavioral attribution), DR-003 D3 caps D4 (PDP decisions required for runtime guardrail enforcement). See dependency-rules page for promotion criteria, candidate registry, and revision protocol.

Gap report structure

Final report contains, at minimum:

Executive summary — three-number headline (typical / weakest / strongest), three-sentence framing, active rule-set version cited.
Per-domain matrix — 9 rows (D1–D9) × per-row columns: raw level, effective level, cap source (which upstream-dependency rule fired, if any), verdict per L1–L5+ criterion. The L5+ column may be left as “n/a” if the engagement does not target L5+.
Weakest-domain explanation — which domain holds the weakest effective score, whether a dependency cap fired, and the strategic rationale (if any) for an intentional trade-off (Stripe-style architectural-containment).
ID-tagged finding registry — every finding with ASI## / AIVSS score / AML.T#### / CVE.
Crosswalk extract — for each L4+ finding, the corresponding Annex IV / AIUC-1 / ISO 42001 anchor (per Agentic AI Security CMM — Standards Crosswalk Matrix).
Top 5 prioritized recommendations — what would move the weakest effective score up by one level (and any candidate dependency-rule promotions to monitor).
Re-assessment cadence — recommendation for next assessment date (tied to AIUC-1 quarterly cadence at L5).
Active rule-set version — must be cited (e.g. “scored under dependency-rules v1, 2026-05-04”). When the rule set is revised, prior assessments retain their original version; re-scoring under a new version is a separate engagement.

Sample assessment timeline

For a mid-size enterprise with ~30 agents in scope:

Week	Activity
-2	Scope letter signed; document request list issued
-1	Documents received; initial gap scan
1	Kickoff; D1 + D2 interviews; identity-graph review
2	D3 + D4 interviews; live PDP / guardrail observation
3	D5 + D6 + D7 interviews; behavioral-monitoring / RAG attestation review
4	D8 + D9 interviews; AI-BOM reconciliation; decommission drill
5	Synthetic incidents fired across 3 agents (if scope permits)
6	Scoring synthesis; gap report draft
7	Report review with org; final report delivered

Assessor competence requirements

Borrowed from ISO/IEC 42006:2025 (auditor competence) and CMMC C3PAO licensing patterns. The assessor MUST demonstrate:

Operational experience with at least 4 of the 9 domains.
Working knowledge of: OWASP ASI Top 10, OWASP AIVSS v0.8, MITRE ATLAS v5.4.0, NIST AI RMF + 600-1, ISO/IEC 42001, EU AI Act high-risk classification.
Experience reading and validating: OTel gen_ai.* traces, AI-BOM (CycloneDX/SPDX), Cedar/OPA policies, MCP server configs, sigstore signatures.
No conflict of interest (the assessor’s firm did not architect or operate any agent in scope within the last 12 months).

Differences from existing audit programs

Existing program	Difference vs this protocol
ISO/IEC 42001 audit	Governance-heavy; weak on technical AI controls. This protocol pulls technical evidence into stage 2 live observation.
AIUC-1 (Schellman)	4–8 week scope; six pillars. This protocol’s 9 domains are more granular and require multi-tool eval at L4.
BSIMM	Descriptive only; no levels. This protocol uses BSIMM-style observation but adds CMMC-style cumulative levels.
CMMC 2.0	Three levels; defense-contractor scope. This protocol uses five levels and is AI-specific.
SOC 2	Type 1 / Type 2 Trust Services Criteria. This protocol’s scope is narrower (agentic AI) and deeper.

Open gaps in this protocol

Known unfilled spots

Quantitative metric thresholds at L4. “Quantitative HITL-fatigue indicators” should have specific thresholds (rubber-stamp rate < X%, queue age p95 < Y minutes) — these are TBD pending production data from early adopters.

Synthetic incident library. Stage 2 calls for synthetic incidents but no library exists yet. Candidates: PoisonedRAG corpus injection, ClawHavoc-class skill swap, prompt-injection via retrieved doc, A2A impersonation.

Self-attestation form. Some orgs will start with a self-assessment before engaging an external assessor. A self-attestation form would mirror this protocol but with relaxed live-observation requirements.

Continuous-assessment mode. Some orgs will want continuous (vs annual) assessment — what does the protocol look like in always-on mode? Mindgard CART is the closest model on the testing side.

Relations

Companion to: Agentic AI Security Capability Maturity Model — A 2026 Practical Proposal — supplies the assessment instrument the CMM lacked.
Companion to: Agentic AI Security CMM — Standards Crosswalk Matrix — assessor uses crosswalk in stage 3 step 5.
Resolves: Validation: Agentic AI Security CMM vs Widely Adopted Standards §6 recommendation #2.

Enterprise Security in the Agentic AI Era

Explorer

Agentic AI Security CMM — Measurement Protocol (Assessor's Handbook)