CMM Calibration Stress Test

Closes peer-review-readiness §3: “L4 → L5 calibration suspect; cumulative-floor rule un-stress-tested. The L4→L5 jump is much bigger than L3→L4. Separately, ‘floor across all 9 domains’ sounds rigorous but is operationally onerous — most real orgs would self-assess at L1 because of one weak domain. CMMI itself debates this trade-off; ours just declares the rule.”

Two separable problems addressed below: (1) the L4→L5 jump is asymmetric vs L3→L4 and mixes shippable controls with research-stage capabilities; (2) the floor rule is unforgiving in ways that under-report strong-but-asymmetric programs. The stress test runs 5 realistic org archetypes against the current rubric and surfaces the systematic biases.

Headline conclusion

The L5 tier as currently defined mixes “achievable maturity” with “leading edge.” A clean split into L5 (Optimizing — stable, achievable) and L5+ (Leading Edge — research-stage) resolves the asymmetric jump without losing aspirational signal. The floor rule should be kept (the alternative is cherry-picking, which is the bigger failure per the anti-patterns catalog B2) but the per-domain matrix becomes the primary report view, with the floor as headline. These changes do not lower the bar; they make the bar more navigable.

Part 1 — The L4→L5 jump is asymmetric

Current criteria

Level	What’s added vs prior level
L1	Baseline (ad hoc)
L2	Written policy + manual inventory + some prompt-level guardrails
L3	Every agent has its own identity + platform-level hooks + AI-BOM + IR runbook
L4	Quantitative metrics continuously tracked + behavioral drift detection + ≥quarterly red-team + credential proxy + per-task sandbox
L5	Platform-level enforcement everywhere + Proof-of-Guardrail attestation + real-time AI-BOM + multi-agent behavioral monitoring with cascade detection + contributes to standards (CoSAI, OWASP, AIUC-1) + AIUC-1 cert

The asymmetric jump

L3→L4 adds capabilities most orgs at L3 can build in 1–2 quarters:

“Behavioral drift detection” → ship Vectra / Miggo / SecureClaw against the baselines from L3
“Red-team eval ≥quarterly” → schedule Promptfoo / PyRIT / Garak / Mindgard
“Credential proxy” → deploy AgentKeys / Keychains.dev / Aegis
“Sandbox per high-risk task” → containerize agent invocations
“Quantitative metrics” → wire to existing SIEM

L4→L5 mixes three different things:

L5 criterion	Type	Achievability today
Platform-level enforcement everywhere	Maturity scale-up	Achievable (extend L4 platform-level to all surfaces)
AIUC-1 cert against most recent quarterly refresh	Compliance milestone	Gated by Schellman queue (single accredited auditor); quarterly refresh adds maintenance burden
Real-time AI-BOM (Miggo DeepTracing)	Vendor-product dependency	Achievable but single-vendor
Proof-of-Guardrail TEE attestation	Research-stage	No integrated product ships (RA explicitly flags as research-stage)
Multi-agent behavioral monitoring + cascade detection	Research-stage	Per Multi-Agent Runtime Security: SentinelAgent / TraceAegis / Bi-Level GAD are papers, not products
Standards contributions (CoSAI / OWASP / AIUC-1)	Org-character	Small orgs cannot contribute to standards regardless of technical maturity

L5 mixes stable maturity (platform-level enforcement everywhere) with research-stage capabilities (TEE attestation, multi-agent cascade detection) with org-character (standards contributions). An organization can be the most mature platform-level-enforcement shop in its industry and still fail L5 because it doesn’t publish to OWASP. A peer reviewer is right to call this miscalibrated.

Recommendation — L5 / L5+ split

Tier	Criteria	What it signals
L5 — Optimizing (stable, achievable)	Platform-level enforcement everywhere; AIUC-1 readiness assessed against most recent quarterly refresh; quantitative drift / red-team / cred-proxy / sandbox running ≥1 year continuously; per-domain matrix shows L4+ in all 9 domains (no D9 caveat at L5); deputy + runbook-continuity-test (anti-pattern I3)	“This org operates at the upper bound of currently-shippable agentic-AI security.” Achievable by ~5–10% of agentic-AI-deploying orgs in 2026
L5+ — Leading Edge (research-stage)	All of L5, plus: Proof-of-Guardrail TEE attestation in production; multi-agent behavioral monitoring with cascade-detection rules + thresholds; real-time AI-BOM with cross-vendor reconciliation; active standards contributions (named contributor in CoSAI / OWASP / AIUC-1)	“This org is at or beyond the literature.” Achievable by <1% in 2026; ambition for 2027–2028

This preserves the aspirational signal (L5+ exists, has rigor) without making “stable maturity” gated on research-stage capabilities. The wiki’s CMM should adopt this split.

Implications for the existing rubric

If the L5/L5+ split is adopted, the following current-L5 criteria move to L5+:

Proof-of-Guardrail TEE attestation
Multi-agent behavioral monitoring with cascade detection
Real-time AI-BOM with cross-vendor reconciliation
Active standards contribution

And these stay at L5:

Platform-level enforcement everywhere
AIUC-1 cert against current quarterly refresh
Per-domain matrix L4+ across all 9 domains
≥1 year continuous L4 operation
Bus-factor ≥ 2 with continuity test

The jump from L4→L5 becomes capability scale-up + duration + bus factor, which is the “managed → optimizing” transition CMMI defined. The jump from L5→L5+ is standards contribution + research-stage tooling adoption, which is honestly different work.

Part 2 — Cumulative-floor stress test

The floor rule (CMMC import): “The organization’s overall rating is the floor across all 9 domains. An organization rated L4 in D2 and L1 in D9 is rated L1 overall.” Stress-tested against 5 archetypes below.

Archetype 1 — Stripe-style architectural-containment

Per Bullen’s talk: the program leans hard on egress containment + sensitive-action HITL + ToolAnnotations; deliberately lighter on behavioral monitoring + AI-SPM (the talk is explicit about this trade-off).

Domain	Likely level	Notes
D1 Governance	L3	Decision rights documented
D2 Identity	L4	Workspace identity + SPIFFE-class workload IDs
D3 Control	L4	ToolAnnotations + HITL on writes
D4 Runtime	L3	Sandbox + Safe Search; not full L4 (no continuous behavioral drift on the model layer)
D5 Egress	L4–L5	Smokescreen + Toolshed + tagged-services CI gate. Arguably L5 architectural maturity
D6 Data	L3	RAG governance + write classification
D7 Observability	L2	Bullen explicit: they don’t run heavy behavioral monitoring; D7 contradiction callout on the CMM page is exactly this case
D8 Supply Chain	L3	Typosquat-aware install gate + ToolAnnotations on install
D9 Operations	L3	Pending Actions UI + LLM-as-2nd-reviewer + named team
Headline floor	L2	Driven by D7

The headline rating dramatically under-reports. This is the case the CMM’s own D7 contradiction callout was added to flag.

Archetype 2 — Microsoft Agent 365-driven enterprise

Hyperscaler-platform-driven. Strong on identity (Entra Agent ID), data (Purview), observability (Defender for Cloud Apps); weak on governance discipline + cross-cloud + ops.

Domain	Likely level	Notes
D1 Governance	L3	Standard governance practice
D2 Identity	L5	Microsoft Agent 365 Registry — this is named as L5 evidence in the CMM
D3 Control	L3	Default Agent 365 controls
D4 Runtime	L3	Prompt Shields + content safety
D5 Egress	L3	mTLS via Entra; no cross-cloud broker
D6 Data	L3	Purview
D7 Observability	L4	Defender for Cloud Apps + Sentinel
D8 Supply Chain	L3	Standard MS supply-chain
D9 Operations	L2	Hyperscaler-platform-buy = coverage-complete anti-pattern (H1); ops discipline often lags procurement
Headline floor	L2	Driven by D9

Same pattern: strong domains pulled down by one weak domain. The wiki’s H1 anti-pattern (hyperscaler-buy = coverage-complete) names this exact failure shape.

Archetype 3 — Startup with strong AI security and limited resources

Small team, modern stack, high technical maturity but limited operational resources.

Domain	Likely level	Notes
D1 Governance	L2	Documented but ad-hoc
D2 Identity	L3	Okta for AI Agents
D3 Control	L3	Cedar/OPA + HITL
D4 Runtime	L3	LlamaFirewall stack
D5 Egress	L3	AgentGateway
D6 Data	L2	Basic RAG governance; no attestation
D7 Observability	L2	LangSmith only
D8 Supply Chain	L2	Inventory + manual scanning
D9 Operations	L1	Single person; no deputy
Headline floor	L1	Bus-factor-1 anti-pattern (I3)

Under-reports the technical maturity (L3 across most operational domains). The recovery is the bus-factor-2 D9 L3 hard requirement from the anti-patterns catalog I3 — but pre-recovery, the org is L1 by the floor rule despite operating at L3 across most domains.

Archetype 4 — Regulated financial services

Mature governance + identity programs from compliance lineage; mid-tier on AI-specific runtime / supply-chain.

Domain	Likely level	Notes
D1 Governance	L4	Existing risk-mgmt program extended to AI
D2 Identity	L4	NHI program from existing IAM maturity
D3 Control	L4	Decision rights from compliance lineage
D4 Runtime	L3	LlamaFirewall + sandbox
D5 Egress	L4	mature network security extended
D6 Data	L3	RAG governance
D7 Observability	L4	SOC integration
D8 Supply Chain	L3	AI-BOM exists
D9 Operations	L3	Compliance team baseline
Headline floor	L3	Most balanced archetype

This is the archetype that maps cleanly to the current rubric. Floor rule fairly reflects the org.

Archetype 5 — Multi-cloud / multi-vendor program

Cross-vendor coverage from intent; weakest on standards-contribution and AIUC-1 cert (small team).

Domain	Likely level	Notes
D1 Governance	L3	Cross-cloud governance documented
D2 Identity	L4	SPIFFE/SPIRE + per-cloud identity bridges
D3 Control	L4	Vendor-neutral PDP (Cedar/OPA)
D4 Runtime	L3	Mixed runtime stack
D5 Egress	L4	AgentGateway-LF cross-cloud
D6 Data	L4	Cross-cloud data residency + jurisdiction tagging
D7 Observability	L4	OTel `gen_ai.*` cross-cloud
D8 Supply Chain	L3	AI-BOM per cloud
D9 Operations	L3	Strong cross-cloud ops
Headline floor	L3	Could attempt L4; AIUC-1 cert is the next gate

Healthy program. Floor rule fairly reflects.

Stress-test verdict

Archetype	Headline floor	Per-domain matrix range	Floor under-reports?
Stripe-style architectural-containment	L2	L2–L5	Yes — significantly
Microsoft Agent 365-driven	L2	L2–L5	Yes — significantly
Startup with strong AI security	L1	L1–L3	Yes — moderately
Regulated financial services	L3	L3–L4	No — fair
Multi-cloud program	L3	L3–L4	No — fair

The floor rule is fair when programs are roughly balanced, and unfair-but-honest when programs have one structurally weak domain. The fix is transparency, not abandoning the floor.

Part 3 — Recommended calibration changes

Change 1 — Adopt L5 / L5+ split

Per Part 1. The wiki’s CMM L5 row should split into:

L5 Optimizing (stable maturity)
L5+ Leading Edge (research-stage capabilities + standards contribution)

Implementation: keep the current L5 description but reframe research-stage items as L5+; add explicit L5/L5+ distinction in the level table.

Change 2 — Per-domain matrix becomes primary report view; floor is headline

Per Part 2. The measurement protocol already requires the per-domain matrix; this change elevates it from supporting evidence to primary report view. The recommendation:

Headline rating = floor across 9 domains (preserves cherry-picking discipline)
Primary report view = per-domain matrix with all 9 levels visible
Required L4+ artifact = gap-closure plan from floor-domain to next level
Auditor disclosure = reports MUST cite floor + matrix together; floor without matrix is non-compliant

This is a documentation / reporting change, not a level-criteria change. It does not relax the floor rule; it makes the rule navigable.

Change 3 — Document the 5 archetypes

Add to the measurement protocol (or a sister page) the 5 archetype profiles + their typical floor pattern + the dominant gap to close. Helps orgs orient: “we look like Archetype 1 (Stripe-style); the wiki tells us our floor will be D7-driven; the recovery is documented in B1 + behavioral baseline maturity ladder.”

Change 4 — D7 contradiction callout converges on a recommendation

The CMM page already carries a [!contradiction] callout (added during the Bullen-talk follow-up) noting that Stripe-tier architectural-containment can rationally score lower on D7 and still be sound. Recommendation: keep D7 L4 criteria as-is (multi-tool red-team + behavioral drift wired to SIEM); add an explicit Stripe-archetype acknowledgement — when D3 + D5 are L4+ AND the program documents the architectural-containment rationale, the D7 weak-floor is a labeled trade-off, not a hidden gap. The floor rule still applies; the matrix view shows readers what kind of L2-floor they’re looking at.

Change 5 — L4→L5 prerequisite gate

Before claiming L5 attempt, the org must show:

≥2 quarters of stable L4 operation across all 9 domains
AIUC-1 readiness assessment scheduled with Schellman (or accredited equivalent)
Named individual responsible for standards contribution work (even if work hasn’t started)
Bus-factor ≥ 2 with documented continuity test

This converts L4→L5 from a step to a campaign — closer to how CMMI’s “Maturity Level 4 → 5” is treated in practice.

Open issues

What this analysis doesn't yet cover

L5+ adoption signal. No org publicly claims L5+ today. The wiki should track which orgs first claim it (probably Anthropic / Google / Microsoft on their own platforms; possibly UiPath via AIUC-1 cert + standards contribution).

Quantitative threshold for “L4 stable for ≥2 quarters.” What does “stable” mean — no regressions in the matrix? No P1 incidents? No drift > X%? Currently undefined.

Archetype-vs-floor mapping accuracy. The 5 archetypes here are first-principles. Real audit data would show how often each profile appears + how often the floor matches reality.

Multi-cloud archetype’s L5 path. The multi-cloud archetype is structurally well-positioned but has no documented path to L5 because cross-cloud AIUC-1 cert is procedurally complex. Worth a separate analysis.

L5+ vs aspirational drift. If L5+ becomes the new aspirational target, does the same calibration debate happen later? CMMI faced this; the wiki should pre-empt with explicit “L5+ is intentionally bleeding-edge and may be unachievable without category-creation.”

Where this leaves the framework

This page does NOT unilaterally rewrite the CMM. It surfaces the calibration analysis the gap doc demanded and recommends 5 changes. Adoption is a separate decision — the wiki should hold the recommendations as candidate changes pending peer review of the framework as a whole.

Adoption status — 2026-05-04 (updated same day)

Changes 1 (L5/L5+ split) and 5 (L4→L5 prerequisite gate) adopted into the CMM and measurement protocol in the 2026-05-04 revision. Every L5 row in the CMM was rewritten to point only to currently-shippable products / OSS / specs; research-stage and standards-contribution items moved to a new L5+ Leading Edge tier. The crosswalk matrix is now explicitly L5-only with L5+ deferred until standards bodies publish leading-edge guidance.

Changes 2 (matrix-as-primary-view) and 4 (D7 contradiction resolution) adopted later the same day via the new Effective-Score Dependency Rules page. The cumulative-floor rule was replaced with dependency-resolved effective scores under a small conservative active rule set (v1 = 3 rules: D2→D5, D2→D7, D3→D4). Headline format becomes typical/weakest/strongest plus the per-domain matrix. Cherry-picking is now prevented by mandatory matrix disclosure rather than by mathematical aggregation. The dependency-rule registry is intentional scaffolding with explicit promotion criteria and quarterly revision protocol — designed to grow as new attack-path evidence and practitioner architectures land in the wiki. The Stripe-archetype D7 contradiction is resolved: D7 raw L2 reports honestly with a strategic-rationale field rather than collapsing the whole rating.

Change 3 (5-archetype documentation in measurement protocol) remains candidate. The 5 archetypes are documented in this stress test and the worked-examples section of the dependency-rules page; copying them into the measurement protocol as standalone templates is a documentation-clarity follow-up that doesn’t gate any other change.

If adopted, the changes are minimally invasive:

Change	Page edits	Behavior change
L5 / L5+ split	CMM L5 row + level table; new sub-row for L5+	Reframes existing L5 criteria; introduces L5+ as new tier
Matrix as primary view	Measurement protocol report-format guidance	No level-criteria change; reporting-format change
Document 5 archetypes	New sister page or measurement protocol annex	New documentation; no rubric change
D7 contradiction recommendation	CMM D7 contradiction callout finalized	Documents trade-off explicitly; no level-criteria change
L4→L5 prerequisite gate	CMM L5 row + measurement protocol stage-2	Adds gate; doesn’t change L5 criteria

Enterprise Security in the Agentic AI Era

Explorer

CMM Calibration Stress Test — L4→L5 Jump and Cumulative-Floor Rule