Agentic AI Security CMM — Effective-Score Dependency Rules
This page defines the dependency-resolved effective-score mechanism that replaces the single cumulative floor as the CMM’s headline aggregation rule. The page is intentionally scaffolded: a small, conservative active rule set (v1 = 3 rules) plus a candidate-rules registry that gets populated as the wiki grows new attack-path evidence and practitioner architectures.
Why this exists
The prior single-floor rule (imported from CMMC 2.0) misreported 3 of 5 realistic archetypes in the 2026-05-02 stress test — Stripe-style architectural-containment, Microsoft Agent 365-driven, and resource-constrained startup all received headline ratings that materially under-reported the program. The L5/L5+ split adopted on 2026-05-04 also broke the floor rule’s premise that domains are interchangeable units. Dependency-resolved scoring replaces the blunt min() with substantive cross-domain caps anchored to documented attack paths — and explicitly tracks which caps we have evidence for vs. which are still candidates.
The effective-score formula
Each domain D has two scores:
- Raw score — the assessor’s per-domain rating against the L1–L5 (and optionally L5+) criteria in the CMM
- Effective score —
min(raw_score(D), min over deps in dependencies(D) of raw_score(dep))
In pseudocode:
def effective_score(domain, raw_scores, active_rules):
deps = [rule.upstream for rule in active_rules if rule.downstream == domain]
if not deps:
return raw_scores[domain]
cap = min(raw_scores[d] for d in deps)
return min(raw_scores[domain], cap)The headline is no longer a single number. It is a three-number summary:
- Typical = median of effective scores across all 9 domains
- Weakest = min of effective scores (with the domain that set it labeled, plus any cap that fired)
- Strongest = max of raw scores (labeled with the domain)
Plus the full per-domain matrix (raw + effective + which caps fired). Plus an optional strategic rationale field for any domain whose raw score is intentionally below its peers (Stripe-style architectural-containment trade-offs).
Active rules — v1 (2026-05-04, 3 rules)
These are the rules currently in force. Conservative on purpose: every active rule has a clear cross-domain attack path documented in the wiki and a directional rationale (why upstream caps downstream, not the other way around).
| ID | Rule | Direction | Evidence anchor | Adopted | Notes |
|---|---|---|---|---|---|
| DR-001 | D2 caps D5 | effective(D5) ≤ raw(D2) | Lethal trifecta: without per-agent identity (D2), per-agent egress policy (D5) cannot be enforced — any agent can impersonate any other agent at the network boundary. The egress gateway has nothing to bind policy to. | 2026-05-04 | Stripe and Salesforce both treat D2 as the precondition for meaningful D5 enforcement |
| DR-002 | D2 caps D7 | effective(D7) ≤ raw(D2) | Without per-agent identity (D2), behavioral anomalies (D7) can only be attributed at fleet level. The Salesforce Rittinghouse 1.8M-prompts-to-30-alerts pipeline depends on per-agent identity to make alerts actionable. | 2026-05-04 | Distinct from DR-001: identity gates attribution in D7, not just enforcement in D5 |
| DR-003 | D3 caps D4 | effective(D4) ≤ raw(D3) | Without policy decisions (D3 PDP), runtime guardrails (D4 PEP) have nothing to enforce — the lifecycle hook fires but no policy decision exists to evaluate against. The Sondera Cedar harness makes this explicit: D4 is structurally downstream of D3 in agentic enforcement | 2026-05-04 | The reverse cap (D4 → D3) is also partially true but weaker; we adopt only the stronger direction |
Promotion threshold met for DR-001/002/003: each has ≥2 wiki-documented practitioner architectures (Stripe + Salesforce + AgentCordon for DR-001/002; Sondera + AgentCordon for DR-003) and a clear lethal-trifecta-class attack path.
Candidate rules registry
Proposed rules whose evidence is suggestive but not yet sufficient for active promotion. Add new candidates here freely. Promotion to active happens at quarterly CMM revisions (or sooner with explicit wiki ingest evidence).
| ID | Proposed rule | Direction | Evidence shape we’d want | Status | Notes |
|---|---|---|---|---|---|
| DR-C001 | D8 caps D6 | effective(D6) ≤ raw(D8) | ≥2 documented incidents where supply-chain compromise (D8 weak) corrupted data integrity (D6) — e.g. ClawHavoc-class skill swap poisoning a downstream RAG corpus | candidate | Likely promotion in 2026-Q4 once 2+ cross-domain incidents are catalogued; currently 1 (ClawHavoc) |
| DR-C002 | D5 caps D7 | effective(D7) ≤ raw(D5) | Production cases where egress is the only signal source for detection — when D5 is L1, D7 has no telemetry to monitor | candidate | Stripe archetype is the counter-example: their architectural containment makes D5 a primary signal source even with lower D7. Hold pending more data on whether this pattern is general or Stripe-specific |
| DR-C003 | D4 caps D5 | effective(D5) ≤ raw(D4) | Runtime guardrail bypass enabling egress bypass; or runtime hook gap allowing direct OS-level egress | candidate — weak directionality | Runtime and egress are co-load-bearing in most architectures; directionality is unclear. Park until a clear asymmetric attack path is documented |
| DR-C004 | D6 caps D4 | effective(D4) ≤ raw(D6) | Poisoned RAG (PoisonedRAG, ConfusedPilot — see memory-poisoning concept) corrupting runtime decisions | candidate — needs production evidence | The dependency exists conceptually; production-evidence is still research-stage. Re-check when AgentDojo / equivalent benchmarks publish cross-domain bypass results |
| DR-C005 | D9 caps D2 | effective(D2) ≤ raw(D9) | Operational decommission failures leaving identity-bound credentials live after agent retirement | candidate — operational-vs-technical boundary | Likely belongs as a soft cap (rate-of-decay rather than hard min), not a hard cap. Defer until soft-cap semantics are designed |
| DR-C006 | D1 caps everything | effective(D*) ≤ raw(D1) | Programs with L1 governance that nonetheless ship strong technical controls — does the governance gap actually undermine the technical controls? | candidate — likely rejected | Existing wiki evidence suggests technical controls operate independently of governance maturity in the moment; governance shows up over time, not at enforcement time. Park as a likely non-rule unless evidence flips |
Promotion criteria
A candidate rule is promoted to active when at least one of the following is met, AND the rule is reviewed at the next quarterly CMM revision:
- ≥2 documented incidents in the wiki where the dependency manifests as a real attack path (incident pages with cross-domain causation noted)
- ≥1 peer-reviewed paper or vendor-published threat-model establishing the dependency as substantive (not theoretical)
- ≥2 practitioner architectures documented in the wiki (talks, deployments, vendor whitepapers) where the dependency is treated as load-bearing
- Synthetic-incident library coverage — if the measurement protocol’s synthetic-incident library (currently a known gap) covers the cross-domain attack path with a documented test case
Any of (1)–(4) is sufficient. The rule’s evidence anchor in the active table MUST cite the qualifying source(s).
Deprecation criteria
An active rule is deprecated when:
- Counter-evidence accumulates — ≥2 documented practitioner architectures where the dependency is not load-bearing (e.g. Stripe-style architectural patterns where the upstream domain is structurally bypassed without compromising the downstream domain)
- Quarterly revision finds the rule no longer reflects practice (consensus call, documented in the revision log)
- A more precise rule replaces it (e.g. soft caps, conditional caps, archetype-specific caps)
Deprecated rules stay in the registry with status: deprecated and a deprecation rationale. They do not affect new assessments but historical reports can be reproduced.
Revision protocol
| When | What |
|---|---|
| Any time | New candidates can be added to the candidate-rules table by anyone editing this page. Add id, proposed rule, direction, evidence shape we’d want, status: candidate, notes. |
| Each wiki ingest of an incident | Check whether the new incident provides cross-domain evidence relevant to an existing candidate. If so, add the citation to that candidate’s notes column. |
| Quarterly (Q1 / Q2 / Q3 / Q4) | Review all candidates against promotion criteria. Promote, hold, or reject. Increment rule-set version on any promotion or deprecation (v1 → v2 → …). Log the revision in wiki/log.md and append to the revision history below. |
| CMM major revision | Re-validate active rules against the latest evidence; deprecate rules that no longer reflect practice. |
Reporting impact
The measurement protocol’s gap report changes shape. Old format:
Headline: L1 (floor — D9 set the floor)
Matrix: D1=L3 D2=L4 D3=L4 D4=L3 D5=L4 D6=L3 D7=L2 D8=L3 D9=L1
New format (Stripe-style architectural-containment archetype example, under v1 rules):
Headline:
Typical (median effective): L4
Weakest: D7 effective L2 (raw L2; no upstream cap fired)
Strongest: D5 raw L4-L5 (effective L4 — capped by DR-001 from D2)
Strategic rationale: D7 light by deliberate trade-off — D3+D5 architectural containment per Stripe Bullen talk
Per-domain matrix (raw / effective / cap source):
D1: L3 / L3 / —
D2: L4 / L4 / —
D3: L4 / L4 / —
D4: L3 / L3 / capped by DR-003 to raw(D3)=L4 (no effect — raw already L3)
D5: L4-L5 / L4 / capped by DR-001 to raw(D2)=L4
D6: L3 / L3 / —
D7: L2 / L2 / capped by DR-002 to raw(D2)=L4 (no effect — raw already L2)
D8: L3 / L3 / —
D9: L3 / L3 / —
Active rule set: v1 (DR-001, DR-002, DR-003)
The headline is now informative — it shows the program’s shape rather than collapsing it to a single misleading number.
Worked examples — re-running the stress-test archetypes
Comparison of the 5 archetypes from the stress test under the old floor vs. v1 effective-score:
| Archetype | Old floor (single number) | v1 effective-score headline (typical / weakest / strongest) | Improvement vs old? |
|---|---|---|---|
| Stripe-style architectural-containment | L2 | L4 typical / L2 D7 (intentional trade-off) / L4 D5 (capped by DR-001 from D2) | Yes — typical L4 reflects the program; D7 honestly noted as weakest with rationale |
| Microsoft Agent 365-driven | L2 | L3 typical / L2 D9 (no upstream cap) / L5 D2 | Yes — D9 ops lag doesn’t drag D2 down (no D9→D2 rule in v1; DR-C005 is candidate not active) |
| Startup with bus-factor 1 | L1 | L3 typical / L1 D9 (bus factor) / L3 D2/D3/D4/D5 | Yes — technical maturity isn’t dragged down |
| Regulated FS (balanced L3-L4) | L3 | L3-L4 typical / L3 weakest / L4 strongest | Equivalent — fair under both rules |
| Multi-cloud (balanced L3-L4) | L3 | L3-L4 typical / L3 weakest / L4 strongest | Equivalent — fair under both rules |
Net effect of v1 rules: the 3 archetypes the floor misreported are now reported fairly; the 2 archetypes the floor reported fairly are still reported fairly. Cherry-picking is now prevented by mandatory matrix disclosure + strategic-rationale field rather than by mathematical aggregation.
What this does NOT do
- Does not eliminate the cross-domain attack-path concern. DR-001/002/003 capture the strongest known cases. Future incidents and architectures will surface more (the candidates are the parking lot).
- Does not allow cherry-picking. Reports MUST publish the full matrix; reports that cite a single domain’s score without the matrix are non-compliant with the measurement protocol (anti-pattern B2 reframed accordingly).
- Does not replace the L4→L5 prerequisite gate (≥2 quarters stable L4, AIUC-1 readiness scheduled, bus-factor ≥2, continuity test). Effective-score is aggregation; the prerequisite gate is eligibility for L5 claims. Both apply.
- Does not address weighted scoring. All 9 domains are still treated as equally important when computing typical/weakest/strongest. Domain weighting (e.g. for high-risk-tier applications) is a separate question parked under the agent-archetype tailoring open gap on the CMM page.
Open questions / known unknowns
Things this scaffolding doesn't yet handle
- Soft caps vs hard caps. DR-C005 (D9 caps D2) is a strong candidate for soft capping (operational lag degrades technical controls over time, not in the moment). The current schema only supports hard caps. Soft-cap semantics are a v2+ design problem.
- Conditional caps. Some caps may only apply for specific application archetypes (e.g. D4 caps D5 may apply for consumer-facing chatbots but not for internal agent platforms). The current schema doesn’t support conditions.
- Multi-hop transitive caps. If D2 caps D5 and D5 caps D7 (DR-C002 candidate), should D2 transitively cap D7 via D5? Currently each rule is independent. Worth re-examining if DR-C002 is promoted.
- Rule interactions. Two rules pointing at the same downstream domain currently take min() of their upstream caps. This is the conservative choice but may be wrong in cases where the caps are partially redundant (capture the same attack path). No counter-evidence yet but flag.
- Negative rules / floor-relaxation. Should there be rules that raise an effective score (e.g. D3+D5 both at L4 raises the effective ceiling on D7 for the Stripe-archetype case, since architectural containment substitutes for behavioral observability)? Currently rules can only cap, not relax. v2+ design problem.
- Scoring stability across rule-set versions. When v1 → v2 promotes a new active rule, prior assessments’ headlines may shift. The protocol should specify which rule set a published rating was computed under (annotate as “v1 effective-score” or similar).
Revision history
| Version | Date | Changes | Active rule count |
|---|---|---|---|
| v1 | 2026-05-04 | Initial scaffolding. 3 active rules (DR-001 D2→D5, DR-002 D2→D7, DR-003 D3→D4) anchored to lethal-trifecta + Sondera/AgentCordon evidence. 6 candidate rules parked. | 3 |
Relations
- Replaces: the single cumulative-floor rule in CMM 2026 (imported from CMMC 2.0)
- Operationalized by: Measurement Protocol §Floor rule (rewritten 2026-05-04 to point here)
- Resolves: stress test §Change 2 (matrix-as-primary view) and §Change 4 (D7 contradiction recommendation) — both adopted via the new effective-score headline format
- Reframes: Anti-Pattern B1 (cumulative-floor demoralizes — mostly resolved) and Anti-Pattern B2 (cherry-picking — reframed as disclosure-discipline failure)
- Updates: Counter-Arguments Thesis 4 — wiki’s stated position changes from “keep floor” to “replace floor with dependency-resolved effective scores”
- Anchored to: Lethal Trifecta (DR-001, DR-002 directional rationale); Sondera Cedar harness (DR-003 directional rationale); Salesforce Rittinghouse (DR-002 production evidence); Stripe Bullen (Stripe archetype worked example); AgentCordon (DR-001/003 OSS reference architecture)