Agentic AI Security CMM — Effective-Score Dependency Rules

This page defines the dependency-resolved effective-score mechanism that replaces the single cumulative floor as the CMM’s headline aggregation rule. The page is intentionally scaffolded: a small, conservative active rule set (v1 = 3 rules) plus a candidate-rules registry that gets populated as the wiki grows new attack-path evidence and practitioner architectures.

Why this exists

The prior single-floor rule (imported from CMMC 2.0) misreported 3 of 5 realistic archetypes in the 2026-05-02 stress test — Stripe-style architectural-containment, Microsoft Agent 365-driven, and resource-constrained startup all received headline ratings that materially under-reported the program. The L5/L5+ split adopted on 2026-05-04 also broke the floor rule’s premise that domains are interchangeable units. Dependency-resolved scoring replaces the blunt min() with substantive cross-domain caps anchored to documented attack paths — and explicitly tracks which caps we have evidence for vs. which are still candidates.

The effective-score formula

Each domain D has two scores:

  • Raw score — the assessor’s per-domain rating against the L1–L5 (and optionally L5+) criteria in the CMM
  • Effective scoremin(raw_score(D), min over deps in dependencies(D) of raw_score(dep))

In pseudocode:

def effective_score(domain, raw_scores, active_rules):
    deps = [rule.upstream for rule in active_rules if rule.downstream == domain]
    if not deps:
        return raw_scores[domain]
    cap = min(raw_scores[d] for d in deps)
    return min(raw_scores[domain], cap)

The headline is no longer a single number. It is a three-number summary:

  • Typical = median of effective scores across all 9 domains
  • Weakest = min of effective scores (with the domain that set it labeled, plus any cap that fired)
  • Strongest = max of raw scores (labeled with the domain)

Plus the full per-domain matrix (raw + effective + which caps fired). Plus an optional strategic rationale field for any domain whose raw score is intentionally below its peers (Stripe-style architectural-containment trade-offs).

Active rules — v1 (2026-05-04, 3 rules)

These are the rules currently in force. Conservative on purpose: every active rule has a clear cross-domain attack path documented in the wiki and a directional rationale (why upstream caps downstream, not the other way around).

IDRuleDirectionEvidence anchorAdoptedNotes
DR-001D2 caps D5effective(D5) ≤ raw(D2)Lethal trifecta: without per-agent identity (D2), per-agent egress policy (D5) cannot be enforced — any agent can impersonate any other agent at the network boundary. The egress gateway has nothing to bind policy to.2026-05-04Stripe and Salesforce both treat D2 as the precondition for meaningful D5 enforcement
DR-002D2 caps D7effective(D7) ≤ raw(D2)Without per-agent identity (D2), behavioral anomalies (D7) can only be attributed at fleet level. The Salesforce Rittinghouse 1.8M-prompts-to-30-alerts pipeline depends on per-agent identity to make alerts actionable.2026-05-04Distinct from DR-001: identity gates attribution in D7, not just enforcement in D5
DR-003D3 caps D4effective(D4) ≤ raw(D3)Without policy decisions (D3 PDP), runtime guardrails (D4 PEP) have nothing to enforce — the lifecycle hook fires but no policy decision exists to evaluate against. The Sondera Cedar harness makes this explicit: D4 is structurally downstream of D3 in agentic enforcement2026-05-04The reverse cap (D4 → D3) is also partially true but weaker; we adopt only the stronger direction

Promotion threshold met for DR-001/002/003: each has ≥2 wiki-documented practitioner architectures (Stripe + Salesforce + AgentCordon for DR-001/002; Sondera + AgentCordon for DR-003) and a clear lethal-trifecta-class attack path.

Candidate rules registry

Proposed rules whose evidence is suggestive but not yet sufficient for active promotion. Add new candidates here freely. Promotion to active happens at quarterly CMM revisions (or sooner with explicit wiki ingest evidence).

IDProposed ruleDirectionEvidence shape we’d wantStatusNotes
DR-C001D8 caps D6effective(D6) ≤ raw(D8)≥2 documented incidents where supply-chain compromise (D8 weak) corrupted data integrity (D6) — e.g. ClawHavoc-class skill swap poisoning a downstream RAG corpuscandidateLikely promotion in 2026-Q4 once 2+ cross-domain incidents are catalogued; currently 1 (ClawHavoc)
DR-C002D5 caps D7effective(D7) ≤ raw(D5)Production cases where egress is the only signal source for detection — when D5 is L1, D7 has no telemetry to monitorcandidateStripe archetype is the counter-example: their architectural containment makes D5 a primary signal source even with lower D7. Hold pending more data on whether this pattern is general or Stripe-specific
DR-C003D4 caps D5effective(D5) ≤ raw(D4)Runtime guardrail bypass enabling egress bypass; or runtime hook gap allowing direct OS-level egresscandidate — weak directionalityRuntime and egress are co-load-bearing in most architectures; directionality is unclear. Park until a clear asymmetric attack path is documented
DR-C004D6 caps D4effective(D4) ≤ raw(D6)Poisoned RAG (PoisonedRAG, ConfusedPilot — see memory-poisoning concept) corrupting runtime decisionscandidate — needs production evidenceThe dependency exists conceptually; production-evidence is still research-stage. Re-check when AgentDojo / equivalent benchmarks publish cross-domain bypass results
DR-C005D9 caps D2effective(D2) ≤ raw(D9)Operational decommission failures leaving identity-bound credentials live after agent retirementcandidate — operational-vs-technical boundaryLikely belongs as a soft cap (rate-of-decay rather than hard min), not a hard cap. Defer until soft-cap semantics are designed
DR-C006D1 caps everythingeffective(D*) ≤ raw(D1)Programs with L1 governance that nonetheless ship strong technical controls — does the governance gap actually undermine the technical controls?candidate — likely rejectedExisting wiki evidence suggests technical controls operate independently of governance maturity in the moment; governance shows up over time, not at enforcement time. Park as a likely non-rule unless evidence flips

Promotion criteria

A candidate rule is promoted to active when at least one of the following is met, AND the rule is reviewed at the next quarterly CMM revision:

  1. ≥2 documented incidents in the wiki where the dependency manifests as a real attack path (incident pages with cross-domain causation noted)
  2. ≥1 peer-reviewed paper or vendor-published threat-model establishing the dependency as substantive (not theoretical)
  3. ≥2 practitioner architectures documented in the wiki (talks, deployments, vendor whitepapers) where the dependency is treated as load-bearing
  4. Synthetic-incident library coverage — if the measurement protocol’s synthetic-incident library (currently a known gap) covers the cross-domain attack path with a documented test case

Any of (1)–(4) is sufficient. The rule’s evidence anchor in the active table MUST cite the qualifying source(s).

Deprecation criteria

An active rule is deprecated when:

  1. Counter-evidence accumulates — ≥2 documented practitioner architectures where the dependency is not load-bearing (e.g. Stripe-style architectural patterns where the upstream domain is structurally bypassed without compromising the downstream domain)
  2. Quarterly revision finds the rule no longer reflects practice (consensus call, documented in the revision log)
  3. A more precise rule replaces it (e.g. soft caps, conditional caps, archetype-specific caps)

Deprecated rules stay in the registry with status: deprecated and a deprecation rationale. They do not affect new assessments but historical reports can be reproduced.

Revision protocol

WhenWhat
Any timeNew candidates can be added to the candidate-rules table by anyone editing this page. Add id, proposed rule, direction, evidence shape we’d want, status: candidate, notes.
Each wiki ingest of an incidentCheck whether the new incident provides cross-domain evidence relevant to an existing candidate. If so, add the citation to that candidate’s notes column.
Quarterly (Q1 / Q2 / Q3 / Q4)Review all candidates against promotion criteria. Promote, hold, or reject. Increment rule-set version on any promotion or deprecation (v1 → v2 → …). Log the revision in wiki/log.md and append to the revision history below.
CMM major revisionRe-validate active rules against the latest evidence; deprecate rules that no longer reflect practice.

Reporting impact

The measurement protocol’s gap report changes shape. Old format:

Headline: L1 (floor — D9 set the floor)
Matrix: D1=L3 D2=L4 D3=L4 D4=L3 D5=L4 D6=L3 D7=L2 D8=L3 D9=L1

New format (Stripe-style architectural-containment archetype example, under v1 rules):

Headline:
  Typical (median effective): L4
  Weakest: D7 effective L2 (raw L2; no upstream cap fired)
  Strongest: D5 raw L4-L5 (effective L4 — capped by DR-001 from D2)
  Strategic rationale: D7 light by deliberate trade-off — D3+D5 architectural containment per Stripe Bullen talk

Per-domain matrix (raw / effective / cap source):
  D1: L3 / L3 / —
  D2: L4 / L4 / —
  D3: L4 / L4 / —
  D4: L3 / L3 / capped by DR-003 to raw(D3)=L4 (no effect — raw already L3)
  D5: L4-L5 / L4 / capped by DR-001 to raw(D2)=L4
  D6: L3 / L3 / —
  D7: L2 / L2 / capped by DR-002 to raw(D2)=L4 (no effect — raw already L2)
  D8: L3 / L3 / —
  D9: L3 / L3 / —

Active rule set: v1 (DR-001, DR-002, DR-003)

The headline is now informative — it shows the program’s shape rather than collapsing it to a single misleading number.

Worked examples — re-running the stress-test archetypes

Comparison of the 5 archetypes from the stress test under the old floor vs. v1 effective-score:

ArchetypeOld floor (single number)v1 effective-score headline (typical / weakest / strongest)Improvement vs old?
Stripe-style architectural-containmentL2L4 typical / L2 D7 (intentional trade-off) / L4 D5 (capped by DR-001 from D2)Yes — typical L4 reflects the program; D7 honestly noted as weakest with rationale
Microsoft Agent 365-drivenL2L3 typical / L2 D9 (no upstream cap) / L5 D2Yes — D9 ops lag doesn’t drag D2 down (no D9→D2 rule in v1; DR-C005 is candidate not active)
Startup with bus-factor 1L1L3 typical / L1 D9 (bus factor) / L3 D2/D3/D4/D5Yes — technical maturity isn’t dragged down
Regulated FS (balanced L3-L4)L3L3-L4 typical / L3 weakest / L4 strongestEquivalent — fair under both rules
Multi-cloud (balanced L3-L4)L3L3-L4 typical / L3 weakest / L4 strongestEquivalent — fair under both rules

Net effect of v1 rules: the 3 archetypes the floor misreported are now reported fairly; the 2 archetypes the floor reported fairly are still reported fairly. Cherry-picking is now prevented by mandatory matrix disclosure + strategic-rationale field rather than by mathematical aggregation.

What this does NOT do

  • Does not eliminate the cross-domain attack-path concern. DR-001/002/003 capture the strongest known cases. Future incidents and architectures will surface more (the candidates are the parking lot).
  • Does not allow cherry-picking. Reports MUST publish the full matrix; reports that cite a single domain’s score without the matrix are non-compliant with the measurement protocol (anti-pattern B2 reframed accordingly).
  • Does not replace the L4→L5 prerequisite gate (≥2 quarters stable L4, AIUC-1 readiness scheduled, bus-factor ≥2, continuity test). Effective-score is aggregation; the prerequisite gate is eligibility for L5 claims. Both apply.
  • Does not address weighted scoring. All 9 domains are still treated as equally important when computing typical/weakest/strongest. Domain weighting (e.g. for high-risk-tier applications) is a separate question parked under the agent-archetype tailoring open gap on the CMM page.

Open questions / known unknowns

Things this scaffolding doesn't yet handle

  1. Soft caps vs hard caps. DR-C005 (D9 caps D2) is a strong candidate for soft capping (operational lag degrades technical controls over time, not in the moment). The current schema only supports hard caps. Soft-cap semantics are a v2+ design problem.
  2. Conditional caps. Some caps may only apply for specific application archetypes (e.g. D4 caps D5 may apply for consumer-facing chatbots but not for internal agent platforms). The current schema doesn’t support conditions.
  3. Multi-hop transitive caps. If D2 caps D5 and D5 caps D7 (DR-C002 candidate), should D2 transitively cap D7 via D5? Currently each rule is independent. Worth re-examining if DR-C002 is promoted.
  4. Rule interactions. Two rules pointing at the same downstream domain currently take min() of their upstream caps. This is the conservative choice but may be wrong in cases where the caps are partially redundant (capture the same attack path). No counter-evidence yet but flag.
  5. Negative rules / floor-relaxation. Should there be rules that raise an effective score (e.g. D3+D5 both at L4 raises the effective ceiling on D7 for the Stripe-archetype case, since architectural containment substitutes for behavioral observability)? Currently rules can only cap, not relax. v2+ design problem.
  6. Scoring stability across rule-set versions. When v1 → v2 promotes a new active rule, prior assessments’ headlines may shift. The protocol should specify which rule set a published rating was computed under (annotate as “v1 effective-score” or similar).

Revision history

VersionDateChangesActive rule count
v12026-05-04Initial scaffolding. 3 active rules (DR-001 D2→D5, DR-002 D2→D7, DR-003 D3→D4) anchored to lethal-trifecta + Sondera/AgentCordon evidence. 6 candidate rules parked.3

Relations