Assessor’s Quick Scorecard — Secure-SDLC and AI Practices for a Large Canadian Bank

A condensed two-party-advisor assessment instrument for evaluating a large Ontario-based federally-regulated bank’s secure-SDLC practices, with explicit overlays for AI applications, optional Frontier-AI vulnerability discovery in the CI/CD pipeline, and continuous penetration testing — the latter explicitly framed to reduce both false-positive findings (alert-fatigue from scanner noise) and false-negative weaknesses (vulnerabilities missed by point-in-time testing).

0 — How to Use

Audience. A 2nd-party advisor engaging a federally-regulated Canadian bank. The bank is either already building AI applications or plans to do so. The scorecard is engagement-oriented: it produces a per-section score, a maturity tier, a prioritized findings backlog, and a 90-day quick-wins list — not a compliance certification.

Engagement flow. Kickoff → Document Request → Interviews (Eng, Security, Risk, Model Risk) → Evidence Collection → Scoring → Findings Workshop → Report.

Scoring rubric. Each question takes one value:

ValueScoreDefinition
Yes2Documented, implemented, evidence available, last reviewed within cadence
Partial1In place but missing one of: documentation, full coverage, cadence, evidence
No0Not in place, or planned-only with no working implementation
N/AexcludedNot applicable to the bank’s current technology footprint (justify briefly)

Evidence type per question. Each question has an expected evidence type — D document, I interview, O live observation, T telemetry / log sample. A “Yes” without the expected evidence type is downgraded to “Partial.”

Section maturity ladder.

TierThresholdInterpretation
L1<30%Ad hoc / undocumented
L230-50%Defined but uneven
L350-75%Implemented and auditable (inflection — minimum expected for a federally-regulated bank)
L475-90%Measured and improving
L5>90%Continuous improvement with evidence

Per-section score is sum(yes × 2 + partial × 1) ÷ (max_possible_excluding_NA) expressed as a percentage. Section tier is derived directly from the threshold. Aligned with the wiki’s CMM measurement protocol for cross-engagement comparability.

Findings priority.

PriorityTrigger
CriticalRegulatory exposure (OSFI B-13 / E-23 / B-10 / PIPEDA citation) or safety-critical AI risk (Lethal Trifecta exposure in production-facing agent)
MajorGap to industry baseline (CMM L3 expectation; NIST SP 800-218A High-priority recommendation)
ModerateGap to leading practice (CMM L4+; SP 800-218A Medium-priority)
InformationalLeading-edge / nice-to-have (CMM L5; SP 800-218A Low-priority or Consideration-level)

1 — Section A: Secure-SDLC Foundation (12 questions)

Anchors: NIST SSDF v1.1 practice groups PO/PS/PW/RV; OSFI B-13 Domain 1 (Governance and Risk Management) and Domain 2 (Technology Operations and Resilience), in particular B-13 §2.4 System Development Life Cycle which is the direct regulatory hook for this section.

#QuestionEvidenceAnchor
A1Are secure software development requirements documented in policy and reviewed at least annually?D, ISSDF PO.1.1; B-13 1.3
A2Are SDLC-related roles documented including a security-champion model or equivalent embedded-security function?D, ISSDF PO.2.1; B-13 1.1
A3Is role-based secure-development training delivered to engineers, with proficiency tracked and refreshed?D, TSSDF PO.2.2; B-13 1.1
A4Is the build / CI/CD toolchain documented, version-controlled, security-vetted, and access-controlled with MFA and least-privilege?D, OSSDF PO.3.1, PO.3.2; B-13 2.4
A5Are software security criteria (gates) defined for material releases, with documented exceptions and exception-approval authority?D, ISSDF PO.4.1; B-13 2.4
A6Is source code under access control with MFA on the SCM, branch protection, and signed commits where feasible?D, OSSDF PS.1.1; B-13 2.4
A7Is a software bill of materials (SBOM, CycloneDX or SPDX) generated for every material release, signed, and retained for the artifact’s full support window?D, TSSDF PS.3.2; B-13 2.4
A8Is threat modeling performed at design time for net-new services and material changes, with output retained and reviewed?D, ISSDF PW.1.1; B-13 1.3
A9Are third-party / open-source components verified (provenance, vulnerability status, support window) and update-policy-bound?D, TSSDF PW.4.4; B-13 2.4; B-10 §2
A10Are SAST, DAST, SCA, and secret-scanning in place with quality gates and tracked false-positive suppression cadence?T, OSSDF PW.7, PW.8; B-13 2.4
A11Is there an inbound vulnerability disclosure channel (security.txt or equivalent) with SLAs and a responsible-disclosure policy?DSSDF RV.1.3; B-13 3.4
A12Are vulnerabilities triaged with documented severity-keyed SLAs and tracked time-to-remediation reported to senior management?D, TSSDF RV.2.2; B-13 1.1, 3.4

2 — Section B: AI Governance and Model Risk (12 questions)

Anchors: OSFI E-23 (2027) Sections B (Enterprise-wide MRM), C (Risk-Based Approach), D (Model Lifecycle Management), and Appendix A (model inventory schema); Canada’s Voluntary AI Code of Conduct (ISED, Sept 2023); NIST SP 800-218A PO/PS overlays; NIST AI RMF 1.0; ISO/IEC 42001. Note: OSFI E-23 (2027) takes effect 2027-05-01 — the bank should be preparing now even if not yet in formal compliance scope.

#QuestionEvidenceAnchor
B1Is there a current enterprise-wide AI / model inventory containing all non-negligible-risk models, with the Appendix-A-aligned schema (model ID, owner, risk rating, dependencies, data sources, limitations, next review)?D, TE-23 C.1, Appendix A
B2Is each AI/ML system risk-rated against the bank’s defined criteria (purpose, impact, data sensitivity, autonomy level) with corresponding control intensity?DE-23 C.2, C.3
B3Is there an AI governance body with documented charter, escalation paths, multi-disciplinary participation (legal, compliance, ethics), and at least quarterly review cadence?D, IE-23 B.1; Voluntary AI Code §1 (Accountability)
B4Is independent model validation performed by reviewers separated from development, with review triggers covering new development, modifications, performance breaches, and significant data changes?D, IE-23 D Stage 2 (Review)
B5Is an AI-BOM maintained for each deployed AI/ML system, covering training-data sources, RAG corpus, frameworks, MCP servers, reward models, and adaptation layers?D, TSP 800-218A PS.3.2; E-23 Appendix A
B6Is training-data provenance tracked when known, integrity-verified before use, and documented when provenance is not knowable?D, TSP 800-218A PW.3.1, PW.3.2
B7Are model weights and configuration parameters protected with cryptographic hashes, digital signatures, least-privilege access, and risk-proportionate additional controls (encryption / multi-party authorization / air-gap)?D, T, OSP 800-218A PS.1.3, PS.1.3.R4
B8Is there an Algorithmic Impact Assessment (or PIPEDA-aligned privacy impact assessment) for each high-risk AI system handling consumer financial data, with documented mitigations?DPIPEDA Principle 4 (Limiting Collection); Voluntary AI Code §2 (Safety)
B9Is the AI system designed such that no critical-path security or financial decision is taken without a human in the loop where the decision is irreversible, material, or rights-affecting?D, OSP 800-218A PW.1.1.C2; Voluntary AI Code §5 (Human Oversight)
B10Are documented model-shutdown / rollback criteria and procedures in place, tested at least quarterly, with named accountable owner per system?D, TSP 800-218A RV.2.2.R2, RV.2.2.C1; E-23 D Decommissioning
B11Are AI/ML model performance and behavior continuously monitored against defined breach thresholds, with documented contingency triggers for drift or autonomous-reparametrization events?T, OE-23 D Stage 5 (Monitoring); SP 800-218A PO.5.3
B12Is PIPEDA breach-notification timing wired into the AI incident-response playbook, with the “real risk of significant harm” determination procedure documented?DPIPEDA §10.1 (Breach of Security Safeguards regs)

3 — Section C: Frontier-AI in CI/CD (Optional Layer, 8 questions)

Anchors: wiki Frontier-AI thesis — harness-over-model architecture; XBOW Mythos eval (42-55% FN reduction vs. Opus 4.6); MDASH (+5 percentage points from harness alone on CyberGym); Big Sleep + CodeMender (Google) production track record. This section is optional — applicable only if the bank uses or is piloting frontier-AI for vulnerability discovery in the development pipeline. If wholly N/A, mark the section excluded.

#QuestionEvidenceAnchor
C1If frontier-AI vulnerability discovery is used in CI/CD, is the vendor or harness identified and documented (Big Sleep / CodeMender / MDASH / XBOW / Glasswing-partner / internal)?DWiki Frontier-AI thesis
C2Is the rollout phased (shadow → advisory → gating) with measurable success criteria at each phase and explicit rollback authority?D, TMDASH 5-stage pipeline pattern
C3Are all AI-discovered findings human-reviewed before patch merge, following the CodeMender / MDASH default pattern? If auto-merge is enabled for any scope, is the auto-merge scope and rollback plan documented?D, TCodeMender / MDASH announcements
C4Is the harness validation step (debater / LLM-as-judge / regression check / functional-equivalence test) documented and version-controlled per the “harness over model” architecture?D, OWiki Frontier-AI thesis; CodeMender multi-agent validation
C5Is false-negative reduction measured against a known-vulnerability ground-truth set (CyberGym-like, internal corpus, or third-party validation), and tracked over time?TXBOW eval pattern; CyberGym leaderboard
C6Is the false-positive rate per AI-discovered finding measured, with a documented SLA and periodic re-tuning cadence?TRTCF Tier 4
C7Are AI-discovered findings tagged with model identity, harness version, confidence score, and evidence chain to support audit reproducibility?T, OSP 800-218A PS.3.2 (provenance); OSFI B-13 1.3
C8Is the bank participating in or evaluating coalition initiatives (Anthropic Glasswing or analogous) for shared vulnerability research, with documented data-sharing controls?D, IAnthropic Glasswing partner pattern

4 — Section D: Continuous Pentesting and AI Red Teaming (12 questions)

Anchors: RTCF Tier 4 (continuous operations); red-teaming for AI synthesis; OSFI B-13 Domain 3 (Cyber Security — Defend, Detect, Respond/Recover/Learn); NIST SP 800-218A PW.8.1.R1 (red-teaming named as a recommended SDLC code-testing form); NIST SP 800-218A glossary definition of AI red-teaming.

#QuestionEvidenceAnchor
D1Is penetration testing performed at least annually and on material change, with named scope, methodology, and senior-management reporting?DB-13 3.2, 3.3
D2Is continuous security testing (BAS, CART, red-team-as-a-service, or equivalent) in place beyond annual point-in-time engagements, with results triaged on a defined cadence?D, TRTCF Tier 4
D3Are per-tool false-positive rates tracked, with documented suppression rules and periodic re-tuning to prevent alert fatigue?TSSDF PW.7, PW.8
D4Are false-negative rates measured via known-vulnerability injection, ground-truth corpora, or independent third-party validation, and tracked over time?TXBOW-style eval; RTCF Tier 4
D5Is the pentest and red-team scope explicitly updated to include AI components — model endpoints, agent loops, MCP servers, retrieval pipelines, and adaptation layers?DSP 800-218A PW.8.1.R1; OWASP LLM Top 10
D6Is there a documented AI red-teaming program with cadence, scope, methodology, and named accountable owner (per the SP 800-218A glossary definition of AI red-teaming)?DSP 800-218A PW.8.1.R1, Appendix A
D7Are adversarial-input corpora and prompt-injection test suites maintained, refreshed against current threats, and rotated (e.g., garak / PyRIT / AgentDojo / Promptfoo)?D, TRTCF Tier 4
D8For agent-based systems, is the Lethal Trifecta (private data + untrusted content + external comms) explicitly assessed per system, with containment controls evidenced where exposure exists?D, IWiki lethal-trifecta concept
D9Are red-team findings tracked separately from pentest findings with AI-specific severity rubrics (data exfiltration via prompt injection, model decision compromise, agent action hijack)?TRTCF Tier 4
D10Are CI/CD gates configured to re-run red-team probes on material change to AI components or prompt assets?D, TRTCF Tier 4; SSDF PW.8
D11Is third-party AI red-teaming evaluated or used (Mindgard CART, HiddenLayer, Protect AI, General Analysis, or equivalent), with sourcing controls applied?DRTCF Tier 5
D12Is a vulnerability disclosure program (or bug bounty) explicitly scoped to AI surfaces (model endpoints, prompt-handling, agent loops), with safe-harbor language?DSSDF RV.1.3

5 — Section E: Identity, Least-Agency, and Supply Chain for AI (10 questions)

Anchors: CMM domains D2 (Identity & Authorization), D3 (Control & Least-Agency), D8 (Supply Chain & AI-BOM); RA Identity, Control, and Data planes; OSFI B-10 (Third-Party Risk Management).

#QuestionEvidenceAnchor
E1Are non-human / agent identities provisioned per agent with documented lifecycle (provision, scope, audit, deprovision) and not shared across services?D, TNHI; CMM D2
E2Are agents authorized through scoped, expiring capability tokens rather than long-lived API keys or static credentials?D, Ocapability-based authorization; CMM D3
E3Is the principle of least-agency enforced — each task grants only the authority required to complete that task — and audited per system?D, TCMM D3
E4For irreversible operations (transactions, customer notifications, identity changes, sharing changes), is Plan-Validate-Execute or an equivalent HITL pattern enforced?D, OWiki PVE concept
E5Is the AI-BOM (model bill of materials) signed, retained, and machine-verifiable for each AI-augmented service?D, TSP 800-218A PS.3.2; CMM D8
E6Is the coding-agent / IDE-extension supply chain governed (rules-file integrity, IDE-extension provenance, destructive-action classification, typosquat defense)?D, TKnostic governance
E7For MCP servers in use, are they scanned at install time, signed, version-pinned, and tracked for CVE feed updates?D, Tsupply chain for agents; MCP CVEs
E8Is agent egress proxied through an inline gateway with tool authorization, response-leak scanning, and audit telemetry?O, TWiki inline-gateway concept; CMM D5
E9Are third-party AI vendors (model providers, agent-platform vendors, MCP-tool providers) governed under OSFI B-10’s third-party-risk framework with documented assessments and SLAs?DB-10 §2, §3
E10Is there a documented agent-containment / kill-switch procedure with named accountable owner, tested at least quarterly?D, TSP 800-218A RV.2.2; CMM D9

6 — Section F: Observability, Detection, and AI Incident Response (8 questions)

Anchors: CMM domains D7 (Observability & Detection), D9 (Operations & Human Factors); OpenTelemetry gen_ai semantic conventions; OSFI B-13 Domain 3 (Cyber Security — Detect, Respond/Recover/Learn); PIPEDA §10.1 Breach of Security Safeguards.

#QuestionEvidenceAnchor
F1Are agent traces emitted in OTel gen_ai.* semantic conventions for every model call, tool call, and RAG retrieval, with retention aligned to OSFI evidence expectations?T, OOTel gen_ai; CMM D7; B-13 3.3
F2Are behavioral baselines maintained per agent (action volume, unique destinations, output length distribution, tool-call profile), with alerts on deviation?TCMM D7
F3Are canary tokens deployed in system prompts and high-value RAG sources to detect system-prompt leakage and prompt-injection exfiltration?D, TWiki canary tokens for LLMs
F4Is there an AI-specific incident-response playbook covering data poisoning, model compromise, prompt injection in production, MCP supply-chain compromise, and agent-action hijack?DCMM D9; SP 800-218A RV.1.1
F5Are OSFI B-13 material-incident reporting timing and PIPEDA breach-notification triggering wired into the AI IR playbook, including the “real risk of significant harm” determination procedure?DB-13 3.4; PIPEDA §10.1
F6Is post-incident root-cause analysis required to determine whether the SDLC, model lifecycle, or governance process should be updated, with named owner for each remediation?DSSDF RV.3; B-13 3.4
F7Are retention windows for AI audit logs documented and aligned with OSFI B-13 evidence-preservation expectations (typically 7 years for material financial systems)?D, TB-13 2.8, 3.3
F8Is the kill-switch / model rollback path tested at least quarterly, with a named accountable owner and documented results retained?D, TSP 800-218A RV.2.2; CMM D9

7 — Scoring, Findings, and Reporting

Per-section roll-up. For each section, compute:

section_score_pct  =  (sum(Yes × 2 + Partial × 1)  /  (max_possible_excluding_NA)) × 100
section_maturity   =  L1 / L2 / L3 / L4 / L5  per the threshold table in §0

Whole-engagement tier. The engagement-level tier is the minimum of the per-section tiers — a single L1 section caps the whole engagement at L1, irrespective of strength elsewhere. This is deliberate: a federally-regulated bank cannot operate at L4 in SDLC fundamentals while at L1 in AI governance and claim L4 maturity.

Findings priority — apply per question marked No or Partial. Use the priority table in §0. A finding tied to an OSFI / PIPEDA citation is at minimum Major; safety-critical AI exposure (e.g., Lethal Trifecta in production-facing agent without containment) is Critical regardless of other context.

Report deliverables.

ArtifactLengthAudience
Executive summary1 pageBoard / Risk Committee / Executive Sponsor
Per-section scorecard table1-2 pagesCISO, Engineering leadership, Model Risk
Prioritized findings backlog2-4 pagesCISO, Eng leads, Model-risk leads
90-day quick-wins list1 pageEng leads, AI platform team
Evidence appendixas neededInternal Audit, future reviewers

8 — References and Crosswalks

Single cross-reference table — section to anchors (wiki, regulatory, source-document).

SectionWiki anchorOSFI / PIPEDA citationSource document
A. Secure-SDLC FoundationSSDF PO/PS/PW/RV; CMM D1B-13 Domain 1, §2.4 SDLCNIST SP 800-218 v1.1
B. AI Governance & Model RiskSP 800-218A; CMM D1, D6E-23 (2027) §§B/C/D + Appendix A; PIPEDA Principle 4NIST SP 800-218A; ISED Voluntary AI Code
C. Frontier-AI in CI/CDFrontier-AI thesis(optional layer)XBOW Mythos eval; Microsoft MDASH; Google Big Sleep / CodeMender
D. Continuous Pentesting & AI Red TeamRTCF Tier 4; red-teaming for AIB-13 Domain 3 (Defend / Detect / Respond)NIST SP 800-218A PW.8.1.R1
E. Identity, Least-Agency, Supply ChainCMM D2/D3/D8B-10 §2-§3; B-13 1.3NIST SP 800-218A PS.1, PS.3.2
F. Observability & AI IRCMM D7/D9B-13 Domain 3 (Detect, Respond/Recover/Learn); PIPEDA §10.1NIST SP 800-218A RV.1, RV.2; OTel gen_ai

Secondary regulatory cross-citations — apply only where the bank has matching operations:

  • PCI DSS 4.0 — Sections A (secure coding), C (AI-assisted code review), D (testing) — applicable if the bank processes payment card data.
  • NYDFS Part 500 (23 NYCRR 500) — §500.5 pentest cadence, §500.17 incident reporting — applicable if the bank is NYDFS-licensed for NY operations.
  • FFIEC IT Examination Handbook — Information Security booklet; Development and Acquisition booklet — applicable if the bank has US federally-regulated operations.
  • DORA (EU 2022/2554) — Articles 5-23 (ICT risk), 17-23 (incident reporting), 24-27 (TLPT — threat-led penetration testing) — applicable if the bank has EU operations.

Out-of-scope but referenced.

  • BSIMM — descriptive benchmark of observed practices; optional comparator for large enterprises; not used as a primary scoring instrument here.
  • SLSA v1.0 — supply-chain provenance specification; recommended at the SBOM / AI-BOM artifact layer; not directly scored.
  • CMMC — US Defense Industrial Base only; not Canadian-relevant.

9 — Notes for the Assessor

The scorecard's first job is to expose the regulatory floor

A federally-regulated Canadian bank operating below L3 on Section A (Secure-SDLC Foundation) or Section B (AI Governance) is exposed under OSFI B-13 (effective since 2022) and pre-positioned for non-compliance under OSFI E-23 (2027). Findings in these sections are at minimum Major, frequently Critical. L3 — implemented and auditable — is the regulatory floor, not the target.

Section C is optional, but increasingly common

Several Glasswing-partner organizations (Microsoft, Google, AWS, JPMorgan, Anthropic itself) have publicly disclosed frontier-AI vulnerability-discovery in production CI/CD. For a large bank, the question is no longer if but when and under what controls. A bank without any Section C activity is not behind; a bank piloting it without harness-validation, FN/FP measurement, and human review is.

Sections D, E, and F sharply differentiate at L4 and L5

Many banks reach L3 on continuous pentesting, identity, and observability through standard enterprise security investments. The L4-L5 differentiation appears specifically in the AI overlay: AI-scoped pentest scope (D5), agent-specific NHI lifecycle (E1), Lethal Trifecta assessment (D8), and OTel gen_ai.* agent traces (F1). These are the questions that separate banks investing in agentic-AI security as a first-class concern from banks merely complying with general SDLC expectations.

OSFI E-23 (2027) effective date is 2027-05-01

Although the guideline was published 2025-09-11, it is not formally effective until 2027-05-01. Findings tied to E-23 are positioned as pre-emptive readiness rather than current non-compliance. Banks operating well below E-23 expectations today have a runway, but the runway is finite and shrinking.

See also