Assessor’s Quick Scorecard — Secure-SDLC and AI Practices for a Large Canadian Bank

A condensed two-party-advisor assessment instrument for evaluating a large Ontario-based federally-regulated bank’s secure-SDLC practices, with explicit overlays for AI applications, optional Frontier-AI vulnerability discovery in the CI/CD pipeline, and continuous penetration testing — the latter explicitly framed to reduce both false-positive findings (alert-fatigue from scanner noise) and false-negative weaknesses (vulnerabilities missed by point-in-time testing).

0 — How to Use

Audience. A 2nd-party advisor engaging a federally-regulated Canadian bank. The bank is either already building AI applications or plans to do so. The scorecard is engagement-oriented: it produces a per-section score, a maturity tier, a prioritized findings backlog, and a 90-day quick-wins list — not a compliance certification.

Engagement flow. Kickoff → Document Request → Interviews (Eng, Security, Risk, Model Risk) → Evidence Collection → Scoring → Findings Workshop → Report.

Scoring rubric. Each question takes one value:

Value	Score	Definition
Yes	2	Documented, implemented, evidence available, last reviewed within cadence
Partial	1	In place but missing one of: documentation, full coverage, cadence, evidence
No	0	Not in place, or planned-only with no working implementation
N/A	excluded	Not applicable to the bank’s current technology footprint (justify briefly)

Evidence type per question. Each question has an expected evidence type — D document, I interview, O live observation, T telemetry / log sample. A “Yes” without the expected evidence type is downgraded to “Partial.”

Section maturity ladder.

Tier	Threshold	Interpretation
L1	<30%	Ad hoc / undocumented
L2	30-50%	Defined but uneven
L3	50-75%	Implemented and auditable (inflection — minimum expected for a federally-regulated bank)
L4	75-90%	Measured and improving
L5	>90%	Continuous improvement with evidence

Per-section score is sum(yes × 2 + partial × 1) ÷ (max_possible_excluding_NA) expressed as a percentage. Section tier is derived directly from the threshold. Aligned with the wiki’s CMM measurement protocol for cross-engagement comparability.

Findings priority.

Priority	Trigger
Critical	Regulatory exposure (OSFI B-13 / E-23 / B-10 / PIPEDA citation) or safety-critical AI risk (Lethal Trifecta exposure in production-facing agent)
Major	Gap to industry baseline (CMM L3 expectation; NIST SP 800-218A High-priority recommendation)
Moderate	Gap to leading practice (CMM L4+; SP 800-218A Medium-priority)
Informational	Leading-edge / nice-to-have (CMM L5; SP 800-218A Low-priority or Consideration-level)

1 — Section A: Secure-SDLC Foundation (12 questions)

Anchors: NIST SSDF v1.1 practice groups PO/PS/PW/RV; OSFI B-13 Domain 1 (Governance and Risk Management) and Domain 2 (Technology Operations and Resilience), in particular B-13 §2.4 System Development Life Cycle which is the direct regulatory hook for this section.

#	Question	Evidence	Anchor
A1	Are secure software development requirements documented in policy and reviewed at least annually?	D, I	SSDF PO.1.1; B-13 1.3
A2	Are SDLC-related roles documented including a security-champion model or equivalent embedded-security function?	D, I	SSDF PO.2.1; B-13 1.1
A3	Is role-based secure-development training delivered to engineers, with proficiency tracked and refreshed?	D, T	SSDF PO.2.2; B-13 1.1
A4	Is the build / CI/CD toolchain documented, version-controlled, security-vetted, and access-controlled with MFA and least-privilege?	D, O	SSDF PO.3.1, PO.3.2; B-13 2.4
A5	Are software security criteria (gates) defined for material releases, with documented exceptions and exception-approval authority?	D, I	SSDF PO.4.1; B-13 2.4
A6	Is source code under access control with MFA on the SCM, branch protection, and signed commits where feasible?	D, O	SSDF PS.1.1; B-13 2.4
A7	Is a software bill of materials (SBOM, CycloneDX or SPDX) generated for every material release, signed, and retained for the artifact’s full support window?	D, T	SSDF PS.3.2; B-13 2.4
A8	Is threat modeling performed at design time for net-new services and material changes, with output retained and reviewed?	D, I	SSDF PW.1.1; B-13 1.3
A9	Are third-party / open-source components verified (provenance, vulnerability status, support window) and update-policy-bound?	D, T	SSDF PW.4.4; B-13 2.4; B-10 §2
A10	Are SAST, DAST, SCA, and secret-scanning in place with quality gates and tracked false-positive suppression cadence?	T, O	SSDF PW.7, PW.8; B-13 2.4
A11	Is there an inbound vulnerability disclosure channel (security.txt or equivalent) with SLAs and a responsible-disclosure policy?	D	SSDF RV.1.3; B-13 3.4
A12	Are vulnerabilities triaged with documented severity-keyed SLAs and tracked time-to-remediation reported to senior management?	D, T	SSDF RV.2.2; B-13 1.1, 3.4

2 — Section B: AI Governance and Model Risk (12 questions)

Anchors: OSFI E-23 (2027) Sections B (Enterprise-wide MRM), C (Risk-Based Approach), D (Model Lifecycle Management), and Appendix A (model inventory schema); Canada’s Voluntary AI Code of Conduct (ISED, Sept 2023); NIST SP 800-218A PO/PS overlays; NIST AI RMF 1.0; ISO/IEC 42001. Note: OSFI E-23 (2027) takes effect 2027-05-01 — the bank should be preparing now even if not yet in formal compliance scope.

#	Question	Evidence	Anchor
B1	Is there a current enterprise-wide AI / model inventory containing all non-negligible-risk models, with the Appendix-A-aligned schema (model ID, owner, risk rating, dependencies, data sources, limitations, next review)?	D, T	E-23 C.1, Appendix A
B2	Is each AI/ML system risk-rated against the bank’s defined criteria (purpose, impact, data sensitivity, autonomy level) with corresponding control intensity?	D	E-23 C.2, C.3
B3	Is there an AI governance body with documented charter, escalation paths, multi-disciplinary participation (legal, compliance, ethics), and at least quarterly review cadence?	D, I	E-23 B.1; Voluntary AI Code §1 (Accountability)
B4	Is independent model validation performed by reviewers separated from development, with review triggers covering new development, modifications, performance breaches, and significant data changes?	D, I	E-23 D Stage 2 (Review)
B5	Is an AI-BOM maintained for each deployed AI/ML system, covering training-data sources, RAG corpus, frameworks, MCP servers, reward models, and adaptation layers?	D, T	SP 800-218A PS.3.2; E-23 Appendix A
B6	Is training-data provenance tracked when known, integrity-verified before use, and documented when provenance is not knowable?	D, T	SP 800-218A PW.3.1, PW.3.2
B7	Are model weights and configuration parameters protected with cryptographic hashes, digital signatures, least-privilege access, and risk-proportionate additional controls (encryption / multi-party authorization / air-gap)?	D, T, O	SP 800-218A PS.1.3, PS.1.3.R4
B8	Is there an Algorithmic Impact Assessment (or PIPEDA-aligned privacy impact assessment) for each high-risk AI system handling consumer financial data, with documented mitigations?	D	PIPEDA Principle 4 (Limiting Collection); Voluntary AI Code §2 (Safety)
B9	Is the AI system designed such that no critical-path security or financial decision is taken without a human in the loop where the decision is irreversible, material, or rights-affecting?	D, O	SP 800-218A PW.1.1.C2; Voluntary AI Code §5 (Human Oversight)
B10	Are documented model-shutdown / rollback criteria and procedures in place, tested at least quarterly, with named accountable owner per system?	D, T	SP 800-218A RV.2.2.R2, RV.2.2.C1; E-23 D Decommissioning
B11	Are AI/ML model performance and behavior continuously monitored against defined breach thresholds, with documented contingency triggers for drift or autonomous-reparametrization events?	T, O	E-23 D Stage 5 (Monitoring); SP 800-218A PO.5.3
B12	Is PIPEDA breach-notification timing wired into the AI incident-response playbook, with the “real risk of significant harm” determination procedure documented?	D	PIPEDA §10.1 (Breach of Security Safeguards regs)

3 — Section C: Frontier-AI in CI/CD (Optional Layer, 8 questions)

Anchors: wiki Frontier-AI thesis — harness-over-model architecture; XBOW Mythos eval (42-55% FN reduction vs. Opus 4.6); MDASH (+5 percentage points from harness alone on CyberGym); Big Sleep + CodeMender (Google) production track record. This section is optional — applicable only if the bank uses or is piloting frontier-AI for vulnerability discovery in the development pipeline. If wholly N/A, mark the section excluded.

#	Question	Evidence	Anchor
C1	If frontier-AI vulnerability discovery is used in CI/CD, is the vendor or harness identified and documented (Big Sleep / CodeMender / MDASH / XBOW / Glasswing-partner / internal)?	D	Wiki Frontier-AI thesis
C2	Is the rollout phased (shadow → advisory → gating) with measurable success criteria at each phase and explicit rollback authority?	D, T	MDASH 5-stage pipeline pattern
C3	Are all AI-discovered findings human-reviewed before patch merge, following the CodeMender / MDASH default pattern? If auto-merge is enabled for any scope, is the auto-merge scope and rollback plan documented?	D, T	CodeMender / MDASH announcements
C4	Is the harness validation step (debater / LLM-as-judge / regression check / functional-equivalence test) documented and version-controlled per the “harness over model” architecture?	D, O	Wiki Frontier-AI thesis; CodeMender multi-agent validation
C5	Is false-negative reduction measured against a known-vulnerability ground-truth set (CyberGym-like, internal corpus, or third-party validation), and tracked over time?	T	XBOW eval pattern; CyberGym leaderboard
C6	Is the false-positive rate per AI-discovered finding measured, with a documented SLA and periodic re-tuning cadence?	T	RTCF Tier 4
C7	Are AI-discovered findings tagged with model identity, harness version, confidence score, and evidence chain to support audit reproducibility?	T, O	SP 800-218A PS.3.2 (provenance); OSFI B-13 1.3
C8	Is the bank participating in or evaluating coalition initiatives (Anthropic Glasswing or analogous) for shared vulnerability research, with documented data-sharing controls?	D, I	Anthropic Glasswing partner pattern

4 — Section D: Continuous Pentesting and AI Red Teaming (12 questions)

Anchors: RTCF Tier 4 (continuous operations); red-teaming for AI synthesis; OSFI B-13 Domain 3 (Cyber Security — Defend, Detect, Respond/Recover/Learn); NIST SP 800-218A PW.8.1.R1 (red-teaming named as a recommended SDLC code-testing form); NIST SP 800-218A glossary definition of AI red-teaming.

#	Question	Evidence	Anchor
D1	Is penetration testing performed at least annually and on material change, with named scope, methodology, and senior-management reporting?	D	B-13 3.2, 3.3
D2	Is continuous security testing (BAS, CART, red-team-as-a-service, or equivalent) in place beyond annual point-in-time engagements, with results triaged on a defined cadence?	D, T	RTCF Tier 4
D3	Are per-tool false-positive rates tracked, with documented suppression rules and periodic re-tuning to prevent alert fatigue?	T	SSDF PW.7, PW.8
D4	Are false-negative rates measured via known-vulnerability injection, ground-truth corpora, or independent third-party validation, and tracked over time?	T	XBOW-style eval; RTCF Tier 4
D5	Is the pentest and red-team scope explicitly updated to include AI components — model endpoints, agent loops, MCP servers, retrieval pipelines, and adaptation layers?	D	SP 800-218A PW.8.1.R1; OWASP LLM Top 10
D6	Is there a documented AI red-teaming program with cadence, scope, methodology, and named accountable owner (per the SP 800-218A glossary definition of AI red-teaming)?	D	SP 800-218A PW.8.1.R1, Appendix A
D7	Are adversarial-input corpora and prompt-injection test suites maintained, refreshed against current threats, and rotated (e.g., garak / PyRIT / AgentDojo / Promptfoo)?	D, T	RTCF Tier 4
D8	For agent-based systems, is the Lethal Trifecta (private data + untrusted content + external comms) explicitly assessed per system, with containment controls evidenced where exposure exists?	D, I	Wiki lethal-trifecta concept
D9	Are red-team findings tracked separately from pentest findings with AI-specific severity rubrics (data exfiltration via prompt injection, model decision compromise, agent action hijack)?	T	RTCF Tier 4
D10	Are CI/CD gates configured to re-run red-team probes on material change to AI components or prompt assets?	D, T	RTCF Tier 4; SSDF PW.8
D11	Is third-party AI red-teaming evaluated or used (Mindgard CART, HiddenLayer, Protect AI, General Analysis, or equivalent), with sourcing controls applied?	D	RTCF Tier 5
D12	Is a vulnerability disclosure program (or bug bounty) explicitly scoped to AI surfaces (model endpoints, prompt-handling, agent loops), with safe-harbor language?	D	SSDF RV.1.3

5 — Section E: Identity, Least-Agency, and Supply Chain for AI (10 questions)

Anchors: CMM domains D2 (Identity & Authorization), D3 (Control & Least-Agency), D8 (Supply Chain & AI-BOM); RA Identity, Control, and Data planes; OSFI B-10 (Third-Party Risk Management).

#	Question	Evidence	Anchor
E1	Are non-human / agent identities provisioned per agent with documented lifecycle (provision, scope, audit, deprovision) and not shared across services?	D, T	NHI; CMM D2
E2	Are agents authorized through scoped, expiring capability tokens rather than long-lived API keys or static credentials?	D, O	capability-based authorization; CMM D3
E3	Is the principle of least-agency enforced — each task grants only the authority required to complete that task — and audited per system?	D, T	CMM D3
E4	For irreversible operations (transactions, customer notifications, identity changes, sharing changes), is Plan-Validate-Execute or an equivalent HITL pattern enforced?	D, O	Wiki PVE concept
E5	Is the AI-BOM (model bill of materials) signed, retained, and machine-verifiable for each AI-augmented service?	D, T	SP 800-218A PS.3.2; CMM D8
E6	Is the coding-agent / IDE-extension supply chain governed (rules-file integrity, IDE-extension provenance, destructive-action classification, typosquat defense)?	D, T	Knostic governance
E7	For MCP servers in use, are they scanned at install time, signed, version-pinned, and tracked for CVE feed updates?	D, T	supply chain for agents; MCP CVEs
E8	Is agent egress proxied through an inline gateway with tool authorization, response-leak scanning, and audit telemetry?	O, T	Wiki inline-gateway concept; CMM D5
E9	Are third-party AI vendors (model providers, agent-platform vendors, MCP-tool providers) governed under OSFI B-10’s third-party-risk framework with documented assessments and SLAs?	D	B-10 §2, §3
E10	Is there a documented agent-containment / kill-switch procedure with named accountable owner, tested at least quarterly?	D, T	SP 800-218A RV.2.2; CMM D9

6 — Section F: Observability, Detection, and AI Incident Response (8 questions)

Anchors: CMM domains D7 (Observability & Detection), D9 (Operations & Human Factors); OpenTelemetry gen_ai semantic conventions; OSFI B-13 Domain 3 (Cyber Security — Detect, Respond/Recover/Learn); PIPEDA §10.1 Breach of Security Safeguards.

#	Question	Evidence	Anchor
F1	Are agent traces emitted in OTel `gen_ai.*` semantic conventions for every model call, tool call, and RAG retrieval, with retention aligned to OSFI evidence expectations?	T, O	OTel gen_ai; CMM D7; B-13 3.3
F2	Are behavioral baselines maintained per agent (action volume, unique destinations, output length distribution, tool-call profile), with alerts on deviation?	T	CMM D7
F3	Are canary tokens deployed in system prompts and high-value RAG sources to detect system-prompt leakage and prompt-injection exfiltration?	D, T	Wiki canary tokens for LLMs
F4	Is there an AI-specific incident-response playbook covering data poisoning, model compromise, prompt injection in production, MCP supply-chain compromise, and agent-action hijack?	D	CMM D9; SP 800-218A RV.1.1
F5	Are OSFI B-13 material-incident reporting timing and PIPEDA breach-notification triggering wired into the AI IR playbook, including the “real risk of significant harm” determination procedure?	D	B-13 3.4; PIPEDA §10.1
F6	Is post-incident root-cause analysis required to determine whether the SDLC, model lifecycle, or governance process should be updated, with named owner for each remediation?	D	SSDF RV.3; B-13 3.4
F7	Are retention windows for AI audit logs documented and aligned with OSFI B-13 evidence-preservation expectations (typically 7 years for material financial systems)?	D, T	B-13 2.8, 3.3
F8	Is the kill-switch / model rollback path tested at least quarterly, with a named accountable owner and documented results retained?	D, T	SP 800-218A RV.2.2; CMM D9

7 — Scoring, Findings, and Reporting

Per-section roll-up. For each section, compute:

section_score_pct  =  (sum(Yes × 2 + Partial × 1)  /  (max_possible_excluding_NA)) × 100
section_maturity   =  L1 / L2 / L3 / L4 / L5  per the threshold table in §0

Whole-engagement tier. The engagement-level tier is the minimum of the per-section tiers — a single L1 section caps the whole engagement at L1, irrespective of strength elsewhere. This is deliberate: a federally-regulated bank cannot operate at L4 in SDLC fundamentals while at L1 in AI governance and claim L4 maturity.

Findings priority — apply per question marked No or Partial. Use the priority table in §0. A finding tied to an OSFI / PIPEDA citation is at minimum Major; safety-critical AI exposure (e.g., Lethal Trifecta in production-facing agent without containment) is Critical regardless of other context.

Report deliverables.

Artifact	Length	Audience
Executive summary	1 page	Board / Risk Committee / Executive Sponsor
Per-section scorecard table	1-2 pages	CISO, Engineering leadership, Model Risk
Prioritized findings backlog	2-4 pages	CISO, Eng leads, Model-risk leads
90-day quick-wins list	1 page	Eng leads, AI platform team
Evidence appendix	as needed	Internal Audit, future reviewers

8 — References and Crosswalks

Single cross-reference table — section to anchors (wiki, regulatory, source-document).

Section	Wiki anchor	OSFI / PIPEDA citation	Source document
A. Secure-SDLC Foundation	SSDF PO/PS/PW/RV; CMM D1	B-13 Domain 1, §2.4 SDLC	NIST SP 800-218 v1.1
B. AI Governance & Model Risk	SP 800-218A; CMM D1, D6	E-23 (2027) §§B/C/D + Appendix A; PIPEDA Principle 4	NIST SP 800-218A; ISED Voluntary AI Code
C. Frontier-AI in CI/CD	Frontier-AI thesis	— (optional layer)	XBOW Mythos eval; Microsoft MDASH; Google Big Sleep / CodeMender
D. Continuous Pentesting & AI Red Team	RTCF Tier 4; red-teaming for AI	B-13 Domain 3 (Defend / Detect / Respond)	NIST SP 800-218A PW.8.1.R1
E. Identity, Least-Agency, Supply Chain	CMM D2/D3/D8	B-10 §2-§3; B-13 1.3	NIST SP 800-218A PS.1, PS.3.2
F. Observability & AI IR	CMM D7/D9	B-13 Domain 3 (Detect, Respond/Recover/Learn); PIPEDA §10.1	NIST SP 800-218A RV.1, RV.2; OTel gen_ai

Secondary regulatory cross-citations — apply only where the bank has matching operations:

PCI DSS 4.0 — Sections A (secure coding), C (AI-assisted code review), D (testing) — applicable if the bank processes payment card data.
NYDFS Part 500 (23 NYCRR 500) — §500.5 pentest cadence, §500.17 incident reporting — applicable if the bank is NYDFS-licensed for NY operations.
FFIEC IT Examination Handbook — Information Security booklet; Development and Acquisition booklet — applicable if the bank has US federally-regulated operations.
DORA (EU 2022/2554) — Articles 5-23 (ICT risk), 17-23 (incident reporting), 24-27 (TLPT — threat-led penetration testing) — applicable if the bank has EU operations.

Out-of-scope but referenced.

BSIMM — descriptive benchmark of observed practices; optional comparator for large enterprises; not used as a primary scoring instrument here.
SLSA v1.0 — supply-chain provenance specification; recommended at the SBOM / AI-BOM artifact layer; not directly scored.
CMMC — US Defense Industrial Base only; not Canadian-relevant.

9 — Notes for the Assessor

The scorecard's first job is to expose the regulatory floor

A federally-regulated Canadian bank operating below L3 on Section A (Secure-SDLC Foundation) or Section B (AI Governance) is exposed under OSFI B-13 (effective since 2022) and pre-positioned for non-compliance under OSFI E-23 (2027). Findings in these sections are at minimum Major, frequently Critical. L3 — implemented and auditable — is the regulatory floor, not the target.

Section C is optional, but increasingly common

Several Glasswing-partner organizations (Microsoft, Google, AWS, JPMorgan, Anthropic itself) have publicly disclosed frontier-AI vulnerability-discovery in production CI/CD. For a large bank, the question is no longer if but when and under what controls. A bank without any Section C activity is not behind; a bank piloting it without harness-validation, FN/FP measurement, and human review is.

Sections D, E, and F sharply differentiate at L4 and L5

Many banks reach L3 on continuous pentesting, identity, and observability through standard enterprise security investments. The L4-L5 differentiation appears specifically in the AI overlay: AI-scoped pentest scope (D5), agent-specific NHI lifecycle (E1), Lethal Trifecta assessment (D8), and OTel gen_ai.* agent traces (F1). These are the questions that separate banks investing in agentic-AI security as a first-class concern from banks merely complying with general SDLC expectations.

OSFI E-23 (2027) effective date is 2027-05-01

Although the guideline was published 2025-09-11, it is not formally effective until 2027-05-01. Findings tied to E-23 are positioned as pre-emptive readiness rather than current non-compliance. Banks operating well below E-23 expectations today have a runway, but the runway is finite and shrinking.

Enterprise Security in the Agentic AI Era

Explorer

Assessor's Quick Scorecard — Secure-SDLC and AI Practices for a Large Canadian Bank

Assessor’s Quick Scorecard — Secure-SDLC and AI Practices for a Large Canadian Bank

0 — How to Use

1 — Section A: Secure-SDLC Foundation (12 questions)

2 — Section B: AI Governance and Model Risk (12 questions)

3 — Section C: Frontier-AI in CI/CD (Optional Layer, 8 questions)

4 — Section D: Continuous Pentesting and AI Red Teaming (12 questions)

5 — Section E: Identity, Least-Agency, and Supply Chain for AI (10 questions)

6 — Section F: Observability, Detection, and AI Incident Response (8 questions)

7 — Scoring, Findings, and Reporting

8 — References and Crosswalks

9 — Notes for the Assessor

See also

Graph View

Table of Contents

Backlinks