From Naptime to Big Sleep

Source: Project Zero — From Naptime to Big Sleep (October 31, 2024). Local copy: .raw/articles/google-big-sleep-projectzero-2024-10-31.md.

Source Summary

Foundational technical announcement of Big Sleep — the Google Project Zero + Google DeepMind collaboration that grew out of the earlier Project Naptime framework for LLM-assisted vulnerability research. The post documents Big Sleep’s first real-world vulnerability: an exploitable stack buffer underflow in SQLite, reported and patched before reaching any official release. Project Zero claims it as “the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software.”

The strategic frame: variant analysis — given a previously-fixed vulnerability, find similar patterns elsewhere. This narrower task is “a better fit for current LLMs than the more general open-ended vulnerability research problem,” reduces ambiguity, and starts from “a concrete, well-founded theory: ‘This was a previous bug; there is probably another similar one somewhere.’” Project Zero’s broader thesis: AI narrows the gap fuzzing leaves behind, with potential for “an asymmetric advantage for defenders.”

Key Contributions

Methodology — variant-analysis framing

  • Big Sleep is fed a previously-fixed vulnerability (commit message + diff) and asked to review the current repository (at HEAD) for related issues that might not have been fixed.
  • Variant analysis is chosen because (a) “continued in-the-wild discovery of exploits for variants of previously found and patched vulnerabilities” — fuzzing fails to catch variants; (b) for attackers, manual variant analysis is cost-effective; (c) LLMs reduce ambiguity when given a starting point.
  • The Naptime framework’s CyberSecEval2 (Meta) state-of-the-art benchmarks were the prior milestone; Big Sleep is the productized continuation.

First disclosed finding — SQLite stack buffer underflow

  • Vulnerability: in seriesBestIndex of SQLite’s ext/misc/series.c, a sentinel value -1 in an index-typed field iColumn (used to indicate ROWID) was not handled correctly. The function computed iCol = pConstraint->iColumn - SERIES_COLUMN_START, which produces a negative value when iColumn == -1. The subsequent aIdx[iCol] = i then writes below the aIdx stack buffer, corrupting the next field (pConstraint pointer’s low 32 bits).
  • In debug builds an assert(iCol >= 0 && iCol <= 2) catches the condition; release builds lack the assertion and the corruption proceeds.
  • Discovery context: Big Sleep was given recent SQLite commits (manually filtered to remove trivial and docs-only changes); for each commit, asked to find related issues at HEAD.
  • Disclosure: reported “early October” 2024, fixed same day by SQLite maintainers. Caught before any release, so no users impacted.
  • Inspiration: Team Atlanta (the DARPA AIxCC team that became the seed for Microsoft MDASH) had earlier discovered a null-pointer dereference in SQLite at the AIxCC event — Project Zero used SQLite as their testing target to see if they could find something more serious. This is the same Team Atlanta that MDASH’s announcement credits as the source of several Microsoft ACS team members.

Defensive framing

“Finding vulnerabilities in software before it’s even released, means that there’s no scope for attackers to compete: the vulnerabilities are fixed before attackers even have a chance to use them. Fuzzing has helped significantly, but we need an approach that can help defenders to find the bugs that are difficult (or impossible) to find by fuzzing, and we’re hopeful that AI can narrow this gap. We think that this is a promising path towards finally turning the tables and achieving an asymmetric advantage for defenders.”

Subsequent Public Milestones (post-paper)

While not part of this October 2024 post, public follow-ups establish Big Sleep’s trajectory through 2025-2026:

  • August 2025: Google reports Big Sleep has found ~20 security vulnerabilities (TechCrunch coverage).
  • July 2025: SQLite CVE-2025-6965 disclosure — Big Sleep finds a vulnerability that was “known only to threat actors and was at risk of being exploited.” Google claims this is the first time an AI agent has directly foiled efforts to exploit a vulnerability in the wild.
  • May 2026: Anthropic Glasswing announcement names Big Sleep in Heather Adkins’s quote as Google’s parallel AI-powered cybersecurity tool — positioning Big Sleep as Google’s defender-side analogue to Anthropic’s Mythos deployment.

CMM / RA Maps-to

  • CMM D7 (Observability & Detection) L5+ — Big Sleep is a defender-side variant-analysis primitive; its CVE-2025-6965 disclosure (first AI-foiled in-the-wild exploit) is a candidate L5+ Leading-Edge tier evidence item.
  • CMM D3 (Supply Chain) — pre-release vulnerability discovery in OSS dependencies (SQLite is used as a primary example) is a supply-chain primitive.
  • RA Observability Plane — agentic vuln discovery sits on the defender side; Big Sleep is a candidate primitive.

Convergence with Other Wiki Sources

  • Naptime → CyberSecEval2 lineage: Meta’s CyberSecEval2 benchmark predates Big Sleep; Project Naptime achieved state-of-the-art on it. The wiki should track CyberSecEval (Meta) alongside CyberGym as benchmark surface for ai-vuln-discovery.
  • Team Atlanta — Project Zero, then Microsoft: Team Atlanta’s DARPA AIxCC SQLite null-pointer-dereference work inspired Big Sleep’s SQLite testing focus. Several Team Atlanta members later joined Microsoft’s ACS team that built MDASH. The cross-organization personnel flow is the human-capital signal underlying the May 2026 tri-vendor convergence.
  • CodeMender symmetry: CodeMender (Oct 2025) is Google’s parallel agent for the patching half of the workflow that Big Sleep solves on the discovery half. Both DeepMind-affiliated; both AI-agent design.

Limitations

  • Single-finding paper. This is one disclosed vulnerability with detailed walkthrough; broader recall numbers are not published.
  • Research-stage at publication. Project Zero explicitly notes “Our project is still in the research stage” — productization signal is from subsequent posts, not this one.
  • No model attribution. The post does not name which LLM Big Sleep uses (Gemini-family is presumed but unconfirmed in this source).
  • Variant analysis only. This methodology assumes a previously-fixed vulnerability to seed each search. Open-ended discovery (find-anything-from-scratch) is explicitly out-of-scope for the framing.

Open Questions

  • Big Sleep’s full production capability surface as of 2026 (per the May 2026 Glasswing reference, Big Sleep continues to operate as Google’s parallel to Anthropic’s Mythos / Microsoft’s MDASH — but the public technical detail layer is older than the other two).
  • Relationship between Big Sleep and CodeMender — Big Sleep finds, CodeMender patches; integration / handoff architecture is not documented.
  • Google Cloud Security operationalization — Big Sleep is the Project Zero+DeepMind research surface, but how it surfaces to Google Cloud / Vertex AI customers is not in this post.

See Also