CyberGym Benchmark
CyberGym is a public benchmark for AI-driven vulnerability reproduction — a corpus of 1,507 real-world vulnerability reproduction tasks drawn across 188 OSS-Fuzz projects. It is the load-bearing third-party evaluation surface for agentic vulnerability discovery systems, analogous in function to AgentDojo for prompt-injection robustness or MMLU for general-capability ranking.
Seed page — created 2026-05-13
This page is seeded from a single citing source (Microsoft’s MDASH announcement). Direct ingestion of the CyberGym homepage, leaderboard, methodology paper, and competing-system entries is pending. Next ingest candidate.
Significance
CyberGym is presently the most-cited public leaderboard for AI-driven vulnerability reproduction. Its level-1 configuration (vulnerable source provided + high-level vulnerability description) makes it tractable for evaluation while remaining grounded in real CVEs. Higher difficulty levels — not yet covered in detail here — remove context to test blind discovery.
The benchmark’s role on the wiki:
- The first independently-verifiable comparison surface for agentic-AI-vuln-discovery claims by MDASH, future Anthropic Glasswing releases, and any subsequent vendor entries on the [[frontier-ai-for-vuln-discovery|
ai-vuln-discoveryaxis]]. - Counterpart to AgentDojo (prompt-injection) and CLASP (capability-centric agent evaluation) in the four-quadrant red-team grid — CyberGym sits in the “real-world reproduction” slot.
Known Results
| System | Score | Source | Configuration |
|---|---|---|---|
| Microsoft MDASH | 88.45% | Microsoft, May 2026 | level 1 |
| Claude Mythos Preview (raw model) | 83.1% | Anthropic Glasswing, May 2026 | level 1 |
| Claude Opus 4.6 (raw model) | 66.6% | Anthropic Glasswing, May 2026 | level 1 |
Harness over model — the ~5-point delta
MDASH’s 88.45% sits ~5 percentage points above raw Mythos’s 83.1%. The MDASH harness (multi-model ensemble + specialized agents + debate + dedup + automated PoC construction) adds roughly that delta over the raw model alone. This is the clearest quantitative measurement on the wiki of the “harness over model” architectural argument from both XBOW and Microsoft.
Direct CyberGym sourcing
The full leaderboard, history, and methodology paper have not yet been ingested. Need to source the CyberGym homepage and the competing entries’ published numbers (beyond Mythos + Opus 4.6).
Configuration Levels
- Level 1: vulnerable source code provided + high-level vulnerability description. The level Microsoft’s published 88.45% represents.
- Higher levels (less context, blind discovery, harness-format constraints): not yet documented on this page.
Limitations and Caveats
- Description quality matters: Microsoft’s failure analysis of MDASH’s remaining ~12% errors shows that 82% of wrong-area findings came from tasks with vague descriptions that also lacked function or file identifiers — description quality is a major factor in scan accuracy.
- Harness-format mismatch: agents occasionally constructed libFuzzer-style inputs when the benchmark task required honggfuzz format, producing otherwise-sound reproductions that fail on harness-format mismatch.
- OSS-Fuzz domain: CyberGym is biased toward C/C++ memory-safety bug classes typical of OSS-Fuzz; coverage of web vulns, prompt-injection, supply-chain, or AI-application classes is structurally limited.
- Public-benchmark contamination risk: as vendors target the leaderboard, model training data may absorb the corpus; the same concern that motivated XBOW’s StorageDrive private-benchmark design.
CMM / RA Maps-to
- CMM D7 (Observability & Detection) L4 — fits the four-quadrant red-team grid’s “real-world reproduction benchmark” slot. Should be cited alongside AgentDojo in CMM evidence checklists for D7 L4.
See Also
- MDASH — current leaderboard leader.
- Microsoft’s MDASH announcement — citing source.
- Frontier AI for Vulnerability Discovery — the wiki thesis CyberGym anchors as a benchmark surface.
- AgentDojo — sibling public benchmark, different bug class (prompt injection).
- Red Teaming for AI: Synthesis — wiki position on the four-quadrant grid.