CyberGym Benchmark

CyberGym is a public benchmark for AI-driven vulnerability reproduction — a corpus of 1,507 real-world vulnerability reproduction tasks drawn across 188 OSS-Fuzz projects. It is the load-bearing third-party evaluation surface for agentic vulnerability discovery systems, analogous in function to AgentDojo for prompt-injection robustness or MMLU for general-capability ranking.

Seed page — created 2026-05-13

This page is seeded from a single citing source (Microsoft’s MDASH announcement). Direct ingestion of the CyberGym homepage, leaderboard, methodology paper, and competing-system entries is pending. Next ingest candidate.

Significance

CyberGym is presently the most-cited public leaderboard for AI-driven vulnerability reproduction. Its level-1 configuration (vulnerable source provided + high-level vulnerability description) makes it tractable for evaluation while remaining grounded in real CVEs. Higher difficulty levels — not yet covered in detail here — remove context to test blind discovery.

The benchmark’s role on the wiki:

The first independently-verifiable comparison surface for agentic-AI-vuln-discovery claims by MDASH, future Anthropic Glasswing releases, and any subsequent vendor entries on the [[frontier-ai-for-vuln-discovery|ai-vuln-discovery axis]].
Counterpart to AgentDojo (prompt-injection) and CLASP (capability-centric agent evaluation) in the four-quadrant red-team grid — CyberGym sits in the “real-world reproduction” slot.

Known Results

System	Score	Source	Configuration
Microsoft MDASH	88.45%	Microsoft, May 2026	level 1
Claude Mythos Preview (raw model)	83.1%	Anthropic Glasswing, May 2026	level 1
Claude Opus 4.6 (raw model)	66.6%	Anthropic Glasswing, May 2026	level 1

Harness over model — the ~5-point delta

MDASH’s 88.45% sits ~5 percentage points above raw Mythos’s 83.1%. The MDASH harness (multi-model ensemble + specialized agents + debate + dedup + automated PoC construction) adds roughly that delta over the raw model alone. This is the clearest quantitative measurement on the wiki of the “harness over model” architectural argument from both XBOW and Microsoft.

Direct CyberGym sourcing

The full leaderboard, history, and methodology paper have not yet been ingested. Need to source the CyberGym homepage and the competing entries’ published numbers (beyond Mythos + Opus 4.6).

Configuration Levels

Level 1: vulnerable source code provided + high-level vulnerability description. The level Microsoft’s published 88.45% represents.
Higher levels (less context, blind discovery, harness-format constraints): not yet documented on this page.

Limitations and Caveats

Description quality matters: Microsoft’s failure analysis of MDASH’s remaining ~12% errors shows that 82% of wrong-area findings came from tasks with vague descriptions that also lacked function or file identifiers — description quality is a major factor in scan accuracy.
Harness-format mismatch: agents occasionally constructed libFuzzer-style inputs when the benchmark task required honggfuzz format, producing otherwise-sound reproductions that fail on harness-format mismatch.
OSS-Fuzz domain: CyberGym is biased toward C/C++ memory-safety bug classes typical of OSS-Fuzz; coverage of web vulns, prompt-injection, supply-chain, or AI-application classes is structurally limited.
Public-benchmark contamination risk: as vendors target the leaderboard, model training data may absorb the corpus; the same concern that motivated XBOW’s StorageDrive private-benchmark design.

CMM / RA Maps-to

CMM D7 (Observability & Detection) L4 — fits the four-quadrant red-team grid’s “real-world reproduction benchmark” slot. Should be cited alongside AgentDojo in CMM evidence checklists for D7 L4.

Enterprise Security in the Agentic AI Era

Explorer

CyberGym Benchmark

CyberGym Benchmark

Significance

Known Results

Configuration Levels

Limitations and Caveats

CMM / RA Maps-to

See Also

Graph View

Table of Contents

Backlinks