Mythos for Offensive Security — XBOW’s Evaluation

Source: XBOW Blog — Mythos for Offensive Security: XBOW’s Evaluation (2026-05-12). Local copy: .raw/articles/xbow-mythos-evaluation-2026-05-13.md.

Source Summary

XBOW’s public evaluation of Mythos Preview — an Anthropic frontier model granted early access roughly two months before the post — focused on offensive-security use cases: web-app vulnerability discovery, native-code analysis, reverse engineering, and judgment-quality benchmarks. A ten-expert XBOW team ran Mythos through XBOW’s internal benchmark suite (the same harness previously applied to Opus 4.7 and GPT 5.5) plus expanded coverage of threat-modeling judgment, source-code vs. live-site comparison, and native/embedded-firmware tasks.

The headline finding: Mythos Preview is the strongest frontier model XBOW has evaluated at vulnerability candidate discovery, especially from source code, but it is not self-sufficient as a security tool. It requires “the right harness and the right tools” (XBOW’s framing for tool orchestration plus live-site interaction) to convert candidate findings into validated exploits.

Key Contributions

Quantitative results

Web exploit benchmark: vs Opus 4.6, Mythos Preview cut false negatives by 42% (no source), 55% (with source). Counted only when PoC-or-GTFO validated after ≤80 actions in XBOW’s harness.
Source-code reading > source-code writing: Mythos’s discovery advantage was larger when given source than when constrained to live-site-only. The recurring theme: “impressive at writing code, but even more impressive at reading it.”
Token-for-token efficiency: at high-accuracy operating points Mythos beats Opus 4.6 / GPT 5.5 on output-token budget. At lower accuracy budgets a longer GPT 5.5 run is often more cost-efficient.
Anthropic-disclosed pricing (as cited by XBOW): Mythos Preview is to be 5× Opus pricing at GA, making cost-normalized comparisons a genuine procurement consideration rather than a benchmark footnote.

Pricing claim contradicted by Anthropic-direct source

XBOW’s “5× Opus at GA” claim above is contradicted by Anthropic’s Project Glasswing landing page (May 12, 2026, same day as XBOW’s post). Anthropic states: (a) Mythos is not planned for general availability; (b) Glasswing-participant pricing is $25/$ 125 per million input / output tokens — approximately 1.67× Opus 4.6 ( $15/$ 75), not 5×. XBOW may have been quoting verbal communication from Anthropic that was subsequently revised or referred to a different pricing model. Treat the Glasswing landing page as authoritative; the XBOW citation is preserved here for traceability.

Qualitative results

Native-code and reverse engineering: strong performance on Chromium-related testing and V8 sandbox work — true positives in subtle threat models where prior baselines produced findings without successful validations. Reverse-engineering across unusual firmware/embedded-system architectures was among the most striking outcomes.
Browser interaction and visual acuity: matches Sonnet 4.6, dramatically outperforms Opus 4.6 at UI-element identification. Practically effective for live-site workflows, though not pixel-accurate for exact coordinates.
Judgment is mixed: command-safety benchmark accuracy 77.8% — below both Opus 4.6 (81.2%) and Haiku 4.5 (90.1%). Mythos tends to prioritize the letter of safety rules over the spirit; literal-conservative bias produces false positives in edge cases.

Methodological notes

XBOW disentangles “Mythos the raw model” (used via API as an engine for XBOW’s agents) from “Mythos inside Claude Code” — the orchestration, tooling, prompting, and live-site access materially change outcomes.
XBOW’s benchmark harvest design lets a vulnerability be found from code alone, which lets XBOW measure the source-vs-live-site asymmetry cleanly. Even on a benchmark where vulnerability is purely in code, removing live-site access hurts performance more than removing source-code access — XBOW’s commercial wedge.
The McGraw quote — “you won’t find the majority of defects by staring at code alone” — is cited as motivation for combining static and dynamic analysis under model orchestration.

CMM / RA Maps-to

CMM D7 (Observability & Detection) L4–L5 — adversarial coverage primitives; XBOW’s harness is an instance of continuous adversarial testing applied to AI app stacks (the four-quadrant red-team grid’s “continuous” quadrant for offensive workflows).
CMM D9 (Operations & Human Factors) — pricing and operational-cost framing matters; Mythos’s 5×-Opus pricing is a procurement consideration relevant to security-operations budgeting.
RA Observability Plane — vulnerability-discovery agents (XBOW orchestrating Mythos) sit on the defender side; the analogous attacker-side capability (autonomous offensive agents using the same models) is the offensive-axis mirror.

Cross-Axis Implications

ai-vuln-discovery: first sourced anchor on this wiki axis. Establishes that frontier models are materially advancing vulnerability discovery, but require orchestration and tooling to convert candidate findings to validated exploits. Promotes the thesis page from seed to developing.
ai-in-sec-offense: XBOW is the canonical autonomous-pentest tooling reference; this is the first sourced page describing how XBOW orchestrates a frontier model. Reduces the gap noted in the offensive-AI thesis.
sec-against-ai: not directly addressed, but the implication is sharp — if XBOW + Mythos produces a 42–55% reduction in false negatives on web vulnerability discovery, an analogous offensive deployment by adversaries (without responsible-disclosure pipelines) collapses defender response times. The SDLC thesis should annex this finding.

Limitations

No raw numbers for absolute pass rates in the source — the post reports relative improvements (42%, 55%) and qualitative ratings, not the underlying baseline pass rates. Independent reproduction will require XBOW (or AISI / Point Estimate) to publish full benchmark numbers.
Mythos availability: not yet on public APIs at the time of writing. The evaluation is on a preview build under direct Anthropic partnership; production behavior may differ.
Vendor evaluation by partner: XBOW received early access from Anthropic and is a commercial beneficiary of Mythos’s strengths (its product orchestrates the model). The post is honest about its position but is not independent.

Open Questions Surfaced

Where does Project Glasswing (Anthropic’s source-code-audit application of Mythos) sit relative to XBOW’s offensive-orientation? Is Glasswing the defender application that maps to XBOW’s offensive one?
AI Security Institute (AISI) benchmarked Mythos vs GPT 5.5 (referenced via Point Estimate’s analysis). What is the AISI canonical methodology and how does it compare to XBOW’s?
What is Anthropic’s stance on offensive-security use of Mythos at GA? The Glasswing partnership signals defender framing; commercial offensive use (XBOW) appears to be tolerated under existing usage policies but the policy boundary is not addressed in the post.

Enterprise Security in the Agentic AI Era

Explorer

Mythos for Offensive Security — XBOW's Evaluation (XBOW Blog, May 2026)

Mythos for Offensive Security — XBOW’s Evaluation

Source Summary

Key Contributions

Quantitative results

Qualitative results

Methodological notes

CMM / RA Maps-to

Cross-Axis Implications

Limitations

Open Questions Surfaced

See Also

Graph View

Table of Contents

Backlinks