XBOW
Sources: Homepage · Mythos Evaluation (May 2026) · Opus 4.7 First Look
XBOW is an autonomous offensive-security platform that orchestrates frontier AI models — Anthropic’s Opus, Sonnet, Haiku, and preview-stage Mythos; OpenAI’s GPT — against live web targets to discover and validate exploitable vulnerabilities. The product is framed as a pentest surface (targeting live deployments as an attacker sees them) rather than a static analysis tool, and the company’s commercial wedge is the orchestration plus tooling layer that converts a frontier model’s vulnerability candidates into validated, reproducible exploits.
Capability Surface
- Live-site exploitation harness: agents execute up to ~80 “actions” per case (shell commands, Python scripts, attack tools) to attempt validated exploitation. Cases are counted as passed only on PoC-or-GTFO outcomes.
- Web exploit benchmark: XBOW’s internal scoring substrate, harvested from open-source applications frozen at known-vulnerable versions. Reused across multiple Anthropic and OpenAI model evaluations.
- Multi-model orchestration: XBOW explicitly maintains “a cadre of models” rather than committing to a single frontier vendor. The choice between Mythos-briefly vs GPT-5.5-longer is framed as a per-task cost-accuracy tradeoff.
- Source + live-site combined detection: the canonical XBOW pattern is to analyze source code for leads, probe live-site behavior to understand deployment-specific exposure, then craft validated exploit. Removing live-site access hurts performance more than removing source-code access — even on benchmarks where the vulnerability is purely in code.
- Visual acuity / browser UI workflows: XBOW’s agents drive browser interaction; visual-acuity QA is part of the internal benchmark suite.
- Native-code and reverse-engineering: explicit capability area in the Mythos evaluation; XBOW tests against Chromium, V8, and embedded firmware contexts.
- Internal benchmarks named in public posts: web exploit benchmark, command-safety benchmark, visual-acuity QA. Not all of XBOW’s evaluation harness is public.
Position in the Wiki
XBOW closes a major gap on the ai-in-sec-offense scope axis — see Scope Expansion Punch-List item 3 and the Offensive AI: State of the Field thesis. It also anchors the ai-vuln-discovery axis: the Mythos evaluation paper is the first sourced data point on the wiki for frontier-model-driven vulnerability discovery in operational use.
Relative to other entries on the wiki:
- General Analysis (CART for agentic AI) is XBOW’s nearest neighbor in the scope-axis matrix. General Analysis tests AI applications; XBOW tests web applications using AI as the attacker. The methodology surface (adversarial testing, validated outcomes, continuous campaigns) is similar; the targets diverge.
- Mindgard CART is closer to General Analysis (AI-application red-teaming) than to XBOW.
- XBOW’s public partnership with Anthropic on Mythos preview is analogous to Anthropic’s Project Glasswing (mentioned below) — XBOW is the offensive-orientation collaborator; Glasswing is the defender-orientation collaborator.
Mythos Evaluation (May 2026) Highlights
See XBOW’s Mythos Evaluation paper for the full source summary. Key claims from XBOW about Mythos Preview, with XBOW’s framing:
- 42% false-negative reduction (no source) and 55% (with source) vs Opus 4.6 on web exploit benchmark.
- Source-code reading is Mythos’s strongest mode; live-site validation remains “the hard part” and is XBOW’s commercial wedge.
- Token-for-token efficient at high-accuracy operating points; not best-in-class cost-normalized for lower-accuracy budgets.
- Strong on native-code and reverse engineering; mixed on judgment (77.8% command-safety accuracy, behind Opus 4.6 at 81.2% and Haiku 4.5 at 90.1%).
Open Questions
- Funding and commercial stage: not addressed in the Mythos evaluation. To be sourced from XBOW press releases, investor disclosures, or analyst coverage.
- Customer base / case studies: XBOW’s public materials emphasize self-evaluation against frontier models; customer-side case studies are pending ingest.
- Methodology transparency: XBOW’s benchmark harvest method is described in outline but not in reproducible detail. AI Security Institute (AISI) and Point Estimate analyses (referenced in the Mythos post) are candidate independent comparators.
- Disclosure pipeline: when XBOW + Mythos finds zero-days during evaluation, how does coordinated disclosure proceed? Not addressed in the post.
Founders / leadership team
Not captured in the May 2026 source. Next ingest pass should add founders, exec team, and any public board members.
See Also
- Mythos — Anthropic frontier model XBOW orchestrates.
- Anthropic — model vendor / partner for the Mythos evaluation.
- XBOW’s Mythos Evaluation paper — the source for this entity page.
- Offensive AI: State of the Field — the wiki thesis XBOW anchors.
- Frontier AI for Vulnerability Discovery — the adjacent thesis Mythos anchors.
- Red Teaming Capability Framework — tier-5 vendor evaluation criteria; XBOW fits.