XBOW

Sources: Homepage · Mythos Evaluation (May 2026) · Opus 4.7 First Look

XBOW is an autonomous offensive-security platform that orchestrates frontier AI models — Anthropic’s Opus, Sonnet, Haiku, and preview-stage Mythos; OpenAI’s GPT — against live web targets to discover and validate exploitable vulnerabilities. The product is framed as a pentest surface (targeting live deployments as an attacker sees them) rather than a static analysis tool, and the company’s commercial wedge is the orchestration plus tooling layer that converts a frontier model’s vulnerability candidates into validated, reproducible exploits.

Capability Surface

  • Live-site exploitation harness: agents execute up to ~80 “actions” per case (shell commands, Python scripts, attack tools) to attempt validated exploitation. Cases are counted as passed only on PoC-or-GTFO outcomes.
  • Web exploit benchmark: XBOW’s internal scoring substrate, harvested from open-source applications frozen at known-vulnerable versions. Reused across multiple Anthropic and OpenAI model evaluations.
  • Multi-model orchestration: XBOW explicitly maintains “a cadre of models” rather than committing to a single frontier vendor. The choice between Mythos-briefly vs GPT-5.5-longer is framed as a per-task cost-accuracy tradeoff.
  • Source + live-site combined detection: the canonical XBOW pattern is to analyze source code for leads, probe live-site behavior to understand deployment-specific exposure, then craft validated exploit. Removing live-site access hurts performance more than removing source-code access — even on benchmarks where the vulnerability is purely in code.
  • Visual acuity / browser UI workflows: XBOW’s agents drive browser interaction; visual-acuity QA is part of the internal benchmark suite.
  • Native-code and reverse-engineering: explicit capability area in the Mythos evaluation; XBOW tests against Chromium, V8, and embedded firmware contexts.
  • Internal benchmarks named in public posts: web exploit benchmark, command-safety benchmark, visual-acuity QA. Not all of XBOW’s evaluation harness is public.

Position in the Wiki

XBOW closes a major gap on the ai-in-sec-offense scope axis — see Scope Expansion Punch-List item 3 and the Offensive AI: State of the Field thesis. It also anchors the ai-vuln-discovery axis: the Mythos evaluation paper is the first sourced data point on the wiki for frontier-model-driven vulnerability discovery in operational use.

Relative to other entries on the wiki:

  • General Analysis (CART for agentic AI) is XBOW’s nearest neighbor in the scope-axis matrix. General Analysis tests AI applications; XBOW tests web applications using AI as the attacker. The methodology surface (adversarial testing, validated outcomes, continuous campaigns) is similar; the targets diverge.
  • Mindgard CART is closer to General Analysis (AI-application red-teaming) than to XBOW.
  • XBOW’s public partnership with Anthropic on Mythos preview is analogous to Anthropic’s Project Glasswing (mentioned below) — XBOW is the offensive-orientation collaborator; Glasswing is the defender-orientation collaborator.

Mythos Evaluation (May 2026) Highlights

See XBOW’s Mythos Evaluation paper for the full source summary. Key claims from XBOW about Mythos Preview, with XBOW’s framing:

  • 42% false-negative reduction (no source) and 55% (with source) vs Opus 4.6 on web exploit benchmark.
  • Source-code reading is Mythos’s strongest mode; live-site validation remains “the hard part” and is XBOW’s commercial wedge.
  • Token-for-token efficient at high-accuracy operating points; not best-in-class cost-normalized for lower-accuracy budgets.
  • Strong on native-code and reverse engineering; mixed on judgment (77.8% command-safety accuracy, behind Opus 4.6 at 81.2% and Haiku 4.5 at 90.1%).

Open Questions

  • Funding and commercial stage: not addressed in the Mythos evaluation. To be sourced from XBOW press releases, investor disclosures, or analyst coverage.
  • Customer base / case studies: XBOW’s public materials emphasize self-evaluation against frontier models; customer-side case studies are pending ingest.
  • Methodology transparency: XBOW’s benchmark harvest method is described in outline but not in reproducible detail. AI Security Institute (AISI) and Point Estimate analyses (referenced in the Mythos post) are candidate independent comparators.
  • Disclosure pipeline: when XBOW + Mythos finds zero-days during evaluation, how does coordinated disclosure proceed? Not addressed in the post.

Founders / leadership team

Not captured in the May 2026 source. Next ingest pass should add founders, exec team, and any public board members.

See Also