METR (Model Evaluation and Threat Research)

Independent AI evaluation organization. The wiki’s methodological foundation for long-task autonomy claims — the work that established the “task-completion-horizon doubles every N months” framing that UK AISI’s 8-month cyber-task figure is built on.

Notable outputs

  • “Measuring AI Ability to Complete Long Tasks” (metr.org blog, arXiv:2503.14499) — Generalist task horizon doubles every ~7 months across 2019–2025; accelerated to ~4 months in 2024–2025. Methodologically transparent.
  • Time Horizon 1.1 (metr.org/blog/2026-1-29) — January 2026 refresh of the doubling-time analysis.
  • Common Elements of Frontier AI Safety Policies (metr.org/common-elements) — Comparative view of vendor responsible-update commitments; cross-referenced from the wiki’s Threat Classes 2026 §Class 4 (model-version regression).

Why it matters for the wiki

UK AISI’s “8-month cyber-doubling” figure is built on METR methodology. Citing UK AISI without METR is citing the conclusion without the foundation. METR is also the only independent (non-vendor, non-government) source for cross-model capability scaling.

See Also