METR (Model Evaluation and Threat Research)
Independent AI evaluation organization. The wiki’s methodological foundation for long-task autonomy claims — the work that established the “task-completion-horizon doubles every N months” framing that UK AISI’s 8-month cyber-task figure is built on.
Notable outputs
- “Measuring AI Ability to Complete Long Tasks” (metr.org blog, arXiv:2503.14499) — Generalist task horizon doubles every ~7 months across 2019–2025; accelerated to ~4 months in 2024–2025. Methodologically transparent.
- Time Horizon 1.1 (metr.org/blog/2026-1-29) — January 2026 refresh of the doubling-time analysis.
- Common Elements of Frontier AI Safety Policies (metr.org/common-elements) — Comparative view of vendor responsible-update commitments; cross-referenced from the wiki’s Threat Classes 2026 §Class 4 (model-version regression).
Why it matters for the wiki
UK AISI’s “8-month cyber-doubling” figure is built on METR methodology. Citing UK AISI without METR is citing the conclusion without the foundation. METR is also the only independent (non-vendor, non-government) source for cross-model capability scaling.
See Also
- Source Triangulation Audit 2026-05-02 — Claim 3 (capability scaling)
- UK AI Safety Institute — government-side counterpart
- Apollo Research — peer organization on safety evals
- Agentic AI Threat Classes 2026 §Class 2, §Class 4