Evidence Centered Benchmark Design

Stub

Methodology for designing benchmarks that explicitly tie task evidence to the underlying capability claims. Referenced from CLASP as a more rigorous alternative to LLM-as-a-judge for capability scoring. Needs full definition, methodology, and citation to canonical sources.