LLM-as-a-Judge

An evaluation methodology in which a language model scores or rubric-grades the output of another (typically agentic) system. The judge LLM is used in place of deterministic matching (string comparison, keyword patterns, taxonomy labeling) when the target output is subjective, open-ended, or semantically rich — i.e., when the meaning of an output matters more than its surface form.

Referenced from CLASP as one of two rubric-application approaches for capability scoring.

The core tradeoff

ApproachMechanismLimitation
Deterministic matchingCompare MITRE categories; pattern-match keywords”Correct risk, wrong label = failure” — same risk identified, different label across two runs counts as wrong
LLM-as-a-JudgeEvaluate semantic equivalence between expected and actual outputIntroduces a circular dependency: the judge may share the failure modes of the agent under evaluation

The circularity problem and its standard resolution

The fundamental tension: LLM-as-a-Judge is used precisely because the output cannot be deterministically verified. But if the LLM judge has the same weaknesses as the agent (hallucination, reasoning gaps), why trust it to evaluate correctly?

The standard resolution (as applied at Stripe in their threat modeling agent — see Guardrails Beyond Vibes) is a division of labor:

  1. Humans write the gold standard. Domain experts curate past examples of high-quality outputs (e.g., complete, correct threat models from past security reviews). Human judgment defines what a correct answer looks like.
  2. The LLM is tasked only with semantic matching. Given an expected output (gold standard) and an actual output, the judge is asked: are these semantically equivalent in terms of the risks and mitigations conveyed? This is a narrower, more tractable task than generating a correct answer.

This isolates the judge’s error mode to semantic similarity assessment rather than domain correctness — a task LLMs are demonstrably better at than humans for high-volume scoring.

Uses of the eval pipeline

The Stripe threat modeling case illustrates three distinct operational uses beyond basic accuracy measurement:

  1. Prompt engineering guidance — low-scoring test cases highlight where the prompt fails; improvements are guided by the distribution of failures, not individual edge cases. Avoids overfitting.
  2. Model selection — when choosing between base LLM models, duplicating the golden test set (to average out non-determinism) and running all candidates through the same scorer gives an empirical comparison. At Stripe, this process yielded +10% accuracy improvement.
  3. Regression detection — the most important use. A prompt change that looks fine on individual runs (correctly formatted JSON output) can reduce overall accuracy by 10% (agent attends to formatting at the expense of security content). The eval pipeline surfaces this; individual inspection does not.

Regression detection is the primary value

The Stripe talk is explicit: “This eval pipeline really gives us confidence in the changes we make to our prompt in the sense that they’re applying generally speaking, rather than just in minute cases.” The pipeline is the ground-truth check against which all prompt modifications are tested — not just a one-time accuracy measurement.

LLM-as-a-Judge and human-in-the-loop

LLM-as-a-Judge addresses evaluation confidence; it does not replace human review of agent outputs in production. The Stripe team treats these as complementary:

  • LLM-as-a-Judge scores the agent against a gold standard → produces a confidence/accuracy number used to gate releases and detect regressions.
  • Human-in-the-loop (HITL) reviews actual agent outputs before they affect real workflows → provides the final quality gate and discovers failure modes not covered by the golden set.

“Eval pipelines validate — humans still discover.” See Human-in-the-Loop (HITL) for Agentic AI.

When not to use LLM-as-a-Judge

  • Open-ended routing tasks with no fixed ground truth (e.g., “which security team should handle this question?”). In this case, a gold standard is hard to define and user feedback in production is a stronger signal. The Stripe security routing agent used a phased user-feedback rollout instead of an offline LLM-as-a-Judge pipeline.
  • Tasks with reliable deterministic signals — if keyword matching or structured schema validation suffices, LLM-as-a-Judge adds cost and latency without benefit.

See also