TECHNIQUE

LLM-as-judge

Evaluation

9APPLICATIONS

9OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 3 OPERATORS

LLM-as-judge is used as a task-specific scoring and validation layer, with explicit evidence from Dropbox, Uber, and Pinterest.

Observed Practices

Use LLMs as task-specific scorers or validators rather than generic text-similarity metrics: Dropbox judges factual correctness, citation support, formatting, and tone; Uber validates whether code antipatterns and optimizations are real; Pinterest scores predicted user-journey relevance.

3 of 3 operators with explicit LLM-as-judge evidence in this pool.

DropboxUberPinterest

Give the judge domain context along with the output being judged: Dropbox evaluation runs take the query, model answer, source context, and sometimes a hidden reference answer; Uber sends full source code plus antipattern lists; Pinterest provides user features and engagement history.

3 of 3 operators with explicit LLM-as-judge evidence in this pool.

DropboxUberPinterest

Automate judge runs inside engineering or production workflows: Dropbox runs PR regression tests, staging suites, and production traffic scoring; Uber routes validated suggestions into Optix and surfaces opportunities through CI/CD and developer workflows.

2 of 3 operators with explicit LLM-as-judge evidence in this pool.

DropboxUber

Use judge results for monitoring and review dashboards: Dropbox consolidates pass/fail rates, key metrics, and shifts over time; Uber’s LLMCheck dashboards show detection accuracy, error patterns, and antipattern frequency.

2 of 3 operators with explicit LLM-as-judge evidence in this pool.

DropboxUber

Add checks around LLM judgment instead of relying on an unchecked judge: Dropbox periodically manually labels sampled outputs; Uber uses a jury of LLMs plus domain-specific rule-based validators to catch false positives.

2 of 3 operators with explicit LLM-as-judge evidence in this pool.

DropboxUber

Where Operators Converge

Every explicitly evidenced operator uses the judge for domain-specific evaluation criteria tied to its product or workflow, not as a generic open-ended reviewer.

Every explicitly evidenced operator feeds contextual evidence into the judgment process: source context or reference answers at Dropbox, source code and antipattern lists at Uber, and user features plus engagement history at Pinterest.

Where Operators Diverge

Where the judge sits in the workflow differs by operator.

APPROACH 01

End-to-end AI evaluation lifecycle: PR regression tests, staging runs over curated datasets, and sampled production traffic scoring.

Dropbox

APPROACH 02

Post-detection validation layer for code-optimization suggestions before they flow into downstream optimization tools and developer workflows.

Uber

APPROACH 03

Relevance scoring for predicted user journeys, where the LLM returns a five-level score with explanations from user features and engagement history.

Judge design differs: single flexible judge, multi-model jury, or scored relevance rubric.

APPROACH 01

Flexible judge model that checks factual correctness against ground truth or context, citation support, formatting, and tone; Dropbox also reports upgrading the judge model to OpenAI o3 reduced disagreements with humans.

Dropbox

APPROACH 02

Jury of large language models, with models independently assessing whether an antipattern is present and whether the suggested optimization is valid.

Uber

APPROACH 03

LLM relevance scorer that produces a five-level score and explanations for predicted journeys.

Human evaluation is handled differently.

APPROACH 01

Manual spot-checks: sampled outputs are periodically labeled by humans.

Dropbox

APPROACH 02

LLM evaluation is used because human evaluation is described as costly and sometimes inconsistent.

APPROACH 03

False-positive control is delegated to LLM juries and domain-specific rule-based validators, with developer manual review available downstream.

Uber

Watch Items

Operators do not treat LLM judges as automatically reliable: Dropbox frames evaluation as protection against regressions and hallucinations, while Uber adds LLMCheck to catch false positives and reduce hallucinations in optimization suggestions.

Human alignment remains a constraint: Dropbox reports tracking disagreements with humans when changing judge models, while Pinterest cites human evaluation as costly and sometimes inconsistent.

Judge performance is monitored over time: Dropbox dashboards track key metrics, pass/fail rates, and shifts over time; Uber logs detection accuracy, failure rates, error patterns, and potential model drift.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
Rubric-anchored judge prompts	pattern	scoring criteria written as explicit rubrics, calibrated on samples	established
Pairwise comparison judging	pattern	relative quality (A vs B) is more reliable than absolute scores	established

Observed in Production

9 APPS

TechnologyGROUNDED

LLM-as-judge

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

LLM Application Quality Assurance

Enterprise Search Synthetic Evaluation Data Generation

LLM-Assisted Code Review, Test Migration, and Agent Evaluation

AI-Assisted Content and Metadata Data Collection

AI-Assisted Education Evaluation Review

AI-Assisted Product and Developer Collaboration Workflows

Compute-Efficient Media Preview and Qwen Journey Inference Optimization

Go Service Performance Optimization

LLM SQL and Knowledge Base Quality Evaluation