HOME/TECHNIQUE/Evaluation/Golden-set offline evals

TECHNIQUE

Golden-set offline evals

Evaluation

17APPLICATIONS

20OBSERVED OPERATORS

01

State of Practice

CROSS-VALIDATED — 2 OPERATORS

Golden-set offline eval practice in this pool is concrete but narrowly evidenced: Dropbox uses golden/curated suites as CI and staging regression gates, while Pinterest uses SME-labeled gold sets to monitor LLM labeling drift.

Observed Practices

Run curated or golden offline suites before release changes, including end-to-end sweeps for regression detection.

1 of 2 operators with cited golden-set evidence in this pool.

Dropbox

Put a fast canonical-query eval subset into pull-request and merge automation, with red-line misses blocking merges.

1 of 2 operators with cited golden-set evidence in this pool.

Dropbox

Score eval examples using the query, model answer, source context, and sometimes a hidden reference answer; use judge models to check factuality, citation support, formatting, and tone.

1 of 2 operators with cited golden-set evidence in this pool.

Dropbox

Build evaluation sets from internal production or dogfood logs, then reuse the same scoring logic across offline suites and production sampling.

1 of 2 operators with cited golden-set evidence in this pool.

Dropbox

Periodically compare LLM-and-prompt quality against SME-labeled gold sets to detect drift.

1 of 2 operators with cited golden-set evidence in this pool.

Pinterest

Keep human or SME review in the evaluation loop through manual spot-checks, human validation queues, or SME-reviewed prompts/gold sets.

2 of 2 operators with cited golden-set evidence in this pool.

DropboxPinterest

Operationalize eval outputs with dashboards, pass/fail metrics, trend views, diagnostics, or lineage for auditability.

2 of 2 operators with cited golden-set evidence in this pool.

DropboxPinterest

Where Operators Converge

Both observed operators anchor evaluator quality to gold or reference data rather than relying only on ad-hoc checks.

Both observed operators keep human expertise attached to the eval loop.

Where Operators Diverge

Where the golden set sits in operations differs.

APPROACH 01

Use golden/curated evals as release-regression infrastructure in PR automation, staging, and merge gates.

Dropbox

APPROACH 02

Use SME-labeled gold sets as ground truth for periodic LLM + prompt drift checks in an AI labeling/prevalence pipeline.

Pinterest

The evaluated artifact differs by product context.

APPROACH 01

Evaluate a multi-stage conversational answer pipeline spanning retrieval, ranking, prompt construction, model inference, and safety filtering.

Dropbox

APPROACH 02

Evaluate multimodal LLM labels for policy-defined content categories, with gold-set drift checks and human validation.

Pinterest

Watch Items

Regression and drift are treated as live risks: Dropbox warns that a tweak anywhere in the pipeline can turn a prior good answer into a hallucination, and Pinterest checks LLM + prompt quality against gold sets to detect model drift.

Automated judging or labeling is not left unchecked: both observed operators use human review mechanisms around sampled outputs or labels.

Cost and latency are explicit evaluation concerns: Dropbox includes latency and cost smoke checks in merge automation, while Pinterest tracks token usage and per-run cost by model or metric variant.

02

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
promptfoo	library	declarative golden-set evals wired into CI	established
Ragas	library	RAG-specific metrics: faithfulness, context precision/recall	established
pytest-style eval harness	pattern	evals as code in the existing test runner and review flow	commodity

03

Observed in Production

17 APPS

TechnologyCROSS-VALIDATED

LLM-Assisted Code Review, Test Migration, and Agent Evaluation

cubic, DoorDash, Dropbox +47 OP

TechnologyGROUNDED

LLM Application Quality Assurance

Atlassian, Grab, Unify3 OP

EducationCROSS-VALIDATED

AI-Assisted Education Evaluation Review

Coursera, Kalvium Labs2 OP

TechnologyGROUNDED

AI-Assisted Product and Developer Collaboration Workflows

Canva, Notion2 OP

TechnologyGROUNDED

Code and Query Defect Validation and Repair

Pinterest, Uber2 OP

TechnologyGROUNDED

Enterprise Search Synthetic Evaluation Data Generation

Canva, Dropbox2 OP

HealthcareGROUNDED

Acute Stroke Triage, Thrombectomy Selection, and Message Center Decision Support

The Second Affiliated Hospital of SooChow University1 OP

TechnologyGROUNDED

Agentic ML and Data Pipeline Workflow Orchestration

TechnologyGROUNDED

AI Security Decision Audit and Incident Report Generation

Meta1 OP

TechnologyNO RECIPE

AI-Assisted Content and Metadata Data Collection

Wix1 OP

TechnologyGROUNDED

Automated Data and Interest Signal Classification

Grab1 OP

EducationGROUNDED

Computational Drug Discovery Lab Workflow Instruction

In Silico Toxicology, Institute of Physiology, Charité – Universitätsmedizin Berlin1 OP

TechnologyGROUNDED

LLM SQL and Knowledge Base Quality Evaluation

Uber1 OP

TechnologyGROUNDED

Monorepo Incident Root Cause Identification

Meta1 OP

E-commerceGROUNDED

Personalized Feed Candidate Retrieval and Search Ranking

TechnologyGROUNDED

Pull Request Mock-Backed Development and LLM Test Generation

Meta1 OP

TechnologyCROSS-VALIDATED

Security and Privacy Policy On-Call Support Copilot

Uber1 OP