HOME/TECHNIQUE/Evaluation/Online evaluation

TECHNIQUE

Online evaluation

Evaluation

10APPLICATIONS
14OBSERVED OPERATORS
01

State of Practice

CROSS-VALIDATED — 6 OPERATORS

Online evaluation in this pool is production-tied: operators sample/scored live outputs, run A/B or traffic-shift tests, and keep human or dashboard oversight around automated judges.

Observed Practices

Sample real production traffic or production-generated content and score or label it, rather than relying only on offline fixtures.

3 of 6 observed operators
DropboxPinterestThumbtack

Use automated evaluators or LLM judges for semantic or policy judgments on sampled outputs.

3 of 6 observed operators
DropboxPinterestThumbtack

Route a sample of outputs, failures, or AI-approved content to human review to validate automated evaluation quality.

4 of 6 observed operators
DropboxAgodaPinterestThumbtack

Put online evaluation results into dashboards, monitoring stores, traces, or alerts so teams can track pass/fail rates, confidence intervals, lineage, and failures over time.

3 of 6 observed operators
DropboxPinterestThumbtack

Evaluate changes with live experimentation or controlled production traffic shifts, including A/B tests, bandit loops, feature flags, and rollback paths.

4 of 6 observed operators
CriteoPinterestThumbtackSlack

Use evaluation gates before and around production release, combining pull-request checks, staging suites, pre-production review, and production monitoring.

3 of 6 observed operators
DropboxThumbtackSlack

Where Operators Converge

Every operator counted in this report ties evaluation or rollout control to production exposure or production-derived data, not only static offline test sets.

Where Operators Diverge

Operators differ on where online evaluation enters the system.

APPROACH 01

Score or label sampled production outputs or impressions.

DropboxPinterestThumbtack

APPROACH 02

Run controlled live experiments or traffic shifts.

CriteoPinterestThumbtackSlack

APPROACH 03

Use reviewer or peer-review validation around operational AI workflows.

AgodaDropboxPinterestThumbtack

Operators differ on what the online evaluation metric is meant to prove.

APPROACH 01

Semantic quality, groundedness, tone, relevance, or policy-label correctness judged by LLMs or rubrics.

DropboxPinterestThumbtack

APPROACH 02

Statistical prevalence estimates with confidence intervals and effective sample size.

Pinterest

APPROACH 03

Business or intervention performance under live traffic.

CriteoPinterestThumbtack

APPROACH 04

Availability and safe migration during backend/model-serving changes.

Slack

Operators differ on cadence.

APPROACH 01

Fast pull-request checks plus staging suites plus continuous production sampling.

Dropbox

APPROACH 02

Daily or nightly batch evaluation from production streams or warehouse samples.

PinterestThumbtack

APPROACH 03

Periodic manual or gold-set spot checks layered on top of automated evaluation.

DropboxPinterest

Watch Items

Operators report that small changes can cause hallucinations, unsupported claims, intent mistakes, or drift, so online evaluation is treated as regression protection rather than a one-time benchmark.

Live tests and migrations can change traffic conditions or threaten availability; operators call out calibration shifts, feature flags, and instant rollback as controls.

Automated judges are not treated as self-validating: operators keep human validation queues, peer review, spot checks, or gold-set checks in the loop.

02

Implementation Menu

CURATED DEFAULTS
NameKindMaturity
A/B testing with guardrail metricspatternestablished
Implicit feedback capturepatterncommodity
Langfuse online scoringlibraryemerging
03

Observed in Production

10 APPS