HOME/TECHNIQUE/Evaluation/Online evaluation

TECHNIQUE

Online evaluation

Evaluation

17APPLICATIONS

20OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 3 OPERATORS

Online evaluation is observed as live/operational scoring, A/B-test readouts, and human-alignment checks rather than a single standard pattern.

Observed Practices

Use operational signals in evaluation: sampled production traffic, A/B-test rollout traffic, or recent peer-review outcomes.

3 of 3 operators with online-evaluation evidence in this pool.

DropboxCriteoAgoda

Continuously sample live production traffic and score it with the same metrics and logic used in offline suites.

1 of 3 operators with online-evaluation evidence in this pool.

Dropbox

Combine online production scoring with pre-merge and staging evaluation gates.

1 of 3 operators with online-evaluation evidence in this pool.

Dropbox

A/B test new models during rollout and report advertiser outcome uplift from phased deployment.

1 of 3 operators with online-evaluation evidence in this pool.

Criteo

Validate LLM workflow outputs against human analyst conclusions during peer review over a recent operating window.

1 of 3 operators with online-evaluation evidence in this pool.

Agoda

Where Operators Converge

Every cited operator evaluates against real operating behavior, but the measured signal differs by product: production traffic scores, A/B-test traffic, or human peer-review alignment.

Where Operators Diverge

What online signal is used as the evaluation unit.

APPROACH 01

Sampled live production traffic is scored with the same metrics and logic as offline suites.

Dropbox

APPROACH 02

New model behavior is evaluated through A/B testing and phased rollout impact on ROAS.

Criteo

APPROACH 03

LLM decisions are validated against human analyst conclusions during peer review.

Agoda

Who or what judges online behavior.

APPROACH 01

Automated scorers and judge-model style checks are used for answer quality, citation support, formatting, and tone.

Dropbox

APPROACH 02

Business-performance readouts from A/B testing and phased rollout are used.

Criteo

APPROACH 03

Human analyst conclusions in peer review are the comparison point.

Agoda

Watch Items

Live or changed traffic can expose quality/calibration failures: Dropbox reports pipeline tweaks can turn a prior good answer into a hallucination, while Criteo reports A/B testing new models can acquire different traffic on which the models are not calibrated.

For insufficient-context LLM cases, Agoda reports over-escalating and routing to human review rather than relying on the model verdict.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
A/B testing with guardrail metrics	pattern	model or prompt changes ship behind controlled rollouts	established
Implicit feedback capture	pattern	thumbs, regenerations, and abandonment proxy quality at zero annotation cost	commodity
Langfuse online scoring	library	production traces scored continuously against eval criteria	emerging

Observed in Production

17 APPS

TechnologyGROUNDED

Online evaluation

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

LLM Application Quality Assurance

AI-Assisted Product and Developer Collaboration Workflows

LLM-Assisted Code Review, Test Migration, and Agent Evaluation

Automated Quality Image Tagging and Cataloging

Enterprise Search Synthetic Evaluation Data Generation

Programmatic Ad Bidding and Budget Pacing Optimization

Agentic ML and Data Pipeline Workflow Orchestration

AI Agent Production Debugging with Logfire MCP and Investigation Memory

AI Security Decision Audit and Incident Report Generation

AI-Assisted Education Evaluation Review

Automated Data and Interest Signal Classification

Compute-Efficient Media Preview and Qwen Journey Inference Optimization

Go Service Performance Optimization

LLM Application Migration and Rollout Validation

LLM SQL and Knowledge Base Quality Evaluation

Personalized Feed Candidate Retrieval and Search Ranking

Radiology Triage and Report Dispatch Optimization