TECHNIQUE
Evaluation
Online evaluation in this pool is production-tied: operators sample/scored live outputs, run A/B or traffic-shift tests, and keep human or dashboard oversight around automated judges.
Sample real production traffic or production-generated content and score or label it, rather than relying only on offline fixtures.
3 of 6 observed operatorsUse automated evaluators or LLM judges for semantic or policy judgments on sampled outputs.
3 of 6 observed operatorsRoute a sample of outputs, failures, or AI-approved content to human review to validate automated evaluation quality.
4 of 6 observed operatorsPut online evaluation results into dashboards, monitoring stores, traces, or alerts so teams can track pass/fail rates, confidence intervals, lineage, and failures over time.
3 of 6 observed operatorsEvaluate changes with live experimentation or controlled production traffic shifts, including A/B tests, bandit loops, feature flags, and rollback paths.
4 of 6 observed operatorsUse evaluation gates before and around production release, combining pull-request checks, staging suites, pre-production review, and production monitoring.
3 of 6 observed operatorsEvery operator counted in this report ties evaluation or rollout control to production exposure or production-derived data, not only static offline test sets.
Operators differ on where online evaluation enters the system.
APPROACH 01
Score or label sampled production outputs or impressions.
APPROACH 02
Run controlled live experiments or traffic shifts.
APPROACH 03
Use reviewer or peer-review validation around operational AI workflows.
Operators differ on what the online evaluation metric is meant to prove.
APPROACH 01
Semantic quality, groundedness, tone, relevance, or policy-label correctness judged by LLMs or rubrics.
APPROACH 02
Statistical prevalence estimates with confidence intervals and effective sample size.
APPROACH 03
Business or intervention performance under live traffic.
APPROACH 04
Availability and safe migration during backend/model-serving changes.
Operators differ on cadence.
APPROACH 01
Fast pull-request checks plus staging suites plus continuous production sampling.
APPROACH 02
Daily or nightly batch evaluation from production streams or warehouse samples.
APPROACH 03
Periodic manual or gold-set spot checks layered on top of automated evaluation.
Operators report that small changes can cause hallucinations, unsupported claims, intent mistakes, or drift, so online evaluation is treated as regression protection rather than a one-time benchmark.
Live tests and migrations can change traffic conditions or threaten availability; operators call out calibration shifts, feature flags, and instant rollback as controls.
Automated judges are not treated as self-validating: operators keep human validation queues, peer review, spot checks, or gold-set checks in the loop.
| Name | Kind | When | Maturity |
|---|---|---|---|
| A/B testing with guardrail metrics | pattern | model or prompt changes ship behind controlled rollouts | established |
| Implicit feedback capture | pattern | thumbs, regenerations, and abandonment proxy quality at zero annotation cost | commodity |
| Langfuse online scoring | library | production traces scored continuously against eval criteria | emerging |