TECHNIQUE
Evaluation
Prompt regression suites are operated as representative, continuously refreshed eval sets/workflows, with some teams enforcing PR or release gates and others using offline experiments plus production feedback loops.
Maintain representative regression targets—canonical queries, curated conversation scenarios, sample cases, labeled examples, golden datasets, or common ML workflows—to benchmark changes and prevent regressions.
6 of 6 deployed operators in the roster; Meta counted from deployed td_100, not announced td_49.Wire regression checks into developer or release workflow so changes are automatically tested before merge, review, or production release.
3 of 6 deployed operators show automated PR, CI, pre-review, or pre-release gating evidence.Use production signals—logs, traces, live sampled traffic, real-time feedback charts, or customer feedback—to monitor behavior and refresh test sets after launch.
4 of 6 deployed operators show live or post-launch feedback feeding monitoring or test-set maintenance.Apply automatic evaluators or scoring logic to suite runs instead of relying only on manual inspection.
5 of 6 deployed operators show automatic judging, self-evaluation, custom/heuristic evaluators, metric scoring, or statistical regression tests.Keep manual or human/user feedback in the loop for labeling, spot checks, troubleshooting, or product acceptance signals.
4 of 6 deployed operators show manual labels, spot checks, user-provided feedback, or explicit human-in-the-loop handling.Test changes beyond prompt wording: operators use the suites to evaluate retrieval, models, tools, examples, system code/configuration, safety checks, latency, or cost.
6 of 6 deployed operators show regression testing around a broader change surface than prompt text alone.Every deployed operator in the pool uses a structured, repeatable evaluation target rather than relying solely on one-off manual checks.
Every deployed operator treats regression testing as change management: the suite is used to compare or validate behavior when prompts, retrieval, models, tools, examples, code, or workflow configuration changes.
Where regression suites run and how strongly they gate shipping differs.
APPROACH 01
Hard developer or release gates: automated PR/CI/pre-review/pre-release tests block merges or releases when thresholds or red lines fail.
APPROACH 02
Offline experiments or evaluation runs support iteration and shipping decisions, with no cited hard merge gate in the pool evidence.
Regression-case sourcing differs by product and workflow.
APPROACH 01
Mine production logs, traces, live traffic, user actions, or customer feedback into monitoring and test sets.
APPROACH 02
Generate synthetic users/queries and manually label relevant examples for retrieval evaluation.
APPROACH 03
Run shrunk, production-equivalent representative ML workflows to reduce compute while preserving code/configuration coverage.
Scoring methods differ.
APPROACH 01
LLM-as-judge, LLM self-evaluation, custom evaluators, or heuristic evaluators score outputs and system health.
APPROACH 02
Task-specific retrieval metrics such as precision, recall, and F1 score prompt/retrieval experiments.
APPROACH 03
Statistical A/B testing and t-tests detect regressions on workflow performance metrics such as time to first batch.
APPROACH 04
Golden datasets and natural-language test suites benchmark product performance and regression risk.
Regression coverage is treated as a moving target: operators continuously sample live traffic, collect feedback, or add new scenarios and edge cases rather than freezing the suite once.
Ad hoc or “vibes” checks appear as prior or early-stage practice before teams move to systematic suites.
Small changes are explicitly treated as regression risks because prompt, retrieval, model, safety, or workflow changes can alter outputs or performance.
Cost, latency, compute, and false-positive budgets show up alongside quality as regression-suite constraints.
| Name | Kind | When | Maturity |
|---|---|---|---|
| promptfoo CI gates | library | prompt changes blocked on regression suites like any other code change | established |
| Golden-transcript snapshot diffs | pattern | output drift surfaces as reviewable diffs, not failed scores | commodity |