Simulation-based testing

Evaluation

2APPLICATIONS

2OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 7 OPERATORS

Simulation-based testing is deployed as practical pre-production/offline evaluation: operators simulate realistic workflows, policies, attacks, ads pacing, or conversations, then add production-grounding, live tests, human gates, or fallbacks where risk remains.

Observed Practices

Evaluate changes in offline, isolated, or pre-production simulations before relying on production rollout.

5 of 7 operators

AmazonDoorDashMetaThumbtackWix

Benchmark candidate behavior against a current or baseline policy using A/B tests, budget-split experiments, or the same simulated scenarios/dataset.

3 of 7 operators

DoorDashMetaWix

Ground simulations in production-like code, historical data, observed KPIs, real telemetry, or actual log databases instead of relying only on model judgment.

4 of 7 operators

AmazonDoorDashMetaWix

Put safety gates around simulation-tested changes before or during production rollout.

3 of 7 operators

AmazonThumbtackWix

Make simulation cheap or parallel enough for iterative testing by shrinking workloads or running many variations concurrently.

2 of 7 operators

AmazonMeta

Use conversation-based practice or transcript evaluation loops for conversational systems and sales training.

2 of 7 operators

OtterZoom

Where Operators Diverge

The simulated object differs by domain: operators simulate attacks, ML workflows, routing policies, ad pacing, gen-AI operations, or sales/support conversations.

APPROACH 01

Simulate adversarial security techniques with red-team and blue-team agents in isolated environments.

Amazon

APPROACH 02

Continuously A/B test common ML workflows in a pre-production framework to catch performance regressions.

Watch Items

Simulator fidelity remains an explicit concern: observed operators either compare simulated current-policy results with production KPIs, keep test code/configurations aligned with production, or validate claims against real execution evidence.

False positives and hallucinations still require controls; observed controls include human-in-the-loop review, false-positive thresholds, and grounded execution evidence.

Simulation does not eliminate production KPI tradeoffs: Wix kept a fallback for excessive waiting times, while DoorDash warned that overspend mitigation can create underspend or blackout risk.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
Persona-driven user simulators	pattern	multi-turn behavior must be exercised before real users see it	emerging
Adversarial test generation	pattern	an attacker model generates the cases the golden set misses	emerging

Observed in Production

2 APPS

SalesGROUNDED

AI Sales Engagement Room and Role Play Certification

Zoom1 OP

AdvertisingNO RECIPE

Programmatic Ad Bidding and Budget Pacing Optimization

DoorDash1 OP