TECHNIQUE
Data & Context Engineering
Synthetic data generation is being used as a practical data-context tool for mocks, offline evaluation, retrieval context, and model adaptation—not as a standalone artifact.
Generate synthetic product/search/retrieval cases as offline fixtures for testing, prototyping, or prompt iteration: Airbnb generates GraphQL mock JSON, Canva creates synthetic design content and queries, and New Computer uses synthetic users and generated queries for Dot memory experiments.
3 of 6 deployed/pilot operators in the pool.Seed synthetic generation with domain context rather than generic prompts: Airbnb includes GraphQL operation/schema/design context, Canva seeds GPT-4o with realistic design topics and real design-type distributions, and New Computer creates synthetic users with LLM-generated backstories.
3 of 6 deployed/pilot operators in the pool.Validate or label generated data before relying on it: Airbnb validates LLM-generated mocks against GraphQL schema/query data and retries with errors, Canva runs synthetic test cases through the local production-like search pipeline to compute recall and precision, and New Computer labels relevant memories and tracks precision/recall/F1.
3 of 6 deployed/pilot operators in the pool.Use synthetic data in model adaptation or fine-tuning workflows: Atlassian reports using NVIDIA NeMo Data Designer for synthetic data generation, data cleaning, and fine-tuning; Grab, in an announced teardown, reports synthetic OCR datasets and an in-house pipeline rendering Southeast Asia text into images for training.
1 of 6 deployed/pilot operators in the pool; also observed in announced Grab evidence, which is not counted in deployed/pilot arithmetic.Embed synthetic or enriched synthetic datasets into retrieval infrastructure: LinkedIn’s SPP AI embeds enriched synthetic datasets into a vector index used as a centralized context repository for query processing.
1 of 6 deployed/pilot operators in the pool.Use synthetic data to avoid exposing private customer content during evaluation: Canva creates realistic but entirely synthetic content and queries with “zero privacy concerns” for private design search evaluation.
1 of 6 deployed/pilot operators in the pool.Observed operators feed synthetic data into a downstream workflow—mock serving, offline evaluation, vector-index context, fine-tuning, or training—rather than treating generated data as an end in itself.
Primary use case for synthetic data differs by operator.
APPROACH 01
Frontend/product mocks and offline evaluation fixtures.
APPROACH 02
Model adaptation, fine-tuning, or training data generation.
APPROACH 03
Retrievable context stored in a vector index.
Synthetic data generation mechanisms differ.
APPROACH 01
Prompt a general-purpose LLM with product/domain context to generate mocks, documents, queries, or synthetic-user behavior.
APPROACH 02
Use dedicated synthetic-data tooling or rendering pipelines as part of training/fine-tuning.
APPROACH 03
Create enriched synthetic datasets from structured context and embed them into a vector index.
Trust controls around generated data vary by workflow.
APPROACH 01
Schema/query validation plus retry loop to self-heal invalid generated mocks.
APPROACH 02
Production-like local evaluation with recall and precision metrics, followed by online A/B experimentation for candidate search changes.
APPROACH 03
Human labeling of relevant memories and experiment metrics such as precision, recall, and F1.
APPROACH 04
Human review to refine generated or auto-labeled document data for label accuracy.
Generated data still needs validity or quality controls before use: Airbnb validates and retries invalid GraphQL mock data; New Computer adds human labels and retrieval metrics; Grab reports human review to achieve high label accuracy.
Representativeness is actively engineered rather than assumed: Canva samples from real design-type distributions, New Computer uses synthetic users and broad generated intents, Airbnb supplies schema/design context, and Grab renders Southeast Asia text with varied fonts, backgrounds, and augmentations.
| Name | Kind | When | Maturity |
|---|---|---|---|
| Teacher-generated instruction data | pattern | a strong model drafts training examples humans spot-check | established |
| distilabel | library | synthetic generation pipelines with built-in judging and filtering | emerging |