HOME/TECHNIQUE/Data & Context Engineering/Synthetic data generation

TECHNIQUE

Synthetic data generation

Data & Context Engineering

4APPLICATIONS
7OBSERVED OPERATORS
01

State of Practice

CROSS-VALIDATED — 5 OPERATORS

Synthetic data generation is being used as a practical data-context tool for mocks, offline evaluation, retrieval context, and model adaptation—not as a standalone artifact.

Observed Practices

Generate synthetic product/search/retrieval cases as offline fixtures for testing, prototyping, or prompt iteration: Airbnb generates GraphQL mock JSON, Canva creates synthetic design content and queries, and New Computer uses synthetic users and generated queries for Dot memory experiments.

3 of 6 deployed/pilot operators in the pool.
AirbnbCanvaNew Computer

Seed synthetic generation with domain context rather than generic prompts: Airbnb includes GraphQL operation/schema/design context, Canva seeds GPT-4o with realistic design topics and real design-type distributions, and New Computer creates synthetic users with LLM-generated backstories.

3 of 6 deployed/pilot operators in the pool.
AirbnbCanvaNew Computer

Validate or label generated data before relying on it: Airbnb validates LLM-generated mocks against GraphQL schema/query data and retries with errors, Canva runs synthetic test cases through the local production-like search pipeline to compute recall and precision, and New Computer labels relevant memories and tracks precision/recall/F1.

3 of 6 deployed/pilot operators in the pool.
AirbnbCanvaNew Computer

Use synthetic data in model adaptation or fine-tuning workflows: Atlassian reports using NVIDIA NeMo Data Designer for synthetic data generation, data cleaning, and fine-tuning; Grab, in an announced teardown, reports synthetic OCR datasets and an in-house pipeline rendering Southeast Asia text into images for training.

1 of 6 deployed/pilot operators in the pool; also observed in announced Grab evidence, which is not counted in deployed/pilot arithmetic.
AtlassianGrab

Embed synthetic or enriched synthetic datasets into retrieval infrastructure: LinkedIn’s SPP AI embeds enriched synthetic datasets into a vector index used as a centralized context repository for query processing.

1 of 6 deployed/pilot operators in the pool.
LinkedIn

Use synthetic data to avoid exposing private customer content during evaluation: Canva creates realistic but entirely synthetic content and queries with “zero privacy concerns” for private design search evaluation.

1 of 6 deployed/pilot operators in the pool.
Canva

Where Operators Converge

Observed operators feed synthetic data into a downstream workflow—mock serving, offline evaluation, vector-index context, fine-tuning, or training—rather than treating generated data as an end in itself.

Where Operators Diverge

Primary use case for synthetic data differs by operator.

APPROACH 01

Frontend/product mocks and offline evaluation fixtures.

AirbnbCanvaNew Computer

APPROACH 02

Model adaptation, fine-tuning, or training data generation.

AtlassianGrab

APPROACH 03

Retrievable context stored in a vector index.

LinkedIn

Synthetic data generation mechanisms differ.

APPROACH 01

Prompt a general-purpose LLM with product/domain context to generate mocks, documents, queries, or synthetic-user behavior.

AirbnbCanvaNew Computer

APPROACH 02

Use dedicated synthetic-data tooling or rendering pipelines as part of training/fine-tuning.

AtlassianGrab

APPROACH 03

Create enriched synthetic datasets from structured context and embed them into a vector index.

LinkedIn

Trust controls around generated data vary by workflow.

APPROACH 01

Schema/query validation plus retry loop to self-heal invalid generated mocks.

Airbnb

APPROACH 02

Production-like local evaluation with recall and precision metrics, followed by online A/B experimentation for candidate search changes.

Canva

APPROACH 03

Human labeling of relevant memories and experiment metrics such as precision, recall, and F1.

New Computer

APPROACH 04

Human review to refine generated or auto-labeled document data for label accuracy.

Grab

Watch Items

Generated data still needs validity or quality controls before use: Airbnb validates and retries invalid GraphQL mock data; New Computer adds human labels and retrieval metrics; Grab reports human review to achieve high label accuracy.

Representativeness is actively engineered rather than assumed: Canva samples from real design-type distributions, New Computer uses synthetic users and broad generated intents, Airbnb supplies schema/design context, and Grab renders Southeast Asia text with varied fonts, backgrounds, and augmentations.

02

Implementation Menu

CURATED DEFAULTS
NameKindMaturity
Teacher-generated instruction datapatternestablished
distilabellibraryemerging
03

Observed in Production

4 APPS