Distillation

Model Adaptation

3APPLICATIONS

2OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 4 OPERATORS

Distillation is deployed/piloted as a product-model transfer technique: operators use it to move stronger model behavior into task-specific retrieval, ranking, ads, and agent models, usually alongside fine-tuning or other supervised/post-training signals.

Observed Practices

Operators explicitly use distillation or knowledge distillation in deployed/pilot systems rather than only in offline research.

4 of 4 operators in the pool show deployed or pilot use of distillation.

LinkedInMetaPinterestPodium

Distillation is used to transfer behavior from a stronger incumbent, foundation, or LLM-based source into downstream product models.

4 of 4 operators cite teacher/source-to-downstream transfer patterns.

LinkedInMetaPinterestPodium

Distillation is paired with task-specific training signals rather than treated as a standalone compression step.

4 of 4 operators show distillation combined with fine-tuning, post-training, ground-truth data, task losses, or curated traces.

LinkedInMetaPinterestPodium

Operators train or guide distilled models with first-party product interaction data, traces, engagement pairs, or production-model outputs.

4 of 4 operators cite operator-owned data or model outputs as inputs to the adaptation loop.

LinkedInMetaPinterestPodium

Some operators use distillation specifically to improve convergence or alignment of downstream models, not only to make models smaller.

3 of 4 operators explicitly connect distillation to alignment, knowledge transfer effectiveness, or better experimental-model convergence.

LinkedInMetaPinterest

One operator reports a classic batch-training distillation loss: add a loss on top of binary-label cross entropy to match prediction differences between experimental and production models.

1 of 4 operators discloses this specific loss-level implementation.

One operator reports a Student Adapter to refine teacher outputs with recent ground-truth data before supervising student models.

1 of 4 operators discloses this adapter-based knowledge-transfer mechanism.

Where Operators Converge

Every observed operator applies distillation inside a product workflow, not as a generic benchmark exercise: LinkedIn in job search and Feed ranking, Meta in ads recommendation, Pinterest in ads engagement modeling, and Podium in an agentic AI employee.

Every observed operator combines distillation with domain/task adaptation data or training: engagement pairs, ad/user interactions, binary labels plus production predictions, traces, curated conversation scenarios, or human/product annotations.

Every observed operator uses distillation as part of a broader model-system design, alongside components such as retrieval/ranking pipelines, vertical models, MMoE architectures, fine-tuning, evaluation, observability, or agent orchestration.

Where Operators Diverge

Teacher/source of distilled signal differs by operator.

APPROACH 01

Foundation or LLM-derived teacher behavior is transferred into retrieval/ranking or embedding models.

APPROACH 02

A large ads foundation model transfers knowledge to hundreds of user-facing vertical models.

Watch Items

Compute and serving cost remain recurring constraints around distillation-adjacent systems: LinkedIn reports advanced LLMs increased compute requirements; Meta trains GEM across thousands of GPUs and reports training bottleneck work; Pinterest explored infrastructure-cost reduction and mixed-precision latency reduction; Podium reports 20-30 LLM calls per interaction before distilling outputs into a smaller model.

Teacher-student transfer can need freshness or data-gap correction: Pinterest uses knowledge distillation to mitigate a short data-retention performance gap and reports pairwise loss mitigating missing-data gaps; Meta uses a Student Adapter to align teacher predictions with recent observed outcomes and domain-relevant supervision.

Operators keep evaluation, observability, or auditing around adapted systems: Podium needed better visibility into LLM calls and uses offline/online evaluation; Pinterest reports offline and online results by view type; LinkedIn reports regular audits of model behavior.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
Teacher-labeled synthetic SFT	pattern	a frontier model labels data that trains a cheaper student	established
TRL student training	library	open-weights student models trained on distilled datasets	established

Observed in Production

3 APPS

TechnologyGROUNDED

Distillation

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

Enterprise Search Synthetic Evaluation Data Generation

LLM Application Quality Assurance

Multimodal User Interest Profiling for Display Ad Ranking