TECHNIQUE

Distillation

Model Adaptation

2APPLICATIONS
1OBSERVED OPERATORS
01

State of Practice

CROSS-VALIDATED — 4 OPERATORS

Distillation is deployed/piloted as a product-model transfer technique: operators use it to move stronger model behavior into task-specific retrieval, ranking, ads, and agent models, usually alongside fine-tuning or other supervised/post-training signals.

Observed Practices

Operators explicitly use distillation or knowledge distillation in deployed/pilot systems rather than only in offline research.

4 of 4 operators in the pool show deployed or pilot use of distillation.
LinkedInMetaPinterestPodium

Distillation is used to transfer behavior from a stronger incumbent, foundation, or LLM-based source into downstream product models.

4 of 4 operators cite teacher/source-to-downstream transfer patterns.
LinkedInMetaPinterestPodium

Distillation is paired with task-specific training signals rather than treated as a standalone compression step.

4 of 4 operators show distillation combined with fine-tuning, post-training, ground-truth data, task losses, or curated traces.
LinkedInMetaPinterestPodium

Operators train or guide distilled models with first-party product interaction data, traces, engagement pairs, or production-model outputs.

4 of 4 operators cite operator-owned data or model outputs as inputs to the adaptation loop.
LinkedInMetaPinterestPodium

Some operators use distillation specifically to improve convergence or alignment of downstream models, not only to make models smaller.

3 of 4 operators explicitly connect distillation to alignment, knowledge transfer effectiveness, or better experimental-model convergence.
LinkedInMetaPinterest

One operator reports a classic batch-training distillation loss: add a loss on top of binary-label cross entropy to match prediction differences between experimental and production models.

1 of 4 operators discloses this specific loss-level implementation.
Pinterest

One operator reports a Student Adapter to refine teacher outputs with recent ground-truth data before supervising student models.

1 of 4 operators discloses this adapter-based knowledge-transfer mechanism.
Meta

Where Operators Converge

Every observed operator applies distillation inside a product workflow, not as a generic benchmark exercise: LinkedIn in job search and Feed ranking, Meta in ads recommendation, Pinterest in ads engagement modeling, and Podium in an agentic AI employee.

Every observed operator combines distillation with domain/task adaptation data or training: engagement pairs, ad/user interactions, binary labels plus production predictions, traces, curated conversation scenarios, or human/product annotations.

Every observed operator uses distillation as part of a broader model-system design, alongside components such as retrieval/ranking pipelines, vertical models, MMoE architectures, fine-tuning, evaluation, observability, or agent orchestration.

Where Operators Diverge

Teacher/source of distilled signal differs by operator.

APPROACH 01

Foundation or LLM-derived teacher behavior is transferred into retrieval/ranking or embedding models.

LinkedIn

APPROACH 02

A large ads foundation model transfers knowledge to hundreds of user-facing vertical models.

Meta

APPROACH 03

The existing production model teaches a new experimental model during batch training.

Pinterest

APPROACH 04

Outputs from many LLM calls are curated into a smaller model.

Podium

The product target differs: most operators apply distillation to retrieval/ranking or recommendation, while Podium applies it in an agentic conversation product.

APPROACH 01

Search, Feed, ads retrieval/ranking, or recommendation models.

LinkedInMetaPinterest

APPROACH 02

Agentic AI employee for customer conversations, scheduling, sales, and support behavior.

Podium

Disclosed implementation depth differs.

APPROACH 01

Loss-level implementation is disclosed: prediction-difference loss added to standard cross entropy.

Pinterest

APPROACH 02

Adapter/transfer-framework implementation is disclosed: direct and hierarchical knowledge transfer plus Student Adapter.

Meta

APPROACH 03

High-level distillation use is disclosed without the same loss/adaptor detail in the cited teardown.

LinkedInPodium

Watch Items

Compute and serving cost remain recurring constraints around distillation-adjacent systems: LinkedIn reports advanced LLMs increased compute requirements; Meta trains GEM across thousands of GPUs and reports training bottleneck work; Pinterest explored infrastructure-cost reduction and mixed-precision latency reduction; Podium reports 20-30 LLM calls per interaction before distilling outputs into a smaller model.

Teacher-student transfer can need freshness or data-gap correction: Pinterest uses knowledge distillation to mitigate a short data-retention performance gap and reports pairwise loss mitigating missing-data gaps; Meta uses a Student Adapter to align teacher predictions with recent observed outcomes and domain-relevant supervision.

Operators keep evaluation, observability, or auditing around adapted systems: Podium needed better visibility into LLM calls and uses offline/online evaluation; Pinterest reports offline and online results by view type; LinkedIn reports regular audits of model behavior.

02

Implementation Menu

CURATED DEFAULTS
NameKindMaturity
Teacher-labeled synthetic SFTpatternestablished
TRL student traininglibraryestablished
03

Observed in Production

2 APPS