TECHNIQUE

Runtime & hardware

Serving & Inference

3APPLICATIONS
2OBSERVED OPERATORS
01

State of Practice

CROSS-VALIDATED — 4 OPERATORS

Runtime and hardware practice is split between direct GPU/serving-stack optimization and managed throughput abstraction, with operators tuning batching, caching, kernels, and capacity around latency and scale constraints.

Observed Practices

Run high-scale model workloads on GPU infrastructure and optimize for GPU utilization.

3 of 4 operators
CriteoLinkedInMeta

Adopt purpose-built LLM serving engines for high-throughput, low-latency serving.

1 of 4 operators
LinkedIn

Use server-side or client-side batching to raise throughput while managing latency.

2 of 4 operators
CriteoLinkedIn

Reduce repeated LLM compute with KV, prefix, or in-batch prefix caching.

1 of 4 operators
LinkedIn

Tune low-level runtime paths: fused kernels, custom GPU kernels, execution providers, allocator choices, and handler-thread configuration.

3 of 4 operators
CriteoLinkedInMeta

Export trained models to ONNX and serve them with Dynamo-Triton using the ONNX Runtime backend.

1 of 4 operators
Criteo

Abstract hardware into managed throughput units and mix provisioned capacity with on-demand capacity.

1 of 4 operators
Slack

Build runtime failover and routing around capacity, model, or regional degradation.

1 of 4 operators
Slack

Where Operators Converge

All observed operators tied runtime or hardware decisions to production constraints such as throughput, latency, scale, reliability, or compute efficiency.

Where Operators Diverge

Operators differ on how directly they expose hardware to the serving/application team.

APPROACH 01

Directly tune GPU-backed runtimes and kernels.

CriteoLinkedInMeta

APPROACH 02

Move from GPU instances to managed throughput units and focus on tokens-per-minute capacity.

Slack

Serving runtime choices vary by workload and platform boundary.

APPROACH 01

Open-source LLM serving engines, including vLLM and SGLang, extended for LinkedIn workloads.

LinkedIn

APPROACH 02

ONNX export served through Dynamo-Triton with ONNX Runtime backend.

Criteo

APPROACH 03

Managed ML/LLM serving platforms, moving from SageMaker to Bedrock.

Slack

APPROACH 04

Model-system co-design with multi-dimensional parallelism and custom GPU kernels for large ads foundation-model workloads.

Meta

Capacity strategy differs between dedicated capacity, on-demand spillover, and engine-level throughput tuning.

APPROACH 01

Dedicated provisioned capacity for latency-sensitive features, on-demand for bursty workloads, and spillover beyond reserved limits.

Slack

APPROACH 02

Throughput tuning inside the inference server and request path, including batching, handler threads, and allocator changes.

CriteoLinkedIn

Watch Items

Latency constraints are explicit: Criteo cites real-time bidding at up to a billion inferences per second with serving latency under 10 ms, while LinkedIn cites P99 latency needing to stay within a few hundred milliseconds under load.

Compute and GPU capacity pressure shows up repeatedly, including LinkedIn's increased compute requirements from advanced LLMs, Slack's GPU scarcity concern, and Meta's training across thousands of GPUs.

Generic generation-oriented serving paths can add overhead for ranking/scoring workloads; LinkedIn explicitly reports unnecessary overhead from code paths optimized for generation, sampling, or low-concurrency interactive workloads.

Live serving migrations and regional/capacity failures require operational controls; Slack reports live backend switching risk and uses gradual traffic shifts, instant rollback, spillover, and rerouting.

02

Implementation Menu

CURATED DEFAULTS
NameKindMaturity
vLLM on dedicated GPUslibraryestablished
SGLanglibraryemerging
Managed inference (Bedrock / Vertex)serviceestablished
03

Observed in Production

3 APPS