TECHNIQUE
Serving & Inference
Runtime and hardware practice is split between direct GPU/serving-stack optimization and managed throughput abstraction, with operators tuning batching, caching, kernels, and capacity around latency and scale constraints.
Run high-scale model workloads on GPU infrastructure and optimize for GPU utilization.
3 of 4 operatorsAdopt purpose-built LLM serving engines for high-throughput, low-latency serving.
1 of 4 operatorsUse server-side or client-side batching to raise throughput while managing latency.
2 of 4 operatorsReduce repeated LLM compute with KV, prefix, or in-batch prefix caching.
1 of 4 operatorsTune low-level runtime paths: fused kernels, custom GPU kernels, execution providers, allocator choices, and handler-thread configuration.
3 of 4 operatorsExport trained models to ONNX and serve them with Dynamo-Triton using the ONNX Runtime backend.
1 of 4 operatorsAbstract hardware into managed throughput units and mix provisioned capacity with on-demand capacity.
1 of 4 operatorsBuild runtime failover and routing around capacity, model, or regional degradation.
1 of 4 operatorsAll observed operators tied runtime or hardware decisions to production constraints such as throughput, latency, scale, reliability, or compute efficiency.
Operators differ on how directly they expose hardware to the serving/application team.
APPROACH 01
Directly tune GPU-backed runtimes and kernels.
APPROACH 02
Move from GPU instances to managed throughput units and focus on tokens-per-minute capacity.
Serving runtime choices vary by workload and platform boundary.
APPROACH 01
Open-source LLM serving engines, including vLLM and SGLang, extended for LinkedIn workloads.
APPROACH 02
ONNX export served through Dynamo-Triton with ONNX Runtime backend.
APPROACH 03
Managed ML/LLM serving platforms, moving from SageMaker to Bedrock.
APPROACH 04
Model-system co-design with multi-dimensional parallelism and custom GPU kernels for large ads foundation-model workloads.
Capacity strategy differs between dedicated capacity, on-demand spillover, and engine-level throughput tuning.
APPROACH 01
Dedicated provisioned capacity for latency-sensitive features, on-demand for bursty workloads, and spillover beyond reserved limits.
APPROACH 02
Throughput tuning inside the inference server and request path, including batching, handler threads, and allocator changes.
Latency constraints are explicit: Criteo cites real-time bidding at up to a billion inferences per second with serving latency under 10 ms, while LinkedIn cites P99 latency needing to stay within a few hundred milliseconds under load.
Compute and GPU capacity pressure shows up repeatedly, including LinkedIn's increased compute requirements from advanced LLMs, Slack's GPU scarcity concern, and Meta's training across thousands of GPUs.
Generic generation-oriented serving paths can add overhead for ranking/scoring workloads; LinkedIn explicitly reports unnecessary overhead from code paths optimized for generation, sampling, or low-concurrency interactive workloads.
Live serving migrations and regional/capacity failures require operational controls; Slack reports live backend switching risk and uses gradual traffic shifts, instant rollback, spillover, and rerouting.
| Name | Kind | When | Maturity |
|---|---|---|---|
| vLLM on dedicated GPUs | library | open-weights serving where the team owns the runtime | established |
| SGLang | library | structured-generation-heavy workloads with radix-tree prefix reuse | emerging |
| Managed inference (Bedrock / Vertex) | service | compliance or ops constraints favor cloud-managed model hosting | established |