HOME/TECHNIQUE/Serving & Inference/Throughput optimization

TECHNIQUE

Throughput optimization

Serving & Inference

4APPLICATIONS

3OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 4 OPERATORS

Throughput optimization is deployed as stack-level engineering: batching/caching and runtime tuning in self-managed inference, capacity abstraction in managed LLM serving, and GPU/model-system co-design at foundation-model scale.

Observed Practices

Use GPU inference serving with dynamic or continuous batching to raise throughput while holding latency down.

2 of 4 operators: Criteo and LinkedIn.

CriteoLinkedIn

Exploit prefix or KV reuse when many requests share input structure.

1 of 4 operators: LinkedIn.

Batch and preserve batches before GPU execution, including tokenizer-side batching and single-message transfer of tokenized batches to the scheduler.

1 of 4 operators: LinkedIn.

Tune low-level inference-server configuration and runtime choices after profiling, including handler threads, allocator choice, and execution provider selection.

1 of 4 operators: Criteo.

Criteo

Decouple serving APIs from the model engine and use asynchronous request handling to improve parallelism.

1 of 4 operators: LinkedIn.

Allocate throughput by workload class using provisioned capacity for latency-sensitive traffic, on-demand capacity for bursty work, and spillover when reserved limits are exceeded.

1 of 4 operators: Slack.

Slack

Use fallback and rerouting when a model or regional endpoint is degraded or at limit.

1 of 4 operators: Slack.

Slack

Use model-system co-design, multi-dimensional parallelism, custom GPU kernels, operator fusion, and memory compression to saturate GPU throughput in very large model stacks.

1 of 4 operators: Meta.

Where Operators Converge

Every observed operator treats throughput as a system-level design problem around model execution, not just as model selection.

Throughput work is tied to concrete scale or latency pressure: real-time bidding latency, high-concurrency LLM serving, enterprise LLM capacity, or large-scale ads model compute.

Where Operators Diverge

Operators optimize different layers of the stack.

APPROACH 01

Self-managed inference runtime and request-path optimization: dynamic/continuous batching, prefix caching, async handling, tokenizer batching, execution-provider and allocator tuning.

CriteoLinkedIn

APPROACH 02

Managed capacity abstraction and routing: model units, provisioned throughput, on-demand endpoints, spillover, and fallback.

Slack

APPROACH 03

Large-scale GPU training/model-system co-design: multi-dimensional parallelism, custom kernels, operator fusion, FP8 activation quantization, and memory locality work.

Watch Items

Tight latency targets under high concurrency are a recurring constraint: Criteo cites up to a billion inferences per second with serving latency under 10 ms, LinkedIn cites P99 latency within a few hundred milliseconds and hundreds or thousands of scores per query, and Slack cites scaling latency as a prior tax.

Batching can be incomplete or lost across service boundaries; LinkedIn reported examining where batching was missing and that individually transmitted tokenized requests caused the original batch structure to be lost before scheduling.

General-purpose serving paths can add avoidable overhead for specialized workloads; LinkedIn reported that code paths optimized for text generation, sampling, or low-concurrency interactive workloads added unnecessary overhead for prefill-only ranking.

Capacity availability and regional health can drive throughput architecture; Slack reported GPU scarcity, regional-outage resilience requirements, and real-time rerouting when a model or region underperformed or hit limits.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
vLLM continuous batching	library	GPU utilization under concurrent load is the bottleneck	established
Provider batch APIs	service	offline jobs trade latency for half-price tokens	commodity
TensorRT-LLM	library	squeezing maximum throughput from NVIDIA hardware	established

Observed in Production

4 APPS

TechnologyGROUNDED

Throughput optimization

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

Agentic ML and Data Pipeline Workflow Orchestration

LLM Application Migration and Rollout Validation

LLM Application Quality Assurance

Multimodal User Interest Profiling for Display Ad Ranking