HOME/TECHNIQUE/Serving & Inference/Throughput optimization

TECHNIQUE

Throughput optimization

Serving & Inference

3APPLICATIONS
2OBSERVED OPERATORS
01

State of Practice

CROSS-VALIDATED — 4 OPERATORS

Throughput optimization is deployed as stack-level engineering: batching/caching and runtime tuning in self-managed inference, capacity abstraction in managed LLM serving, and GPU/model-system co-design at foundation-model scale.

Observed Practices

Use GPU inference serving with dynamic or continuous batching to raise throughput while holding latency down.

2 of 4 operators: Criteo and LinkedIn.
CriteoLinkedIn

Exploit prefix or KV reuse when many requests share input structure.

1 of 4 operators: LinkedIn.
LinkedIn

Batch and preserve batches before GPU execution, including tokenizer-side batching and single-message transfer of tokenized batches to the scheduler.

1 of 4 operators: LinkedIn.
LinkedIn

Tune low-level inference-server configuration and runtime choices after profiling, including handler threads, allocator choice, and execution provider selection.

1 of 4 operators: Criteo.
Criteo

Decouple serving APIs from the model engine and use asynchronous request handling to improve parallelism.

1 of 4 operators: LinkedIn.
LinkedIn

Allocate throughput by workload class using provisioned capacity for latency-sensitive traffic, on-demand capacity for bursty work, and spillover when reserved limits are exceeded.

1 of 4 operators: Slack.
Slack

Use fallback and rerouting when a model or regional endpoint is degraded or at limit.

1 of 4 operators: Slack.
Slack

Use model-system co-design, multi-dimensional parallelism, custom GPU kernels, operator fusion, and memory compression to saturate GPU throughput in very large model stacks.

1 of 4 operators: Meta.
Meta

Where Operators Converge

Every observed operator treats throughput as a system-level design problem around model execution, not just as model selection.

Throughput work is tied to concrete scale or latency pressure: real-time bidding latency, high-concurrency LLM serving, enterprise LLM capacity, or large-scale ads model compute.

Where Operators Diverge

Operators optimize different layers of the stack.

APPROACH 01

Self-managed inference runtime and request-path optimization: dynamic/continuous batching, prefix caching, async handling, tokenizer batching, execution-provider and allocator tuning.

CriteoLinkedIn

APPROACH 02

Managed capacity abstraction and routing: model units, provisioned throughput, on-demand endpoints, spillover, and fallback.

Slack

APPROACH 03

Large-scale GPU training/model-system co-design: multi-dimensional parallelism, custom kernels, operator fusion, FP8 activation quantization, and memory locality work.

Meta

Batching strategies are workload-specific rather than uniform.

APPROACH 01

GPU-side dynamic or continuous batching for inference serving.

CriteoLinkedIn

APPROACH 02

Tokenizer-side dynamic batching and explicit batch-send to preserve batch structure before scheduling.

LinkedIn

APPROACH 03

Capacity segmentation instead of request batching: provisioned throughput for interactive traffic and on-demand capacity for bursty scheduled workloads.

Slack

Watch Items

Tight latency targets under high concurrency are a recurring constraint: Criteo cites up to a billion inferences per second with serving latency under 10 ms, LinkedIn cites P99 latency within a few hundred milliseconds and hundreds or thousands of scores per query, and Slack cites scaling latency as a prior tax.

Batching can be incomplete or lost across service boundaries; LinkedIn reported examining where batching was missing and that individually transmitted tokenized requests caused the original batch structure to be lost before scheduling.

General-purpose serving paths can add avoidable overhead for specialized workloads; LinkedIn reported that code paths optimized for text generation, sampling, or low-concurrency interactive workloads added unnecessary overhead for prefill-only ranking.

Capacity availability and regional health can drive throughput architecture; Slack reported GPU scarcity, regional-outage resilience requirements, and real-time rerouting when a model or region underperformed or hit limits.

02

Implementation Menu

CURATED DEFAULTS
NameKindMaturity
vLLM continuous batchinglibraryestablished
Provider batch APIsservicecommodity
TensorRT-LLMlibraryestablished
03

Observed in Production

3 APPS