TECHNIQUE
Serving & Inference
Throughput optimization is deployed as stack-level engineering: batching/caching and runtime tuning in self-managed inference, capacity abstraction in managed LLM serving, and GPU/model-system co-design at foundation-model scale.
Use GPU inference serving with dynamic or continuous batching to raise throughput while holding latency down.
2 of 4 operators: Criteo and LinkedIn.Exploit prefix or KV reuse when many requests share input structure.
1 of 4 operators: LinkedIn.Batch and preserve batches before GPU execution, including tokenizer-side batching and single-message transfer of tokenized batches to the scheduler.
1 of 4 operators: LinkedIn.Tune low-level inference-server configuration and runtime choices after profiling, including handler threads, allocator choice, and execution provider selection.
1 of 4 operators: Criteo.Decouple serving APIs from the model engine and use asynchronous request handling to improve parallelism.
1 of 4 operators: LinkedIn.Allocate throughput by workload class using provisioned capacity for latency-sensitive traffic, on-demand capacity for bursty work, and spillover when reserved limits are exceeded.
1 of 4 operators: Slack.Use fallback and rerouting when a model or regional endpoint is degraded or at limit.
1 of 4 operators: Slack.Use model-system co-design, multi-dimensional parallelism, custom GPU kernels, operator fusion, and memory compression to saturate GPU throughput in very large model stacks.
1 of 4 operators: Meta.Every observed operator treats throughput as a system-level design problem around model execution, not just as model selection.
Throughput work is tied to concrete scale or latency pressure: real-time bidding latency, high-concurrency LLM serving, enterprise LLM capacity, or large-scale ads model compute.
Operators optimize different layers of the stack.
APPROACH 01
Self-managed inference runtime and request-path optimization: dynamic/continuous batching, prefix caching, async handling, tokenizer batching, execution-provider and allocator tuning.
APPROACH 02
Managed capacity abstraction and routing: model units, provisioned throughput, on-demand endpoints, spillover, and fallback.
APPROACH 03
Large-scale GPU training/model-system co-design: multi-dimensional parallelism, custom kernels, operator fusion, FP8 activation quantization, and memory locality work.
Batching strategies are workload-specific rather than uniform.
APPROACH 01
GPU-side dynamic or continuous batching for inference serving.
APPROACH 02
Tokenizer-side dynamic batching and explicit batch-send to preserve batch structure before scheduling.
APPROACH 03
Capacity segmentation instead of request batching: provisioned throughput for interactive traffic and on-demand capacity for bursty scheduled workloads.
Tight latency targets under high concurrency are a recurring constraint: Criteo cites up to a billion inferences per second with serving latency under 10 ms, LinkedIn cites P99 latency within a few hundred milliseconds and hundreds or thousands of scores per query, and Slack cites scaling latency as a prior tax.
Batching can be incomplete or lost across service boundaries; LinkedIn reported examining where batching was missing and that individually transmitted tokenized requests caused the original batch structure to be lost before scheduling.
General-purpose serving paths can add avoidable overhead for specialized workloads; LinkedIn reported that code paths optimized for text generation, sampling, or low-concurrency interactive workloads added unnecessary overhead for prefill-only ranking.
Capacity availability and regional health can drive throughput architecture; Slack reported GPU scarcity, regional-outage resilience requirements, and real-time rerouting when a model or region underperformed or hit limits.
| Name | Kind | When | Maturity |
|---|---|---|---|
| vLLM continuous batching | library | GPU utilization under concurrent load is the bottleneck | established |
| Provider batch APIs | service | offline jobs trade latency for half-price tokens | commodity |
| TensorRT-LLM | library | squeezing maximum throughput from NVIDIA hardware | established |