HOME/TECHNIQUE/Serving & Inference/Latency engineering

TECHNIQUE

Latency engineering

Serving & Inference

8APPLICATIONS

10OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 6 OPERATORS

Across the quoted deployments, latency engineering is handled with concrete serving-path controls: optimized inference engines, batching/async execution, capacity routing, caching/nearline access, gateway overhead reduction, and prompt/pipeline trimming.

Observed Practices

Use high-throughput, low-latency serving engines with batching or asynchronous execution in the inference path.

2 of 6 operators with quoted latency-engineering evidence.

CriteoLinkedIn

Tune GPU/runtime execution details after profiling, including TensorRT execution providers, gRPC handler threads, memory allocators, PagedAttention, FlashAttention, custom CUDA kernels, and mixed precision.

2 of 6 operators with quoted latency-engineering evidence.

CriteoLinkedIn

Move or reuse expensive work outside the hottest request path through nearline embeddings, key-value stores, prefix caching, or narrower prompt/context selection.

3 of 6 operators with quoted latency-engineering evidence.

LinkedInDropboxMendable.ai

Separate capacity by workload latency profile: provisioned/dedicated capacity for interactive latency-sensitive features, on-demand capacity for bursty or scheduled workloads, and spillover/rerouting when reserved capacity or endpoints degrade.

1 of 6 operators with quoted latency-engineering evidence.

Slack

Keep gateway or proxy layers thin when mediating access to external model providers, returning provider responses with no to minimal processing time.

1 of 6 operators with quoted latency-engineering evidence.

Grab

Track latency at the application-pipeline level, not just at the model-server level, using retrieval latency budgets, traces, and run-level diagnostics to find slow calls.

2 of 6 operators with quoted latency-engineering evidence.

DropboxMendable.ai

Where Operators Converge

Every quoted operator treats serving latency or request-path overhead as a production constraint, but the control point differs by system: model server, gateway, capacity layer, retrieval pipeline, or agent prompt path.

Where Operators Diverge

Operators differ on the primary latency lever they expose and tune.

APPROACH 01

Model-serving/runtime lever: optimize the inference engine with batching, asynchronous handling, GPU execution, and low-level runtime tuning.

CriteoLinkedIn

APPROACH 02

Capacity/routing lever: assign provisioned versus on-demand capacity by workload and reroute or fall back when endpoints or models degrade.

Slack

APPROACH 03

Application-pipeline lever: reduce retrieval, prompt, or agent overhead before the model call dominates the user-visible path.

DropboxMendable.ai

APPROACH 04

Gateway/proxy lever: minimize added processing while brokering calls to multiple external providers.

Grab

Watch Items

Scale and capacity pressure remains a reported risk: Criteo cites up to a billion inferences per second under 10 ms, Slack reports scaling latency and GPU scarcity, and LinkedIn cites high computational costs and complex deployment pipelines for production LLMs.

Long prompts and token-heavy generation create latency pressure: Mendable.ai observed a massive concatenated prompt and a 7.23-second ChatOpenAI call, while LinkedIn reports nearly 1000 output tokens on average per candidate and hundreds to thousands of candidates per hirer workflow.

Runtime overhead can remain after adopting optimized engines: LinkedIn observed repeated kernel launches even with CUDA graphs enabled, Criteo still tuned Dynamo-Triton configuration, gRPC handler threads, and allocator choice, and Mendable.ai reported console-log-heavy debugging and high-latency runs before tracing.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
Token streaming end to end	pattern	perceived latency matters; first token beats total time	commodity
vLLM	library	self-hosted serving with paged attention and speculative decoding	established
Provider prompt caching	service	long stable prefixes amortized across calls	commodity

Observed in Production

8 APPS

TechnologyGROUNDED

Latency engineering

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

LLM Application Quality Assurance

Automated Quality Image Tagging and Cataloging

Change Request and CRM Account Linking Copilot

Code and Query Defect Validation and Repair

LLM Application Migration and Rollout Validation

LLM-Assisted Code Review, Test Migration, and Agent Evaluation

Permission-Aware Enterprise AI Search and PII Tagging

Pull Request Mock-Backed Development and LLM Test Generation