HOME/TECHNIQUE/Serving & Inference/Latency engineering

TECHNIQUE

Latency engineering

Serving & Inference

4APPLICATIONS
6OBSERVED OPERATORS
01

State of Practice

CROSS-VALIDATED — 9 OPERATORS

Latency engineering in this pool is implemented at multiple layers: serving engines and batching, GPU/runtime tuning, capacity routing, prompt/context reduction, streaming, and production latency monitoring.

Observed Practices

Use high-throughput/low-latency serving engines or capacity abstractions for latency-sensitive inference paths.

3 of 9 operators with latency-engineering evidence in the pool.
CriteoLinkedInSlack

Batch or parallelize execution to reduce latency and improve throughput.

3 of 9 operators with latency-engineering evidence in the pool.
CriteoLinkedInAppFolio

Tune GPU/runtime details directly, including mixed precision, attention kernels, execution providers, memory allocators, and handler threads.

3 of 9 operators with latency-engineering evidence in the pool.
LinkedInCriteoPinterest

Reduce prompt, prefix, or retrieved-context work to protect latency budgets.

4 of 9 operators with latency-engineering evidence in the pool.
LinkedInMendable.aiDropboxAirbnb

Use streaming or asynchronous handling to improve responsiveness for GenAI workflows.

2 of 9 operators with latency-engineering evidence in the pool.
LinkedInAirbnb

Route, spill over, or fall back when primary capacity, model, or region is degraded or saturated.

1 of 9 operators with latency-engineering evidence in the pool.
Slack

Instrument production latency or publish explicit latency targets/budgets.

4 of 9 operators with latency-engineering evidence in the pool.
CriteoDropboxAppFolioMendable.ai

Keep gateway/proxy overhead low when brokering access to external model providers.

1 of 9 operators with latency-engineering evidence in the pool.
Grab

Where Operators Diverge

Operators optimize different layers of the stack.

APPROACH 01

Serving-engine and GPU-stack optimization: tune inference servers, batching, kernels, precision, and runtime execution.

CriteoLinkedInPinterest

APPROACH 02

Application/workflow optimization: parallelize branches, trim prompt or retrieval work, and stream progress or outputs.

AppFolioMendable.aiDropboxAirbnb

APPROACH 03

Capacity and routing optimization: reserve/provision capacity for latency-sensitive features and reroute or spill over when limits are hit.

Slack

APPROACH 04

Gateway overhead minimization: proxy provider calls while returning provider responses with no to minimal processing time.

Grab

Latency targets differ sharply by workload.

APPROACH 01

Real-time advertising/recommendation workloads report millisecond-level constraints or benchmarked latency reductions.

CriteoPinterest

APPROACH 02

Interactive GenAI and RAG workloads describe user-facing responsiveness, retrieval latency budgets, or keeping latency low while adding suggestions/actions.

DropboxAppFolioSlackMendable.ai

APPROACH 03

Nearline or platform-scale embedding/LLM serving emphasizes throughput, concurrency, and fast downstream access rather than a single per-request user latency number.

LinkedIn

Some operators abstract away hardware; others expose and tune hardware/runtime details.

APPROACH 01

Abstract hardware behind managed throughput units or provider endpoints.

SlackGrab

APPROACH 02

Expose and tune GPUs, kernels, inference servers, and runtime backends directly.

CriteoLinkedInPinterest

Watch Items

Capacity, cost, and hardware availability remain first-order latency risks: operators reported high computational costs, GPU scarcity, scaling latency, and infrastructure-cost pressure.

Long prompts, broad context, and agent/tool pipelines can create high-latency runs; operators responded by trimming prompt chunks or measuring slow calls.

Live serving-stack changes can threaten availability; one operator called backend switching during live traffic risky and used gradual traffic shifts plus rollback.

Model, region, or endpoint degradation can force real-time rerouting or fallback in latency-sensitive features.

02

Implementation Menu

CURATED DEFAULTS
NameKindMaturity
Token streaming end to endpatterncommodity
vLLMlibraryestablished
Provider prompt cachingservicecommodity
03

Observed in Production

4 APPS