HOME/TECHNIQUE/Serving & Inference/Runtime & hardware

TECHNIQUE

Runtime & hardware

Serving & Inference

5APPLICATIONS

5OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 6 OPERATORS

Across the 6 deployed/pilot operators, runtime and hardware practice is explicitly engineered around accelerated compute, workload-specific serving runtimes, and capacity controls rather than left as a generic deployment detail.

Observed Practices

Use accelerated GPU or AI hardware where workload scale or latency makes compute a deployment constraint.

5 of 6 operators explicitly cited GPU, edge AI, or accelerated AI hardware in the runtime/hardware path.

CriteoLinkedInMetaNVIDIAVyom Electronics

Adopt specialized LLM serving engines instead of generic generation-serving paths for production GenAI workloads.

1 of 6 operators explicitly cited vLLM and SGLang as production LLM serving systems.

Use batching, asynchronous handling, or concurrency-aware serving controls to trade off throughput and latency.

2 of 6 operators explicitly described batching, async, or concurrency-tuning mechanisms in serving.

CriteoLinkedIn

Optimize inference runtimes with ONNX, Triton, TensorRT, CUDA/TensorRT providers, INT8 quantization, or allocator/thread tuning.

2 of 6 operators explicitly cited ONNX/Triton/TensorRT-style inference-runtime optimization.

CriteoVyom Electronics

Use memory- and compute-efficiency techniques such as KV/prefix caching, fused kernels, custom GPU kernels, FP8/INT8 quantization, and parallelism.

3 of 6 operators explicitly cited low-level memory or compute-efficiency techniques.

LinkedInMetaVyom Electronics

Run vision inference at the factory edge so inspection feedback can be returned during production.

2 of 6 operators explicitly described edge/factory-floor inference for visual inspection.

NVIDIAVyom Electronics

Use managed cloud capacity abstractions for LLM serving, separating latency-sensitive provisioned capacity from bursty on-demand workloads.

1 of 6 operators explicitly described managed capacity units, provisioned throughput, on-demand endpoints, and spillover.

Slack

Add routing, fallback, or rollout controls around serving infrastructure to maintain availability during capacity limits, degradation, or backend migration.

1 of 6 operators explicitly described model hierarchy fallback, real-time rerouting, feature-flagged traffic shifts, and instant rollback.

Slack

Where Operators Converge

Every observed operator made runtime or hardware choices part of the deployment design because the cited workloads impose scale, latency, throughput, uptime, line-speed, or compute-efficiency constraints.

Where Operators Diverge

Operators differ on serving substrate: some operate open-source/self-controlled serving engines, some use optimized inference servers, some use managed cloud model-serving capacity, and some run factory-edge inference.

APPROACH 01

Open-source or self-controlled LLM serving engines with custom changes for production workloads.

APPROACH 02

Dynamo-Triton with ONNX Runtime backend on GPU for high-performance structured-data inference.

Criteo

APPROACH 03

Managed cloud serving and capacity abstraction through SageMaker/Bedrock, Provisioned Throughput, On Demand, and Model Units.

Slack

APPROACH 04

Factory-edge vision inference on AI infrastructure or Jetson modules for immediate inspection feedback.

NVIDIAVyom Electronics

Operators differ on capacity abstraction: most cited direct GPU or edge AI hardware, while Slack explicitly moved the operational unit of capacity away from GPU instances toward model-throughput units.

APPROACH 01

Directly cited GPU, NVLink, thousands of GPUs, edge AI infrastructure, or Jetson hardware as part of the deployment.

CriteoLinkedInMetaNVIDIAVyom Electronics

APPROACH 02

Abstracted away GPU instances into Model Units and throughput measured in tokens per minute.

Slack

Operators differ on the execution path they optimize for: text-generation serving, scoring/ranking without generation, real-time bidding inference, ads foundation-model training, and edge visual inspection each drive different runtime choices.

APPROACH 01

High-throughput/low-latency LLM serving for text generation and embedding generation.

APPROACH 02

Prefill-only LLM ranking that scores many candidates rather than generating long responses.

APPROACH 03

Real-time bidding inference under massive scale and sub-10 ms serving latency constraints.

Criteo

APPROACH 04

Large-scale ads foundation-model training with thousands of GPUs, custom GPU kernels, parallelism, and model-system co-design.

Watch Items

Compute capacity and GPU availability remain recurring constraints: LinkedIn reported advanced LLMs increased compute requirements, Slack reported scaling latency and hardware scarcity, and Meta described GPU-scale training as a core system challenge.

Latency targets are tight and workload-specific: Criteo cited up to a billion inferences per second under 10 ms, LinkedIn cited P99 within a few hundred milliseconds under load, Slack separated latency-sensitive features onto dedicated capacity, and Vyom/NVIDIA reported line-speed or near-real-time factory feedback requirements.

Generic or untuned serving paths can add avoidable overhead: LinkedIn reported generation-oriented paths, repeated kernel launches, lost batch structure, and missing prefill optimizations; Criteo found throughput depended on Triton configuration, gRPC handler threads, TensorRT provider choice, and allocator choice.

Production reliability work extends beyond model speed: Slack reported feature-flagged gradual traffic shifts, instant rollback, spillover, rerouting, and model fallback; NVIDIA and Vyom logged inspection results and used human/operator feedback loops around edge decisions.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
vLLM on dedicated GPUs	library	open-weights serving where the team owns the runtime	established
SGLang	library	structured-generation-heavy workloads with radix-tree prefix reuse	emerging
Managed inference (Bedrock / Vertex)	service	compliance or ops constraints favor cloud-managed model hosting	established

Observed in Production

5 APPS

ManufacturingCROSS-VALIDATED

Runtime & hardware

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

Automated Quality Image Tagging and Cataloging

Enterprise Search Synthetic Evaluation Data Generation

LLM Application Migration and Rollout Validation

LLM Application Quality Assurance

Multimodal User Interest Profiling for Display Ad Ranking