HOME/TECHNIQUE/Serving & Inference/Caching

TECHNIQUE

Caching

Serving & Inference

4APPLICATIONS

5OBSERVED OPERATORS

State of Practice

GROUNDED

Caching is used by 6 of 7 operators, but at different layers: LLM KV/prefix state, metadata/configuration, intermediate pipeline states, precomputed embeddings, AI responses, and agent context costs.

Observed Practices

Use caching or precomputation to reuse work already present in the AI path, including fixed embeddings, overlapping context, conversion states, LLM KV/prefix state, metadata/configuration, and AI responses.

6 of 7 operators; no caching evidence was provided for Grab in this pool.

CanvacubicDropboxLinkedInSalesforceShopify

Cache LLM serving computation: LinkedIn reports KV caching, prefix caching, and in-batch prefix caching to reduce duplicate work and reuse query-prefix KV.

1 of 7 operators.

Put multi-level caches around inference metadata/configuration: Salesforce uses a local cache in the AIMS Client / AI Gateway and an L2 service-level cache in AIMS.

1 of 7 operators.

Salesforce

Cache intermediate document-processing states so later LLM summarization and Q&A steps can reuse prior conversions.

1 of 7 operators.

Dropbox

Precompute fixed embedding sets used at inference: Canva precomputes keyword embeddings for all languages for the top 1000 keywords and serves the 10 highest-probability keywords during inference.

1 of 7 operators.

Canva

Use response caching inside an AI workflow/orchestration layer.

1 of 7 operators.

Shopify

Use caching to manage the token-consumption trade-off from overlapping context in multi-agent review workflows.

1 of 7 operators.

cubic

Where Operators Converge

Across the six operators with caching evidence, the cached artifact is workload-specific rather than uniform: embeddings, overlapping context, conversion states, LLM KV/prefix state, metadata/configuration, or AI responses.

Where Operators Diverge

Operators differ on what layer they cache.

APPROACH 01

LLM serving-engine state: KV caching, prefix caching, and in-batch prefix KV reuse.

APPROACH 02

Application metadata/configuration: local AIMS Client / AI Gateway cache plus service-level L2 cache in AIMS.

Salesforce

APPROACH 03

Document pipeline state: cached plugin conversions and intermediate pipeline states.

Dropbox

APPROACH 04

Fixed candidate embeddings: precomputed keyword embeddings for all languages and top keywords.

Canva

APPROACH 05

AI interaction responses inside a workflow framework.

Shopify

APPROACH 06

Overlapping multi-agent context in code review workflows, managed with caching strategies to control token consumption.

cubic

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
Provider prompt caching	service	repeated system prompts and contexts dominate token spend	commodity
Semantic response cache	pattern	near-duplicate queries can safely share answers	emerging

Observed in Production

4 APPS

TechnologyGROUNDED

Caching

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Implementation Menu

Observed in Production

LLM Application Quality Assurance

LLM-Assisted Code Review, Test Migration, and Agent Evaluation

AI-Assisted Product and Developer Collaboration Workflows

Enterprise Search Synthetic Evaluation Data Generation