TECHNIQUE

Caching

Serving & Inference

3APPLICATIONS
4OBSERVED OPERATORS
01

State of Practice

GROUNDED

Caching is observed as a serving/inference efficiency and resilience tactic: operators cache LLM KV/prefix work, metadata/configuration, pipeline conversions, responses, or overlapping agent context depending on where repeat work appears.

Observed Practices

Use LLM-serving KV or prefix caching to avoid duplicate computation across shared request text.

1 of 6 operators; LinkedIn reports KV caching, prefix caching, and in-batch prefix caching across multiple deployed serving stacks.
LinkedIn

Cache tenant metadata/configuration used by inference routing and tenant-specific context, including a local client cache and service-level L2 cache.

1 of 6 operators; Salesforce reports multi-level caching for AI inference metadata/configuration.
Salesforce

Cache intermediate pipeline states/conversions so later file-summary and Q&A steps can reuse prior work.

1 of 6 operators; Dropbox reports caching each state of its file-conversion pipeline and reusing conversions between plugins.
Dropbox

Expose response caching as part of an AI workflow/provider abstraction layer.

1 of 6 operators; Shopify reports response caching in Raix, the abstraction layer used by Roast workflows.
Shopify

Use caching to manage extra token consumption caused by overlapping context in multi-agent/code-review architectures.

1 of 6 operators; cubic reports overlapping context increased token consumption and was managed through caching strategies.
cubic

Instrument cache behavior with hit ratios, latency buckets, expiration patterns, stale-data thresholds, and alerts.

1 of 6 operators; explicit cache telemetry and alerting are only shown for Salesforce in this pool.
Salesforce

Make cache TTLs configurable when cached inference metadata/configuration has different freshness needs by use case.

1 of 6 operators; configurable TTLs are only shown for Salesforce in this pool.
Salesforce

Use caching in deployed or pilot AI systems for performance, reuse, or resilience at the layer where repeated work is observed.

5 of 6 operators; Grab’s cited AI Gateway teardown does not show caching for this technique.
cubicDropboxLinkedInSalesforceShopify

Where Operators Converge

Observed caching is tactical and layer-specific rather than a single standard pattern: each operator caches the repeatable artifact in its own path, such as KV/prefix state, tenant metadata, conversion states, responses, or overlapping agent context.

Where Operators Diverge

Operators cache different artifacts at different layers of the AI path.

APPROACH 01

Cache LLM serving-engine state: KV caching, prefix caching, and in-batch prefix KV reuse.

LinkedIn

APPROACH 02

Cache inference metadata/configuration at client and service levels.

Salesforce

APPROACH 03

Cache file-conversion pipeline states and intermediate conversions between plugins.

Dropbox

APPROACH 04

Add response caching to the AI workflow/provider abstraction layer.

Shopify

APPROACH 05

Use caching strategies to control token cost from overlapping context in specialized micro-agent workflows.

cubic

Only one operator in the pool reports explicit cache governance mechanics.

APPROACH 01

Track L1/L2 hit ratios, latency buckets, expiration patterns, stale-data thresholds, and trigger alerts when cache usage shifts.

Salesforce

APPROACH 02

Report caching as an implementation capability or strategy without detailed cache telemetry/governance in the cited teardown.

cubicDropboxLinkedInShopify

Watch Items

Repeated or duplicate work is the pressure that caching is used to control: LinkedIn cites duplicate LLM computation/prefix overhead, while cubic cites increased token consumption from overlapping context.

Cache freshness and usage shifts need operational attention when cached data affects inference behavior; Salesforce tracks expiration patterns, stale-data thresholds, and alerts on shifts to L2 usage.

Caching can become part of resilience planning for backend outages when inference depends on external metadata/configuration services; Salesforce reports serving safe cached responses during outages.

02

Implementation Menu

CURATED DEFAULTS
NameKindMaturity
Provider prompt cachingservicecommodity
Semantic response cachepatternemerging
03

Observed in Production

3 APPS