TECHNIQUE
Serving & Inference
Caching is observed as a serving/inference efficiency and resilience tactic: operators cache LLM KV/prefix work, metadata/configuration, pipeline conversions, responses, or overlapping agent context depending on where repeat work appears.
Use LLM-serving KV or prefix caching to avoid duplicate computation across shared request text.
1 of 6 operators; LinkedIn reports KV caching, prefix caching, and in-batch prefix caching across multiple deployed serving stacks.Cache tenant metadata/configuration used by inference routing and tenant-specific context, including a local client cache and service-level L2 cache.
1 of 6 operators; Salesforce reports multi-level caching for AI inference metadata/configuration.Cache intermediate pipeline states/conversions so later file-summary and Q&A steps can reuse prior work.
1 of 6 operators; Dropbox reports caching each state of its file-conversion pipeline and reusing conversions between plugins.Expose response caching as part of an AI workflow/provider abstraction layer.
1 of 6 operators; Shopify reports response caching in Raix, the abstraction layer used by Roast workflows.Use caching to manage extra token consumption caused by overlapping context in multi-agent/code-review architectures.
1 of 6 operators; cubic reports overlapping context increased token consumption and was managed through caching strategies.Instrument cache behavior with hit ratios, latency buckets, expiration patterns, stale-data thresholds, and alerts.
1 of 6 operators; explicit cache telemetry and alerting are only shown for Salesforce in this pool.Make cache TTLs configurable when cached inference metadata/configuration has different freshness needs by use case.
1 of 6 operators; configurable TTLs are only shown for Salesforce in this pool.Use caching in deployed or pilot AI systems for performance, reuse, or resilience at the layer where repeated work is observed.
5 of 6 operators; Grab’s cited AI Gateway teardown does not show caching for this technique.Observed caching is tactical and layer-specific rather than a single standard pattern: each operator caches the repeatable artifact in its own path, such as KV/prefix state, tenant metadata, conversion states, responses, or overlapping agent context.
Operators cache different artifacts at different layers of the AI path.
APPROACH 01
Cache LLM serving-engine state: KV caching, prefix caching, and in-batch prefix KV reuse.
APPROACH 02
Cache inference metadata/configuration at client and service levels.
APPROACH 03
Cache file-conversion pipeline states and intermediate conversions between plugins.
APPROACH 04
Add response caching to the AI workflow/provider abstraction layer.
APPROACH 05
Use caching strategies to control token cost from overlapping context in specialized micro-agent workflows.
Only one operator in the pool reports explicit cache governance mechanics.
APPROACH 01
Track L1/L2 hit ratios, latency buckets, expiration patterns, stale-data thresholds, and trigger alerts when cache usage shifts.
APPROACH 02
Report caching as an implementation capability or strategy without detailed cache telemetry/governance in the cited teardown.
Repeated or duplicate work is the pressure that caching is used to control: LinkedIn cites duplicate LLM computation/prefix overhead, while cubic cites increased token consumption from overlapping context.
Cache freshness and usage shifts need operational attention when cached data affects inference behavior; Salesforce tracks expiration patterns, stale-data thresholds, and alerts on shifts to L2 usage.
Caching can become part of resilience planning for backend outages when inference depends on external metadata/configuration services; Salesforce reports serving safe cached responses during outages.
| Name | Kind | When | Maturity |
|---|---|---|---|
| Provider prompt caching | service | repeated system prompts and contexts dominate token spend | commodity |
| Semantic response cache | pattern | near-duplicate queries can safely share answers | emerging |