HOME/TECHNIQUE/Serving & Inference/Token-cost engineering

TECHNIQUE

Token-cost engineering

Serving & Inference

7APPLICATIONS

6OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 8 OPERATORS

Token-cost engineering is practiced mainly as context-budget control: operators prune schemas, prompts, tools, retrieval scope, duplicate compute, or track per-run token spend rather than simply sending more context.

Observed Practices

Prune or narrow input context before the LLM call.

3 of 8 operators with cited token-cost evidence in this pool.

AirbnbGrabUber

Reduce tool/context surface area exposed to the main model.

3 of 8 operators with cited token-cost evidence in this pool.

AtlassianDropboxLinkedIn

Break large LLM tasks into smaller prompts, agents, or subagents to conserve context and focus the model.

3 of 8 operators with cited token-cost evidence in this pool.

AtlassianDropboxUber

Use serving/runtime optimizations to cut repeated computation or resource cost.

2 of 8 operators with cited token-cost evidence in this pool.

LinkedInMeta AI

Track or monitor token usage, per-run cost, or cost-related quality signals after deployment.

2 of 8 operators with cited token-cost evidence in this pool.

PinterestUber

Choose model/context-window tradeoffs explicitly when token volume is part of the workload.

2 of 8 operators with cited token-cost evidence in this pool.

AirbnbUber

Where Operators Converge

Across the cited deployments, token-cost engineering is tied to production constraints around context size, duplicate compute, latency, or per-run spend—not treated as an offline-only prompt cleanup exercise.

Where Operators Diverge

Operators apply token-cost controls at different layers of the stack.

APPROACH 01

Input/context layer: strip schemas, reduce prompt word count, split large tables, prune columns, or narrow RAG search radius.

AirbnbGrabUber

APPROACH 02

Orchestration layer: reduce exposed tools, route through subagents/meta-tools, or skip retrieval/tool calls when not needed.

AtlassianDropboxLinkedIn

APPROACH 03

Serving/infrastructure layer: reduce repeated LLM compute or retrieval resource cost with KV caching, fused kernels, quantization, chunking, and asynchronous serving.

LinkedInMeta AI

APPROACH 04

Accounting/monitoring layer: track token usage, per-run cost, and model/metric variants over time.

Operators differ on whether to pay for larger context windows or aggressively shrink the prompt.

APPROACH 01

Use a large-context model because the workload benefits from a wide window, while still citing speed/quality tradeoffs.

Airbnb

APPROACH 02

Shrink prompts or schemas so the model has less irrelevant context to process.

AirbnbGrabUber

Watch Items

Context bloat degrades agent behavior: operators reported that too many tools, retrieval options, or issue records cannot simply be stuffed into the LLM context.

Large inputs can distract the model from the actual task; Grab reported model capacity being consumed by unnecessary detail, and Airbnb strips unused schema fields before prompting.

Token-cost controls do not remove the need for validation: operators still use human checks, schema validation, rule validators, or monitoring to catch misclassifications, invalid outputs, hallucinations, and drift.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
Context pruning with summarization budgets	pattern	context grows unbounded; compaction keeps cost linear	established
Per-feature output budgeting	pattern	cost control by capping output tokens per product surface	commodity

Observed in Production

7 APPS

TechnologyGROUNDED

Token-cost engineering

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

LLM Application Quality Assurance

Automated Data and Interest Signal Classification

Code and Query Defect Validation and Repair

Enterprise Search Synthetic Evaluation Data Generation

Go Service Performance Optimization

LLM SQL and Knowledge Base Quality Evaluation

Pull Request Mock-Backed Development and LLM Test Generation