HOME/TECHNIQUE/Tool Use & Structured Output/Schema-constrained generation

TECHNIQUE

Schema-constrained generation

Tool Use & Structured Output

5APPLICATIONS

8OBSERVED OPERATORS

State of Practice

CROSS-VALIDATED — 10 OPERATORS

Schema-constrained generation is used as an application boundary: operators ask models for JSON, structured API/tool payloads, or fixed output structures, then add parsers, validation, retries, evals, or human review around those outputs.

Observed Practices

Generate machine-readable JSON or structured payloads instead of relying on prose-only model output.

Observed in 9 of 14 rostered operators.

AirbnbAppFolioDropboxKalvium LabsMendable.aiPinterestShopifySlackWix

Use structured output at tool, function, API, or action boundaries so downstream systems can execute or route the result.

Observed in 5 of 14 rostered operators.

AppFolioDropboxMendable.aiSlackWix

Validate, parse, or reject structured outputs before downstream consumption.

Observed in 5 of 14 rostered operators.

AirbnbDropboxPinterestSlackThumbtack

Break complex workflows into steps with defined output structures, then chain those outputs through application logic.

Observed in 4 of 14 rostered operators.

AppFolioShopifySlackWix

Use retry, optimization, evaluation, or review loops when structured outputs are invalid or semantically uncertain.

Observed in 6 of 14 rostered operators.

AirbnbAppFolioDropboxKalvium LabsMendable.aiThumbtack

Where Operators Converge

Across the cited deployments, structured generation is treated as something application code must consume: operators produce JSON, structured payloads, tool/action schemas, or fixed output structures for downstream systems.

Operators do not present schema-constrained generation as sufficient by itself; the observed systems surround it with parsing, validation, retry, evaluation, tracing, or human review.

Where Operators Diverge

Operators differ on where they enforce structure.

APPROACH 01

Constrain the model through structured-output or tool/function-calling interfaces.

AppFolioDropboxShopifySlack

APPROACH 02

Let the model emit structured data, then validate, parse, or retry in the application layer.

AirbnbDropboxPinterestThumbtack

APPROACH 03

Generate structured API or tool schemas from product/UI context and execute them against internal systems.

Mendable.aiWix

Operators use schemas for different target artifacts.

APPROACH 01

Scoring, judging, and rubric artifacts.

DropboxKalvium LabsThumbtack

APPROACH 02

Tool, action, and internal API payloads.

AppFolioMendable.aiSlackWix

APPROACH 03

Generated data artifacts used by application code.

Airbnb

APPROACH 04

User-facing follow-up questions bundled with summaries or answers.

Dropbox

Operators differ in the recovery mechanism when structured generation is imperfect.

APPROACH 01

Feed validation errors back to the model and retry.

Airbnb

APPROACH 02

Optimize prompts or programs against measurable output-quality objectives.

Dropbox

APPROACH 03

Surface failing or low-confidence outputs to humans.

Kalvium LabsThumbtack

APPROACH 04

Use tracing and eval datasets to debug and evaluate runs.

AppFolioMendable.ai

Watch Items

Invalid or nonconforming structured output remains a production failure mode: Dropbox notes broken JSON cannot be parsed, Airbnb adds GraphQL validation and self-healing retries, Pinterest uses partial JSON parsing, and Thumbtack includes schema validation.

Reliability and observability are recurring concerns in structured agent workflows: Mendable.ai reports reliability and lack of observability as a major problem, Shopify says non-determinism hurts reliability, Slack uses a critic to mitigate hallucinations and variability, and AppFolio relies on tracing to pinpoint issues.

Cost and latency shape implementations: Dropbox moved away from an expensive judge at scale, Mendable.ai found a long ChatOpenAI call and a massive prompt, AppFolio parallelizes to keep latency down, and Slack assigns low-, medium-, and high-cost models to different agent roles.

Structured outputs still need semantic quality checks: Kalvium Labs calibrates against human graders and routes low-confidence answers to review, Thumbtack uses human review for AI-approved content, Dropbox uses human-rated relevance examples, and AppFolio blocks merges unless eval thresholds are met.

Implementation Menu

CURATED DEFAULTS

Name	Kind	When	Maturity
OpenAI structured outputs (json_schema strict)	service	managed models must emit exactly the schema, no parsing repair	commodity
Outlines	library	self-hosted models need grammar-constrained decoding	established
Instructor	library	Pydantic-validated outputs with automatic retry on validation failure	established

Observed in Production

5 APPS

TechnologyCROSS-VALIDATED

Schema-constrained generation

State of Practice

Observed Practices

Where Operators Converge

Where Operators Diverge

Watch Items

Implementation Menu

Observed in Production

LLM-Assisted Code Review, Test Migration, and Agent Evaluation

AI-Assisted Education Evaluation Review

LLM Application Quality Assurance

Code and Query Defect Validation and Repair

Pull Request Mock-Backed Development and LLM Test Generation