Output-length predictor
Headroom multiplier default 1.5 · per-workload override
Most LLM callers either omit max_tokens entirely (defaults to provider ceiling. Often huge, billed at full output rate even when the actual response is short) or set it conservatively high. M9 measures the workload's observed output-length distribution (past 14 days, sampled from optimize_savings rows that carry the provider's finish_reason) and injects a per-request max_tokens ceiling at the workload p90 × headroom multiplier (default 1.5). Tight ceiling = lower output token bill on the requests that would have padded; no impact on the requests that actually need the headroom.
M9 is parameter-mutating, not content-mutating. It changes the max_tokens parameter on the outgoing request, never touches the messages. Sits outside the M3 / M7 / M8 single-pick mutex. Composes freely with everything.
The quality gate
M9 refuses to inject a tighter ceiling unless three things hold:
- Workload opt-in is true (
output_length_predictor_optin). - The 14-day-fresh p90 estimate exists. If the cron hasn't populated
predicted_output_tokens_p90yet, or the value is older than 14 days, M9 short-circuits. - The measured 7-day truncation rate (
predicted_output_truncation_rate) is below 0.02. I.e. fewer than 2% of past requests hit the predicted ceiling. If we're truncating the model frequently at the predicted p90, we're mispredicting; M9 stays out of the way until the cron repopulates with fresh data.
Any failed gate → no-op + log reason. We never tighten beyond the caller's explicit max_tokens either. If the customer set 500, M9's 800 ceiling is irrelevant; if the customer set 5000, M9 may tighten to 800.
Headroom multiplier (per-workload, Session 6)
Default headroom: 1.5× p90 (OUTPUT_LENGTH_PREDICTOR_HEADROOM_MULTIPLIER). Per-workload override: workloads.output_length_headroom_multiplier (migration 0074, clamped [1.0, 3.0]). Tighter (closer to 1.0) extracts more savings but risks truncation; looser (closer to 3.0) safer but less efficient.
Why per-workload: a chat-summarisation workload with stable short outputs benefits from 1.2; a code-generation workload with high variance benefits from 2.0. Sponsor sets the trade- off; the truncation-rate gate enforces the safety floor.
Per-provider field mapping
OpenAI + xAI + every OpenAI-compatible provider use the top-level max_tokens field. Single slot, simple write. Anthropic also uses max_tokens at top level (required field; M9 reads the caller's value first).
Google Gemini uses generationConfig.maxOutputTokens (nested) or max_output_tokens (flat, depending on SDK version). The applier writes to the provider-correct slot while preserving the rest of generationConfig.
The data flywheel
Every measured request writes finish_reason + tokens_out + max_tokens_requested to optimize_savings (migration 0070, columns added 2026-05-22). The M9 cron runs daily and:
- Computes p90 of
tokens_outper workload over the past 14 days. - Computes truncation rate per workload over the past 7 days as
COUNT(finish_reason='length') / COUNT(*). - Writes both back to
workloads(predicted_output_tokens_p90+predicted_output_truncation_rate+predicted_output_tokens_computed_at). - Triggers KV refresh so the worker reads the fresh values on the next request.
Workloads with low traffic may have p90 estimates based on small sample sizes. That's why we require 14 days of freshness AND the 7-day truncation gate. Two checks limit the mis-prediction surface.
What we don’t do (yet)
- No per-request prediction. v0.1 uses workload-wide p90. No per-prompt classifier. A future v0.2 could route long-form prompts to a higher ceiling and short- form prompts to a tighter one based on input embedding + historical pairs. Out of v0.1 scope.
- No dynamic adjustment within a session. M9 reads from KV at request time; KV updates only on cron tick or explicit refresh. A session with a sudden output- length shift sees the old prediction until the next cron.
- No auto-loosening on truncation alarms. The cron writes the new measurement and the gate decides; we do not auto-increase headroom multiplier in response to a truncation spike. Sponsor receives an anomaly and decides.
- No streaming-only adjustment. M9 fires the same on streaming and non-streaming. Streaming
finish_reasonarrives via the SSE parser (wired Session 5, 2026-05-24) so the cron has consistent input regardless of mode.
Verification surfaces
- Response header
x-tessera-output-length-ceiling: Non every M9-applied request. Sponsors see exactly what ceiling was applied + can correlate with their own observability. - Request log line
output_length_predictor_appliedwithceiling+caller_requested_max_tokens+reason(decision provenance). -
/portal/auditchip strip -m9chip (non-mutating colour) +max_tokens_requestedcolumn showing the injected value.
What we promise
M9 fires only when the data supports it (14-day-fresh p90 AND 7-day truncation rate < 2%), never overrides a tighter caller-set ceiling, never increases a caller's ceiling (only ever tightens), surfaces the applied ceiling on every mutated request, and respects per-workload headroom-multiplier overrides. The data flywheel is closed-loop: every measured request feeds the next prediction.
Parent index: How it works. Adjacent mechanics: M1 auto-route · M8 structured output.