Skip to main content
← Tessera
Mechanics · Quality contract

How it works

Updated 26 May 2026

Tessera is a substrate proxy. Your application points at our endpoint, we forward to OpenAI / Anthropic / Google / xAI on your behalf, and on the way we apply a small set of optimization mechanics measured against a jointly-ratified baseline. Pricing is a flat monthly subscription by token volume; you keep 100% of the measured savings. The mechanics fire conservatively, gated by your eval, with a public composition cap and a hard quality floor that auto-disables anything drifting beneath it.

Every claim on this page is enforced in code, observable on /portal/audit, and gated by a daily quality canary - not by our promise.

The ten mechanics

Each mechanic is opt-in per workload, observable per request, and bypasses when its eligibility heuristic says the savings won't cover the risk. False-negatives (we miss a borderline win) are cheap. False-positives (we mutate a request and the response degrades) are expensive. They damage trust. We bias toward pass-through.

M1

Auto-route

non-mutating

When eval confirms a cheaper model produces equivalent output, we forward to the cheaper model.

Per-workload model pairing (gpt-4o → gpt-4o-mini, opus → sonnet → haiku, etc.). The daily canary samples 10% of traffic, double-fires each through the routed model AND the baseline, scores both via your promptfoo eval. Routing pauses for the workload if the canary mean-score drops below 0.95 for 3 consecutive days. M1 supports chained walks past the first hop with cumulative quality_preservation product gating (gated on per-workload signal).

M2

Exact-match cache

non-mutating

Identical request → cached response served from KV. We never call the provider.

Cache key is sha256 of the canonical request body (model, messages, temperature, tools, response_format). Cache hits return upstream of any mutation site, so they never compose with a content-changing mechanic. Default TTL 7 days; cached responses purge automatically.

M3

Compress

mutating

Heuristic whitespace + structural normalization. Preserves code, JSON, and tool messages verbatim. Per-role opt-in (system / user turns independent).

Today: deterministic whitespace + structural pass with code-fence + JSON-shape preservation. Per-role split shipped 2026-05-26. Compress_system_enabled + compress_user_turns_enabled toggle independently so workloads with stable expensive system prompts can compress those without touching variable user content. Roadmap: server-side LLMLingua-2 template substitution per workload (ADR-0003). v0.1 yields 5–15% on long prompts; preserves code and JSON structure verbatim.

M5

Semantic cache

non-mutating

Cosine-similar requests (≥ 0.95) served from cache. Catches paraphrased queries that exact-match misses.

Cloudflare Vectorize backs the embedding store. The cache canary measures the served-from-cache response against the live provider on 5% of hits. Drops back to provider on near-miss quality dips.

M6

Provider prompt cache

non-mutating

Anthropic, OpenAI, and Google ship native prompt caching. We auto-inject the cache markers on stable system prompts.

Anthropic gives 90% off cached prefix tokens, Google 75%, OpenAI 50%. Per-provider mutation only. Model output contract is identical to the un-marked request. Google cachedContents lifecycle is opt-in per workload because it requires explicit create/reference/refresh.

M7

Context pruning

mutating

Long conversations: trim to last N turns + system. RAG: rerank chunks, keep top-K by relevance.

Triggered when messages > 12 OR body > 32KB. Conservative trim: keep system + last 8 turns. RAG re-rank via FlashRank (ADR-0002). Tool_use/tool_result pairs preserved on Anthropic to maintain pairing invariant. Per-workload sliding aggressiveness, eval-gated.

M8

Structured output

mutating

When a workload registers an expected JSON schema, we force the provider to obey it. Schema mode on OpenAI / tool-use on Anthropic / responseSchema on Gemini.

Non-conformant responses log to anomaly inbox at Tier 1. Workloads without a registered schema fall back to auto-mode JSON when the body opts in.

M9

Output-length predictor

non-mutating

When a workload's historical truncation rate is under 2%, we inject a tight max_tokens ceiling from the p90 of past outputs.

Daily compute cron fits a per-workload p90 of completion length. Injects max_tokens = p90 × headroom_multiplier (default 1.3, configurable 1.0–2.0). Cuts output cost on completions that previously over-ran the budget but were never actually long. Truncation rate gate prevents the mechanic from firing on workloads where the model legitimately produces long completions.

M10

Batch arbitrage

non-mutating

Async-tolerant workloads route to provider Batch APIs (OpenAI Batch / Anthropic Message Batches, both 50% off list).

Opt-in per workload. We never auto-promote a real-time workload into batch. Latency contract is sponsor-set. Per-request override via x-tessera-batch-allowed header. Settlement loop polls the batch endpoint hourly; cost reconciliation overwrites dispatch-time estimates with actuals once the upstream output_file is available. Customer settlement-token is stored at-rest-encrypted via Supabase Vault.

Per-mechanic deep dives. Full set: M1 auto-route · M2 exact cache · M3 compress · M5 semantic cache · M6 prompt cache · M7 context prune · M8 structured output · M9 output-length predictor · M10 batch arbitrage · M11 cross-provider failover. Schema references, source-traceable claims, explicit deferrals. One page per mechanic, ten of ten now live.

The composition cap

At most one content-mutating mechanic per request.Mutating mechanics. M3 compress, M7 context prune, M8 structured output. Can compose quality risk exponentially when stacked. If M7 trims 2% of quality and M8 schema-forces another 3%, the request that survives both isn't losing 5%. It can lose 15-25% because the prompt has been weakened twice.

The proxy enforces this as a hard mutex with priority M7 > M3 > M8. When two mutating candidates qualify on the same request, the higher-priority one fires and the lower one short-circuits. Auto-route (M1) is also locked off whenever any mutating candidate is present. We never compound model reduction with content reduction.

Non-mutating mechanics (M1 auto-route, M2 + M5 cache hit, M6 provider prompt cache, M9 output-length predictor, M10 batch arbitrage) carry zero quality risk by contract. They either skip the provider entirely or hand the decision to the provider's own cache layer. They stack freely.

The quality SLA

We commit to a measured quality floor of 0.95on the daily promptfoo canary. That's a per-workload, per-mechanic-stack average score against a jointly-ratified golden set. If any (workload, stack) combination drifts below 0.95 for three consecutive days, the offending mechanic auto-disables for that workload and we issue a service credit on the next invoice. No paperwork.

The canary is stack-aware: the same request body fires through the full mechanic stack AND a pristine pass-through, and the eval compares both against the golden answer. That's how we detect "the route was fine but the compress was lossy" cases a single-mechanic canary would miss.

Every tier — Free Sandbox included — can opt in to Tier 1 (throttle on detection) or Tier 2 (auto-rollback the offending mechanic) in /portal/settings. Default is Tier 0 detection (we surface the anomaly, you decide); no upgrade required to escalate.

Auto-rollback

Tier 2 anomaly response is our deepest trust commitment. When the canary detects a quality regression on a workload that has opted in, the proxy disables the offending mechanic at the next request without operator intervention. The sponsor sees a row in /portal/anomaliesexplaining what dropped, by how much, and which mechanic got rolled back. Re-enablement is sponsor-driven, not time-based. We don't auto-flip the flag back hoping it'll behave.

Reliability primitives (above the mechanics)

Distinct from the ten cost mechanics. Reliability primitives are request-infrastructure features that sit in the same proxy path but address a different concern. Keeping your savings stack honest under load, not just under steady state.

Per-provider circuit breaker. Each provider gets an in-memory rolling-window state machine (HEALTHY / OPEN / HALF_OPEN) per worker isolate. When 5xx rate crosses the threshold over a 60-second window, the provider is marked OPEN and auto-route skips its intra-provider alternative mappings until the half-open probe succeeds. The primary request still flows. We never block your traffic. But the proxy stops recommending degraded models inside that provider.

Per-stack auto-rollback. Already described in the Auto-rollback section above. A reliability primitive in the quality-SLA sense (we disable the offending mechanic stack without taking down the rest of your savings). Surgical, not nuclear.

Shipped (MVP, 2026-05-20). Cross-provider failover (M11) via OpenRouter as universal fallback. Passive trigger. Primary upstream returns 5xx / connection error / timeout, the proxy retries the same chat-completions body on OpenRouter with the namespaced equivalent model (gpt-5 openai/gpt-5, deepseek-chat deepseek/deepseek-chat). Opt-in per workload, default off. Flip the toggle + register an OpenRouter API key in /portal/settings. Worker version at ship: 0.43.0-m11-failover. Body- shape-divergent primaries (Anthropic /v1/messages), mid-stream 5xx surfacing, and active failover via population signal are queued for follow-ons. Architecture write-up: cross-provider failover at the edge → · Mechanic deep-dive: M11 details

The audit surface

Every request your application sends through Tessera produces a row on /portal/auditwith the canonical mechanics_stack and the savings delta against the baseline. Mutating mechanics render in warning-amber chips so the composition cap is visible at a glance. If the page ever shows two mutating mechanics on the same row, it's a bug. Report it and we issue a service credit. We surface composition-cap violations to you on the page itself; the honesty has to be load-bearing or it isn't honesty.

Why hosted-only — the conscious tradeoff

A real objection we hear often: "why can't I run Tessera on my own servers?" Honest answer — we ship hosted-only on purpose, and the tradeoff works in your favor for the same reason Stripe and Cloudflare are hosted-only.

What you give up: on-prem deployment, full control of the proxy binary, the ability to audit the routing code at runtime, the option of air-gapped deployment.

What you get back:

  • Zero DevOps overhead. No Cloudflare Workers operational expertise required on your side. No worker version pinning. No KV migration runbooks. Every mechanic upgrade reaches you within 24 hours of merge — same shape as Stripe API improvements appearing in your account without a SDK upgrade.
  • Multi-source pricing catalog stays fresh centrally. Self-hosting would force you to maintain your own pricing snapshots for OpenAI / Anthropic / Mistral / Groq / Cohere / 8 other providers — that work is the asymmetric ops burden we absorb. Pricing changes from upstreams land in your savings math the same day they ship.
  • Quality canary uses the cross-customer eval ledger.Your stack's 0.95 floor is calibrated against an aggregate signal you wouldn't have in isolation. Self-hosted mode would degrade to per-workload eval only.
  • Cross-provider failover is operational, not theoretical. M11 routes to OpenRouter when your primary upstream degrades — that integration requires us to maintain the OR account, monitor health, and pre-cache fallback pricing. Hard to replicate in a single-customer self-host.

What we're working on: a "deploy to your own Cloudflare account"tier for enterprise customers with strict data-residency or air-gap requirements. You own the worker + KV; we still operate the pricing catalog + eval ledger as a managed service. ETA: post-revenue, after the first enterprise pilot signs. If that's blocking you today, talk to us — we'll prioritize the build if the contract justifies it.

What we won't ship: a full on-prem distribution of the proxy + pricing catalog + eval ledger. The asymmetric IP is exactly that bundle; open-sourcing it makes us a worse-funded version of the AWS playbook. The SDK + wire format are open Apache-2.0 — if we go away tomorrow, point your client back at the provider endpoints and you lose nothing except the mechanic stack.

What we don't do

We don't replace your prompts. We don't fine-tune your models. We don't add inference-time tooling beyond the ten mechanics on this page. We don't aggregate your traffic with other customers' (committed-use pooling is on the roadmap but lives in a separate disclosure when it lands).

We don't ship destructive mechanics. There is no "creative compression" mode that's aggressive by default. There is no token-level random substitution. There is no "model swap A/B test" without an eval gate. If a mechanic can't pass the 0.95 canary on your golden set, it doesn't fire on your traffic.

We charge a flat monthly subscription by token volume, not a cut of your savings — so we have no revenue incentive to push mechanics harder than your quality bar allows. This page exists to make that posture enforceable. Every commitment here is enforceable in code and visible to you in the dashboard.

Questions

Engagement structure and pricing on the pricing page. Data flow, subprocessors, and retention in the security posture and DPA. For the specific eval methodology and the golden-set construction protocol, talk to us. That's the first thing we set up in onboarding.