Skip to main content
← How it works
Mechanic · M6 · Non-mutating (semantically)

Prompt cache

Anthropic · OpenAI · Google · per-provider integration

Every major provider ships a prompt-caching path that lets the upstream skip re-tokenising and re-computing attention over a long shared prefix on every subsequent request. The hard part is wiring into each provider's specific surface. Anthropic wants a cache_control marker injected into the body, OpenAI applies caching automatically for prefixes ≥ 1024 tokens but Tessera needs to emit a signal to attribute cached-input billing, and Google requires creating a cachedContents/{name} resource ahead of time and swapping the inline prefix for a reference on the hot path. M6 unifies these three behind a single sponsor opt-in.

M6 is classified as non-mutating in the composition cap because it does not change the prompt the model sees. At most it adds a cache marker (Anthropic) or swaps an inline prefix for a server-side reference to the identical content (Google). The model's next token decision is bit-identical to the uncached path. M6 composes freely with M1, M3, M7, M8, M9.

Anthropic. Cache_control marker injection

On Anthropic /v1/messages requests, the applier (applyAnthropicPromptCache) walks the request body to the LAST system block and injects cache_control: { type: 'ephemeral' } on it. Anthropic's server then caches the system content for ~5 minutes. Subsequent requests with the identical system prefix hit the cache; Anthropic bills the cached portion at 10% of input rate, the non-cached portion at full rate.

The applier handles both shapes the SDKs use: a single string system field, or an array of typed content blocks. When the body has no system field at all, M6 is a no-op for that request. There's nothing stable to cache against.

Response header on Anthropic M6 hits: x-tessera-prompt-cache: applied-anthropic.

OpenAI. Auto-cache attribution signal

Since October 2024 OpenAI applies prompt caching automatically for any prefix ≥ 1024 tokens; no body change is required. So why does Tessera surface M6 on OpenAI workloads at all?

Because the canary's stack-aware grouping needs M6 in the mechanics_stack for OpenAI requests that hit cache. Without M6 in the stack, OpenAI workloads with cache-hit traffic get scored against the _none (pristine passthrough) bucket OR a partial stack. Hiding the actual prompt-cache contribution from breach detection. The worker emits the m6 tag for opted-in OpenAI workloads as the canary signal source, even though there's no body mutation.

Response header on OpenAI M6 attribution: x-tessera-prompt-cache: auto-openai. The auto- prefix distinguishes this from applied- markers where Tessera mutated the body.

Google. CachedContents lifecycle reference

Google Gemini's caching path is explicit: the customer (or Tessera) first creates a cachedContents/{name} resource via the Gemini API with the content to cache + a TTL. Subsequent requests pass cached_content: 'cachedContents/{name}' instead of including the content inline. Google bills the cached portion at 25% of input rate while the cache is live.

Tessera splits this into two phases:

  • Lifecycle (off the hot path). A dashboard cron creates / refreshes the cachedContents resource for opted-in workloads and writes the reference into the GOOGLE_CACHE_REGISTRY KV namespace keyed by workload id.
  • Worker hot path. On a Google opted-in request, the worker looks up the registry, swaps the inline prefix for the cached_content reference, and emits applied-google. Miss / no registry entry / expired → emits deferred-google so ops can see the opt-in is honoured at the intent level even when the cron hasn't populated the registry yet.

The Gemini API key for the lifecycle calls is stored on workloads.google_api_key per workload. The worker doesn't read this directly today - the cron writes the registry entry; the worker only references existing entries. Reserved for a future inline-create fallback if the cron lags behind demand.

Tier gating

M6 opt-ins are governed by getTierFeatures(customerTier) .promptCacheAvailable. After the 2026-05-31 free=paid pivot this flag is on for every active tier including Free Sandbox — the per-workload toggle is honoured the same way everywhere. Only the quantitative token cap differentiates Free from paid. M5 / M6 / M7 / M8 share the same gate.

The dashboard kv-sync effective-enabled flag is workload_flag AND tier_allows . Single source of truth, no drift between dashboard claim and worker behaviour.

What we don’t do (yet)

  • No xAI prompt-cache integration.xAI hasn't published a cache surface for grok-* as of ship date. M6 falls through to no-op on xAI requests.
  • No multi-marker Anthropic strategies.v0.1 injects on the LAST system block only. Anthropic's API supports up to 4 cache breakpoints with different TTLs. The multi-breakpoint optimisation lands in a follow-up once we have customer traffic shapes that benefit from it (mid- conversation cache anchors).
  • No Google inline-create fallback. When the registry entry is missing or expired, the worker emits deferred-google and forwards the inline body unchanged. We do not create the cachedContents resource on the hot path. That would add a second round-trip per cache-miss request, and the worker latency budget doesn't tolerate it. Cron repopulates within ~5 minutes.
  • No Tessera-side proxy cache for prompt-cache hits. That's M2 (exact cache) and M5 (semantic cache). M6 is specifically about getting the customer's prefix into the provider's OWN cache. We don't double-cache.

Verification surfaces

  • Response header x-tessera-prompt-cache with one of: applied-anthropic / auto-openai / applied-google / deferred-google.
  • /portal/audit chip strip. The m6 chip in canonical mechanics_stack. Non-mutating chip colour (subtle, not the warning-amber of M3 / M7 / M8).
  • Provider-side cache metrics. Anthropic responses carry usage.cache_read_input_tokens and usage.cache_creation_input_tokens; OpenAI responses carry usage.cached_tokens. These flow into optimize_savings.tokens_in (effective input) so the savings math accounts for the cache discount automatically.

What we promise

M6 marker injection (Anthropic, Google) is structural. It does not change the semantic content the model sees on cache MISS, and on cache HIT the model sees the exact content it would have processed inline. OpenAI's auto-cache path doesn't mutate the body at all; M6 is just the attribution signal so the canary breakdown stays honest.

When provider documentation contradicts our integration (e.g. Anthropic changes the cache_control syntax, OpenAI tightens the 1024-token threshold, Google deprecates cachedContents), the worker version that ships the fix carries a CHANGELOG entry + deprecation window for any sponsor still on a pre-update KV record.

Parent index: How it works. Cache mechanic family: M2 exact cache · M5 semantic cache (deep-dives queued). Adjacent: M1 auto-route · M3 compress.