Semantic cache
Cosine threshold 0.95 default · per-workload override
Cosine-similar requests served from cache. Catches paraphrased queries that exact-match (M2) misses. "summarise this" and "give me a summary of this" embed to nearly the same vector, return the same cached answer when the workload accepts the approximation. Default threshold 0.95. Per-workload override on workloads.semantic_cache_threshold (clamped [0.85, 0.99]).
M5 is the cache mechanic with the highest variance in customer utility. Some workloads (FAQ-style support bots, news summarisers) hit M5 80% of the time. Others (open-ended creative generation, unique user-message workloads) hit ~0%. The eligibility heuristic gates M5 firing per workload; the canary tracks whether the cached responses actually pass the customer's eval.
Embedding model + hot-path budget
Embeddings via OpenAI text-embedding-3-small on a dedicated Tessera-owned key (OPENAI_API_KEY_EMBEDDINGS worker secret), NOT the customer's upstream key. The customer may not send an OpenAI key at all (Anthropic-primary workloads); we provide the embedding capability.
Hot-path latency: ~80 ms p50 for embed + lookup. M5 only fires after M2 miss, so worst case the worker pays exact-match KV + embed + semantic KV before upstream. The trade-off: when M5 hits, the win is the full inference round-trip (saves seconds); when it misses, we've added 80 ms. Opt-in per workload.
Vector store + entry shape
Per-workload partition in the SEMANTIC_CACHE KV namespace, keyed by semantic:{client_id}:{workload_id}:{n}. Each entry carries:
embedding- the float32 vector (text-embedding-3-small produces 1536-dim, we store verbatim).response_body- the cached provider response, byte-for-byte.model- the actual model served. M1 may have routed this earlier; we remember which alt.created_at+ttl_hours- default 24-hour TTL.
Per-workload semantic_cache_max_entries (clamped [10, 200], default 50) caps the partition size. Beyond the cap, LRU eviction removes the oldest entry on populate. The worker reads + clamps every value at the call site before handing to the lookup library.
Eligibility + tier gating
M5 fires only when:
- Workload opt-in is true (tier feature
tier.promptCacheAvailableis on for every active tier including Free Sandbox after the 2026-05-31 free=paid pivot — only the token cap differs). - Provider is OpenAI-shape OR Anthropic-shape (the embed-text extractor handles both; Gemini-shape lands in a follow-up).
- Request is non-streaming.
-
SEMANTIC_CACHEandOPENAI_API_KEY_EMBEDDINGSenv are present (graceful degrade to disabled when missing - worker logs a warning, doesn't break the request).
Any failed gate → M5 skipped, M2 lookup still ran in front, M6 / M3 / M7 / M8 still run downstream. The request continues with one less optimisation in the stack.
Populate path
On M5 miss + successful upstream response (200) + body ≤ 256 KB: the worker computes a fresh embedding (or reuses the lookup-side vector if available. Same text produces the same embedding, so we deduplicate the OpenAI embed call when possible) and stores an entry in the workload partition via populateEntry. The populate runs in ctx.waitUntil . The customer-facing response has already been returned by the time we write to KV.
Why the 256 KB cap: a single KV value over that size invokes chunking + indexing overhead that defeats the purpose. M5 is best for small-to-medium responses (FAQ answers, short summaries). Long-form generation (multi-page essays) bypasses populate to avoid bloating the partition.
What we don’t do (yet)
- No Vectorize backing (yet). v0.1 uses KV with in-worker cosine compute (max 200 entries × 1536-dim ≈ small enough to brute-force per request). Cloudflare Vectorize as a proper ANN store lands when partition sizes pass that bound.
- No re-rank by recency. v0.1 returns the single highest-similarity match above threshold. We do not blend recency × similarity. Recency-aware re-ranking is on the roadmap.
- No Gemini-shape support.
contents[].parts[]needs its own embed-text extractor; Gemini share gates the lift. - No customer-side embedding override. M5 uses text-embedding-3-small. Sponsors with their own embedding preferences (Cohere, Voyage) cannot swap models today.
- No automatic threshold tuning. 0.95 is the published default; sponsors override per workload. We do not auto-tune based on canary signal.
Verification surfaces
- Response header
x-tessera-cache: semantic-hiton M5 hits, withx-tessera-cache-similarity: 0.9743(4-decimal similarity score). Misses surfacesemantic-miss. -
/portal/auditchip strip -m5chip,semantic_cache_hit = trueon the row.
What we promise
M5 is opt-in per workload, threshold-tunable per workload, tier- gated to the paid tiers, byte-exact on serve (cached response returned verbatim), and graceful-degrade on missing infrastructure. We never silently swap a customer's requested model under M5. The cache always returns the model that was served originally.
Parent index: How it works. Cache family: M2 exact cache · M6 prompt cache.