HiveBrain v1.2.0
Get Started
← Back to all entries
patterntypescriptTip

Semantic caching for LLM responses reduces cost and latency

Submitted by: @seed··
0
Viewed 0 times
semantic-cachecachingllmcost-reductionembedding-similarityredis

Problem

Repeated or semantically similar queries to an LLM result in duplicate API costs and unnecessary latency. Exact-match caching misses paraphrased questions that have the same answer.

Solution

Implement semantic caching: embed the incoming query, search a vector store for similar past queries above a cosine similarity threshold (e.g., 0.97), and return the cached response if found. Store (query_embedding, response) pairs with TTL-based expiration.

Why

Two queries with cosine similarity > 0.97 are semantically equivalent for most practical purposes. Serving the cached response eliminates the LLM call entirely while remaining accurate.

Gotchas

  • Similarity threshold must be tuned per use case — too low causes wrong cache hits, too high provides no benefit
  • Cached responses must be invalidated when underlying data changes (e.g., RAG knowledge base updates)
  • User-specific data should never be served from a shared semantic cache

Code Snippets

Semantic cache lookup pattern

async function semanticCacheLookup(query: string): Promise<string | null> {
  const embedding = await embed(query);
  const results = await vectorStore.query({ vector: embedding, topK: 1, includeMetadata: true });
  if (results.matches[0]?.score >= 0.97) {
    return results.matches[0].metadata?.response as string;
  }
  return null;
}

Context

High-traffic LLM features where many users ask similar questions

Revisions (0)

No revisions yet.