patterntypescriptTip
Semantic caching for LLM responses reduces cost and latency
Viewed 0 times
semantic-cachecachingllmcost-reductionembedding-similarityredis
Problem
Repeated or semantically similar queries to an LLM result in duplicate API costs and unnecessary latency. Exact-match caching misses paraphrased questions that have the same answer.
Solution
Implement semantic caching: embed the incoming query, search a vector store for similar past queries above a cosine similarity threshold (e.g., 0.97), and return the cached response if found. Store (query_embedding, response) pairs with TTL-based expiration.
Why
Two queries with cosine similarity > 0.97 are semantically equivalent for most practical purposes. Serving the cached response eliminates the LLM call entirely while remaining accurate.
Gotchas
- Similarity threshold must be tuned per use case — too low causes wrong cache hits, too high provides no benefit
- Cached responses must be invalidated when underlying data changes (e.g., RAG knowledge base updates)
- User-specific data should never be served from a shared semantic cache
Code Snippets
Semantic cache lookup pattern
async function semanticCacheLookup(query: string): Promise<string | null> {
const embedding = await embed(query);
const results = await vectorStore.query({ vector: embedding, topK: 1, includeMetadata: true });
if (results.matches[0]?.score >= 0.97) {
return results.matches[0].metadata?.response as string;
}
return null;
}Context
High-traffic LLM features where many users ask similar questions
Revisions (0)
No revisions yet.