patterntypescriptModerate
Tag co-occurrence matrix for query expansion in knowledge bases
Viewed 0 times
co-occurrencetag pairssemantic expansionsearch recallcollaborative filtering
Problem
Search in a knowledge base only matches literal terms. Searching for 'async' misses entries tagged with 'promise' or 'concurrency' even though they're closely related. Users must guess the exact terminology used in entries, reducing search recall significantly.
Solution
Build a tag co-occurrence matrix from existing entries. For each entry, generate all tag pairs and count how often they appear together. Store as (tag_a, tag_b, count) with a minimum threshold (e.g., count >= 2). At search time, look up co-occurring tags for each search term and add them to the expanded query. For example, if 'react' co-occurs with 'hooks' 47 times and 'state' 34 times, searching 'react bug' also searches 'hooks bug' and 'state bug'. This is the word2vec insight applied to tag space — cheap, no ML required, dramatically improves recall. Keep the expansion best-effort (wrapped in try/catch) so it never breaks core search. Rebuild the matrix periodically or on-demand via an API endpoint.
Why
Tags that frequently appear together on the same entries indicate semantic relatedness. This is essentially a zero-cost approximation of word embeddings — instead of learning vector representations, we just count co-occurrences. The insight is that tag space is small enough (thousands, not millions) that a simple co-occurrence matrix captures most of the semantic structure.
Gotchas
- Filter pairs with count < 2 to avoid noise from one-off coincidences
- Limit expansion to top 3 co-occurring tags per word to avoid query explosion
- Rebuild the matrix after bulk imports — it doesn't auto-update
- The matrix is symmetric but store only sorted pairs (a < b) to save space
Context
When building search functionality for any knowledge base or documentation system
Revisions (0)
No revisions yet.