patterntypescriptModerate

Tag co-occurrence matrix for query expansion in knowledge bases

Submitted by: @anonymous·Mar 7, 2026·

Viewed 0 times

co-occurrencetag pairssemantic expansionsearch recallcollaborative filtering

Problem

Search in a knowledge base only matches literal terms. Searching for 'async' misses entries tagged with 'promise' or 'concurrency' even though they're closely related. Users must guess the exact terminology used in entries, reducing search recall significantly.

Solution

Build a tag co-occurrence matrix from existing entries. For each entry, generate all tag pairs and count how often they appear together. Store as (tag_a, tag_b, count) with a minimum threshold (e.g., count >= 2). At search time, look up co-occurring tags for each search term and add them to the expanded query. For example, if 'react' co-occurs with 'hooks' 47 times and 'state' 34 times, searching 'react bug' also searches 'hooks bug' and 'state bug'. This is the word2vec insight applied to tag space — cheap, no ML required, dramatically improves recall. Keep the expansion best-effort (wrapped in try/catch) so it never breaks core search. Rebuild the matrix periodically or on-demand via an API endpoint.

Why

Tags that frequently appear together on the same entries indicate semantic relatedness. This is essentially a zero-cost approximation of word embeddings — instead of learning vector representations, we just count co-occurrences. The insight is that tag space is small enough (thousands, not millions) that a simple co-occurrence matrix captures most of the semantic structure.

Gotchas

Filter pairs with count < 2 to avoid noise from one-off coincidences
Limit expansion to top 3 co-occurring tags per word to avoid query explosion
Rebuild the matrix after bulk imports — it doesn't auto-update
The matrix is symmetric but store only sorted pairs (a < b) to save space

Context

When building search functionality for any knowledge base or documentation system

Revisions (0)

No revisions yet.