HiveBrain v1.2.0
Get Started
← Back to all entries
snippetjavascriptTip

Implementing search in JavaScript, using TF-IDF and an inverted index

Submitted by: @import:30-seconds-of-code··
0
Viewed 0 times
javascriptimplementinginvertedandsearchindexusingidf

Problem

In the previous article, I went into detail how one could implement the Porter stemming algorithm in JavaScript. While the algorithm can be used along with a simple tokenization strategy to search for keywords in a document, it's insufficient for complex searching tasks. In this article, we'll explore how we can use TF-IDF for more advanced search tasks and implement an inverted index to speed up the search process.
Before we begin, let's establish a simple tokenization strategy. As we're only concerned with plaintext documents, we can use a regular expression to split the text into words.
All we're doing here is matching characters other than alphanumeric characters, hyphens, and apostrophes and splitting the text into words. We're also filtering out short words and converting all words to lowercase for easier comparison.
Stopwords are common words that are often filtered out before or after processing of natural language data. These words are usually not relevant for search tasks and can be safely removed. Lists of stopwords are readily available online, but for simplicity, we'll use a small list of common English stopwords.
Having a basic understanding of these steps, we can use the Porter stemmer to stem the words and perform a search. We'll use the Porter stemmer from the previous article and search for the word 'jump'.

Solution

const tokenize = str =>
  str
    .split(/[^a-z0-9\-']+/i)
    .filter(tkn => !tkn.length < 2)
    .map(tkn => tkn.toLowerCase());

const text = 'The quick brown fox jumps over the lazy dog.';
const tokens = tokenize(text);
// ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


All we're doing here is matching characters other than alphanumeric characters, hyphens, and apostrophes and splitting the text into words. We're also filtering out short words and converting all words to lowercase for easier comparison.
Stopwords are common words that are often filtered out before or after processing of natural language data. These words are usually not relevant for search tasks and can be safely removed. Lists of stopwords are readily available online, but for simplicity, we'll use a small list of common English stopwords.
Having a basic understanding of these steps, we can use the Porter stemmer to stem the words and perform a search. We'll use the Porter stemmer from the previous article and search for the word 'jump'.
As you can see, using this technique has some limitations. While we can extend it to search for multiple words, it's overall not particularly good at ranking the results. This is why we'll be exploring a better measure of relevance, TF-IDF.
TF-IDF refers to the term frequency-inverse document frequency. It is a measurement that is intended to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus.
The implementation is straightforward. All we need to add is a Map to use as a vocabulary and add the terms found in each document to it. Then, we can calculate the term frequency (TF) of a term in a document by dividing the number of times the term appears in the document by the total number of terms in the document.

Code Snippets

const tokenize = str =>
  str
    .split(/[^a-z0-9\-']+/i)
    .filter(tkn => !tkn.length < 2)
    .map(tkn => tkn.toLowerCase());

const text = 'The quick brown fox jumps over the lazy dog.';
const tokens = tokenize(text);
// ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
const removeStopwords = (stopWords, words) =>
  words.filter(tkn => !stopWords.includes(tkn));

const commonStopWords = [
  'a', 'am', 'an', 'are', 'be', 'been', 'being', 'by', 'cannot', 'could',
  'did', 'do', 'does', 'doing', 'for', 'had', 'has', 'have', 'having', 'he',
  'her', 'hers', 'him','his', 'i', 'if', 'is', 'it', 'its', 'me', 'my', 'no',
  'nor', 'of', 'or', 'other', 'our', 'ours', 'she', 'should', 'so', 'such',
  'that', 'the', 'their', 'theirs', 'them', 'then', 'there', 'these', 'they',
  'those', 'too', 'was', 'we', 'were', 'who', 'whom', 'with', 'would', 'you',
  'your', 'yours'
];

const text = 'The quick brown fox jumps over the lazy dog.';
const tokens = tokenize(text);
const filteredTokens = removeStopwords(commonStopWords, tokens);
// ['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
// Assuming the `porterStemmer` function is available
import { porterStemmer } from './porterStemmer.js';

const documents = [];

const parseDocument = document => {
  const tokens = tokenize(document);
  const filteredTokens = removeStopwords(commonStopWords, tokens);
  const stemmedTokens = filteredTokens.map(porterStemmer);
  return stemmedTokens;
};

const addDocument = document => {
  const terms = parseDocument(document);
  documents.push(terms);
};

const search = query => {
  const stemmedQuery = porterStemmer(query);
  const results = documents.filter(doc => doc.includes(stemmedQuery));
  return results;
};

addDocument(
  'JavaScript is a programming language, used in programming the web.'
);
addDocument(
  'Python is a high-level programming language, used in data science.'
);
addDocument(
  'CSS is a style sheet language, used in web development.'
);
addDocument(
  'HTML is a markup language, used in web development.'
);

search('javascript'); // Returns the first document [0]
search('language'); // Returns all documents [0, 1, 2, 3]
search('programming language'); // Returns nothing []
search('web'); // Returns the first, third, and fourth document [0, 2, 3]

Context

From 30-seconds-of-code: tf-idf-inverted-index

Revisions (0)

No revisions yet.