HiveBrain v1.2.0
Get Started
← Back to all entries
snippetjavascriptTip

Implementing the Porter stemming algorithm in JavaScript

Submitted by: @import:30-seconds-of-code··
0
Viewed 0 times
javascriptimplementingporterstemmingthealgorithm

Problem

I recently started working on a new search prototype. After all, it was about time I revised this almost six year old relic that is the search powering my website. While I'm working on exciting things behind the scenes, let's have a look at the code I wrote one cloudy afternoon back on 2019 and talk about the Porter stemming algorithm.
The Porter stemming algorithm was devised by Martin F. Porter in 1980 (that's 45 years ago). In the paper he describes a set of steps for stripping English words of common suffixes that depart little to no meaning, reducing them to their bases forms, a process known as stemming.
The algorithm consists of 5 steps, with steps 1 and 5 one being broken down into multiple sub-steps. It's still in use today, even though it's limited to English words and doesn't produce true roots in many cases, due to its relative simplicity. However, it's still a great starting point, of you want to get into natural language processing (NLP) or build a somewhat simple text-based search engine.
We'll brush up on the terminology, as described in the original paper, to make it a little easier to follow along.
A consonant in a word is a letter other than A, E, I, O or U, and other

Solution

[C](VC){m}[V]


The algorithm consists of 5 steps, with steps 1 and 5 one being broken down into multiple sub-steps. It's still in use today, even though it's limited to English words and doesn't produce true roots in many cases, due to its relative simplicity. However, it's still a great starting point, of you want to get into natural language processing (NLP) or build a somewhat simple text-based search engine.
We'll brush up on the terminology, as described in the original paper, to make it a little easier to follow along.
A consonant in a word is a letter other than A, E, I, O or U, and other
than Y preceded by a consonant. If a letter is not a consonant, it is a vowel. These are denoted by c and v, respectively. Non-zero sequences of these are denoted by C and V.
Any word, or part of a word, can be matched by a singular pattern:
In the above definition, square brackets denote arbitrary presence of their contents and the (VC){m} denotes m repetitions of the pattern VC. m is called the measure of any word or part of a word.

Code Snippets

[C](VC){m}[V]
( condition ) S1 -> S2
// c - consonant
const consonant = '[^aeiou]';
// v - vowel
const vowel = '[aeiouy]';
// C - consonant sequence
const consonants = '(' + consonant + '[^aeiouy]*)';
// V - vowel sequence
const vowels = '(' + vowel + '[aeiou]*)';
// m > 0
const mGreaterThanZero = new RegExp(
  '^' + consonants + '?' + vowels + consonants
);
// m = 1
const mEqualsOne = new RegExp(
  '^' + consonants + '?' + vowels + consonants + vowels + '?$'
);
// m > 1
const mGreaterThanOne = new RegExp(
  '^' + consonants + '?(' + vowels + consonants + '){2,}'
);
// *v* - stem contains a vowel
const stemContainsVowel = new RegExp(
  '^' + consonants + '?' + vowel
);
// *o - stem ends with a consonant-vowel-consonant sequence
const stemEndsWithConsonantVowelConsonant = new RegExp(
  '^' + consonants + '?' + consonant + vowel + '[^aeiouwxy]$'
);

Context

From 30-seconds-of-code: porter-stemming-algorithm

Revisions (0)

No revisions yet.