HiveBrain v1.2.0
Get Started
← Back to all entries
principleMajor

How does editing software (like Microsoft word or Gmail) pick the 2nd string to compare in Levenshtein distance?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
thegmaileditingdistancelikemicrosoftlevenshteinwordsoftwaredoes

Problem

I understand the textbook explanation of how to use dynamic programming to find the minimum edit distance between 2 strings but how do we get to pick the 2nd string?

I don't think the entire dictionary is compared as sometimes the difference is either in the middle or the end. I assume that in the end, what is suggested is the string that has the minimum edit distance after creating a certain number of $n \times m$ tables, where $n$ is the typed string length and $m$ is that of the other words that may be close.

Solution

Yes, the entire dictionary is compared against each word. This can be fast by using a trie and an algorithm similar to Levenshtein's.

I have built my own spelling corrector that checks words against a dictionary of about 1 million words and 12.5 million word-pairs (for basic word usage checking). It takes about 0.7 milliseconds to check a single word and provide corrective suggestions based on these two dictionaries. I'm sure it could be significantly faster; I stopped optimizing once it became fast enough for my purposes, which was to check a few paragraphs in under 1 second.

Several of the other answers have hinted toward how spelling and grammar checkers/correctors work, but let's put all the pieces together and see how we could build our own fast spelling corrector.

Disclaimer: I don't know if this is how Microsoft and Google actually do it, but it works very well for my needs.
How to build a spelling corrector

I started from this excellent article by Peter Norvig. It's worth a read if you're serious about implementing your own spelling corrector, but the most important insight is the use of Bayes' Theorem to decompose the question into independent parts. Let's start by actually stating our goal.

Goal: given the user typed word $w$, find and rank the most likely possibilities for what the user actually meant to type, $c$ ("candidate"). In other words, for each $c$ in our list of 1 million candidates, we are looking for the probability that the user wanted to type each $c$ when they actually typed $w$. Norvig's insight is to use Bayes' theorem to turn this into manageable independent sub-problems:

$$
P(c|w) = \frac{P(c)P(w|c)}{P(w)}
$$

Where:

$P(c|w)$ = the probability the user meant $c$ when they actually typed $w$ (our goal)

$P(c)$ = the probability of typing $c$. This is simply a measure of the frequency of the candidate word in our language model. For example, in English, P("the") is larger than P("thaw").

$P(w|c)$ = The probability that the user would type $w$ when they meant $c$. This is our error model. For example, the probability the user would type "thew" when they meant "thaw" is probably higher than typing "thew" when they really meant "the".

$P(w)$ = The probability of typing $w$. We're simply not interested in any cases where $w$ was not typed, because we observe that it was. Therefore this value is basically 1 (or at least constant) for our purposes and we divide it out.

Since we're just trying to rank suggestions relative to each other, I'm not going to worry about making sure the probabilities actually add up to 1.

Levenshtein distance is an error model. We could say that that $P(w|c)$ = $Levenshtein(w,c)$ and we get a perfectly serviceable error model. It might even be the best we can do in some cases, e.g. if we're trying to reconstruct English text that was sent over a noisy line that caused essentially random insertion, deletion, and replacement errors. However, for a human typist, we can do better by taking into account common language-specific errors. For instance, accidentally replacing an "e" with an "a" is much more likely than with an "m". As gnasher729 pointed out, we could also take the user's keyboard layout into account.
Generating Candidate Words

Now we can get to the heart of your question: how do we generate candidates to check? Every word in our dictionary is a candidate! If we try to do this with a naïve brute-force approach, we can think of some obvious waste. For instance, we shouldn't need to fully evaluate the word "continuations" if we've already evaluated "continuation"; this hints that a Trie (or "prefix tree") might help us here. We also probably shouldn't be comparing the "s" in "continuations" against the "y" in "specificity"; surely we know by that point that neither is a good substitute for the other! This hints that we can significantly cull down our search for candidates if we keep track of compounding error while doing our search.

So our steps are:

Step 1. Organize your dictionary into a giant trie

Let's say our dictionary consists of the 12 words shown below. The corresponding trie is shown to the right. We also want to keep track of which nodes are word endings, which I've given a solid blue color:

Even with 1 million words, this data structure is only a few hundred megabytes and fits easily into memory. (The n-grams are the memory hogs!)

Step 2. Calculate compounding error while searching for suggestions

Let's say the user writes "saken", and we want to suggest corrections from this dictionary. We keep a list of 3-tuples containing the following information:

  • Where we are in the word



  • Where we are in the trie



  • The total compounded error so far



This list starts with one entry: (first letter, root of trie, 0), as shown below.

Until we have an empty list, we loop over every element in the list. If it's a valid word (that is, the trie pointer is pointing to a node marked "end-of-word"), we return it as a possible suggestion, al

Context

StackExchange Computer Science Q#143708, answer score: 25

Revisions (0)

No revisions yet.