HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

soundex implementation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
implementationsoundexstackoverflow

Problem

The soundex algorithm maps several spellings of a name to a 4 character term. (hopefully mapping all different transcriptions to the same term) The steps as described in Introduction to information retrieval are:

  • Keep first Letter for the rest:



  • Change any of 'A', 'E', 'I', 'O', 'U', 'H', 'W', 'Y' to zero



  • Change all other characters to 1-6, see below



  • remove consecutive doubles



  • remove all zeros, right pad with 0s, keep first 3



I implemented a working solution, but can it be implemented more readable?

import Data.Char

soundex :: String -> String
soundex []      = []
soundex (n:ame) = result
  where 
    compr  = map (compress . toUpper) ame
    noDoub = remDoubles compr
    noZero = filter (/='0') noDoub
    result = n: take 3 (noZero++"000")

remDoubles :: String -> String
remDoubles []       = []
remDoubles (a:[])   = [a]
remDoubles (a:b:bs) | a==b      = remDoubles   (b:bs)
                    | otherwise = a:remDoubles (b:bs)

compress :: Char -> Char
compress char | char `elem` "AEIOUHWY" = '0'
              | char `elem` "BFPV"     = '1'
              | char `elem` "CGJKQSXZ" = '2'
              | char `elem` "DT"       = '3'
              | char `elem` "L"        = '4'
              | char `elem` "MN"       = '5'
              | char `elem` "R"        = '6'
              | otherwise              = char

Solution

Correctness

The algorithm is actually quite tricky to implement correctly. Your algorithm gives the wrong results for some examples in the Wikipedia article:

$$\begin{array}{l|l|l}
\textrm{Name} & \textrm{Correct Soundex} & \textrm{Your result} \\
\hline
\textrm{Ashcroft} & \textrm{A261} & \textrm{A226} \\
\textrm{Pfister} & \textrm{P236} & \textrm{P123} \\
\end{array}$$

Specifically, note the following rule:


Two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.

As a consequence, the code must…

  • … treat "H" and "W" differently from the vowels, so char elem "AEIOUHWY" = '0' is wrong. I recommend using '0' to temporarily indicate the presence of a vowel, and dropping 'H' and 'W' unless it is in the initial position.



  • … lookup the digit for the first character too, to see if the next nonzero digit should be discarded. Therefore, compr = map (compress . toUpper) ame is wrong.



Style

ame and noDoub are pretty weird variable names. For a clear and idiomatic function definition, I suggest

soundex name@(n:ns) = n : take 3 (noZero ++ "000")
  where
    …


Note that result should probably be eliminated.

remDoubles could be simplified to map head . group, using the group function from Data.List. I would rename remDoubles to dedup.

compress is oddly named, as it maps one Char to exactly one Char.

I prefer imports to be more explicit about what functions are imported.

Suggested solution

import Data.Char (toUpper)
import Data.List (group)

soundex :: String -> String
soundex []   = []
soundex name = (toUpper $ head name) : nnn
  where
    dedup = map head . group
    nnn = take 3 $ (++ "000") $ filter (/= '0') $ tail $ dedup $ digitize name

digitize :: String -> String
digitize name = d : filter (/= '-') ds
  where
    digitFor c
      | c `elem` "AEIOUY"   = '0'
      | c `elem` "BFPV"     = '1'
      | c `elem` "CGJKQSXZ" = '2'
      | c `elem` "DT"       = '3'
      | c `elem` "L"        = '4'
      | c `elem` "MN"       = '5'
      | c `elem` "R"        = '6'
      | otherwise           = '-'   -- 'H', 'W', and punctuation
    (d:ds) = map (digitFor . toUpper) name


Note that the order of the chain of functions used to define nnn is critical. Yours was closer to…

nnn = take 3 $ (++ "000") $ filter (/= '0') $ dedup $ digitize $ tail name


… which explains the "Pfister" bug.

To address the "Ashcroft" bug, I've rewritten compress as digitize, which can return a shorter string that drops output corresponding to a non-initial 'H' and 'W'.

Code Snippets

soundex name@(n:ns) = n : take 3 (noZero ++ "000")
  where
    …
import Data.Char (toUpper)
import Data.List (group)

soundex :: String -> String
soundex []   = []
soundex name = (toUpper $ head name) : nnn
  where
    dedup = map head . group
    nnn = take 3 $ (++ "000") $ filter (/= '0') $ tail $ dedup $ digitize name

digitize :: String -> String
digitize name = d : filter (/= '-') ds
  where
    digitFor c
      | c `elem` "AEIOUY"   = '0'
      | c `elem` "BFPV"     = '1'
      | c `elem` "CGJKQSXZ" = '2'
      | c `elem` "DT"       = '3'
      | c `elem` "L"        = '4'
      | c `elem` "MN"       = '5'
      | c `elem` "R"        = '6'
      | otherwise           = '-'   -- 'H', 'W', and punctuation
    (d:ds) = map (digitFor . toUpper) name
nnn = take 3 $ (++ "000") $ filter (/= '0') $ dedup $ digitize $ tail name

Context

StackExchange Code Review Q#117894, answer score: 4

Revisions (0)

No revisions yet.