patternMinor
soundex implementation
Viewed 0 times
implementationsoundexstackoverflow
Problem
The soundex algorithm maps several spellings of a name to a 4 character term. (hopefully mapping all different transcriptions to the same term) The steps as described in Introduction to information retrieval are:
I implemented a working solution, but can it be implemented more readable?
- Keep first Letter for the rest:
- Change any of 'A', 'E', 'I', 'O', 'U', 'H', 'W', 'Y' to zero
- Change all other characters to 1-6, see below
- remove consecutive doubles
- remove all zeros, right pad with 0s, keep first 3
I implemented a working solution, but can it be implemented more readable?
import Data.Char
soundex :: String -> String
soundex [] = []
soundex (n:ame) = result
where
compr = map (compress . toUpper) ame
noDoub = remDoubles compr
noZero = filter (/='0') noDoub
result = n: take 3 (noZero++"000")
remDoubles :: String -> String
remDoubles [] = []
remDoubles (a:[]) = [a]
remDoubles (a:b:bs) | a==b = remDoubles (b:bs)
| otherwise = a:remDoubles (b:bs)
compress :: Char -> Char
compress char | char `elem` "AEIOUHWY" = '0'
| char `elem` "BFPV" = '1'
| char `elem` "CGJKQSXZ" = '2'
| char `elem` "DT" = '3'
| char `elem` "L" = '4'
| char `elem` "MN" = '5'
| char `elem` "R" = '6'
| otherwise = charSolution
Correctness
The algorithm is actually quite tricky to implement correctly. Your algorithm gives the wrong results for some examples in the Wikipedia article:
$$\begin{array}{l|l|l}
\textrm{Name} & \textrm{Correct Soundex} & \textrm{Your result} \\
\hline
\textrm{Ashcroft} & \textrm{A261} & \textrm{A226} \\
\textrm{Pfister} & \textrm{P236} & \textrm{P123} \\
\end{array}$$
Specifically, note the following rule:
Two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
As a consequence, the code must…
Style
Note that
I prefer imports to be more explicit about what functions are imported.
Suggested solution
Note that the order of the chain of functions used to define
… which explains the "Pfister" bug.
To address the "Ashcroft" bug, I've rewritten
The algorithm is actually quite tricky to implement correctly. Your algorithm gives the wrong results for some examples in the Wikipedia article:
$$\begin{array}{l|l|l}
\textrm{Name} & \textrm{Correct Soundex} & \textrm{Your result} \\
\hline
\textrm{Ashcroft} & \textrm{A261} & \textrm{A226} \\
\textrm{Pfister} & \textrm{P236} & \textrm{P123} \\
\end{array}$$
Specifically, note the following rule:
Two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
As a consequence, the code must…
- … treat "H" and "W" differently from the vowels, so
charelem"AEIOUHWY" = '0'is wrong. I recommend using'0'to temporarily indicate the presence of a vowel, and dropping'H'and'W'unless it is in the initial position.
- … lookup the digit for the first character too, to see if the next nonzero digit should be discarded. Therefore,
compr = map (compress . toUpper) ameis wrong.
Style
ame and noDoub are pretty weird variable names. For a clear and idiomatic function definition, I suggestsoundex name@(n:ns) = n : take 3 (noZero ++ "000")
where
…Note that
result should probably be eliminated.remDoubles could be simplified to map head . group, using the group function from Data.List. I would rename remDoubles to dedup.compress is oddly named, as it maps one Char to exactly one Char.I prefer imports to be more explicit about what functions are imported.
Suggested solution
import Data.Char (toUpper)
import Data.List (group)
soundex :: String -> String
soundex [] = []
soundex name = (toUpper $ head name) : nnn
where
dedup = map head . group
nnn = take 3 $ (++ "000") $ filter (/= '0') $ tail $ dedup $ digitize name
digitize :: String -> String
digitize name = d : filter (/= '-') ds
where
digitFor c
| c `elem` "AEIOUY" = '0'
| c `elem` "BFPV" = '1'
| c `elem` "CGJKQSXZ" = '2'
| c `elem` "DT" = '3'
| c `elem` "L" = '4'
| c `elem` "MN" = '5'
| c `elem` "R" = '6'
| otherwise = '-' -- 'H', 'W', and punctuation
(d:ds) = map (digitFor . toUpper) nameNote that the order of the chain of functions used to define
nnn is critical. Yours was closer to…nnn = take 3 $ (++ "000") $ filter (/= '0') $ dedup $ digitize $ tail name… which explains the "Pfister" bug.
To address the "Ashcroft" bug, I've rewritten
compress as digitize, which can return a shorter string that drops output corresponding to a non-initial 'H' and 'W'.Code Snippets
soundex name@(n:ns) = n : take 3 (noZero ++ "000")
where
…import Data.Char (toUpper)
import Data.List (group)
soundex :: String -> String
soundex [] = []
soundex name = (toUpper $ head name) : nnn
where
dedup = map head . group
nnn = take 3 $ (++ "000") $ filter (/= '0') $ tail $ dedup $ digitize name
digitize :: String -> String
digitize name = d : filter (/= '-') ds
where
digitFor c
| c `elem` "AEIOUY" = '0'
| c `elem` "BFPV" = '1'
| c `elem` "CGJKQSXZ" = '2'
| c `elem` "DT" = '3'
| c `elem` "L" = '4'
| c `elem` "MN" = '5'
| c `elem` "R" = '6'
| otherwise = '-' -- 'H', 'W', and punctuation
(d:ds) = map (digitFor . toUpper) namennn = take 3 $ (++ "000") $ filter (/= '0') $ dedup $ digitize $ tail nameContext
StackExchange Code Review Q#117894, answer score: 4
Revisions (0)
No revisions yet.