patternMinor
Counting word frequencies in Haskell
Viewed 0 times
frequencieswordcountinghaskell
Problem
I am leaning Haskell and decided to do the following lab assignment from CS240h: Functional Systems in Haskell. I welcome any advice on more idiomatic coding style and performance tips.
Notes:
Here is the code I ended with:
`
chunk s = let (word, notGood) = span isGoodChar s
(_, rest) = span (not . isGoodChar) notGood
in word : (if rest == "" then [] else chunk rest)
trim = (dropWhile isPunctuation) . (dropWhileEnd isPunctuation)
lower = map toLower str
in [trimmed | word String -> IO Frequencies
increaseCount frequencies word = do
maybeCount String -> IO Frequencies
countWords frequencies str = do
_ <- mapM (increaseCount frequencies) (getWords str)
return frequencies
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
frequencies <- emptyFrequencies
args <- getArgs
contents <- if args == [] then sequence
Notes:
- I chose
Data.HashTableon the assumption that it would be faster thanData.Mapas the number of entries increase. Edit: this seems to be an erroneous assumption as timing on about 5M of input hasData.Mapat 3.5s andData.HashTableat 5.3s.
- Some of the
_
- I also wondered if using some sort of ByteString
would be better, but I was also confused on how to make it work with UTF-8 and coming up with a suitable hash function, so I dropped that idea.
Here is the code I ended with:
`
module Lab1 where
import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)
type Frequencies = HashTable String Int
-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
isGoodChar c = isAlpha(c) || c elem "'"chunk s = let (word, notGood) = span isGoodChar s
(_, rest) = span (not . isGoodChar) notGood
in word : (if rest == "" then [] else chunk rest)
trim = (dropWhile isPunctuation) . (dropWhileEnd isPunctuation)
lower = map toLower str
in [trimmed | word String -> IO Frequencies
increaseCount frequencies word = do
maybeCount String -> IO Frequencies
countWords frequencies str = do
_ <- mapM (increaseCount frequencies) (getWords str)
return frequencies
-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
frequencies <- emptyFrequencies
args <- getArgs
contents <- if args == [] then sequence
Solution
Inside do notation, you don't need to use
Haskell programmers tend to prefer immutable collections like
To work efficiently with Unicode data, consider using the text package.
_ <- to discard the value of a computation. So_ <- update frequencies word (count + 1) can be written as update frequencies word (count + 1)Haskell programmers tend to prefer immutable collections like
Data.Map or Data.IntMap over hash tables. Data.HashTable uses mutable references and requires you to work inside the IO monad. Generally, it is good practice to relegate the IO monad to the outer layers of your program and keep the core pure.To work efficiently with Unicode data, consider using the text package.
Context
StackExchange Code Review Q#22713, answer score: 2
Revisions (0)
No revisions yet.