HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Counting word frequencies in Haskell

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
frequencieswordcountinghaskell

Problem

I am leaning Haskell and decided to do the following lab assignment from CS240h: Functional Systems in Haskell. I welcome any advice on more idiomatic coding style and performance tips.

Notes:

  • I chose Data.HashTable on the assumption that it would be faster than Data.Map as the number of entries increase. Edit: this seems to be an erroneous assumption as timing on about 5M of input has Data.Map at 3.5s and Data.HashTable at 5.3s.



  • Some of the _



  • I also wondered if using some sort of ByteString would be better, but I was also confused on how to make it work with UTF-8 and coming up with a suitable hash function, so I dropped that idea.



Here is the code I ended with:

`
module Lab1 where

import Prelude hiding (lookup, readFile, getContents)
import System.Environment (getArgs)
import System.IO.UTF8 (readFile, getContents)
import Data.Char (isAlpha, toLower, isPunctuation)
import Data.HashTable
import Data.Maybe (fromMaybe)
import Data.Function (on)
import Data.List (sortBy, dropWhileEnd)

type Frequencies = HashTable String Int

-- Get relevant words according to the instructions in the assignment.
-- Convert to lowercase, keep the single quotes ("Don't" -> "don't")
getWords :: String -> [String]
getWords str = let
isGoodChar c = isAlpha(c) || c
elem "'"
chunk s = let (word, notGood) = span isGoodChar s
(_, rest) = span (not . isGoodChar) notGood
in word : (if rest == "" then [] else chunk rest)
trim = (dropWhile isPunctuation) . (dropWhileEnd isPunctuation)
lower = map toLower str
in [trimmed | word String -> IO Frequencies
increaseCount frequencies word = do
maybeCount String -> IO Frequencies
countWords frequencies str = do
_ <- mapM (increaseCount frequencies) (getWords str)
return frequencies

-- read list files from arguments or use stdin
processInputs :: IO Frequencies
processInputs = do
frequencies <- emptyFrequencies
args <- getArgs
contents <- if args == [] then sequence

Solution

Inside do notation, you don't need to use _ <- to discard the value of a computation. So
_ <- update frequencies word (count + 1) can be written as update frequencies word (count + 1)

Haskell programmers tend to prefer immutable collections like Data.Map or Data.IntMap over hash tables. Data.HashTable uses mutable references and requires you to work inside the IO monad. Generally, it is good practice to relegate the IO monad to the outer layers of your program and keep the core pure.

To work efficiently with Unicode data, consider using the text package.

Context

StackExchange Code Review Q#22713, answer score: 2

Revisions (0)

No revisions yet.