HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Finding most frequent words from Google N-Gram dataset

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
googlewordsdatasetgramfindingfromfrequentmost

Problem

I have written a program to process Google 1-Gram dataset in order
to obtain the list of most frequent words. This is my second more or less
working program in Haskell, so I really want to know if it complies with
current Haskell best practices.

I want to hear any your thought about virtually anything: performance, coding style, naming, even grammar and spelling in comments.

```
-- This script processes Google N-Gram dataset to obtain the list of the most
-- frequent words sorted by their frequency descending.
--
-- Usage example:
-- $ zcat googlebooks-eng-all-1gram-20120701-{a..z}.gz |
-- ./ProcessGoogleDataset 1950 5000 "$(echo {a..z})" > result.txt
--
-- where 1950 means to filter out only records newer then year 1950,
-- 5000 means number of the most frequent words to collect,
-- {a..z} means the list of letters allowed in words.
--
-- Copyright (c) 2015, Pavel Kretov.
-- Provided under the terms of Apache 2.0 License.
import Control.Monad.IO.Class (liftIO)
import Data.Char (isSpace, isLower)
import Data.Function (on)
import Data.List (foldl', sortBy, groupBy, partition)
import Data.List.Ordered (mergeBy)
import Data.Maybe (fromJust)
import Data.Ord (comparing)
import Data.Tuple (swap)
import System.Environment (getArgs)
import System.IO (stdin, stdout)

-- Heap seems convenient for finding most frequent items.
import qualified Data.Heap as H

-- Use Conduit package for IO as it has proven to be faster
-- then built-in (lines . readFile) approach.
import Data.Conduit (($$), (=$))
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Lazy as CZ

-- Conduit thinks Data.Text is better then plain strings.
import qualified Data.Text as T
import qualified Data.Text.Read as TR

-- Conduit really insists on that f

Solution

Here are only general thoughts about your code (it doesn’t talk about optimization, or data structures)

Code organisation

I would say it’s time to split your code: you have 19 imports and it becomes hard to know which part of your program requires which library.

You need to separate input/output code, argument handling and data processing. It will allow you to write unit tests more easily.

Comments

Avoid -- Do actual processing comments style. It doesn’t tell anything about the code itself, it just helps the developper to know where she/he is, which a good organisation of the code would have tell him/her.

When commenting functions, describe the arguments and what the function returns (the Haddock documentation will give some hint).

The naive mostFreq implementation code should be put in comments if it is not used.

Code style

Why don’t you use Haddock when commenting your code ? It even allows you to put small tests for QuickCheck.

I’m not a fan of your use of id just to make your code look better. Code should do something.

Hlint advices:

-
ProcessGoogleDataset.hs:67:23: Warning: Use first

Found:
  \ (word, f) -> (truncateUnderscore word, f)
Why not:
  Control.Arrow.first truncateUnderscore


-
ProcessGoogleDataset.hs:68:45: Warning: Avoid lambda

Found:
  \ c -> c /= '_'
Why not:
  (/= '_')


-
ProcessGoogleDataset.hs:73:21: Warning: Use String

Found:
  [Char] -> [(T.Text, Int)] -> [(T.Text, Int)]
Why not:
  String -> [(T.Text, Int)] -> [(T.Text, Int)]


-
ProcessGoogleDataset.hs:118:39: Warning: Use head

Found:
  args !! 0
Why not:
  head args


There are many (T.Text, Int) and (T.Text, Int, Int). It would help to define types (even type synonyms) to make it clear of what data they hold.

Program behaviour

Since your program will be run as a command line program, I think it should behave like a standard command line utility.

For example, it should have a help option, handle incorrect parameter, setting the return code, work with standard files (not only standard input) etc.

Code Snippets

Found:
  \ (word, f) -> (truncateUnderscore word, f)
Why not:
  Control.Arrow.first truncateUnderscore
Found:
  \ c -> c /= '_'
Why not:
  (/= '_')
Found:
  [Char] -> [(T.Text, Int)] -> [(T.Text, Int)]
Why not:
  String -> [(T.Text, Int)] -> [(T.Text, Int)]
Found:
  args !! 0
Why not:
  head args

Context

StackExchange Code Review Q#96778, answer score: 2

Revisions (0)

No revisions yet.