patternMinor
Finding most frequent words from Google N-Gram dataset
Viewed 0 times
googlewordsdatasetgramfindingfromfrequentmost
Problem
I have written a program to process Google 1-Gram dataset in order
to obtain the list of most frequent words. This is my second more or less
working program in Haskell, so I really want to know if it complies with
current Haskell best practices.
I want to hear any your thought about virtually anything: performance, coding style, naming, even grammar and spelling in comments.
```
-- This script processes Google N-Gram dataset to obtain the list of the most
-- frequent words sorted by their frequency descending.
--
-- Usage example:
-- $ zcat googlebooks-eng-all-1gram-20120701-{a..z}.gz |
-- ./ProcessGoogleDataset 1950 5000 "$(echo {a..z})" > result.txt
--
-- where 1950 means to filter out only records newer then year 1950,
-- 5000 means number of the most frequent words to collect,
-- {a..z} means the list of letters allowed in words.
--
-- Copyright (c) 2015, Pavel Kretov.
-- Provided under the terms of Apache 2.0 License.
import Control.Monad.IO.Class (liftIO)
import Data.Char (isSpace, isLower)
import Data.Function (on)
import Data.List (foldl', sortBy, groupBy, partition)
import Data.List.Ordered (mergeBy)
import Data.Maybe (fromJust)
import Data.Ord (comparing)
import Data.Tuple (swap)
import System.Environment (getArgs)
import System.IO (stdin, stdout)
-- Heap seems convenient for finding most frequent items.
import qualified Data.Heap as H
-- Use Conduit package for IO as it has proven to be faster
-- then built-in (lines . readFile) approach.
import Data.Conduit (($$), (=$))
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Lazy as CZ
-- Conduit thinks Data.Text is better then plain strings.
import qualified Data.Text as T
import qualified Data.Text.Read as TR
-- Conduit really insists on that f
to obtain the list of most frequent words. This is my second more or less
working program in Haskell, so I really want to know if it complies with
current Haskell best practices.
I want to hear any your thought about virtually anything: performance, coding style, naming, even grammar and spelling in comments.
```
-- This script processes Google N-Gram dataset to obtain the list of the most
-- frequent words sorted by their frequency descending.
--
-- Usage example:
-- $ zcat googlebooks-eng-all-1gram-20120701-{a..z}.gz |
-- ./ProcessGoogleDataset 1950 5000 "$(echo {a..z})" > result.txt
--
-- where 1950 means to filter out only records newer then year 1950,
-- 5000 means number of the most frequent words to collect,
-- {a..z} means the list of letters allowed in words.
--
-- Copyright (c) 2015, Pavel Kretov.
-- Provided under the terms of Apache 2.0 License.
import Control.Monad.IO.Class (liftIO)
import Data.Char (isSpace, isLower)
import Data.Function (on)
import Data.List (foldl', sortBy, groupBy, partition)
import Data.List.Ordered (mergeBy)
import Data.Maybe (fromJust)
import Data.Ord (comparing)
import Data.Tuple (swap)
import System.Environment (getArgs)
import System.IO (stdin, stdout)
-- Heap seems convenient for finding most frequent items.
import qualified Data.Heap as H
-- Use Conduit package for IO as it has proven to be faster
-- then built-in (lines . readFile) approach.
import Data.Conduit (($$), (=$))
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Lazy as CZ
-- Conduit thinks Data.Text is better then plain strings.
import qualified Data.Text as T
import qualified Data.Text.Read as TR
-- Conduit really insists on that f
Solution
Here are only general thoughts about your code (it doesn’t talk about optimization, or data structures)
Code organisation
I would say it’s time to split your code: you have 19 imports and it becomes hard to know which part of your program requires which library.
You need to separate input/output code, argument handling and data processing. It will allow you to write unit tests more easily.
Comments
Avoid
When commenting functions, describe the arguments and what the function returns (the Haddock documentation will give some hint).
The naive mostFreq implementation code should be put in comments if it is not used.
Code style
Why don’t you use Haddock when commenting your code ? It even allows you to put small tests for QuickCheck.
I’m not a fan of your use of
Hlint advices:
-
ProcessGoogleDataset.hs:67:23: Warning: Use first
-
ProcessGoogleDataset.hs:68:45: Warning: Avoid lambda
-
ProcessGoogleDataset.hs:73:21: Warning: Use String
-
ProcessGoogleDataset.hs:118:39: Warning: Use head
There are many
Program behaviour
Since your program will be run as a command line program, I think it should behave like a standard command line utility.
For example, it should have a help option, handle incorrect parameter, setting the return code, work with standard files (not only standard input) etc.
Code organisation
I would say it’s time to split your code: you have 19 imports and it becomes hard to know which part of your program requires which library.
You need to separate input/output code, argument handling and data processing. It will allow you to write unit tests more easily.
Comments
Avoid
-- Do actual processing comments style. It doesn’t tell anything about the code itself, it just helps the developper to know where she/he is, which a good organisation of the code would have tell him/her.When commenting functions, describe the arguments and what the function returns (the Haddock documentation will give some hint).
The naive mostFreq implementation code should be put in comments if it is not used.
Code style
Why don’t you use Haddock when commenting your code ? It even allows you to put small tests for QuickCheck.
I’m not a fan of your use of
id just to make your code look better. Code should do something.Hlint advices:
-
ProcessGoogleDataset.hs:67:23: Warning: Use first
Found:
\ (word, f) -> (truncateUnderscore word, f)
Why not:
Control.Arrow.first truncateUnderscore-
ProcessGoogleDataset.hs:68:45: Warning: Avoid lambda
Found:
\ c -> c /= '_'
Why not:
(/= '_')-
ProcessGoogleDataset.hs:73:21: Warning: Use String
Found:
[Char] -> [(T.Text, Int)] -> [(T.Text, Int)]
Why not:
String -> [(T.Text, Int)] -> [(T.Text, Int)]-
ProcessGoogleDataset.hs:118:39: Warning: Use head
Found:
args !! 0
Why not:
head argsThere are many
(T.Text, Int) and (T.Text, Int, Int). It would help to define types (even type synonyms) to make it clear of what data they hold.Program behaviour
Since your program will be run as a command line program, I think it should behave like a standard command line utility.
For example, it should have a help option, handle incorrect parameter, setting the return code, work with standard files (not only standard input) etc.
Code Snippets
Found:
\ (word, f) -> (truncateUnderscore word, f)
Why not:
Control.Arrow.first truncateUnderscoreFound:
\ c -> c /= '_'
Why not:
(/= '_')Found:
[Char] -> [(T.Text, Int)] -> [(T.Text, Int)]
Why not:
String -> [(T.Text, Int)] -> [(T.Text, Int)]Found:
args !! 0
Why not:
head argsContext
StackExchange Code Review Q#96778, answer score: 2
Revisions (0)
No revisions yet.