HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Optimizing word counter

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
wordoptimizingcounter

Problem

I took the Python class at Google code today, and this is what I made for the problem about building a word counter.

Please take a look and suggest any improvements that can be done. Please do point out the bad practices if I have used any.

import sys

def make_dict(filename):
    """returns a word/count dictionary for the given input file"""
    myFile=open(filename,'rU')
    text=myFile.read()
    words=text.lower().split()
    wordcount_dict={}   #map each word to its count
    for word in words:
        wordcount_dict[word]=0
    for word in words:
        wordcount_dict[word]=wordcount_dict[word]+1
    myFile.close()
    return  wordcount_dict

def print_words(filename):
    """prints each word in the file followed by its count"""
    wordcount_dict=make_dict(filename)
    for word in wordcount_dict.keys():
        print word, "  " , wordcount_dict[word]

def print_top(filename):
    """prints the words with the top 20 counts"""
    wordcount_dict=make_dict(filename)
    keys = wordcount_dict.keys()
    values = sorted(wordcount_dict.values())
    for x in xrange (1,21): #for the top 20 values
        for word in keys :
            if wordcount_dict[word]==values[-x]:
                print word, "       ",wordcount_dict[word]

def main():
  if len(sys.argv) != 3:
    print 'usage: ./wordcount.py {--count | --topcount} file'
    sys.exit(1)

  option = sys.argv[1]
  filename = sys.argv[2]
  if option == '--count':
    print_words(filename)
  elif option == '--topcount':
    print_top(filename)
  else:
    print 'unknown option: ' + option
    sys.exit(1)

if __name__ == '__main__':
  main()

Solution

If you have to prepare a word counter, then the more adequate container is collections.defaultdict

Then your make_dict function could be written much more simply:

def make_dict(filename):
    """returns a word/count dictionary for the given input file"""
    wordcount_dict = defaultdict(int)
    with open(filename, 'rU') as myFile:
        for line in myFile:
            words = line.strip().lower().split()
            for word in words:      
                wordcount_dict[word] += 1
    return wordcount_dict


Note that you don't need to care about initialization of dictionary entries for new keys for word counting, as defaultdict takes care of it.

Another different approach is to use OOP. That is, to create a word counter object with state initialization, methods and all the stuff. The code gets simplified, encapsulated and ready to be extended.

Below, there is a working OOP proposal. There are some improvements that can be implemented also in your functional version if you don't like OOP:

1) I simplified your methods. Now there is only one method print_words(self, number=None). If you want the best 20 then just indicate the number of words.

2) I included some optimizations to clean words that are splitted with punctuation characters (otherwise house, house. and house' would be counted as different), using constants from the string module.

non_chars = string.punctuation + string.whitespace
words = [item.strip(non_chars).lower() for item in line.split()]


3) I used operator.itemgetter for the sorting key (instead of lambdas. More readable, imho)

4) I used formatting for the print for a better look. Used classical %.

import operator
import string
from collections import defaultdict

class WordCounter(defaultdict):
    def __init__(self, filename):
        defaultdict.__init__(self, int)
        self.file = filename
        self._fill_it()

    def _fill_it(self):
        "fill dictionary"
        non_chars = string.punctuation + string.whitespace
        with open(self.file, 'rU') as myFile:
            for line in myFile:
                words = [item.strip(non_chars).lower() for item in line.split()]
                for word in words:      
                    self[word] += 1

    def print_words(self, number=None):
        """prints the words with the top  counts"""
        wc_pairs = self.items()
        wc_pairs.sort(key=operator.itemgetter(1), reverse=True)
        number = number or len(wc_pairs)
        for word, count in wc_pairs[:number]:
            print "%-20s%5s" % (word, count)

my_wc = WordCounter('testme.txt')

print my_wc['aword']    # print 'aword' counts
my_wc.print_words()     # print all (sorted by counts)
my_wc.print_words(3)    # print top 3


And a final note: leaving a blank space before and after an operator and after commas in lists, increases readability of the text and is considered good practice.

Code Snippets

def make_dict(filename):
    """returns a word/count dictionary for the given input file"""
    wordcount_dict = defaultdict(int)
    with open(filename, 'rU') as myFile:
        for line in myFile:
            words = line.strip().lower().split()
            for word in words:      
                wordcount_dict[word] += 1
    return wordcount_dict
non_chars = string.punctuation + string.whitespace
words = [item.strip(non_chars).lower() for item in line.split()]
import operator
import string
from collections import defaultdict

class WordCounter(defaultdict):
    def __init__(self, filename):
        defaultdict.__init__(self, int)
        self.file = filename
        self._fill_it()

    def _fill_it(self):
        "fill dictionary"
        non_chars = string.punctuation + string.whitespace
        with open(self.file, 'rU') as myFile:
            for line in myFile:
                words = [item.strip(non_chars).lower() for item in line.split()]
                for word in words:      
                    self[word] += 1

    def print_words(self, number=None):
        """prints the words with the top <number> counts"""
        wc_pairs = self.items()
        wc_pairs.sort(key=operator.itemgetter(1), reverse=True)
        number = number or len(wc_pairs)
        for word, count in wc_pairs[:number]:
            print "%-20s%5s" % (word, count)


my_wc = WordCounter('testme.txt')

print my_wc['aword']    # print 'aword' counts
my_wc.print_words()     # print all (sorted by counts)
my_wc.print_words(3)    # print top 3

Context

StackExchange Code Review Q#6863, answer score: 5

Revisions (0)

No revisions yet.