HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Count occurrence of words in a .txt file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filewordsoccurrencecounttxt

Problem

I'm taking an intro to programming class and although I've learned some things I didn't know before (I've been using Python for about 1.5 years) I feel like I've not progressed much in writing "beautiful" code. My professor is committed to keeping this as a general intro class, and chose Python for its "friendliness" as an initial language. I can't really tell how much I'm improving (or not) as the grades at this point seem really inflated, so I wanted to get some input from here.

We were assigned an exercise in class to take a .txt file (in our case a .txt of the Gettysburg Address) and count the number of times words occur. We were then to output our results in a neatly formatted fashion. We've been getting drilled in writing functions and are starting to work with dictionaries, so I came up with this solution with those things in mind. I want to know how I can improve my code (i.e make it more efficient, Pythonic, and embrace what Python brings to the table as a language).

```
from re import split

def process_line(words, word_dict):
for word in words:
if word in word_dict:
word_dict[word] += 1
else:
word_dict[word] = 1

def process_dict(word_dict):
temp_list = []
for key, value in word_dict.items():
temp_list.append((value, key))

temp_list.sort()
return temp_list

def format_print(input_list, reverse, word_num):
if reverse:
input_list.sort(reverse=True)

print "\n", ("[Unique Words: " + str(word_num) + "]").center(35, "=")
print "-"35 + "\n", "%-16s %s %16s" % ("Word", "|", "Count"), "\n", "-"35
for count, word in input_list:
print "%-16s %s %16d" % (word, "|", count)

def word_count(_file, max_to_min=False):
txt = open(_file, "rU")
word_dict = {}
for line in txt:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

txt.close()
final_list = process_dict(w

Solution

Let's take a look at word_count, which appears to be the central function:

def word_count(_file, max_to_min=False):
    txt = open(_file, "rU")
    word_dict = {}
    for line in txt:
        if line.replace(" ", "") != ("\n" or None):
            process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

    txt.close()
    final_list = process_dict(word_dict)
    format_print(final_list, max_to_min, len(word_dict))


_file is not a suitable name as according to PEP 8. It is more Pythonic to use with open(_file, "rU") as f too (known as context managers). With that, I rename _file to filename. These two points are mentioned in vnp's answer; however, I disagree with vnp's suggestion to catch the exception, as there is no need for a graceful exit. The program should crash if the file cannot be opened.

def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        word_dict = {}
        for line in f:
            if line.replace(" ", "") != ("\n" or None):
                process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

    final_list = process_dict(word_dict)
    format_print(final_list, max_to_min, len(word_dict))


Your function calls the process_line function:

def process_line(words, word_dict):
    for word in words:
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1


There's a builtin Python class for this called Counter. It has a dictionary interface too. With that, the process_line function is no longer necessary and we can rewrite this as:

from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            if line.replace(" ", "") != ("\n" or None):
                counter.update(filter(None, split("[^a-zA-Z']+", line.lower())))

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))


Secondly, you appear to be removing all spaces from the line so as to find out if the line is just a series of whitespace and contains no actual words. This can be easily done using the strip function.

from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(filter(None, split("[^a-zA-Z']+", line)))

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))


filter can be rewritten as a generator, which feels more natural to me. That also uses less parenthesis, making the code more readable.

from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(x for x in split("[^a-zA-Z']+", line) if x)

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))


Now, let's take a look at process_dict.

def process_dict(word_dict):
    temp_list = []
    for key, value in word_dict.items():
        temp_list.append((value, key))

    temp_list.sort()
    return temp_list


The first few lines can be done with a lambda. The new function looks like this:

def process_dict(counter):
    temp_list = map(lambda (a, b): (b, a), counter.items())
    temp_list.sort()
    return temp_list


But was there really a need for a function on its own? In fact, I'd argue that since your function is named word_count, the function should only count words. Hence, we should just return the counter object and let the printing be handled. Also, we usually name functions as verbs, so I'll change the name to count_words.

The above change affects our whole program structure. Hence, I'll show the final code before explaining the changes I made.

from collections import Counter
from re import split

BANNER = "-" * 35

def format_print(counter, is_reverse=False):
    lst = counter.items()
    lst.sort(key=lambda (a, b): (b, a), reverse=is_reverse)
    print ("[Unique Words: %d]" % len(lst)).center(35, "=")
    print "%-16s | %16s" % ("Word", "Count")
    print BANNER
    for word, count in lst:
        print "%-16s | %16d" % (word, count)

def count_words(filename):
    counter = Counter()
    with open(filename, "rU") as f:
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(x for x in split("[^a-zA-Z']+", line) if x)
    return counter

format_print(count_words("Gettysburg.txt"), is_reverse=False)


I've removed max_to_min=False since we no longer sort the items in count_words.

I

Code Snippets

def word_count(_file, max_to_min=False):
    txt = open(_file, "rU")
    word_dict = {}
    for line in txt:
        if line.replace(" ", "") != ("\n" or None):
            process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

    txt.close()
    final_list = process_dict(word_dict)
    format_print(final_list, max_to_min, len(word_dict))
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        word_dict = {}
        for line in f:
            if line.replace(" ", "") != ("\n" or None):
                process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

    final_list = process_dict(word_dict)
    format_print(final_list, max_to_min, len(word_dict))
def process_line(words, word_dict):
    for word in words:
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1
from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            if line.replace(" ", "") != ("\n" or None):
                counter.update(filter(None, split("[^a-zA-Z']+", line.lower())))

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))
from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(filter(None, split("[^a-zA-Z']+", line)))

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))

Context

StackExchange Code Review Q#69556, answer score: 9

Revisions (0)

No revisions yet.