patternpythonMinor
Count occurrence of words in a .txt file
Viewed 0 times
filewordsoccurrencecounttxt
Problem
I'm taking an intro to programming class and although I've learned some things I didn't know before (I've been using Python for about 1.5 years) I feel like I've not progressed much in writing "beautiful" code. My professor is committed to keeping this as a general intro class, and chose Python for its "friendliness" as an initial language. I can't really tell how much I'm improving (or not) as the grades at this point seem really inflated, so I wanted to get some input from here.
We were assigned an exercise in class to take a .txt file (in our case a .txt of the Gettysburg Address) and count the number of times words occur. We were then to output our results in a neatly formatted fashion. We've been getting drilled in writing functions and are starting to work with dictionaries, so I came up with this solution with those things in mind. I want to know how I can improve my code (i.e make it more efficient, Pythonic, and embrace what Python brings to the table as a language).
```
from re import split
def process_line(words, word_dict):
for word in words:
if word in word_dict:
word_dict[word] += 1
else:
word_dict[word] = 1
def process_dict(word_dict):
temp_list = []
for key, value in word_dict.items():
temp_list.append((value, key))
temp_list.sort()
return temp_list
def format_print(input_list, reverse, word_num):
if reverse:
input_list.sort(reverse=True)
print "\n", ("[Unique Words: " + str(word_num) + "]").center(35, "=")
print "-"35 + "\n", "%-16s %s %16s" % ("Word", "|", "Count"), "\n", "-"35
for count, word in input_list:
print "%-16s %s %16d" % (word, "|", count)
def word_count(_file, max_to_min=False):
txt = open(_file, "rU")
word_dict = {}
for line in txt:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
txt.close()
final_list = process_dict(w
We were assigned an exercise in class to take a .txt file (in our case a .txt of the Gettysburg Address) and count the number of times words occur. We were then to output our results in a neatly formatted fashion. We've been getting drilled in writing functions and are starting to work with dictionaries, so I came up with this solution with those things in mind. I want to know how I can improve my code (i.e make it more efficient, Pythonic, and embrace what Python brings to the table as a language).
```
from re import split
def process_line(words, word_dict):
for word in words:
if word in word_dict:
word_dict[word] += 1
else:
word_dict[word] = 1
def process_dict(word_dict):
temp_list = []
for key, value in word_dict.items():
temp_list.append((value, key))
temp_list.sort()
return temp_list
def format_print(input_list, reverse, word_num):
if reverse:
input_list.sort(reverse=True)
print "\n", ("[Unique Words: " + str(word_num) + "]").center(35, "=")
print "-"35 + "\n", "%-16s %s %16s" % ("Word", "|", "Count"), "\n", "-"35
for count, word in input_list:
print "%-16s %s %16d" % (word, "|", count)
def word_count(_file, max_to_min=False):
txt = open(_file, "rU")
word_dict = {}
for line in txt:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
txt.close()
final_list = process_dict(w
Solution
Let's take a look at
Your function calls the
There's a builtin Python class for this called
Secondly, you appear to be removing all spaces from the line so as to find out if the line is just a series of whitespace and contains no actual words. This can be easily done using the
Now, let's take a look at
The first few lines can be done with a lambda. The new function looks like this:
But was there really a need for a function on its own? In fact, I'd argue that since your function is named
The above change affects our whole program structure. Hence, I'll show the final code before explaining the changes I made.
I've removed
I
word_count, which appears to be the central function:def word_count(_file, max_to_min=False):
txt = open(_file, "rU")
word_dict = {}
for line in txt:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
txt.close()
final_list = process_dict(word_dict)
format_print(final_list, max_to_min, len(word_dict))_file is not a suitable name as according to PEP 8. It is more Pythonic to use with open(_file, "rU") as f too (known as context managers). With that, I rename _file to filename. These two points are mentioned in vnp's answer; however, I disagree with vnp's suggestion to catch the exception, as there is no need for a graceful exit. The program should crash if the file cannot be opened.def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
word_dict = {}
for line in f:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
final_list = process_dict(word_dict)
format_print(final_list, max_to_min, len(word_dict))Your function calls the
process_line function:def process_line(words, word_dict):
for word in words:
if word in word_dict:
word_dict[word] += 1
else:
word_dict[word] = 1There's a builtin Python class for this called
Counter. It has a dictionary interface too. With that, the process_line function is no longer necessary and we can rewrite this as:from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
if line.replace(" ", "") != ("\n" or None):
counter.update(filter(None, split("[^a-zA-Z']+", line.lower())))
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))Secondly, you appear to be removing all spaces from the line so as to find out if the line is just a series of whitespace and contains no actual words. This can be easily done using the
strip function.from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
line = line.strip().lower()
if not line:
continue
counter.update(filter(None, split("[^a-zA-Z']+", line)))
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))filter can be rewritten as a generator, which feels more natural to me. That also uses less parenthesis, making the code more readable.from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
line = line.strip().lower()
if not line:
continue
counter.update(x for x in split("[^a-zA-Z']+", line) if x)
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))Now, let's take a look at
process_dict.def process_dict(word_dict):
temp_list = []
for key, value in word_dict.items():
temp_list.append((value, key))
temp_list.sort()
return temp_listThe first few lines can be done with a lambda. The new function looks like this:
def process_dict(counter):
temp_list = map(lambda (a, b): (b, a), counter.items())
temp_list.sort()
return temp_listBut was there really a need for a function on its own? In fact, I'd argue that since your function is named
word_count, the function should only count words. Hence, we should just return the counter object and let the printing be handled. Also, we usually name functions as verbs, so I'll change the name to count_words.The above change affects our whole program structure. Hence, I'll show the final code before explaining the changes I made.
from collections import Counter
from re import split
BANNER = "-" * 35
def format_print(counter, is_reverse=False):
lst = counter.items()
lst.sort(key=lambda (a, b): (b, a), reverse=is_reverse)
print ("[Unique Words: %d]" % len(lst)).center(35, "=")
print "%-16s | %16s" % ("Word", "Count")
print BANNER
for word, count in lst:
print "%-16s | %16d" % (word, count)
def count_words(filename):
counter = Counter()
with open(filename, "rU") as f:
for line in f:
line = line.strip().lower()
if not line:
continue
counter.update(x for x in split("[^a-zA-Z']+", line) if x)
return counter
format_print(count_words("Gettysburg.txt"), is_reverse=False)I've removed
max_to_min=False since we no longer sort the items in count_words.I
Code Snippets
def word_count(_file, max_to_min=False):
txt = open(_file, "rU")
word_dict = {}
for line in txt:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
txt.close()
final_list = process_dict(word_dict)
format_print(final_list, max_to_min, len(word_dict))def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
word_dict = {}
for line in f:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
final_list = process_dict(word_dict)
format_print(final_list, max_to_min, len(word_dict))def process_line(words, word_dict):
for word in words:
if word in word_dict:
word_dict[word] += 1
else:
word_dict[word] = 1from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
if line.replace(" ", "") != ("\n" or None):
counter.update(filter(None, split("[^a-zA-Z']+", line.lower())))
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
line = line.strip().lower()
if not line:
continue
counter.update(filter(None, split("[^a-zA-Z']+", line)))
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))Context
StackExchange Code Review Q#69556, answer score: 9
Revisions (0)
No revisions yet.