HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Word count and most frequent words from input text, excluding stop words

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
wordstextexcludinginputstopwordandcountfromfrequent

Problem

# Program will display a welcome message to the user
print("Welcome! This program will analyze your file to provide a word count, the top 30 words and remove the following stopwords.")

s = open('Obama 2009.txt','r').read()  # Open the input file

# Program will count the characters in text file
num_chars = len(s)

# Program will count the lines in the text file
num_lines = s.count('\n')

# Program will call split with no arguments
words = s.split()
d = {}
for w in words:
    if w in d:
        d[w] += 1
    else:
        d[w] = 1

num_words = sum(d[w] for w in d)

lst = [(d[w],w) for w in d]
lst.sort()
lst.reverse()

# Program assumes user has downloaded an imported stopwords from NLTK
from nltk.corpus import stopwords # Import the stop word list
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('english')) # creating a set makes the searching faster
print ([word for word in lst if word not in stop_words])

# Program will print the results
print('Your input file has characters = '+str(num_chars))
print('Your input file has lines = '+str(num_lines))
print('Your input file has the following words = '+str(num_words))

print('\n The 30 most frequent words are /n')

i = 1
for count, word in lst[:50]:
    print('%2s. %4s %s' %(i,count,word))
    i+= 1

print("Thank You! Goodbye.")

Solution

The code is clear, if a bit overly commented. All in all very good I'd
say. Once you do become more familiar with it, things like len(...)
should be obvious enough that they generally don't warrant a comment on
their own.

Also in general, single character names are discouraged as they don't
carry much meaning and can be harder to read. On the top of my head s
is vaguely a string, d could be a dictionary ... but really, what
string and what dictionary I have no idea.

Next, while it works, the general recommendation in longer scripts is
not to use open directly without with, unless you take care to close
the file. Therefore the input file should be read like follows:

# Open the input file
with open('Obama 2009.txt', 'r') as f:
s = f.read()


The with
statement will take care to always call close on the open file, which
would otherwise lead to issues in long-running processes taking up too
much system resources (open files).

For the word counting I'd again say it's fine, but can be written more
compact. Notably using a class from collections,
Counter.

num_words on the other hand does too much work and can just be
len(words).

list.sort actually has a reverse parameter, so you can just use that
instead of the extra call:

lst.sort(reverse=True)


Usually imports go to the start of the file, even though it doesn't
affect function too much (but you'd get an early warning if a library
wasn't available for example).

For formatting output there are a ton of options. At some point you
might want to look at the
format
method for strings.

Lastly, the loop at the end prints the 50 most frequent words, not
30 like the output suggests. That is a good opportunity to introduce
a constant for the number of words to print:

PRINT_WORDS = 50

print('\n The {} most frequent words are /n'.format(PRINT_WORDS))


Since it's a constant it's in upper case. I've also used format here
since using placeholders looks a bit nicer and also is much regular than
either concatenation or the % operator. But it's kind of up to
preference anyway.

Lastly, the final loop uses an explicit counter, which, again, has an
equivalent and nicer standard library helper called
enumerate.

for i, (count, word) in enumerate(lst[:PRINT_WORDS], 1):
print('%2s. %4s %s' % (i, count, word))


Which starts at zero normally, but I actually just saw that it supports
an optional start parameter, so might as well use that.

Context

StackExchange Code Review Q#126518, answer score: 3

Revisions (0)

No revisions yet.