HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Word frequency generator in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
frequencywordpythongenerator

Problem

Please critique my word frequency generator. I am a beginner programmer so any criticism are welcome.

Original Code: http://pastebin.com/rSRfbnCt

Here is my revised code after the feedback:

import string, time, webbrowser, collections, urllib2

def timefunc(function):
    def wrapped(*args):
        start = time.time()
        data = function(*args)
        end = time.time()
        timetaken = end - start
        print "Function: "+function.__name__+"\nTime taken:",timetaken
        return data
    return wrapped

@timefunc
def process_text(text_file):
    words = text_file.read().lower().split()
    words = [word.strip(string.punctuation+string.whitespace) for word in words]
    words = [word for word in words if word]#skips ''(None) elements
    return words

@timefunc
def create_freq_dict(wordlist):
    freq_dict = collections.Counter(wordlist)
    return freq_dict

@timefunc
def create_sorted_list(freqdict):
    sorted_list = [(value,key) for key,value in list(freqdict.items())]#list() provides python 3 compatibility
    sorted_list.sort(reverse=True)
    return sorted_list

@timefunc
def write_results(sorted_list):
    text_file = open('wordfreq.txt','w')
    text_file.write('Word Frequency List\n\n')
    rank = 0
    for word in sorted_list:
        rank += 1
        write_str = "[{0}] {1:-10}\n".format(rank, word[1],word[0])
        text_file.write(write_str)
    text_file.close()

## The Brothers Grimm
## This file can be obtained from Project Gutenberg:
## http://www.gutenberg.org/cache/epub/5314/pg5314.txt
web_file = urllib2.urlopen('http://www.gutenberg.org/cache/epub/5314/pg5314.txt')

wordlist = process_text(web_file)
freqdict = create_freq_dict(wordlist)
sorted_list = create_sorted_list(freqdict)
results = write_results(sorted_list)

webbrowser.open('wordfreq.txt')

print "END"

Solution

import string, time, math, webbrowser

def timefunc(function, *args):
    start = time.time()
    data = function(*args)
    end = time.time()
    timetaken = end - start


I'd recommend calling this time_taken as its slightly easier to read.

print "Function: "+function.__name__+"\nTime taken:",timetaken


Print already introduces newlines and combines different pieces. Take advantage of that.

print "Function: ", function.__name__
   print "Time Taken: ", time_taken


That's easier to follow

return data

def process_text(filename):


You never use filename in here, but you do use fin which is the same thing. typo?

t = []


Not a very descriptive name. I suggest coming up with something clearer

for line in fin:
        for i in line.split():


i usually means index which its not here

word = i.lower()
            word = word.strip(string.punctuation)
            if word != '':
                t.append(word)
    return t


I'd do this as

words = fin.read().lower().split()
 words = [word.strip() for word in words]
 words = [word for word in words if word]
 return words


I think its easier to follow and probably more efficient

def create_freq_dict(wordlist):
    d = dict()


d is not a very good name. Usually dicts are created with {} not dict(). No difference, but the first is generally preffered

for word in wordlist:
        if word not in d.keys():
            d[word] = 1
        else:
            d[word] += 1


Use d = collections.defaultdict(int) or d = collections.Counter(). Both will make it easier to count up like this. See the python documentation for collections. You should actually be able to write this function in one line

return d

def sort_dict(in_dict):
    t = []
    for key,value in in_dict.items():
        t.append((value, key))


in_dict.items() is a list already, there is no reason to copy the elements into the list. (NOTE: in Python 3.x in_dict.items() is no longer a list). Even if it wasn't a list you could do:

t = list(in_dict.items())


Which would do the same thing your code does

t.sort(reverse=True)
    return t


I'd implement this function as

return sorted(in_dict.items())


The sorted function takes anything sufficiently list-like and produces a sorted list from it.

def write_results(sorted_list):


sorted_list isn't a great name. It would be better to give an indication of what's in the list.

fout = open('wordfreq.txt','w')
    fout.write('Word Frequency List\n\n')
    r = 0
    for i in sorted_list:
        r += 1


Use for r, (key, value) in enumerate(sortedlist): That way you don't need to manage r yourself, and you can refer the value as value rather then the harder to read i[1].

fillamount = 20 - (len(i[1]) + len(str(r)))


Some of the those parens are unnecessary.

write_str = str(r)+': '+i[1]+' '+('-' * (fillamount-2))+' '+str(i[0])+'\n'


Multiplication has precedence, you don't need the parens to make that happen. Also, python has a method ljust which does this for you

write_str = (str(r) + ': ' + i[1]).ljust('-', 20) + str(i[0]) + '\n'


You may also want to consider using string formatting rather then adding strings

write_str = ('%d: %s' % (r, i[1])).just('-', 20) + '%d\n' % i[0]


I think its easier to follow, although I'd probably split across several lines

fout.write(write_str)
    fout.close()

## The Brothers Grimm
## This file can be obtained from Project Gutenberg:
## http://www.gutenberg.org/cache/epub/5314/pg5314.txt
fin = open('c:\Python27\My Programs\wx\grimm.txt')


fin presumable stands for file in. Give a name that indicates what's actually in it ike grim_text

wordlist = timefunc(process_text,fin)
freqdict = timefunc(create_freq_dict,wordlist)
sorted_list = timefunc(sort_dict,freqdict)
results = timefunc(write_results,sorted_list)


Very nice

webbrowser.open('wordfreq.txt')


That's a slightly unusual use of a webbrowser

print "END"

Code Snippets

import string, time, math, webbrowser

def timefunc(function, *args):
    start = time.time()
    data = function(*args)
    end = time.time()
    timetaken = end - start
print "Function: "+function.__name__+"\nTime taken:",timetaken
print "Function: ", function.__name__
   print "Time Taken: ", time_taken
return data

def process_text(filename):
for line in fin:
        for i in line.split():

Context

StackExchange Code Review Q#8648, answer score: 7

Revisions (0)

No revisions yet.