HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Checking literary works for Zipf's Law

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
checkingzipfliteraryworksforlaw

Problem

I wrote a Python script that, given a literary work in text file format, takes it apart, makes a word list, counts how many times each words appears in the literary work, and then checks if it follows Zipf's Law by performing the chi-square test.

For those of you who don't know Zipf's Law, put simply, it is a law that states that in literary works, the frequency of a word is inversely proportional to its rank in the frequency table. So, the second most common word will appear half as much as the most common words, the third most common word will appear a third as often, and so on.

Also, for those who don't know the Chi-Square Test, it is a test that looks for statistical significance.

It commonly takes the form of:

$$x^2 = \sum \frac{(\text{observed} - \text{expected})^2}{\text{expected}}$$

I made this script because I had recently learned to code Python and I wanted some small project to occupy my mind. However, as I am rather new, I figured that my code was not exactly "Pythonic", so I was hoping that some kind souls could help me to tell me what I'm doing wrong and provide other helpful shortcuts.

```
from string import maketrans
from operator import itemgetter, attrgetter, methodcaller
import math

punct = [".", ",", "!", "?", "", "\n", "'", '"', "1", "2", "3", "4", "5", "6", "7", "8", "9", "0", "_"]

f_open = open('C:\Python27\Pride and Prejudice.txt', 'r')
lines = f_open.readlines()
f_open.close()
lines = filter(lambda name: name.strip(), lines)

def cleanup(text):
counter = 0
new_text = []
while counter < len(text):
if text[counter] in punct:
counter += 1
elif text[counter] == " " or text[counter] == "-":
new_text.append(" ")
counter += 1
else:
new_text.insert(counter, text[counter].lower())
counter += 1
words = "".join(new_text)
return words.split()

def make_list(line):
words = []
for x in line:
words = words + cleanup(x)
return

Solution

In no particular order:

-
Your code needs more documentation.

There are no comments or docs to help me understand how the code works, or why it was written the way it was. If I don’t know why the code was written this way, I can’t tell whether it’s working correctly – this makes it harder to review, edit and maintain.

There’s Python PEP 257, which describes how you should document a function with docstrings. (PEP means “Python Enhancement Proposal”, like a public design doc for Python.)

As a side note, when I read make_list(), I assumed that it took a single line (one string), which was a little confusing. It wasn't until I saw it called later on that I realised it took an iterable of strings – a collection of lines. Beware confusing typos like this.

-
Prefer comprehensions to loops and lambdas.

Python has a feature called comprehensions, which is a very powerful way for quickly constructing lists, sets and dictionaries. If you’re unfamiliar with the concept, I quite like this introduction. This is often considered to be one of the highlight features of Python.

This lets you write for loops in a more succinct way, for example in get_numbers():

def get_numbers(lists):
    return [x[1] for x in lists]


If the for loop is simple, these are generally considered more Pythonic.

Lambdas are considered fairly un-Pythonic, especially when comprehensions are available. For example, I'd rewrite line 10 as:

lines = [line.strip() for line in lines if line.strip()


-
Use with open() rather than open() … close().

The preferred approach for opening files is as follows:

with open(pride_and_prejudice_path, 'r') as infile:
    # do stuff with infile


This construction ensures that the file is always closed correctly – if you make explicit calls to open() and close(), something could go wrong in the middle and you might not close the file correctly.

Note also that you can iterate over the contents of the file directly, rather than calling readlines() first:

with open(pride_and_prejudice_path, 'r') as infile:
    lines = [name.strip() for name in infile if name.strip()]


Note that calling readlines() will store every line in a list, before you can do anything – depending on the size of the file you’re looking at, this can be prohibitively large. Often better to skip readlines() and iterate over the file directly.

-
Try to avoid unnecessary imports.

The only import you’re actually using is operator.itemgetter(). It’s good practice to avoid importing things that you don’t use.

In the same vein, don’t create variables you don’t need, e.g. the counter variable in count_words().

-
Read PEP 8, the Python style guide.

In general, your code is pretty good. A few small things I noticed:

  • On line 41, you seem to have over-indented by 4 spaces.



-
On line 42, you shouldn’t have spaces around the = sign in keyword arguments, i.e.,

return sorted(word_count, key=itemgetter(1), reverse=True)


-
Line 56 should be broken up for readability, probably by assigning new_list[counted_words.index(x)] to a constant. (Regardless of the line length in the standard, it’s just unwieldy.)

-
Use __future__.division instead of casting to floats.

I can see that you’re casting to float in quite a few places, because Python 2.7 defaults to integer division. Python 3 behaves in a more sensible manner (and I’d recommend using Py3 if you’re starting from scratch). To get this behaviour in the older versions of Python, you can add

from __future__ import division


to the top of your code. This will let you clean up your float() calls.

Note that __future__ imports must precede any other imports.

-
Use data structures other than lists.

In count_words(), you're using a list of lists to keep track of how many times a word appears in the text. A better data structure would be a dictionary.

Further, you're constructing multiple intermediate lists to get this structure. All of those consume memory and take time to construct. If you move this up to where we iterate over the file, we can skip those lists and go much faster:

word_counter = dict()
with open('prideandprejudice.txt', 'r') as infile:
    for line in infile:
        if not line.strip():
            continue
        for word in cleanup(line):
            if word in word_counter:
                word_counter[word] += 1
            else:
                word_counter[word] = 1
counted_words = sorted(word_counter.items(),
                       key=itemgetter(1),
                       reverse=True)


On my machine, your original code took ~40s to run. My updated version runs in

-
Use enumerate() for a more Pythonic loop.

Your cleanup() function has a classic pattern for people coming to Python from other languages:

i = 0
while i < len(iterable):
    item = iterable[i]
    # do stuff with item
    i += 1


Python has loop constructions that make this much nicer. In `clea

Code Snippets

def get_numbers(lists):
    return [x[1] for x in lists]
lines = [line.strip() for line in lines if line.strip()
with open(pride_and_prejudice_path, 'r') as infile:
    # do stuff with infile
with open(pride_and_prejudice_path, 'r') as infile:
    lines = [name.strip() for name in infile if name.strip()]
return sorted(word_count, key=itemgetter(1), reverse=True)

Context

StackExchange Code Review Q#124672, answer score: 2

Revisions (0)

No revisions yet.