HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Frequency analysis for counting keywords

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
countingfrequencyanalysisforkeywords

Problem

I have a CSV with around 10,000,000 lines. Each line looks like:

keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn


The script which does a frequency count and outputs to another text file:

from collections import Counter
import csv
import bz2
import sys

csv.field_size_limit(999999999)

all_terms = []

#with open('keywords.csv', 'rb') as f:
with bz2.BZ2File("keywords.csv.bz2", "r") as f:
    reader = csv.reader(f)
    for idx, row in enumerate(reader):
        for term in row:
            terms = term.split(",")
            for x in terms:
                all_terms.append( x.strip().lower() )
            if idx % 100000 == 0:
                print idx

c = Counter(all_terms)

with open("counts.txt", 'w+') as fl:
    for k,v in  c.most_common():
        fl.write( "{} | {}\n".format(k,v) )


I can't think of a way to minimize the memory utilization using generators and I'm not sure I could do a better job with the frequency counting job than the Counter class from the collections module.

Solution

The obvious issue with this is that you first build the list of all keywords and then count them. It would make more sense to count them while you process each line. collections.Counter supports this with an overridden update method:

>>> c = Counter([1, 2, 3])
>>> c
Counter({1: 1, 2: 1, 3: 1})
>>> c.update([1, 4])
>>> c
Counter({1: 2, 2: 1, 3: 1, 4: 1})


Which you can use in your case like this:

from collections import Counter
import csv
import bz2

csv.field_size_limit(999999999)

c = Counter()

with bz2.BZ2File("keywords.csv.bz2") as f:
    reader = csv.reader(f)
    for idx, row in enumerate(reader):
        c.update(term.strip().lower() for term in row)
        if idx % 100000 == 0:
            print idx

with open("counts.txt", 'w+') as fl:
    for k, v in c.most_common(): 
        fl.write("{} | {}\n".format(k, v))


I don't see the reason for for term in row, the CSV reader already splits the row into a list, so I left it out/moved it into the generator.

You should test the performance of

c.update(term.strip().lower() for term in row)


vs

c.update(map(lower, map(strip, row)))


vs

c.update(map(lambda term: term.strip().lower(), row))


with your file, because I'm not sure how well the double map is implemented in Python 2.x in comparison to using a lambda or a generator expression.

As an additional aside, it might makes sense to not do the update on every iteration, but buffer it for some number of counts. You could use the print as an already existing check to write it out. With this code:

from collections import Counter
import csv
import bz2

csv.field_size_limit(999999999)

c = Counter()
buf = []
with bz2.BZ2File("keywords.csv.bz2") as f:
    reader = csv.reader(f)
    for idx, row in enumerate(reader):
        buf.extend(term.strip().lower() for term in row)
        if idx % 100000 == 0:
            print idx
            c.update(buf)
            buf = []
    c.update(buf)

with open("counts.txt", 'w+') as fl:
    for k, v in c.most_common():
        fl.write("{} | {}\n".format(k, v))


I get some 25% speed increase on my local test file (which has a lot of repeated lines). At this point it becomes a trade-off between the cost of update and extend. in my normal code, this determines the runtime:

29877245 function calls (29877242 primitive calls) in 8.559 seconds
...
999181    2.756    0.000    7.515    0.000 collections.py:528(update)


For the second code it is somewhat shared between update and extend and lower in the sum

23882242 function calls (23882239 primitive calls) in 5.870 seconds
...
    12    1.421    0.118    1.791    0.149 collections.py:528(update)
...
999180    0.779    0.000    3.065    0.000 {method 'extend' of 'list' objects}


I also tried basically your approach, updating the counter only at the end. This is about another 0.2 seconds faster at the cost of potentially unlimited memory usage.

Code Snippets

>>> c = Counter([1, 2, 3])
>>> c
Counter({1: 1, 2: 1, 3: 1})
>>> c.update([1, 4])
>>> c
Counter({1: 2, 2: 1, 3: 1, 4: 1})
from collections import Counter
import csv
import bz2

csv.field_size_limit(999999999)

c = Counter()

with bz2.BZ2File("keywords.csv.bz2") as f:
    reader = csv.reader(f)
    for idx, row in enumerate(reader):
        c.update(term.strip().lower() for term in row)
        if idx % 100000 == 0:
            print idx

with open("counts.txt", 'w+') as fl:
    for k, v in c.most_common(): 
        fl.write("{} | {}\n".format(k, v))
c.update(term.strip().lower() for term in row)
c.update(map(lower, map(strip, row)))
c.update(map(lambda term: term.strip().lower(), row))

Context

StackExchange Code Review Q#148404, answer score: 3

Revisions (0)

No revisions yet.