patternpythonMinor
Frequency analysis for counting keywords
Viewed 0 times
countingfrequencyanalysisforkeywords
Problem
I have a CSV with around 10,000,000 lines. Each line looks like:
The script which does a frequency count and outputs to another text file:
I can't think of a way to minimize the memory utilization using generators and I'm not sure I could do a better job with the frequency counting job than the
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordn
keyword1, keyword2, ....keywordnThe script which does a frequency count and outputs to another text file:
from collections import Counter
import csv
import bz2
import sys
csv.field_size_limit(999999999)
all_terms = []
#with open('keywords.csv', 'rb') as f:
with bz2.BZ2File("keywords.csv.bz2", "r") as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
for term in row:
terms = term.split(",")
for x in terms:
all_terms.append( x.strip().lower() )
if idx % 100000 == 0:
print idx
c = Counter(all_terms)
with open("counts.txt", 'w+') as fl:
for k,v in c.most_common():
fl.write( "{} | {}\n".format(k,v) )I can't think of a way to minimize the memory utilization using generators and I'm not sure I could do a better job with the frequency counting job than the
Counter class from the collections module.Solution
The obvious issue with this is that you first build the list of all keywords and then count them. It would make more sense to count them while you process each line.
Which you can use in your case like this:
I don't see the reason for
You should test the performance of
vs
vs
with your file, because I'm not sure how well the double map is implemented in Python 2.x in comparison to using a lambda or a generator expression.
As an additional aside, it might makes sense to not do the
I get some 25% speed increase on my local test file (which has a lot of repeated lines). At this point it becomes a trade-off between the cost of
For the second code it is somewhat shared between
I also tried basically your approach, updating the counter only at the end. This is about another 0.2 seconds faster at the cost of potentially unlimited memory usage.
collections.Counter supports this with an overridden update method:>>> c = Counter([1, 2, 3])
>>> c
Counter({1: 1, 2: 1, 3: 1})
>>> c.update([1, 4])
>>> c
Counter({1: 2, 2: 1, 3: 1, 4: 1})Which you can use in your case like this:
from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
with bz2.BZ2File("keywords.csv.bz2") as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
c.update(term.strip().lower() for term in row)
if idx % 100000 == 0:
print idx
with open("counts.txt", 'w+') as fl:
for k, v in c.most_common():
fl.write("{} | {}\n".format(k, v))I don't see the reason for
for term in row, the CSV reader already splits the row into a list, so I left it out/moved it into the generator.You should test the performance of
c.update(term.strip().lower() for term in row)vs
c.update(map(lower, map(strip, row)))vs
c.update(map(lambda term: term.strip().lower(), row))with your file, because I'm not sure how well the double map is implemented in Python 2.x in comparison to using a lambda or a generator expression.
As an additional aside, it might makes sense to not do the
update on every iteration, but buffer it for some number of counts. You could use the print as an already existing check to write it out. With this code:from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
buf = []
with bz2.BZ2File("keywords.csv.bz2") as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
buf.extend(term.strip().lower() for term in row)
if idx % 100000 == 0:
print idx
c.update(buf)
buf = []
c.update(buf)
with open("counts.txt", 'w+') as fl:
for k, v in c.most_common():
fl.write("{} | {}\n".format(k, v))I get some 25% speed increase on my local test file (which has a lot of repeated lines). At this point it becomes a trade-off between the cost of
update and extend. in my normal code, this determines the runtime:29877245 function calls (29877242 primitive calls) in 8.559 seconds
...
999181 2.756 0.000 7.515 0.000 collections.py:528(update)For the second code it is somewhat shared between
update and extend and lower in the sum23882242 function calls (23882239 primitive calls) in 5.870 seconds
...
12 1.421 0.118 1.791 0.149 collections.py:528(update)
...
999180 0.779 0.000 3.065 0.000 {method 'extend' of 'list' objects}I also tried basically your approach, updating the counter only at the end. This is about another 0.2 seconds faster at the cost of potentially unlimited memory usage.
Code Snippets
>>> c = Counter([1, 2, 3])
>>> c
Counter({1: 1, 2: 1, 3: 1})
>>> c.update([1, 4])
>>> c
Counter({1: 2, 2: 1, 3: 1, 4: 1})from collections import Counter
import csv
import bz2
csv.field_size_limit(999999999)
c = Counter()
with bz2.BZ2File("keywords.csv.bz2") as f:
reader = csv.reader(f)
for idx, row in enumerate(reader):
c.update(term.strip().lower() for term in row)
if idx % 100000 == 0:
print idx
with open("counts.txt", 'w+') as fl:
for k, v in c.most_common():
fl.write("{} | {}\n".format(k, v))c.update(term.strip().lower() for term in row)c.update(map(lower, map(strip, row)))c.update(map(lambda term: term.strip().lower(), row))Context
StackExchange Code Review Q#148404, answer score: 3
Revisions (0)
No revisions yet.