patternpythonMinor
Improving python3 processing speed (against a reference perl script)
Viewed 0 times
scriptreferenceperlagainstpython3processingspeedimproving
Problem
In order to create fast map/reduce processor for hadoop, I'm evaluating many languages. I'm learning python so I would like my python processor to go as fast as my perl processor.
So, for this question, the point is to increase the performance of simpleprocessor.py. To measure improvement, there is a benchmarking suite available there: https://github.com/Fade78/Stream-processing-benchmark
simpleprocessor.py
In the benchmark output you can see that the python processor is three time slower than the perl processor.
To test you can run the benchma
So, for this question, the point is to increase the performance of simpleprocessor.py. To measure improvement, there is a benchmarking suite available there: https://github.com/Fade78/Stream-processing-benchmark
simpleprocessor.py
#!/usr/bin/python3
import sys
import re
scripttitle="PYTHON SIMPLE PROCESSOR (regex parsing)"
linecount=0
commentary=0
unknownline=0
DATA={}
pattern_data=re.compile(r"^(\d+)\s+(\S+)\s+(\S+)\s+(\d+(?:\.\d+)?)")
print("#",scripttitle,"\n#TRANSFORMED INPUT",sep='')
for line in sys.stdin:
linecount+=1
if line.startswith("#"):
commentary+=1
continue
line=line.replace(',','')
m=re.match(pattern_data,line)
if m:
i,k1,k2,value = m.group(1,2,3,4)
i=int(i)
value=float(value)
try:
DATA[k1][k2]+=value
except KeyError:
if k1 not in DATA: # Can't automaticaly create missing key and do the insert?
DATA[k1]={}
if k2 not in DATA[k1]:
DATA[k1][k2]=value
else:
DATA[k1][k2]+=value
print("{0},{1:.0f},{2},{3}".format(i,value,k2,k1))
else:
unknownline+=1
print("#DATADUMP")
keystat=0
for k1 in sorted(DATA):
print(k1,':',sep='',end='')
for k2 in sorted(DATA[k1]):
keystat+=1
print(' (',k2,':',int(DATA[k1][k2]),')',sep='',end='')
print()
report="#{0}\n#{1}\nparsed line: {2}, commentary line: {3}, unknown line: {4}, keystat: {5}.".format(
scripttitle, sys.version.replace("\n"," "), linecount, commentary, unknownline, keystat)
print("#REPORT\n"+report,file=sys.stdout)
print(report,file=sys.stderr)In the benchmark output you can see that the python processor is three time slower than the perl processor.
To test you can run the benchma
Solution
#!/usr/bin/python3
import sys
import re
scripttitle="PYTHON SIMPLE PROCESSOR (regex parsing)"Python convention is to put constants in ALL_CAPS
linecount=0
commentary=0
unknownline=0
DATA={}This isn't a constant, so it really shouldn't be all caps.
pattern_data=re.compile(r"^(\d+)\s+(\S+)\s+(\S+)\s+(\d+(?:\.\d+)?)")
print("#",scripttitle,"\n#TRANSFORMED INPUT",sep='')Its best to put all your actual logic inside a function rather then at the main level of a script. It'll run a bit faster that way.
for line in sys.stdin:
linecount+=1
if line.startswith("#"):
commentary+=1
continueI find code is almost always more readable when you put thing in the else block rather then use continue
line=line.replace(',','')
m=re.match(pattern_data,line)
if m:Typically we'd explicit check for none with
if m is not Nonei,k1,k2,value = m.group(1,2,3,4)Actually you could use
m.groups() here. I'd also avoid such unhelpnames as i, k1, and k2i=int(i)I'm not sure why you bother doing this if you are just going to print it out anyways
value=float(value)
try:
DATA[k1][k2]+=value
except KeyError:
if k1 not in DATA: # Can't automaticaly create missing key and do the insert?
DATA[k1]={}
if k2 not in DATA[k1]:
DATA[k1][k2]=value
else:
DATA[k1][k2]+=valuePython has a useful class called defaultdict. It lets you provide the default value for a dictionary. It also has a class called Counter for counting things So you could do this:
DATA = collections.defaultdict(collections.Counter)Then
DATA[k1][k2] += valuewill always work because the default cases are handled.
print("{0},{1:.0f},{2},{3}".format(i,value,k2,k1))It'd probably be easier to follow using
sep=',' rather then what you've done hereelse:
unknownline+=1
print("#DATADUMP")
keystat=0
for k1 in sorted(DATA):Instead use
for k1, items in sorted(DATA.items): Then items will be DATA[k1] and you can relooking up the dataprint(k1,':',sep='',end='')
for k2 in sorted(DATA[k1]):Same here, use the
.items() to fetch keys and values togetherkeystat+=1
print(' (',k2,':',int(DATA[k1][k2]),')',sep='',end='')
print()
report="#{0}\n#{1}\nparsed line: {2}, commentary line: {3}, unknown line: {4}, keystat: {5}.".format(
scripttitle, sys.version.replace("\n"," "), linecount, commentary, unknownline, keystat)
print("#REPORT\n"+report,file=sys.stdout)
print(report,file=sys.stderr)As for performance, remember that Perl is the practical extraction and report language. This kinda thing is perl's bread and butter, so its gonna be hard for python to win. Doesn't mean I'm not gonna try though.
EDIT: Performance
I've played with improving performance, a few points:
m=re.match(pattern_data,line)A better way is to use
m = pattern_data.match(line)They both do the same thing, but the first has a speed penalty associated with it.
print(' (',k2,':',int(DATA[k1][k2]),')',sep='',end='')The print function is expensive, probably due to its versatility. Rewriting your code to use sys.stdout.write() directly gave much better performance.
try:
DATA[k1][k2]+=value
except KeyError:
if k1 not in DATA: # Can't automaticaly create missing key and do the insert?
DATA[k1]={}
if k2 not in DATA[k1]:
DATA[k1][k2]=value
else:
DATA[k1][k2]+=valueReplacing this with
defaultdict or counter harmed performance. I rewrote it astry:
row = DATA[k1]
except KeyError:
row = DATA[k1] = {}
try:
row[k2] += value
except KeyError:
row[k2] = valueWhich gave me a speed boost because it avoids looking up the same values in the dictionary more then once.
With those changes I was able to get within one second of the speed of the perl script. But I was still slower. My semi-educated guess is that perl wins due to builtin support for sorting the keys of a hash during iteration. In python the sorting is done in an seperate function and may not be able to take advantage of the same things the perl version can.
FURTHER PERFORMANCE
Put everything in a function. Python optimizes functions more then other code outside of functions.
Replace
for key, value in sorted(data.items()):with
for key in sorted(data):
value = data[key]The first looks nicer, but it requires python to sort a list of tuples rather then a list of strings which ends up more expensive.
Replace
sys.stdout.write(' ({}: {})'.format(k1, math.trunc(v)))With
sys.stdout.write(''.join([' (', k1, ': ', str(math.trunc(v)), ')']))String formatting is expensive since python has to parse through the str
Code Snippets
#!/usr/bin/python3
import sys
import re
scripttitle="PYTHON SIMPLE PROCESSOR (regex parsing)"linecount=0
commentary=0
unknownline=0
DATA={}pattern_data=re.compile(r"^(\d+)\s+(\S+)\s+(\S+)\s+(\d+(?:\.\d+)?)")
print("#",scripttitle,"\n#TRANSFORMED INPUT",sep='')for line in sys.stdin:
linecount+=1
if line.startswith("#"):
commentary+=1
continueline=line.replace(',','')
m=re.match(pattern_data,line)
if m:Context
StackExchange Code Review Q#10297, answer score: 2
Revisions (0)
No revisions yet.