HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Improving python3 processing speed (against a reference perl script)

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptreferenceperlagainstpython3processingspeedimproving

Problem

In order to create fast map/reduce processor for hadoop, I'm evaluating many languages. I'm learning python so I would like my python processor to go as fast as my perl processor.

So, for this question, the point is to increase the performance of simpleprocessor.py. To measure improvement, there is a benchmarking suite available there: https://github.com/Fade78/Stream-processing-benchmark

simpleprocessor.py

#!/usr/bin/python3    
import sys
import re

scripttitle="PYTHON SIMPLE PROCESSOR (regex parsing)"

linecount=0
commentary=0
unknownline=0
DATA={}

pattern_data=re.compile(r"^(\d+)\s+(\S+)\s+(\S+)\s+(\d+(?:\.\d+)?)")

print("#",scripttitle,"\n#TRANSFORMED INPUT",sep='')

for line in sys.stdin:
    linecount+=1
    if line.startswith("#"):
        commentary+=1
        continue
    line=line.replace(',','')
    m=re.match(pattern_data,line)
    if m:
        i,k1,k2,value = m.group(1,2,3,4)
        i=int(i)
        value=float(value)
        try:
            DATA[k1][k2]+=value
        except KeyError:
            if k1 not in DATA: # Can't automaticaly create missing key and do the insert?
                DATA[k1]={}
            if k2 not in DATA[k1]:
                DATA[k1][k2]=value
            else:
                DATA[k1][k2]+=value
        print("{0},{1:.0f},{2},{3}".format(i,value,k2,k1))
    else:
        unknownline+=1

print("#DATADUMP")

keystat=0

for k1 in sorted(DATA):
    print(k1,':',sep='',end='')
    for k2 in sorted(DATA[k1]):
        keystat+=1
        print(' (',k2,':',int(DATA[k1][k2]),')',sep='',end='')
    print()

report="#{0}\n#{1}\nparsed line: {2}, commentary line: {3}, unknown line: {4}, keystat: {5}.".format(
               scripttitle, sys.version.replace("\n"," "), linecount, commentary, unknownline, keystat)

print("#REPORT\n"+report,file=sys.stdout)
print(report,file=sys.stderr)


In the benchmark output you can see that the python processor is three time slower than the perl processor.

To test you can run the benchma

Solution

#!/usr/bin/python3    
import sys
import re

scripttitle="PYTHON SIMPLE PROCESSOR (regex parsing)"


Python convention is to put constants in ALL_CAPS

linecount=0
commentary=0
unknownline=0
DATA={}


This isn't a constant, so it really shouldn't be all caps.

pattern_data=re.compile(r"^(\d+)\s+(\S+)\s+(\S+)\s+(\d+(?:\.\d+)?)")

print("#",scripttitle,"\n#TRANSFORMED INPUT",sep='')


Its best to put all your actual logic inside a function rather then at the main level of a script. It'll run a bit faster that way.

for line in sys.stdin:
    linecount+=1
    if line.startswith("#"):
        commentary+=1
        continue


I find code is almost always more readable when you put thing in the else block rather then use continue

line=line.replace(',','')
    m=re.match(pattern_data,line)
    if m:


Typically we'd explicit check for none with if m is not None

i,k1,k2,value = m.group(1,2,3,4)


Actually you could use m.groups() here. I'd also avoid such unhelpnames as i, k1, and k2

i=int(i)


I'm not sure why you bother doing this if you are just going to print it out anyways

value=float(value)
        try:
            DATA[k1][k2]+=value
        except KeyError:
            if k1 not in DATA: # Can't automaticaly create missing key and do the insert?
                DATA[k1]={}
            if k2 not in DATA[k1]:
                DATA[k1][k2]=value
            else:
                DATA[k1][k2]+=value


Python has a useful class called defaultdict. It lets you provide the default value for a dictionary. It also has a class called Counter for counting things So you could do this:

DATA = collections.defaultdict(collections.Counter)


Then

DATA[k1][k2] += value


will always work because the default cases are handled.

print("{0},{1:.0f},{2},{3}".format(i,value,k2,k1))


It'd probably be easier to follow using sep=',' rather then what you've done here

else:
        unknownline+=1

print("#DATADUMP")

keystat=0

for k1 in sorted(DATA):


Instead use for k1, items in sorted(DATA.items): Then items will be DATA[k1] and you can relooking up the data

print(k1,':',sep='',end='')
    for k2 in sorted(DATA[k1]):


Same here, use the .items() to fetch keys and values together

keystat+=1
        print(' (',k2,':',int(DATA[k1][k2]),')',sep='',end='')
    print()

report="#{0}\n#{1}\nparsed line: {2}, commentary line: {3}, unknown line: {4}, keystat: {5}.".format(
               scripttitle, sys.version.replace("\n"," "), linecount, commentary, unknownline, keystat)

print("#REPORT\n"+report,file=sys.stdout)
print(report,file=sys.stderr)


As for performance, remember that Perl is the practical extraction and report language. This kinda thing is perl's bread and butter, so its gonna be hard for python to win. Doesn't mean I'm not gonna try though.

EDIT: Performance

I've played with improving performance, a few points:

m=re.match(pattern_data,line)


A better way is to use

m = pattern_data.match(line)


They both do the same thing, but the first has a speed penalty associated with it.

print(' (',k2,':',int(DATA[k1][k2]),')',sep='',end='')


The print function is expensive, probably due to its versatility. Rewriting your code to use sys.stdout.write() directly gave much better performance.

try:
        DATA[k1][k2]+=value
    except KeyError:
        if k1 not in DATA: # Can't automaticaly create missing key and do the insert?
            DATA[k1]={}
        if k2 not in DATA[k1]:
            DATA[k1][k2]=value
        else:
            DATA[k1][k2]+=value


Replacing this with defaultdict or counter harmed performance. I rewrote it as

try:
            row = DATA[k1]
        except KeyError:
            row = DATA[k1] = {}
        try:
            row[k2] += value
        except KeyError:
            row[k2] = value


Which gave me a speed boost because it avoids looking up the same values in the dictionary more then once.

With those changes I was able to get within one second of the speed of the perl script. But I was still slower. My semi-educated guess is that perl wins due to builtin support for sorting the keys of a hash during iteration. In python the sorting is done in an seperate function and may not be able to take advantage of the same things the perl version can.

FURTHER PERFORMANCE

Put everything in a function. Python optimizes functions more then other code outside of functions.

Replace

for key, value in sorted(data.items()):


with

for key in sorted(data):
    value = data[key]


The first looks nicer, but it requires python to sort a list of tuples rather then a list of strings which ends up more expensive.

Replace

sys.stdout.write(' ({}: {})'.format(k1, math.trunc(v)))


With

sys.stdout.write(''.join([' (', k1, ': ', str(math.trunc(v)), ')']))


String formatting is expensive since python has to parse through the str

Code Snippets

#!/usr/bin/python3    
import sys
import re

scripttitle="PYTHON SIMPLE PROCESSOR (regex parsing)"
linecount=0
commentary=0
unknownline=0
DATA={}
pattern_data=re.compile(r"^(\d+)\s+(\S+)\s+(\S+)\s+(\d+(?:\.\d+)?)")

print("#",scripttitle,"\n#TRANSFORMED INPUT",sep='')
for line in sys.stdin:
    linecount+=1
    if line.startswith("#"):
        commentary+=1
        continue
line=line.replace(',','')
    m=re.match(pattern_data,line)
    if m:

Context

StackExchange Code Review Q#10297, answer score: 2

Revisions (0)

No revisions yet.