HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Statistics about gaps in DNA sequences

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
statisticsdnaaboutsequencesgaps

Problem

Noobie to Numba here, I'm trying to get faster code from existing function but the result is not faster. 10 times faster would be heaven, but I know nothing about optimization.
This is code about parsing gaps in DNA sequences pairwise alignment and doing statistics about it.
The code looks like this:

```
import re
import time
import numpy as np
from numba import autojit, int32, complex64
sstart = 10
send = 52
absoluteRoiStart = 19 #rank of the first nucleotide in the ROI
absoluteRoiStop = 27 #rank of the the first nucleotide after the ROI
#ROI is here 'TATCGA---CAG|TA-----TACTA-C|G--TTGAGAGAGAC-CCCA'
#between | 'T--CGACCAC--|-GATCGAG---ATC|GGCTT--------CTC--A'
source = 'TATCGA---CAGTA-----TACTA-CG--TTGAGAGAGAC-CCCA'
sequence = 'T--CGACCAC---GATCGAG---ATCGGCTT--------CTC--A'
realSource = 'AAGGTTCCAATATCGACAGTATACTACGTTGAGAGAGACCCCACATGACTGACTACGT'

tresholds = {
"DEL" : {
"other" : 2,
"slippage": 2,
"quantity" : 7
},
"INS" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
},
"MUT" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
},
"NA" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
}
}

def getAllGaps(sequence1, sequence2):
starts = []
stops = []
lengths = []
types = []
locations = []
gap = '(\-)+'
x = re.compile(gap)
for m in x.finditer(sequence1):
#Get Gap satrt, stop and length
start,stop = m.span()
#Test if Gap is slippage(compression or extension)
if start > 1 and stop = absoluteRoiStart) & (absoluteIndex = relativeRoiStart) & (events[1] = relativeRoiStart)) | ((events[0] >= relativeRoiStart) & (events[1] <= relativeRoiStop))]
print(deletionOverlappingRoiOrStartingInRoi)

t0 = time.time()
getAlignmentData(source, sequence, sstart, tresholds)
t1 = time.time()
getAlignmentData(source, sequence, sstart, tresholds)
t2 = time.time()

print(str(t1-t0)+' to fir

Solution

After profiling your code as:

import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()
for it in range(0,10000): 
    getAlignmentData(source, sequence, sstart, tresholds)

pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()


You will get:

1370129 function calls (1370122 primitive calls) in 3.249 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
10000 1.003 0.000 3.249 0.000 test.py:71(getAlignmentData)
20000 1.071 0.000 1.684 0.000 test.py:38(getAllGaps)
20000 0.171 0.000 0.171 0.000 {numpy.core.multiarray.array}
20000 0.160 0.000 0.160 0.000 {method 'reduce' of 'numpy.ufunc' objects}
10000 0.018 0.000 0.121 0.000 {method 'min' of 'numpy.ndarray' objects}
400022 0.119 0.000 0.119 0.000 {method 'append' of 'list' objects}
180000 0.107 0.000 0.107 0.000 {method 'replace' of 'str' objects}
100000 0.106 0.000 0.106 0.000 {range}


You can use Cython or f2py to optimise memory management starting with getAllGaps, which is simpler, and then go for getAlignmentData.

Keep in mind that you need to deactivate outputs and take long runs to measure real speedup.

If you compile your code using nuitka you can get 9x speedup.

110128 function calls (110121 primitive calls) in 0.355 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
20000 0.158 0.000 0.158 0.000 {method 'reduce' of 'numpy.ufunc' objects}
10000 0.011 0.000 0.101 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/_methods.py:28(_amin)
10000 0.022 0.000 0.096 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/numeric.py:141(ones)
20000 0.025 0.000 0.081 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py:192(compile)
10000 0.008 0.000 0.076 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/_methods.py:25(_amax)
20000 0.056 0.000 0.056 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py:230(_compile)
10000 0.037 0.000 0.037 0.000 {numpy.core.multiarray.copyto}
10000 0.037 0.000 0.037 0.000 {numpy.core.multiarray.empty}


All you need to do is install nuitka and run:

$ nuitka mycode.py


You also need a larger dataset to run a proper benchmark. Since it's easy to keep small data in cache, the profiler can provide misleading results.

Code Snippets

import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()
for it in range(0,10000): 
    getAlignmentData(source, sequence, sstart, tresholds)

pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()

Context

StackExchange Code Review Q#86051, answer score: 5

Revisions (0)

No revisions yet.