patternpythonMinor
Statistics about gaps in DNA sequences
Viewed 0 times
statisticsdnaaboutsequencesgaps
Problem
Noobie to Numba here, I'm trying to get faster code from existing function but the result is not faster. 10 times faster would be heaven, but I know nothing about optimization.
This is code about parsing gaps in DNA sequences pairwise alignment and doing statistics about it.
The code looks like this:
```
import re
import time
import numpy as np
from numba import autojit, int32, complex64
sstart = 10
send = 52
absoluteRoiStart = 19 #rank of the first nucleotide in the ROI
absoluteRoiStop = 27 #rank of the the first nucleotide after the ROI
#ROI is here 'TATCGA---CAG|TA-----TACTA-C|G--TTGAGAGAGAC-CCCA'
#between | 'T--CGACCAC--|-GATCGAG---ATC|GGCTT--------CTC--A'
source = 'TATCGA---CAGTA-----TACTA-CG--TTGAGAGAGAC-CCCA'
sequence = 'T--CGACCAC---GATCGAG---ATCGGCTT--------CTC--A'
realSource = 'AAGGTTCCAATATCGACAGTATACTACGTTGAGAGAGACCCCACATGACTGACTACGT'
tresholds = {
"DEL" : {
"other" : 2,
"slippage": 2,
"quantity" : 7
},
"INS" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
},
"MUT" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
},
"NA" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
}
}
def getAllGaps(sequence1, sequence2):
starts = []
stops = []
lengths = []
types = []
locations = []
gap = '(\-)+'
x = re.compile(gap)
for m in x.finditer(sequence1):
#Get Gap satrt, stop and length
start,stop = m.span()
#Test if Gap is slippage(compression or extension)
if start > 1 and stop = absoluteRoiStart) & (absoluteIndex = relativeRoiStart) & (events[1] = relativeRoiStart)) | ((events[0] >= relativeRoiStart) & (events[1] <= relativeRoiStop))]
print(deletionOverlappingRoiOrStartingInRoi)
t0 = time.time()
getAlignmentData(source, sequence, sstart, tresholds)
t1 = time.time()
getAlignmentData(source, sequence, sstart, tresholds)
t2 = time.time()
print(str(t1-t0)+' to fir
This is code about parsing gaps in DNA sequences pairwise alignment and doing statistics about it.
The code looks like this:
```
import re
import time
import numpy as np
from numba import autojit, int32, complex64
sstart = 10
send = 52
absoluteRoiStart = 19 #rank of the first nucleotide in the ROI
absoluteRoiStop = 27 #rank of the the first nucleotide after the ROI
#ROI is here 'TATCGA---CAG|TA-----TACTA-C|G--TTGAGAGAGAC-CCCA'
#between | 'T--CGACCAC--|-GATCGAG---ATC|GGCTT--------CTC--A'
source = 'TATCGA---CAGTA-----TACTA-CG--TTGAGAGAGAC-CCCA'
sequence = 'T--CGACCAC---GATCGAG---ATCGGCTT--------CTC--A'
realSource = 'AAGGTTCCAATATCGACAGTATACTACGTTGAGAGAGACCCCACATGACTGACTACGT'
tresholds = {
"DEL" : {
"other" : 2,
"slippage": 2,
"quantity" : 7
},
"INS" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
},
"MUT" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
},
"NA" : {
"other" : 3,
"slippage": 3,
"quantity" : 7
}
}
def getAllGaps(sequence1, sequence2):
starts = []
stops = []
lengths = []
types = []
locations = []
gap = '(\-)+'
x = re.compile(gap)
for m in x.finditer(sequence1):
#Get Gap satrt, stop and length
start,stop = m.span()
#Test if Gap is slippage(compression or extension)
if start > 1 and stop = absoluteRoiStart) & (absoluteIndex = relativeRoiStart) & (events[1] = relativeRoiStart)) | ((events[0] >= relativeRoiStart) & (events[1] <= relativeRoiStop))]
print(deletionOverlappingRoiOrStartingInRoi)
t0 = time.time()
getAlignmentData(source, sequence, sstart, tresholds)
t1 = time.time()
getAlignmentData(source, sequence, sstart, tresholds)
t2 = time.time()
print(str(t1-t0)+' to fir
Solution
After profiling your code as:
You will get:
You can use Cython or f2py to optimise memory management starting with
Keep in mind that you need to deactivate outputs and take long runs to measure real speedup.
If you compile your code using
All you need to do is install nuitka and run:
You also need a larger dataset to run a proper benchmark. Since it's easy to keep small data in cache, the profiler can provide misleading results.
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()
for it in range(0,10000):
getAlignmentData(source, sequence, sstart, tresholds)
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()You will get:
1370129 function calls (1370122 primitive calls) in 3.249 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
10000 1.003 0.000 3.249 0.000 test.py:71(getAlignmentData)
20000 1.071 0.000 1.684 0.000 test.py:38(getAllGaps)
20000 0.171 0.000 0.171 0.000 {numpy.core.multiarray.array}
20000 0.160 0.000 0.160 0.000 {method 'reduce' of 'numpy.ufunc' objects}
10000 0.018 0.000 0.121 0.000 {method 'min' of 'numpy.ndarray' objects}
400022 0.119 0.000 0.119 0.000 {method 'append' of 'list' objects}
180000 0.107 0.000 0.107 0.000 {method 'replace' of 'str' objects}
100000 0.106 0.000 0.106 0.000 {range}
You can use Cython or f2py to optimise memory management starting with
getAllGaps, which is simpler, and then go for getAlignmentData.Keep in mind that you need to deactivate outputs and take long runs to measure real speedup.
If you compile your code using
nuitka you can get 9x speedup. 110128 function calls (110121 primitive calls) in 0.355 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
20000 0.158 0.000 0.158 0.000 {method 'reduce' of 'numpy.ufunc' objects}
10000 0.011 0.000 0.101 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/_methods.py:28(_amin)
10000 0.022 0.000 0.096 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/numeric.py:141(ones)
20000 0.025 0.000 0.081 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py:192(compile)
10000 0.008 0.000 0.076 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/_methods.py:25(_amax)
20000 0.056 0.000 0.056 0.000 /usr/local/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py:230(_compile)
10000 0.037 0.000 0.037 0.000 {numpy.core.multiarray.copyto}
10000 0.037 0.000 0.037 0.000 {numpy.core.multiarray.empty}
All you need to do is install nuitka and run:
$ nuitka mycode.py
You also need a larger dataset to run a proper benchmark. Since it's easy to keep small data in cache, the profiler can provide misleading results.
Code Snippets
import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()
for it in range(0,10000):
getAlignmentData(source, sequence, sstart, tresholds)
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()Context
StackExchange Code Review Q#86051, answer score: 5
Revisions (0)
No revisions yet.