HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Generating DNA sequences and looking for correlations

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
generatinglookingdnaforandsequencescorrelations

Problem

I've written a script to generate DNA sequences and then count the appearance of each step to see if there is any long range correlation.

My program runs really slow for a length 100000 sequence 100 times replicate. I already run it for more than 100 hours without completion.

The first function is seq(), it randomly generate DNA sequences based on the transition matrix follow Markov chain, each step will be either a,c,t, or g. The length is ten thousands.

After the DNA sequence generated, the u will be calculated. u is the total score by DNA walk, for example, at each step, if there is a|g, u + 1, if there is a c|g, u - 1. Therefore we will have u from step 1 to step 10 thousands.

Then we calculate the fluctuation for u from l = 1 step to l = 5000 step, to see if there is a long range correlation exist.

The performTrial() is using to do replicate for fl() function.

```
#!/usr/bin/env python

import sys, random
import os
import math

length = 10000

initial_p = {'a':0.25,'c':0.25,'t':0.25,'g':0.25}

tran_matrix = {'a': {'a':0.495,'c':0.113,'g':0.129,'t':0.263},
'c': {'a':0.129,'c':0.063,'g':0.413,'t':0.395},
't': {'a':0.213,'c':0.495,'g':0.263,'t':0.029},
'g': {'a':0.263,'c':0.129,'g':0.295,'t':0.313}}

def fl():
def seq():
def choose(dist):
r = random.random()
sum = 0.0
keys = dist.keys()
for k in keys:
sum += dist[k]
if sum > r:
return k
return keys[-1]
c = choose(initial_p)
sequence = ''
for i in range(length):
sequence += c
c = choose(tran_matrix[c])
return sequence

sequence = seq()
# This program takes a DNA sequence calculate the DNA walk score.
#print sequence
#print len
u = 0
ls = []
for i in sequence:
if i == 'a' :
#print i

Solution

One word: numpy.

A few more words:

numpy is a library that includes highly-tuned algorithms (often in Fortran, C, and C++) specifically for the purposes of accelerating numerical computations over matrices and other arrays. If it's at all applicable to what you're trying to do, it's almost certainly the best answer. Even when numpy itself isn't appropriate, look at the other components of scipy (the larger project it's part of).

At the moment, the numpy site seems to be down, and the scipy site seems to be responding but broken. The docs wiki is working, at least. (Please don't take this as an indication that numpy is some fly-by-night project; almost everyone who does numerical or scientific computing in Python relies on it. I'm sure it'll be fixed shortly.)

However, there are other possibilities for other use cases:

  • Analyze your algorithms and make sure the problem isn't what you're doing, rather than how you're doing it. There's no point making everything 30% faster, or even 95%, when the real problem is that you're doing 1000000 loops instead of 1000.



  • Profile your code, and don't worry about the parts that aren't slow or aren't called often.



  • Whenever it's possible to use a built-in function, even if it's a bit more awkward, it's often faster. (For example, sticking an iterator in a collections.deque of size 0 is at least 30% faster than any pure-Python way of walking and disposing the iterator.)



  • Cython gives you a way to write almost-Python code that gets compiled into C (and then into binary code).



  • You can port (bits of) your code to C explicitly.



  • Sometimes just running in PyPy (or maybe Jython or IronPython) instead of the normal CPython makes a big difference.



  • Replace explicit lists with generators, and explicit list-building loops that can't be replaced by generators with list comprehensions.



  • If you need that last 2%, there are all kinds of tricks, like rebinding members of objects to local variables, that aren't always more trouble than they're worth.



A couple more words on Cython:

The first step to accelerating some slow critical code with Cython is just to move it as-is into Cython. If that doesn't help enough, try adding explicit types to variables and turning function definitions from def into cdef. If that's still not enough, then you have to actually learn how to use Cython effectively—but you may get enough win that you don't even have to understand why it's helping. (Although obviously you should learn anyway.)

Context

StackExchange Code Review Q#19654, answer score: 12

Revisions (0)

No revisions yet.