HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Counting K-mers (words) in a sequence string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
merscountingwordssequencestring

Problem

I recently found out how to use joblib for parallelization. I want to iterate through this string seq in steps of 1 and count the number that word is seen. I thought this would be really fast but for larger files it still takes long.

How could I make this word counter faster?

def iterSlice(seq, start_idx, end_idx):
    """
    Slice sequence
    """
    return seq[start_idx:end_idx]

def countKmers(seq, K=5, num_cores=multiprocessing.cpu_count()):
    """
    Count kmers in sequence
    """
    start_indices = np.arange(0, len(seq) - K + 1, step=1)
    end_indices = start_indices + K
    list_of_kmers = Parallel(n_jobs=num_cores)\
                        (delayed(iterSlice)\
                             (seq=seq, start_idx=start_idx, end_idx=end_idx)\
                                 for start_idx,end_idx in zip(start_indices,end_indices))
    return Counter(list_of_kmers)

seq = "ABCDBBBDBABCBDBABCBCBCABCDBACBDCBCABBCABDCBCABCABCABDCBABABABCD"
countKmers(seq=seq, K=3, num_cores=8)

Counter({'ABA': 2,
         'ABB': 1,
         'ABC': 7,
         'ABD': 2,
         'ACB': 1,
         'BAB': 5,
         'BAC': 1,
         'BBB': 1,
         'BBC': 1,
         'BBD': 1,
         'BCA': 6,
         'BCB': 3,
         'BCD': 3,
         'BDB': 2,
         'BDC': 3,
         'CAB': 6,
         'CBA': 1,
         'CBC': 4,
         'CBD': 2,
         'CDB': 2,
         'DBA': 3,
         'DBB': 1,
         'DCB': 3})


10 loops, best of 3: 186 ms per loop

Solution

Obligatory PEP8 reference: It is recommended to use lower_case names for variables and functions. camelCase is alright if you are consistent (you seem to be using lower_case for variables and camelCase for functions).

The hardest to read part is the calling of the Parallels part. I would make a few changes there to improve readability:

The way delayed is used I would think it is a decorator and you could do:

@delayed
def iterSlice(seq, start_idx, end_idx):
    """
    Slice sequence
    """
    return seq[start_idx:end_idx]


Unfortunately this does not work, I asked a stackoverflow question why here.

You can however do this:

def iterSlice(seq, start_idx, end_idx):
    """
    Slice sequence
    """
    return seq[start_idx:end_idx]
iterSlice = delayed(iterSlice)


This seems a bit too verbose to me:

iterSlice(seq=seq, start_idx=start_idx, end_idx=end_idx)


This is just as explicit:

itereSlice(seq, start_idx, end_idx)


Parallels has an implemented contextlib so you can do:

with Parallel(n_jobs=num_cores) as parallel:
    list_of_kmers = parallel(iterSlice(seq, start_idx, end_idx)
                             for start_idx,end_idx in zip(start_indices,end_indices))


It would also allow you to reuse the pool of workers if you had en explicit loop in there, which you don't.

You can probably move the definition of list_of_kmers directly into Counter(...) to avoid generating an intermediate list.

Regarding the performance:

The actually parallelized taks you are performing is getting the slice of the string. This is not really CPU intensive, so the overhead of setting up the parallelization might ruin the benefits.

For large files you will be limited by the reading of the file itself, which will be done linearly with the file size and with one CPU.

Have you tried comparing your code to a naive implementation like this:

from collections import Counter

def count_kmers(seq, K=5):
    """Count kmers in sequence"""
    return Counter(seq[start:start+K] for start in xrange(len(seq) - K))


(In python 3.x use range instead of xrange)

Code Snippets

@delayed
def iterSlice(seq, start_idx, end_idx):
    """
    Slice sequence
    """
    return seq[start_idx:end_idx]
def iterSlice(seq, start_idx, end_idx):
    """
    Slice sequence
    """
    return seq[start_idx:end_idx]
iterSlice = delayed(iterSlice)
iterSlice(seq=seq, start_idx=start_idx, end_idx=end_idx)
itereSlice(seq, start_idx, end_idx)
with Parallel(n_jobs=num_cores) as parallel:
    list_of_kmers = parallel(iterSlice(seq, start_idx, end_idx)
                             for start_idx,end_idx in zip(start_indices,end_indices))

Context

StackExchange Code Review Q#138205, answer score: 2

Revisions (0)

No revisions yet.