HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Building a report of DNA sites and chunks

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
siteschunksbuildingreportdnaand

Problem

Here is the slow part of my code:

def Prepare_WCfst_file():
    content_WCfst = "iteration\tlocus\tFromSite\tToSite\tWCFst\n"
    nbchunks = nbsites/nbsitesPerChunk
    if nbchunks%1 != 0:
        print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")
        sys.exit(1)
    nbchunks = int(nbchunks)
    nbsitesPerChunk_minus1 = nbsitesPerChunk - 1
    for iteration in xrange(nbiterations):
        for locus in xrange(nbloci):
            FromSite = 0
            ToSite = nbsitesPerChunk_minus1
            for chunk in xrange(nbchunks):
                content_WCfst += str(iteration) + "\t" + str(locus) + "\t" + str(FromSite) + "\t" + str(ToSite) + "\n"
                FromSite = ToSite + 1
                ToSite = FromSite + nbsitesPerChunk_minus1
    return content_WCfst


Typically, nbiterations and nbloci take values between 1 and 100. nbchunks however can take values of the order of 10^6 to 10^9. This function will be called about 100 to 10,000 times in total.

Suggested data to benchmark

nbloci = 10^6
nbiterations = 20
nbloci = 2
nbsitesPerChunk = 10


What the function does

Basically, what this piece of code does is to create a long string content_WCfst (which will afterward be written on a file). The file needs to contain 4 columns: iteration, locus, FromSite and ToSite. The difference between the columns FromSite and ToSite is always nbsitesPerChunk_minus1. By the end the number of lines is nbiterations nbloci nbchunks.

My thoughts about what to improve

I think the part that can be improved is the three loops or eventually only the most internal loop (for chunk in xrange(nbchunks):)

Note that I've tried to replace

str(iteration) + "\t" + str(locus) + "\t" + str(FromSite) + "\t" + str(ToSite) + "\n"


by

"\t".join(str(iteration), str(locus), str(FromSite), str(ToSite), "\n")


but it was even slower.

Note also that the f

Solution

nbchunks = nbsites/nbsitesPerChunk
if nbchunks%1 != 0:
    print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")
    sys.exit(1)
nbchunks = int(nbchunks)


This chunk of code is not good because:

nbchunks = nbsites / nbsitesPerChunk


In python-2 is already int for sure, in Python-3 you can write:

nbchunks = nbsites // nbsitesPerChunk


and it will be int for sure, so you can remove

nbchunks = int(nbchunks)


to avoid clutter.

The line:

if nbchunks%1 != 0


  • (minor) should be spaced like if nbchunks % 1 != 0



  • (SERIOUS) is always false as the remainder between a natural number and one is always zero!



Also the error message:

print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")


is not connected to the if statement...

Exiting explicit like sys.exit(1) should be avoided, you should instead:

raise AdequateException(message)

Code Snippets

nbchunks = nbsites/nbsitesPerChunk
if nbchunks%1 != 0:
    print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")
    sys.exit(1)
nbchunks = int(nbchunks)
nbchunks = nbsites / nbsitesPerChunk
nbchunks = nbsites // nbsitesPerChunk
nbchunks = int(nbchunks)
if nbchunks%1 != 0

Context

StackExchange Code Review Q#96373, answer score: 5

Revisions (0)

No revisions yet.