patternpythonMinor
Building a report of DNA sites and chunks
Viewed 0 times
siteschunksbuildingreportdnaand
Problem
Here is the slow part of my code:
Typically,
Suggested data to benchmark
What the function does
Basically, what this piece of code does is to create a long string
My thoughts about what to improve
I think the part that can be improved is the three loops or eventually only the most internal loop (
Note that I've tried to replace
by
but it was even slower.
Note also that the f
def Prepare_WCfst_file():
content_WCfst = "iteration\tlocus\tFromSite\tToSite\tWCFst\n"
nbchunks = nbsites/nbsitesPerChunk
if nbchunks%1 != 0:
print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")
sys.exit(1)
nbchunks = int(nbchunks)
nbsitesPerChunk_minus1 = nbsitesPerChunk - 1
for iteration in xrange(nbiterations):
for locus in xrange(nbloci):
FromSite = 0
ToSite = nbsitesPerChunk_minus1
for chunk in xrange(nbchunks):
content_WCfst += str(iteration) + "\t" + str(locus) + "\t" + str(FromSite) + "\t" + str(ToSite) + "\n"
FromSite = ToSite + 1
ToSite = FromSite + nbsitesPerChunk_minus1
return content_WCfstTypically,
nbiterations and nbloci take values between 1 and 100. nbchunks however can take values of the order of 10^6 to 10^9. This function will be called about 100 to 10,000 times in total.Suggested data to benchmark
nbloci = 10^6
nbiterations = 20
nbloci = 2
nbsitesPerChunk = 10What the function does
Basically, what this piece of code does is to create a long string
content_WCfst (which will afterward be written on a file). The file needs to contain 4 columns: iteration, locus, FromSite and ToSite. The difference between the columns FromSite and ToSite is always nbsitesPerChunk_minus1. By the end the number of lines is nbiterations nbloci nbchunks.My thoughts about what to improve
I think the part that can be improved is the three loops or eventually only the most internal loop (
for chunk in xrange(nbchunks):)Note that I've tried to replace
str(iteration) + "\t" + str(locus) + "\t" + str(FromSite) + "\t" + str(ToSite) + "\n"by
"\t".join(str(iteration), str(locus), str(FromSite), str(ToSite), "\n")but it was even slower.
Note also that the f
Solution
nbchunks = nbsites/nbsitesPerChunk
if nbchunks%1 != 0:
print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")
sys.exit(1)
nbchunks = int(nbchunks)This chunk of code is not good because:
nbchunks = nbsites / nbsitesPerChunkIn python-2 is already int for sure, in Python-3 you can write:
nbchunks = nbsites // nbsitesPerChunkand it will be int for sure, so you can remove
nbchunks = int(nbchunks)to avoid clutter.
The line:
if nbchunks%1 != 0- (minor) should be spaced like
if nbchunks % 1 != 0
- (SERIOUS) is always false as the remainder between a natural number and one is always zero!
Also the error message:
print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")is not connected to the
if statement...Exiting explicit like
sys.exit(1) should be avoided, you should instead:raise AdequateException(message)Code Snippets
nbchunks = nbsites/nbsitesPerChunk
if nbchunks%1 != 0:
print("\n\nThe number of sites per chunk for WCFst is not a divisor of the number of sites per locus.\nProcess is aborted!\n\n")
sys.exit(1)
nbchunks = int(nbchunks)nbchunks = nbsites / nbsitesPerChunknbchunks = nbsites // nbsitesPerChunknbchunks = int(nbchunks)if nbchunks%1 != 0Context
StackExchange Code Review Q#96373, answer score: 5
Revisions (0)
No revisions yet.