HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Speed up file format conversion

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
conversionformatfilespeed

Problem

I wrote this quick script to convert a genotype file into a BED file. Running it seems to be taking a very long time.

#BED file format
#http://genome.ucsc.edu/FAQ/FAQformat.html#format1
#1.chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). 
#2.chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. 
#3.chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. 

from collections import defaultdict
from operator import itemgetter
from itertools import groupby
from __future__ import print_function

genotype = open("", "rU")

def listToRange(list):
    ranges = []
     for k, g in groupby(enumerate(list), lambda (i,x):i-x):
        group = map(itemgetter(1), g)
        ranges.append((group[0], group[-1]))
    return ranges

#Create Dictionary with all positions for each chrom
chromDict = defaultdict(list) 

genotypeLines = list(genotype.readlines()[1:])

for item in genotypeLines:
    objects = item.split()

    chromDict[objects[0]].append(int(objects[1]))

#print() implemented in python 3 for use in python 2 need to: from __future__ import     print_function
for key, value in chromDict.iteritems():
    if len(listToRange(value)) > 1:
        for i in range(0,(len(listToRange(value)))):
            print(key, listToRange(value)[i][0],listToRange(value)[i][1], sep="\t")
    else:
        print(key, listToRange(value)[0][0], listToRange(value)[0][1], sep="\t")

#need to capture output to terminal i.e. python script.py > myfile.bed

genotype.close()


Input file example:

```
Chr Position 1 2 3
HE669507 9752 C C C
HE669507 9753 T T T
HE669507 9755 A

Solution

I do not know python, so I should probably restrain from answering. But my initial observation by looking at your code is the repeated call to listToRange in the last for loop.

We can also skip the if len() > 1.

Rewrite:

for key, value in chromDict.iteritems():
    chrom_range = listToRange(value)
    chrom_len = len(chrom_range)
    for i in range(0, chrom_len):
        print(key, chrom_range[i][0], chrom_range[i][1], sep="\t")


Some other notes:

-
There is also the question about variable names. In your
listToRange(list), list is a python keyword.

-
import __future__ statements should be served before
anything else.

-
Comments and code should be within a width limit. I stay with the 80
rule.

-
Space after comma.

As you tag question with beginner I would also recommend PEP 8 – Style Guide for Python Code. It is one of those things that when one accumulate experience one better see where and when it is appropriate to deviate from. Start out strict and lax later if appropriate.

There is also a pep8 command line tool for easy validation.

As such I can add the bullet point:

  • Function names should be lowercase, with words separated


by underscores as necessary to improve readability.

That is: if this is not part of a bigger code-base where the norm is to use camel-case.

For file read with can be nice to use. I simply quote python.org by:


It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks:

As noted, I do not know python, but add a simple writeup anyway, you might find some of it useful:

""" GENO TO BED """
# BED file format
# http://genome.ucsc.edu/FAQ/FAQformat.html#format1
# 1.chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or
# scaffold (e.g. scaffold10671).
# 2.chromStart - The starting position of the feature in the chromosome or
# scaffold. The first base in a chromosome is numbered 0.
# 3.chromEnd - The ending position of the feature in the chromosome or
# scaffold. The chromEnd base is not included in the display of
# the feature. For example:
# The first 100 bases of a chromosome are defined as
# chromStart=0, chromEnd=100, and span the bases numbered 0-99.

from __future__ import print_function
from collections import defaultdict
from operator import itemgetter
from itertools import groupby
import sys

def list_to_range(list_):
""" List to range """
ranges = []
for k, groups in groupby(enumerate(list_), lambda (i, x): i - x):
group = map(itemgetter(1), groups)
ranges.append((group[0], group[-1]))
return ranges

def dump_bed(chrom_dict):
""" Print BED to stdout """
for key, value in chrom_dict.iteritems():
chrom_range = list_to_range(value)
chrom_len = len(chrom_range)
for i in range(0, chrom_len):
print(key, chrom_range[i][0], chrom_range[i][6], sep="\t")

def dict_from_file(filename):
""" Generate dictionary from file """
chrom_dict = defaultdict(list)
with open(filename, "rU") as handle:
next(handle) # Skip first line
for line in handle:
if line == "\n":
continue
try:
key, val = line.split()[0:2]
chrom_dict[key].append(int(val))
except ValueError:
# Here you likely want to print some error.
# Use except ValueError, err: to get the
# error description.
pass

return chrom_dict

def main():
""" Main routine """
chrom_dict = dict_from_file(sys.argv[1])
dump_bed(chrom_dict)
return 0

if __name__ == '__main__':
if len(sys.argv) == 1:
print("Missing filename.", file=sys.stderr)
sys.exit(1)
sys.exit(main())

Code Snippets

for key, value in chromDict.iteritems():
    chrom_range = listToRange(value)
    chrom_len = len(chrom_range)
    for i in range(0, chrom_len):
        print(key, chrom_range[i][0], chrom_range[i][1], sep="\t")

Context

StackExchange Code Review Q#42561, answer score: 5

Revisions (0)

No revisions yet.