snippetpythonMinor
Speed up file format conversion
Viewed 0 times
conversionformatfilespeed
Problem
I wrote this quick script to convert a genotype file into a BED file. Running it seems to be taking a very long time.
Input file example:
```
Chr Position 1 2 3
HE669507 9752 C C C
HE669507 9753 T T T
HE669507 9755 A
#BED file format
#http://genome.ucsc.edu/FAQ/FAQformat.html#format1
#1.chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
#2.chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
#3.chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
from collections import defaultdict
from operator import itemgetter
from itertools import groupby
from __future__ import print_function
genotype = open("", "rU")
def listToRange(list):
ranges = []
for k, g in groupby(enumerate(list), lambda (i,x):i-x):
group = map(itemgetter(1), g)
ranges.append((group[0], group[-1]))
return ranges
#Create Dictionary with all positions for each chrom
chromDict = defaultdict(list)
genotypeLines = list(genotype.readlines()[1:])
for item in genotypeLines:
objects = item.split()
chromDict[objects[0]].append(int(objects[1]))
#print() implemented in python 3 for use in python 2 need to: from __future__ import print_function
for key, value in chromDict.iteritems():
if len(listToRange(value)) > 1:
for i in range(0,(len(listToRange(value)))):
print(key, listToRange(value)[i][0],listToRange(value)[i][1], sep="\t")
else:
print(key, listToRange(value)[0][0], listToRange(value)[0][1], sep="\t")
#need to capture output to terminal i.e. python script.py > myfile.bed
genotype.close()Input file example:
```
Chr Position 1 2 3
HE669507 9752 C C C
HE669507 9753 T T T
HE669507 9755 A
Solution
I do not know python, so I should probably restrain from answering. But my initial observation by looking at your code is the repeated call to
We can also skip the
Rewrite:
Some other notes:
-
There is also the question about variable names. In your
-
anything else.
-
Comments and code should be within a width limit. I stay with the 80
rule.
-
Space after comma.
As you tag question with beginner I would also recommend PEP 8 – Style Guide for Python Code. It is one of those things that when one accumulate experience one better see where and when it is appropriate to deviate from. Start out strict and lax later if appropriate.
There is also a
As such I can add the bullet point:
by underscores as necessary to improve readability.
That is: if this is not part of a bigger code-base where the norm is to use camel-case.
For file read
It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks:
As noted, I do not know python, but add a simple writeup anyway, you might find some of it useful:
listToRange in the last for loop.We can also skip the
if len() > 1.Rewrite:
for key, value in chromDict.iteritems():
chrom_range = listToRange(value)
chrom_len = len(chrom_range)
for i in range(0, chrom_len):
print(key, chrom_range[i][0], chrom_range[i][1], sep="\t")Some other notes:
-
There is also the question about variable names. In your
listToRange(list), list is a python keyword.-
import __future__ statements should be served beforeanything else.
-
Comments and code should be within a width limit. I stay with the 80
rule.
-
Space after comma.
As you tag question with beginner I would also recommend PEP 8 – Style Guide for Python Code. It is one of those things that when one accumulate experience one better see where and when it is appropriate to deviate from. Start out strict and lax later if appropriate.
There is also a
pep8 command line tool for easy validation.As such I can add the bullet point:
- Function names should be lowercase, with words separated
by underscores as necessary to improve readability.
That is: if this is not part of a bigger code-base where the norm is to use camel-case.
For file read
with can be nice to use. I simply quote python.org by:It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks:
As noted, I do not know python, but add a simple writeup anyway, you might find some of it useful:
""" GENO TO BED """
# BED file format
# http://genome.ucsc.edu/FAQ/FAQformat.html#format1
# 1.chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or
# scaffold (e.g. scaffold10671).
# 2.chromStart - The starting position of the feature in the chromosome or
# scaffold. The first base in a chromosome is numbered 0.
# 3.chromEnd - The ending position of the feature in the chromosome or
# scaffold. The chromEnd base is not included in the display of
# the feature. For example:
# The first 100 bases of a chromosome are defined as
# chromStart=0, chromEnd=100, and span the bases numbered 0-99.
from __future__ import print_function
from collections import defaultdict
from operator import itemgetter
from itertools import groupby
import sys
def list_to_range(list_):
""" List to range """
ranges = []
for k, groups in groupby(enumerate(list_), lambda (i, x): i - x):
group = map(itemgetter(1), groups)
ranges.append((group[0], group[-1]))
return ranges
def dump_bed(chrom_dict):
""" Print BED to stdout """
for key, value in chrom_dict.iteritems():
chrom_range = list_to_range(value)
chrom_len = len(chrom_range)
for i in range(0, chrom_len):
print(key, chrom_range[i][0], chrom_range[i][6], sep="\t")
def dict_from_file(filename):
""" Generate dictionary from file """
chrom_dict = defaultdict(list)
with open(filename, "rU") as handle:
next(handle) # Skip first line
for line in handle:
if line == "\n":
continue
try:
key, val = line.split()[0:2]
chrom_dict[key].append(int(val))
except ValueError:
# Here you likely want to print some error.
# Use except ValueError, err: to get the
# error description.
pass
return chrom_dict
def main():
""" Main routine """
chrom_dict = dict_from_file(sys.argv[1])
dump_bed(chrom_dict)
return 0
if __name__ == '__main__':
if len(sys.argv) == 1:
print("Missing filename.", file=sys.stderr)
sys.exit(1)
sys.exit(main())
Code Snippets
for key, value in chromDict.iteritems():
chrom_range = listToRange(value)
chrom_len = len(chrom_range)
for i in range(0, chrom_len):
print(key, chrom_range[i][0], chrom_range[i][1], sep="\t")Context
StackExchange Code Review Q#42561, answer score: 5
Revisions (0)
No revisions yet.