patternpythonMinor
Efficient parsing of FASTQ
Viewed 0 times
fastqefficientparsing
Problem
FASTQ is a notoriously bad format. This is because it uses the same
I'd like your opinion of my
This works, but I'd like your tips and ideas on ways of improving it.
```
def read_fastq(fileH):
"""
takes a fastq file as input
yields idSeq, sequence and score
for each fastq entry
"""
#initialize the idSeq, sequence, score and index
idSeq, sequence, score = None, None, None
"""
main loop structure:
An outer while loop will run until the file runs out of lines.
If the line starts with @ and score exists, yield the id, sequence and score.
this is where the yielding happens in our loop, it could be considered where each
round of the loop ends
The first id is recorded on the first iteration.
If there is no sequence, begin an inner while loop,
read lines into sequence until we hit a + character, and break out of inner while loop
if there is no score, begin an inner while loop where we will increment score until we have
the same number of characters as sequence, this prevents interpretting a header as a score
"""
while 1:
line = fileH.readline()
#break if we hit the end of the file
if not line:
break
if line.startswith('@') and score:
#before yielding the sequence, remove non-ascii characters
sequence = legalChar(sequence)
yield idSeq, sequence, score
#reset to default values
sequence = None
score = None
idSeq = line.rstrip()
elif not idSeq:
#get our first idSeq
idSeq = line.rstrip()
continue
elif not sequence:
sequence = ""
while not line.startswith('+'):
@ character for the id line as it does for quality scores. Deciding what is a quality score and what is an id is a tricky endeavor with many pitfalls.I'd like your opinion of my
while 1 approach in the function read_fastq.This works, but I'd like your tips and ideas on ways of improving it.
```
def read_fastq(fileH):
"""
takes a fastq file as input
yields idSeq, sequence and score
for each fastq entry
"""
#initialize the idSeq, sequence, score and index
idSeq, sequence, score = None, None, None
"""
main loop structure:
An outer while loop will run until the file runs out of lines.
If the line starts with @ and score exists, yield the id, sequence and score.
this is where the yielding happens in our loop, it could be considered where each
round of the loop ends
The first id is recorded on the first iteration.
If there is no sequence, begin an inner while loop,
read lines into sequence until we hit a + character, and break out of inner while loop
if there is no score, begin an inner while loop where we will increment score until we have
the same number of characters as sequence, this prevents interpretting a header as a score
"""
while 1:
line = fileH.readline()
#break if we hit the end of the file
if not line:
break
if line.startswith('@') and score:
#before yielding the sequence, remove non-ascii characters
sequence = legalChar(sequence)
yield idSeq, sequence, score
#reset to default values
sequence = None
score = None
idSeq = line.rstrip()
elif not idSeq:
#get our first idSeq
idSeq = line.rstrip()
continue
elif not sequence:
sequence = ""
while not line.startswith('+'):
Solution
- Bugs
All the bugs in this section are to do with handling of bad input; see §2.1 below for more about this issue.
-
If there are no sequences in the input, your function wrongly yields an output sequence:
>>> from StringIO import StringIO
>>> list(read_fastq(StringIO('')))
[(None, None, None)]-
If the input is wrongly formatted, your program can go into an infinite loop waiting for a line starting with a plus sign:
>>> list(read_fastq(StringIO('@ID\nSEQUENCE\n')))
^C
Traceback (most recent call last):
File "", line 1, in
File "cr32897.py", line 155, in read_fastq
sequence += line.rstrip().replace(' ', '')
KeyboardInterrupt-
If the input is wrongly formatted, your program can go into an infinite loop waiting for the end of the quality data:
>>> list(read_fastq(StringIO('@ID\nSEQUENCE\n+ID\nSCORE')))
^C
Traceback (most recent call last):
File "", line 1, in
File "cr32897.py", line 162, in read_fastq
if len(score) >= len(sequence):
KeyboardInterrupt- Other comments on your code
-
You don't detect or report problems in the input. You just ignore or suppress them. This leads to outright bugs, as noted in §1 above, but it also means that you don't detect other problems with the input, like these:
- missing sequence id;
- junk lines before the first sequence or between sequences;
- sequence id on
+line fails to match sequence id on@line;
- length of quality data fails to match length of sequence data;
- quality data contains characters outside the correct range.
You shouldn't ignore or suppress errors in the input, because you need to know about these errors in order to discover problems with your processing pipeline. What if you are being fed the wrong input? What if the input is in the wrong format (for example, FASTA instead of FASTQ)? What if there is a bug in the code that generated the input? If you don't detect errors then you will end up generating bogus output and causing trouble for whatever process is next in your processing pipeline.
When writing code that parses input data, the code for detecting and reporting errors typically constitutes the majority of the code. See §3 below for how I would go about writing this kind of parser.
-
The first item of each output tuple (the sequence id) starts with the
@ sign, but the @ sign is not part of the id.-
Your claim, "Deciding what is a quality score and what is an id is a tricky endeavor with many pitfalls" is a bit exaggerated. As far as I can see there is just one pitfall, namely that you might incorrectly stop reading the quality data when you encounter a line starting with
@. If there are other pitfalls, you should probably mention them in a comment in the code so that we can check that they are all avoided.-
It's not clear to me what the
H means in the variable name fileH. Does it stand for "handle"? If that's what you mean, you should write it out for clarity: file_handle. But really there's no reason not just to use the name file here.-
Your code contains two docstrings, but only the first is actually available to a user via the
help function. The second docstring should be a comment instead — or better still, break it up into small comments that appear next to the relevant bits of code.-
The actual comments could be better. There is really no need to write comments like this:
line = fileH.readline()
#break if we hit the end of the file
if not line:
breakYou can expect Python programmers to know that
file.readline returns an empty string to indicate end-of-file; or if not, they can find out by running help(file.readline).Some of the comments are wrong, for example here:
#get our first idSeq
idSeq = line.rstrip()This only gets the "first"
idSeq if there was no previous line starting with an @.-
For an infinite loop, you write:
while 1:but it's conventional to write:
while True:-
In Python, a file object is also an iterator that yields the lines in the file, so instead of
line = fileH.readline()
#break if we hit the end of the file
if not line:
breakyou can write:
line = next(fileH)This raises
StopIteration when there are no more lines.-
You build up the
sequence string by repeated addition:sequence = ""
while not line.startswith('+'):
sequence += line.rstrip().replace(' ', '')
line = fileH.readline()In Python this is an anti-pattern that results in quadratic space and time consumption. The efficient way to build up a string from components is to put the components in a list and then call
str.join:sequence_lines = []
while not line.startswith('+'):
sequence_lines.append(line.rstrip().replace(' ', ''))
line = fileH.readline()
sequence = ''.join(sequence_lines)-
You build up the
score in the form of a list rather than a string:score = []
while 1:
score += line.rstrip().replace(' ', '')
...whic
Code Snippets
>>> from StringIO import StringIO
>>> list(read_fastq(StringIO('')))
[(None, None, None)]>>> list(read_fastq(StringIO('@ID\nSEQUENCE\n')))
^C
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cr32897.py", line 155, in read_fastq
sequence += line.rstrip().replace(' ', '')
KeyboardInterrupt>>> list(read_fastq(StringIO('@ID\nSEQUENCE\n+ID\nSCORE')))
^C
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cr32897.py", line 162, in read_fastq
if len(score) >= len(sequence):
KeyboardInterruptline = fileH.readline()
#break if we hit the end of the file
if not line:
break#get our first idSeq
idSeq = line.rstrip()Context
StackExchange Code Review Q#32897, answer score: 4
Revisions (0)
No revisions yet.