patternpythonMinor
High performance parsing for large, well-formatted text files
Viewed 0 times
texthighformattedlargeparsingforperformancewellfiles
Problem
I am looking to optimize the performance of a big data parsing problem I have using Python. The example data I show are segments of whole genome DNA sequence alignments for six primate species.
Each file contains multiple segments with the following format:
Example:
I will only process segments where the species
Fields 4 and 5 are
I want to compare the letters (DNA bases) for
Each file contains multiple segments with the following format:
# number of header lines mirrors number of data columns
# the word 'DATA'
# variable number of columns
# the pattern '//'
Example:
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
...... # continue for 10-100 thousand lines
//
SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
.... # continue for 10-100 thousand lines
//
I will only process segments where the species
homo_sapiens and macaca_mulatta are both present in the header, and field 6 (a quality control flag) equals '1' for both of these species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely and continue to the next header to check the species and flags.Fields 4 and 5 are
start and end coordinates, which I will record from the homo_sapiens line. I count from start for another process (described below).I want to compare the letters (DNA bases) for
homo_sapiens and macaca_mulatta at each position. The line on which a species appears in the header indexes the data column for that species, so in example 1 where both target species are present, the (0-based) indices for homo_sapiens aSolution
You should separate your "business logic" (ie, the rules you're using to analyse and filter the file) from your file parsing:
Note the careful use of iterators here to prevent loading more than is needed into ram.
This also has the benefit of easier file format validation, rather than failing silently and unexpectedly
def parse(stream):
while True:
headers = []
for line in stream:
if not line.strip() and not headers:
# empty line before chunk start
pass
elif line.startswith('SEQ'):
headers.append(line)
elif line.startswith('DATA'):
break
else:
raise ValueError("Unexpected line %s" % line)
else:
if headers:
# stream ended, after finding some headers
raise ValueError("No data section found")
else:
# stream ended normally
return
def data_iter_gen():
for line in stream:
if line.startswith("//"):
break
yield line
data_it = data_iter_gen()
yield headers, data_it
# consume the data iterator, so that we're at
# the end of a record in the next iteration of this while loop
for d in data_it: pass
def headers_look_ok(headers):
# headers (a `list`) is small, so this doesn't need much care
def process(headers, data):
# headers is a list, data is an iterator
with gz.open(compara_file, 'rb') as f:
for headers, data in parse(f):
if headers_look_ok(headers):
process(headers, data)Note the careful use of iterators here to prevent loading more than is needed into ram.
This also has the benefit of easier file format validation, rather than failing silently and unexpectedly
Code Snippets
def parse(stream):
while True:
headers = []
for line in stream:
if not line.strip() and not headers:
# empty line before chunk start
pass
elif line.startswith('SEQ'):
headers.append(line)
elif line.startswith('DATA'):
break
else:
raise ValueError("Unexpected line %s" % line)
else:
if headers:
# stream ended, after finding some headers
raise ValueError("No data section found")
else:
# stream ended normally
return
def data_iter_gen():
for line in stream:
if line.startswith("//"):
break
yield line
data_it = data_iter_gen()
yield headers, data_it
# consume the data iterator, so that we're at
# the end of a record in the next iteration of this while loop
for d in data_it: pass
def headers_look_ok(headers):
# headers (a `list`) is small, so this doesn't need much care
def process(headers, data):
# headers is a list, data is an iterator
with gz.open(compara_file, 'rb') as f:
for headers, data in parse(f):
if headers_look_ok(headers):
process(headers, data)Context
StackExchange Code Review Q#98141, answer score: 4
Revisions (0)
No revisions yet.