HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

High performance parsing for large, well-formatted text files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
texthighformattedlargeparsingforperformancewellfiles

Problem

I am looking to optimize the performance of a big data parsing problem I have using Python. The example data I show are segments of whole genome DNA sequence alignments for six primate species.

Each file contains multiple segments with the following format:

  # number of header lines mirrors number of data columns
  # the word 'DATA'
  # variable number of columns
  # the pattern '//'


Example:

SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
...... # continue for 10-100 thousand lines
//

SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
.... # continue for 10-100 thousand lines
//



I will only process segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6 (a quality control flag) equals '1' for both of these species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely and continue to the next header to check the species and flags.

Fields 4 and 5 are start and end coordinates, which I will record from the homo_sapiens line. I count from start for another process (described below).

I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta at each position. The line on which a species appears in the header indexes the data column for that species, so in example 1 where both target species are present, the (0-based) indices for homo_sapiens a

Solution

You should separate your "business logic" (ie, the rules you're using to analyse and filter the file) from your file parsing:

def parse(stream):
    while True:
        headers = []
        for line in stream:
            if not line.strip() and not headers:
                # empty line before chunk start
                pass
            elif line.startswith('SEQ'):
                headers.append(line)
            elif line.startswith('DATA'):
                break
            else:
                raise ValueError("Unexpected line %s" % line)

        else:
            if headers:
                # stream ended, after finding some headers
                raise ValueError("No data section found")
            else:
                # stream ended normally
                return

        def data_iter_gen():
            for line in stream:
                if line.startswith("//"):
                    break
                yield line

        data_it = data_iter_gen()
        yield headers, data_it

        # consume the data iterator, so that we're at
        # the end of a record in the next iteration of this while loop
        for d in data_it: pass

def headers_look_ok(headers):
    # headers (a `list`) is small, so this doesn't need much care

def process(headers, data):
    # headers is a list, data is an iterator

with gz.open(compara_file, 'rb') as f:
    for headers, data in parse(f):
        if headers_look_ok(headers):
            process(headers, data)


Note the careful use of iterators here to prevent loading more than is needed into ram.

This also has the benefit of easier file format validation, rather than failing silently and unexpectedly

Code Snippets

def parse(stream):
    while True:
        headers = []
        for line in stream:
            if not line.strip() and not headers:
                # empty line before chunk start
                pass
            elif line.startswith('SEQ'):
                headers.append(line)
            elif line.startswith('DATA'):
                break
            else:
                raise ValueError("Unexpected line %s" % line)

        else:
            if headers:
                # stream ended, after finding some headers
                raise ValueError("No data section found")
            else:
                # stream ended normally
                return

        def data_iter_gen():
            for line in stream:
                if line.startswith("//"):
                    break
                yield line

        data_it = data_iter_gen()
        yield headers, data_it

        # consume the data iterator, so that we're at
        # the end of a record in the next iteration of this while loop
        for d in data_it: pass

def headers_look_ok(headers):
    # headers (a `list`) is small, so this doesn't need much care

def process(headers, data):
    # headers is a list, data is an iterator

with gz.open(compara_file, 'rb') as f:
    for headers, data in parse(f):
        if headers_look_ok(headers):
            process(headers, data)

Context

StackExchange Code Review Q#98141, answer score: 4

Revisions (0)

No revisions yet.