patternpythonMinor

Optimize huge text file search

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

filesearchtexthugeoptimize

Problem

I have several huge 100MB text files that I need to scan through to pick out certain frame numbers which relate to a specific log packet of interest. My plan was to scan for these frame numbers and drop them into a list (often 6000+ frames per text file!). Ok so far, but there is second packet of interest which should accompany the first packet, these pairs of packets can only be matched by a frame number in my newly created list, so to avoid grabbing useless/blank data from non-matching frames. So I would then re-scan the text file to get the frame related packet data. It was suggested on Stack Overflow that I go down the regex route, and whilst this method worked it was extremely slow, taking up to 3 minutes to process a single text file.

I was wondering if there is a suitable optimization to my code, or even a completely different approach to grabbing this data?

```
for root, subFolders, files in os.walk(path):
for filename in files:
if filename.endswith('.txt'):
with open(os.path.join(root, filename), 'r') as f:
print '\tProcessing file: '+filename

for line in f:
#first find first key packet and grab frame number
if 'KEY_FRAME details' in line:
chunk = [next(f) for x in xrange(5)]
FRAME = chunk[5].split()
FRAME = FRAME[2]

#drop frame number into a list
framelist.append(str(FRAME))

#return to the start of the file, and search for next packet
f.seek(0)

framed = re.compile('|'.join(framelist))
framed = framed.pattern

#Look for any frame number in list based on 'FrameNumber = '+f and 'FN = '+f match
sentences = f
for s in sentences:
if any(('FrameNumber = '+f) in s for f in framelist):
print 'first found'
#do stuff

if any(('FN = '+f

Solution

I would strongly recommend a different algorithm for processing your file. I would do it in a three-stage approach, requiring only two reads of the file:

-
on the first stage, we scan the file, and do a few things:

For each line, call the stream.tell() on the file stream, and remember the byte position.

identify all Second_Packet blocks, and the FN they relate to.

store the FN and the byte position from the Tell in to a dictionary.

identify all KEY_FRAME blocks, and create an object to represent it, with it's number. Store it in a list

-
the second stage involves processing the KEY_FRAME records, and identifying where the Second_Packet records are. From the map, order the requests to happen in byte-position order from the stream.

-
Here we scan the file again, in order of the Second_Packet byte positions and the KEY_FRAME instances they belong to.

seek in the file to the position of the first needed Second_Packet

strip off whatever information you need to complete the record.

update the KEY_FRAME instance of data with the required information

By performing only two scans through the file (the first is a full scan, the second is an in-order-but-random-and-selective scan) you reduce the amount of times you process the data.

In your current system, you are scanning the file many times (once and then an additional one for each KEY_FRAME record and an additional one again for each second...). The loops you have at the end are very costly:

for s in sentences:
    if any(('FrameNumber = '+f) in s for f in framelist):
        print 'first found'
        #do stuff

    if any(('FN = '+f) in s for f in framelist):
        print 'second found'
        #do stuff

Code Snippets

for s in sentences:
    if any(('FrameNumber = '+f) in s for f in framelist):
        print 'first found'
        #do stuff

    if any(('FN = '+f) in s for f in framelist):
        print 'second found'
        #do stuff

Context

StackExchange Code Review Q#78224, answer score: 4

Revisions (0)

No revisions yet.