patternpythonMinor
Optimize huge text file search
Viewed 0 times
filesearchtexthugeoptimize
Problem
I have several huge 100MB text files that I need to scan through to pick out certain frame numbers which relate to a specific log packet of interest. My plan was to scan for these frame numbers and drop them into a list (often 6000+ frames per text file!). Ok so far, but there is second packet of interest which should accompany the first packet, these pairs of packets can only be matched by a frame number in my newly created list, so to avoid grabbing useless/blank data from non-matching frames. So I would then re-scan the text file to get the frame related packet data. It was suggested on Stack Overflow that I go down the regex route, and whilst this method worked it was extremely slow, taking up to 3 minutes to process a single text file.
I was wondering if there is a suitable optimization to my code, or even a completely different approach to grabbing this data?
```
for root, subFolders, files in os.walk(path):
for filename in files:
if filename.endswith('.txt'):
with open(os.path.join(root, filename), 'r') as f:
print '\tProcessing file: '+filename
for line in f:
#first find first key packet and grab frame number
if 'KEY_FRAME details' in line:
chunk = [next(f) for x in xrange(5)]
FRAME = chunk[5].split()
FRAME = FRAME[2]
#drop frame number into a list
framelist.append(str(FRAME))
#return to the start of the file, and search for next packet
f.seek(0)
framed = re.compile('|'.join(framelist))
framed = framed.pattern
#Look for any frame number in list based on 'FrameNumber = '+f and 'FN = '+f match
sentences = f
for s in sentences:
if any(('FrameNumber = '+f) in s for f in framelist):
print 'first found'
#do stuff
if any(('FN = '+f
I was wondering if there is a suitable optimization to my code, or even a completely different approach to grabbing this data?
```
for root, subFolders, files in os.walk(path):
for filename in files:
if filename.endswith('.txt'):
with open(os.path.join(root, filename), 'r') as f:
print '\tProcessing file: '+filename
for line in f:
#first find first key packet and grab frame number
if 'KEY_FRAME details' in line:
chunk = [next(f) for x in xrange(5)]
FRAME = chunk[5].split()
FRAME = FRAME[2]
#drop frame number into a list
framelist.append(str(FRAME))
#return to the start of the file, and search for next packet
f.seek(0)
framed = re.compile('|'.join(framelist))
framed = framed.pattern
#Look for any frame number in list based on 'FrameNumber = '+f and 'FN = '+f match
sentences = f
for s in sentences:
if any(('FrameNumber = '+f) in s for f in framelist):
print 'first found'
#do stuff
if any(('FN = '+f
Solution
I would strongly recommend a different algorithm for processing your file. I would do it in a three-stage approach, requiring only two reads of the file:
-
on the first stage, we scan the file, and do a few things:
-
the second stage involves processing the KEY_FRAME records, and identifying where the Second_Packet records are. From the map, order the requests to happen in byte-position order from the stream.
-
Here we scan the file again, in order of the Second_Packet byte positions and the KEY_FRAME instances they belong to.
By performing only two scans through the file (the first is a full scan, the second is an in-order-but-random-and-selective scan) you reduce the amount of times you process the data.
In your current system, you are scanning the file many times (once and then an additional one for each KEY_FRAME record and an additional one again for each second...). The loops you have at the end are very costly:
-
on the first stage, we scan the file, and do a few things:
- For each line, call the
stream.tell()on the file stream, and remember the byte position.
- identify all Second_Packet blocks, and the FN they relate to.
- store the FN and the byte position from the Tell in to a dictionary.
- identify all KEY_FRAME blocks, and create an object to represent it, with it's number. Store it in a list
-
the second stage involves processing the KEY_FRAME records, and identifying where the Second_Packet records are. From the map, order the requests to happen in byte-position order from the stream.
-
Here we scan the file again, in order of the Second_Packet byte positions and the KEY_FRAME instances they belong to.
- seek in the file to the position of the first needed Second_Packet
- strip off whatever information you need to complete the record.
- update the KEY_FRAME instance of data with the required information
By performing only two scans through the file (the first is a full scan, the second is an in-order-but-random-and-selective scan) you reduce the amount of times you process the data.
In your current system, you are scanning the file many times (once and then an additional one for each KEY_FRAME record and an additional one again for each second...). The loops you have at the end are very costly:
for s in sentences:
if any(('FrameNumber = '+f) in s for f in framelist):
print 'first found'
#do stuff
if any(('FN = '+f) in s for f in framelist):
print 'second found'
#do stuffCode Snippets
for s in sentences:
if any(('FrameNumber = '+f) in s for f in framelist):
print 'first found'
#do stuff
if any(('FN = '+f) in s for f in framelist):
print 'second found'
#do stuffContext
StackExchange Code Review Q#78224, answer score: 4
Revisions (0)
No revisions yet.