HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Python search for array in large text file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filesearcharraytextlargeforpython

Problem

I asked similar question at: https://stackoverflow.com/questions/38410982/superfast-regexmatch-in-large-text-file

from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
    for line in f:
        for x in paid:
            if line[:7]==x:
                print (line)
print (datetime.now() - startTime)
##=>0:11:05.542505


from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
    for line in f:
        for x in paid:
            if line.startswith(x):
                print (line)
print (datetime.now() - startTime)
##=>0:11:32.997729


Is there any faster way?

f.txt has ~ 14 million lines, and looks like this:

```
4287053 06218896 N 19801222 19810901 19881222 M171
4287053 06218896 N 19801222 19810901 19850211 M170
4289713 06222552 Y 19810105 19810915 19930330 SM02
4289713 06222552 Y 19810105 19810915 19930303 M285
4289713 06222552 Y 19810105 19810915 19921208 RMPN
4289713 06222552 Y 19810105 19810915 19921208 ASPN
4289713 06222552 Y 19810105 19810915 19881116 ASPN
4289713 06222552 Y 19810105 19810915 19881107 M171
4289713 06222552 Y 19810105 19810915 19850306 M170
4291808 06215853 N 19801212 19810929 19851031 EXP.
4291808 06215853 N 19801212 19810929 19850812 REM.
4292069 06227825 N 19810123 19810929 19930926 EXP.
4292069 06227825 N 19810123 19810929 19890323 ASPN
4292069 06227825 N 19810123 19810929 19890320 M171
4292069 06227825 N 19810123 19810929 19850314 M170
4292142 06224175 N 19810112 19810929 19930926 EXP.
4292142 06224175 N 19810112 19810929 19890316 M171
4292142 06224175 N 19810112 19810929 19861008 ASPN
4292142 06224175 N 19810112 19810929 19850925 M170
4292142 06224175 N 19810112 19810929 19850925 M176
4292142 06224175 N 19810112 19810929 19850812 REM.

RE45962 14454334 Y 20140807 20160405 20160323 ASPN
RE45972 14335639 N

Solution

Performance

It should be possible to accomplish this task in seconds rather than minutes, with the right data structure. This is your main mistake:

paid = list(set(t))


The problem is, for a list with n items, it takes O(n) time to check whether the list contains a particular item. It's particularly bad if the vast majority of the entries that you are seeking do not appear in the list — then it really does need to look at every single member of the list before concluding that an item is not in the list.

On the other hand, testing whether a particular element is a member of a set is O(1) — extremely efficient!

Miscellaneous

Please avoid cryptic filenames such as x and t. Calling the file handle f is OK, since f is a typical name for a file handle, but since you have two files in this program, more distinctive names would help readability. I would also avoid renaming the patent data file to f.txt, which is too cryptic.

It's a good habit to always call open() in the context of a with block, so that the file handle will be automatically closed for sure. The 'r' mode is implied, and can be omitted.

Suggested solution

with open('data.txt') as paid_file:
    paid = set(line.rstrip() for line in paid_file)

with open('MaintFeeEvents_20160711.txt') as patent_events_file:
    for line in patent_events_file:
        if line.split(' ', 1)[0] in paid:
            print(line, end='')

Code Snippets

paid = list(set(t))
with open('data.txt') as paid_file:
    paid = set(line.rstrip() for line in paid_file)

with open('MaintFeeEvents_20160711.txt') as patent_events_file:
    for line in patent_events_file:
        if line.split(' ', 1)[0] in paid:
            print(line, end='')

Context

StackExchange Code Review Q#135159, answer score: 10

Revisions (0)

No revisions yet.