patternpythonModerate
Python search for array in large text file
Viewed 0 times
filesearcharraytextlargeforpython
Problem
I asked similar question at: https://stackoverflow.com/questions/38410982/superfast-regexmatch-in-large-text-file
Is there any faster way?
f.txt has ~ 14 million lines, and looks like this:
```
4287053 06218896 N 19801222 19810901 19881222 M171
4287053 06218896 N 19801222 19810901 19850211 M170
4289713 06222552 Y 19810105 19810915 19930330 SM02
4289713 06222552 Y 19810105 19810915 19930303 M285
4289713 06222552 Y 19810105 19810915 19921208 RMPN
4289713 06222552 Y 19810105 19810915 19921208 ASPN
4289713 06222552 Y 19810105 19810915 19881116 ASPN
4289713 06222552 Y 19810105 19810915 19881107 M171
4289713 06222552 Y 19810105 19810915 19850306 M170
4291808 06215853 N 19801212 19810929 19851031 EXP.
4291808 06215853 N 19801212 19810929 19850812 REM.
4292069 06227825 N 19810123 19810929 19930926 EXP.
4292069 06227825 N 19810123 19810929 19890323 ASPN
4292069 06227825 N 19810123 19810929 19890320 M171
4292069 06227825 N 19810123 19810929 19850314 M170
4292142 06224175 N 19810112 19810929 19930926 EXP.
4292142 06224175 N 19810112 19810929 19890316 M171
4292142 06224175 N 19810112 19810929 19861008 ASPN
4292142 06224175 N 19810112 19810929 19850925 M170
4292142 06224175 N 19810112 19810929 19850925 M176
4292142 06224175 N 19810112 19810929 19850812 REM.
…
RE45962 14454334 Y 20140807 20160405 20160323 ASPN
RE45972 14335639 N
from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
for line in f:
for x in paid:
if line[:7]==x:
print (line)
print (datetime.now() - startTime)
##=>0:11:05.542505from datetime import datetime
startTime = datetime.now()
f = open('data.txt', 'r')
t = f.read().splitlines()
paid = list(set(t))
with open('f.txt') as f:
for line in f:
for x in paid:
if line.startswith(x):
print (line)
print (datetime.now() - startTime)
##=>0:11:32.997729Is there any faster way?
f.txt has ~ 14 million lines, and looks like this:
```
4287053 06218896 N 19801222 19810901 19881222 M171
4287053 06218896 N 19801222 19810901 19850211 M170
4289713 06222552 Y 19810105 19810915 19930330 SM02
4289713 06222552 Y 19810105 19810915 19930303 M285
4289713 06222552 Y 19810105 19810915 19921208 RMPN
4289713 06222552 Y 19810105 19810915 19921208 ASPN
4289713 06222552 Y 19810105 19810915 19881116 ASPN
4289713 06222552 Y 19810105 19810915 19881107 M171
4289713 06222552 Y 19810105 19810915 19850306 M170
4291808 06215853 N 19801212 19810929 19851031 EXP.
4291808 06215853 N 19801212 19810929 19850812 REM.
4292069 06227825 N 19810123 19810929 19930926 EXP.
4292069 06227825 N 19810123 19810929 19890323 ASPN
4292069 06227825 N 19810123 19810929 19890320 M171
4292069 06227825 N 19810123 19810929 19850314 M170
4292142 06224175 N 19810112 19810929 19930926 EXP.
4292142 06224175 N 19810112 19810929 19890316 M171
4292142 06224175 N 19810112 19810929 19861008 ASPN
4292142 06224175 N 19810112 19810929 19850925 M170
4292142 06224175 N 19810112 19810929 19850925 M176
4292142 06224175 N 19810112 19810929 19850812 REM.
…
RE45962 14454334 Y 20140807 20160405 20160323 ASPN
RE45972 14335639 N
Solution
Performance
It should be possible to accomplish this task in seconds rather than minutes, with the right data structure. This is your main mistake:
The problem is, for a list with n items, it takes O(n) time to check whether the list contains a particular item. It's particularly bad if the vast majority of the entries that you are seeking do not appear in the list — then it really does need to look at every single member of the list before concluding that an item is not in the list.
On the other hand, testing whether a particular element is a member of a
Miscellaneous
Please avoid cryptic filenames such as
It's a good habit to always call
Suggested solution
It should be possible to accomplish this task in seconds rather than minutes, with the right data structure. This is your main mistake:
paid = list(set(t))The problem is, for a list with n items, it takes O(n) time to check whether the list contains a particular item. It's particularly bad if the vast majority of the entries that you are seeking do not appear in the list — then it really does need to look at every single member of the list before concluding that an item is not in the list.
On the other hand, testing whether a particular element is a member of a
set is O(1) — extremely efficient!Miscellaneous
Please avoid cryptic filenames such as
x and t. Calling the file handle f is OK, since f is a typical name for a file handle, but since you have two files in this program, more distinctive names would help readability. I would also avoid renaming the patent data file to f.txt, which is too cryptic.It's a good habit to always call
open() in the context of a with block, so that the file handle will be automatically closed for sure. The 'r' mode is implied, and can be omitted.Suggested solution
with open('data.txt') as paid_file:
paid = set(line.rstrip() for line in paid_file)
with open('MaintFeeEvents_20160711.txt') as patent_events_file:
for line in patent_events_file:
if line.split(' ', 1)[0] in paid:
print(line, end='')Code Snippets
paid = list(set(t))with open('data.txt') as paid_file:
paid = set(line.rstrip() for line in paid_file)
with open('MaintFeeEvents_20160711.txt') as patent_events_file:
for line in patent_events_file:
if line.split(' ', 1)[0] in paid:
print(line, end='')Context
StackExchange Code Review Q#135159, answer score: 10
Revisions (0)
No revisions yet.