patternpythonMinor
Why is this program for extracting IDs from a file so slow?
Viewed 0 times
thiswhyfileidsprogramslowforextractingfrom
Problem
This is roughly what my data file looks like:
...and it's a total of 31mb in size. Here's my python script that pulls the Mon-###### IDs
(found at the beginning of each of the lines).
The script works but for this particular file it ran for well into 5 minutes and I eventually just killed the process due to impatience. Is this just something I'll have to deal with in python? i.e. Should this be written with a compiled language considering the size of my data file?
Further info:
This script is being run within Emacs. This, by the checked answer, explains why it was running so slow.
# Monid U B V R I u g r i J Jerr H Herr K Kerr IRAC1 I1err IRAC2 I2err IRAC3 I3err IRAC4 I4err MIPS24 M24err SpT HaEW mem comp
Mon-000001 99.999 99.999 21.427 99.999 18.844 99.999 99.999 99.999 99.999 16.144 99.999 15.809 0.137 16.249 99.999 15.274 0.033 15.286 0.038 99.999 99.999 99.999 99.999 99.999 99.999 null 55.000 1 N
Mon-000002 99.999 99.999 20.905 19.410 17.517 99.999 99.999 99.999 99.999 15.601 0.080 15.312 0.100 14.810 0.110 14.467 0.013 14.328 0.019 14.276 0.103 99.999 0.048 99.999 99.999 null -99.999 2 N...and it's a total of 31mb in size. Here's my python script that pulls the Mon-###### IDs
(found at the beginning of each of the lines).
import re
def pullIDs(file_input):
'''Pulls Mon-IDs from input file.'''
arrayID = []
with open(file_input,'rU') as user_file:
for line in user_file:
arrayID.append(re.findall('Mon\-\d{6}',line))
return arrayID
print pullIDs(raw_input("Enter your first file: "))The script works but for this particular file it ran for well into 5 minutes and I eventually just killed the process due to impatience. Is this just something I'll have to deal with in python? i.e. Should this be written with a compiled language considering the size of my data file?
Further info:
This script is being run within Emacs. This, by the checked answer, explains why it was running so slow.
Solution
You said in comments that you don't know how to create a self-contained test case. But that's really easy! All that's needed is a function like this:
You can use this to make a test case of about the right size:
That's about 34 MiB so close enough. Now we can see how fast your code really is, using the
Just over a second. There's nothing wrong with your code or the speed of Python.
So why does it take you many minutes? Well, you say that you're running it inside Emacs. That means that when you run
Python prints out a list of 200,000 ids, and Emacs reads this line of output into the
So don't do that. Read the list of ids into a variable and if you need to look at it, use slicing to look at bits of it:
def test_case(filename, n):
"""Write n lines of test data to filename."""
with open(filename, 'w') as f:
for i in range(n):
f.write('Mon-{0:06d} {1}\n'.format(i + 1, ' 99.999' * 20))You can use this to make a test case of about the right size:
>>> test_case('cr36275.data', 200000)
>>> import os
>>> os.stat('cr36275.data').st_size
34400000That's about 34 MiB so close enough. Now we can see how fast your code really is, using the
timeit module:>>> from timeit import timeit
>>> timeit(lambda:pullIDs('cr36275.data'), number=1)
1.3354740142822266Just over a second. There's nothing wrong with your code or the speed of Python.
So why does it take you many minutes? Well, you say that you're running it inside Emacs. That means that when you run
>>> pullIDs('cr36275.data')Python prints out a list of 200,000 ids, and Emacs reads this line of output into the
Python buffer and applies syntax highlighting rules to it as it goes. Emacs' syntax highlighting code is designed to work on lines of source code (at most a few hundred characters but mostly 80 characters or less), not on lines of output that are millions of characters long. This is what is taking all the time.So don't do that. Read the list of ids into a variable and if you need to look at it, use slicing to look at bits of it:
>>> ids = pullIDs('cr36275.data')
>>> ids[:10]
[['Mon-000001'], ['Mon-000002'], ['Mon-000003'], ['Mon-000004'], ['Mon-000005'],
['Mon-000006'], ['Mon-000007'], ['Mon-000008'], ['Mon-000009'], ['Mon-000010']]Code Snippets
def test_case(filename, n):
"""Write n lines of test data to filename."""
with open(filename, 'w') as f:
for i in range(n):
f.write('Mon-{0:06d} {1}\n'.format(i + 1, ' 99.999' * 20))>>> test_case('cr36275.data', 200000)
>>> import os
>>> os.stat('cr36275.data').st_size
34400000>>> from timeit import timeit
>>> timeit(lambda:pullIDs('cr36275.data'), number=1)
1.3354740142822266>>> pullIDs('cr36275.data')>>> ids = pullIDs('cr36275.data')
>>> ids[:10]
[['Mon-000001'], ['Mon-000002'], ['Mon-000003'], ['Mon-000004'], ['Mon-000005'],
['Mon-000006'], ['Mon-000007'], ['Mon-000008'], ['Mon-000009'], ['Mon-000010']]Context
StackExchange Code Review Q#36275, answer score: 4
Revisions (0)
No revisions yet.