HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Why is this program for extracting IDs from a file so slow?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thiswhyfileidsprogramslowforextractingfrom

Problem

This is roughly what my data file looks like:

# Monid      U        B       V       R       I      u       g        r       i       J      Jerr     H      Herr      K      Kerr   IRAC1    I1err  IRAC2    I2err  IRAC3    I3err  IRAC4    I4err  MIPS24  M24err  SpT    HaEW       mem comp
Mon-000001  99.999  99.999  21.427  99.999  18.844  99.999  99.999  99.999  99.999  16.144  99.999  15.809   0.137  16.249  99.999  15.274   0.033  15.286   0.038  99.999  99.999  99.999  99.999  99.999  99.999  null   55.000        1  N
Mon-000002  99.999  99.999  20.905  19.410  17.517  99.999  99.999  99.999  99.999  15.601   0.080  15.312   0.100  14.810   0.110  14.467   0.013  14.328   0.019  14.276   0.103  99.999   0.048  99.999  99.999  null  -99.999        2  N


...and it's a total of 31mb in size. Here's my python script that pulls the Mon-###### IDs
(found at the beginning of each of the lines).

import re

def pullIDs(file_input):
    '''Pulls Mon-IDs from input file.'''

    arrayID = []
    with open(file_input,'rU') as user_file:
        for line in user_file:
            arrayID.append(re.findall('Mon\-\d{6}',line))
    return arrayID

print pullIDs(raw_input("Enter your first file: "))


The script works but for this particular file it ran for well into 5 minutes and I eventually just killed the process due to impatience. Is this just something I'll have to deal with in python? i.e. Should this be written with a compiled language considering the size of my data file?

Further info:
This script is being run within Emacs. This, by the checked answer, explains why it was running so slow.

Solution

You said in comments that you don't know how to create a self-contained test case. But that's really easy! All that's needed is a function like this:

def test_case(filename, n):
    """Write n lines of test data to filename."""
    with open(filename, 'w') as f:
        for i in range(n):
            f.write('Mon-{0:06d} {1}\n'.format(i + 1, '  99.999' * 20))


You can use this to make a test case of about the right size:

>>> test_case('cr36275.data', 200000)
>>> import os
>>> os.stat('cr36275.data').st_size
34400000


That's about 34 MiB so close enough. Now we can see how fast your code really is, using the timeit module:

>>> from timeit import timeit
>>> timeit(lambda:pullIDs('cr36275.data'), number=1)
1.3354740142822266


Just over a second. There's nothing wrong with your code or the speed of Python.

So why does it take you many minutes? Well, you say that you're running it inside Emacs. That means that when you run

>>> pullIDs('cr36275.data')


Python prints out a list of 200,000 ids, and Emacs reads this line of output into the Python buffer and applies syntax highlighting rules to it as it goes. Emacs' syntax highlighting code is designed to work on lines of source code (at most a few hundred characters but mostly 80 characters or less), not on lines of output that are millions of characters long. This is what is taking all the time.

So don't do that. Read the list of ids into a variable and if you need to look at it, use slicing to look at bits of it:

>>> ids = pullIDs('cr36275.data')
>>> ids[:10]
[['Mon-000001'], ['Mon-000002'], ['Mon-000003'], ['Mon-000004'], ['Mon-000005'],
 ['Mon-000006'], ['Mon-000007'], ['Mon-000008'], ['Mon-000009'], ['Mon-000010']]

Code Snippets

def test_case(filename, n):
    """Write n lines of test data to filename."""
    with open(filename, 'w') as f:
        for i in range(n):
            f.write('Mon-{0:06d} {1}\n'.format(i + 1, '  99.999' * 20))
>>> test_case('cr36275.data', 200000)
>>> import os
>>> os.stat('cr36275.data').st_size
34400000
>>> from timeit import timeit
>>> timeit(lambda:pullIDs('cr36275.data'), number=1)
1.3354740142822266
>>> pullIDs('cr36275.data')
>>> ids = pullIDs('cr36275.data')
>>> ids[:10]
[['Mon-000001'], ['Mon-000002'], ['Mon-000003'], ['Mon-000004'], ['Mon-000005'],
 ['Mon-000006'], ['Mon-000007'], ['Mon-000008'], ['Mon-000009'], ['Mon-000010']]

Context

StackExchange Code Review Q#36275, answer score: 4

Revisions (0)

No revisions yet.