HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Script for analyzing millions of lines

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptmillionsanalyzingforlines

Problem

I have files with over 100 million lines in each:

...
01-AUG-2012 02:29:44 important data
01-AUG-2012 02:29:44 important data
01-AUG-2012 02:36:02 important data
some unimportant data
blahblah (also unimportant data)
some unimportant data
01-AUG-2012 02:40:15 important data
some unimportant data
...


As you can see, there are important data (starting with date and time) and unimportant data. Also in each second, there can be many lines of important data.

My goal is to count the number of "important data" in each second (or minute or hour...) and reformat date/time format. My script also lets me count data in each minute, hour etc using options.dlen:

options.dlen = 10 takes YYYY-MM-DDD
options.dlen = 13 takes YYYY-MM-DDD HH
options.dlen = 16 takes YYYY-MM-DDD HH:MM
options.dlen = 20 takes YYYY-MM-DDD HH:MM:SS


I have written the following script (this is the main part - I skip all the file openings, parameters etc).

DATA = {}

# search for DD-MMM-YYYY HH:MM:SS
# e.g. "01-JUL-2012 02:29:36 important data"
pattern = re.compile('^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')

DATA = defaultdict(int)
i = 0
f = open(options.infilename, 'r')
for line in f:
    if re.match(pattern, line):
        if options.verbose:
            i += 1
            # print out every 1000 iterations
            if i % 1000 == 0:
                print str(i) + '\r',

        # converts data date/time format to YYYY-MM-DD HH:MM:SS format (but still keep it as datetime !)
        d = datetime.strptime( line [0:20], '%d-%b-%Y %H:%M:%S')
        # converts d, which is datetime to string again
        day_string = d.strftime('%Y-%m-%d %H:%M:%S')
        DATA [ str(day_string[0:int(options.dlen)]) ] += 1
f.close()
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1), reverse=True)
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1))
L2 = sorted(DATA.iteritems(), key=operator.itemgetter(0))


It takes about 3 hours to process over 100 million lines. C

Solution


  1. Make a test case



The first thing to do is to establish the existence of the performance problem, so let's knock up some test data.

def make_test_file(filename, n, t, delta):
    """
    Write `n` lines of test data to `filename`, starting at `t` (a
    datetime object) and stepping by `delta` (a timedelta object) each
    line.
    """
    with open(filename, 'w') as f:
        for _ in xrange(n):
            f.write(t.strftime('%d-%b-%Y %H:%M:%S ').upper())
            f.write('important data\n')
            t += delta

>>> from datetime import datetime, timedelta
>>> make_test_file('data.txt', 10**5, datetime.now(), timedelta(seconds=1))


And then with the OP's code in the function aggregate1(filename, dlen):

>>> import timeit
>>> timeit.timeit(lambda:aggregate1('data.txt', 16), number = 1)
5.786283016204834


So on the real file (1000 times bigger) that would take an hour and a half on my machine (or longer, if the time complexity is worse than linear). So yes, there's a real performance problem.

  1. Clean up the code



Let's try a bunch of obvious minor improvements and optimizations (mostly as suggested in other answers):

-
Convert dlen to an integer once (not every for every line).

-
Write day_string[:dlen] instead of str(day_string[0:dlen]).

-
Write pattern.match(line) instead of re.match(pattern, line).

-
There's no need for key = operator.itemgetter(0) because the sort will proceed on the first element of the pair in any case.

-
Rename DATA as count and day_string with s (it's really a date-time string, not a day string).

-
Use with to ensure that the file is closed in the event of an error.

-
Import the name strptime so it doesn't have to be looked up for every line.

Let's try that:

def aggregate2(filename, dlen):
    strptime = datetime.datetime.strptime
    dlen = int(dlen)
    pattern = re.compile(r'^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
    count = defaultdict(int)
    with open(filename, 'r') as f:
        for line in f:
            if pattern.match(line):
                d = strptime(line[:20], '%d-%b-%Y %H:%M:%S')
                s = d.strftime('%Y-%m-%d %H:%M:%S')
                count[s[:dlen]] += 1
    return sorted(count.iteritems())

>>> timeit.timeit(lambda:aggregate2('data.txt', 10), number = 1)
5.200263977050781


A small improvement, 10% or so, but clean code makes the next step easier.

  1. Profile



>>> import cProfile
>>> cProfile.run("aggregate2('data.txt', 10)")
         2700009 function calls in 6.262 seconds

   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.262    6.262 :1()
   100000    0.088    0.000    1.020    0.000 _strptime.py:27(_getlang)
   100000    2.098    0.000    4.033    0.000 _strptime.py:295(_strptime)
   100000    0.393    0.000    0.642    0.000 locale.py:339(normalize)
   100000    0.105    0.000    0.747    0.000 locale.py:407(_parse_localename)
   100000    0.119    0.000    0.933    0.000 locale.py:506(getlocale)
   100000    0.067    0.000    0.067    0.000 {_locale.setlocale}
   100000    0.515    0.000    4.548    0.000 {built-in method strptime}
   100000    0.079    0.000    0.079    0.000 {isinstance}
   200000    0.035    0.000    0.035    0.000 {len}
   100000    0.043    0.000    0.043    0.000 {method 'end' of '_sre.SRE_Match' objects}
   300001    0.076    0.000    0.076    0.000 {method 'get' of 'dict' objects}
   100000    0.276    0.000    0.276    0.000 {method 'groupdict' of '_sre.SRE_Match' objects}
   100000    0.090    0.000    0.090    0.000 {method 'index' of 'list' objects}
   100000    0.025    0.000    0.025    0.000 {method 'iterkeys' of 'dict' objects}
   100000    0.046    0.000    0.046    0.000 {method 'lower' of 'str' objects}
   200000    0.553    0.000    0.553    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
   100000    1.144    0.000    1.144    0.000 {method 'strftime' of 'datetime.date' objects}


I've cut some of the output for clarity. It should be clear that the culprits are strptime (73% of runtime), strftime (18%), and match (9%). Everything else is either called by one of those, or negligible.

  1. Pluck the low-hanging fruit



We can avoid calling both strptime and strftime if we recognize that the only things we are achieving by calling these two functions are (a) to translate the months from names (AUG) to numbers (08), and (b) to reorder the components into ISO standard order. So let's do that ourselves:

```
def aggregate3(filename, dlen):
dlen = int(dlen)
months = dict(JAN = '01', FEB = '02', MAR = '03', APR = '04',
MAY = '05', JUN = '06', JUL = '07', AUG = '08',
SEP = '09', OCT = '10', NOV = '11', DEC = '12')
pattern = re.compile(r'^(\d{2})-([A-Z]{3})-(\d{4}) (\d{2}:\d{2}:\d{2}) '
'important data')
count = defaultdict(int)
with open

Code Snippets

def make_test_file(filename, n, t, delta):
    """
    Write `n` lines of test data to `filename`, starting at `t` (a
    datetime object) and stepping by `delta` (a timedelta object) each
    line.
    """
    with open(filename, 'w') as f:
        for _ in xrange(n):
            f.write(t.strftime('%d-%b-%Y %H:%M:%S ').upper())
            f.write('important data\n')
            t += delta

>>> from datetime import datetime, timedelta
>>> make_test_file('data.txt', 10**5, datetime.now(), timedelta(seconds=1))
>>> import timeit
>>> timeit.timeit(lambda:aggregate1('data.txt', 16), number = 1)
5.786283016204834
def aggregate2(filename, dlen):
    strptime = datetime.datetime.strptime
    dlen = int(dlen)
    pattern = re.compile(r'^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
    count = defaultdict(int)
    with open(filename, 'r') as f:
        for line in f:
            if pattern.match(line):
                d = strptime(line[:20], '%d-%b-%Y %H:%M:%S')
                s = d.strftime('%Y-%m-%d %H:%M:%S')
                count[s[:dlen]] += 1
    return sorted(count.iteritems())

>>> timeit.timeit(lambda:aggregate2('data.txt', 10), number = 1)
5.200263977050781
>>> import cProfile
>>> cProfile.run("aggregate2('data.txt', 10)")
         2700009 function calls in 6.262 seconds

   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.262    6.262 <string>:1(<module>)
   100000    0.088    0.000    1.020    0.000 _strptime.py:27(_getlang)
   100000    2.098    0.000    4.033    0.000 _strptime.py:295(_strptime)
   100000    0.393    0.000    0.642    0.000 locale.py:339(normalize)
   100000    0.105    0.000    0.747    0.000 locale.py:407(_parse_localename)
   100000    0.119    0.000    0.933    0.000 locale.py:506(getlocale)
   100000    0.067    0.000    0.067    0.000 {_locale.setlocale}
   100000    0.515    0.000    4.548    0.000 {built-in method strptime}
   100000    0.079    0.000    0.079    0.000 {isinstance}
   200000    0.035    0.000    0.035    0.000 {len}
   100000    0.043    0.000    0.043    0.000 {method 'end' of '_sre.SRE_Match' objects}
   300001    0.076    0.000    0.076    0.000 {method 'get' of 'dict' objects}
   100000    0.276    0.000    0.276    0.000 {method 'groupdict' of '_sre.SRE_Match' objects}
   100000    0.090    0.000    0.090    0.000 {method 'index' of 'list' objects}
   100000    0.025    0.000    0.025    0.000 {method 'iterkeys' of 'dict' objects}
   100000    0.046    0.000    0.046    0.000 {method 'lower' of 'str' objects}
   200000    0.553    0.000    0.553    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
   100000    1.144    0.000    1.144    0.000 {method 'strftime' of 'datetime.date' objects}
def aggregate3(filename, dlen):
    dlen = int(dlen)
    months = dict(JAN = '01', FEB = '02', MAR = '03', APR = '04',
                  MAY = '05', JUN = '06', JUL = '07', AUG = '08',
                  SEP = '09', OCT = '10', NOV = '11', DEC = '12')
    pattern = re.compile(r'^(\d{2})-([A-Z]{3})-(\d{4}) (\d{2}:\d{2}:\d{2}) '
                         'important data')
    count = defaultdict(int)
    with open(filename, 'r') as f:
        for line in f:
            m = pattern.match(line)
            if m:
                s = '{3}-{0}-{1} {4}'.format(months[m.group(2)], *m.groups())
                count[s[:dlen]] += 1
    return sorted(count.iteritems())

>>> timeit.timeit(lambda:aggregate3('data.txt', 10), number = 1)
0.5073871612548828

Context

StackExchange Code Review Q#15214, answer score: 11

Revisions (0)

No revisions yet.