patternpythonModerate
Script for analyzing millions of lines
Viewed 0 times
scriptmillionsanalyzingforlines
Problem
I have files with over 100 million lines in each:
As you can see, there are important data (starting with date and time) and unimportant data. Also in each second, there can be many lines of important data.
My goal is to count the number of "important data" in each second (or minute or hour...) and reformat date/time format. My script also lets me count data in each minute, hour etc using
I have written the following script (this is the main part - I skip all the file openings, parameters etc).
It takes about 3 hours to process over 100 million lines. C
...
01-AUG-2012 02:29:44 important data
01-AUG-2012 02:29:44 important data
01-AUG-2012 02:36:02 important data
some unimportant data
blahblah (also unimportant data)
some unimportant data
01-AUG-2012 02:40:15 important data
some unimportant data
...As you can see, there are important data (starting with date and time) and unimportant data. Also in each second, there can be many lines of important data.
My goal is to count the number of "important data" in each second (or minute or hour...) and reformat date/time format. My script also lets me count data in each minute, hour etc using
options.dlen:options.dlen = 10 takes YYYY-MM-DDD
options.dlen = 13 takes YYYY-MM-DDD HH
options.dlen = 16 takes YYYY-MM-DDD HH:MM
options.dlen = 20 takes YYYY-MM-DDD HH:MM:SSI have written the following script (this is the main part - I skip all the file openings, parameters etc).
DATA = {}
# search for DD-MMM-YYYY HH:MM:SS
# e.g. "01-JUL-2012 02:29:36 important data"
pattern = re.compile('^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
DATA = defaultdict(int)
i = 0
f = open(options.infilename, 'r')
for line in f:
if re.match(pattern, line):
if options.verbose:
i += 1
# print out every 1000 iterations
if i % 1000 == 0:
print str(i) + '\r',
# converts data date/time format to YYYY-MM-DD HH:MM:SS format (but still keep it as datetime !)
d = datetime.strptime( line [0:20], '%d-%b-%Y %H:%M:%S')
# converts d, which is datetime to string again
day_string = d.strftime('%Y-%m-%d %H:%M:%S')
DATA [ str(day_string[0:int(options.dlen)]) ] += 1
f.close()
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1), reverse=True)
#L2 = sorted(DATA.iteritems(), key=operator.itemgetter(1))
L2 = sorted(DATA.iteritems(), key=operator.itemgetter(0))It takes about 3 hours to process over 100 million lines. C
Solution
- Make a test case
The first thing to do is to establish the existence of the performance problem, so let's knock up some test data.
def make_test_file(filename, n, t, delta):
"""
Write `n` lines of test data to `filename`, starting at `t` (a
datetime object) and stepping by `delta` (a timedelta object) each
line.
"""
with open(filename, 'w') as f:
for _ in xrange(n):
f.write(t.strftime('%d-%b-%Y %H:%M:%S ').upper())
f.write('important data\n')
t += delta
>>> from datetime import datetime, timedelta
>>> make_test_file('data.txt', 10**5, datetime.now(), timedelta(seconds=1))And then with the OP's code in the function
aggregate1(filename, dlen):>>> import timeit
>>> timeit.timeit(lambda:aggregate1('data.txt', 16), number = 1)
5.786283016204834So on the real file (1000 times bigger) that would take an hour and a half on my machine (or longer, if the time complexity is worse than linear). So yes, there's a real performance problem.
- Clean up the code
Let's try a bunch of obvious minor improvements and optimizations (mostly as suggested in other answers):
-
Convert
dlen to an integer once (not every for every line).-
Write
day_string[:dlen] instead of str(day_string[0:dlen]).-
Write
pattern.match(line) instead of re.match(pattern, line).-
There's no need for
key = operator.itemgetter(0) because the sort will proceed on the first element of the pair in any case.-
Rename
DATA as count and day_string with s (it's really a date-time string, not a day string).-
Use
with to ensure that the file is closed in the event of an error.-
Import the name
strptime so it doesn't have to be looked up for every line.Let's try that:
def aggregate2(filename, dlen):
strptime = datetime.datetime.strptime
dlen = int(dlen)
pattern = re.compile(r'^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
count = defaultdict(int)
with open(filename, 'r') as f:
for line in f:
if pattern.match(line):
d = strptime(line[:20], '%d-%b-%Y %H:%M:%S')
s = d.strftime('%Y-%m-%d %H:%M:%S')
count[s[:dlen]] += 1
return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate2('data.txt', 10), number = 1)
5.200263977050781A small improvement, 10% or so, but clean code makes the next step easier.
- Profile
>>> import cProfile
>>> cProfile.run("aggregate2('data.txt', 10)")
2700009 function calls in 6.262 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 6.262 6.262 :1()
100000 0.088 0.000 1.020 0.000 _strptime.py:27(_getlang)
100000 2.098 0.000 4.033 0.000 _strptime.py:295(_strptime)
100000 0.393 0.000 0.642 0.000 locale.py:339(normalize)
100000 0.105 0.000 0.747 0.000 locale.py:407(_parse_localename)
100000 0.119 0.000 0.933 0.000 locale.py:506(getlocale)
100000 0.067 0.000 0.067 0.000 {_locale.setlocale}
100000 0.515 0.000 4.548 0.000 {built-in method strptime}
100000 0.079 0.000 0.079 0.000 {isinstance}
200000 0.035 0.000 0.035 0.000 {len}
100000 0.043 0.000 0.043 0.000 {method 'end' of '_sre.SRE_Match' objects}
300001 0.076 0.000 0.076 0.000 {method 'get' of 'dict' objects}
100000 0.276 0.000 0.276 0.000 {method 'groupdict' of '_sre.SRE_Match' objects}
100000 0.090 0.000 0.090 0.000 {method 'index' of 'list' objects}
100000 0.025 0.000 0.025 0.000 {method 'iterkeys' of 'dict' objects}
100000 0.046 0.000 0.046 0.000 {method 'lower' of 'str' objects}
200000 0.553 0.000 0.553 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
100000 1.144 0.000 1.144 0.000 {method 'strftime' of 'datetime.date' objects}I've cut some of the output for clarity. It should be clear that the culprits are
strptime (73% of runtime), strftime (18%), and match (9%). Everything else is either called by one of those, or negligible.- Pluck the low-hanging fruit
We can avoid calling both
strptime and strftime if we recognize that the only things we are achieving by calling these two functions are (a) to translate the months from names (AUG) to numbers (08), and (b) to reorder the components into ISO standard order. So let's do that ourselves:```
def aggregate3(filename, dlen):
dlen = int(dlen)
months = dict(JAN = '01', FEB = '02', MAR = '03', APR = '04',
MAY = '05', JUN = '06', JUL = '07', AUG = '08',
SEP = '09', OCT = '10', NOV = '11', DEC = '12')
pattern = re.compile(r'^(\d{2})-([A-Z]{3})-(\d{4}) (\d{2}:\d{2}:\d{2}) '
'important data')
count = defaultdict(int)
with open
Code Snippets
def make_test_file(filename, n, t, delta):
"""
Write `n` lines of test data to `filename`, starting at `t` (a
datetime object) and stepping by `delta` (a timedelta object) each
line.
"""
with open(filename, 'w') as f:
for _ in xrange(n):
f.write(t.strftime('%d-%b-%Y %H:%M:%S ').upper())
f.write('important data\n')
t += delta
>>> from datetime import datetime, timedelta
>>> make_test_file('data.txt', 10**5, datetime.now(), timedelta(seconds=1))>>> import timeit
>>> timeit.timeit(lambda:aggregate1('data.txt', 16), number = 1)
5.786283016204834def aggregate2(filename, dlen):
strptime = datetime.datetime.strptime
dlen = int(dlen)
pattern = re.compile(r'^\d{2}-[A-Z]{3}-\d{4} \d{2}:\d{2}:\d{2} important data')
count = defaultdict(int)
with open(filename, 'r') as f:
for line in f:
if pattern.match(line):
d = strptime(line[:20], '%d-%b-%Y %H:%M:%S')
s = d.strftime('%Y-%m-%d %H:%M:%S')
count[s[:dlen]] += 1
return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate2('data.txt', 10), number = 1)
5.200263977050781>>> import cProfile
>>> cProfile.run("aggregate2('data.txt', 10)")
2700009 function calls in 6.262 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 6.262 6.262 <string>:1(<module>)
100000 0.088 0.000 1.020 0.000 _strptime.py:27(_getlang)
100000 2.098 0.000 4.033 0.000 _strptime.py:295(_strptime)
100000 0.393 0.000 0.642 0.000 locale.py:339(normalize)
100000 0.105 0.000 0.747 0.000 locale.py:407(_parse_localename)
100000 0.119 0.000 0.933 0.000 locale.py:506(getlocale)
100000 0.067 0.000 0.067 0.000 {_locale.setlocale}
100000 0.515 0.000 4.548 0.000 {built-in method strptime}
100000 0.079 0.000 0.079 0.000 {isinstance}
200000 0.035 0.000 0.035 0.000 {len}
100000 0.043 0.000 0.043 0.000 {method 'end' of '_sre.SRE_Match' objects}
300001 0.076 0.000 0.076 0.000 {method 'get' of 'dict' objects}
100000 0.276 0.000 0.276 0.000 {method 'groupdict' of '_sre.SRE_Match' objects}
100000 0.090 0.000 0.090 0.000 {method 'index' of 'list' objects}
100000 0.025 0.000 0.025 0.000 {method 'iterkeys' of 'dict' objects}
100000 0.046 0.000 0.046 0.000 {method 'lower' of 'str' objects}
200000 0.553 0.000 0.553 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
100000 1.144 0.000 1.144 0.000 {method 'strftime' of 'datetime.date' objects}def aggregate3(filename, dlen):
dlen = int(dlen)
months = dict(JAN = '01', FEB = '02', MAR = '03', APR = '04',
MAY = '05', JUN = '06', JUL = '07', AUG = '08',
SEP = '09', OCT = '10', NOV = '11', DEC = '12')
pattern = re.compile(r'^(\d{2})-([A-Z]{3})-(\d{4}) (\d{2}:\d{2}:\d{2}) '
'important data')
count = defaultdict(int)
with open(filename, 'r') as f:
for line in f:
m = pattern.match(line)
if m:
s = '{3}-{0}-{1} {4}'.format(months[m.group(2)], *m.groups())
count[s[:dlen]] += 1
return sorted(count.iteritems())
>>> timeit.timeit(lambda:aggregate3('data.txt', 10), number = 1)
0.5073871612548828Context
StackExchange Code Review Q#15214, answer score: 11
Revisions (0)
No revisions yet.