patternpythonMinor
Running over a log file to extract log times
Viewed 0 times
filelogrunningextracttimesover
Problem
I need to run over a log file with plenty of entries (25+GB) in Python to extract some log times.
The snippet below works. On my test file, I'm ending up with roughly
While I'm ok with the time the script runs, I'm wondering if there are any obvious pitfalls regarding performance, which I'm stepping in (still in my first weeks of Python).
Can I do anything to make this run faster?
EDIT:
Sample log entry:
EDIT 2:
One thing I just did, was to check if log entries have the same log time and if so, I'm using the
I che
The snippet below works. On my test file, I'm ending up with roughly
9:30min (Python) and 4:30min (PyPy) on 28 million records (small log).import datetime
import time
import json
import re
format_path= re.compile(
r"^\S+\s-\s"
r"(?P\S*)\s"
r"\[(?P.*?)\]"
)
threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')
tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()
zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()
format_date = '%d/%b/%Y:%H:%M:%S'
obj = {}
test = open('very/big/log','r')
for line in test:
try:
chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')
user = chunk[1]
if (user[0] == "C"):
this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
try:
machine = obj[user]
except KeyError, e:
machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})
last = machine["last"]
diff = (this_time-last).total_seconds()
if (diff > tick):
machine["downtime"] += diff-tick
machine["last"] = this_time
except Exception:
passWhile I'm ok with the time the script runs, I'm wondering if there are any obvious pitfalls regarding performance, which I'm stepping in (still in my first weeks of Python).
Can I do anything to make this run faster?
EDIT:
Sample log entry:
['2001:470:1f14:169:15f3:824f:8a61:7b59 - SOFTINST [14/Nov/2012:09:32:31 +0100] "POST /setComputerPartition HTTP/1.1" 200 4 "-" "-" 102356']EDIT 2:
One thing I just did, was to check if log entries have the same log time and if so, I'm using the
this_time from the previous iteration vs calling this:this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)I che
Solution
-
You can use regular expressions with re module to speed up parsing dramatically. For example the following code extracts user and time information by using regular expression (note that
I think you can make it even faster by reading bigger chunks of data and using re.finditer method.
-
Don't use dictionaries to store information in the
-
Probably
-
Check the Python performance tips. For example Avoiding dots... chapter can be relevant for your code.
UPDATE
The following version of the code read bigger chunks of data and using re.finditer (Not sure if it faster or not. Also note the changes in the regexp):
You can use regular expressions with re module to speed up parsing dramatically. For example the following code extracts user and time information by using regular expression (note that
user[0] == "C" condition also checked by the regexp):import re
import datetime
match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
for line in f:
match = match_record(line)
if match is not None:
user, str_time = match.groups()
time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
print user, repr(time)I think you can make it even faster by reading bigger chunks of data and using re.finditer method.
-
Don't use dictionaries to store information in the
obj dictionary. Tuples is the way to go here and if you really need to use names for fields you can use collections.namedtuple. In any case tuples will be faster than dictionaries (and lists) and will take less memory.-
Probably
if user in obj will be little bit faster than try/except KeyError here. You can check execution times for small code blocks with timeit module.-
Check the Python performance tips. For example Avoiding dots... chapter can be relevant for your code.
UPDATE
The following version of the code read bigger chunks of data and using re.finditer (Not sure if it faster or not. Also note the changes in the regexp):
import re
import datetime
iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
chunk = ""
while True:
# Probably buffer size can be even bigger
new = f.read(1048576)
chunk += new
end = 0
for match in iter_records(chunk):
user, str_time = match.groups()
time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
print user, repr(time)
end = match.end()
if end > 0:
chunk = chunk[end:]
elif not new:
breakCode Snippets
import re
import datetime
match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
for line in f:
match = match_record(line)
if match is not None:
user, str_time = match.groups()
time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
print user, repr(time)import re
import datetime
iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime
f = open("very/big/log", "rb")
chunk = ""
while True:
# Probably buffer size can be even bigger
new = f.read(1048576)
chunk += new
end = 0
for match in iter_records(chunk):
user, str_time = match.groups()
time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
print user, repr(time)
end = match.end()
if end > 0:
chunk = chunk[end:]
elif not new:
breakContext
StackExchange Code Review Q#23570, answer score: 4
Revisions (0)
No revisions yet.