HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Running over a log file to extract log times

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filelogrunningextracttimesover

Problem

I need to run over a log file with plenty of entries (25+GB) in Python to extract some log times.

The snippet below works. On my test file, I'm ending up with roughly 9:30min (Python) and 4:30min (PyPy) on 28 million records (small log).

import datetime
import time
import json
import re

format_path= re.compile( 
    r"^\S+\s-\s" 
    r"(?P\S*)\s"
    r"\[(?P.*?)\]" 
)

threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')
tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()
zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()
format_date = '%d/%b/%Y:%H:%M:%S'

obj = {}
test = open('very/big/log','r')

for line in test:
  try:
      chunk = line.split('+', 1)[0].split('-', 1)[1].split(' ')
      user = chunk[1]
      if (user[0] == "C"):
        this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)
        try:
            machine = obj[user]
        except KeyError, e:
            machine = obj.setdefault(user,{"init":this_time,"last":this_time,"downtime":0})
        last = machine["last"]
        diff = (this_time-last).total_seconds()
        if (diff > tick):
          machine["downtime"] += diff-tick
        machine["last"] = this_time

  except Exception:
    pass


While I'm ok with the time the script runs, I'm wondering if there are any obvious pitfalls regarding performance, which I'm stepping in (still in my first weeks of Python).

Can I do anything to make this run faster?

EDIT:

Sample log entry:

['2001:470:1f14:169:15f3:824f:8a61:7b59 - SOFTINST [14/Nov/2012:09:32:31 +0100] "POST /setComputerPartition HTTP/1.1" 200 4 "-" "-" 102356']


EDIT 2:

One thing I just did, was to check if log entries have the same log time and if so, I'm using the this_time from the previous iteration vs calling this:

this_time = datetime.datetime.strptime(chunk[2].split('[')[1], format_date)


I che

Solution

-
You can use regular expressions with re module to speed up parsing dramatically. For example the following code extracts user and time information by using regular expression (note that user[0] == "C" condition also checked by the regexp):

import re
import datetime

match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
strptime = datetime.datetime.strptime

f = open("very/big/log", "rb")

for line in f:
    match = match_record(line)
    if match is not None:
        user, str_time = match.groups()
        time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
        print user, repr(time)


I think you can make it even faster by reading bigger chunks of data and using re.finditer method.

-
Don't use dictionaries to store information in the obj dictionary. Tuples is the way to go here and if you really need to use names for fields you can use collections.namedtuple. In any case tuples will be faster than dictionaries (and lists) and will take less memory.

-
Probably if user in obj will be little bit faster than try/except KeyError here. You can check execution times for small code blocks with timeit module.

-
Check the Python performance tips. For example Avoiding dots... chapter can be relevant for your code.

UPDATE

The following version of the code read bigger chunks of data and using re.finditer (Not sure if it faster or not. Also note the changes in the regexp):

import re
import datetime

iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime

f = open("very/big/log", "rb")

chunk = ""
while True:
    # Probably buffer size can be even bigger
    new = f.read(1048576)
    chunk += new
    end = 0
    for match in iter_records(chunk):
        user, str_time = match.groups()
        time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
        print user, repr(time)
        end = match.end()
    if end > 0:
        chunk = chunk[end:]
    elif not new:
        break

Code Snippets

import re
import datetime

match_record = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+)").match
strptime = datetime.datetime.strptime

f = open("very/big/log", "rb")

for line in f:
    match = match_record(line)
    if match is not None:
        user, str_time = match.groups()
        time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
        print user, repr(time)
import re
import datetime

iter_records = re.compile(r"^[^ ]+ - (C[^ ]*) \[([^ ]+) .*$", re.M).finditer
strptime = datetime.datetime.strptime

f = open("very/big/log", "rb")

chunk = ""
while True:
    # Probably buffer size can be even bigger
    new = f.read(1048576)
    chunk += new
    end = 0
    for match in iter_records(chunk):
        user, str_time = match.groups()
        time = strptime(str_time, "%d/%b/%Y:%H:%M:%S")
        print user, repr(time)
        end = match.end()
    if end > 0:
        chunk = chunk[end:]
    elif not new:
        break

Context

StackExchange Code Review Q#23570, answer score: 4

Revisions (0)

No revisions yet.