HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Diving into Python sequences: analyze an access.log

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
analyzelogintopythondivingsequencesaccess

Problem

As a first little Python exercise, I wrote an analyzer/summarizer for my nginx accesslogs. The code works fine but I'm not sure if I used the different types of sequences properly or made some other stupid things which could lead to bugs etc.

Steps:

  • read in access.log and heavily poke around to fetch the wanted data (requests, IPs and user agents till now)



  • sum the occurrences



  • sort the sums desc and write the top x sums into a file



Example generalized log (I don't know if that's helpful):

```
1.1.1.1 - - [21/Feb/2014:06:35:45 +0100] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
1.1.1.1 - - [21/Feb/2014:06:35:45 +0100] "GET /blog.css HTTP/1.1" 200 3663 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
2.2.2.2 - - [21/Feb/2014:06:52:04 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
2.2.2.2 - - [21/Feb/2014:06:52:04 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
3.3.3.3 - - [21/Feb/2014:06:58:14 +0100] "/" 200 1664 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117"
4.4.4.4 - - [21/Feb/2014:07:22:03 +0100] "/" 200 1664 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117"
5.5.5.5 - - [21/Feb/2014:07:32:48 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
5.5.5.5 - - [21/Feb/2014:07:32:48 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
6.6.6.6 - - [21/Feb/2014:08:13:01 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
6.6.6.6 - - [21/Feb/2014:08:13:01 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
7.7.7.7 - - [21/Feb/2014:08:51:25 +0100] "GET /main.php HTTP/1.1" 200 3681 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461)"
7.7.7.7 - - [21/Feb/2014:08:51:34 +0100] "-" 400 0 "-" "-"
7.7.7.7 - - [21/Feb/2014:08:51:48 +0100] "GET /tag/php.php HTTP/1.1" 200 4673 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q

Solution

I find your find_chars() method amusingly creative. I'll comment on the big picture instead.

-
Class design: Passing all the parameters into the constructor makes the class less versatile. Those parameters don't need to be part of the object's state. Consider this outline instead:

class LogAnalyzer():
    def __init__(self):
        self.summary = …

    def analyze(self, logfile):
        …

    def summarize(self, topcount=5):
        …


Then you have the flexibility to summarize several log files at once:

analysis = LogAnalyzer()
analysis.analyze('access_log.0')
analysis.analyze('access_log.1')
analysis.analyze('access_log.2')
print(analysis.summarize('access_summary.txt'))


Consider making it the caller's responsibility to write the result to a file. I don't think that it's essential to the business of log analysis.

-
Take advantage of collections.Counter.

-
Open files using with blocks. Then you never have to worry about closing them.

-
Avoid reading everything into memory at once. Read a line at a time, use it to update the cumulative statistics, and don't hold on to lines. If possible, avoid keeping loglist as well.

from collections import defaultdict, Counter

class LogAnalyzer():
    def __init__(self):
        self.linecount = 0
        self.counters = defaultdict(Counter)

    def analyze(self, logfile):
        with open(logfile) as f:
            for line in f:
                self._update(**self._parse(line))

    def summarize(self, topcount=5):
        …

    @staticmethod
    def _parse(line):
        …
        return {'ip': …, 'request': …, 'useragent': … }

    def _update(self, **kwargs):
        self.linecount += 1
        for key, value in kwargs.items():
            self.counters[key][value] += 1

Code Snippets

class LogAnalyzer():
    def __init__(self):
        self.summary = …

    def analyze(self, logfile):
        …

    def summarize(self, topcount=5):
        …
analysis = LogAnalyzer()
analysis.analyze('access_log.0')
analysis.analyze('access_log.1')
analysis.analyze('access_log.2')
print(analysis.summarize('access_summary.txt'))
from collections import defaultdict, Counter

class LogAnalyzer():
    def __init__(self):
        self.linecount = 0
        self.counters = defaultdict(Counter)

    def analyze(self, logfile):
        with open(logfile) as f:
            for line in f:
                self._update(**self._parse(line))

    def summarize(self, topcount=5):
        …

    @staticmethod
    def _parse(line):
        …
        return {'ip': …, 'request': …, 'useragent': … }

    def _update(self, **kwargs):
        self.linecount += 1
        for key, value in kwargs.items():
            self.counters[key][value] += 1

Context

StackExchange Code Review Q#44413, answer score: 5

Revisions (0)

No revisions yet.