patternpythonMinor
Diving into Python sequences: analyze an access.log
Viewed 0 times
analyzelogintopythondivingsequencesaccess
Problem
As a first little Python exercise, I wrote an analyzer/summarizer for my nginx accesslogs. The code works fine but I'm not sure if I used the different types of sequences properly or made some other stupid things which could lead to bugs etc.
Steps:
Example generalized log (I don't know if that's helpful):
```
1.1.1.1 - - [21/Feb/2014:06:35:45 +0100] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
1.1.1.1 - - [21/Feb/2014:06:35:45 +0100] "GET /blog.css HTTP/1.1" 200 3663 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
2.2.2.2 - - [21/Feb/2014:06:52:04 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
2.2.2.2 - - [21/Feb/2014:06:52:04 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
3.3.3.3 - - [21/Feb/2014:06:58:14 +0100] "/" 200 1664 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117"
4.4.4.4 - - [21/Feb/2014:07:22:03 +0100] "/" 200 1664 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117"
5.5.5.5 - - [21/Feb/2014:07:32:48 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
5.5.5.5 - - [21/Feb/2014:07:32:48 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
6.6.6.6 - - [21/Feb/2014:08:13:01 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
6.6.6.6 - - [21/Feb/2014:08:13:01 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
7.7.7.7 - - [21/Feb/2014:08:51:25 +0100] "GET /main.php HTTP/1.1" 200 3681 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461)"
7.7.7.7 - - [21/Feb/2014:08:51:34 +0100] "-" 400 0 "-" "-"
7.7.7.7 - - [21/Feb/2014:08:51:48 +0100] "GET /tag/php.php HTTP/1.1" 200 4673 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q
Steps:
- read in access.log and heavily poke around to fetch the wanted data (requests, IPs and user agents till now)
- sum the occurrences
- sort the sums desc and write the top x sums into a file
Example generalized log (I don't know if that's helpful):
```
1.1.1.1 - - [21/Feb/2014:06:35:45 +0100] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
1.1.1.1 - - [21/Feb/2014:06:35:45 +0100] "GET /blog.css HTTP/1.1" 200 3663 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
2.2.2.2 - - [21/Feb/2014:06:52:04 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
2.2.2.2 - - [21/Feb/2014:06:52:04 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
3.3.3.3 - - [21/Feb/2014:06:58:14 +0100] "/" 200 1664 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117"
4.4.4.4 - - [21/Feb/2014:07:22:03 +0100] "/" 200 1664 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117"
5.5.5.5 - - [21/Feb/2014:07:32:48 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
5.5.5.5 - - [21/Feb/2014:07:32:48 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
6.6.6.6 - - [21/Feb/2014:08:13:01 +0100] "GET /main/rss HTTP/1.1" 301 178 "-" "Motorola"
6.6.6.6 - - [21/Feb/2014:08:13:01 +0100] "GET /feed/atom.xml HTTP/1.1" 304 0 "-" "Motorola"
7.7.7.7 - - [21/Feb/2014:08:51:25 +0100] "GET /main.php HTTP/1.1" 200 3681 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461)"
7.7.7.7 - - [21/Feb/2014:08:51:34 +0100] "-" 400 0 "-" "-"
7.7.7.7 - - [21/Feb/2014:08:51:48 +0100] "GET /tag/php.php HTTP/1.1" 200 4673 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q
Solution
I find your
-
Class design: Passing all the parameters into the constructor makes the class less versatile. Those parameters don't need to be part of the object's state. Consider this outline instead:
Then you have the flexibility to summarize several log files at once:
Consider making it the caller's responsibility to write the result to a file. I don't think that it's essential to the business of log analysis.
-
Take advantage of
-
Open files using
-
Avoid reading everything into memory at once. Read a line at a time, use it to update the cumulative statistics, and don't hold on to
find_chars() method amusingly creative. I'll comment on the big picture instead.-
Class design: Passing all the parameters into the constructor makes the class less versatile. Those parameters don't need to be part of the object's state. Consider this outline instead:
class LogAnalyzer():
def __init__(self):
self.summary = …
def analyze(self, logfile):
…
def summarize(self, topcount=5):
…Then you have the flexibility to summarize several log files at once:
analysis = LogAnalyzer()
analysis.analyze('access_log.0')
analysis.analyze('access_log.1')
analysis.analyze('access_log.2')
print(analysis.summarize('access_summary.txt'))Consider making it the caller's responsibility to write the result to a file. I don't think that it's essential to the business of log analysis.
-
Take advantage of
collections.Counter.-
Open files using
with blocks. Then you never have to worry about closing them.-
Avoid reading everything into memory at once. Read a line at a time, use it to update the cumulative statistics, and don't hold on to
lines. If possible, avoid keeping loglist as well.from collections import defaultdict, Counter
class LogAnalyzer():
def __init__(self):
self.linecount = 0
self.counters = defaultdict(Counter)
def analyze(self, logfile):
with open(logfile) as f:
for line in f:
self._update(**self._parse(line))
def summarize(self, topcount=5):
…
@staticmethod
def _parse(line):
…
return {'ip': …, 'request': …, 'useragent': … }
def _update(self, **kwargs):
self.linecount += 1
for key, value in kwargs.items():
self.counters[key][value] += 1Code Snippets
class LogAnalyzer():
def __init__(self):
self.summary = …
def analyze(self, logfile):
…
def summarize(self, topcount=5):
…analysis = LogAnalyzer()
analysis.analyze('access_log.0')
analysis.analyze('access_log.1')
analysis.analyze('access_log.2')
print(analysis.summarize('access_summary.txt'))from collections import defaultdict, Counter
class LogAnalyzer():
def __init__(self):
self.linecount = 0
self.counters = defaultdict(Counter)
def analyze(self, logfile):
with open(logfile) as f:
for line in f:
self._update(**self._parse(line))
def summarize(self, topcount=5):
…
@staticmethod
def _parse(line):
…
return {'ip': …, 'request': …, 'useragent': … }
def _update(self, **kwargs):
self.linecount += 1
for key, value in kwargs.items():
self.counters[key][value] += 1Context
StackExchange Code Review Q#44413, answer score: 5
Revisions (0)
No revisions yet.