HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Cassandra slow query log analysis tool

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
logqueryanalysisslowcassandratool

Problem

Cassandra, when debugging is enabled, logs slow queries to the debug log file. Typical entries look like:

DEBUG [ScheduledTasks:1] 2017-02-16 18:58:44,342 MonitoringTask.java:572 - 4 operations were slow in the last 5010 msecs:
token(9be90fe7-9a6d-45d5-ad11-e93cfd56def7) LIMIT 100>, time 1 msec - slow timeout 1 msec
token(91faceee-a64b-4fd3-bb93-ef483acade88) LIMIT 100>, time 1 msec - slow timeout 1 msec
token(47250d17-573a-4d76-9039-d2771a19ff10) LIMIT 100>, time 1 msec - slow timeout 1 msec
token(e04fc6d0-18b8-4ac0-b5f9-df42cd3a03c5) LIMIT 100>, time 1 msec - slow timeout 1 msec


The actual format is only documented in code.

For MySQL, the mysqldumpslow tool parses the logs and prints the queries (and related statistics) in a readable manner. I'm trying to write a similar tool for Cassandra, for the feature request in CASSANDRA-13000.

The goals I set are:

-
Use similar options to mysqldumpslow, where applicable, so I've to implement these options:

--help Display help message and exit
-g Only consider statements that match the pattern
-r Reverse the sort order
-s How to sort output
-t Display only first num queries


Sorting options:

  • t, at: Sort by query time or average query time



  • c: Sort by count



Of these, the -g option is yet to be implemented, since there are some problems in how the queries are logged.

I'm also adding long-form variants of these (--sort, --reverse, etc.) consistently.

-
Support JSON encoded input, in a streaming fashion. This is for another related patch I'm submitting, where the queries are dumped with JSON encoding for easier parsing by external tools. The JSON-encoded entry will look like:

{
"operation": "SELECT FROM foo.bar WHERE token(id) > token(60bad0b3-551f-46c7-addc-4e3105561a21) LIMIT 100",
"totalTime": 1,
"timeout": 1,
"isCrossNode": false,
"numTimesReported": 1,
"minTime": 1,
"maxTime": 1,
"keyspace": "foo",
"table": "bar"
}


  • Keep compatibility with Python 2 and

Solution

Here are some notes about the code (both performance and code style related):

-
since you are initializing a lot of slow_query and query_stats (also see note about the naming below) class instances on the fly, to improve the memory usage and performance, use __slots__:

class slow_query:
    __slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
    # ...


-
switching from json to ujson may dramatically improve the JSON parsing speed

  • or, you can try the PyPy and simplejson combination (ujson won't work on PyPy since it is written in C, simplejson is a fast pure-python parser)



-
think about the capturing groups in your regular expressions, you can avoid capturing more things than you actually need. For example, in the "start" regular expression you have 2 capturing groups, but you actually use only the first one:

r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:

-
the wild card matches in the regular expressions can be non-greedy - .? instead of . (not sure if it will have a measurable impact on performance)

-
class names should use a "CamelCase" convention (PEP8 reference)

  • the .get_json_objects() method can be static



  • for the CLI parameter parsing I would use argparse module - you would avoid the boilerplate code you have in the main() and usage() functions



  • use 2 spaces before the # for the inline comment (PEP8 reference)



  • fix typo "avergae" -> "average"



-
you can improve the readability of the sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:

def sort_queries(self):
    """Sorts "queries" in place, default sort is "by time"."""
    sort_attributes = {
        't': 'time',
        'at': 'avg',
        'c': 'count'
    }
    sort_attribute = sort_attributes.get(self.key, 't')

    self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute), 
                      reverse=self.reverse)


It though feels like this mapping should be defined as a constant beforehand.

  • improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written



Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input. no group here^


-
the wild card matches in the regular expressions can be non-greedy - .? instead of . (not sure if it will have a measurable impact on performance)

-
class names should use a "CamelCase" convention (PEP8 reference)

  • the .get_json_objects() method can be static



  • for the CLI parameter parsing I would use argparse module - you would avoid the boilerplate code you have in the main() and usage() functions



  • use 2 spaces before the # for the inline comment (PEP8 reference)



  • fix typo "avergae" -> "average"



-
you can improve the readability of the sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:

%%CODEBLOCK_2%%

It though feels like this mapping should be defined as a constant beforehand.

  • improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written



Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.

Code Snippets

class slow_query:
    __slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
    # ...
r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:$'
                                     no group here^
def sort_queries(self):
    """Sorts "queries" in place, default sort is "by time"."""
    sort_attributes = {
        't': 'time',
        'at': 'avg',
        'c': 'count'
    }
    sort_attribute = sort_attributes.get(self.key, 't')

    self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute), 
                      reverse=self.reverse)

Context

StackExchange Code Review Q#155563, answer score: 3

Revisions (0)

No revisions yet.