patternpythonMinor

Cassandra slow query log analysis tool

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

logqueryanalysisslowcassandratool

Problem

Cassandra, when debugging is enabled, logs slow queries to the debug log file. Typical entries look like:

DEBUG [ScheduledTasks:1] 2017-02-16 18:58:44,342 MonitoringTask.java:572 - 4 operations were slow in the last 5010 msecs:
 token(9be90fe7-9a6d-45d5-ad11-e93cfd56def7) LIMIT 100>, time 1 msec - slow timeout 1 msec
 token(91faceee-a64b-4fd3-bb93-ef483acade88) LIMIT 100>, time 1 msec - slow timeout 1 msec
 token(47250d17-573a-4d76-9039-d2771a19ff10) LIMIT 100>, time 1 msec - slow timeout 1 msec
 token(e04fc6d0-18b8-4ac0-b5f9-df42cd3a03c5) LIMIT 100>, time 1 msec - slow timeout 1 msec

The actual format is only documented in code.

For MySQL, the mysqldumpslow tool parses the logs and prints the queries (and related statistics) in a readable manner. I'm trying to write a similar tool for Cassandra, for the feature request in CASSANDRA-13000.

The goals I set are:

-
Use similar options to mysqldumpslow, where applicable, so I've to implement these options:

--help  Display help message and exit
-g  Only consider statements that match the pattern
-r  Reverse the sort order
-s  How to sort output
-t  Display only first num queries

Sorting options:

t, at: Sort by query time or average query time

c: Sort by count

Of these, the -g option is yet to be implemented, since there are some problems in how the queries are logged.

I'm also adding long-form variants of these (--sort, --reverse, etc.) consistently.

-
Support JSON encoded input, in a streaming fashion. This is for another related patch I'm submitting, where the queries are dumped with JSON encoding for easier parsing by external tools. The JSON-encoded entry will look like:

{
  "operation": "SELECT  FROM foo.bar WHERE token(id) > token(60bad0b3-551f-46c7-addc-4e3105561a21) LIMIT 100",
  "totalTime": 1,
  "timeout": 1,
  "isCrossNode": false,
  "numTimesReported": 1,
  "minTime": 1,
  "maxTime": 1,
  "keyspace": "foo",
  "table": "bar"
}

Keep compatibility with Python 2 and

Solution

Here are some notes about the code (both performance and code style related):

-
since you are initializing a lot of slow_query and query_stats (also see note about the naming below) class instances on the fly, to improve the memory usage and performance, use __slots__:

class slow_query:
    __slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
    # ...

-
switching from json to ujson may dramatically improve the JSON parsing speed

or, you can try the PyPy and simplejson combination (ujson won't work on PyPy since it is written in C, simplejson is a fast pure-python parser)

-
think about the capturing groups in your regular expressions, you can avoid capturing more things than you actually need. For example, in the "start" regular expression you have 2 capturing groups, but you actually use only the first one:

r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:

- 
the wild card matches in the regular expressions can be non-greedy - .? instead of . (not sure if it will have a measurable impact on performance)

- 
class names should use a "CamelCase" convention (PEP8 reference)

the .get_json_objects() method can be static



for the CLI parameter parsing I would use argparse module - you would avoid the boilerplate code you have in the main() and usage() functions



use 2 spaces before the # for the inline comment (PEP8 reference)



fix typo "avergae" -> "average"



- 
you can improve the readability of the sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:

def sort_queries(self):
    """Sorts "queries" in place, default sort is "by time"."""
    sort_attributes = {
        't': 'time',
        'at': 'avg',
        'c': 'count'
    }
    sort_attribute = sort_attributes.get(self.key, 't')

    self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute), 
                      reverse=self.reverse)


It though feels like this mapping should be defined as a constant beforehand.

improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written



Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.
                                     no group here^

-
the wild card matches in the regular expressions can be non-greedy - .? instead of . (not sure if it will have a measurable impact on performance)

-
class names should use a "CamelCase" convention (PEP8 reference)

the .get_json_objects() method can be static

for the CLI parameter parsing I would use argparse module - you would avoid the boilerplate code you have in the main() and usage() functions

use 2 spaces before the # for the inline comment (PEP8 reference)

fix typo "avergae" -> "average"

-
you can improve the readability of the sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:

%%CODEBLOCK_2%%

It though feels like this mapping should be defined as a constant beforehand.

improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written

Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.

Code Snippets

class slow_query:
    __slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
    # ...

r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:$'
                                     no group here^

def sort_queries(self):
    """Sorts "queries" in place, default sort is "by time"."""
    sort_attributes = {
        't': 'time',
        'at': 'avg',
        'c': 'count'
    }
    sort_attribute = sort_attributes.get(self.key, 't')

    self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute), 
                      reverse=self.reverse)

Context

StackExchange Code Review Q#155563, answer score: 3

Revisions (0)

No revisions yet.