patternpythonMinor
Cassandra slow query log analysis tool
Viewed 0 times
logqueryanalysisslowcassandratool
Problem
Cassandra, when debugging is enabled, logs slow queries to the debug log file. Typical entries look like:
The actual format is only documented in code.
For MySQL, the
The goals I set are:
-
Use similar options to
Sorting options:
Of these, the
I'm also adding long-form variants of these (
-
Support JSON encoded input, in a streaming fashion. This is for another related patch I'm submitting, where the queries are dumped with JSON encoding for easier parsing by external tools. The JSON-encoded entry will look like:
DEBUG [ScheduledTasks:1] 2017-02-16 18:58:44,342 MonitoringTask.java:572 - 4 operations were slow in the last 5010 msecs:
token(9be90fe7-9a6d-45d5-ad11-e93cfd56def7) LIMIT 100>, time 1 msec - slow timeout 1 msec
token(91faceee-a64b-4fd3-bb93-ef483acade88) LIMIT 100>, time 1 msec - slow timeout 1 msec
token(47250d17-573a-4d76-9039-d2771a19ff10) LIMIT 100>, time 1 msec - slow timeout 1 msec
token(e04fc6d0-18b8-4ac0-b5f9-df42cd3a03c5) LIMIT 100>, time 1 msec - slow timeout 1 msec
The actual format is only documented in code.
For MySQL, the
mysqldumpslow tool parses the logs and prints the queries (and related statistics) in a readable manner. I'm trying to write a similar tool for Cassandra, for the feature request in CASSANDRA-13000.The goals I set are:
-
Use similar options to
mysqldumpslow, where applicable, so I've to implement these options:--help Display help message and exit
-g Only consider statements that match the pattern
-r Reverse the sort order
-s How to sort output
-t Display only first num queries
Sorting options:
t,at: Sort by query time or average query time
c: Sort by count
Of these, the
-g option is yet to be implemented, since there are some problems in how the queries are logged.I'm also adding long-form variants of these (
--sort, --reverse, etc.) consistently.-
Support JSON encoded input, in a streaming fashion. This is for another related patch I'm submitting, where the queries are dumped with JSON encoding for easier parsing by external tools. The JSON-encoded entry will look like:
{
"operation": "SELECT FROM foo.bar WHERE token(id) > token(60bad0b3-551f-46c7-addc-4e3105561a21) LIMIT 100",
"totalTime": 1,
"timeout": 1,
"isCrossNode": false,
"numTimesReported": 1,
"minTime": 1,
"maxTime": 1,
"keyspace": "foo",
"table": "bar"
}
- Keep compatibility with Python 2 and
Solution
Here are some notes about the code (both performance and code style related):
-
since you are initializing a lot of
-
switching from
-
think about the capturing groups in your regular expressions, you can avoid capturing more things than you actually need. For example, in the "start" regular expression you have 2 capturing groups, but you actually use only the first one:
-
the wild card matches in the regular expressions can be non-greedy -
-
class names should use a "CamelCase" convention (PEP8 reference)
-
you can improve the readability of the
%%CODEBLOCK_2%%
It though feels like this mapping should be defined as a constant beforehand.
Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.
-
since you are initializing a lot of
slow_query and query_stats (also see note about the naming below) class instances on the fly, to improve the memory usage and performance, use __slots__:class slow_query:
__slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
# ...-
switching from
json to ujson may dramatically improve the JSON parsing speed- or, you can try the
PyPyandsimplejsoncombination (ujsonwon't work onPyPysince it is written in C,simplejsonis a fast pure-python parser)
-
think about the capturing groups in your regular expressions, you can avoid capturing more things than you actually need. For example, in the "start" regular expression you have 2 capturing groups, but you actually use only the first one:
r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:
-
the wild card matches in the regular expressions can be non-greedy - .? instead of . (not sure if it will have a measurable impact on performance)
-
class names should use a "CamelCase" convention (PEP8 reference)
- the
.get_json_objects() method can be static
- for the CLI parameter parsing I would use
argparse module - you would avoid the boilerplate code you have in the main() and usage() functions
- use 2 spaces before the
# for the inline comment (PEP8 reference)
- fix typo "avergae" -> "average"
-
you can improve the readability of the sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:
def sort_queries(self):
"""Sorts "queries" in place, default sort is "by time"."""
sort_attributes = {
't': 'time',
'at': 'avg',
'c': 'count'
}
sort_attribute = sort_attributes.get(self.key, 't')
self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute),
reverse=self.reverse)
It though feels like this mapping should be defined as a constant beforehand.
- improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written
Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.
no group here^-
the wild card matches in the regular expressions can be non-greedy -
.? instead of . (not sure if it will have a measurable impact on performance)-
class names should use a "CamelCase" convention (PEP8 reference)
- the
.get_json_objects()method can be static
- for the CLI parameter parsing I would use
argparsemodule - you would avoid the boilerplate code you have in themain()andusage()functions
- use 2 spaces before the
#for the inline comment (PEP8 reference)
- fix typo "avergae" -> "average"
-
you can improve the readability of the
sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:%%CODEBLOCK_2%%
It though feels like this mapping should be defined as a constant beforehand.
- improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written
Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.
Code Snippets
class slow_query:
__slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
# ...r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:$'
no group here^def sort_queries(self):
"""Sorts "queries" in place, default sort is "by time"."""
sort_attributes = {
't': 'time',
'at': 'avg',
'c': 'count'
}
sort_attribute = sort_attributes.get(self.key, 't')
self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute),
reverse=self.reverse)Context
StackExchange Code Review Q#155563, answer score: 3
Revisions (0)
No revisions yet.