patternMinor
Why the query optimizer choose this bad execution plan?
Viewed 0 times
thiswhythechoosequerybadplanoptimizerexecution
Problem
We have a MariaDB table (stories) with more than 1TB of data, periodically running a query that fetches recently added rows for indexing somewhere else.
The query works just fine when the query optimizer decides on using the secondary index to do a range walk through (in buckets of a 1000):
But occasionally, even with small differences in the data set (Note: the second timestamp difference with the query above, worth to mention that the whole table holds data for several years and holds several dozens millions of rows) decides to use the primary key index:
``
innodb_version: 5.6.36-82.1
version : 10.1.26-MariaDB-0+deb9u1The query works just fine when the query optimizer decides on using the secondary index to do a range walk through (in buckets of a 1000):
explain extended SELECT stories.item_guid
FROM `stories`
WHERE (updated_at >= '2018-09-21 15:00:00')
AND (updated_at <= '2018-09-22 05:30:00')
ORDER BY `stories`.`id` ASC
LIMIT 1000;
+------+-------------+---------+-------+-----------------------------+-----------------------------+---------+------+--------+----------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+---------+-------+-----------------------------+-----------------------------+---------+------+--------+----------+---------------------------------------+
| 1 | SIMPLE | stories | range | index_stories_on_updated_at | index_stories_on_updated_at | 5 | NULL | 192912 | 100.00 | Using index condition; Using filesort |
+------+-------------+---------+-------+-----------------------------+-----------------------------+---------+------+--------+----------+---------------------------------------+
1 row in set, 1 warning (0.00 sec)But occasionally, even with small differences in the data set (Note: the second timestamp difference with the query above, worth to mention that the whole table holds data for several years and holds several dozens millions of rows) decides to use the primary key index:
``
explain extended SELECT stories.item_guid
FROM stories
WHERE (updated_at >= '2018-09-21 15:00:00')
AND (updated_at <= '2018-09-22 06:30:00')
ORDER BY stories.id` ASCSolution
Short Answer: The
Long Answer:
The
The
If the 1000 are found at small value of
Does
A possible speedup:
A "covering" index is one that has all the needed columns. Since the PK is 'clustered' with the data,
The query would range-scan the 192K rows faster than if it needed to bounce between
Alas, that speeds up the faster query. So, let me put my thinking cap back on.
Partitioning?
Oh, you may have found a 5th use case for
Supposed you used
When setting up partitioning, one needs to rethink all the indexes. What indexes do you have now? I'll assume these:
In your case, perhaps not much would change:
What would happen to your queries?
Tackling the
By tackling the
Other notes
Could we see
Ypercube's
If
Pagination
Not via OFFSET, remember where you left off. And this discusses how to deal with "left off" involving more than a single column.
WHERE and the ORDER BY are tugging in different directions. The Optimizer is not yet smart enough to always correctly decide which direction to be pulled.Long Answer:
The
WHERE benefits from any index starting with updated_at. See the "100%", and other things. Such an index does a quick job of finding the desired rows, all 192K of them. But the query then needs to sort of 192K rows (ORDER BY) before it can get to the LIMIT.The
ORDER BY id benefits from the PRIMARY KEY. This index lets the query get all rows in order (thereby avoiding a sort) and hence can get to the LIMIT (thereby never shoveling around more than 1000 rows.If the 1000 are found at small value of
id, it will run fast. If the desired 1000 are late in the table, the query will run slowly. The Optimizer can't predict (without a lot more smarts and complexity).Does
update_at mostly track id? If so, then either index would lead to essentially the same blocks. But the Optimizer neither notices this, nor does it take advantage of it.A possible speedup:
A "covering" index is one that has all the needed columns. Since the PK is 'clustered' with the data,
PRIMARY KEY(id) is sort of a covering index. But we saw how bad it could be. The following may be better:INDEX(updated_at, -- first, to deal with the WHERE
id, item_guid) -- in either orderThe query would range-scan the 192K rows faster than if it needed to bounce between
INDEX(updated_at) and the data 192K times to find item_guid. The sort would not be improved.Alas, that speeds up the faster query. So, let me put my thinking cap back on.
Partitioning?
Oh, you may have found a 5th use case for
PARTITIONing. Six years ago, I had only 4 use cases; I have yet to find a new use case. Let me talk through how your case might be different.Supposed you used
PARTITION BY RANGE (TO_DAYS(updated_at)) on the table, setting up about 50 partitions. (If you need to purge 'old' data, Partitioning is excellent.)When setting up partitioning, one needs to rethink all the indexes. What indexes do you have now? I'll assume these:
PRIMARY KEY(id),
INDEX(updated_at)In your case, perhaps not much would change:
PRIMARY KEY(id, updated_at),
INDEX(updated_at)What would happen to your queries?
Tackling the
WHERE first would not change much. Yes, there would be partition pruning, plus the secondary index. Speed would stay about the same.By tackling the
ORDER BY first, the query plan could also prune down to one (or two) partitions. Then the scan in id order within that/those partition(s) would be 50 (or 25) times as fast. There would probably be an added sort, since the rows from separate partitions won't be in order.Other notes
Could we see
SHOW CREATE TABLE? A 1TB table needs a lot of things to help with performance. If all your numbers are INTs, then there might be a lot of space to recover by shrinking to MEDIUMINT, etc. Normalization may be worth doing. If the GUID is indexed, that must be a big burden on insert performance. Etc, etc.Ypercube's
If
ORDER BY updated_at, id is OK for the task, then that eliminates the temptation to use the slow explain plan. And INDEX(updated_at, id, item_guid) becomes the best index. And partitioning becomes useless.Pagination
Not via OFFSET, remember where you left off. And this discusses how to deal with "left off" involving more than a single column.
Code Snippets
INDEX(updated_at, -- first, to deal with the WHERE
id, item_guid) -- in either orderPRIMARY KEY(id),
INDEX(updated_at)PRIMARY KEY(id, updated_at),
INDEX(updated_at)Context
StackExchange Database Administrators Q#218327, answer score: 2
Revisions (0)
No revisions yet.