patternsqlModerate
Mysql Performance optimization, max CPU load, low qps
Viewed 0 times
lowoptimizationmysqlmaxqpsperformanceloadcpu
Problem
I would like if anybody could look into performance optimize this:
I have a Ubuntu 12.04 running on VMWare 5.1, with 32 GB RAM and 8 Cores (No problems regarding cpu scheduling as the VM is almost alone on the host)
The hardware is IBM blade with 2xE5-2660 CPU's
Im running Mysql 5.5, and have a table looking like this:
The table contains 187,500 rows.
Im running queries like this:
``
+----+-------------+----------+-------+---------------+-------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+-------------+---------+------+------+-------------+
| 1 | SIMPLE | ochrange | index | rangestart_2 | rangelength | 4 | NULL | 46 | Using where |
+----+-------------+----------+-------+---------------+-------------+---------+------+------+-------------+
1 row in set (0.
I have a Ubuntu 12.04 running on VMWare 5.1, with 32 GB RAM and 8 Cores (No problems regarding cpu scheduling as the VM is almost alone on the host)
The hardware is IBM blade with 2xE5-2660 CPU's
Im running Mysql 5.5, and have a table looking like this:
ochrange | CREATE TABLE `ochrange` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`rangestart` int(8) NOT NULL,
`rangeend` int(8) NOT NULL,
`rangelength` int(11) NOT NULL DEFAULT '1',
`networkoperator` varchar(6) COLLATE latin1_danish_ci NOT NULL,
`serviceoperator` varchar(6) COLLATE latin1_danish_ci NOT NULL,
`numbertype` varchar(6) COLLATE latin1_danish_ci NOT NULL,
`lastupdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`lastupdateFile` varchar(255) COLLATE latin1_danish_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `rangestart_2` (`rangestart`,`rangeend`),
KEY `rangelength` (`rangelength`)
) ENGINE=MyISAM AUTO_INCREMENT=189138 DEFAULT CHARSET=latin1 COLLATE=latin1_danish_ci |The table contains 187,500 rows.
Im running queries like this:
``
SELECT networkoperator,numbertype
FROM och.ochrange
WHERE '20972128'
BETWEEN rangestart AND rangeend
ORDER BY rangelength ASC LIMIT 1;
mysql> EXPLAIN SELECT networkoperator,numbertype
-> FROM och.ochrange
-> WHERE '20972128'
-> BETWEEN rangestart AND rangeend
-> ORDER BY rangelength` ASC LIMIT 1;+----+-------------+----------+-------+---------------+-------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+-------------+---------+------+------+-------------+
| 1 | SIMPLE | ochrange | index | rangestart_2 | rangelength | 4 | NULL | 46 | Using where |
+----+-------------+----------+-------+---------------+-------------+---------+------+------+-------------+
1 row in set (0.
Solution
Based on reading the query plan you've posted from the explain output, you will probably have difficulty believing the explanation of how this query is actually being processed by the server... but the explanation does illustrate why the performance would not be good.
Since you've asked for the results to be ordered by
There are two approaches that come to mind to improve this.
Option 1: The first suggestion would be to add an index that includes all three values you're sorting by and selecting by... but not for the usual reasons, because the query isn't going to use it quite like that.
This is still far from the ideal index for this query, but it has three advantages over what you have now:
Very important note on point 3: I am not saying this index will be used to look up the matching rows, because it can't exactly be used for that. It should at least, however, be used more effectively than the current plan because it contains values we need to use for filtering and because it may also allow out of range values for rangeend to be quickly eliminated, and among the remaining values, it may allow out of range values for rangestart to also be eliminated.
I would also recommend that the where clause be written in a less-ambiguous but logically equivalent form, to potentially make things a little bit easier on the optimizer:
...becomes this:
The (
The residential telephone directory is a fitting analogy for a two-column index on (last_name, first_name) and illustrates why this kind of index doesn't provide as much benefit as it would seem to.
In such a directory, given the last name "Smith" and first name "John," it is easy to find all of the people named Smith, and very easy to find first name John accompanying last name Smith. It is quite impossible, though, to use the index in a telephone directory to find all of the people with first name John regardless of last name.
What we're asking of the index in this query, whether written the original way or in my suggested way, is to find all of the rows where 'rangestart' = 20972128 in the same row. This would be like trying to find all of the people in a telephone directory with a last name of Smith or any other last name that appears prior to Smith in the directory, and of those people, find those with the first name of John, or any other name that is lexically (alphabetically) "greater than" (after) John. The task would be tedious, and our only consolation is that we don't have to examine any of the pages in the directory that follow Smith, but we have to examine every entry on every preceding page before we can find what we are looking for.
Still, Option #1, adding a new index, seems worth a try. After testing with that index, it could also be worthwhile to add yet another index, this one on (rangelength,rangestart,rangeend), to see which one the optimizer prefers to use. Hopefully it will use one of them, and depending on the data in the table and the values in the query, it might alternate and it might not.
Option #2 is apparently a little bit "outside the box" in some people's minds, but it is a solution that I use for finding the specific block of IP addresses (IPv4 addresses are essentially INT UNSIGNED, with low/high boundaries) where a particular IP address lies, for geocoding. I caught some grief for suggesting this
Since you've asked for the results to be ordered by
rangelength and since the B-TREE index on (rangestart,rangeend) is poorly-suited to resolving "x between y and z" expressions, the optimizer has decided to use the index on rangelength to determine an order in which it will proceed to read every row, if necessary, in the entire table (type = index), in ascending order as sorted by the index on rangelength (type = index, key = rangelength), until it finds the first row where (extra = using where) the where clause is matched. Since the rows are being read in the desired order-by, the server can stop after the first row... so I would assume this query exhibits substantial variability depending on how much of the table or index has to be scanned to resolve any particular value.There are two approaches that come to mind to improve this.
Option 1: The first suggestion would be to add an index that includes all three values you're sorting by and selecting by... but not for the usual reasons, because the query isn't going to use it quite like that.
ALTER TABLE ochrange ADD KEY(rangelength,rangeend,rangestart);This is still far from the ideal index for this query, but it has three advantages over what you have now:
- it's already sorted by rangelength
- it narrows down the number of comparisons needed, because for any given rangelength, the rangestart values that are too high can be ignored
- while it isn't truly a covering index, all of the values of interest in the
WHEREclause are found within the index, so the optimizer should be able to qualify or disqualify the rows based on an index scan instead of having to read the table data, and might be able to do more.
Very important note on point 3: I am not saying this index will be used to look up the matching rows, because it can't exactly be used for that. It should at least, however, be used more effectively than the current plan because it contains values we need to use for filtering and because it may also allow out of range values for rangeend to be quickly eliminated, and among the remaining values, it may allow out of range values for rangestart to also be eliminated.
I would also recommend that the where clause be written in a less-ambiguous but logically equivalent form, to potentially make things a little bit easier on the optimizer:
WHERE 20972128
BETWEEN `rangestart` AND `rangeend`
ORDER BY `rangelength` ASC LIMIT 1;...becomes this:
WHERE rangestart = 20972128
ORDER BY rangelength ASC LIMIT 1;The (
rangestart,rangeend) index, at first glance, seems like it would have been more useful, but a 2-column B-Tree isn't well-suited for finding values between the low and high bounds, like this.The residential telephone directory is a fitting analogy for a two-column index on (last_name, first_name) and illustrates why this kind of index doesn't provide as much benefit as it would seem to.
In such a directory, given the last name "Smith" and first name "John," it is easy to find all of the people named Smith, and very easy to find first name John accompanying last name Smith. It is quite impossible, though, to use the index in a telephone directory to find all of the people with first name John regardless of last name.
What we're asking of the index in this query, whether written the original way or in my suggested way, is to find all of the rows where 'rangestart' = 20972128 in the same row. This would be like trying to find all of the people in a telephone directory with a last name of Smith or any other last name that appears prior to Smith in the directory, and of those people, find those with the first name of John, or any other name that is lexically (alphabetically) "greater than" (after) John. The task would be tedious, and our only consolation is that we don't have to examine any of the pages in the directory that follow Smith, but we have to examine every entry on every preceding page before we can find what we are looking for.
Still, Option #1, adding a new index, seems worth a try. After testing with that index, it could also be worthwhile to add yet another index, this one on (rangelength,rangestart,rangeend), to see which one the optimizer prefers to use. Hopefully it will use one of them, and depending on the data in the table and the values in the query, it might alternate and it might not.
Option #2 is apparently a little bit "outside the box" in some people's minds, but it is a solution that I use for finding the specific block of IP addresses (IPv4 addresses are essentially INT UNSIGNED, with low/high boundaries) where a particular IP address lies, for geocoding. I caught some grief for suggesting this
Code Snippets
ALTER TABLE ochrange ADD KEY(rangelength,rangeend,rangestart);WHERE 20972128
BETWEEN `rangestart` AND `rangeend`
ORDER BY `rangelength` ASC LIMIT 1;WHERE rangestart <= 20972128
AND rangeend >= 20972128
ORDER BY rangelength ASC LIMIT 1;SELECT ...
FROM geo_block b
WHERE MBRContains(b.node_polygon,POINT(in_ip_unsigned,in_ip_unsigned))WHERE YEAR(birthday) = 1973; # bad
WHERE birthday >= '1973-01-01' AND birthday < '1974-01-01'; # goodContext
StackExchange Database Administrators Q#54117, answer score: 16
Revisions (0)
No revisions yet.