HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Why is a secondary index chosen over a clustered index for SELECT COUNT(*) ...?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
clusteredwhycountselectchosensecondaryforindexover

Problem

In this query:

select count(*) from largetable;


a secondary index is chosen:

mysql> explain select count(*) from largetable;
+----+-------------+------------+-------+---------------+------+---------+------+----------+-------------+
| id | select_type | table      | type  | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+------------+-------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | largetable | index | NULL          | iif  | 5       | NULL | 50000169 | Using index |
+----+-------------+------------+-------+---------------+------+---------+------+----------+-------------+
1 row in set (0.00 sec)

mysql> select count(*) from largetable;
+----------+
| count(*) |
+----------+
| 50000000 |
+----------+
1 row in set (5 min 52.02 sec)


Whereas forcing usage of the clustered index:

select count(*) from largetable force index (primary);


gives better performance:

mysql> explain select count(*) from largetable force index (primary);
+----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows     | Extra       |
+----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+
|  1 | SIMPLE      | largetable | index | NULL          | PRIMARY | 4       | NULL | 50000169 | Using index |
+----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+
1 row in set (0.00 sec)

mysql> select count(*) from largetable force index (primary);
+----------+
| count(*) |
+----------+
| 50000000 |
+----------+
1 row in set (2 min 23.07 sec)


So that's 5 minutes and 52 seconds versus 2 minutes and 23 seconds.

I am looking to understand why MySQL's query optimizer chooses a secondary index.

There are 50 million rows in the table, with ids from 1 t

Solution

The problem may stem from the way MySQL Query Optimizer makes choices as well as the way indexes are internally represented in InnoDB.

First look at the Cardinality of the Indexes. A primary key's cardinality must always be the actual row count of the InnoDB table. Now, look at the cardinality of the field1. If the index iif is less than that of the primary key, the MySQL Query Optimizer will choose the secondary index. To verify that the Cardinaliry of field1 is lower, run these queries:

SELECT COUNT(DISTINCT field1) FROM largetable;
SELECT field1,COUNT(1) fieldcount FROM largetable
GROUP BY field1 WITH ROLLUP;


Now, look at the internal representation of the indexes. A secondary index will contain two items: 1) the column value(s) being indexed, 2) the rowid from the Clustered Index (a.k.a. the gen_clust_index). Every time a column is referenced in the secondary index, a lookup of the actual row is done as well. Picture it : Two Keys Looks Up for every row in InnoDB.

Putting these two issues together, you find that a secondary index with a lower cardinality than the primary key, will still lookup the actual row using the primary key. This explains why a secondary index is chosen over the primary key and has takes twice as long or even longer to query.

Some people would disagree with this line of reasoning because I answered a question similar to this in StackOverflow (Nov 15, 2011). Although my answer was accepted, it has mixed upvotes and downvotes because some do not view the MySQL Query Optimizer and InnoDB index structure the same way.

If anyone from Percona sees this question and my answer and sees any flaw in my reasoning, please correct me so all can learn.
UPDATE 2012-04-23 12:56 EDT

The InnoDB storage engine does deep dives into the BTREE indexes to take educated guesses at the cardinality. Try setting innodb_stats_on_metadata off

[mysqld]
innodb_stats_on_metadata = 0


According to the documentation, when disabled, InnoDB does not update statistics during these operations. Disabling this variable can improve access speed for schemas that have a large number of tables or indexes. It can also improve the stability of execution plans for queries that involve InnoDB tables.

Code Snippets

SELECT COUNT(DISTINCT field1) FROM largetable;
SELECT field1,COUNT(1) fieldcount FROM largetable
GROUP BY field1 WITH ROLLUP;
[mysqld]
innodb_stats_on_metadata = 0

Context

StackExchange Database Administrators Q#16893, answer score: 4

Revisions (0)

No revisions yet.