HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

MySQL Index Creation Internals

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
indexmysqlcreationinternals

Problem

Right now there are two ways to build an index of a table in MySQL:

  • Create the table structure first, then import the data and add indexes later.



  • Create the table structure with indexes and then import the data.



In the first process we will have contiguous data (all fields) pages and then we have index pages. So when we query using index, MySQL has to first load the index pages and find out the matching keys and have to look up those primary key on the data pages. For this it has to then again load the data pages to get the data. This is useful when we have larger index scans as we have all index loaded contiguously.

In the second way of index creation the filtered index page will most probably contain the data page near to it as they were created at the same time. So I guess look up will be faster for a small range scans.

Is my understanding correct?

UPDATE:

I should have mentioned that the "PRIMARY KEY is enabled" (an auto increment id column) in the first way of importing data. So internal rowid is not generated and lot of IO is saved as we are not going to add a PRIMARY KEY.

As you noted there is fragmentation when we import data using the Second method.

Considering my requirements is for larger range scans (scanning ~100M rows) I guess I will go with the first way of importing data.

update on Jun 8 11:30

``
CREATE TABLE
table_dummy (
id bigint(20) NOT NULL AUTO_INCREMENT,
column1 bigint(20) DEFAULT NULL,
column2 bigint(20) DEFAULT NULL,
column3 bigint(20) DEFAULT NULL,
created_at datetime DEFAULT NULL,
column4 tinyint(1) DEFAULT NULL,
column5 tinyint(4) DEFAULT NULL,
column6 bigint(20) DEFAULT NULL,
column6_created_at datetime DEFAULT NULL,
column7 int(11) DEFAULT NULL,
column8 tinyint(1) DEFAULT NULL,
PRIMARY KEY (
id),
UNIQUE KEY
twtaccount_id_2 (column1,column2),
KEY
twtaccount_id (column1,created_at,column5),
KEY
twt_user_id (column3,created_at,column5`),
K

Solution

There are a few things I can clarify for you here:

-
Yes, it is a good practice to delay secondary index creation until after you import the data (starting from MySQL 5.5 - not before). Mysqlpump does this by default.

-
When you delay secondary index creation, internally MySQL will read, sort and then create the index (reducing fragmentation). For MySQL 5.7 there are additional optimizations here.

-
When you trickle load indexes, they may have more page splits and a lower page fill efficiency (fragmentation). This depends a little on the data, and if it is in order. You are not giving me many clues with column1, column2, but say for example column1 was a timestamp it would probably be in order.

-
Pages in an index are logically in order, not necessarily physically. This distinction may matter if you can not fit your working set for ranges in memory and are using spinning disks for backing storage. It's also a little hard to answer because with spinning disks you are probably using RAID with some stripe size, and the blocks at the filesystem level may also be non-contiguous.

-
Be aware also that the optimizer may not consider the range scan for large ranges over a tablescan (since prior to 5.7 the cost model assumed pages were not in memory). If you can rely on this being the case, you may want to FORCE INDEX and compare actual run times. (More info on new cost model.)

Context

StackExchange Database Administrators Q#102413, answer score: 3

Revisions (0)

No revisions yet.