HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Why is Insert-Ignore so expensive in MySQL?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
ignorewhyinsertmysqlexpensive

Problem

TL;DR:
Inserting a large number of rows with no collisions is MUCH faster than (not) re-inserting the same rows, BOTH using the INSERT IGNORE syntax.

Why is this? I would assume that the index lookup cost would be the same for an insert and an "ignored" insert, given that MySQL does not know if the incoming data has repeated/conflicting data (and therefore needs to be ignored)...thus, indexing occurs in both the initial insert and the ignored insert runs.

Furthermore, I would assume that an "ignored" row should be CHEAPER in that it does not require any disk writes.

But this is most definitely NOT the case.

Long Version:
In this question, we use AWS's Aurora/MySQL and the LOAD DATA FROM S3 FILE syntax to remove any transfer or performance variables. We load a 4 megarow CSV file corresponding to the schema below, and load it in twice, both times with LOAD ... IGNORE.

Please note that the issue also occurred with standard INSERT ... IGNORE, but with use of batched row inserts. The use of LOAD ... IGNORE here is to steer the discussion towards the counter-intuitive nature of the measured results, and not "how to perform large number of ignored inserts". That is not the issue here, as a domain-specific method was worked out.

In the model being tested, there is a three-tier index: the first two being enumerable categorization columns with very low cardinality, and a third column that is essentially "actual" data. For ease of qualifying this question, I am just sticking to my current setup.

Assume the following trivial schema:

our_table:

id: basic auto-increment bigint, primary key
constkey1: varchar(50) -- this is a constant in the insert
constkey2: varchar(50) -- this is a constant in the insert
datakey:   varchar(50) -- this is pulled in from the CSV file
datafield: varchar(50) -- this is pulled in from the CSV file
consttime: datetime    -- this is a constant in the insert

Unique Index: (constkey1, constkey2, datakey)


And the following que

Solution

As one can see from the source code, when a row is inserted into an InnoDB table, it gets added to the clustered index first, then to all secondary indexes in turn. In your case the clustered index is based on an auto-increment column, so the first operation always succeeds. The next operation on a secondary unique index fails because of the duplicate key value. The engine then proceeds to roll back the previous operation, incurring unnecessary performance penalty.

The logic is clearly optimised for the "happy path" -- non-conflicting inserts, at the cost of suboptimal exception path. If they were to validate constraints before applying any changes, conflict resolution cost would be reduced, but the "happy path" would become more expensive, making more users unhappy.

Yours being an AWS Aurora instance, the problem is likely made even worse by the fact that both changes need to be written to the quorum of change logs for replication to the secondary node(s), as well as its possible negative effects on the fast insert acceleration.

Context

StackExchange Database Administrators Q#211018, answer score: 4

Revisions (0)

No revisions yet.