HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Large INSERT INTO SELECT [..] FROM gradually gets slower

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
insertintoslowerlargeselectfromgetsgradually

Problem

I've written a program that does a large INSERT in batches of 100,000 and shows its progress.

The source table contains 2.5GB of data:

CREATE TABLE wikt.text (
    old_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    old_text MEDIUMBLOB NOT NULL,
    old_flags TINYBLOB NOT NULL,

    PRIMARY KEY (old_id),
    KEY old_id (old_id)

)   ENGINE=INNODB
    AUTO_INCREMENT=23565544
    DEFAULT CHARSET=binary;

CREATE INDEX old_id ON text (old_id);


This is the destination table:

CREATE TABLE domains.dictionary_language (
    text_id     INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    english     TINYINT(1) UNSIGNED NOT NULL,

    PRIMARY KEY (text_id),
    KEY         english (english)

)   ENGINE=INNODB
    AUTO_INCREMENT=23565544;


This is the query being run in batches of 100k:

INSERT INTO domains.dictionary_language
    SELECT      old_id,
                IF(old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text

    LIMIT       {batch}, 100000;


The query is getting slower and slower. The first 1 million records inserted in 8 minutes. After that, records #1.2m - #1.3m alone took 7 mins. Now #2.3m - #2.4m just finished in 15 mins.

Here's the log of how long each batch of 100k takes to import.

As you can see, the first 12 batches (1.2 million records) insert in < 1 minute each. After that, the performance drops, with each batch taking a bit longer than the last!

```
24/11/2013 19:18:40 Ready

24/11/2013 19:18:42 Dictionary import started from Wiktionary
24/11/2013 19:18:42 Records: 3,729,613
24/11/2013 19:18:42 Batches of: 100,000

24/11/2013 19:19:11 Batch 1 finished in 00:00:29.3146767
24/11/2013 19:19:33 Batch 2 finished in 00:00:22.2142706
24/11/2013 19:19:41 Batch 3 finished in 00:00:07.6104353
24/11/2013 19:19:53 Batch 4 finished in 00:00:12.7057267
24/11/2013 19:20:08 Batch 5 finished in 00:00:14.9248537
24/11/2013 19:20:25 Batch 6 finished in 00:00:16.9849715
24/11/2013 19:20:43 Batch 7 finished in 00:00:17.7930177
24/1

Solution

The large offsets can have this effect. I would try to remove the offset and use only LIMIT 10000:

INSERT INTO domains.dictionary_language
  (text_id, english)
    SELECT      t.old_id,
                IF(t.old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text AS t
      JOIN      ( SELECT COALESCE(MAX(text_id), 0) AS offset
                  FROM domains.dictionary_language
                ) AS m
                ON  t.old_id > m.offset

    ORDER BY    t.old_id
    LIMIT       100000;

Code Snippets

INSERT INTO domains.dictionary_language
  (text_id, english)
    SELECT      t.old_id,
                IF(t.old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text AS t
      JOIN      ( SELECT COALESCE(MAX(text_id), 0) AS offset
                  FROM domains.dictionary_language
                ) AS m
                ON  t.old_id > m.offset

    ORDER BY    t.old_id
    LIMIT       100000;

Context

StackExchange Database Administrators Q#53831, answer score: 12

Revisions (0)

No revisions yet.