HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Update by iterating table in batches faster than whole table in PostgreSQL

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlupdateiteratingwholethanbatchesfastertable

Problem

I have a table with approximately one million rows.

It is being used in production and I ran an UPDATE which covers ~95% of those rows.

After five hours I cancelled the request because it was taking so long.

The table has an auto-incrementing ID column, so I tried extending the WHERE condition of my query to include id BETWEEN 1 AND 500.

This update completed in approximately two seconds. I then manually iterated up through the id in batches of 500, e.g. BETWEEN 500 AND 1000, then BETWEEN 1000 AND 1500.

At this rate it will require 2000 batches of 500 to update the whole table.

Updating 2000 batches at 2 seconds each is just over an hour.

My questions are:

  • What reasons are there for this difference?



  • I don't care about transaction isolation, so is there a way to simulate this 'update in batches', without having to script the 2000 updates to run separately.

Solution

I'll try my best to answer your question in brief, but since I'm not really aware of your level of comfort with PostgreSQL, and I don't have a lot of time to go into an in-depth explanation anyways, I'll keep the answers simple, and you can ask for clarification if you'd like more info.

1) Why is it faster in batches?

Due to the structure of PostgreSQL's write ahead log, the amount of shared buffer space in RAM, and the attempt to perform the entire UPDATE in a single transaction, my guess is that you simply don't have enough computing resources to efficiently handle the update to nearly a million records in a single transaction.

PostgreSQL has a well-built concurrency control system, essentially meaning that it has to keep the old copies of your pre-UPDATE rows available during your UPDATE operation. This is so that, in case another client tries to access these rows while you're updating, in case the update fails, or in case you cancel the update, you don't lose the old information.

If you perform a large enough UPDATE, PostgreSQL will load pages into memory and modify them, but will eventually run out of memory to work with, so it is forced to immediately copy these pages temporarily to disk if it wants to be able to load further pages and continue the transaction.

Rather than being able to amortize the disk writes over a period of time, you've just forced your database into a bottleneck.

2) Scripting the updates

You absolutely can script the updates, by creating a function in PL/pgSQL. There's a lot to learn about PL/pgSQL, including a lot I probably don't know, but generally speaking, you could do something like this

CREATE OR REPLACE FUNCTION mini_batch_update()
  RETURNS void AS
$BODY$
DECLARE

id_val integer;

BEGIN

  FOR id_val IN 0..2000
  LOOP

  **UPDATE GOES HERE**
  WHERE id > (500 * id_val) AND id <= 500 * (id_val + 1);

  END LOOP;

END
$BODY$
LANGUAGE plpgsql;


I didn't take a lot of time to make this batch function in tip-top form; what I mean is that I simply hard-coded several of the numerical values for simplicity's sake. In your case, you may want to get more detailed and include: 1) Something that checks for the maximum id value so that you set your bounds appropriately, and 2) even though I hard-coded batches of 500, you could easily make this a function input parameter.

Sorry I don't have time to test this or make sure it really works well. Good luck!

Code Snippets

CREATE OR REPLACE FUNCTION mini_batch_update()
  RETURNS void AS
$BODY$
DECLARE

id_val integer;

BEGIN

  FOR id_val IN 0..2000
  LOOP

  **UPDATE GOES HERE**
  WHERE id > (500 * id_val) AND id <= 500 * (id_val + 1);

  END LOOP;


END
$BODY$
LANGUAGE plpgsql;

Context

StackExchange Database Administrators Q#123244, answer score: 11

Revisions (0)

No revisions yet.