HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Improve update performance in PostgreSQL data warehouse (RDS)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlupdatewarehouseimproveperformancerdsdata

Problem

I am using a Postgres data warehouse hosted on Amazon RDS. When trying to update one column of a fact table (25 million rows) from another table in the same database, the query takes several days to run. Why is this happening and how can I improve this performance? I know that PG is designed more for OLTP than OLAP, but select query performance is usually pretty decent on this table.

The query in question looks like this:

UPDATE a
SET a.value = b.value
FROM b
WHERE a.id = b.id


b is a temp table in a different schema but same database that has the same number of rows as a. Both tables have primary keys on id. There is no index or constraints on the value column. There are views that depend on table a but no foreign keys

I am using PG 9.5 on RDS. General purpose (SSD) with 256 GB of storage, so after exhausting our initial burst IOPS, I should get a little under 800 IOPS.

Is the IOPS throttling really the issue here? While watching the query run I see ~ 400 IOPS of write performance, and similar read performance. 25,000,000 rows / 400 IOPS = 17 hours, but this query took much longer than 24 hours to run ( cancelled after ~ 30 hours to try and make tweaks). There was some other periodic update traffic on the same table, but I halted this at around the 20 hour mark when I saw how long this query was taking.

I wondering if my general update approach is wrong, or if there is general advice for operating a data warehouse (OLAP workload) using postgres. Could I get better performance by ditching RDS and running an PG on EC2?

UPDATE:
Inspired by responses and comment, I ran a test on 45k rows (by limiting the pk below a certain range)

You can see the results of explain analyze here. The vast majority of the time is spent writing the actual updates to the table. Right now I am still leaning towards write IOPS being a limiting factor, but I will dig into possible replication issues as mentioned by joanolo.

This image shows the RDS instance monitori

Solution

Have you considered the possibility that WAL (Write-Ahead Logging) is slowing down your update? Also see this more detail explanation.

This is a common problem for UPDATEs, INSERTs and ALTER TABLE operations on very large tables. According to my understanding, for any row affected by an update--even if that update is only on a single column--Postgres archives and replaces the entire row and updates all indexes on all indexes columns. While Postgres's implementation of WAL is extremely effective for maintaining data integrity in a transactional setting, it can seriously degrade performance for operations on large tables--particularly for data warehouses, where bulk updates involving many records are common.

I would need to know more about the table being updated to know if WAL is the culprit. Is a large proportion of records (say 30% or more) affected by the update? You stated that the value column is not indexed, but are there indexes on other columns? If the answer to both questions is yes, then I would strongly suspect WAL.

An effective solution to this problem is to create a new, unindexed table using the CREATE TABLE AS method described here. You will need to rebuild indexes, keys and constraints on the new table, but this is still much faster than updating-in-place. See also the related answer here.

The downside to the CREATE TABLE AS approach is that a simple UPDATE query becomes a monster multi-statement transaction. The latter code is not only cumbersome but also fragile: the table schema must be repeated for every update that uses this approach. Imagine a data warehouse pipeline with dozens of such updates. Any change to the table schema must be hard-coded into every single update operation.

As an alternative approach, I suggest you first try stripping all indexes not involved in the update (joins or where clause), then use a regular UPDATE statement and rebuild the indexes. Depending on the dimensions of you table, its indexes, the complexity of the update and the number of rows involved, an update-in-place may nearly as fast as the "CREATE TABLE AS" method, and your code will be simpler and more stable.

Context

StackExchange Database Administrators Q#174097, answer score: 2

Revisions (0)

No revisions yet.