patternsqlModerate
Adding a Primary Key to a Large PostgreSQL Table with High Traffic
Viewed 0 times
postgresqlprimarywithtraffichighaddinglargetablekey
Problem
I need to add a primary key to a large PostgreSQL table (approximately 2TB) with high traffic. This is a critical operation, and I'm looking for guidance on how to do it efficiently.
I have already tried below steps:
I have facing two problems:
-
Table is locked completely on "Step 1" itself.
I know this is expected, but if there is any option to avoid that please suggest.
-
I get below error,
ERROR: could not extend file "base/16401/90996": No space left on device
HINT: Check free disk space.
But I have around
As the table will be locked on "Step 1", and if there is no option to avoid that, I could take the downtime and add the
I don't know whether that would solve the storage error.
Please provide any suggestions so that I can add the PK with the least downtime possible.
PostgreSQL v14.6
I have already tried below steps:
-- Step 1: Add id identity column
ALTER TABLE users
ADD COLUMN id BIGINT GENERATED ALWAYS as IDENTITY;
-- Step 2: Add unique index on (id, user_id) concurrently
CREATE UNIQUE INDEX CONCURRENTLY id_user_id_idx
ON users (id, user_id);
-- verify that step 2 is completed
-- Step 3: Add primary key
ALTER TABLE users
ADD CONSTRAINT users_pkey PRIMARY KEY USING INDEX id_user_id_idx;
I have facing two problems:
-
Table is locked completely on "Step 1" itself.
I know this is expected, but if there is any option to avoid that please suggest.
-
I get below error,
ERROR: could not extend file "base/16401/90996": No space left on device
HINT: Check free disk space.
But I have around
600GB of storage left on my server.As the table will be locked on "Step 1", and if there is no option to avoid that, I could take the downtime and add the
id column first and then run the other two scripts.I don't know whether that would solve the storage error.
Please provide any suggestions so that I can add the PK with the least downtime possible.
PostgreSQL v14.6
Solution
Why?
Your step 1 would need a lot more than 600 GB (temporarily). The table has around 2 TB. About as much (minus possible bloat, plus 8 bytes per row for the new
Minimize blocking AND minimize total storage requirement
Do instead, in this order:
fiddle
Add a nullable column
This way, Postgres can make do with tiny metadata changes. No table rewrite, no blocking.
I would name the PK column "user_id", not a fan of the wide-spread, non-descriptive, and highly duplicative name "id". But keeping "id" to stay in line with the question.
Create a
Make the column "own" the sequence:
Add the column default, which only kicks in for new rows.
See:
Update pre-existing rows (still with
Since the addition of SQL procedures in Postgres 11, we can
Alternatively, loop in your client, and run
See:
Eventually, all old rows are updated.
Now create the unique index
I don't see a good reason to add
Now use the unique index to add the new PK without blocking inserts (your step 3):
This will also implicitly set the column
Finally, use Peter Eisentraut's function
Or stick with the
Related:
Your step 1 would need a lot more than 600 GB (temporarily). The table has around 2 TB. About as much (minus possible bloat, plus 8 bytes per row for the new
bigint column) has to be available at least, because that change forces Postgres to rewrite the whole table.Minimize blocking AND minimize total storage requirement
Do instead, in this order:
fiddle
Add a nullable column
id with no default value, so it will be null initially.ALTER TABLE users ADD COLUMN id bigint;This way, Postgres can make do with tiny metadata changes. No table rewrite, no blocking.
I would name the PK column "user_id", not a fan of the wide-spread, non-descriptive, and highly duplicative name "id". But keeping "id" to stay in line with the question.
Create a
SEQUENCE manually:CREATE SEQUENCE users_id_seq;Make the column "own" the sequence:
ALTER SEQUENCE users_id_seq OWNED BY users.id;Add the column default, which only kicks in for new rows.
ALTER TABLE users ALTER COLUMN id SET DEFAULT nextval('users_id_seq');See:
- Safely rename tables using serial primary key columns
- Creating a PostgreSQL sequence to a field (which is not the ID of the record)
Update pre-existing rows (still with
null values) in batches of like 1 % of the total size (or whatever). In separate transactions, to allow autovacuum to kick in and mark dead rows for reuse. This way, the table won't grow much, and 600 GB are easily enough wiggle room.Since the addition of SQL procedures in Postgres 11, we can
COMMIT in an anonymous code block. Assuming a timestamptz column users.inserted_at (ideally with an index on it!) something like this would work:DO
$do$
DECLARE
_ts timestamptz := (SELECT COALESCE(min(inserted_at), now()) FROM users); -- must not be NULL
_step interval := '7 days'; -- adjust to your data !!!
BEGIN
LOOP
RAISE NOTICE 'Updating rows starting from %', _ts; -- optional
UPDATE users
SET id = nextval('users_id_seq')
WHERE inserted_at >= _ts
AND inserted_at = now(); -- adjust to your case !!!
COMMIT; -- Requires Postgres 11+ !!!
PERFORM pg_sleep(10); -- adapt to your setup: long enough so let autovacuum kick in
_ts := _ts + _step;
END LOOP;
END
$do$;Alternatively, loop in your client, and run
VACUUM users; between iterations to make sure space is reused. (VACUUM cannot run inside a transaction.)See:
- COMMIT works in one plpgsql code block, but not in another?
- Do PL/pgSQL functions autocommit on their own?
Eventually, all old rows are updated.
Now create the unique index
CONCURRENTLY, to avoid blocking inserts. Like your step 2, but only on (id):CREATE UNIQUE INDEX CONCURRENTLY users_id_idx ON users (id);I don't see a good reason to add
user_id to the PK. If you need it for index-only scans consider a covering index with INCLUDE (user_id). But that's not always beneficial overall. See:- Does a query with a primary key and foreign keys run faster than a query with just primary keys?
- Do covering indexes in PostgreSQL help JOIN columns?
Now use the unique index to add the new PK without blocking inserts (your step 3):
ALTER TABLE users ADD CONSTRAINT users_pkey PRIMARY KEY USING INDEX users_id_idx;This will also implicitly set the column
NOT NULL.Finally, use Peter Eisentraut's function
upgrade_serial_to_identity(tbl regclass, col name) to convert the serial to an IDENTITY column. As superuser:SELECT upgrade_serial_to_identity('users'::regclass, 'id')Or stick with the
serial PK, might be good enough.Related:
- How to change a table ID from serial to identity?
Code Snippets
ALTER TABLE users ADD COLUMN id bigint;CREATE SEQUENCE users_id_seq;ALTER SEQUENCE users_id_seq OWNED BY users.id;ALTER TABLE users ALTER COLUMN id SET DEFAULT nextval('users_id_seq');DO
$do$
DECLARE
_ts timestamptz := (SELECT COALESCE(min(inserted_at), now()) FROM users); -- must not be NULL
_step interval := '7 days'; -- adjust to your data !!!
BEGIN
LOOP
RAISE NOTICE 'Updating rows starting from %', _ts; -- optional
UPDATE users
SET id = nextval('users_id_seq')
WHERE inserted_at >= _ts
AND inserted_at < _ts + _step
AND id IS NULL; -- optional
EXIT WHEN NOT FOUND AND _ts >= now(); -- adjust to your case !!!
COMMIT; -- Requires Postgres 11+ !!!
PERFORM pg_sleep(10); -- adapt to your setup: long enough so let autovacuum kick in
_ts := _ts + _step;
END LOOP;
END
$do$;Context
StackExchange Database Administrators Q#331424, answer score: 12
Revisions (0)
No revisions yet.