HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Copying Data into another table and removing older data should I do in batches and should I remove indexes and FK constraints?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
constraintsremovingintoindexesbatchesremovecopyinganothershouldand

Problem

I am looking to remove older data from a table which consists of 889 million rows of data.

I have a script but I am trying to make it more robust as removing approximately 418 million rows of data to be precise.

Am running in Postgres 9.6, with the table and FK Constraints

```
Column | Type | Collation | Nullable |
Default
--------------------------+--------------------------+-----------+----------+---
------------------------------------
game_id | integer | | not null | ne
xtval('game_game_id_seq'::regclass)
game_id | integer | | not null |
session_id | integer | | |
game_created_on | timestamp with time zone | | not null |
currency_code | character(3) | | not null |
game_cash_staked | numeric(12,2) | | |
game_cash_won | numeric(12,2) | | |
game_bonus_staked | numeric(12,2) | | |
game_bonus_won | numeric(12,2) | | |
game_created_by_id | integer | | not null |
game_remote_ref | character varying(50) | | |
game_description | text | | |
game_batch_id | integer | | |
game_rejection_code_id | integer | | |
game_rejection_message | character varying(255) | | |
game_transfer_remote_ref | character varying(128) | | |

Indexes:
"game_pkey" PRIMARY KEY, btree (game_id)
"idx_game_created_on_rejection_code" btree (game_created_on) WHERE game_rejection_code_id IS NULL
"idx_game_game_created_on" btr

Solution

The first step would be to upgrade to a current version. Postgres 9.6
reaches EOL on November 11, 2021 - in three weeks. Postgres 13 or 14 are much faster with big data. Dropping and recreating all indexes has additional benefit in this case: that way you tap into the new feature index deduplication of Postgres 13 or later: shrinks indexes with duplicative data to a fraction of their size.

Assuming no concurrent access.

would it be better to drop the indexes first prior to deleting and then adding them back in once done.

Because you delete half the table, typically yes. The added benefit is that recreated indexes are in pristine condition without bloat.

This will be much faster:

BEGIN;
CREATE TABLE public.gamearchived (LIKE public.game);

-- DROP all indexes on table game here (be sure to remember the DDL!)

WITH del AS (
   DELETE FROM game
   WHERE  game_created_on < NOW() - interval '1 year';  -- older than 1 year
   RETURNING *
   )
 , ins AS (
   INSERT INTO public.gamearchived
   SELECT * FROM del
   ORDER  BY game_created_on;  -- optional, only if it helps future queries
    )
SELECT count(*) FROM del;  -- get your count

-- run checks if you are not sure; last chance.

COMMIT;


If anything goes wrong, the transaction is rolled back. So this is safe.

Creating the new table within the same transaction saves a lot of overhead: no additional WAL needs to be written.

The main difference: this only needs a single sequential scan on the big table. Your original does a lot of additional (pointless) work.

Also, we don't need any indexes for the big DELETE. It's cheaper to drop them before and recreate them after, than to incrementally keep them up to date.

I would at least run this afterwards:

VACUUM ANALYZE game;
VACUUM ANALYZE gamearchived;


To free up space (and while thee is no concurrent access), even:

VACUUM FULL ANALYZE game;


(No point in running VACUUM FULL on gamearchived.)

Ideally, recreate all indexes on table game now.

Also, create any constraints or indexes (PK?) you want on table gamearchived. Doesn't have to be in the same transaction.

The important bit is to do it after the big insert, which is much cheaper and typically produces better results (indexes balanced without bloat).

Outgoing FK constraints in table game don't matter, the don't impose any additional work for DELETE. (Incoming FK constraints would matter, as Postgres would have to check for possible referencing rows in linked tables.)

Related:

  • VACUUM returning disk space to operating system



  • Best way to populate a new column in a large table?



  • Best way to delete millions of rows by ID



Basics in this related chapter of the manual: Populating a Database

Code Snippets

BEGIN;
CREATE TABLE public.gamearchived (LIKE public.game);

-- DROP all indexes on table game here (be sure to remember the DDL!)

WITH del AS (
   DELETE FROM game
   WHERE  game_created_on < NOW() - interval '1 year';  -- older than 1 year
   RETURNING *
   )
 , ins AS (
   INSERT INTO public.gamearchived
   SELECT * FROM del
   ORDER  BY game_created_on;  -- optional, only if it helps future queries
    )
SELECT count(*) FROM del;  -- get your count

-- run checks if you are not sure; last chance.

COMMIT;
VACUUM ANALYZE game;
VACUUM ANALYZE gamearchived;
VACUUM FULL ANALYZE game;

Context

StackExchange Database Administrators Q#301749, answer score: 3

Revisions (0)

No revisions yet.