HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

pgsql - Delete data from huge table where not in

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
deletewherehugepgsqlfromdatatablenot

Problem

I need to delete some rows from one large table. The rows to delete shouldn't be in another table, example:

DELETE FROM LargeTable WHERE id IS NOT IN (SELECT DISTINCT foreign_id from EvenLargerTable)

but my server can't handle such blunt query, because there are almost a million records in the LargeTable and few million records in the EvenLargerTable

How can I solve it?

Solution

For large tables used for test of existence NOT EXISTS frequently works faster then NOT IN. So try

DELETE FROM LargeTable
WHERE NOT EXISTS (
  SELECT *
  FROM EvenLargerTable
  WHERE EvenLargerTable.foreign_id = LargeTable.id);


It is hard to explain why it's happening without having actual data. But with simple playground we can see that NOT IN case does not used indexes to perform the operation:

Playground:

-- drop table if exists a;
-- drop table if exists b;
create table a as select (random()*1000)::int as x from generate_series(1,10000);
create index idx_a on a(x);

create table b as select (random()*1000)::int*10 as x from generate_series(1,1000000);
create index idx_b on b(x);

analyse a;
analyse b;


Tests:

nd@postgres=# explain (verbose) delete from a where a.x not in (select b.x from b);
╔══════════════════════════════════════════════════════════════════════════════════╗
║                                    QUERY PLAN                                    ║
╠══════════════════════════════════════════════════════════════════════════════════╣
║ Delete on nd.a  (cost=0.00..129160170.00 rows=5000 width=6)                      ║
║   ->  Seq Scan on nd.a  (cost=0.00..129160170.00 rows=5000 width=6)              ║
║         Output: a.ctid                                                           ║
║         Filter: (NOT (SubPlan 1))                                                ║
║         SubPlan 1                                                                ║
║           ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4)            ║
║                 Output: b.x                                                      ║
║                 ->  Seq Scan on nd.b  (cost=0.00..14425.00 rows=1000000 width=4) ║
║                       Output: b.x                                                ║
╚══════════════════════════════════════════════════════════════════════════════════╝

nd@postgres=# explain (verbose) delete from a where not exists (select * from b where a.x=b.x);
╔══════════════════════════════════════════════════════════════════════════════════╗
║                                    QUERY PLAN                                    ║
╠══════════════════════════════════════════════════════════════════════════════════╣
║ Delete on nd.a  (cost=0.42..5005.91 rows=1 width=12)                             ║
║   ->  Nested Loop Anti Join  (cost=0.42..5005.91 rows=1 width=12)                ║
║         Output: a.ctid, b.ctid                                                   ║
║         ->  Seq Scan on nd.a  (cost=0.00..145.00 rows=10000 width=10)            ║
║               Output: a.ctid, a.x                                                ║
║         ->  Index Scan using idx_b on nd.b  (cost=0.42..20.78 rows=999 width=10) ║
║               Output: b.ctid, b.x                                                ║
║               Index Cond: (a.x = b.x)                                            ║
╚══════════════════════════════════════════════════════════════════════════════════╝

Code Snippets

DELETE FROM LargeTable
WHERE NOT EXISTS (
  SELECT *
  FROM EvenLargerTable
  WHERE EvenLargerTable.foreign_id = LargeTable.id);
-- drop table if exists a;
-- drop table if exists b;
create table a as select (random()*1000)::int as x from generate_series(1,10000);
create index idx_a on a(x);

create table b as select (random()*1000)::int*10 as x from generate_series(1,1000000);
create index idx_b on b(x);

analyse a;
analyse b;
nd@postgres=# explain (verbose) delete from a where a.x not in (select b.x from b);
╔══════════════════════════════════════════════════════════════════════════════════╗
║                                    QUERY PLAN                                    ║
╠══════════════════════════════════════════════════════════════════════════════════╣
║ Delete on nd.a  (cost=0.00..129160170.00 rows=5000 width=6)                      ║
║   ->  Seq Scan on nd.a  (cost=0.00..129160170.00 rows=5000 width=6)              ║
║         Output: a.ctid                                                           ║
║         Filter: (NOT (SubPlan 1))                                                ║
║         SubPlan 1                                                                ║
║           ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4)            ║
║                 Output: b.x                                                      ║
║                 ->  Seq Scan on nd.b  (cost=0.00..14425.00 rows=1000000 width=4) ║
║                       Output: b.x                                                ║
╚══════════════════════════════════════════════════════════════════════════════════╝

nd@postgres=# explain (verbose) delete from a where not exists (select * from b where a.x=b.x);
╔══════════════════════════════════════════════════════════════════════════════════╗
║                                    QUERY PLAN                                    ║
╠══════════════════════════════════════════════════════════════════════════════════╣
║ Delete on nd.a  (cost=0.42..5005.91 rows=1 width=12)                             ║
║   ->  Nested Loop Anti Join  (cost=0.42..5005.91 rows=1 width=12)                ║
║         Output: a.ctid, b.ctid                                                   ║
║         ->  Seq Scan on nd.a  (cost=0.00..145.00 rows=10000 width=10)            ║
║               Output: a.ctid, a.x                                                ║
║         ->  Index Scan using idx_b on nd.b  (cost=0.42..20.78 rows=999 width=10) ║
║               Output: b.ctid, b.x                                                ║
║               Index Cond: (a.x = b.x)                                            ║
╚══════════════════════════════════════════════════════════════════════════════════╝

Context

StackExchange Database Administrators Q#172688, answer score: 6

Revisions (0)

No revisions yet.