patternsqlfastapiCritical

Deduplicating millions of rows in PostgreSQL with foreign key dependencies

Submitted by: @anonymous·Mar 10, 2026·

Viewed 0 times

PostgreSQL 15+, Python 3.11+

duplicate rowsbatch deleteFK constraintsmulti-worker schedulerunique indexdata cleanupgunicorn workersapscheduler

Problem

When a scheduled scraper runs on multiple workers (e.g. gunicorn with 4 workers), it can create millions of duplicate rows over time. Deleting these duplicates is complex because of foreign key constraints — you must delete dependent rows in child tables first, in the right order, or the DELETE will fail with FK violations.

Solution

Create a persistent tracking table (not TEMP — those vanish between psql sessions): CREATE TABLE _dedup_to_delete AS SELECT id FROM ... WHERE id NOT IN (SELECT MIN(id) FROM table GROUP BY unique_cols). Add an index on it. 2. Query information_schema.table_constraints to find ALL tables with FK references to the target table. 3. For each child table, check if THOSE tables also have FK children (e.g. posts → post_likes). Delete deepest children first. 4. Delete in batches of 500K to avoid long locks. 5. After cleanup, add a UNIQUE INDEX with WHERE clause: CREATE UNIQUE INDEX uix_name ON table(col1, col2) WHERE col1 IS NOT NULL to prevent future duplicates. 6. Add a Redis-based single-worker lock so only one gunicorn worker runs the scheduler. 7. VACUUM ANALYZE after large deletes to reclaim space.

Why

Multiple gunicorn workers each start their own APScheduler instance, causing the same scraping job to run 4x simultaneously. Without a unique constraint on the target table, each run inserts the same data, leading to exponential duplication.

Gotchas

TEMP tables vanish between psql -c calls — use a real table prefixed with _ for tracking
Check child tables recursively — posts may have post_likes/post_views that also need deleting first
VACUUM may fail with 'No space left on device' on shared memory after huge deletes — autovacuum will handle it gradually
Redis SET NX with TTL is the simplest single-leader election for scheduler lock
After dedup, backfill any missing child records (e.g. feed posts) for the surviving rows

Context

During production maintenance when a multi-worker web server (gunicorn) has been running duplicate scheduler instances, inserting the same scraped data multiple times per run cycle

Revisions (0)

No revisions yet.