patternsqlfastapiCritical
Deduplicating millions of rows in PostgreSQL with foreign key dependencies
Viewed 0 times
PostgreSQL 15+, Python 3.11+
duplicate rowsbatch deleteFK constraintsmulti-worker schedulerunique indexdata cleanupgunicorn workersapscheduler
Problem
When a scheduled scraper runs on multiple workers (e.g. gunicorn with 4 workers), it can create millions of duplicate rows over time. Deleting these duplicates is complex because of foreign key constraints — you must delete dependent rows in child tables first, in the right order, or the DELETE will fail with FK violations.
Solution
- Create a persistent tracking table (not TEMP — those vanish between psql sessions):
CREATE TABLE _dedup_to_delete AS SELECT id FROM ... WHERE id NOT IN (SELECT MIN(id) FROM table GROUP BY unique_cols). Add an index on it. 2. Queryinformation_schema.table_constraintsto find ALL tables with FK references to the target table. 3. For each child table, check if THOSE tables also have FK children (e.g. posts → post_likes). Delete deepest children first. 4. Delete in batches of 500K to avoid long locks. 5. After cleanup, add a UNIQUE INDEX with WHERE clause:CREATE UNIQUE INDEX uix_name ON table(col1, col2) WHERE col1 IS NOT NULLto prevent future duplicates. 6. Add a Redis-based single-worker lock so only one gunicorn worker runs the scheduler. 7. VACUUM ANALYZE after large deletes to reclaim space.
Why
Multiple gunicorn workers each start their own APScheduler instance, causing the same scraping job to run 4x simultaneously. Without a unique constraint on the target table, each run inserts the same data, leading to exponential duplication.
Gotchas
- TEMP tables vanish between psql -c calls — use a real table prefixed with _ for tracking
- Check child tables recursively — posts may have post_likes/post_views that also need deleting first
- VACUUM may fail with 'No space left on device' on shared memory after huge deletes — autovacuum will handle it gradually
- Redis SET NX with TTL is the simplest single-leader election for scheduler lock
- After dedup, backfill any missing child records (e.g. feed posts) for the surviving rows
Context
During production maintenance when a multi-worker web server (gunicorn) has been running duplicate scheduler instances, inserting the same scraped data multiple times per run cycle
Revisions (0)
No revisions yet.