patternsqlMinor
What is effective way to delete duplicate records based on multiple columns in MySQL?
Viewed 0 times
whatdeletecolumnsduplicaterecordswaymysqlbasedmultipleeffective
Problem
I have a table named
Two records are said duplicate if both have same
I searched on stackexchange, but couldn't found any solution effective with such a huge data. Any help would be appreciated. My table is in MySQL 5.5.45 with InnoDB engine.
This is the users table. Apart from mentioned columns, I have few more columns but the mentioned ones are relevant to the question.
users having approx 70 Million records, out of which I guess 10 Million is duplicate. How to remove those 10 Million records keeping one copy of that. Two records are said duplicate if both have same
email, alt_email, mobile and alt_mobile.I searched on stackexchange, but couldn't found any solution effective with such a huge data. Any help would be appreciated. My table is in MySQL 5.5.45 with InnoDB engine.
This is the users table. Apart from mentioned columns, I have few more columns but the mentioned ones are relevant to the question.
CREATE TABLE users
(
id int(11) NOT NULL AUTO_INCREMENT,
email varchar(200) DEFAULT NULL,
alt_email varchar(200) DEFAULT NULL,
mobile varchar(100) DEFAULT NULL,
alt_mobile varchar(100) DEFAULT NULL,
PRIMARY KEY (id),
KEY mobile (mobile),
KEY email (email)
) ENGINE=InnoDB;Solution
Consider enforcing this combination of columns as a unique constraint at the database level. That won't be possible until the existing duplicates are cleaned out. But it will help prevent this problem from creeping back into the data.
You should define the 4 columns as a non-unique index now since this process will involving lots of comparisons using them. You should define the 4 columns as a single index listing them in the same order you plan to list them in your query. 70M records is gonna take a long time to query. But with an index on the column in question it may drop to a feasible time frame. And you may end up needing to use
https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/
Is there a preferred strategy in terms of picking the version of the record to keep? Assuming you want to keep the most recent on file, for example, you can use a date column (I'll call it
You may want to create a new table
Sanity check the results. You want to make damn sure your data matches your expectations before running a
If you don't have a date to compare use
Adding an index on these columns will slow down writes. So if it is vitally important that writes to the
Personally I find slightly slower writes a small price to pay for the potential read boost and data sanitation that comes along with defining natural keys as
If this proves unworkable I can think of two other strategies worth investigating.
Use a cloned table similar to
Or you could rename your existing
You can leave the columns from the
You should define the 4 columns as a non-unique index now since this process will involving lots of comparisons using them. You should define the 4 columns as a single index listing them in the same order you plan to list them in your query. 70M records is gonna take a long time to query. But with an index on the column in question it may drop to a feasible time frame. And you may end up needing to use
FORCE INDEX anyway.https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/
Is there a preferred strategy in terms of picking the version of the record to keep? Assuming you want to keep the most recent on file, for example, you can use a date column (I'll call it
cdate) or the value of an auto-increment column (I'll call it id). You may want to create a new table
users_test or something similar using an otherwise unchanged CREATE TABLE statement. Then shunt over a more manageable portion of your data to experiment on there. Especially if you already know a few of the duplicate records so that you can be sure to include them. Then you can start out with something like this:SELECT
t1.id,
t2.id,
t1.cdate,
t2.cdate,
t1.email,
t2.email,
t1.alt_email,
t2.alt_email,
t1.mobile,
t2.mobile,
t1.alt_mobile,
t2.alt_mobile
# Include whatever other columns may be helpful
FROM
# Or users_test if it exists
users t1
JOIN users t2 ON (
# We want t2 to always contain a duplicate but older record
t1.cdate > t2.cdate
AND
t1.email = t2.email
AND
t1.alt_email = t2.alt_email
AND
t1.mobile = t2.mobile
AND
t1.alt_mobile = t2.alt_mobile
)
;Sanity check the results. You want to make damn sure your data matches your expectations before running a
DELETE statement in production. But once you are happy with the results you can safely collect t2.id values from that query. Those will all be records with a newer duplicate row. If you don't have a date to compare use
id. If you would rather keep the oldest record on file switch the > to an <. Or you may have a completely different strategy in mind for picking the version of the record to keep. That's left as an exercise for the reader.Adding an index on these columns will slow down writes. So if it is vitally important that writes to the
users table in production are super fast then you may not be able to define the index (unique or not) in production. But with this method you can still dump the whole thing to a staging environment, set up the index there, collect the IDs, and then use just the ID list in DELETE statements in production. DELETE FROM users WHERE id in (/* list of t2.id values here */);Personally I find slightly slower writes a small price to pay for the potential read boost and data sanitation that comes along with defining natural keys as
UNIQUE constraints. But I've also never dealt with a project with 70M rows in the users table. My experiences may not mean much for your particular situation.If this proves unworkable I can think of two other strategies worth investigating.
Use a cloned table similar to
user_test discussed above. Move some much smaller subset of user records into it. Join the two different tables rather than the self join on users. Comparing 10k rows to 70M rows should be much faster than comparing 70M to 70M. But checking all the records will require running 7,000+ iterations. You'd want to use the full users table in the t2 position. And you could potentially pick up some optimizations depending on how you chunk your rows into the table for comparison. Or you could rename your existing
users table to something like users_old. Then you can create a new users table with the unique constraint in place across those 4 columns. Then use a combination of INSERT INTO … SELECT … and INSERT … ON DUPLICATE KEY UPDATE … statements to populate the new table using values from the old table.INSERT INTO users
SELECT * FROM users_old
ON DUPLICATE KEY UPDATE
some_column = VALUES(some_column),
some_other_column = VALUES(some_other_column), …
;You can leave the columns from the
UNIQUE constraint out of your UPDATE clause because you know they wouldn't change anyway. And this strategy goes out the window is enforcing a UNIQUE constraint in production isn't an option for you.Code Snippets
SELECT
t1.id,
t2.id,
t1.cdate,
t2.cdate,
t1.email,
t2.email,
t1.alt_email,
t2.alt_email,
t1.mobile,
t2.mobile,
t1.alt_mobile,
t2.alt_mobile
# Include whatever other columns may be helpful
FROM
# Or users_test if it exists
users t1
JOIN users t2 ON (
# We want t2 to always contain a duplicate but older record
t1.cdate > t2.cdate
AND
t1.email = t2.email
AND
t1.alt_email = t2.alt_email
AND
t1.mobile = t2.mobile
AND
t1.alt_mobile = t2.alt_mobile
)
;DELETE FROM users WHERE id in (/* list of t2.id values here */);INSERT INTO users
SELECT * FROM users_old
ON DUPLICATE KEY UPDATE
some_column = VALUES(some_column),
some_other_column = VALUES(some_other_column), …
;Context
StackExchange Database Administrators Q#142777, answer score: 3
Revisions (0)
No revisions yet.