HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Deleting duplicate records when using text datatype

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
deletingdatatypeduplicaterecordstextusingwhen

Problem

I have a large (~678,000 rows) table storing emails, and I need to delete duplicate records that match the fields of the email: to, from, subject, body, as well as the foreign key record_id.

Normally I would use the following statement to remove duplicates:

DELETE
FROM emails
WHERE email_id NOT IN (
SELECT MAX(email_id) FROM emails
GROUP BY record_id, from_add, to_add, subject, body)


However, body is datatype text, so it cannot be compared/grouped. It gives the following error message on the select query:

The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator.


What is the best way to identify duplicates with the text datatype? Do I have to convert the column to varchar first?

Solution

You don't have to convert the actual column. Just its usage in the SELECT query.

For a one off clean up task I'd probably just go

;WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY record_id, 
                                                from_add, 
                                                to_add, 
                                                subject, 
                                                CAST(body AS VARCHAR(MAX)) 
                                       ORDER BY email_id DESC) RN
         FROM   emails)
DELETE FROM cte
WHERE  RN > 1


Unless dealing with huge amounts of data that might make spending time on a more efficient solution (that doesn't have to sort huge strings) worthwhile.

Or alternatively you could try

;WITH T
     AS (SELECT MAX(email_id) AS max_email_id
         FROM   emails
         GROUP  BY record_id,
                   from_add,
                   to_add,
                   subject,
                   CAST(body AS VARCHAR(max)))
DELETE e1
FROM   emails e1
WHERE  NOT EXISTS (SELECT *
                   FROM   T
                   WHERE  max_email_id = e1.email_id)


This is likely to give you a better plan than NOT IN as SQL Server treats MAX(not_nullable_column) as being potentially nullable and hence you end up with a potentially more expensive plan.

Code Snippets

;WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY record_id, 
                                                from_add, 
                                                to_add, 
                                                subject, 
                                                CAST(body AS VARCHAR(MAX)) 
                                       ORDER BY email_id DESC) RN
         FROM   emails)
DELETE FROM cte
WHERE  RN > 1
;WITH T
     AS (SELECT MAX(email_id) AS max_email_id
         FROM   emails
         GROUP  BY record_id,
                   from_add,
                   to_add,
                   subject,
                   CAST(body AS VARCHAR(max)))
DELETE e1
FROM   emails e1
WHERE  NOT EXISTS (SELECT *
                   FROM   T
                   WHERE  max_email_id = e1.email_id)

Context

StackExchange Database Administrators Q#31135, answer score: 5

Revisions (0)

No revisions yet.