patternMinor
Deleting duplicate records when using text datatype
Viewed 0 times
deletingdatatypeduplicaterecordstextusingwhen
Problem
I have a large (~678,000 rows) table storing emails, and I need to delete duplicate records that match the fields of the email: to, from, subject, body, as well as the foreign key record_id.
Normally I would use the following statement to remove duplicates:
However, body is datatype
What is the best way to identify duplicates with the
Normally I would use the following statement to remove duplicates:
DELETE
FROM emails
WHERE email_id NOT IN (
SELECT MAX(email_id) FROM emails
GROUP BY record_id, from_add, to_add, subject, body)However, body is datatype
text, so it cannot be compared/grouped. It gives the following error message on the select query:The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator.What is the best way to identify duplicates with the
text datatype? Do I have to convert the column to varchar first?Solution
You don't have to
For a one off clean up task I'd probably just go
Unless dealing with huge amounts of data that might make spending time on a more efficient solution (that doesn't have to sort huge strings) worthwhile.
Or alternatively you could try
This is likely to give you a better plan than
convert the actual column. Just its usage in the SELECT query.For a one off clean up task I'd probably just go
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(MAX))
ORDER BY email_id DESC) RN
FROM emails)
DELETE FROM cte
WHERE RN > 1Unless dealing with huge amounts of data that might make spending time on a more efficient solution (that doesn't have to sort huge strings) worthwhile.
Or alternatively you could try
;WITH T
AS (SELECT MAX(email_id) AS max_email_id
FROM emails
GROUP BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(max)))
DELETE e1
FROM emails e1
WHERE NOT EXISTS (SELECT *
FROM T
WHERE max_email_id = e1.email_id)This is likely to give you a better plan than
NOT IN as SQL Server treats MAX(not_nullable_column) as being potentially nullable and hence you end up with a potentially more expensive plan.Code Snippets
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(MAX))
ORDER BY email_id DESC) RN
FROM emails)
DELETE FROM cte
WHERE RN > 1;WITH T
AS (SELECT MAX(email_id) AS max_email_id
FROM emails
GROUP BY record_id,
from_add,
to_add,
subject,
CAST(body AS VARCHAR(max)))
DELETE e1
FROM emails e1
WHERE NOT EXISTS (SELECT *
FROM T
WHERE max_email_id = e1.email_id)Context
StackExchange Database Administrators Q#31135, answer score: 5
Revisions (0)
No revisions yet.