patternsqlMinor
what is the quickest way to delete duplicate rows?
Viewed 0 times
quickestrowsthewhatdeleteduplicateway
Problem
I need to delete duplicate rows from a large table.
what is the best way to achieve that?
currently I use this algorithm:
when I run it on my system:
I found that <> 1 is better than > 1.
I could create this index, currently not present:
Is there any other way to get this done?
On this occasion this table is not big - about 500,000 records on the live system.
the delete is part of a SSIS package, it runs daily and deletes about 10-15 records a day.
there are problems in the way the data is structured, I just need one AccountCode for each customer but there could be duplicates and if they are not removed, they break the package on a later stage.
It was not me who developed the package, and my scope is not to re-design anything.
I am just after the best
what is the best way to achieve that?
currently I use this algorithm:
declare @t table ([key] int )
insert into @t select 1
insert into @t select 1
insert into @t select 1
insert into @t select 2
insert into @t select 2
insert into @t select 3
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 4
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 5
insert into @t select 6
insert into @t select 6
insert into @t select 6
insert into @t select 7
insert into @t select 7
insert into @t select 8
insert into @t select 8
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
insert into @t select 9
select * from @t
; with cte as (
select *
, row_number() over (partition by [Key] order by [Key]) as Picker
from @t
)
delete cte
where Picker > 1
select * from @twhen I run it on my system:
;WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY AccountCode ) AS [Version]
FROM Stage.Customer
)
DELETE
FROM Customer
WHERE [Version] <> 1I found that <> 1 is better than > 1.
I could create this index, currently not present:
USE [BodenDWH]
GO
CREATE NONCLUSTERED INDEX []
ON [Stage].[Customer] ([AccountCode])
INCLUDE ([ID])
GOIs there any other way to get this done?
On this occasion this table is not big - about 500,000 records on the live system.
the delete is part of a SSIS package, it runs daily and deletes about 10-15 records a day.
there are problems in the way the data is structured, I just need one AccountCode for each customer but there could be duplicates and if they are not removed, they break the package on a later stage.
It was not me who developed the package, and my scope is not to re-design anything.
I am just after the best
Solution
If the table is small and the number of rows you are deleting is small, then use
Note: In above query you are using an arbitrary ordering in the window order clause
If the table is large (e.g. 5M records) then deleting in small number of rows or chunks will help not bloat transaction log and will prevent lock escalation.
A lock escalation will occur if and only if a Transact-SQL statement has acquired at least 5000 locks on a single reference of a table.
;WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
FROM dbo.Customer
)
DELETE
FROM Customer
WHERE [Version] > 1;Note: In above query you are using an arbitrary ordering in the window order clause
ORDER BY (select null) (learned it from Itzik Ben-Gan's T-SQL Querying book and @AaronBertrand cited that above as well).If the table is large (e.g. 5M records) then deleting in small number of rows or chunks will help not bloat transaction log and will prevent lock escalation.
A lock escalation will occur if and only if a Transact-SQL statement has acquired at least 5000 locks on a single reference of a table.
while 1=1
begin
WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
FROM dbo.Customer
)
DELETE top(4000) -- choose a lower batch size than 5000 to prevent lock escalation
FROM Customer
WHERE [Version] > 1
if @@ROWCOUNT < 4000
BREAK ;
endCode Snippets
;WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
FROM dbo.Customer
)
DELETE
FROM Customer
WHERE [Version] > 1;while 1=1
begin
WITH Customer AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY AccountCode ORDER BY (select null) ) AS [Version]
FROM dbo.Customer
)
DELETE top(4000) -- choose a lower batch size than 5000 to prevent lock escalation
FROM Customer
WHERE [Version] > 1
if @@ROWCOUNT < 4000
BREAK ;
endContext
StackExchange Database Administrators Q#115627, answer score: 5
Revisions (0)
No revisions yet.