HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Index with IGNORE_DUP_KEY = ON filters out non-duplicate records

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
nonwithduplicaterecordsignore_dup_keyfiltersindexout

Problem

Below is a script which can be used to re-produce the issue I am facing.
Basically the question is this: why is it that the record with (ID1 = 6, ID2 = 7) is not populated into the final #tt2 table? I can INSERT it just fine manually, so it doesn't violate the UNIQUE constraint, but it doesn't get populated as part of INSERT/SELECT.

USE [tempdb]
GO 

IF OBJECT_ID('tempdb..#t1') IS NOT NULL DROP TABLE #t1
CREATE TABLE [#t1] ([ID] INT , [V] VARCHAR(10))

INSERT INTO #t1 (ID, V) 
VALUES
(1,'A'),
(2,'B'),
(3,'B'),
(4,'C'),
(5,'E'),
(6,'E')

IF OBJECT_ID('tempdb..#t2') IS NOT NULL DROP TABLE #t2
CREATE TABLE [#t2] ([ID] INT , [V] VARCHAR(10))

INSERT INTO #t2 (ID, V) 
VALUES
(1,'A'),
(2,'B'),
(3,'C'),
(4,'C'),
(5,'D'),
(6,'E'),
(7,'E')

IF OBJECT_ID('tempdb..#tt') IS NOT NULL DROP TABLE #tt

SELECT t1.ID AS ID1, t2.ID AS ID2, t1.V AS V
INTO #tt
FROM #t1 t1
JOIN #t2 t2 ON t1.V=t2.V

--SELECT * FROM #tt

IF OBJECT_ID('tempdb..#tt2') IS NOT NULL DROP TABLE #tt2
CREATE TABLE #tt2 (ID1 INT, ID2 INT, V VARCHAR(10))

CREATE UNIQUE INDEX IDX_TT_ID1 ON #tt2 (ID1) WITH (IGNORE_DUP_KEY = ON)
CREATE UNIQUE INDEX IDX_TT_ID2 ON #tt2 (ID2) WITH (IGNORE_DUP_KEY = ON)

-- Query 1 - this SELECT returns 9 records. 
SELECT ID1, ID2, V
FROM #tt 
ORDER BY ID1, ID2

-- Query 2 - this INSERT populates 4 records, but I would expect 5
INSERT INTO #tt2 (ID1, ID2, V)
SELECT ID1, ID2, V
FROM #tt 
ORDER BY ID1, ID2

-- Why are there only 4 records in this table? (I would expect 5, but the record with ID1 = 6, ID2 = 7  is missing)
SELECT * FROM #tt2

-- I can INSERT the missing record manually:
INSERT INTO #tt2 (ID1, ID2, V)
VALUES (6, 7, 'E')

Solution

I think the confusion arises because the indexes are defined as single-column on ID1 and ID2 separately. This means the "ignore duplicates" check is performed on each individually and only if a row passes both checks will it be inserted into the final #tt2 table. I'll step through and show the working.

One way for the system to check for duplicates is to sort the rows then process them in sequence. This is what happens on my local (2017) instance. So let's look at the contents of #tt, ordered:

Row ID1 ID2 V
i   1   1   A
ii  2   2   B
iii 3   2   B
iv  4   3   C
v   4   4   C
vi  5   6   E
vii 5   7   E
iix 6   6   E
ix  6   7   E


Starting with row i the system asks "have I seen ID1=1 already or have I seen ID2 = 1 already?" Since both are false row i is passed through to the output. Similarly with row ii. For row iii, however, ID2=2 which matches row ii that has been seen already. So row iii is rejected. Row iv is accepted. Row v is rejected since it has the same ID1 value as row iv. Row vi is accepted. Row vii is rejected - ID1=5 which matches row vi. Row iix is rejected since ID2=6 matches row vi. Finally row ix is rejected in both ID1 (matches iix) and ID2 (matches vii).

The output is i, ii, iv, vi and these are inserted into #tt2.

Row ID1 ID2 V
i   1   1   A
ii  2   2   B
iv  4   3   C
vi  5   6   E


It is possible to insert {6,7,E} afterwards because it does not conflict with any row that made it through the filtering and was actually inserted into #tt2. All the rows with ID1=6 were eliminated due to their ID2 values and those with ID2=7 due to their ID1 values.

The above is an explanation of the logical algorithm which produces the observed output. The physical implementation of this algorithm, as observed in the actual execution plan, is different but equivalent. Here's the plan:

fig 1 - actual execution plan for INSERT

The physical plan sorts on ID1 (A in the image) and takes the first row for each value (B). The output of this is then sorted on ID2 (C) and, again, the first row for each value is retained (D). Whichever rows pass both filters are inserted to #tt2.

The tricky part here is the word "first". Relational DB tables have no inherent order. The query optimizer (QO) is free to process rows in whichever order it chooses, as long as its results are logically equivalent to the SQL submitted. This got me wondering if I could produce a different result from the same data without changing the queries. I can.

Sorts are expensive so the QO often chooses other options if it has them. One such option is to read an index. I found that with an index on #tt.ID1 the QO would use it and skip the sort. SQL Server's sorts are order-preserving so if I can present ID2 in an alternate ordering a different "first" row will be chosen for each ID1 value and the overall output will be different. I created a new target table #tt3 identical to #tt2 and ran this (the ORDER BY does not affect this INSERT)

CREATE CLUSTERED INDEX CIX ON #tt(ID1 ASC, ID2 DESC);

INSERT INTO #tt3 (ID1, ID2, V)
SELECT ID1, ID2, V
FROM #tt;


Produced this

Row ID1 ID2 V
i   1   1   A
ii  2   2   B
v   4   4   C
vii 5   7   E


Which is different to the previous result i, ii, iv, vi.

You'll notice that if you order the contents of #tt by ID1 asc, ID1 desc ..

Row ID1 ID2 V
i   1   1   A
ii  2   2   B
iii 3   2   B
v   4   4   C
iv  4   3   C
vii 5   7   E
vi  5   6   E
ix  6   7   E
iix 6   6   E


.. and apply the logical "have I seen" algorithm this alternate result is obtained.

The question remains, why does the QO process ID1 first then ID2? I believe this is driven by the sequence in which the constraints are defined. I would have expected that defining #tt clustered index as (ID2 asc, ID1 desc) would cause ID2 to be processed first, as that would avoid a sort. It is not. It still processed ID1 first with the sort operator brought back. Indeed, creating another constrint in column V adds it to the plan accordingly. Seems like processing in sys.indexes.index_id order may be hard-coded in the QO?

All up, I think caution is advised. Changes in the physical layer can affect the oucome of this query. If the target table's constrints are dropped and re-created we better hope they come back in the same sequence they were previously. If the source table is optimised, say with indexes, it may affect the access path and which row comes "first". This is without considering allocation-order scans or merry-go-round scans.

If more rows were involved it is likely QO would swap to using hash joins. Heavens only knows how that would affect things.

Code Snippets

Row ID1 ID2 V
i   1   1   A
ii  2   2   B
iii 3   2   B
iv  4   3   C
v   4   4   C
vi  5   6   E
vii 5   7   E
iix 6   6   E
ix  6   7   E
Row ID1 ID2 V
i   1   1   A
ii  2   2   B
iv  4   3   C
vi  5   6   E
CREATE CLUSTERED INDEX CIX ON #tt(ID1 ASC, ID2 DESC);

INSERT INTO #tt3 (ID1, ID2, V)
SELECT ID1, ID2, V
FROM #tt;
Row ID1 ID2 V
i   1   1   A
ii  2   2   B
v   4   4   C
vii 5   7   E
Row ID1 ID2 V
i   1   1   A
ii  2   2   B
iii 3   2   B
v   4   4   C
iv  4   3   C
vii 5   7   E
vi  5   6   E
ix  6   7   E
iix 6   6   E

Context

StackExchange Database Administrators Q#301536, answer score: 7

Revisions (0)

No revisions yet.