HiveBrain v1.2.0
Get Started
← Back to all entries
debugsqlModerate

Workaround for Anti-Semi Join bug

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
bugsemiantijoinforworkaround

Problem

I built the following SQL Server query, but it is encountering the anti-semi join defect in SQL Server 2005 which results in inaccurate cardinality estimates (1 -- urgh!) and runs forever. Since it is a longstanding production SQL Server I can't easily suggest to upgrade versions, and as such I cannot force the traceflag 4199 hint on this specific query.

I'm having a hard time refactoring the WHERE AND NOT IN (SELECT). Can anyone care to help? I've made sure to try and use the best joins based on clustered key pairs.

SELECT TOP 5000 d.doc2_id
    ,d.direction_cd
    ,a.address_type_cd
    ,d.external_identification
    ,s.hash_value
    ,d.publishdate
    ,d.sender_address_id AS [D2 Sender_Address_id]
    ,a.address_id AS [A Address_ID]
    ,d.message_size
    ,d.subject
    ,emi.employee_id
FROM assentor.emcsdbuser.doc2 d(NOLOCK)
INNER JOIN assentor.emcsdbuser.employee_msg_index emi(NOLOCK)
    ON d.processdate = emi.processdate
    AND d.doc2_id = emi.doc2_id
INNER LOOP JOIN assentor.emcsdbuser.doc2_address a(NOLOCK)
    ON emi.doc2_id = a.doc2_id
    AND emi.address_type_cd = a.address_type_cd
    AND emi.address_id = a.address_id
INNER JOIN sis.dbo.sis s(NOLOCK) ON d.external_identification = s.external_identification
WHERE d.publishdate > '2008-01-01'
    **AND d.doc2_id NOT IN (
        SELECT doc2_id
        FROM assentor.emcsdbuser.doc2_address d2a(NOLOCK)
        WHERE d.doc2_id = d2a.doc2_id
            AND d2a.address_type_cd = 'FRM'
        )**
OPTION (FAST 10)


Note that the Employee_MSG_Index table is 500m rows, doc2 is 1.5b rows, SIS is ~500m rows.

Any help would be appreciated!

Solution

Since it is a longstanding production SQL Server I can't easily suggest to upgrade versions

The anti semi join cardinality estimation bug is reproducible on all versions of SQL Server from 2005 to 2012 inclusive. All require trace flag 4199 to enable the fix, so upgrading would not solve your problem without activating 4199 (though there are many other good reasons to upgrade from 2005, of course).

...as such I cannot force the traceflag 4199 hint on this specific query.

If it is just one particular query that is affected, you can use OPTION (QUERYTRACEON 4199) to enable the trace flag for just that query. This query hint is documented and supported for use with 4199, and applies from SQL Server 2005 Service Pack 2 onward.

This hint effectively runs DBCC TRACEON (4199) and DBCC TRACEOFF (4199) around the query, and requires sysadmin permission as a result. If that is a problem, add the hint using a plan guide.

You should also look at testing your whole system with 4199 enabled instance-wide. Plan regressions are possible, but overall you may find that the various optimizer fixes enabled by this flag are well worth it. All future plan-affecting query processor fixes require this flag to activate.
All that said...

As mentioned in ypercube's answer, the bug requires two or more join columns to manifest (among many details). The redundancy in your NOT IN clause causes the optimizer to see two column comparisons (though logically there is only one), thereby exposing the bug.

Removing this redundancy will 'solve' the problem for this particular query, though other queries that really do have more than one join predicate will still be vulnerable.
Example

To illustrate, here is an example based on the CSS blog post linked in the question (but with a complete script!):

CREATE TABLE dbo.tst_TAB1
(
    c1 integer NOT NULL, 
    c2 integer NOT NULL, 
    c3 integer NOT NULL
);

CREATE TABLE dbo.tst_TAB2
(
    c1 integer NOT NULL, 
    c2 integer NOT NULL, 
    c3 integer NOT NULL
);

CREATE INDEX i ON dbo.tst_TAB1 (c1, c2);
CREATE INDEX i ON dbo.tst_TAB2 (c1, c2);


Sample data:

INSERT dbo.tst_TAB1
    (c1, c2, c3)
SELECT 
    number, number, number
FROM master.dbo.spt_values
WHERE 
    [type] = N'P' 
    AND number BETWEEN 1 AND 2047;

INSERT dbo.tst_TAB2 (c1, c2, c3)
VALUES (1, 1, 1);


Test query using NOT IN with redundant predicate:

SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
        -- This is redundant!
        WHERE
            t2.c1 = t1.c1
    );


The estimated execution plan shows an estimate of 1 row after the anti semi join:

Side note: In fact this is an example of another (rare) bug. Writing the WHERE clause as t1.c1 = t2.c1 instead of t2.c1 = t1.c1 allows the optimizer to see that the two join predicates are in fact the same, and the bug does not manifest.

The same query with OPTION (QUERYTRACEON 4199):

SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
        WHERE
            t2.c1 = t1.c1
    )
OPTION (QUERYTRACEON 4199);


The estimated execution plan now shows an estimate of 2046 rows, which is exactly right:

We can also remove the redundant predicate:

SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
    );


The execution plan happens to use an additional unrelated optimization (the Stream Aggregate), but the important point is that the post-join estimate is correct without having to enable 4199:

Multiple anti semi join columns

It is possible to express an anti semi join over multiple columns using NOT IN syntax. These cases will require 4199. For example, the next query joins on c1 and c2:

SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
        WHERE
            t2.c2 = t1.c2
    );


The execution plan shows the erroneous 1-row estimate:

With 4199, the issue is resolved:

SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
        WHERE
            t2.c2 = t1.c2
    )
OPTION (QUERYTRACEON 4199);


Other syntaxes

Using NOT IN in this way is best avoided, not least for the reasons mentioned in Books Online:

That issue with NOT IN and NULLs has been written about many times. There are many alternative syntaxes available, of which NOT EXISTS is my personal preference. Note that changing syntax will not avoid the cardinality estimation bug:

SELECT
    T1.c1
FROM dbo.tst_TAB1 AS t1 
WHERE 
    NOT EXISTS 
    ( 
        SELECT 1 
        FROM dbo.tst_TAB2 AS t2 
        WHERE 
            t2.c1 = t1.c1 
            AND t2.c2 = t1.c2
    );


That two column anti semi join produces the 1-r

Code Snippets

CREATE TABLE dbo.tst_TAB1
(
    c1 integer NOT NULL, 
    c2 integer NOT NULL, 
    c3 integer NOT NULL
);

CREATE TABLE dbo.tst_TAB2
(
    c1 integer NOT NULL, 
    c2 integer NOT NULL, 
    c3 integer NOT NULL
);

CREATE INDEX i ON dbo.tst_TAB1 (c1, c2);
CREATE INDEX i ON dbo.tst_TAB2 (c1, c2);
INSERT dbo.tst_TAB1
    (c1, c2, c3)
SELECT 
    number, number, number
FROM master.dbo.spt_values
WHERE 
    [type] = N'P' 
    AND number BETWEEN 1 AND 2047;

INSERT dbo.tst_TAB2 (c1, c2, c3)
VALUES (1, 1, 1);
SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
        -- This is redundant!
        WHERE
            t2.c1 = t1.c1
    );
SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
        WHERE
            t2.c1 = t1.c1
    )
OPTION (QUERYTRACEON 4199);
SELECT
    T1.c1
FROM tst_TAB1 AS t1 
WHERE
    t1.c1 NOT IN 
    (
        SELECT 
            t2.c1 
        FROM tst_TAB2 AS t2
    );

Context

StackExchange Database Administrators Q#47127, answer score: 14

Revisions (0)

No revisions yet.