HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Odd execution plan behaviour from variations in WHERE...IN(List)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
oddvariationswhereplanbehaviourlistfromexecution

Problem

(reposted from stack-overflow by request)

I've been looking at trying to optimise (or at least change) some EF code in C# to use stored procedures, and found what seems to be an anomaly (or something new to me) when finding rows matching a constant list.
The typical manually-generated short query I'd use would be something like...

SELECT Something FROM Table WHERE ID IN (one, two, others);


We had an EF query that we were replacing with a stored procedure call, so I looked at the output, saw that it was complex and thought my simpler query (similar to the above) would be better. It wasn't. Here is a quick demo that reproduces this, trying a couple of intermediate variations to see where the change-over in plan behaviour occurs.

Can anyone explain why the execution plans for the final version - with the

...WHERE EXISTS(... (SELECT 1 AS X) AS Alias UNION ALL...) AS Alias...)


construct is better - seemingly because it omits the costly SORT operation, even though the plan includes TWO index scans (or more with more list members) rather than the one of the simpler query.

Here's a self-contained example script (I hope)...

```
USE SandBox; -- a dummy database, not a live one!
-- create our dummy table, dropping first if it exists
IF EXISTS (SELECT NULL FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'Test')
DROP TABLE Test;
CREATE TABLE Test (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, FormId INT NOT NULL, DateRead DATE NULL);
-- populate with some data
INSERT INTO Test VALUES (1, NULL), (1, GETDATE()), (1, NULL), (4, NULL), (5, NULL), (6, GETDATE());

-- Simple query that I might typically use
-- how many un-read entries are there for a set of 'forms' of interest, 1, 5 and 6
-- (we're happy to omit forms with none)
SELECT T.FormId, COUNT(*) AS TheCount
FROM Test AS T
WHERE T.FormId IN (1, 5, 6)
AND T.DateRead IS NULL
GROUP BY T.FormId;

-- This is the first step towards the EF-generated code
-- using an EXISTS gives basically the s

Solution

For the query that you have I don't see a measurable performance advantage or disadvantage to adding the extraneous derived tables. I'm testing against SQL Server 2017 but I can easily reproduce the behavior that you're seeing. I'm going to use a slightly simpler reproduction:

CREATE TABLE Test (FormId INT NOT NULL);

INSERT INTO Test
SELECT TOP (2580) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

-- query 1
SELECT  T.FormId, COUNT(*) AS TheCount
 FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1
                                UNION ALL
                                SELECT 2
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
GROUP BY T.FormId;

-- query 2
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1 FROM (SELECT 1 x) AS X1 -- added
                                UNION ALL
                                SELECT 2
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
GROUP BY T.FormId;


With 2580 rows in the table I get different plans as you did. However, it takes just five logical reads to read all data from the table. For both queries I wasn't able to get SET STATISTICS IO, TIME ON to output anything other than:


CPU time = 0 ms, elapsed time = 0 ms.

The query plans may be different, but the amount of data in the table is so tiny that it doesn't matter.

Now I'm going to truncate the table and insert 10 more rows than before.

TRUNCATE TABLE Test;

INSERT INTO Test
SELECT TOP (2590) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2

OPTION (MAXDOP 1);


With that minor change I get effectively the same plans for both queries. Both queries do one scan of the data and one sort:

As far as I can tell the plan that does multiple scans is not a legal plan option for the first query. I was unable to force it with the USE XML hint, but I could have made a mistake. The second query is able to choose between both plans on a cost basis. The query rule responsible for the multiple scan approach appears to be JNonUNIA. You can see this in action with the undocumented query hint QUERYRULEOFF JNonUNIA. The Google does not give me any results when searching for this optimizer rule, but based on the description it's an abbreviation for "Join Union All -> Union All Join". In other words, take a join to a UNION ALL and switch it to a UNION ALL of joins. That's exactly what we see in the query plan with multiple scans.

I understand that you're using an EF so you don't have complete control over the code. But just because you see a certain plan today doesn't mean that you'll get that plan in all future versions of SQL Server, especially for something as obscure as what you're observing. If for some reason you need to pick one plan over the other for a query I recommend being as direct as possible. For example, in the current version of SQL Server the following query appears to always result in a single scan:

SELECT
  FormId
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId IN (1, 2)
GROUP BY FormId;


And the following query seems to always result in two scans:

SELECT
  1 AS FormId
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId = 1

UNION ALL

SELECT
  2 AS FormId
  , COUNT(*) AS TheCount
FROM Test T
WHERE FormId = 2
OPTION (MAXDOP 1);


Of course, that behavior isn't guaranteed either. It still feels more reliable, if that makes sense.

It is possible to think of modified queries in which using the UNION ALL approach will give a performance boost. Suppose that you have a million rows in the table and you want to put any rows with a FormId between 1 and 600000 into bucket 1 and any rows with a FormId between 600001 and 1000000 into bucket 2.

The following query takes around 200 ms to execute on my machine:

SELECT
  CASE WHEN FormId = 1 AND FormId <= 1000000
GROUP BY CASE WHEN FormId <= 600000 THEN 1 ELSE 2 END
OPTION (MAXDOP 1);


Here's the plan:

However, this query executes in around 100 ms:

SELECT
  1 AS FormBucket
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId >= 1 AND FormId = 600001 AND FormId <= 1000000
OPTION (MAXDOP 1);


Here's the plan:

Code Snippets

CREATE TABLE Test (FormId INT NOT NULL);

INSERT INTO Test
SELECT TOP (2580) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

-- query 1
SELECT  T.FormId, COUNT(*) AS TheCount
 FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1
                                UNION ALL
                                SELECT 2
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
GROUP BY T.FormId;

-- query 2
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1 FROM (SELECT 1 x) AS X1 -- added
                                UNION ALL
                                SELECT 2
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
GROUP BY T.FormId;
TRUNCATE TABLE Test;

INSERT INTO Test
SELECT TOP (2590) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2

OPTION (MAXDOP 1);
SELECT
  FormId
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId IN (1, 2)
GROUP BY FormId;
SELECT
  1 AS FormId
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId = 1

UNION ALL

SELECT
  2 AS FormId
  , COUNT(*) AS TheCount
FROM Test T
WHERE FormId = 2
OPTION (MAXDOP 1);
SELECT
  CASE WHEN FormId <= 600000 THEN 1 ELSE 2 END FormBucket
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId >= 1 AND FormId <= 1000000
GROUP BY CASE WHEN FormId <= 600000 THEN 1 ELSE 2 END
OPTION (MAXDOP 1);

Context

StackExchange Database Administrators Q#190521, answer score: 2

Revisions (0)

No revisions yet.