HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Is it possible to get seek based parallel plan for distinct/group by?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distinctseekgroupplanpossiblegetparallelforbased

Problem

An example from this question shows that SQL Server will choose a full index scan to solve a query like this:

select distinct [typeName] from [types]


Where [typeName] has a non-clustered, non-unique ascending index on it. His example has 200M rows but only 76 unique values. It seems like a seek plan would be a better choice with that density (~76 multiple binary searchs)?

His case could be normalized but the reason for the question is that I really want to solve something like this:

select TransactionId, max(CreatedUtc) 
from TxLog 
group by TransactionId


There is an index on (TransactionId, MaxCreatedUtc).

Re-writing using a normalized source (dt) does not not change the plan.

select dt.TransactionId, MaxCreatedUtc
 from [Transaction] dt -- distinct transactions
 cross apply
   (
        select Max(CreatedUtc) as MaxCreatedUtc 
          from TxLog tl
         where tl.TransactionId = dt.TransactionId         
   ) ca


Running just the CA subquery as a Scalar UDF does show a plan of 1 seek.

select max(CreatedUtc) as MaxCreatedUtc
 from Pub.TransactionLog 
 where TransactionID = @TxId;


Using that Scalar UDF in the original query seems to work but loses parallelism (known issue with UDFs):

select t.typeName, 
       Pub.ufn_TransactionMaxCreatedUtc(t.TransactionId) as MaxCreatedUtc
  from Pub.[Transaction] t


Rewriting using an Inline TVF reverts it back to the scan based plan.

From answer/comment @ypercube:

select TransactionId, MaxCreatedUtc        
 from Pub.[Transaction]  t
  cross apply
   (
        select top (1) CreatedUtc as MaxCreatedUtc 
        from Pub.TransactionLog l
        where l.TransactionID = t.TransactionId
        order by CreatedUtc desc                     
   ) ca


Plan looks good. No parallelism but pointless since so fast. Will have to try this on a larger problem sometime. Thanks.

Solution

I have exactly the same set up and I've been through the same stages of rewriting the query.

In my case the table names and meaning is a bit different, but overall structure is the same. Your table Transactions corresponds to my table PortalElevators below. It has ~2000 rows. Your table TxLog corresponds to my table PlaybackStats. It has ~150M rows. It has index on (ElevatorID, DataSourceRowID), same as you.

I'll run several variants of the query on the real data and compare execution plans, IO and time stats. I'm using SQL Server 2008 Standard.

GROUP BY with MAX

SELECT [ElevatorID], MAX([DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
GROUP BY [ElevatorID]


Same as for you optimizer scans the index and aggregates results. Slow.

Individual row

Let's see what optimizer would do if I requested MAX just for one row:

SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = 1


Optimizer is smart enough to use the index and it does one seek. By the way, we can see that optimizer uses TOP operator, even though the query doesn't have it. This is a telling sign that optimization paths of MAX and TOP have something in common in the engine, but they are different as we'll see below.

CROSS APPLY with MAX

SELECT
    [dbo].[PortalElevators].elevatorsId
    ,LastItemID
FROM
    [dbo].[PortalElevators]
    CROSS APPLY
    (
        SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
        FROM [dbo].[PlaybackStats]
        WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
    ) AS CA
;


Optimizer still scans the index. It is not smart enough to convert MAX into TOP and scan into seeks here. Slow. I didn't think of this variant originally, my next try was scalar UDF.

Scalar UDF

I saw that plan for getting MAX for individual row had index seek, so I put that simple query in a scalar UDF.

CREATE FUNCTION [dbo].[GetElevatorLastID]
(
    @ParamElevatorID int
)
RETURNS bigint
AS
BEGIN
    DECLARE @Result bigint;
    SELECT @Result = MAX([dbo].[PlaybackStats].[DataSourceRowID])
    FROM [dbo].[PlaybackStats]
    WHERE [dbo].[PlaybackStats].ElevatorID = @ParamElevatorID;
    RETURN @Result;
END

SELECT
    [dbo].[PortalElevators].elevatorsId
    ,[dbo].[GetElevatorLastID]([dbo].[PortalElevators].elevatorsId) AS LastItemID
FROM
    [dbo].[PortalElevators]
;


It does run fast. At least, much faster than Group by. Unfortunately, execution plan doesn't show details of UDF and what is even worse, it doesn't show the real IO stats (it doesn't include IO generated by UDF). You need to run Profiler to see all calls of the function and their stats. This plan shows only 6 reads. The plan for individual row has 4 reads, so real number would be close to: 6 + 2779 * 4 = 6 + 11,116 = 11,122.

CROSS APPLY with TOP

Eventually, I discovered the CROSS APPLY and how it can be applied ;-) in this case.

SELECT
    [dbo].[PortalElevators].elevatorsId
    ,LastItemID
FROM
    [dbo].[PortalElevators]
    CROSS APPLY
    (
        SELECT TOP(1) [dbo].[PlaybackStats].[DataSourceRowID] AS LastItemID
        FROM [dbo].[PlaybackStats]
        WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
        ORDER BY [dbo].[PlaybackStats].[DataSourceRowID] DESC
    ) AS CA
;


Here optimizer is smart enough to do ~2000 seeks. You can see that number of reads is much lower than for group by. Fast.

Interestingly, number of reads here (11,850) is a bit more than reads that I estimated with UDF (11,122). Table IO stats with CROSS APPLY have 11,844 reads and 2,779 scan count of the large table, which gives 11,844 / 2,779 ~= 4.26 reads per index seek. Most likely, seeks for some values use 4 reads and for some 5, with average 4.26. There are 2,779 seeks, but there are values only for 2,130 rows. As I said, it is difficult to get real number of reads with UDF without profiler.

Recursive CTE

As was pointed in comments, Paul White described a Recursive Index Skip Scan method to find distinct values in a large table without performing a full index scan, but doing index seeks recursively. To start recursion we need to find the MIN or MAX value for an anchor and then each step of recursion adds next value one by one. The post explains it in details.

```
WITH RecursiveCTE
AS
(
-- Anchor
SELECT TOP (1) [ElevatorID], [DataSourceRowID]
FROM [dbo].[PlaybackStats]
ORDER BY [ElevatorID] DESC, [DataSourceRowID] DESC

UNION ALL

-- Recursive
SELECT R.[ElevatorID], R.[DataSourceRowID]
FROM
(
-- Number the rows
SELECT
T.[ElevatorID], T.[DataSourceRowID]
,ROW_NUMBER() OVER (ORDER BY T.[ElevatorID] DESC, T.[DataSourceRowID] DESC) AS rn
FROM
[dbo].[PlaybackStats] AS T
INNER JOIN RecursiveCTE AS R ON R.[ElevatorID] > T.[ElevatorID]

Code Snippets

SELECT [ElevatorID], MAX([DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
GROUP BY [ElevatorID]
SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = 1
SELECT
    [dbo].[PortalElevators].elevatorsId
    ,LastItemID
FROM
    [dbo].[PortalElevators]
    CROSS APPLY
    (
        SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
        FROM [dbo].[PlaybackStats]
        WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
    ) AS CA
;
CREATE FUNCTION [dbo].[GetElevatorLastID]
(
    @ParamElevatorID int
)
RETURNS bigint
AS
BEGIN
    DECLARE @Result bigint;
    SELECT @Result = MAX([dbo].[PlaybackStats].[DataSourceRowID])
    FROM [dbo].[PlaybackStats]
    WHERE [dbo].[PlaybackStats].ElevatorID = @ParamElevatorID;
    RETURN @Result;
END

SELECT
    [dbo].[PortalElevators].elevatorsId
    ,[dbo].[GetElevatorLastID]([dbo].[PortalElevators].elevatorsId) AS LastItemID
FROM
    [dbo].[PortalElevators]
;
SELECT
    [dbo].[PortalElevators].elevatorsId
    ,LastItemID
FROM
    [dbo].[PortalElevators]
    CROSS APPLY
    (
        SELECT TOP(1) [dbo].[PlaybackStats].[DataSourceRowID] AS LastItemID
        FROM [dbo].[PlaybackStats]
        WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
        ORDER BY [dbo].[PlaybackStats].[DataSourceRowID] DESC
    ) AS CA
;

Context

StackExchange Database Administrators Q#96364, answer score: 11

Revisions (0)

No revisions yet.