patternsqlModerate
Is it possible to get seek based parallel plan for distinct/group by?
Viewed 0 times
distinctseekgroupplanpossiblegetparallelforbased
Problem
An example from this question shows that SQL Server will choose a full index scan to solve a query like this:
Where [typeName] has a non-clustered, non-unique ascending index on it. His example has 200M rows but only 76 unique values. It seems like a seek plan would be a better choice with that density (~76 multiple binary searchs)?
His case could be normalized but the reason for the question is that I really want to solve something like this:
There is an index on
Re-writing using a normalized source (dt) does not not change the plan.
Running just the CA subquery as a Scalar UDF does show a plan of 1 seek.
Using that Scalar UDF in the original query seems to work but loses parallelism (known issue with UDFs):
Rewriting using an Inline TVF reverts it back to the scan based plan.
From answer/comment @ypercube:
Plan looks good. No parallelism but pointless since so fast. Will have to try this on a larger problem sometime. Thanks.
select distinct [typeName] from [types]Where [typeName] has a non-clustered, non-unique ascending index on it. His example has 200M rows but only 76 unique values. It seems like a seek plan would be a better choice with that density (~76 multiple binary searchs)?
His case could be normalized but the reason for the question is that I really want to solve something like this:
select TransactionId, max(CreatedUtc)
from TxLog
group by TransactionIdThere is an index on
(TransactionId, MaxCreatedUtc).Re-writing using a normalized source (dt) does not not change the plan.
select dt.TransactionId, MaxCreatedUtc
from [Transaction] dt -- distinct transactions
cross apply
(
select Max(CreatedUtc) as MaxCreatedUtc
from TxLog tl
where tl.TransactionId = dt.TransactionId
) caRunning just the CA subquery as a Scalar UDF does show a plan of 1 seek.
select max(CreatedUtc) as MaxCreatedUtc
from Pub.TransactionLog
where TransactionID = @TxId;Using that Scalar UDF in the original query seems to work but loses parallelism (known issue with UDFs):
select t.typeName,
Pub.ufn_TransactionMaxCreatedUtc(t.TransactionId) as MaxCreatedUtc
from Pub.[Transaction] tRewriting using an Inline TVF reverts it back to the scan based plan.
From answer/comment @ypercube:
select TransactionId, MaxCreatedUtc
from Pub.[Transaction] t
cross apply
(
select top (1) CreatedUtc as MaxCreatedUtc
from Pub.TransactionLog l
where l.TransactionID = t.TransactionId
order by CreatedUtc desc
) caPlan looks good. No parallelism but pointless since so fast. Will have to try this on a larger problem sometime. Thanks.
Solution
I have exactly the same set up and I've been through the same stages of rewriting the query.
In my case the table names and meaning is a bit different, but overall structure is the same. Your table
I'll run several variants of the query on the real data and compare execution plans, IO and time stats. I'm using SQL Server 2008 Standard.
GROUP BY with MAX
Same as for you optimizer scans the index and aggregates results. Slow.
Individual row
Let's see what optimizer would do if I requested
Optimizer is smart enough to use the index and it does one seek. By the way, we can see that optimizer uses
CROSS APPLY with MAX
Optimizer still scans the index. It is not smart enough to convert
Scalar UDF
I saw that plan for getting
It does run fast. At least, much faster than
CROSS APPLY with TOP
Eventually, I discovered the
Here optimizer is smart enough to do ~2000 seeks. You can see that number of reads is much lower than for
Interestingly, number of reads here (11,850) is a bit more than reads that I estimated with UDF (11,122). Table IO stats with
Recursive CTE
As was pointed in comments, Paul White described a Recursive Index Skip Scan method to find distinct values in a large table without performing a full index scan, but doing index seeks recursively. To start recursion we need to find the
```
WITH RecursiveCTE
AS
(
-- Anchor
SELECT TOP (1) [ElevatorID], [DataSourceRowID]
FROM [dbo].[PlaybackStats]
ORDER BY [ElevatorID] DESC, [DataSourceRowID] DESC
UNION ALL
-- Recursive
SELECT R.[ElevatorID], R.[DataSourceRowID]
FROM
(
-- Number the rows
SELECT
T.[ElevatorID], T.[DataSourceRowID]
,ROW_NUMBER() OVER (ORDER BY T.[ElevatorID] DESC, T.[DataSourceRowID] DESC) AS rn
FROM
[dbo].[PlaybackStats] AS T
INNER JOIN RecursiveCTE AS R ON R.[ElevatorID] > T.[ElevatorID]
In my case the table names and meaning is a bit different, but overall structure is the same. Your table
Transactions corresponds to my table PortalElevators below. It has ~2000 rows. Your table TxLog corresponds to my table PlaybackStats. It has ~150M rows. It has index on (ElevatorID, DataSourceRowID), same as you.I'll run several variants of the query on the real data and compare execution plans, IO and time stats. I'm using SQL Server 2008 Standard.
GROUP BY with MAX
SELECT [ElevatorID], MAX([DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
GROUP BY [ElevatorID]Same as for you optimizer scans the index and aggregates results. Slow.
Individual row
Let's see what optimizer would do if I requested
MAX just for one row:SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = 1Optimizer is smart enough to use the index and it does one seek. By the way, we can see that optimizer uses
TOP operator, even though the query doesn't have it. This is a telling sign that optimization paths of MAX and TOP have something in common in the engine, but they are different as we'll see below.CROSS APPLY with MAX
SELECT
[dbo].[PortalElevators].elevatorsId
,LastItemID
FROM
[dbo].[PortalElevators]
CROSS APPLY
(
SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
) AS CA
;Optimizer still scans the index. It is not smart enough to convert
MAX into TOP and scan into seeks here. Slow. I didn't think of this variant originally, my next try was scalar UDF.Scalar UDF
I saw that plan for getting
MAX for individual row had index seek, so I put that simple query in a scalar UDF. CREATE FUNCTION [dbo].[GetElevatorLastID]
(
@ParamElevatorID int
)
RETURNS bigint
AS
BEGIN
DECLARE @Result bigint;
SELECT @Result = MAX([dbo].[PlaybackStats].[DataSourceRowID])
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = @ParamElevatorID;
RETURN @Result;
END
SELECT
[dbo].[PortalElevators].elevatorsId
,[dbo].[GetElevatorLastID]([dbo].[PortalElevators].elevatorsId) AS LastItemID
FROM
[dbo].[PortalElevators]
;It does run fast. At least, much faster than
Group by. Unfortunately, execution plan doesn't show details of UDF and what is even worse, it doesn't show the real IO stats (it doesn't include IO generated by UDF). You need to run Profiler to see all calls of the function and their stats. This plan shows only 6 reads. The plan for individual row has 4 reads, so real number would be close to: 6 + 2779 * 4 = 6 + 11,116 = 11,122.CROSS APPLY with TOP
Eventually, I discovered the
CROSS APPLY and how it can be applied ;-) in this case.SELECT
[dbo].[PortalElevators].elevatorsId
,LastItemID
FROM
[dbo].[PortalElevators]
CROSS APPLY
(
SELECT TOP(1) [dbo].[PlaybackStats].[DataSourceRowID] AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
ORDER BY [dbo].[PlaybackStats].[DataSourceRowID] DESC
) AS CA
;Here optimizer is smart enough to do ~2000 seeks. You can see that number of reads is much lower than for
group by. Fast.Interestingly, number of reads here (11,850) is a bit more than reads that I estimated with UDF (11,122). Table IO stats with
CROSS APPLY have 11,844 reads and 2,779 scan count of the large table, which gives 11,844 / 2,779 ~= 4.26 reads per index seek. Most likely, seeks for some values use 4 reads and for some 5, with average 4.26. There are 2,779 seeks, but there are values only for 2,130 rows. As I said, it is difficult to get real number of reads with UDF without profiler.Recursive CTE
As was pointed in comments, Paul White described a Recursive Index Skip Scan method to find distinct values in a large table without performing a full index scan, but doing index seeks recursively. To start recursion we need to find the
MIN or MAX value for an anchor and then each step of recursion adds next value one by one. The post explains it in details.```
WITH RecursiveCTE
AS
(
-- Anchor
SELECT TOP (1) [ElevatorID], [DataSourceRowID]
FROM [dbo].[PlaybackStats]
ORDER BY [ElevatorID] DESC, [DataSourceRowID] DESC
UNION ALL
-- Recursive
SELECT R.[ElevatorID], R.[DataSourceRowID]
FROM
(
-- Number the rows
SELECT
T.[ElevatorID], T.[DataSourceRowID]
,ROW_NUMBER() OVER (ORDER BY T.[ElevatorID] DESC, T.[DataSourceRowID] DESC) AS rn
FROM
[dbo].[PlaybackStats] AS T
INNER JOIN RecursiveCTE AS R ON R.[ElevatorID] > T.[ElevatorID]
Code Snippets
SELECT [ElevatorID], MAX([DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
GROUP BY [ElevatorID]SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = 1SELECT
[dbo].[PortalElevators].elevatorsId
,LastItemID
FROM
[dbo].[PortalElevators]
CROSS APPLY
(
SELECT MAX([dbo].[PlaybackStats].[DataSourceRowID]) AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
) AS CA
;CREATE FUNCTION [dbo].[GetElevatorLastID]
(
@ParamElevatorID int
)
RETURNS bigint
AS
BEGIN
DECLARE @Result bigint;
SELECT @Result = MAX([dbo].[PlaybackStats].[DataSourceRowID])
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = @ParamElevatorID;
RETURN @Result;
END
SELECT
[dbo].[PortalElevators].elevatorsId
,[dbo].[GetElevatorLastID]([dbo].[PortalElevators].elevatorsId) AS LastItemID
FROM
[dbo].[PortalElevators]
;SELECT
[dbo].[PortalElevators].elevatorsId
,LastItemID
FROM
[dbo].[PortalElevators]
CROSS APPLY
(
SELECT TOP(1) [dbo].[PlaybackStats].[DataSourceRowID] AS LastItemID
FROM [dbo].[PlaybackStats]
WHERE [dbo].[PlaybackStats].ElevatorID = [dbo].[PortalElevators].elevatorsId
ORDER BY [dbo].[PlaybackStats].[DataSourceRowID] DESC
) AS CA
;Context
StackExchange Database Administrators Q#96364, answer score: 11
Revisions (0)
No revisions yet.