patternsqlCritical
Retrieving n rows per group
Viewed 0 times
retrievingrowspergroup
Problem
I often need to select a number of rows from each group in a result set.
For example, I might want to list the 'n' highest or lowest recent order values per customer.
In more complex cases, the number of rows to list might vary per group (defined by an attribute of the grouping/parent record). This part is definitely optional/for extra credit and not intended to dissuade people from answering.
What are the main options for solving these types of problems in SQL Server 2005 and later? What are the main advantages and disadvantages of each method?
AdventureWorks examples (for clarity, optional)
For example, I might want to list the 'n' highest or lowest recent order values per customer.
In more complex cases, the number of rows to list might vary per group (defined by an attribute of the grouping/parent record). This part is definitely optional/for extra credit and not intended to dissuade people from answering.
What are the main options for solving these types of problems in SQL Server 2005 and later? What are the main advantages and disadvantages of each method?
AdventureWorks examples (for clarity, optional)
- List the five most recent recent transaction dates and IDs from the
TransactionHistorytable, for each product that starts with a letter from M to R inclusive.
- Same again, but with
nhistory lines per product, wherenis five times theDaysToManufactureProduct attribute.
- Same, for the special case where exactly one history line per product is required (the single most recent entry by
TransactionDate, tie-break onTransactionID.
Solution
Let's start with the basic scenario.
If I want to get some number of rows out of a table, I have two main options: ranking functions; or
First, let's consider the whole set from
This returns 418 rows, and the plan shows that it checks every row in the table looking for this - an unrestricted Clustered Index Scan, with a Predicate to provide the filter. 797 reads here, which is ugly.
So let's be fair to it, and create an index that would be more useful. Our conditions call for an equality match on
Having done this, our plan changes significantly, and drops the reads down to just 3. So we're already improving things by over 250x or so...
Now that we've levelled the playing field, let's look at the top options - ranking functions and
You will notice that the second (
So this work across a single product. But we need to consider what happens if we need to do this across multiple products.
The iterative programmer is going to consider the idea of looping through the products of interest, and calling this query multiple times, and we can actually get away with writing a query in this form - not using cursors, but using
The plan for this is the iterative programmers' method - Nested Loop, doing a Top operation and Seek (those 2 reads we had before) for each Product. This gives 4 reads against Product, and 360 against TransactionHistory.
Using
Significantly, though, this plan has an expensive Sort operator. The Merge Join doesn't seem to maintain the order of rows in TransactionHistory, the data must be resorted to be able to find the rownumbers. It's fewer reads, but this blocking Sort could feel painful. Using
Interestingly, if the
This plan uses a Nested Loop, just like with
If I want to get some number of rows out of a table, I have two main options: ranking functions; or
TOP.First, let's consider the whole set from
Production.TransactionHistory for a particular ProductID:SELECT h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = 800;This returns 418 rows, and the plan shows that it checks every row in the table looking for this - an unrestricted Clustered Index Scan, with a Predicate to provide the filter. 797 reads here, which is ugly.
So let's be fair to it, and create an index that would be more useful. Our conditions call for an equality match on
ProductID, followed by a search for the most recent by TransactionDate. We need the TransactionID returned too, so let's go with: CREATE INDEX ix_FindingMostRecent ON Production.TransactionHistory (ProductID, TransactionDate) INCLUDE (TransactionID);.Having done this, our plan changes significantly, and drops the reads down to just 3. So we're already improving things by over 250x or so...
Now that we've levelled the playing field, let's look at the top options - ranking functions and
TOP.WITH Numbered AS
(
SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (ORDER BY TransactionDate DESC) AS RowNum
FROM Production.TransactionHistory h
WHERE h.ProductID = 800
)
SELECT TransactionID, ProductID, TransactionDate
FROM Numbered
WHERE RowNum <= 5;
SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = 800
ORDER BY TransactionDate DESC;You will notice that the second (
TOP) query is much simpler than the first, both in query and in plan. But very significantly, they both use TOP to limit the number of rows actually being pulled out of the index. The costs are only estimates and worth ignoring, but you can see a lot of similarity in the two plans, with the ROW_NUMBER() version doing a tiny amount of extra work to assign numbers and filter accordingly, and both queries end up doing just 2 reads to do their work. The Query Optimizer certainly recognises the idea of filtering on a ROW_NUMBER() field, realising that it can use a Top operator to ignore rows that aren't going to be needed. Both these queries are good enough - TOP isn't so much better that it's worth changing code, but it is simpler and probably clearer for beginners.So this work across a single product. But we need to consider what happens if we need to do this across multiple products.
The iterative programmer is going to consider the idea of looping through the products of interest, and calling this query multiple times, and we can actually get away with writing a query in this form - not using cursors, but using
APPLY. I'm using OUTER APPLY, figuring that we might want to return the Product with NULL, if there are no Transactions for it.SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
FROM
Production.Product p
OUTER APPLY (
SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = p.ProductID
ORDER BY TransactionDate DESC
) t
WHERE p.Name >= 'M' AND p.Name < 'S';The plan for this is the iterative programmers' method - Nested Loop, doing a Top operation and Seek (those 2 reads we had before) for each Product. This gives 4 reads against Product, and 360 against TransactionHistory.
Using
ROW_NUMBER(), the method is to use PARTITION BY in the OVER clause, so that we restart the numbering for each Product. This can then be filtered like before. The plan ends up being quite different. The logical reads are about 15% lower on TransactionHistory, with a full Index Scan going on to get the rows out.WITH Numbered AS
(
SELECT p.Name, p.ProductID, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY h.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
FROM Production.Product p
LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
WHERE p.Name >= 'M' AND p.Name < 'S'
)
SELECT Name, ProductID, TransactionID, TransactionDate
FROM Numbered n
WHERE RowNum <= 5;Significantly, though, this plan has an expensive Sort operator. The Merge Join doesn't seem to maintain the order of rows in TransactionHistory, the data must be resorted to be able to find the rownumbers. It's fewer reads, but this blocking Sort could feel painful. Using
APPLY, the Nested Loop will return the first rows very quickly, after just a few reads, but with a Sort, ROW_NUMBER() will only return rows after a most of the work has been finished.Interestingly, if the
ROW_NUMBER() query uses INNER JOIN instead of LEFT JOIN, then a different plan comes up.This plan uses a Nested Loop, just like with
APPLY. But there's no Top operator, so it pulls all the transactions for each product, and uses a lot more reads than before Code Snippets
SELECT h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = 800;WITH Numbered AS
(
SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (ORDER BY TransactionDate DESC) AS RowNum
FROM Production.TransactionHistory h
WHERE h.ProductID = 800
)
SELECT TransactionID, ProductID, TransactionDate
FROM Numbered
WHERE RowNum <= 5;
SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = 800
ORDER BY TransactionDate DESC;SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
FROM
Production.Product p
OUTER APPLY (
SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = p.ProductID
ORDER BY TransactionDate DESC
) t
WHERE p.Name >= 'M' AND p.Name < 'S';WITH Numbered AS
(
SELECT p.Name, p.ProductID, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY h.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
FROM Production.Product p
LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
WHERE p.Name >= 'M' AND p.Name < 'S'
)
SELECT Name, ProductID, TransactionID, TransactionDate
FROM Numbered n
WHERE RowNum <= 5;WITH Numbered AS
(
SELECT p.Name, p.ProductID, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY p.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
FROM Production.Product p
LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
WHERE p.Name >= 'M' AND p.Name < 'S'
)
SELECT Name, ProductID, TransactionID, TransactionDate
FROM Numbered n
WHERE RowNum <= 5;Context
StackExchange Database Administrators Q#86415, answer score: 86
Revisions (0)
No revisions yet.