HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Most efficient way to join two tables with MAX value

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
tableswithefficientwayjoinvaluetwomaxmost

Problem

I am optimizing a query, where I have to join two tables, let's say Product and TransactionHistory table, and return multiple columns from both table for the last TransactionDate from the TransactionHistory table for each Product in the Product table.

TransactionHistory table has around 13 Million rows and the Product table has almost 2 million rows.

Sounds pretty easy, doesn't it? but for different scenarios different query perform differently.

I am not using my actual table names here, to show the different query syntax I have tried, I am using Adam Machanic's BigProduct and BigSalesHistory Tables.

The query was originally written like....

SELECT p.ProductID
     , p.Name
     , h.Quantity
     , h.TransactionDate
FROM [dbo].[bigProduct] p
INNER JOIN [dbo].[bigTransactionHistory] h ON p.ProductID = h.ProductID
WHERE h.TransactionDate = ( SELECT MAX(TransactionDate)
                            FROM [dbo].[bigTransactionHistory]
                            WHERE ProductID = h.ProductID )
AND p.ProductID > 1317
AND p.ProductID < 1416
GO


Which resulted in some 30 scan counts and almost 300,000 logical reads.

I being a big fan of windowing functions wrote the following queries using ROW_NUMBER(), both backfired badly:

SELECT p.ProductID
     , p.Name
     , c.*
FROM [dbo].[bigProduct] p
INNER JOIN ( SELECT  h.ProductID
                   , h.Quantity
                   , h.TransactionDate
                   , ROW_NUMBER() OVER (PARTITION BY h.ProductID 
                                            ORDER BY h.TransactionDate DESC) rn
              FROM [dbo].[bigTransactionHistory] h
            ) c
            ON p.ProductID = c.ProductID AND rn = 1
WHERE p.ProductID > 1317
  AND p.ProductID < 1416
GO


AND

```
SELECT p.ProductID
, p.Name
, c.*
FROM [dbo].[bigProduct] p
CROSS APPLY ( SELECT h.ProductID
, h.Quantity
, h.TransactionDate
, ROW_NUMBER() OVER (PART

Solution

You need to think about how you would like to solve this if you had a paper-based system.

Like - do you want to use each ProductID one by one and find the newest one? Then your Nested Loop that calls a Seek using TOP 1 on an Index on (ProductID, Transaction Date DESC) will be best.

Do you want to search through all your transactions and group them by ProductID and grab the max? Then that approach could be best.

But I'd suggest to you that what you'd do in real life is to do the second approach on a subset of your data (such as the most recent month), and then do a product-by-product search on the ones that didn't work. Try an outer join onto a subquery that groups by ProductID and filters by TransactionDate to "fairly recent". Then do an Outer Apply to do a Top 1, but include a Predicate in there like where m.ProductID IS NULL, so that it only does those Seeks for products that were missed in the first round. You should see a Filter with a Startup Expression Predicate, and notice that the Index Seek is executed a lot fewer times.

I show this kind of thing at http://blogs.lobsterpot.com.au/2014/07/08/ssis-lookup-transformation-in-t-sql/ - look for the plan I show, on the example with two Outer Apply bits.

Context

StackExchange Database Administrators Q#163238, answer score: 2

Revisions (0)

No revisions yet.