HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

Filter table before join

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
beforefiltertablejoin

Problem

I have a stored procedure that populates the temp table #employee_benefits with a list of Ids. This table ends up being roughly 10,000 rows long. The query below then selects from a table called EmployeeBenefitData that has around 4 million rows.

SELECT  ebd.EmployeeBenefitDataId, ebd.EmployeeBenefitId, ebd.[DataDefinitionId]
FROM    #employee_benefits eb
INNER JOIN EmployeeBenefitData ebd ON eb.EmployeeBenefitId = ebd.EmployeeBenefitId


The bottle neck was the index scan on the EmployeeBenefitData table. It did the index scan first and then joined it to the temp table. The temp table was acting as a filter meaning that scanning all the data before the join was very inefficient. I added the following code to change the scan to a seek and dramatically cut down the amount of reads needed.

DECLARE @MinEmpBenId INT, @MaxEmpBenId INT

SELECT @MinEmpBenId = MIN(EmployeeBenefitId), @MaxEmpBenId = MAX(EmployeeBenefitId)
FROM #employee_benefits

SELECT  ebd.EmployeeBenefitDataId, ebd.EmployeeBenefitId, ebd.[DataDefinitionId],
        dd.TypeId, dd.DataDefinitionId, dd.Name, ebd.[Value], ebd.[Date], ebd.[Text]
FROM    #employee_benefits eb
INNER JOIN EmployeeBenefitData ebd ON eb.EmployeeBenefitId = ebd.EmployeeBenefitId
INNER JOIN DataDefinition dd ON ebd.DataDefinitionId = dd.DataDefinitionId
WHERE   ebd.EmployeeBenefitId >= @MinEmpBenId AND ebd.EmployeeBenefitId <= @MaxEmpBenId


It makes a massive difference in the client stats

Total execution time 74, 1794

Wait time on server replies 11, 11

My question is: Is this good practice? And why doesn't the optimiser do this?

UPDATE
I should have mentioned that the temp table has a clustered index on EmployeeBenefitID

Solution

Is this good practice?

In this circumstance I would say yes. I'd probably also add an OPTION (RECOMPILE) to let it "sniff" the variable values. The optimal plan will likely vary dependant on the proportion of rows in the larger table that match this range.

It provides a potentially useful extra path to the optimiser and it is not something that the query optimiser ever does by itself as far as I know. The closest thing to it is that with a merge join it will stop processing an input when either one is finished. Thus meaning that it potentially avoids a full scan.

The only downside that springs to mind would be if the calculation of the min/max range values itself might be expensive (but this should be very cheap if the table you are using as a filter is indexed on that column).

I created two test tables

CREATE TABLE EmployeeBenefitData(EmployeeID INT PRIMARY KEY);

CREATE TABLE FilteredEmployee(EmployeeID INT PRIMARY KEY);


And loaded EmployeeBenefitData with integers from 1 to 4,000,000 (6,456 pages)

And FilteredEmployee with integers from 2,000,000 AND 2,010,000 (19 pages)

And then ran 6 queries of the following form

DECLARE @E1 INT,
        @E2 INT

SELECT @E1 = FE.EmployeeID,
       @E2 = EBD.EmployeeID
FROM   FilteredEmployee FE
       INNER LOOP JOIN EmployeeBenefitData EBD
         ON FE.EmployeeID = EBD.EmployeeID
OPTION (MAXDOP 1);


The 6 permutations were made up by inverting the order of the two tables and trying all three join types LOOP, MERGE, HASH.

Results were as follows

+------------+-------------+-------+----------------+-----------+---------------+----------+
| Left Table | Right Table | Join  | EBD Scan Count | EBD reads | FE Scan Count | FE reads |
+------------+-------------+-------+----------------+-----------+---------------+----------+
| FE         | EBD         | Loop  |              0 |     30637 |             1 |       19 |
| EBD        | FE          | Loop  |              1 |      6456 |             0 |  8250009 |
| FE         | EBD         | Merge |              1 |      3257 |             1 |       19 |
| EBD        | FE          | Merge |              1 |      3257 |             1 |       19 |
| FE         | EBD         | Hash  |              1 |      6456 |             1 |       19 |
| EBD        | FE          | Hash  |              1 |      6456 |             1 |       19 |
+------------+-------------+-------+----------------+-----------+---------------+----------+


The above figures illustrate the point about merge join as it "only" scans just over half the larger table. It still read all the rows from 1 to 1,999,999 first though and discarded them.

Repeating the experiment with a WHERE EBD.EmployeeID BETWEEN 2000000 AND 2010000 gave the following.

+------------+-------------+-------+----------------+-----------+---------------+----------+
| Left Table | Right Table | Join  | EBD Scan Count | EBD reads | FE Scan Count | FE reads |
+------------+-------------+-------+----------------+-----------+---------------+----------+
| FE         | EBD         | Loop  |              0 |     30637 |             1 |       19 |
| EBD        | FE          | Loop  |              1 |        21 |             0 |    20636 |
| FE         | EBD         | Merge |              1 |        21 |             1 |       19 |
| EBD        | FE          | Merge |              1 |        21 |             1 |       19 |
| FE         | EBD         | Hash  |              1 |        21 |             1 |       19 |
| EBD        | FE          | Hash  |              1 |        21 |             1 |       19 |
+------------+-------------+-------+----------------+-----------+---------------+----------+


The only query that didn't benefit from the additional range predicate is the one where the larger table was on the inside of a nested loops join.

This is of course not surprising as that plan (plan 1 below) is driven by repeated index seeks using the values from FilteredEmployee.

Plan 1 was also the one chosen "naturally" by the optimiser without the range predicate. With the range predicate in place it chose a different merge join plan seeking into the relevant index range without scanning unnecessary rows and costed that as significantly cheaper (plan 2)

Code Snippets

CREATE TABLE EmployeeBenefitData(EmployeeID INT PRIMARY KEY);

CREATE TABLE FilteredEmployee(EmployeeID INT PRIMARY KEY);
DECLARE @E1 INT,
        @E2 INT

SELECT @E1 = FE.EmployeeID,
       @E2 = EBD.EmployeeID
FROM   FilteredEmployee FE
       INNER LOOP JOIN EmployeeBenefitData EBD
         ON FE.EmployeeID = EBD.EmployeeID
OPTION (MAXDOP 1);
+------------+-------------+-------+----------------+-----------+---------------+----------+
| Left Table | Right Table | Join  | EBD Scan Count | EBD reads | FE Scan Count | FE reads |
+------------+-------------+-------+----------------+-----------+---------------+----------+
| FE         | EBD         | Loop  |              0 |     30637 |             1 |       19 |
| EBD        | FE          | Loop  |              1 |      6456 |             0 |  8250009 |
| FE         | EBD         | Merge |              1 |      3257 |             1 |       19 |
| EBD        | FE          | Merge |              1 |      3257 |             1 |       19 |
| FE         | EBD         | Hash  |              1 |      6456 |             1 |       19 |
| EBD        | FE          | Hash  |              1 |      6456 |             1 |       19 |
+------------+-------------+-------+----------------+-----------+---------------+----------+
+------------+-------------+-------+----------------+-----------+---------------+----------+
| Left Table | Right Table | Join  | EBD Scan Count | EBD reads | FE Scan Count | FE reads |
+------------+-------------+-------+----------------+-----------+---------------+----------+
| FE         | EBD         | Loop  |              0 |     30637 |             1 |       19 |
| EBD        | FE          | Loop  |              1 |        21 |             0 |    20636 |
| FE         | EBD         | Merge |              1 |        21 |             1 |       19 |
| EBD        | FE          | Merge |              1 |        21 |             1 |       19 |
| FE         | EBD         | Hash  |              1 |        21 |             1 |       19 |
| EBD        | FE          | Hash  |              1 |        21 |             1 |       19 |
+------------+-------------+-------+----------------+-----------+---------------+----------+

Context

StackExchange Database Administrators Q#81405, answer score: 6

Revisions (0)

No revisions yet.