patternMinor
Optimising join on large table
Viewed 0 times
joinlargetableoptimising
Problem
I am trying to coax some more performance out of a query that is accessing a table with ~250-million records. From my reading of the actual (not estimated) execution plan, the first bottleneck is a query that looks like this:
See further down for the definitions of the tables & indexes involved.
The execution plan indicates that a nested loop is being used on #smalltable, and that the index scan over hugetable is being executed 480 times (for each row in #smalltable). This seems backwards to me, so I've tried to force a merge join to be used instead:
The index in question (see below for full definition) covers columns fk (the join predicate), added (used in the where clause) & id (useless) in ascending order, and includes value.
When I do this, however, the query blows out from 2 1/2 minutes to over 9. I would have hoped that the hints would force a more efficient join that only does a single pass over each table, but clearly not.
Any guidance is welcome. Additional information provided if required.
Update (2011/06/02)
Having reorganised the indexing on the table, I have made significant performance inroads, however I have hit a new obstacle when it comes to summarising the data in the huge table. The outcome is a summarisation by month, which currently looks like the following:
At present, hugeta
select
b.stuff,
a.added,
a.value
from
dbo.hugetable a
inner join
#smalltable b on a.fk = b.pk
where
a.added between @start and @end;See further down for the definitions of the tables & indexes involved.
The execution plan indicates that a nested loop is being used on #smalltable, and that the index scan over hugetable is being executed 480 times (for each row in #smalltable). This seems backwards to me, so I've tried to force a merge join to be used instead:
select
b.stuff,
a.added,
a.value
from
dbo.hugetable a with(index = ix_hugetable)
inner merge join
#smalltable b with(index(1)) on a.fk = b.pk
where
a.added between @start and @end;The index in question (see below for full definition) covers columns fk (the join predicate), added (used in the where clause) & id (useless) in ascending order, and includes value.
When I do this, however, the query blows out from 2 1/2 minutes to over 9. I would have hoped that the hints would force a more efficient join that only does a single pass over each table, but clearly not.
Any guidance is welcome. Additional information provided if required.
Update (2011/06/02)
Having reorganised the indexing on the table, I have made significant performance inroads, however I have hit a new obstacle when it comes to summarising the data in the huge table. The outcome is a summarisation by month, which currently looks like the following:
select
b.stuff,
datediff(month, 0, a.added),
count(a.value),
sum(case when a.value > 0 else 1 end) -- this triples the running time!
from
dbo.hugetable a
inner join
#smalltable b on a.fk = b.pk
group by
b.stuff,
datediff(month, 0, a.added);At present, hugeta
Solution
Your
In addition:
Try changing the clustered key to
Another option might be to try the FORCE ORDER hint with table order boh ways and no JOIN/INDEX hints. I try not to use JOIN/INDEX hints personally because you remove options for the optimiser. Many years ago I was told (seminar with a SQL Guru) that FORCE ORDER hint can help when you have huge table JOIN small table: YMMV 7 years later...
Oh, and let us know where the DBA lives so we can arrange for some percussion adjustment
Edit, after 02 Jun update
The 4th column is not part of the non-clustered index so it uses the clustered index.
Try changing the NC index to INCLUDE the value column so it doesn't have to access the value column for the clustered index
Note: If value is not nullable then it is the same as
As an example, if you change
The query needs 3 columns: added, fk, value. The first 2 are filtered/joined so are key columns. value is just used so can be included. Classic use of a covering index.
ix_hugetable looks quite useless because:- it is the clustered index (PK)
- the INCLUDE makes no difference because a clustered index INCLUDEs all non-key columns (non-key values at lowest leaf = INCLUDEd = what a clustered index is)
In addition:
- added or fk should be first
- ID is first = not much use
Try changing the clustered key to
(added, fk, id) and drop ix_hugetable. You've already tried (fk, added, id). If nothing else, you'll save a lot of disk space and index maintenanceAnother option might be to try the FORCE ORDER hint with table order boh ways and no JOIN/INDEX hints. I try not to use JOIN/INDEX hints personally because you remove options for the optimiser. Many years ago I was told (seminar with a SQL Guru) that FORCE ORDER hint can help when you have huge table JOIN small table: YMMV 7 years later...
Oh, and let us know where the DBA lives so we can arrange for some percussion adjustment
Edit, after 02 Jun update
The 4th column is not part of the non-clustered index so it uses the clustered index.
Try changing the NC index to INCLUDE the value column so it doesn't have to access the value column for the clustered index
create nonclustered index ix_hugetable on dbo.hugetable (
fk asc, added asc
) include(value)Note: If value is not nullable then it is the same as
COUNT(*) semantically. But for SUM it need the actual value, not existence.As an example, if you change
COUNT(value) to COUNT(DISTINCT value) without changing the index it should break the query again because it has to process value as a value, not as existence.The query needs 3 columns: added, fk, value. The first 2 are filtered/joined so are key columns. value is just used so can be included. Classic use of a covering index.
Code Snippets
create nonclustered index ix_hugetable on dbo.hugetable (
fk asc, added asc
) include(value)Context
StackExchange Database Administrators Q#9858, answer score: 5
Revisions (0)
No revisions yet.