HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlModerate

logical reads vs. scan count

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
countreadslogicalscan

Problem

I'm joining a small table (1,000 rows) against a large table (8M rows) in SQL Server 2008. The join uses a nonclustered covering index on the large table, and the join can produce three possible query plans. I'm trying to figure out which plan is better, but also I want to generalize this knowledge so next time I can better know what heuristics to use when looking at SQL I/O statistics.

Plan #1 is a loop join and emits statistics for the large table like this:

Scan count 2582, logical reads 35686, physical reads 1041, read-ahead reads 23052


Plan #2 is a merge join and emits statistics like this:

Scan count 1, logical reads 59034, physical reads 49, read-ahead reads 59004


Plan #3 is a hash join and emits statistics like this:

Scan count 3, logical reads 59011, physical reads 5, read-ahead reads 59010


The covering index is ordered by (ID, Date). The query returns data for about 50% of the IDs and, for each ID, returns a contiguous chunk of the most recent 3 months of data, which is usually about 1/4 or the rows for each ID. The query returns about 1/8 of the total rows in the index. In other words, the query is sparse but consistently so.

My assumption is that plan #1 is awful for this workload, because moving the disk head around 2,500 times (or even 1,041 times) is far more expensive than a sequential disk scan. I also assume that #3 and #2 have similar, sequential (and therefore more efficient) I/O patterns.

But is there a case where plan #1 is really best, where "best" means less impact on the I/O subsystem and less impact on other queries running concurrently?

Or does it really depend on many variables like the kind of disk subsystem I have, index fragmentation, etc. If "it depends" are there any rules of thumb to approach the problem?

Solution

Here is the killer deal: it January it was costing $12k to buy 864GB of RAM. You can get a lot of bang for the buck by simply increasing the RAM of your server up to the point that you will never hit a physical read (after warm up, of course).

Other than that is really hard to give an black or white opinion about either of those data points you present. Sure plan #1 had most physical reads, but are you positive that all tests were done on similarly warmed up cache? Could it be that #1 warmed up the cache for #2, what is your test methodology to ensure all cases are considered on level ground? Even so, if you shell out $500 and double the RAM, would it matter any more? #1 does have the least logical reads...

But then #2 is probably benefit from a high DOP (that one scan can be parallel). Is the wall-clock time of #2 better that #1 after you added sufficient RAM?

How many of these plans run in parallel? Are there tens of queries requesting concurrently a significant memory grant for the hash of #3 and thus creating contention for the RESOURCE_SEMAPHORE? Is the #2 doing a sort and also requesting a memory grant? Will #1 work better since it requires no grant (at least from the info posted...)?

Is really really relative and the question you ask is more like finding one solution for a complex system of equations... there simply could be more that one solutions.

One things is sure: 8M rows should fit in RAM with plenty room to spare. Those physical reads are begging for some memory banks.

Context

StackExchange Database Administrators Q#19849, answer score: 10

Revisions (0)

No revisions yet.