HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Clustered index scan appears to be costed too low with row count spool

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
clusteredcountscancostedlowwithtooappearsindexrow

Problem

I've never posted one of these before so let me know if any information is missing.

I have a relatively simple query that performs pretty poorly with the default sample size on a few key columns. The costing of a clustered index scan used by a row count spool appears to be impossibly low. I will go through the technical details that I can share about the problem. Unfortunately, I cannot share statistics or provide a data set. What I am looking for is an explanation of the root cause and additional workarounds to the problem.

This is the SQL Server version that I'm testing on:

Microsoft SQL Server 2014 - 12.0.4100.1 (X64)
Apr 20 2015 17:29:27
Copyright (c) Microsoft Corporation
Developer Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)

Trace flags 4199 and 9481 are enabled. Trace flags 4134, 2390, and 2371 are also enabled but I don't expect those to have any effect for this query.

The query itself only has two tables.

CREATE TABLE X_JRO_TABLE_1 (
PK_COLUMN NUMERIC(18,0) NOT NULL,
JOIN_COLUMN NUMERIC(18,0) NULL,
PRIMARY KEY (PK_COLUMN)
)
WITH (DATA_COMPRESSION = PAGE);

CREATE TABLE X_JRO_TABLE_2 (
PK_COLUMN NUMERIC(18,0) NOT NULL,
JOIN_COLUMN NUMERIC(18,0) NULL,
DATA_TO_PAD_PAGES VARCHAR(500) NULL
PRIMARY KEY (PK_COLUMN)
)
WITH (DATA_COMPRESSION = PAGE);

-- insert data query here

CREATE STATISTICS X_JRO_TABLE_1__JOIN_COLUMN ON X_JRO_TABLE_1 (JOIN_COLUMN) 
WITH NORECOMPUTE;

CREATE STATISTICS X_JRO_TABLE_2__JOIN_COLUMN ON X_JRO_TABLE_2 (JOIN_COLUMN) 
WITH NORECOMPUTE;


Information that I can share on the statistics for JOIN_COLUMN in X_JRO_TABLE_1:

Rows: about 63500000

Rows sampled: 1056952

Steps: 200

All density: 4.741809E-05

6.168829E+07 EQ_ROWS for the step with a NULL RANGE_HI_KEY.

Rest of the histogram is hard to describe.

Information that I can share on the statistics for JOIN_COLUMN in X_JRO_TABLE_2:

Rows: about 23400000

Rows sampled: 182929

Steps: 188

All density: 0.0003521127

About 170 steps in the histogram w

Solution

Your first question (why is the clustered index scan costed so low) is almost certainly due to the concept of "row goals", a feature of the SQL Server optimizer that examines sections of a query tree where not all rows from that tree are needed (these sub-trees typically implement things like IN, NOT IN, EXISTS, TOP 1 subqueries, etc). SQL Server is basically asking "how many rows do I expect to scan from the X_JRO_TABLE_2 table before I find a match for X_JRO_TABLE_1 (and thus disqualify the row)".

Since SQL Server assumes an even distribution of values throughout the table, it assumes that it won't have to read very many rows/pages before finding a match. Or said another way, the optimizer thinks that most "scans" on the inner side of the loop join will be extremely short partial scans, rather than full scans or anywhere close to full. That's obviously a poor assumption given the real distribution of rows in X_JRO_TABLE_2.

Paul White has answered a very similar question here, much better than I can, discussing row goals and linking to more in-depth resources: Unexpected scans during delete operation using WHERE IN
Note the Top operator above the tblFEStatsPaperHits scan on the inside of a nested loops join.

As for your second question (why the sampling increase on X_JRO_TABLE_2 increases the cost of the table scan) I'd have to play around with things to hazard a nuanced guess, though it is probably also related to the row goal/"how many rows to read before finding a match" issue.

Context

StackExchange Database Administrators Q#148367, answer score: 2

Revisions (0)

No revisions yet.