patternModerate
Poor performing subquery with date comparisons
Viewed 0 times
performingwithdatepoorsubquerycomparisons
Problem
When using a subquery to find the total count of all prior records with a matching field, the performance is terrible on a table with as little as 50k records. Without the subquery, the query executes in a few milliseconds. With the subquery, the execution time is upwards of a minute.
For this query, the result must:
Basic Table Schema
Example Data
Expected Results
For the date range of
Records 96 and 95 are excluded from the result, but are included in the
Current Query
Current Index
```
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON [dbo].[Activity]
(
[ActionDate] ASC
)
INCLUDE ([Address]) WITH (
PAD_INDEX = OFF,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ONLINE = OFF,
ALL
For this query, the result must:
- Include only those records within a given date range.
- Include a count of all prior records, not including the current record, regardless of date range.
Basic Table Schema
Activity
======================
Id int Identifier
Address varchar(25)
ActionDate datetime2
Process varchar(50)
-- 7 other columns
Example Data
Id Address ActionDate (Time part excluded for simplicity)
===========================
99 000 2017-05-30
98 111 2017-05-30
97 000 2017-05-29
96 000 2017-05-28
95 111 2017-05-19
94 222 2017-05-30
Expected Results
For the date range of
2017-05-29 to 2017-05-30Id Address ActionDate PriorCount
=========================================
99 000 2017-05-30 2 (3 total, 2 prior to ActionDate)
98 111 2017-05-30 1 (2 total, 1 prior to ActionDate)
94 222 2017-05-30 0 (1 total, 0 prior to ActionDate)
97 000 2017-05-29 1 (3 total, 1 prior to ActionDate)
Records 96 and 95 are excluded from the result, but are included in the
PriorCount subqueryCurrent Query
select
*.a
, ( select count(*)
from Activity
where
Activity.Address = a.Address
and Activity.ActionDate < a.ActionDate
) as PriorCount
from Activity a
where a.ActionDate between '2017-05-29' and '2017-05-30'
order by a.ActionDate descCurrent Index
```
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON [dbo].[Activity]
(
[ActionDate] ASC
)
INCLUDE ([Address]) WITH (
PAD_INDEX = OFF,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ONLINE = OFF,
ALL
Solution
With the index definition that you have for
It's likely that seeking on
I'm testing against SQL Server 2016 so there might be some minor syntax differences. Below is some sample data in which I made similar assumptions to the above for data distribution:
I've created your index as described in the question. I'm testing against this query which returns the same data as the one in the question:
I get an index spool. What that means at a basic level is that the query optimizer build a temporary index on-the-fly because none of the existing indexes against the table were suitable.
The query still finishes quickly for me. Perhaps you aren't getting the index spool optimization on your system or there's something different about the table definition or the query. For educational purposes I can use an undocumented feature
Don't be fooled by the appearance of a simple index seek. SQL Server reads nearly 10 million rows from the index:
If I'm going to be running the query more than once then it probably doesn't make sense for the query optimizer to create an index each time it runs. I could create an index upfront that would be more selective for this query:
The plan is similar to before:
However, with the new index SQL Server only reads 1000 rows from the index. 800 of the rows are returned to be counted. The index could be defined to be more selective but this could be good enough depending on your data distribution.
If you aren't able to define any additional indexes on the table I would consider using window functions. The following appears to work:
That query does a single scan of the data but does an expensive sort and calculates the
However, if you really like that code pattern you could define an index to make it more efficient:
That moves the sort towards the end which will be much less expensive:
If none of this helps then you'll need to add more information to the question, preferably including actual execution plans.
IDX_my_nme, SQL Server will be able to seek using the ActionDate column but not with the Address column. The index contains all of the columns needed to cover the subquery but it likely isn't very selective for that subquery. Suppose that almost all of the data in the table has an ActionDate value of earlier than '2017-05-30'. A seek of ActionDate < '2017-05-30' will return almost all of the rows from the index, which are further filtered down after the row is fetched from the index. If your query returns 200 rows then you would probably doing almost 200 full index scans on IDX_my_nme, which means you will read around 50000 * 200 = 10 million rows from the index.It's likely that seeking on
Address will be far more selective for your subquery, although you haven't given us full statistical information about the query so that's an assumption on my part. However, suppose that you create an index on just Address and your table has 10k unique values for Address. With the new index, SQL Server will only need to seek 5 rows from the index for each execution of the subquery, so you'll read around 200 * 5 = 1000 rows from the index.I'm testing against SQL Server 2016 so there might be some minor syntax differences. Below is some sample data in which I made similar assumptions to the above for data distribution:
CREATE TABLE #Activity (
Id int NOT NULL,
[Address] varchar(25) NULL,
ActionDate datetime2 NULL,
FILLER varchar(100),
PRIMARY KEY (Id)
);
INSERT INTO #Activity WITH (TABLOCK)
SELECT TOP (50000) -- 50k total rows
x.RN
, x.RN % 10000 -- 10k unique addresses
, DATEADD(DAY, x.RN / 100, '20160201') -- 100 rows per day
, REPLICATE('Z', 100)
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) x;
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON #Activity
([ActionDate] ASC) INCLUDE ([Address]);I've created your index as described in the question. I'm testing against this query which returns the same data as the one in the question:
select
a.*
, ( select count(*)
from #Activity Activity
where
Activity.[Address] = a.[Address]
and Activity.ActionDate < a.ActionDate
) as PriorCount
from #Activity a
where a.ActionDate between '2017-05-29' and '2017-05-30'
order by a.ActionDate desc;I get an index spool. What that means at a basic level is that the query optimizer build a temporary index on-the-fly because none of the existing indexes against the table were suitable.
The query still finishes quickly for me. Perhaps you aren't getting the index spool optimization on your system or there's something different about the table definition or the query. For educational purposes I can use an undocumented feature
OPTION (QUERYRULEOFF BuildSpool) to disable the index spool. Here's what the plan looks like:Don't be fooled by the appearance of a simple index seek. SQL Server reads nearly 10 million rows from the index:
If I'm going to be running the query more than once then it probably doesn't make sense for the query optimizer to create an index each time it runs. I could create an index upfront that would be more selective for this query:
CREATE NONCLUSTERED INDEX [IDX_my_nme_2] ON #Activity
([Address] ASC) INCLUDE (ActionDate);The plan is similar to before:
However, with the new index SQL Server only reads 1000 rows from the index. 800 of the rows are returned to be counted. The index could be defined to be more selective but this could be good enough depending on your data distribution.
If you aren't able to define any additional indexes on the table I would consider using window functions. The following appears to work:
SELECT t.*
FROM
(
select
a.*
, -1 + ROW_NUMBER() OVER (PARTITION BY [Address] ORDER BY ActionDate) PriorCount
from #Activity a
) t
where t.ActionDate between '2017-05-29' and '2017-05-30'
order by t.ActionDate desc;That query does a single scan of the data but does an expensive sort and calculates the
ROW_NUMBER() function for every row in the table, so it feels like there's some extra work done here:However, if you really like that code pattern you could define an index to make it more efficient:
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON #Activity
([Address], [ActionDate]) INCLUDE (FILLER);That moves the sort towards the end which will be much less expensive:
If none of this helps then you'll need to add more information to the question, preferably including actual execution plans.
Code Snippets
CREATE TABLE #Activity (
Id int NOT NULL,
[Address] varchar(25) NULL,
ActionDate datetime2 NULL,
FILLER varchar(100),
PRIMARY KEY (Id)
);
INSERT INTO #Activity WITH (TABLOCK)
SELECT TOP (50000) -- 50k total rows
x.RN
, x.RN % 10000 -- 10k unique addresses
, DATEADD(DAY, x.RN / 100, '20160201') -- 100 rows per day
, REPLICATE('Z', 100)
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) x;
CREATE NONCLUSTERED INDEX [IDX_my_nme] ON #Activity
([ActionDate] ASC) INCLUDE ([Address]);select
a.*
, ( select count(*)
from #Activity Activity
where
Activity.[Address] = a.[Address]
and Activity.ActionDate < a.ActionDate
) as PriorCount
from #Activity a
where a.ActionDate between '2017-05-29' and '2017-05-30'
order by a.ActionDate desc;CREATE NONCLUSTERED INDEX [IDX_my_nme_2] ON #Activity
([Address] ASC) INCLUDE (ActionDate);SELECT t.*
FROM
(
select
a.*
, -1 + ROW_NUMBER() OVER (PARTITION BY [Address] ORDER BY ActionDate) PriorCount
from #Activity a
) t
where t.ActionDate between '2017-05-29' and '2017-05-30'
order by t.ActionDate desc;CREATE NONCLUSTERED INDEX [IDX_my_nme] ON #Activity
([Address], [ActionDate]) INCLUDE (FILLER);Context
StackExchange Database Administrators Q#173550, answer score: 17
Revisions (0)
No revisions yet.