patternsqlMinor
Optimal way to record and retrieve a time-dependent value
Viewed 0 times
optimalwaytimevaluerecordretrieveanddependent
Problem
I have a table that has transactions for a bus (boarding riders). Given the route ID and the date, I need to look up in another table what service type it was doing that day. Bus schedules change at most every 6 months or so, with most going years unchanged.
Currently the schedule table is defined like so:
An example might look like:
So, if I have a transaction from 2015-04-20 for
Basically, the table represents that route 301 was a standard route between 2015-01-01 and 2016-05-31 (and thus any transactions during that period should be categorized as "Standard"), then it was Discontinued on 2016-06-01 (through current day, implicitly as there is no later schedule change noted), while 302 was a Standard route from 2015-01-01 through 2016-12-31, then a ParaTrans(it) route after.
Currently, the query to do this looks like this:
```
SELECT
TRANSIT_DAY,
ROUTE_ID,
(SELECT TOP (1) Type FROM Routes
WHERE (RouteID = dbo.DAILY_SALES_DETAIL.ROUTE_I
Currently the schedule table is defined like so:
CREATE TABLE [dbo].[Routes](
[ID] [int] NOT NULL,
[RouteID] [int] NOT NULL,
[Type] [varchar](50) NOT NULL,
[StartDate] [datetime] NOT NULL,
PRIMARY KEY CLUSTERED
(
[ID] ASC
));An example might look like:
ID RouteID Type StartDate
-- ------- ------------ ----------
1 301 Standard 2015-01-01
2 301 Discontinued 2016-06-01
3 302 Standard 2015-01-01
4 302 ParaTrans 2017-01-01So, if I have a transaction from 2015-04-20 for
RouteID 301, I want to get back "Standard", but if the transaction is from 2018-01-20, it should return "Discontinued". For transactions prior to 2015-01-01, it should return NULL (or "", or anything other than a result that might conflict with a valid answer, i.e., "Standard", "Paratrans", or "Discontinued").Basically, the table represents that route 301 was a standard route between 2015-01-01 and 2016-05-31 (and thus any transactions during that period should be categorized as "Standard"), then it was Discontinued on 2016-06-01 (through current day, implicitly as there is no later schedule change noted), while 302 was a Standard route from 2015-01-01 through 2016-12-31, then a ParaTrans(it) route after.
Route Type Start End
----- ---- ----- ---
301
Standard 2015-01-01 2016-05-31
Discontinued 2016-06-01 Present
302
Standard 2015-01-01 2016-12-31
ParaTrans 2017-01-01 PresentCurrently, the query to do this looks like this:
```
SELECT
TRANSIT_DAY,
ROUTE_ID,
(SELECT TOP (1) Type FROM Routes
WHERE (RouteID = dbo.DAILY_SALES_DETAIL.ROUTE_I
Solution
You could increase performance of your setup, as shown in your question, by changing the
The key here is we're defining the clustered index, which is the table, on the compound of
Instead of doing that, I would consider modifying the
Note the clustered index is now on
The query can now use an
This allows SQL Server to do a simple inner loop join to obtain results. Granted, if you are returning a large amount of rows, there will be a significant sort required, that will probably spill to tempdb.
Using the MCVE I've shown below, we can compare plans for the the two variants. The first plan is your original query with the correlated subquery. The second plan is with the
The 2nd variant has a plan cost which is about 4 times lower than the 1st variant. The sort operators in both plans are requesting 108MB of memory and spilling over 9,000 pages to tempdb - however it's fairly unlikely you'll be requesting the entire result set vs getting a single route, or perhaps a date range. If you add a filter for a single route, there is no large memory grant or spill to tempdb.
What follows is a sample MCVE with 10,000 route rows, and 1,000,000 transaction rows, that can be used to run tests against various designs:
Do this in tempdb to avoid any "accidents" with real tables.
Drop the tables if they exist (this works on SQL Server 2016+):
Create the
Insert 10,000 route rows:
Create the
Insert 1,000,000 rows into the
For a
```
CREATE TABLE dbo.RoutesEndDate(
RouteID int NOT NULL
, [Type] varchar(50) NOT NULL
, StartDate datetime NOT NULL
, EndDate datetime NOT NULL
, CONSTRAINT PK_R
dbo.Routes table to:CREATE TABLE dbo.Routes(
RouteID int NOT NULL
, [Type] varchar(50) NOT NULL
, StartDate datetime NOT NULL
, CONSTRAINT PK_Routes
PRIMARY KEY CLUSTERED
(RouteID, StartDate DESC)
) WITH (DATA_COMPRESSION = PAGE)
ON [PRIMARY];The key here is we're defining the clustered index, which is the table, on the compound of
RouteID and StartDate DESC. This provides the data in exactly the way that is most efficient for the query you've written. The caveat emptor here is inserts into the dbo.Routes for an existing route with a new date will cause page splits to occur since we'll be filling rows in descending order of date. Having said that, with a small number of rows in the Route table, and with the occasional index maintenance, this should not be a big concern.Instead of doing that, I would consider modifying the
dbo.Routes table to include an EndDate column. This eliminates the need to do a subquery with TOP(1) and ORDER BY .... Something like:CREATE TABLE dbo.Routes(
RouteID int NOT NULL
, [Type] varchar(50) NOT NULL
, StartDate datetime NOT NULL
, EndDate datetime NOT NULL
, CONSTRAINT PK_Routes
PRIMARY KEY CLUSTERED
(RouteID, StartDate ASC)
);Note the clustered index is now on
(RouteID, StartDate ASC).The query can now use an
INNER JOIN, instead of the correlated subquery, and looks like:SELECT
t.TRANSIT_DAY
, t.ROUTE_ID
, NCTD_MODE = r.Type
FROM Transactions t
INNER JOIN dbo.Routes r ON t.ROUTE_ID = r.RouteID
AND t.TRANSIT_DAY >= r.StartDate
AND t.TRANSIT_DAY < r.EndDate
ORDER BY t.TRANSIT_DAY
, t.ROUTE_ID;This allows SQL Server to do a simple inner loop join to obtain results. Granted, if you are returning a large amount of rows, there will be a significant sort required, that will probably spill to tempdb.
Using the MCVE I've shown below, we can compare plans for the the two variants. The first plan is your original query with the correlated subquery. The second plan is with the
EndDate column included.The 2nd variant has a plan cost which is about 4 times lower than the 1st variant. The sort operators in both plans are requesting 108MB of memory and spilling over 9,000 pages to tempdb - however it's fairly unlikely you'll be requesting the entire result set vs getting a single route, or perhaps a date range. If you add a filter for a single route, there is no large memory grant or spill to tempdb.
What follows is a sample MCVE with 10,000 route rows, and 1,000,000 transaction rows, that can be used to run tests against various designs:
Do this in tempdb to avoid any "accidents" with real tables.
USE tempdb;Drop the tables if they exist (this works on SQL Server 2016+):
DROP TABLE IF EXISTS dbo.Routes;
DROP TABLE IF EXISTS dbo.Transactions;Create the
dbo.Routes table, with a clustered index on RouteID, StartDate DESC:CREATE TABLE dbo.Routes(
RouteID int NOT NULL
, [Type] varchar(50) NOT NULL
, StartDate datetime NOT NULL
, CONSTRAINT PK_Routes
PRIMARY KEY CLUSTERED
(RouteID, StartDate DESC)
);Insert 10,000 route rows:
;WITH src AS (
SELECT t.n
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))t(n)
)
, src2 AS (
SELECT RouteID = (s1.n * 1000) + (s2.n * 100) + (s3.n * 10)
, Type = REPLICATE(CHAR(65 + CONVERT(int, CRYPT_GEN_RANDOM(1) % 26)), 50)
FROM src s1
CROSS JOIN src s2
CROSS JOIN src s3
CROSS JOIN src s4
)
INSERT INTO dbo.Routes (RouteID, [Type], StartDate)
SELECT s.RouteID
, s.Type
, StartDate = DATEADD(DAY, ROW_NUMBER() OVER (PARTITION BY RouteID ORDER BY s.RouteID) - 1, '1997-01-01T00:00:00')
FROM src2 sCreate the
dbo.Transactions, with a clustered index on ROUTE_ID, TRANSIT_DAY. Building the clustered index like that optimizes queries that filter on both route and day. CREATE TABLE dbo.Transactions(
TRANSIT_DAY datetime NOT NULL
, ROUTE_ID int NOT NULL
, CONSTRAINT PK_Transactions
PRIMARY KEY CLUSTERED
(ROUTE_ID, TRANSIT_DAY)
);Insert 1,000,000 rows into the
dbo.Transactions table:;WITH src AS (
SELECT t.n
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))t(n)
)
INSERT INTO dbo.Transactions (TRANSIT_DAY, ROUTE_ID)
SELECT DATEADD(DAY, CONVERT(int, CRYPT_GEN_RANDOM(1)), '1997-01-01') + DATEADD(MILLISECOND, ABS(CONVERT(int, CRYPT_GEN_RANDOM(4))), '00:00:00')
, r.RouteID
FROM dbo.Routes r
CROSS JOIN src s1
CROSS JOIN src s2For a
Routes table with an EndDate column that can be used for comparative tests, I used this:```
CREATE TABLE dbo.RoutesEndDate(
RouteID int NOT NULL
, [Type] varchar(50) NOT NULL
, StartDate datetime NOT NULL
, EndDate datetime NOT NULL
, CONSTRAINT PK_R
Code Snippets
CREATE TABLE dbo.Routes(
RouteID int NOT NULL
, [Type] varchar(50) NOT NULL
, StartDate datetime NOT NULL
, CONSTRAINT PK_Routes
PRIMARY KEY CLUSTERED
(RouteID, StartDate DESC)
) WITH (DATA_COMPRESSION = PAGE)
ON [PRIMARY];CREATE TABLE dbo.Routes(
RouteID int NOT NULL
, [Type] varchar(50) NOT NULL
, StartDate datetime NOT NULL
, EndDate datetime NOT NULL
, CONSTRAINT PK_Routes
PRIMARY KEY CLUSTERED
(RouteID, StartDate ASC)
);SELECT
t.TRANSIT_DAY
, t.ROUTE_ID
, NCTD_MODE = r.Type
FROM Transactions t
INNER JOIN dbo.Routes r ON t.ROUTE_ID = r.RouteID
AND t.TRANSIT_DAY >= r.StartDate
AND t.TRANSIT_DAY < r.EndDate
ORDER BY t.TRANSIT_DAY
, t.ROUTE_ID;USE tempdb;DROP TABLE IF EXISTS dbo.Routes;
DROP TABLE IF EXISTS dbo.Transactions;Context
StackExchange Database Administrators Q#205100, answer score: 5
Revisions (0)
No revisions yet.