HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlMinor

Indexing strategy when using the between operator SQL Server 2008

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
the2008strategyoperatorindexingsqlbetweenusingwhenserver

Problem

I have a large table ~25 million rows with the structure

CREATE TABLE [dbo].[rx](
            [pat_id] [int] NOT NULL,
            [fill_Date] [date] NOT NULL,
            [script_End_Date]  AS (dateadd(day,[dayssup],[filldate])) persisted,
            [drug_Name] [varchar](50) NULL,
            [days_Sup] [int] NOT NULL,
            [quantity] [float] NOT NULL,
            [drug_Class] [char](3) NOT  NULL,
            CHECK(fill_Date <=script_End_Date
PRIMARY KEY NONCLUSTERED 
(
          [clmid]
)

create clustered index ix_rx_temporal on rx(fill_date asc, script_end_date asc, pat_id asc)


The primary key on this table is never queried on. This table by far is used most often with queries that involve date ranges. I have a calendar table with the structure

CREATE TABLE [dbo].[Calendar](
             [cal_date] [date] PRIMARY KEY,
[Year] AS YEAR(cal_date) PERSISTED,
[Month] AS MONTH(cal_date) PERSISTED,
[Day] AS DAY(cal_date) PERSISTED,
             [julian_seq] AS 1+DATEDIFF(DD, CONVERT(DATE, CONVERT(varchar,YEAR(cal_date))+'0101'),cal_date));


The query that I'm trying to speed up is:

```
;WITH x
AS (
--join finds the amount of distinct drugs that a person was prescribed on a given day
SELECT rx.pat_id,
c.cal_date,
Count(DISTINCT rx.drug_name) AS distinctDrugs
FROM rx,
calendar AS c
WHERE c.cal_date BETWEEN rx.fill_date AND rx.script_end_date
GROUP BY rx.pat_id,
c.cal_date),
y
AS (
--makes a sequence so that contiguous dates can be grouped together as a date range
SELECT x.pat_id,
x.distinctdrugs,
c2.julian_seq- Row_number()
OVER(
partition BY x.pat_id, distinctdrugs
ORDER BY x.cal_date) AS rn,
x.cal_date
FROM x,
calendar AS c2

Solution

This is not a complete answer, I do not have the time now, so let me just share a few thoughts. The complete answer would be huge, and I am not sure you want to know the details.

I have been working with various temporal queries for several years already, and learned a lot in the process. As such, I would rather not have to optimize your query in my production system. I would try very hard to avoid solving it with T-SQL. It is a complex problem. Itzik Ben-Gan has written about "gaps and islands" several times, including a chapter in his latest book on OLAP functions. Your problem is a variation of gaps and islands.

First, I would consider reading all the data to the client and solve it there using loops. I know, it requires sending data over the network, but fast loops in Java/C++/C# work very well for me most of the time. For instance, once I was struggling with a query involving time series and temporal data. When I moved most of the logic to the client, the C# solution was several times shorter and it ran 20,000 times faster. That's not a typo - twenty thousand times faster.

There is another problem with solving such problems in T-SQL - your performance may be unstable. If a query is complex, all of a sudden the optimizer can choose another plan and it will run many times slower, and we have to optimize it again.

Alternatively, I would consider storing data differently. Right now I see two possible approaches.

First, instead of storing intervals we could use this table:

ClientId,
PrescriptionId,
DrugId,
Date


We can use trusted constraints to make sure each PrescriptionId covers a range of dates without gaps and overlaps, so that one interval stores as one unbroken sequence of dates.

Note: I know you are using DISTINCT in your first subquery, so you are assuming that one person can take one drug on one day from more than one prescription. I am not mkaing this assumption, for simplicity. Are you sure it is a correct assumption? If yes, we will have to change the design.

Once we have this table, we can essentially materialize your first subquery as an indexed view:

SELECT ClientId,Date,COUNT_BIG(*) AS DistinctDrugs
GROUP BY ClientId,
Date


THat done, you can use your second subquery to group data points into interval, or just run it on a client, where is can be solved as one trivial loop.

Second approach: instead of intervals I would store sequence of events. there woudl be two kinds of events: interval start and interval end. With every I would store a running total, the number of open events after this event has happened. Essentially this running total is the number of prescriptions active after the event has happened.

As in the previous approach, much of the data you calculate on the fly every time you run your query is pre-calculated in this table. We can use trusted constraints to ensure the integrity of precalculated data.

I can describe it in more detail later if you are interested.

Code Snippets

ClientId,
PrescriptionId,
DrugId,
Date
SELECT ClientId,Date,COUNT_BIG(*) AS DistinctDrugs
GROUP BY ClientId,
Date

Context

StackExchange Database Administrators Q#34541, answer score: 2

Revisions (0)

No revisions yet.