HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How do I optimize large table so queries that target only recent data perform optimaly?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
targetrecentoptimalylargethatoptimizeperformhowqueriesdata

Problem

So I have this table that is ever growing. Most queries are targeting just recent data, say one month old. I suppose this is common problem but I have no idea how it can be solved.

I am open to changing design or if there is mechanism in MsSql to solve this. I have limited options to try different solutions as database is in production and its hard to reproduce.

CREATE TABLE [dbo].[mydata](
[ID] [int] IDENTITY(1,1) PRIMARY KEY NOT NULL,
[Code] [varchar](20) NOT NULL,   -- index1 UNIQUE NONCLUSTERED INDEX
[Data2] [varchar](20) NULL,      
[Data3] [nvarchar](50) NOT NULL,
... bunch of DATA around 5kb
[Time_1] [datetime] NULL,    -- time created, -- index2 NONCLUSTERED INDEX
[Time_2] [datetime] NULL,    -- time finished ( usualy within few days ) -- index3 NONCLUSTERED INDEX
[Status] [int] NOT NULL,     -- active 
[Modid] [timestamp] NOT NULL
)

Solution

Time series should be clustered by time:

CREATE TABLE [dbo].[mydata](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Code] [varchar](20) NOT NULL,   -- index1 UNIQUE NONCLUSTERED INDEX
[Data2] [varchar](20) NULL,      
[Data3] [nvarchar](50) NOT NULL,
... bunch of DATA around 5kb
[Time_1] [datetime] NULL,    -- time created, 
[Time_2] [datetime] NULL,    -- time finished ( usualy within few days ) -- index3 NONCLUSTERED INDEX
[Status] [int] NOT NULL,     -- active 
[Modid] [timestamp] NOT NULL,
CONSTRAINT NONCLUSTERED PRIMARY KEY ([ID])
);

CREATE CLUSTERED INDEX cdxMyDataTime_1 on dbi.mydata (Time_1);


In time series data the time is almost always specified in queries, and usually as a range. With a clustered key based on the time range queries will scan only the relevant portion of the table.

The ID can continue to server the logical primary key role, but there is little benefit from having the table clustered by it since the ID is never used as a range. So off it goes into a non clustered constraint. Singleton lookups based on ID will need two reads, but who cares, is two fast reads.

If you cannot have the Time_1 as clustered key a frequent trick used is to retrieve the ID range for each day, eg. create table of days and min_ID/max_ID. Then use the ID range that covers the time range you're interested in to restrict the range of the scan on the table. This advantage of this approach is that it works for multiple time columns (you cannot cluster by Time_1 and Time_2...) and it is less invasive (can be tried out right away w/o modifying the table). But this approach is very invasive in the application query design, it requires discipline in remembering to use the ID ranges for the days desired. Note that since the IDs usually don't change, they can be cached in the app.

Simple indexes on Time_1 and Time_2 do not work because they hit the index tipping point. Covering index (with INCLUDE columns) on Time_1 and Time_2 explode the data size as often the included column required are ... all columns.

Code Snippets

CREATE TABLE [dbo].[mydata](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Code] [varchar](20) NOT NULL,   -- index1 UNIQUE NONCLUSTERED INDEX
[Data2] [varchar](20) NULL,      
[Data3] [nvarchar](50) NOT NULL,
... bunch of DATA around 5kb
[Time_1] [datetime] NULL,    -- time created, 
[Time_2] [datetime] NULL,    -- time finished ( usualy within few days ) -- index3 NONCLUSTERED INDEX
[Status] [int] NOT NULL,     -- active 
[Modid] [timestamp] NOT NULL,
CONSTRAINT NONCLUSTERED PRIMARY KEY ([ID])
);

CREATE CLUSTERED INDEX cdxMyDataTime_1 on dbi.mydata (Time_1);

Context

StackExchange Database Administrators Q#39866, answer score: 5

Revisions (0)

No revisions yet.