HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Index fragmentation while continuously processing

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whilefragmentationcontinuouslyindexprocessing

Problem

SQL Server 2005

I need to be able to continuously process about 350M records in a 900M record table. The query I'm using to select the records to process becomes badly fragmented as I process and I have a need to stop the processing to rebuild the index. Pseudo data model & query ...

/**************************************/
CREATE TABLE [Table] 
(
    [PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
    [ForeignKeyId] [INT] NOT NULL,
    /* more columns ... */
    [DataType] [CHAR](1) NOT NULL,
    [DataStatus] [DATETIME] NULL,
    [ProcessDate] [DATETIME] NOT NULL,
    [ProcessThreadId] VARCHAR (100) NULL
);

CREATE NONCLUSTERED INDEX [Idx] ON [Table] 
(
    [DataType],
    [DataStatus],
    [ProcessDate],
    [ProcessThreadId]
);
/**************************************/

/**************************************/
WITH cte AS (
    SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId]
    FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
    WHERE [DataType] = 'X'
    AND [DataStatus] IS NULL
    AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
    AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId;

SELECT * FROM [Table] WITH ( NOLOCK )
WHERE [ProcessThreadId] = @ProcessThreadId;
/**************************************/


Data content...

While the column [DataType] is typed as a CHAR(1), about 35% of all the records equal 'X' with the remainder equaling 'A'.

Of only the records where [DataType] equals 'X', about 10% will have a NOT NULL [DataStatus] value.

The [ProcessDate] and [ProcessThreadId] columns will be updated for every record processed.
The [DataType] column is updated ('X' is changed to 'A') about 10% of the time.
The [DataStatus] column is updated less than 1% of the time.

For now my solution is to select the primary key of all the records to process into a separate processing table. I delete the keys as I process them so that as the index fragments I'm dealing with l

Solution

What you're doing is you're using a table as a queue. Your update is the dequeue method. But the clustered index on the table is a poor choice for a queue. Using tables as Queues actually impose quite stringent requirements on the table design. Your clustered index must be the dequeue order, in this case likely ([DataType], [DataStatus], [ProcessDate]). You can implement the primary key as a nonclustered constraint. Drop the non-clustered index Idx, as the clustered key takes its role.

Another important piece of the puzzle is to keep the row size constant during processing. You have declared the ProcessThreadId as a VARCHAR(100) which implies the row grows and shrinks as is being 'processed' because the field value changes from NULL to non-null. This grow-and-shrink pattern on the row causes page splits and fragmentation. I can't possibly imagine a thread ID that is 'VARCHAR(100)'. Use a fixed length type, perhaps an INT.

As a side note, you do not need to dequeue in two steps (UPDATE followed by SELECT). You can use OUTPUT clause, as explained in the article linked above:

/**************************************/
CREATE TABLE [Table] 
(
    [PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED,
    [ForeignKeyId] [INT] NOT NULL,
    /* more columns ... */
    [DataType] [CHAR](1) NOT NULL,
    [DataStatus] [DATETIME] NULL,
    [ProcessDate] [DATETIME] NOT NULL,
    [ProcessThreadId] INT NULL
);

CREATE CLUSTERED INDEX [Cdx] ON [Table] 
(
    [DataType],
    [DataStatus],
    [ProcessDate]
);
/**************************************/

declare @BatchSize int, @ProcessThreadId int;

/**************************************/
WITH cte AS (
    SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId] , ... more columns 
    FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
    WHERE [DataType] = 'X'
    AND [DataStatus] IS NULL
    AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
    AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId
OUTPUT DELETED.[PrimaryKeyId] , ... more columns ;
/**************************************/


In addition I would consider moving successfully processed items into a different, archive, table. You want your queue tables to hover near zero size, you do no want them to grow as they retain 'history' from unneeded old entries. You may also consider partitioning by [ProcessDate] as an alternative (ie. one current active partition that acts as the queue and stores entries with NULL ProcessDate, and another partition for everything non-null. Or multiple partitions for non-null if you want to implement efficient deletes (switch out) for data that has passed the mandated retention period. If things get hot you can partition in addition by [DataType] if it has enough selectivity, but that design would be really complicated as it requires partitioning by persisted computed column (a composite column that glues together [DataType] and [ProcessingDate]).

Code Snippets

/**************************************/
CREATE TABLE [Table] 
(
    [PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED,
    [ForeignKeyId] [INT] NOT NULL,
    /* more columns ... */
    [DataType] [CHAR](1) NOT NULL,
    [DataStatus] [DATETIME] NULL,
    [ProcessDate] [DATETIME] NOT NULL,
    [ProcessThreadId] INT NULL
);

CREATE CLUSTERED INDEX [Cdx] ON [Table] 
(
    [DataType],
    [DataStatus],
    [ProcessDate]
);
/**************************************/

declare @BatchSize int, @ProcessThreadId int;

/**************************************/
WITH cte AS (
    SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId] , ... more columns 
    FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
    WHERE [DataType] = 'X'
    AND [DataStatus] IS NULL
    AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
    AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId
OUTPUT DELETED.[PrimaryKeyId] , ... more columns ;
/**************************************/

Context

StackExchange Database Administrators Q#16617, answer score: 4

Revisions (0)

No revisions yet.