patternsqlMinor
Index fragmentation while continuously processing
Viewed 0 times
whilefragmentationcontinuouslyindexprocessing
Problem
SQL Server 2005
I need to be able to continuously process about 350M records in a 900M record table. The query I'm using to select the records to process becomes badly fragmented as I process and I have a need to stop the processing to rebuild the index. Pseudo data model & query ...
Data content...
While the column [DataType] is typed as a CHAR(1), about 35% of all the records equal 'X' with the remainder equaling 'A'.
Of only the records where [DataType] equals 'X', about 10% will have a NOT NULL [DataStatus] value.
The [ProcessDate] and [ProcessThreadId] columns will be updated for every record processed.
The [DataType] column is updated ('X' is changed to 'A') about 10% of the time.
The [DataStatus] column is updated less than 1% of the time.
For now my solution is to select the primary key of all the records to process into a separate processing table. I delete the keys as I process them so that as the index fragments I'm dealing with l
I need to be able to continuously process about 350M records in a 900M record table. The query I'm using to select the records to process becomes badly fragmented as I process and I have a need to stop the processing to rebuild the index. Pseudo data model & query ...
/**************************************/
CREATE TABLE [Table]
(
[PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
[ForeignKeyId] [INT] NOT NULL,
/* more columns ... */
[DataType] [CHAR](1) NOT NULL,
[DataStatus] [DATETIME] NULL,
[ProcessDate] [DATETIME] NOT NULL,
[ProcessThreadId] VARCHAR (100) NULL
);
CREATE NONCLUSTERED INDEX [Idx] ON [Table]
(
[DataType],
[DataStatus],
[ProcessDate],
[ProcessThreadId]
);
/**************************************/
/**************************************/
WITH cte AS (
SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId]
FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
WHERE [DataType] = 'X'
AND [DataStatus] IS NULL
AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId;
SELECT * FROM [Table] WITH ( NOLOCK )
WHERE [ProcessThreadId] = @ProcessThreadId;
/**************************************/Data content...
While the column [DataType] is typed as a CHAR(1), about 35% of all the records equal 'X' with the remainder equaling 'A'.
Of only the records where [DataType] equals 'X', about 10% will have a NOT NULL [DataStatus] value.
The [ProcessDate] and [ProcessThreadId] columns will be updated for every record processed.
The [DataType] column is updated ('X' is changed to 'A') about 10% of the time.
The [DataStatus] column is updated less than 1% of the time.
For now my solution is to select the primary key of all the records to process into a separate processing table. I delete the keys as I process them so that as the index fragments I'm dealing with l
Solution
What you're doing is you're using a table as a queue. Your update is the dequeue method. But the clustered index on the table is a poor choice for a queue. Using tables as Queues actually impose quite stringent requirements on the table design. Your clustered index must be the dequeue order, in this case likely
Another important piece of the puzzle is to keep the row size constant during processing. You have declared the
As a side note, you do not need to dequeue in two steps (UPDATE followed by SELECT). You can use OUTPUT clause, as explained in the article linked above:
In addition I would consider moving successfully processed items into a different, archive, table. You want your queue tables to hover near zero size, you do no want them to grow as they retain 'history' from unneeded old entries. You may also consider partitioning by
([DataType], [DataStatus], [ProcessDate]). You can implement the primary key as a nonclustered constraint. Drop the non-clustered index Idx, as the clustered key takes its role. Another important piece of the puzzle is to keep the row size constant during processing. You have declared the
ProcessThreadId as a VARCHAR(100) which implies the row grows and shrinks as is being 'processed' because the field value changes from NULL to non-null. This grow-and-shrink pattern on the row causes page splits and fragmentation. I can't possibly imagine a thread ID that is 'VARCHAR(100)'. Use a fixed length type, perhaps an INT. As a side note, you do not need to dequeue in two steps (UPDATE followed by SELECT). You can use OUTPUT clause, as explained in the article linked above:
/**************************************/
CREATE TABLE [Table]
(
[PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED,
[ForeignKeyId] [INT] NOT NULL,
/* more columns ... */
[DataType] [CHAR](1) NOT NULL,
[DataStatus] [DATETIME] NULL,
[ProcessDate] [DATETIME] NOT NULL,
[ProcessThreadId] INT NULL
);
CREATE CLUSTERED INDEX [Cdx] ON [Table]
(
[DataType],
[DataStatus],
[ProcessDate]
);
/**************************************/
declare @BatchSize int, @ProcessThreadId int;
/**************************************/
WITH cte AS (
SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId] , ... more columns
FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
WHERE [DataType] = 'X'
AND [DataStatus] IS NULL
AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId
OUTPUT DELETED.[PrimaryKeyId] , ... more columns ;
/**************************************/In addition I would consider moving successfully processed items into a different, archive, table. You want your queue tables to hover near zero size, you do no want them to grow as they retain 'history' from unneeded old entries. You may also consider partitioning by
[ProcessDate] as an alternative (ie. one current active partition that acts as the queue and stores entries with NULL ProcessDate, and another partition for everything non-null. Or multiple partitions for non-null if you want to implement efficient deletes (switch out) for data that has passed the mandated retention period. If things get hot you can partition in addition by [DataType] if it has enough selectivity, but that design would be really complicated as it requires partitioning by persisted computed column (a composite column that glues together [DataType] and [ProcessingDate]).Code Snippets
/**************************************/
CREATE TABLE [Table]
(
[PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED,
[ForeignKeyId] [INT] NOT NULL,
/* more columns ... */
[DataType] [CHAR](1) NOT NULL,
[DataStatus] [DATETIME] NULL,
[ProcessDate] [DATETIME] NOT NULL,
[ProcessThreadId] INT NULL
);
CREATE CLUSTERED INDEX [Cdx] ON [Table]
(
[DataType],
[DataStatus],
[ProcessDate]
);
/**************************************/
declare @BatchSize int, @ProcessThreadId int;
/**************************************/
WITH cte AS (
SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId] , ... more columns
FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
WHERE [DataType] = 'X'
AND [DataStatus] IS NULL
AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId
OUTPUT DELETED.[PrimaryKeyId] , ... more columns ;
/**************************************/Context
StackExchange Database Administrators Q#16617, answer score: 4
Revisions (0)
No revisions yet.