HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Filtering data ordered by rowversion

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowversiondataorderedfiltering

Problem

I have a SQL table of data with the following structure :

CREATE TABLE Data(
Id uniqueidentifier NOT NULL,
Date datetime NOT NULL,
Value decimal(20, 10) NULL,
RV timestamp NOT NULL,
CONSTRAINT PK_Data PRIMARY KEY CLUSTERED (Id, Date)
)


The number of distinct Ids ranges from 3000 to 50000.

The size of the table varies up to over a billion rows.

One Id can cover between a few rows up to 5% of the table.

The single most executed query on this table is :

SELECT Id, Date, Value, RV
FROM Data
WHERE Id = @Id
AND Date Between @StartDate AND @StopDate


I now have to implement incremental retrieval of data on a subset of Ids, including updates.

I then used a request scheme in which the caller provide a specific rowversion, retrieve a block of data and use the maximum rowversion value of the returned data for the subsequent call.

I have written this procedure :

CREATE TYPE guid_list_tbltype AS TABLE (Id uniqueidentifier not null primary key)


CREATE PROCEDURE GetData
@Ids guid_list_tbltype READONLY,
@Cursor rowversion,
@MaxRows int
AS
BEGIN
SELECT A.*
FROM (
SELECT
Data.Id,
Date,
Value,
RV,
ROW_NUMBER() OVER (ORDER BY RV) AS RN
FROM Data
inner join (SELECT Id FROM @Ids) Ids ON Ids.Id = Data.Id
WHERE RV > @Cursor
) A
WHERE RN

Where
@MaxRows will range between 500,000 and 2,000,000 depending on how chunked the client will want his data.

I have tried different approaches :

  • Indexing on (Id, RV) :



CREATE NONCLUSTERED INDEX IDX_IDRV ON Data(Id, RV) INCLUDE(Date, Value);


Using the index, the query seek the rows where
RV = @Cursor for each Id in @Ids, read the following rows then merge the result and sort.

Efficiency then depends on the relative position of
@Cursor` value.

If it is close to the end of data (ordered by RV) the query is instantaneous and if not the query can take up to minutes (nev

Solution

One solution is for the client application to remember the maximum rowversion per ID. The user-defined table type would change to:

CREATE TYPE
    dbo.guid_list_tbltype
AS TABLE 
    (
    Id      uniqueidentifier PRIMARY KEY, 
    LastRV  rowversion NOT NULL
    );


The query in the procedure can then be rewritten to use the APPLY pattern (see my SQLServerCentral articles part 1 and part 2 - free login required). The key to good performance here is the ORDER BY - it avoids unordered pre-fetching on the nested loops join. The RECOMPILE is necessary to allow the optimizer to see the cardinality of the table variable at compilation time (probably resulting in a desirable parallel plan).

ALTER PROCEDURE dbo.GetData

    @IDs        guid_list_tbltype READONLY,
    @MaxRows    bigint

AS
BEGIN

    SELECT TOP (@MaxRows)
        d.Id,
        d.[Date],
        d.Value,
        d.RV
    FROM @Ids AS i
    CROSS APPLY
    (
        SELECT
            d.*
        FROM dbo.Data AS d
        WHERE
            d.Id = i.Id
            AND d.RV > i.LastRV
    ) AS d
    ORDER BY
        i.Id,
        d.RV
    OPTION (RECOMPILE);

END;


You should get a post-execution query plan like this (estimated plan will be serial):

Code Snippets

CREATE TYPE
    dbo.guid_list_tbltype
AS TABLE 
    (
    Id      uniqueidentifier PRIMARY KEY, 
    LastRV  rowversion NOT NULL
    );
ALTER PROCEDURE dbo.GetData

    @IDs        guid_list_tbltype READONLY,
    @MaxRows    bigint

AS
BEGIN

    SELECT TOP (@MaxRows)
        d.Id,
        d.[Date],
        d.Value,
        d.RV
    FROM @Ids AS i
    CROSS APPLY
    (
        SELECT
            d.*
        FROM dbo.Data AS d
        WHERE
            d.Id = i.Id
            AND d.RV > i.LastRV
    ) AS d
    ORDER BY
        i.Id,
        d.RV
    OPTION (RECOMPILE);

END;

Context

StackExchange Database Administrators Q#22129, answer score: 5

Revisions (0)

No revisions yet.