HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Locate the smallest missing element based on a specific formula

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
smallesttheelementbasedmissinglocateformulaspecific

Problem

I need to be able to locate a missing element from a table with tens-of-millions of rows, and has a primary key of a BINARY(64) column (which is the input value to calculate from). These values are mostly inserted in order, but on occasion I want to reuse a previous value that was deleted. It's infeasible to modify the deleted records with a IsDeleted column, as sometimes a row is inserted that is many millions of values ahead of the currently existing rows. This means the sample data would look something like:

KeyCol : BINARY(64)
0x..000000000001
0x..000000000002
0x..FFFFFFFFFFFF


So inserting all the missing values between 0x000000000002 and 0xFFFFFFFFFFFF is infeasible, the amount of time and space used would be undesirable. Essentially, when I run the algorithm, I expect it to return 0x000000000003, which is the first opening.

I've come up with a binary-search algorithm in C#, which would query the database for each value at position i, and test if that value was expected. For context, my terrible algorithm: https://codereview.stackexchange.com/questions/174498/binary-search-for-a-missing-or-default-value-by-a-given-formula

This algorithm would run, for example, 26-27 SQL-queries on a table with 100,000,000 items. (That doesn't seem like a lot, but it's going to be occurring very frequently.) Currently, this table has approximately 50,000,000 rows in it, and performance is becoming noticeable.

My first alternative thought is to translate this to a stored-procedure, but that has it's own hurdles. (I have to write a BINARY(64) + BINARY(64) algorithm, as well as a slew of other things.) This would be painful, but not infeasible. I've also considered implementing the translation algorithm based on ROW_NUMBER, but I have a really bad gut feeling about this. (A BIGINT is not nearly big enough for these values.)

I'm up for other suggestions, as I really need this to be as quick as possible. For what it's worth the only column selected by

Solution

There are a few challenges with this question. Indexes in SQL Server can do the following very efficiently with just a few logical read each:

  • check that a row exists



  • check that a row doesn't exist



  • find the next row starting at some point



  • find the previous row starting at some point



However, they cannot be used to find the Nth row in an index. Doing that requires you roll your own index stored as a table or to scan the first N rows in the index. Your C# code heavily relies on the fact that you can efficiently find the Nth element of the array, but you can't do that here. I think that algorithm isn't usable for T-SQL without a data model change.

The second challenge relates to the restrictions on the BINARY data types. As far as I can tell you cannot perform addition, subtraction, or division in the usual ways. You can convert your BINARY(64) to a BIGINT and it won't throw conversion errors, but the behavior is not defined:


Conversions between any data type and the binary data types are not
guaranteed to be the same between versions of SQL Server.

In addition, the lack of conversion errors is somewhat of a problem here. You can convert anything larger than the biggest possible BIGINT value but it'll give you the wrong results.

It's true that you have values right now that are bigger than 9223372036854775807. However, if you're always starting at 1 and searching for the smallest minimum value then those large values cannot be relevant unless your table has more than 9223372036854775807 rows. This seems unlikely because your table at that point would be around 2000 exabytes, so for the purposes of answering your question I'm going to assume that the very large values do not need to be searched. I'm also going to do data type conversion because they seem to be unavoidable.

For the test data, I inserted the equivalent of 50 million sequential integers into a table along with 50 million more integers with a single value gap about every 20 values. I also inserted a single value that won't properly fit in a signed BIGINT:

CREATE TABLE dbo.BINARY_PROBLEMS (
    KeyCol BINARY(64) NOT NULL
);

INSERT INTO dbo.BINARY_PROBLEMS WITH (TABLOCK)
SELECT CAST(SUM(OFFSET) OVER (ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS BINARY(64))
FROM
(
    SELECT 1 + CASE WHEN t.RN > 50000000 THEN
        CASE WHEN ABS(CHECKSUM(NewId()) % 20)  = 10 THEN 1 ELSE 0 END
    ELSE 0 END OFFSET
    FROM
    (
        SELECT TOP (100000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
        FROM master..spt_values t1
        CROSS JOIN master..spt_values t2
        CROSS JOIN master..spt_values t3
    ) t
) tt
OPTION (MAXDOP 1);

CREATE UNIQUE CLUSTERED INDEX CI_BINARY_PROBLEMS ON dbo.BINARY_PROBLEMS (KeyCol);

-- add a value too large for BIGINT
INSERT INTO dbo.BINARY_PROBLEMS
SELECT CAST(0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008000000000000000 AS BINARY(64));


That code took a few minutes to run on my machine. I made the first half of the table not have any gaps to represent a sort of worse case for performance. The code that I used to solve the problem scans the index in order so it will finish very quickly if the first gap is early on in the table. Before we get to that let's verify that the data is as it should be:

SELECT TOP (2) KeyColBigInt
FROM
(
    SELECT KeyCol
    , CAST(KeyCol AS BIGINT) KeyColBigInt
    FROM dbo.BINARY_PROBLEMS
) t
ORDER By KeyCol DESC;


The results suggest that the maximum value that we converts to BIGINT is 102500672:

╔══════════════════════╗
║     KeyColBigInt     ║
╠══════════════════════╣
║ -9223372036854775808 ║
║            102500672 ║
╚══════════════════════╝


There are 100 million rows with values that fit into BIGINT as expected:

SELECT COUNT(*) 
FROM dbo.BINARY_PROBLEMS
WHERE KeyCol < 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007FFFFFFFFFFFFFFF;


One approach to this problem is to scan the index in order and to quit as soon as a row's value doesn't match the expected ROW_NUMBER() value. The entire table does not need to be scanned to get the first row: only the rows up until the first gap. Here's one way to write code that is likely to get that query plan:

SELECT TOP (1) KeyCol
FROM
(
    SELECT KeyCol
    , CAST(KeyCol AS BIGINT) KeyColBigInt
    , ROW_NUMBER() OVER (ORDER BY KeyCol) RN
    FROM dbo.BINARY_PROBLEMS
    WHERE KeyCol  RN
ORDER BY KeyCol;


For reasons that won't fit in this answer, this query will often be run serially by SQL Server and SQL Server will often underestimate the number of rows that need to be scanned before the first match is found. On my machine, SQL Server scans 50000022 rows from the index before finding the first match. The query takes 11 seconds to run. Note that this returns the first value past the gap. It's not cle

Code Snippets

CREATE TABLE dbo.BINARY_PROBLEMS (
    KeyCol BINARY(64) NOT NULL
);

INSERT INTO dbo.BINARY_PROBLEMS WITH (TABLOCK)
SELECT CAST(SUM(OFFSET) OVER (ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS BINARY(64))
FROM
(
    SELECT 1 + CASE WHEN t.RN > 50000000 THEN
        CASE WHEN ABS(CHECKSUM(NewId()) % 20)  = 10 THEN 1 ELSE 0 END
    ELSE 0 END OFFSET
    FROM
    (
        SELECT TOP (100000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
        FROM master..spt_values t1
        CROSS JOIN master..spt_values t2
        CROSS JOIN master..spt_values t3
    ) t
) tt
OPTION (MAXDOP 1);

CREATE UNIQUE CLUSTERED INDEX CI_BINARY_PROBLEMS ON dbo.BINARY_PROBLEMS (KeyCol);

-- add a value too large for BIGINT
INSERT INTO dbo.BINARY_PROBLEMS
SELECT CAST(0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008000000000000000 AS BINARY(64));
SELECT TOP (2) KeyColBigInt
FROM
(
    SELECT KeyCol
    , CAST(KeyCol AS BIGINT) KeyColBigInt
    FROM dbo.BINARY_PROBLEMS
) t
ORDER By KeyCol DESC;
╔══════════════════════╗
║     KeyColBigInt     ║
╠══════════════════════╣
║ -9223372036854775808 ║
║            102500672 ║
╚══════════════════════╝
SELECT COUNT(*) 
FROM dbo.BINARY_PROBLEMS
WHERE KeyCol < 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007FFFFFFFFFFFFFFF;
SELECT TOP (1) KeyCol
FROM
(
    SELECT KeyCol
    , CAST(KeyCol AS BIGINT) KeyColBigInt
    , ROW_NUMBER() OVER (ORDER BY KeyCol) RN
    FROM dbo.BINARY_PROBLEMS
    WHERE KeyCol < 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007FFFFFFFFFFFFFFF
) t
WHERE KeyColBigInt <> RN
ORDER BY KeyCol;

Context

StackExchange Database Administrators Q#185123, answer score: 8

Revisions (0)

No revisions yet.