HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Fastest way to split/store a long string for charindex function

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
longwaystorecharindexfunctionsplitfastestforstring

Problem

I have a 1 TB string of digits. Given a 12-character sequence of digits I want to get the start-position of this sequence in the original string (charindex function).

I have tested this with a 1GB string and a 9-digit-substring using SQL Server, storing the string as a varchar(max). Charindex takes 10 secs. Breaking up the 1GB string in 900 byte overlapping chunks and creating a table (StartPositionOfChunk, Chunkofstring) with chunkofstring in binary collation, indexed takes under 1 sec. Latter method for 10GB,10 digit-substring rises charindex to 1,5 min. I would like to find a faster storage method.
Example

string of digits: 0123456789 - substring to search 345

charindex('345','0123456789') gives 4

Method 1: I can now store this in a SQL Server table strtable consisting of one column colstr and perform:

select charindex('345',colstr) from strtable


Method 2: or I can make up a table strtable2 (pos,colstr1) by splitting up the original string: 1;012 | 2;123 | 3;234 a.s.o and then we can have the query

select pos from strtable2 where colstr1='345'


Method 3: I can make up a table strtable2 (pos2,colstr2) by splitting up the original string into larger chunks 1;01234 | 4;34567 | 7;6789 and then

select pos2+charindex('345',colstr2) from strtable2 where colstr2 like '%345%'


First method is the slowest.

Second method blows up the database storage size!

Method 3: Setting colstr2 length to 900 bytes in binary collation, creating an index on this column takes 1 sec for 1GB string and 9 digit substring search. For 10GB string and 10 digit substring ist takes 90 secs.

Any other idea how to make this faster (maybe by utilizing the string consists of Digits, with Long integers,....)?

Search is always for a 12 digit substring in a 1TB string of digits SQL Server 2017 Developer Edition, 16 cores, 16GB RAM. Primary goal is search speed! 10 digits in a 10GB string (for performance testing).

Solution

Storing and processing 1TB of data with only 16GB RAM available may prove to be a challenge. 1GB per core is rather unbalanced, especially for this kind of workload. 8GB per core would be a much better starting point, with more desirable.

That said, I would still be tempted to try a variation of method 2:

Store all possible 12-character substrings as bigint in a clustered columnstore table (with archive compression if that turns out to be useful):

CREATE TABLE dbo.Search
(
    pos bigint NOT NULL,
    fragment bigint NOT NULL,

    INDEX CCS CLUSTERED COLUMNSTORE 
        WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE) -- optional
);


You will probably have to implement some way of incrementally loading the source data into this table. Make sure you end up with maximally-sized row groups (1,048,576 rows) in the finished columnstore structure. See the data-loading guidance.

You might stage rows in multiples of 1048576 in an unindexed rowstore table before creating a clustered columnstore index on that, then switching the result directly into a partitioned main table. The exact approach depends on how you intend to load the data, whether it will be added to, and how familiar you are with SQL Server in general.

Very good performance is possible with this method, but as so often with columnstore, you would need to achieve effective partition and segment elimination. Partitioning on the fragment column and building the columnstore index serially while replacing a rowstore clustered index keyed on fragment is the way to achieve this, as noted in the documentation linked above. This will also minimize the storage needs, since fragment values in the same range will be stored in the same segment. This allows effective value rebasing and bit packing.

While loading, try to limit the strings you are working with within SQL Server to non-LOB (max) types. If you do find working with LOBs best for throughput, there is often a sweet spot of data length to be found, above which performance drops off significantly.

Depending on the final size of the structure and the speed of your I/O subsystem, you may find that this approach offers consistently good enough performance. Increasing the memory available would improve things markedly.

Code Snippets

CREATE TABLE dbo.Search
(
    pos bigint NOT NULL,
    fragment bigint NOT NULL,

    INDEX CCS CLUSTERED COLUMNSTORE 
        WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE) -- optional
);

Context

StackExchange Database Administrators Q#229892, answer score: 6

Revisions (0)

No revisions yet.