HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

What is a scalable way to simulate HASHBYTES using a SQL CLR scalar function?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whatscalablesqlscalarwayclrfunctionusinghashbytessimulate

Problem

As part of our ETL process, we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.

The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use HASHBYTES with the SHA2_256 algorithm and have found that it does not scale on large servers if many concurrent worker threads are all calling HASHBYTES.

Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent MAXDOP 8 queries from 1 - 12. Testing with MAXDOP 1 showed the same scalability bottleneck.

As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:

  • The function must be able to participate in parallel queries



  • The function must be deterministic



  • The function must take an input of an NVARCHAR or VARBINARY string (all relevant columns are concatenated together)



  • The typical input size of the string will be 100 - 20000 characters in length. 20000 is not a max



  • The chance of a hash collision should be roughly equal to or better than the MD5 algorithm. CHECKSUM does not work for us because there are too many collisions.



  • The function must scale well on large servers (throughput per thread should not significantly decrease as the number of threads increases)



For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).

What is a scalable way to simulate HASHBYTES using a SQL CLR function? My goal can be expressed as getting as many hashes per second as I can on a large server, so performance matters as well. I am terrible with CLR so I don't know how to accomplish this. If it motivates anyone to answer, I plan on adding a bo

Solution

Since you're just looking for changes, you don't need a cryptographic hash function.

You could choose from one of the faster non-cryptographic hashes in the open-source Data.HashFunction library by Brandon Dahler, licensed under the permissive and OSI approved MIT license. SpookyHash is a popular choice.
Example implementation
Source Code
using Microsoft.SqlServer.Server;
using System.Data.HashFunction.SpookyHash;
using System.Data.SqlTypes;

public partial class UserDefinedFunctions
{
[SqlFunction
(
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None,
IsDeterministic = true,
IsPrecise = true
)
]
public static byte[] SpookyHash
(
[SqlFacet (MaxSize = 8000)]
SqlBinary Input
)
{
ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
return sh.ComputeHash(Input.Value).Hash;
}

[SqlFunction
(
DataAccess = DataAccessKind.None,
IsDeterministic = true,
IsPrecise = true,
SystemDataAccess = SystemDataAccessKind.None
)
]
public static byte[] SpookyHashLOB
(
[SqlFacet (MaxSize = -1)]
SqlBinary Input
)
{
ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
return sh.ComputeHash(Input.Value).Hash;
}
}


The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version. The non-LOB version should be significantly quicker.

You might be able to wrap a LOB binary in COMPRESS to get it under the 8000 byte limit, if that turns out to be worthwhile for performance. Alternatively, you could break the LOB up into sub-8000 byte segments, or simply reserve the use of HASHBYTES for the LOB case (since longer inputs scale better).
Pre-built code

You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:

https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300
T-SQL functions

CREATE FUNCTION dbo.SpookyHash
(
    @Input varbinary(8000)
)
RETURNS binary(16)
WITH 
    RETURNS NULL ON NULL INPUT, 
    EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
GO
CREATE FUNCTION dbo.SpookyHashLOB
(
    @Input varbinary(max)
)
RETURNS binary(16)
WITH 
    RETURNS NULL ON NULL INPUT, 
    EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
GO


Usage

An example use given the sample data in the question:

```
SELECT
HT1.ID
FROM dbo.HB_TBL AS HT1
JOIN dbo.HB_TBL_2 AS HT2
ON HT2.ID = HT1.ID
AND dbo.SpookyHash
(
CONVERT(binary(8), HT2.FK1) + 0x7C +
CONVERT(binary(8), HT2.FK2) + 0x7C +
CONVERT(binary(8), HT2.FK3) + 0x7C +
CONVERT(binary(8), HT2.FK4) + 0x7C +
CONVERT(binary(8), HT2.FK5) + 0x7C +
CONVERT(binary(8), HT2.FK6) + 0x7C +
CONVERT(binary(8), HT2.FK7) + 0x7C +
CONVERT(binary(8), HT2.FK8) + 0x7C +
CONVERT(binary(8), HT2.FK9) + 0x7C +
CONVERT(binary(8), HT2.FK10) + 0x7C +
CONVERT(binary(8), HT2.FK11) + 0x7C +
CONVERT(binary(8), HT2.FK12) + 0x7C +
CONVERT(binary(8), HT2.FK13) + 0x7C +
CONVERT(binary(8), HT2.FK14) + 0x7C +
CONVERT(binary(8), HT2.FK15) + 0x7C +
CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
CONVERT(binary(1), HT2.COMP1) + 0x7C +
CONVERT(binary(1), HT2.COMP2) + 0x7C +
CONVERT(binary(1), HT2.COMP3) + 0x7C +
CONVERT(binary(1), HT2.COMP4) + 0x7C +
CONVERT(binary(1), HT2.COMP5)
)
<> dbo.SpookyHash
(
CONVERT(binary(8), HT1.FK1) + 0x7C +
CONVERT(binary(8), HT1.FK2) + 0x7C +
CONVERT(binary(8), HT1.FK3) + 0x7C +
CONVERT(binary(8), HT1.FK4) + 0x7C +
CONVERT(binary(8), HT1.FK5) + 0x7C +
CONVERT(binary(8), HT1.FK6) + 0x7C +
CONVERT(binary(8), HT1.FK7) + 0x7C +
CONVERT(binary(8), HT1.FK8) + 0x7C +
CONVERT(binary(8), HT1.FK9) + 0x7C +
CONVERT(binary(8), HT1.FK10) + 0x7C +
CONVERT(binary(8), HT1.FK11) + 0x7C +
CONVERT(binary(8), HT1.FK12) + 0x7C +
CONVERT(binary(8), HT1.FK13) + 0x7C +
CONVERT(binary(8), HT1.FK14) + 0x7C +
CONVERT(binary(8), HT1.FK15) + 0x7C +
CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
CONVERT(binary(1), HT1.COMP1) + 0x7C +
CONVERT(binary(1), HT1.COMP2) + 0x7C +
CONVERT(binary(1), HT1.COMP3) + 0x7C +

Code Snippets

CREATE FUNCTION dbo.SpookyHash
(
    @Input varbinary(8000)
)
RETURNS binary(16)
WITH 
    RETURNS NULL ON NULL INPUT, 
    EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
GO
CREATE FUNCTION dbo.SpookyHashLOB
(
    @Input varbinary(max)
)
RETURNS binary(16)
WITH 
    RETURNS NULL ON NULL INPUT, 
    EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
GO
SELECT
    HT1.ID
FROM dbo.HB_TBL AS HT1
JOIN dbo.HB_TBL_2 AS HT2
    ON HT2.ID = HT1.ID
    AND dbo.SpookyHash
    (
        CONVERT(binary(8), HT2.FK1) + 0x7C +
        CONVERT(binary(8), HT2.FK2) + 0x7C +
        CONVERT(binary(8), HT2.FK3) + 0x7C +
        CONVERT(binary(8), HT2.FK4) + 0x7C +
        CONVERT(binary(8), HT2.FK5) + 0x7C +
        CONVERT(binary(8), HT2.FK6) + 0x7C +
        CONVERT(binary(8), HT2.FK7) + 0x7C +
        CONVERT(binary(8), HT2.FK8) + 0x7C +
        CONVERT(binary(8), HT2.FK9) + 0x7C +
        CONVERT(binary(8), HT2.FK10) + 0x7C +
        CONVERT(binary(8), HT2.FK11) + 0x7C +
        CONVERT(binary(8), HT2.FK12) + 0x7C +
        CONVERT(binary(8), HT2.FK13) + 0x7C +
        CONVERT(binary(8), HT2.FK14) + 0x7C +
        CONVERT(binary(8), HT2.FK15) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
        CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
        CONVERT(binary(1), HT2.COMP1) + 0x7C +
        CONVERT(binary(1), HT2.COMP2) + 0x7C +
        CONVERT(binary(1), HT2.COMP3) + 0x7C +
        CONVERT(binary(1), HT2.COMP4) + 0x7C +
        CONVERT(binary(1), HT2.COMP5)
    )
    <> dbo.SpookyHash
    (
        CONVERT(binary(8), HT1.FK1) + 0x7C +
        CONVERT(binary(8), HT1.FK2) + 0x7C +
        CONVERT(binary(8), HT1.FK3) + 0x7C +
        CONVERT(binary(8), HT1.FK4) + 0x7C +
        CONVERT(binary(8), HT1.FK5) + 0x7C +
        CONVERT(binary(8), HT1.FK6) + 0x7C +
        CONVERT(binary(8), HT1.FK7) + 0x7C +
        CONVERT(binary(8), HT1.FK8) + 0x7C +
        CONVERT(binary(8), HT1.FK9) + 0x7C +
        CONVERT(binary(8), HT1.FK10) + 0x7C +
        CONVERT(binary(8), HT1.FK11) + 0x7C +
        CONVERT(binary(8), HT1.FK12) + 0x7C +
        CONVERT(binary(8), HT1.FK13) + 0x7C +
        CONVERT(binary(8), HT1.FK14) + 0x7C +
        CONVERT(binary(8), HT1.FK15) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
        CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
        CONVERT(binary(1), HT1.COMP1) + 0x7C +
        CONVERT(binary(1), HT1.COMP2) + 0x7C +
        CONVERT(binary(1), HT1.COMP3) + 0x7C +
        CONVERT(binary(1), HT1.COMP4) + 0x7C +
        CONVERT(binary(1), HT1.COMP5)
    );

Context

StackExchange Database Administrators Q#228789, answer score: 21

Revisions (0)

No revisions yet.