patternsqlMajor
What is a scalable way to simulate HASHBYTES using a SQL CLR scalar function?
Viewed 0 times
whatscalablesqlscalarwayclrfunctionusinghashbytessimulate
Problem
As part of our ETL process, we compare rows from staging against the reporting database to figure out if any of the columns have actually changed since the data was last loaded.
The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use
Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent
As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:
For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).
What is a scalable way to simulate
The comparison is based on the unique key of the table and some kind of hash of all of the other columns. We currently use
HASHBYTES with the SHA2_256 algorithm and have found that it does not scale on large servers if many concurrent worker threads are all calling HASHBYTES.Throughput measured in hashes per second does not increase past 16 concurrent threads when testing on a 96 core server. I test by changing the number of concurrent
MAXDOP 8 queries from 1 - 12. Testing with MAXDOP 1 showed the same scalability bottleneck.As a workaround I want to try a SQL CLR solution. Here is my attempt to state the requirements:
- The function must be able to participate in parallel queries
- The function must be deterministic
- The function must take an input of an
NVARCHARorVARBINARYstring (all relevant columns are concatenated together)
- The typical input size of the string will be 100 - 20000 characters in length. 20000 is not a max
- The chance of a hash collision should be roughly equal to or better than the MD5 algorithm.
CHECKSUMdoes not work for us because there are too many collisions.
- The function must scale well on large servers (throughput per thread should not significantly decrease as the number of threads increases)
For Application Reasons™, assume that I cannot save off the value of the hash for the reporting table. It's a CCI which doesn't support triggers or computed columns (there are other problems as well that I don't want to get into).
What is a scalable way to simulate
HASHBYTES using a SQL CLR function? My goal can be expressed as getting as many hashes per second as I can on a large server, so performance matters as well. I am terrible with CLR so I don't know how to accomplish this. If it motivates anyone to answer, I plan on adding a boSolution
Since you're just looking for changes, you don't need a cryptographic hash function.
You could choose from one of the faster non-cryptographic hashes in the open-source Data.HashFunction library by Brandon Dahler, licensed under the permissive and OSI approved MIT license.
Example implementation
Source Code
The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version. The non-LOB version should be significantly quicker.
You might be able to wrap a LOB binary in
Pre-built code
You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:
https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300
T-SQL functions
Usage
An example use given the sample data in the question:
```
SELECT
HT1.ID
FROM dbo.HB_TBL AS HT1
JOIN dbo.HB_TBL_2 AS HT2
ON HT2.ID = HT1.ID
AND dbo.SpookyHash
(
CONVERT(binary(8), HT2.FK1) + 0x7C +
CONVERT(binary(8), HT2.FK2) + 0x7C +
CONVERT(binary(8), HT2.FK3) + 0x7C +
CONVERT(binary(8), HT2.FK4) + 0x7C +
CONVERT(binary(8), HT2.FK5) + 0x7C +
CONVERT(binary(8), HT2.FK6) + 0x7C +
CONVERT(binary(8), HT2.FK7) + 0x7C +
CONVERT(binary(8), HT2.FK8) + 0x7C +
CONVERT(binary(8), HT2.FK9) + 0x7C +
CONVERT(binary(8), HT2.FK10) + 0x7C +
CONVERT(binary(8), HT2.FK11) + 0x7C +
CONVERT(binary(8), HT2.FK12) + 0x7C +
CONVERT(binary(8), HT2.FK13) + 0x7C +
CONVERT(binary(8), HT2.FK14) + 0x7C +
CONVERT(binary(8), HT2.FK15) + 0x7C +
CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
CONVERT(binary(1), HT2.COMP1) + 0x7C +
CONVERT(binary(1), HT2.COMP2) + 0x7C +
CONVERT(binary(1), HT2.COMP3) + 0x7C +
CONVERT(binary(1), HT2.COMP4) + 0x7C +
CONVERT(binary(1), HT2.COMP5)
)
<> dbo.SpookyHash
(
CONVERT(binary(8), HT1.FK1) + 0x7C +
CONVERT(binary(8), HT1.FK2) + 0x7C +
CONVERT(binary(8), HT1.FK3) + 0x7C +
CONVERT(binary(8), HT1.FK4) + 0x7C +
CONVERT(binary(8), HT1.FK5) + 0x7C +
CONVERT(binary(8), HT1.FK6) + 0x7C +
CONVERT(binary(8), HT1.FK7) + 0x7C +
CONVERT(binary(8), HT1.FK8) + 0x7C +
CONVERT(binary(8), HT1.FK9) + 0x7C +
CONVERT(binary(8), HT1.FK10) + 0x7C +
CONVERT(binary(8), HT1.FK11) + 0x7C +
CONVERT(binary(8), HT1.FK12) + 0x7C +
CONVERT(binary(8), HT1.FK13) + 0x7C +
CONVERT(binary(8), HT1.FK14) + 0x7C +
CONVERT(binary(8), HT1.FK15) + 0x7C +
CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
CONVERT(binary(1), HT1.COMP1) + 0x7C +
CONVERT(binary(1), HT1.COMP2) + 0x7C +
CONVERT(binary(1), HT1.COMP3) + 0x7C +
You could choose from one of the faster non-cryptographic hashes in the open-source Data.HashFunction library by Brandon Dahler, licensed under the permissive and OSI approved MIT license.
SpookyHash is a popular choice.Example implementation
Source Code
using Microsoft.SqlServer.Server;
using System.Data.HashFunction.SpookyHash;
using System.Data.SqlTypes;
public partial class UserDefinedFunctions
{
[SqlFunction
(
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None,
IsDeterministic = true,
IsPrecise = true
)
]
public static byte[] SpookyHash
(
[SqlFacet (MaxSize = 8000)]
SqlBinary Input
)
{
ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
return sh.ComputeHash(Input.Value).Hash;
}
[SqlFunction
(
DataAccess = DataAccessKind.None,
IsDeterministic = true,
IsPrecise = true,
SystemDataAccess = SystemDataAccessKind.None
)
]
public static byte[] SpookyHashLOB
(
[SqlFacet (MaxSize = -1)]
SqlBinary Input
)
{
ISpookyHashV2 sh = SpookyHashV2Factory.Instance.Create();
return sh.ComputeHash(Input.Value).Hash;
}
}
The source provides two functions, one for inputs of 8000 bytes or less, and a LOB version. The non-LOB version should be significantly quicker.
You might be able to wrap a LOB binary in
COMPRESS to get it under the 8000 byte limit, if that turns out to be worthwhile for performance. Alternatively, you could break the LOB up into sub-8000 byte segments, or simply reserve the use of HASHBYTES for the LOB case (since longer inputs scale better).Pre-built code
You can obviously grab the package for yourself and compile everything, but I built the assemblies below to make quick testing easier:
https://gist.github.com/SQLKiwi/365b265b476bf86754457fc9514b2300
T-SQL functions
CREATE FUNCTION dbo.SpookyHash
(
@Input varbinary(8000)
)
RETURNS binary(16)
WITH
RETURNS NULL ON NULL INPUT,
EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
GO
CREATE FUNCTION dbo.SpookyHashLOB
(
@Input varbinary(max)
)
RETURNS binary(16)
WITH
RETURNS NULL ON NULL INPUT,
EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
GOUsage
An example use given the sample data in the question:
```
SELECT
HT1.ID
FROM dbo.HB_TBL AS HT1
JOIN dbo.HB_TBL_2 AS HT2
ON HT2.ID = HT1.ID
AND dbo.SpookyHash
(
CONVERT(binary(8), HT2.FK1) + 0x7C +
CONVERT(binary(8), HT2.FK2) + 0x7C +
CONVERT(binary(8), HT2.FK3) + 0x7C +
CONVERT(binary(8), HT2.FK4) + 0x7C +
CONVERT(binary(8), HT2.FK5) + 0x7C +
CONVERT(binary(8), HT2.FK6) + 0x7C +
CONVERT(binary(8), HT2.FK7) + 0x7C +
CONVERT(binary(8), HT2.FK8) + 0x7C +
CONVERT(binary(8), HT2.FK9) + 0x7C +
CONVERT(binary(8), HT2.FK10) + 0x7C +
CONVERT(binary(8), HT2.FK11) + 0x7C +
CONVERT(binary(8), HT2.FK12) + 0x7C +
CONVERT(binary(8), HT2.FK13) + 0x7C +
CONVERT(binary(8), HT2.FK14) + 0x7C +
CONVERT(binary(8), HT2.FK15) + 0x7C +
CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
CONVERT(binary(1), HT2.COMP1) + 0x7C +
CONVERT(binary(1), HT2.COMP2) + 0x7C +
CONVERT(binary(1), HT2.COMP3) + 0x7C +
CONVERT(binary(1), HT2.COMP4) + 0x7C +
CONVERT(binary(1), HT2.COMP5)
)
<> dbo.SpookyHash
(
CONVERT(binary(8), HT1.FK1) + 0x7C +
CONVERT(binary(8), HT1.FK2) + 0x7C +
CONVERT(binary(8), HT1.FK3) + 0x7C +
CONVERT(binary(8), HT1.FK4) + 0x7C +
CONVERT(binary(8), HT1.FK5) + 0x7C +
CONVERT(binary(8), HT1.FK6) + 0x7C +
CONVERT(binary(8), HT1.FK7) + 0x7C +
CONVERT(binary(8), HT1.FK8) + 0x7C +
CONVERT(binary(8), HT1.FK9) + 0x7C +
CONVERT(binary(8), HT1.FK10) + 0x7C +
CONVERT(binary(8), HT1.FK11) + 0x7C +
CONVERT(binary(8), HT1.FK12) + 0x7C +
CONVERT(binary(8), HT1.FK13) + 0x7C +
CONVERT(binary(8), HT1.FK14) + 0x7C +
CONVERT(binary(8), HT1.FK15) + 0x7C +
CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
CONVERT(binary(1), HT1.COMP1) + 0x7C +
CONVERT(binary(1), HT1.COMP2) + 0x7C +
CONVERT(binary(1), HT1.COMP3) + 0x7C +
Code Snippets
CREATE FUNCTION dbo.SpookyHash
(
@Input varbinary(8000)
)
RETURNS binary(16)
WITH
RETURNS NULL ON NULL INPUT,
EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHash;
GO
CREATE FUNCTION dbo.SpookyHashLOB
(
@Input varbinary(max)
)
RETURNS binary(16)
WITH
RETURNS NULL ON NULL INPUT,
EXECUTE AS OWNER
AS EXTERNAL NAME Spooky.UserDefinedFunctions.SpookyHashLOB;
GOSELECT
HT1.ID
FROM dbo.HB_TBL AS HT1
JOIN dbo.HB_TBL_2 AS HT2
ON HT2.ID = HT1.ID
AND dbo.SpookyHash
(
CONVERT(binary(8), HT2.FK1) + 0x7C +
CONVERT(binary(8), HT2.FK2) + 0x7C +
CONVERT(binary(8), HT2.FK3) + 0x7C +
CONVERT(binary(8), HT2.FK4) + 0x7C +
CONVERT(binary(8), HT2.FK5) + 0x7C +
CONVERT(binary(8), HT2.FK6) + 0x7C +
CONVERT(binary(8), HT2.FK7) + 0x7C +
CONVERT(binary(8), HT2.FK8) + 0x7C +
CONVERT(binary(8), HT2.FK9) + 0x7C +
CONVERT(binary(8), HT2.FK10) + 0x7C +
CONVERT(binary(8), HT2.FK11) + 0x7C +
CONVERT(binary(8), HT2.FK12) + 0x7C +
CONVERT(binary(8), HT2.FK13) + 0x7C +
CONVERT(binary(8), HT2.FK14) + 0x7C +
CONVERT(binary(8), HT2.FK15) + 0x7C +
CONVERT(varbinary(1000), HT2.STR1) + 0x7C +
CONVERT(varbinary(1000), HT2.STR2) + 0x7C +
CONVERT(varbinary(1000), HT2.STR3) + 0x7C +
CONVERT(varbinary(1000), HT2.STR4) + 0x7C +
CONVERT(varbinary(1000), HT2.STR5) + 0x7C +
CONVERT(binary(1), HT2.COMP1) + 0x7C +
CONVERT(binary(1), HT2.COMP2) + 0x7C +
CONVERT(binary(1), HT2.COMP3) + 0x7C +
CONVERT(binary(1), HT2.COMP4) + 0x7C +
CONVERT(binary(1), HT2.COMP5)
)
<> dbo.SpookyHash
(
CONVERT(binary(8), HT1.FK1) + 0x7C +
CONVERT(binary(8), HT1.FK2) + 0x7C +
CONVERT(binary(8), HT1.FK3) + 0x7C +
CONVERT(binary(8), HT1.FK4) + 0x7C +
CONVERT(binary(8), HT1.FK5) + 0x7C +
CONVERT(binary(8), HT1.FK6) + 0x7C +
CONVERT(binary(8), HT1.FK7) + 0x7C +
CONVERT(binary(8), HT1.FK8) + 0x7C +
CONVERT(binary(8), HT1.FK9) + 0x7C +
CONVERT(binary(8), HT1.FK10) + 0x7C +
CONVERT(binary(8), HT1.FK11) + 0x7C +
CONVERT(binary(8), HT1.FK12) + 0x7C +
CONVERT(binary(8), HT1.FK13) + 0x7C +
CONVERT(binary(8), HT1.FK14) + 0x7C +
CONVERT(binary(8), HT1.FK15) + 0x7C +
CONVERT(varbinary(1000), HT1.STR1) + 0x7C +
CONVERT(varbinary(1000), HT1.STR2) + 0x7C +
CONVERT(varbinary(1000), HT1.STR3) + 0x7C +
CONVERT(varbinary(1000), HT1.STR4) + 0x7C +
CONVERT(varbinary(1000), HT1.STR5) + 0x7C +
CONVERT(binary(1), HT1.COMP1) + 0x7C +
CONVERT(binary(1), HT1.COMP2) + 0x7C +
CONVERT(binary(1), HT1.COMP3) + 0x7C +
CONVERT(binary(1), HT1.COMP4) + 0x7C +
CONVERT(binary(1), HT1.COMP5)
);Context
StackExchange Database Administrators Q#228789, answer score: 21
Revisions (0)
No revisions yet.