HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlMinor

Basic T-SQL to compare two same length strings

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
samesqllengthtwocomparestringsbasic

Problem

I've come up with the following T-SQL to compare similar same length strings.

Example usage would be:

  • OCR returns a value which is expected to be in the database.



  • PATINDEX is used to check the value's in the database, including matching for common errors (e.g. 0/O, I/1/l).



  • If multiple values are found, each match is scored for likeness to the original.



declare @t table (solution nvarchar(32), score bigint)

insert @t(solution) 
select 'Portugal'
union select '9ortugal'
union select '9ortu8al'
union select 'Portu8al'
union select 'P0rtugal'
union select '90rtugal'
union select '90rtu8al'
union select 'P0rtu8al'
union select 'Portuga1'
union select '9ortuga1'
union select '9ortu8a1'
union select 'Portu8a1'
union select 'P0rtuga1'
union select '90rtuga1'
union select '90rtu8a1'
union select 'P0rtu8a1'

declare @answer nvarchar(32) = 'Portugal'

update @t 
set score = cast(cast(cast(@answer as varchar) as varbinary(max)) as bigint) - (cast(cast(cast(solution as varchar) as varbinary(max)) as bigint) & cast(cast(cast(@answer as varchar) as varbinary(max)) as bigint))

select * from @t order by score --first result is the most similar


Results:

solution score
Portugal 0
Portuga1 76
Portu8al 4653056
Portu8a1 4653132
P0rtugal 22236523160141824
P0rtuga1 22236523160141900
P0rtu8al 22236523164794880
P0rtu8a1 22236523164794956
9ortugal 4611686018427387904
9ortuga1 4611686018427387980
9ortu8al 4611686018432040960
9ortu8a1 4611686018432041036
90rtugal 4633922541587529728
90rtuga1 4633922541587529804
90rtu8al 4633922541592182784
90rtu8a1 4633922541592182860


SQL Fiddle

Notes:

  • Conversion to varchar because nvarchar seems to cause a loss of precision



  • Conversion to varbinary because doing varchar straight to numeric would result in a non-numeric issue for non-numeric characters.



  • Conversion to bigint to allow bitwise & arithmetic comparison



  • Use of - and & may be superfluous; perhaps just the - wou

Solution

Firstly, a stylistic point. SQL is typically written with keywords capitalised, SELECT instead of select for example.

Your table and variable names could use a little more meaning. Using @Countries or @CountryNames in place of the meaningless @t, for example.

You use NVARCHAR to hold your data, but immediately convert it to VARCHAR. I would recommend just using VARCHAR in your query, this allows you to remove the cast to VARCHAR's.

Lastly, the logic you use to set the score is a little convoluted, as you thought it might be. I would recommend using:

SET score = CAST(CAST(@SelectedCountry AS VARBINARY(MAX)) AS INT) - CAST(CAST(solution AS VARBINARY(MAX)) AS INT)


These changes would result in this:

DECLARE @CoutryNames TABLE (solution VARCHAR(32), score BIGINT)

INSERT INTO @CoutryNames(solution)
SELECT 'Portugal'
UNION SELECT '9ortugal'
UNION SELECT '9ortu8al'
UNION SELECT 'Portu8al'
UNION SELECT 'P0rtugal'
UNION SELECT '90rtugal'
UNION SELECT '90rtu8al'
UNION SELECT 'P0rtu8al'
UNION SELECT 'Portuga1'
UNION SELECT '9ortuga1'
UNION SELECT '9ortu8a1'
UNION SELECT 'Portu8a1'
UNION SELECT 'P0rtuga1'
UNION SELECT '90rtuga1'
UNION SELECT '90rtu8a1'
UNION SELECT 'P0rtu8a1'

DECLARE @SelectedCountry VARCHAR(32) = 'Portugal'

UPDATE @CoutryNames
SET score =
CAST(CAST(@SelectedCountry AS VARBINARY(MAX)) AS BIGINT)
  • CAST(CAST(solution AS VARBINARY(MAX)) AS BIGINT)



SELECT * FROM @CoutryNames ORDER BY score --first result is the most similar


Here is the SQL Fiddle

The idea to use weighted errors got me thinking about how I would do it.

Here is what I came up with, you could pretty easily turn it into a function by adding the required boilerplate code and changing the SELECT @ErrorScore to RETURN @ErrorScore:

DECLARE @StringA VARCHAR(10) = 'bacon',
@StringB VARCHAR(10) = 'bacon',
@CharPosition TINYINT = 1,
@ErrorScore INT = 0

DECLARE @LETTER_WEIGHTING TABLE(CorrectLetter CHAR(1), IncorrectLetter CHAR(1),LetterWeighting TINYINT)

INSERT INTO @LETTER_WEIGHTING
SELECT 'o','0',1 UNION ALL
SELECT 'o','Q',3 UNION ALL
SELECT 'g','6',2 UNION ALL
SELECT 'g','8',3

IF @StringA <> @StringB
BEGIN
WHILE @CharPosition SUBSTRING(@StringB,@CharPosition,1)
THEN
ISNULL((
SELECT TOP 1 LetterWeighting
FROM @LETTER_WEIGHTING
WHERE CorrectLetter = SUBSTRING(@StringA,@CharPosition,1)
AND IncorrectLetter = SUBSTRING(@StringB,@CharPosition,1)
),5)
*
@CharPosition
ELSE 0
END

SET @CharPosition+=1
END
END

SELECT @ErrorScore


What this does is loop through all the characters of the first string and compare them with the appropriate character in the second string. If the characters do not match, then we use a lookup table of weightings for some errors or we use 5 as a base value if the correct/incorrect character pair is not in the lookup table. We then multiply the value in the previous step by the characters position in the string.

Here is an SQL Fiddle

Context

StackExchange Code Review Q#61872, answer score: 3

Revisions (0)

No revisions yet.