patternsqlMinor
On-Demand Fuzzy Lookup
Viewed 0 times
demandlookupfuzzy
Problem
We've got a customers table (Who doesn't?), containing many records that are, from a business perspective, duplicates. I've been able to create an SSIS package to perform fuzzy grouping, and report on potential duplicates.
Now, suppose I want to do this kind of analysis just as somebody is entering a new customer. The idea would be to perform a fuzzy lookup on customer name (and possibly some other basic info like postal code), and show potential duplicates prior to proceeding to the customer creation form.
The obvious problem here is that the fuzzy grouping and lookup components are part of SSIS. If I wanted to run those on-demand, I'd have to do something insane like putting the search terms in a staging table, running the SSIS package, waiting for it to complete, and fetching the results from an output table. It would be slow, painful, and have severe concurrency problems.
So, the other idea was to use full-text indexing. In experimenting with it, it looks like it won't be suitable. It can't catch subtle misspellings of customer names, or names that differ in "Company" vs. "Corporation" vs. "Co.", or "Anderson" vs. "Andersen", and other such variations.
Is there something that will allow for the flexibility of fuzzy grouping/matching from T-SQL? I can tell a fuzzy lookup to save the tokens, but it looks like I would still have to reimplement most of the matching algorithm to make use of them.
Now, suppose I want to do this kind of analysis just as somebody is entering a new customer. The idea would be to perform a fuzzy lookup on customer name (and possibly some other basic info like postal code), and show potential duplicates prior to proceeding to the customer creation form.
The obvious problem here is that the fuzzy grouping and lookup components are part of SSIS. If I wanted to run those on-demand, I'd have to do something insane like putting the search terms in a staging table, running the SSIS package, waiting for it to complete, and fetching the results from an output table. It would be slow, painful, and have severe concurrency problems.
So, the other idea was to use full-text indexing. In experimenting with it, it looks like it won't be suitable. It can't catch subtle misspellings of customer names, or names that differ in "Company" vs. "Corporation" vs. "Co.", or "Anderson" vs. "Andersen", and other such variations.
Is there something that will allow for the flexibility of fuzzy grouping/matching from T-SQL? I can tell a fuzzy lookup to save the tokens, but it looks like I would still have to reimplement most of the matching algorithm to make use of them.
Solution
In the past I built a "fuzzy-search" in a .Net CLR function. This function gets called the same way a user-defined function gets called.
For example,
would only return customers with a name that was 80% similar to the input name.
The % match is based on the number of changes needed to convert one value to another, not the number of characters that are different. We use it to compare addresses, and found this was more effective because of the numerous street abbreviations that are used.
Here's the code I used to compare strings. I did this so long ago that I can't remember how to deploy it, although a quick search will show you many articles on how to create SQL CLR functions
For example,
select id, name
from customers
where dbo.CompareStrings("newCustomerName", customers.name) > .8would only return customers with a name that was 80% similar to the input name.
The % match is based on the number of changes needed to convert one value to another, not the number of characters that are different. We use it to compare addresses, and found this was more effective because of the numerous street abbreviations that are used.
Here's the code I used to compare strings. I did this so long ago that I can't remember how to deploy it, although a quick search will show you many articles on how to create SQL CLR functions
' Checks two strings against each other and returns a decimal between 0 (doesn't match at all) and 1 (100% match)
_
Public Shared Function CompareStrings(ByVal input1 As SqlChars, ByVal input2 As SqlChars) _
As SqlDecimal
If IsNothing(input1) And IsNothing(input2) Then
Return New SqlDecimal(1.0)
ElseIf IsNothing(input1) Or IsNothing(input2) Then
Return New SqlDecimal(0.0)
End If
Dim s1 As String = New String(input1.Value)
Dim s2 As String = New String(input2.Value)
If s1.Length = 0 Or s2.Length = 0 Then
Return New SqlDecimal(1.0)
Else
Dim re As New Regex("[^A-Za-z0-9 ]", RegexOptions.IgnorePatternWhitespace Xor RegexOptions.Singleline)
s1 = re.Replace(s1, "( )\1*", "$1")
s2 = re.Replace(s2, "( )\1*", "$1")
s1 = UCase(re.Replace(s1, ""))
s2 = UCase(re.Replace(s2.ToString, ""))
Dim dif As Integer = GetStringSimilarity(s1, s2)
Dim max As Integer = s1.Length
If s2.Length > max Then max = s2.Length
Return New SqlDecimal(1.0 - (dif / max))
End If
End Function
' Compares two strings using the relationship in patterns of letters
Private Shared Function GetStringSimilarity(ByVal s1 As String, ByVal s2 As String) As Integer
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim distance(n + 1, m + 1) As Integer
Dim cost As Integer = 0
If n = 0 Then Return m
If m = 0 Then Return n
For i As Integer = 0 To n
distance(i, 0) = i
Next
For j As Integer = 0 To m
distance(0, j) = j
Next
For i As Integer = 1 To n
For j As Integer = 1 To m
If Mid(s2, j, 1) = Mid(s1, i, 1) Then cost = 0 Else cost = 1
distance(i, j) = Min3(distance(i - 1, j) + 1, distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost)
Next
Next
Return distance(n, m)
End Function
' Returns the min of 3 values
Private Shared Function Min3(ByVal x As Integer, ByVal y As Integer, ByVal z As Integer) As Integer
Dim min As Integer = x
If y < min Then min = y
If z < min Then min = z
Return min
End FunctionCode Snippets
select id, name
from customers
where dbo.CompareStrings("newCustomerName", customers.name) > .8' Checks two strings against each other and returns a decimal between 0 (doesn't match at all) and 1 (100% match)
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function CompareStrings(ByVal input1 As SqlChars, ByVal input2 As SqlChars) _
As <SqlFacet(Precision:=10, Scale:=4)> SqlDecimal
If IsNothing(input1) And IsNothing(input2) Then
Return New SqlDecimal(1.0)
ElseIf IsNothing(input1) Or IsNothing(input2) Then
Return New SqlDecimal(0.0)
End If
Dim s1 As String = New String(input1.Value)
Dim s2 As String = New String(input2.Value)
If s1.Length = 0 Or s2.Length = 0 Then
Return New SqlDecimal(1.0)
Else
Dim re As New Regex("[^A-Za-z0-9 ]", RegexOptions.IgnorePatternWhitespace Xor RegexOptions.Singleline)
s1 = re.Replace(s1, "( )\1*", "$1")
s2 = re.Replace(s2, "( )\1*", "$1")
s1 = UCase(re.Replace(s1, ""))
s2 = UCase(re.Replace(s2.ToString, ""))
Dim dif As Integer = GetStringSimilarity(s1, s2)
Dim max As Integer = s1.Length
If s2.Length > max Then max = s2.Length
Return New SqlDecimal(1.0 - (dif / max))
End If
End Function
' Compares two strings using the relationship in patterns of letters
Private Shared Function GetStringSimilarity(ByVal s1 As String, ByVal s2 As String) As Integer
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim distance(n + 1, m + 1) As Integer
Dim cost As Integer = 0
If n = 0 Then Return m
If m = 0 Then Return n
For i As Integer = 0 To n
distance(i, 0) = i
Next
For j As Integer = 0 To m
distance(0, j) = j
Next
For i As Integer = 1 To n
For j As Integer = 1 To m
If Mid(s2, j, 1) = Mid(s1, i, 1) Then cost = 0 Else cost = 1
distance(i, j) = Min3(distance(i - 1, j) + 1, distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost)
Next
Next
Return distance(n, m)
End Function
' Returns the min of 3 values
Private Shared Function Min3(ByVal x As Integer, ByVal y As Integer, ByVal z As Integer) As Integer
Dim min As Integer = x
If y < min Then min = y
If z < min Then min = z
Return min
End FunctionContext
StackExchange Database Administrators Q#13593, answer score: 2
Revisions (0)
No revisions yet.