patternMinor
Find all differences between 2 strings
Viewed 0 times
alldifferencesbetweenfindstrings
Problem
I pulled this out of my code bucket and dusted it off earlier today in response to a post over on SO that made me cringe. This was originally written to highlight changes in Excel cells in real time via
I cleaned it up a bit to modernize the coding style, but I'm mainly looking for input on the algorithm used and suggestions for making the code a bit more understandable for somebody who isn't familiar with it.
It basically works on byte arrays and tracks the current working position in each array with a set of index pointers for the "start" and "end" of each substring that it's working with. The algorithm is similar to a binary search. The entry point function finds the longest matching substring in the two byte arrays, excludes it, logs all the differences to the output, and then recursively calls itself on the slices of the arrays to the right and left of the match:
```
'Returns a comma delimited string containing the positions of differences in the passed byte arrays. Recursive.
'Arrays are not modified, index parameters specify where the pointers are in the arrays on each subsequent call.
Private Function FindDifferences(ByRef first() As Byte, ByRef other() As Byte, Optional ByVal firstStartIndex As Long = -1, _
Optional ByVal firstEndIndex As Long, Optional ByVal otherStartIndex As Long, _
Optional ByVal otherEndIndex As Long) As String
If firstStartIndex = -1 Then
'Find matching substrings and set index markers.
SkipSubstringMatches first, other, firstStartIndex, firstEndIndex, otherStartIndex, otherEndIndex
If firstEndIndex = -1 And otherEndIndex > 0 Then
'All matches in first.
Exit Function
ElseIf otherEndIndex = -1 And firstEndIndex > 0 Then
Worksheet_Change, so it is designed with an eye toward raw speed to avoid blocking the UI. Benchmarks on current hardware are running around 3 seconds to compare 2 1kb strings.I cleaned it up a bit to modernize the coding style, but I'm mainly looking for input on the algorithm used and suggestions for making the code a bit more understandable for somebody who isn't familiar with it.
It basically works on byte arrays and tracks the current working position in each array with a set of index pointers for the "start" and "end" of each substring that it's working with. The algorithm is similar to a binary search. The entry point function finds the longest matching substring in the two byte arrays, excludes it, logs all the differences to the output, and then recursively calls itself on the slices of the arrays to the right and left of the match:
```
'Returns a comma delimited string containing the positions of differences in the passed byte arrays. Recursive.
'Arrays are not modified, index parameters specify where the pointers are in the arrays on each subsequent call.
Private Function FindDifferences(ByRef first() As Byte, ByRef other() As Byte, Optional ByVal firstStartIndex As Long = -1, _
Optional ByVal firstEndIndex As Long, Optional ByVal otherStartIndex As Long, _
Optional ByVal otherEndIndex As Long) As String
If firstStartIndex = -1 Then
'Find matching substrings and set index markers.
SkipSubstringMatches first, other, firstStartIndex, firstEndIndex, otherStartIndex, otherEndIndex
If firstEndIndex = -1 And otherEndIndex > 0 Then
'All matches in first.
Exit Function
ElseIf otherEndIndex = -1 And firstEndIndex > 0 Then
Solution
Nothing much to complain about with this code, so here's a short list of opportunities that may or may not fall under the realm of personal preference:
Dual-branch conditional assignments like this, where there's only a single, non-side-effecting instruction in each branch:
Can be written as a one-liner assignment with the
Single-liner conditional blocks that basically act as guard clauses:
Can be written with the conditional statement syntax:
Note that doing that removes a nesting level here:
Becomes:
The declarations-as-close-as-possible-to-first-use style in
Would have been:
This bit looks a little packed, some vertical whitespace wouldn't hurt:
But otherwise the whole code looks great.
Dual-branch conditional assignments like this, where there's only a single, non-side-effecting instruction in each branch:
If topFirst >= topOther Then
lower = topOther
Else
lower = topFirst
End IfCan be written as a one-liner assignment with the
IIf function:lower = IIf(topFirst >= topOther, topOther, topFirst)Single-liner conditional blocks that basically act as guard clauses:
If firstIndex = firstEndIndex Then
Exit Do
End IfCan be written with the conditional statement syntax:
If firstIndex = firstEndIndex Then Exit DoNote that doing that removes a nesting level here:
If firstEndIndex = -1 Or otherEndIndex = -1 Then
Exit Sub
Else
Do Until first(topFirst) <> other(topOther)
topFirst = topFirst - 1
topOther = topOther - 1
If topFirst < baseFirst Or topOther < baseFirst Then
Exit Do
End If
Loop
firstEndIndex = topFirst
otherEndIndex = topOther
End IfBecomes:
If firstEndIndex = -1 Or otherEndIndex = -1 Then Exit Sub
Do Until first(topFirst) <> other(topOther)
topFirst = topFirst - 1
topOther = topOther - 1
If topFirst < baseFirst Or topOther < baseFirst Then
Exit Do
End If
Loop
firstEndIndex = topFirst
otherEndIndex = topOtherThe declarations-as-close-as-possible-to-first-use style in
StringDiffs wrapper function is not consistent with the rest of the code:Public Function StringDiffs(ByVal first As String, ByVal other As String) As String
Dim firstChars() As Byte
Dim otherChars() As Byte
firstChars = StrConv(first, vbFromUnicode)
otherChars = StrConv(other, vbFromUnicode)
StringDiffs = FindDifferences(firstChars, otherChars)
End FunctionWould have been:
Public Function StringDiffs(ByVal first As String, ByVal other As String) As String
Dim firstChars() As Byte
firstChars = StrConv(first, vbFromUnicode)
Dim otherChars() As Byte
otherChars = StrConv(other, vbFromUnicode)
StringDiffs = FindDifferences(firstChars, otherChars)
End FunctionThis bit looks a little packed, some vertical whitespace wouldn't hurt:
If matchLength <> 0 Then
differences = FindDifferences(first, other, firstStartIndex, firstMatch - 1, otherStartIndex, otherMatch - 1)
If Len(differences) <> 0 Then returnValue = returnValue & "," & differences
differences = FindDifferences(first, other, firstMatch + matchLength, firstEndIndex, otherMatch + matchLength, otherEndIndex)
If Len(differences) <> 0 Then returnValue = returnValue & "," & differences
ElseBut otherwise the whole code looks great.
Code Snippets
If topFirst >= topOther Then
lower = topOther
Else
lower = topFirst
End Iflower = IIf(topFirst >= topOther, topOther, topFirst)If firstIndex = firstEndIndex Then
Exit Do
End IfIf firstIndex = firstEndIndex Then Exit DoIf firstEndIndex = -1 Or otherEndIndex = -1 Then
Exit Sub
Else
Do Until first(topFirst) <> other(topOther)
topFirst = topFirst - 1
topOther = topOther - 1
If topFirst < baseFirst Or topOther < baseFirst Then
Exit Do
End If
Loop
firstEndIndex = topFirst
otherEndIndex = topOther
End IfContext
StackExchange Code Review Q#156670, answer score: 3
Revisions (0)
No revisions yet.