HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Find first non-matching character between two strings

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
noncharactertwofirstbetweenfindstringsmatching

Problem

Posting this with solution provided.

The use case could be framed in a number of ways:

  • Return all characters that match at the start of two strings, until they do not match



  • How many characters match between two strings before one character doesn't match?



  • What is the first character that doesn't match between two strings?



  • etc.



For example:

If I have the strings 'Interesting' and 'Interested' they are similar for all the characters 'Interest' and then the first string ends in 'ing' while the second ends in 'ed'. Therefore, they have 8 characters in common, the first non-matching character is the 9th and the identical string is 'Interest'.

For the question in the title, the first non-matching character is number 9.

Solution

I'll use a numbers table. It can be used as an iterator over the two strings, but in a relational way.

My approach is to use the numbers table as an iterator over the strings, but in a relational way. The question calls for finding the first character in each string where they differ. The characters can be found using SUBSTRING(). Since we want the first difference we can use MIN() of the iterator. If no differences are found, i.e. the strings are equal, the checks on the lengths of the strings stop the iteration continuing to the end of the numbers table, which would be wasteful. I add one to the string length to catch the case where one string is a proper prefix of the other e.g. "interest" and "interested".

declare @FirstString    varchar(20) = 'interested';
declare @SecondString   varchar(20) = 'interesting';

select
    MIN(n.Number)
from dbo.Numbers as n
where SUBSTRING(@FirstString, n.Number, 1) <> SUBSTRING(@SecondString, n.Number, 1)
and n.Number <= LEN(@FirstString) + 1
and n.Number <= LEN(@SecondString) + 1;


If the strings are empty, equal or NULL this will return NULL.

With a clustered index on dbo.Numbers.Number the optimizer (in my instance - 2017 CU12) chooses to read the table in number order (Clustered Index Seek, Ordered = True). Further the predicate is pushed into the index seek. That operator processes 9 rows, the minimum required. The remainder of the plan simply returns the one row from dbo.Numbers which meets the query's requirements.

If the strings were very large, say DNA base pairs, and it was known that the matching prefixes were very long compared to the differing suffixes, it would be possible to invert the sequence. One could use REVERSE() on the strings. Placing a descending index on dbo.Numbers.Number and using MAX() may also work. Further testing would be required to prove a performance benefit, however.

With thanks to Andriy M for valuable observations.

Code Snippets

declare @FirstString    varchar(20) = 'interested';
declare @SecondString   varchar(20) = 'interesting';

select
    MIN(n.Number)
from dbo.Numbers as n
where SUBSTRING(@FirstString, n.Number, 1) <> SUBSTRING(@SecondString, n.Number, 1)
and n.Number <= LEN(@FirstString) + 1
and n.Number <= LEN(@SecondString) + 1;

Context

StackExchange Database Administrators Q#232817, answer score: 3

Revisions (0)

No revisions yet.