HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Is there a SQL Server implementation of the Longest Common Substring problem?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
problemthesqlsubstringlongestserverimplementationtherecommon

Problem

Is there a SQL Server implementation of the Longest Common Substring problem? A solution that checks with all rows of a column in SQL Server? I have seen solutions that take two strings as input, but no SQL Server solution that looks at all rows of a column in a table.

I did try a few things, but to be honest I think a solution goes over my head at the moment, so any suggestions are welcome.

There is no "real world" problem here, I'm just looking at programming problems and how they could be solved with SQL Server.

Solution

This can be done rather easily as a SQLCLR User-Defined Aggregate (UDA). An aggregate operates over a set of rows, so you would be able to operate over all rows or just a subset, based on a WHERE condition and optional GROUP BY (if wanting to operate over separate sets of rows).

BUT, whether or not you should do this depends on what you are planning on doing with the result. If this is a one-off project to do some research that won't be repeated, then it is probably best to just make a small Console App to read in the rows and process accordingly.

However, if you do have some need to use the returned value within a database-centered process, then SQLCLR should be fine (assuming you follow the two recommendations mentioned under "The following tricks can be used to reduce the memory usage of an implementation" in the Pseudocode section. You will just need to find a "creative" way of dealing with the situation of having multiple results for what is considered the longest common substring (i.e. if 2 or more common substrings tie for "first place"). Using the example from the Wikipedia page (which shows 2 matches for the 2 strings):

ABAB
BABA


Returns both:

  • BAB



  • ABA



Perhaps returning an XML document of matches since that is parsable and can contain any string (when properly escaped).

UPDATE (updated, and updated again)

The .NET C# source code for this can be found on Pastebin.com at:

SQLCLR UDA for Longest Common Substring - Source Code

And for anyone that wants to play with this UDA without compiling it, an installation T-SQL script (no external DLL) can be found on Pastebin.com at:

SQLCLR UDA for Longest Common Substring - Installer

The UDA should be fairly memory-efficient as it only stores substrings that match the current string and all previously encountered strings. As a new row calls the UDA, any substrings that are not found in the new string are removed from the collection.

It is also CPU-efficient in that, if at any point the number of "common" substrings goes to zero, it sets a flag indicating that no substrings are even possible and short-cicruits all future executions to simply exit upon being called. This happens immediately if an empty string is encountered. In any of these cases, an empty XML document (i.e. root element only) is returned. The meaning of the empty document is equivalent to an empty string, since the only thing the input strings have in common is that they were non-NULL strings.

NULL, in my interpretation, is ignored and does not indicate no possible matches like an empty string does.

A NULL is returned in the following two cases:

  • All inputs are NULL



  • Only a single non-NULL row was in the set and hence there was no other string to compare to, hence nothing to be considered "common".



I added a second input parameter which controls whether or not the return is just the longest common substrings or all common substrings. In the case of returning all, an attribute is added to each "item" to indicate whether or not it is one of the "longest" substrings:

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test1a]
FROM (VALUES (N'ABAB'), (N'BABA')) tab(col);


Returns:


ABA
BAB



And

SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test1b]
FROM (VALUES (N'ABAB'), (N'BABA')) tab(col);


Returns:


ABA
BAB
AB
BA
A
B



Also, the comparisons are now case-InSensitive to match the typical Collation.

Below are 16 more test cases that check functionality only, not performance. I will post additional tests later that operate over many rows of much longer strings. I intentionally left out Combining Characters and Supplementary Characters, for now, as they are bit a more complicated.

`SELECT dbo.LongestCommonSubstring(tab.col, 1) AS [Test2]
FROM (VALUES (N'ABAB'), (N'BABA'), (N'2BAB5')) tab(col);
-- BAB

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test3]
FROM (VALUES (N'ABAB'), (N'BABA'), (NULL), (N'2BAB5')) tab(col);
-- BAB

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test4]
FROM (VALUES (NULL), (NULL), (NULL)) tab(col);
-- NULL

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test5]
FROM (VALUES (N'ABAB'), (N'BABA'), (N''), (N'2BAB5')) tab(col);
--

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test6]
FROM (VALUES (N'ABAB'), (N'BABA'), (N'L'), (N'2BAB5')) tab(col);
--

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test7]
FROM (VALUES (N'ABAB')) tab(col);
-- NULL

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8a-DuplicatesAcross2Rows]
FROM (VALUES (N'ABAB'), (N'ABAB')) tab(col);
-- ABAB

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8b-DuplicatesAcross3Rows]
FROM (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- ABAB

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test8c-DuplicatesAcross4Rows]
FROM (VALUES (N'ABAB'), (N'ABAB'), (N'ABAB'), (N'ABAB')) tab(col);
-- ABAB

SELECT dbo.LongestCommonSubstring(tab.col, 0) AS [Test9-DuplicatesWithinOneString]
FROM (V

Context

StackExchange Database Administrators Q#131759, answer score: 7

Revisions (0)

No revisions yet.