HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Efficient way to see what pattern matches a single word

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
matcheswhatefficientwayseesinglewordpattern

Problem

I'm creating a function to find out what search pattern would best fit a single string. The string to match will be compared against a table of what could be possibly hundreds of possible patterns to see if it matches any of them.

I've got a not-so-elegant function working right now that will read all patterns into a memory table and iterate though them to see if each matches. It works but makes me sick to my stomach to think of using this in a production environment, especially if the pattern table grows to thousands of patterns.

Example string to match: 'Master'

Possible patterns:

  • m%r



  • abc%



  • xyz%



  • a%bcd



The outcome would be a match to pattern #1.

In the scenario, I need to be able to match potential part numbers, brands, and other possibly misspelled phrases. I'm incorporating Double Metaphone but when it comes to other items that only a pattern will catch, I need to know what pattern was the match among many. The "StringToSearch" would only return 1 matched pattern.

Does anyone have theory or other code types that can help me with this reverse pattern matching process?

Solution

Without knowing anything about what your actual scenario looks like, I'd perhaps do something like:

USE tempdb;

IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
);

INSERT INTO dbo.Strings(TargetString)
VALUES ('Master');

INSERT INTO dbo.Matches(PossibleMatch)
VALUES ('m%r')
    , ('abc%')
    , ('zyx%')
    , ('a%bcd')

SELECT *
FROM dbo.Strings s
    INNER JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch


╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════╬═══════════════╣
║ Master ║ m%r ║
╚══════════════╩═══════════════╝

You'll get a better set of answers if you more accurately describe exactly what your scenario is by explaining how the data is designed, and what you exact design goals are.

To see if this methodology is effective enough, I've extended the MCVE above by adding a couple of indexes and a bunch of data:

The tables now have clustered indexes:

IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);


This inserts 50,000 words into the Strings table:

;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Strings(TargetString)
SELECT TOP(50000) words.c1
    + words.c2
    + words.c3
    + words.c4
    + words.c5
FROM words
ORDER BY CRYPT_GEN_RANDOM(5);


A sample of the rows from that table:

╔══════════════╗
║ TargetString ║
╠══════════════╣
║ UKMBL ║
║ ASOCG ║
║ XWACG ║
║ NHRXQ ║
║ LFMBR ║
║ EQELO ║
║ PRHCM ║
║ IIFPD ║
║ OCLJQ ║
║ NJBMB ║
╚══════════════╝

Next, we'll create 1,000,000 rows in the Matches table with wildcards:

;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Matches(PossibleMatch)
SELECT TOP(1000000) REPLACE(words.c1, 'A', '_')
    + REPLACE(words.c2, 'E', '_')
    + REPLACE(words.c3, 'I', '_')
    + REPLACE(words.c4, 'O', '_')
    + REPLACE(words.c5, 'U', '%')
FROM words
WHERE NOT (words.c1 = 'A' AND words.c2 = 'E' AND words.c3 = 'I' AND words.c4 ='O' AND words.c5 <> 'U');

SELECT TotalMatches = (SELECT COUNT(1) FROM dbo.Matches)
    , TotalStrings = (SELECT COUNT(1) FROM dbo.Strings)


Here's a sample of those rows:

╔═══════════════╗
║ PossibleMatch ║
╠═══════════════╣
║ LDBFB ║
║ IWLMB ║
║ FNBUA ║
║ _ZEVC ║
║ BSZDB ║
║ TTRIB ║
║ XVLNA ║
║ VLPDB ║
║ UI_IA ║
║ FNLCB ║
╚═══════════════╝

As you can see from the sample above, some of the rows have a T-SQL single-character wildcard, _, and some rows don't have wildcards.

Now, we'll run some code to see if we can find a row from Matches that matches up to a row in Strings:

SELECT *
FROM dbo.Strings s
    LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = (
    SELECT TOP(1) Strings.TargetString
    FROM dbo.Strings 
    ORDER BY CRYPT_GEN_RANDOM(5)
    )


The WHERE clause randomly chooses a word from the Strings table. You may need to run this several times before you get a word with matching values in the Matches table.

Looking up any given word in the Strings table, with associated Matches:

SELECT *
FROM dbo.Strings s
    LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = 'JRZPC'


The plan for this query:

On my system, this query executes in ~100 milliseconds.

The results of that query:

╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════

Code Snippets

USE tempdb;

IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
);

INSERT INTO dbo.Strings(TargetString)
VALUES ('Master');

INSERT INTO dbo.Matches(PossibleMatch)
VALUES ('m%r')
    , ('abc%')
    , ('zyx%')
    , ('a%bcd')

SELECT *
FROM dbo.Strings s
    INNER JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);
;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Strings(TargetString)
SELECT TOP(50000) words.c1
    + words.c2
    + words.c3
    + words.c4
    + words.c5
FROM words
ORDER BY CRYPT_GEN_RANDOM(5);
;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Matches(PossibleMatch)
SELECT TOP(1000000) REPLACE(words.c1, 'A', '_')
    + REPLACE(words.c2, 'E', '_')
    + REPLACE(words.c3, 'I', '_')
    + REPLACE(words.c4, 'O', '_')
    + REPLACE(words.c5, 'U', '%')
FROM words
WHERE NOT (words.c1 = 'A' AND words.c2 = 'E' AND words.c3 = 'I' AND words.c4 ='O' AND words.c5 <> 'U');

SELECT TotalMatches = (SELECT COUNT(1) FROM dbo.Matches)
    , TotalStrings = (SELECT COUNT(1) FROM dbo.Strings)
SELECT *
FROM dbo.Strings s
    LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = (
    SELECT TOP(1) Strings.TargetString
    FROM dbo.Strings 
    ORDER BY CRYPT_GEN_RANDOM(5)
    )

Context

StackExchange Database Administrators Q#184281, answer score: 5

Revisions (0)

No revisions yet.