patternsqlMinor
Efficient way to see what pattern matches a single word
Viewed 0 times
matcheswhatefficientwayseesinglewordpattern
Problem
I'm creating a function to find out what search pattern would best fit a single string. The string to match will be compared against a table of what could be possibly hundreds of possible patterns to see if it matches any of them.
I've got a not-so-elegant function working right now that will read all patterns into a memory table and iterate though them to see if each matches. It works but makes me sick to my stomach to think of using this in a production environment, especially if the pattern table grows to thousands of patterns.
Example string to match: 'Master'
Possible patterns:
The outcome would be a match to pattern #1.
In the scenario, I need to be able to match potential part numbers, brands, and other possibly misspelled phrases. I'm incorporating Double Metaphone but when it comes to other items that only a pattern will catch, I need to know what pattern was the match among many. The "StringToSearch" would only return 1 matched pattern.
Does anyone have theory or other code types that can help me with this reverse pattern matching process?
I've got a not-so-elegant function working right now that will read all patterns into a memory table and iterate though them to see if each matches. It works but makes me sick to my stomach to think of using this in a production environment, especially if the pattern table grows to thousands of patterns.
Example string to match: 'Master'
Possible patterns:
- m%r
- abc%
- xyz%
- a%bcd
The outcome would be a match to pattern #1.
In the scenario, I need to be able to match potential part numbers, brands, and other possibly misspelled phrases. I'm incorporating Double Metaphone but when it comes to other items that only a pattern will catch, I need to know what pattern was the match among many. The "StringToSearch" would only return 1 matched pattern.
Does anyone have theory or other code types that can help me with this reverse pattern matching process?
Solution
Without knowing anything about what your actual scenario looks like, I'd perhaps do something like:
╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════╬═══════════════╣
║ Master ║ m%r ║
╚══════════════╩═══════════════╝
You'll get a better set of answers if you more accurately describe exactly what your scenario is by explaining how the data is designed, and what you exact design goals are.
To see if this methodology is effective enough, I've extended the MCVE above by adding a couple of indexes and a bunch of data:
The tables now have clustered indexes:
This inserts 50,000 words into the
A sample of the rows from that table:
╔══════════════╗
║ TargetString ║
╠══════════════╣
║ UKMBL ║
║ ASOCG ║
║ XWACG ║
║ NHRXQ ║
║ LFMBR ║
║ EQELO ║
║ PRHCM ║
║ IIFPD ║
║ OCLJQ ║
║ NJBMB ║
╚══════════════╝
Next, we'll create 1,000,000 rows in the
Here's a sample of those rows:
╔═══════════════╗
║ PossibleMatch ║
╠═══════════════╣
║ LDBFB ║
║ IWLMB ║
║ FNBUA ║
║ _ZEVC ║
║ BSZDB ║
║ TTRIB ║
║ XVLNA ║
║ VLPDB ║
║ UI_IA ║
║ FNLCB ║
╚═══════════════╝
As you can see from the sample above, some of the rows have a T-SQL single-character wildcard,
Now, we'll run some code to see if we can find a row from
The
Looking up any given word in the
The plan for this query:
On my system, this query executes in ~100 milliseconds.
The results of that query:
╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════
USE tempdb;
IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
TargetString varchar(46) NOT NULL
);
IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
PossibleMatch varchar(46) NOT NULL
);
INSERT INTO dbo.Strings(TargetString)
VALUES ('Master');
INSERT INTO dbo.Matches(PossibleMatch)
VALUES ('m%r')
, ('abc%')
, ('zyx%')
, ('a%bcd')
SELECT *
FROM dbo.Strings s
INNER JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════╬═══════════════╣
║ Master ║ m%r ║
╚══════════════╩═══════════════╝
You'll get a better set of answers if you more accurately describe exactly what your scenario is by explaining how the data is designed, and what you exact design goals are.
To see if this methodology is effective enough, I've extended the MCVE above by adding a couple of indexes and a bunch of data:
The tables now have clustered indexes:
IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
TargetString varchar(46) NOT NULL
PRIMARY KEY CLUSTERED
);
IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
PossibleMatch varchar(46) NOT NULL
PRIMARY KEY CLUSTERED
);This inserts 50,000 words into the
Strings table:;WITH src AS (
SELECT v.val
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS
(
SELECT TOP(26) Num = (v.val * 10) + v1.val
FROM src v
CROSS JOIN src v1
ORDER BY v.val, v1.val
)
, words AS
(
SELECT c1 = CHAR(65 + s1.num)
, c2 = CHAR(65 + s2.num)
, c3 = CHAR(65 + s3.num)
, c4 = CHAR(65 + s4.num)
, c5 = CHAR(65 + s5.num)
FROM src26 s5
CROSS JOIN src26 s4
CROSS JOIN src26 s3
CROSS JOIN src26 s2
CROSS JOIN src26 s1
)
INSERT INTO dbo.Strings(TargetString)
SELECT TOP(50000) words.c1
+ words.c2
+ words.c3
+ words.c4
+ words.c5
FROM words
ORDER BY CRYPT_GEN_RANDOM(5);A sample of the rows from that table:
╔══════════════╗
║ TargetString ║
╠══════════════╣
║ UKMBL ║
║ ASOCG ║
║ XWACG ║
║ NHRXQ ║
║ LFMBR ║
║ EQELO ║
║ PRHCM ║
║ IIFPD ║
║ OCLJQ ║
║ NJBMB ║
╚══════════════╝
Next, we'll create 1,000,000 rows in the
Matches table with wildcards:;WITH src AS (
SELECT v.val
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS
(
SELECT TOP(26) Num = (v.val * 10) + v1.val
FROM src v
CROSS JOIN src v1
ORDER BY v.val, v1.val
)
, words AS
(
SELECT c1 = CHAR(65 + s1.num)
, c2 = CHAR(65 + s2.num)
, c3 = CHAR(65 + s3.num)
, c4 = CHAR(65 + s4.num)
, c5 = CHAR(65 + s5.num)
FROM src26 s5
CROSS JOIN src26 s4
CROSS JOIN src26 s3
CROSS JOIN src26 s2
CROSS JOIN src26 s1
)
INSERT INTO dbo.Matches(PossibleMatch)
SELECT TOP(1000000) REPLACE(words.c1, 'A', '_')
+ REPLACE(words.c2, 'E', '_')
+ REPLACE(words.c3, 'I', '_')
+ REPLACE(words.c4, 'O', '_')
+ REPLACE(words.c5, 'U', '%')
FROM words
WHERE NOT (words.c1 = 'A' AND words.c2 = 'E' AND words.c3 = 'I' AND words.c4 ='O' AND words.c5 <> 'U');
SELECT TotalMatches = (SELECT COUNT(1) FROM dbo.Matches)
, TotalStrings = (SELECT COUNT(1) FROM dbo.Strings)Here's a sample of those rows:
╔═══════════════╗
║ PossibleMatch ║
╠═══════════════╣
║ LDBFB ║
║ IWLMB ║
║ FNBUA ║
║ _ZEVC ║
║ BSZDB ║
║ TTRIB ║
║ XVLNA ║
║ VLPDB ║
║ UI_IA ║
║ FNLCB ║
╚═══════════════╝
As you can see from the sample above, some of the rows have a T-SQL single-character wildcard,
_, and some rows don't have wildcards.Now, we'll run some code to see if we can find a row from
Matches that matches up to a row in Strings:SELECT *
FROM dbo.Strings s
LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = (
SELECT TOP(1) Strings.TargetString
FROM dbo.Strings
ORDER BY CRYPT_GEN_RANDOM(5)
)The
WHERE clause randomly chooses a word from the Strings table. You may need to run this several times before you get a word with matching values in the Matches table.Looking up any given word in the
Strings table, with associated Matches:SELECT *
FROM dbo.Strings s
LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = 'JRZPC'The plan for this query:
On my system, this query executes in ~100 milliseconds.
The results of that query:
╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════
Code Snippets
USE tempdb;
IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
TargetString varchar(46) NOT NULL
);
IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
PossibleMatch varchar(46) NOT NULL
);
INSERT INTO dbo.Strings(TargetString)
VALUES ('Master');
INSERT INTO dbo.Matches(PossibleMatch)
VALUES ('m%r')
, ('abc%')
, ('zyx%')
, ('a%bcd')
SELECT *
FROM dbo.Strings s
INNER JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatchIF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
TargetString varchar(46) NOT NULL
PRIMARY KEY CLUSTERED
);
IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
PossibleMatch varchar(46) NOT NULL
PRIMARY KEY CLUSTERED
);;WITH src AS (
SELECT v.val
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS
(
SELECT TOP(26) Num = (v.val * 10) + v1.val
FROM src v
CROSS JOIN src v1
ORDER BY v.val, v1.val
)
, words AS
(
SELECT c1 = CHAR(65 + s1.num)
, c2 = CHAR(65 + s2.num)
, c3 = CHAR(65 + s3.num)
, c4 = CHAR(65 + s4.num)
, c5 = CHAR(65 + s5.num)
FROM src26 s5
CROSS JOIN src26 s4
CROSS JOIN src26 s3
CROSS JOIN src26 s2
CROSS JOIN src26 s1
)
INSERT INTO dbo.Strings(TargetString)
SELECT TOP(50000) words.c1
+ words.c2
+ words.c3
+ words.c4
+ words.c5
FROM words
ORDER BY CRYPT_GEN_RANDOM(5);;WITH src AS (
SELECT v.val
FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS
(
SELECT TOP(26) Num = (v.val * 10) + v1.val
FROM src v
CROSS JOIN src v1
ORDER BY v.val, v1.val
)
, words AS
(
SELECT c1 = CHAR(65 + s1.num)
, c2 = CHAR(65 + s2.num)
, c3 = CHAR(65 + s3.num)
, c4 = CHAR(65 + s4.num)
, c5 = CHAR(65 + s5.num)
FROM src26 s5
CROSS JOIN src26 s4
CROSS JOIN src26 s3
CROSS JOIN src26 s2
CROSS JOIN src26 s1
)
INSERT INTO dbo.Matches(PossibleMatch)
SELECT TOP(1000000) REPLACE(words.c1, 'A', '_')
+ REPLACE(words.c2, 'E', '_')
+ REPLACE(words.c3, 'I', '_')
+ REPLACE(words.c4, 'O', '_')
+ REPLACE(words.c5, 'U', '%')
FROM words
WHERE NOT (words.c1 = 'A' AND words.c2 = 'E' AND words.c3 = 'I' AND words.c4 ='O' AND words.c5 <> 'U');
SELECT TotalMatches = (SELECT COUNT(1) FROM dbo.Matches)
, TotalStrings = (SELECT COUNT(1) FROM dbo.Strings)SELECT *
FROM dbo.Strings s
LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = (
SELECT TOP(1) Strings.TargetString
FROM dbo.Strings
ORDER BY CRYPT_GEN_RANDOM(5)
)Context
StackExchange Database Administrators Q#184281, answer score: 5
Revisions (0)
No revisions yet.