patternsqlMajor
Slow update on large table with subquery
Viewed 0 times
updatewithslowlargesubquerytable
Problem
With
In English, this query is counting the number of distinct phrases listed in Bad_Phrase that are a substring of the field
I would like some suggestions on how to have this query run considerably faster.
SourceTable having >15MM records and Bad_Phrase having >3K records, the following query takes almost 10 hours to run on SQL Server 2005 SP4.UPDATE [SourceTable]
SET
Bad_Count=
(
SELECT
COUNT(*)
FROM Bad_Phrase
WHERE
[SourceTable].Name like '%'+Bad_Phrase.PHRASE+'%'
)In English, this query is counting the number of distinct phrases listed in Bad_Phrase that are a substring of the field
Name in the SourceTable and then placing that result in the field Bad_Count.I would like some suggestions on how to have this query run considerably faster.
Solution
While I agree with other commenters that this is a computationally expensive problem, I think that there is a lot of room for improvement by tweaking the SQL that you are using. To illustrate, I create a fake data set with 15MM names and 3K phrases, ran the old approach, and ran a new approach.
Full script to generate a fake data set and try out the new approach
TL;DR
On my machine and this fake data set, the original approach takes about 4 hours to run. The proposed new approach takes about 10 minutes, a considerable improvement. Here is a short summary of the proposed approach:
Original approach: algorithmic analysis
From the plan of the original
The query is actually proportional to the length of the
New approach: code
This code is also available in the full pastebin but I've copied it here for convenience. The pastebin also has the full procedure definition, which includes the
New approach: query plans
First, we generate the substring starting at each character offset
Then create a clustered index on these substrings
Now, for each bad phrase we seek into these substrings to identify any matches. We then compute the number of distinct bad phrases that matches one or more substrings of that string. This is really the key step; because of the way that we have indexed the substrings, we no longer have to check a full cross-product of bad phrases and names. This step, which does the actual computation, accounts for only about 10% of the actual run-time (the rest is the pre-processing of substrings).
Lastly, perform the actual update statement, using a
New approach: algorithmic analysis
The new approach can be divided into two phases, pre-processing and matching. Let's define the following variables:
Full script to generate a fake data set and try out the new approach
TL;DR
On my machine and this fake data set, the original approach takes about 4 hours to run. The proposed new approach takes about 10 minutes, a considerable improvement. Here is a short summary of the proposed approach:
- For each name, generate the substring starting at each character offset (and capped at the length of the longest bad phrase, as an optimization)
- Create a clustered index on these substrings
- For each bad phrase, perform a seek into these substrings to identify any matches
- For each original string, compute the number of distinct bad phrases that matches one or more substrings of that string
Original approach: algorithmic analysis
From the plan of the original
UPDATE statement, we can see that the amount of work is linearly proportional to both the number of names (15MM) and the number of phrases (3K). So if we multiple both the number of names and phrases by 10, the overall run time is going to be ~100 times slower. The query is actually proportional to the length of the
name as well; while this is a bit hidden in the query plan, it comes through in the "number of executions" for seeking into the table spool. In the actual plan, we can see that this occurs not just once per name, but actually once per character offset within the name. So this approach is O(# names # phrases name length) in run-time complexity.New approach: code
This code is also available in the full pastebin but I've copied it here for convenience. The pastebin also has the full procedure definition, which includes the
@minId and @maxId variables that you see below to define the boundaries of the current batch.-- For each name, generate the string at each offset
DECLARE @maxBadPhraseLen INT = (SELECT MAX(LEN(phrase)) FROM Bad_Phrase)
SELECT s.id, sub.sub_name
INTO #SubNames
FROM (SELECT * FROM SourceTable WHERE id BETWEEN @minId AND @maxId) s
CROSS APPLY (
-- Create a row for each substring of the name, starting at each character
-- offset within that string. For example, if the name is "abcd", this CROSS APPLY
-- will generate 4 rows, with values ("abcd"), ("bcd"), ("cd"), and ("d"). In order
-- for the name to be LIKE the bad phrase, the bad phrase must match the leading X
-- characters (where X is the length of the bad phrase) of at least one of these
-- substrings. This can be efficiently computed after indexing the substrings.
-- As an optimization, we only store @maxBadPhraseLen characters rather than
-- storing the full remainder of the name from each offset; all other characters are
-- simply extra space that isn't needed to determine whether a bad phrase matches.
SELECT TOP(LEN(s.name)) SUBSTRING(s.name, n.n, @maxBadPhraseLen) AS sub_name
FROM Numbers n
ORDER BY n.n
) sub
-- Create an index so that bad phrases can be quickly compared for a match
CREATE CLUSTERED INDEX IX_SubNames ON #SubNames (sub_name)
-- For each name, compute the number of distinct bad phrases that match
-- By "match", we mean that the a substring starting from one or more
-- character offsets of the overall name starts with the bad phrase
SELECT s.id, COUNT(DISTINCT b.phrase) AS bad_count
INTO #tempBadCounts
FROM dbo.Bad_Phrase b
JOIN #SubNames s
ON s.sub_name LIKE b.phrase + '%'
GROUP BY s.id
-- Perform the actual update into a "bad_count_new" field
-- For validation, we'll compare bad_count_new with the originally computed bad_count
UPDATE s
SET s.bad_count_new = COALESCE(b.bad_count, 0)
FROM dbo.SourceTable s
LEFT JOIN #tempBadCounts b
ON b.id = s.id
WHERE s.id BETWEEN @minId AND @maxIdNew approach: query plans
First, we generate the substring starting at each character offset
Then create a clustered index on these substrings
Now, for each bad phrase we seek into these substrings to identify any matches. We then compute the number of distinct bad phrases that matches one or more substrings of that string. This is really the key step; because of the way that we have indexed the substrings, we no longer have to check a full cross-product of bad phrases and names. This step, which does the actual computation, accounts for only about 10% of the actual run-time (the rest is the pre-processing of substrings).
Lastly, perform the actual update statement, using a
LEFT OUTER JOIN to assign a count of 0 to any names for which we found no bad phrases.New approach: algorithmic analysis
The new approach can be divided into two phases, pre-processing and matching. Let's define the following variables:
N= # of names
B= # of bad phrases
L= averag
Code Snippets
-- For each name, generate the string at each offset
DECLARE @maxBadPhraseLen INT = (SELECT MAX(LEN(phrase)) FROM Bad_Phrase)
SELECT s.id, sub.sub_name
INTO #SubNames
FROM (SELECT * FROM SourceTable WHERE id BETWEEN @minId AND @maxId) s
CROSS APPLY (
-- Create a row for each substring of the name, starting at each character
-- offset within that string. For example, if the name is "abcd", this CROSS APPLY
-- will generate 4 rows, with values ("abcd"), ("bcd"), ("cd"), and ("d"). In order
-- for the name to be LIKE the bad phrase, the bad phrase must match the leading X
-- characters (where X is the length of the bad phrase) of at least one of these
-- substrings. This can be efficiently computed after indexing the substrings.
-- As an optimization, we only store @maxBadPhraseLen characters rather than
-- storing the full remainder of the name from each offset; all other characters are
-- simply extra space that isn't needed to determine whether a bad phrase matches.
SELECT TOP(LEN(s.name)) SUBSTRING(s.name, n.n, @maxBadPhraseLen) AS sub_name
FROM Numbers n
ORDER BY n.n
) sub
-- Create an index so that bad phrases can be quickly compared for a match
CREATE CLUSTERED INDEX IX_SubNames ON #SubNames (sub_name)
-- For each name, compute the number of distinct bad phrases that match
-- By "match", we mean that the a substring starting from one or more
-- character offsets of the overall name starts with the bad phrase
SELECT s.id, COUNT(DISTINCT b.phrase) AS bad_count
INTO #tempBadCounts
FROM dbo.Bad_Phrase b
JOIN #SubNames s
ON s.sub_name LIKE b.phrase + '%'
GROUP BY s.id
-- Perform the actual update into a "bad_count_new" field
-- For validation, we'll compare bad_count_new with the originally computed bad_count
UPDATE s
SET s.bad_count_new = COALESCE(b.bad_count, 0)
FROM dbo.SourceTable s
LEFT JOIN #tempBadCounts b
ON b.id = s.id
WHERE s.id BETWEEN @minId AND @maxIdContext
StackExchange Database Administrators Q#115984, answer score: 21
Revisions (0)
No revisions yet.