patternsqlMinor
Pick the most non-default value from a selection of values
Viewed 0 times
fromtheselectionnonvaluedefaultvaluespickmost
Problem
Given the following tables:
My goal is to select all
The customer will only ever have one item that matches
It is possible (in the database) for the customer to have a
This is similar to a previous question of mine (Select all records, join with table A if join exists, table B if not), but much less pleasant. Because they're
CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, Name VARCHAR(16))
INSERT INTO FeeTestClient (Name)
VALUES ('Test'), ('Test 2'), ('Test 3')
CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16))
INSERT INTO FeeTest (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
(1, 10, 'Default', 'asdf'),
(2, 15, 'Default', 'Default'),
(2, 20, 'Default', 'qwer'),
(2, 10, 'zxcv', 'asdf'),
(3, 20, 'Default', 'Default')My goal is to select all
FeeTestClient elements, and pick the fee that is least default. The rules on a default fee are pretty simple: if Val2 is 'Default', then Val cannot be anything except 'Default', and for each fee, we want the first one where Val is not 'Default', or the first one where Val2 is not 'Default', otherwise we are guaranteed a Val = 'Default' AND Val2 = 'Default' match.The customer will only ever have one item that matches
'Default'/'Default', one item that matches 'Default'/____ and one item that matches ____/____. (Though the last two rows may not exist.) If they have a ____/____ then they'll always have a 'Default'/____, every customer will have a 'Default'/'Default'. They can never have a ____/'Default' — this is an invalid state on the application, and they can never have more than one of the same x/y, this is enforced by a UNIQUE constraint on the table.It is possible (in the database) for the customer to have a
'Default'/a, and a 'Default'/b, but that is considered an invalid state in the application, and there is a test for that. (The user must delete one of the two.)This is similar to a previous question of mine (Select all records, join with table A if join exists, table B if not), but much less pleasant. Because they're
INT (actually FLOAT in the database, but the same issue applies) they're being aggregated Solution
I'm not sure that your MCVE fully represents the problem that you have, but I'll answer the question as given. This question is about performance, so having just a few rows in the table won't cut it. I duplicated your sample data a million times for a total of 6 million rows for
Using
The plan is as I expect. There's a hash join along with a hash aggregate at the end.
However, you have other options because you have a
With
We're searching on
The index isn't covering so the optimizer doesn't want to use it. Using it would require a lot of key lookups to get the columns that aren't on the index. I can force the index to be used with a lazy, undocumented trick that you should not use in production:
Now the query runs in 5 seconds. We can see the key lookups in the plan:
Our final attempt with be with a covering index:
The query from before now runs in four seconds:
How can you apply this to your giant, anonymized query? You have a lot of key lookups and an index spool. Try defining covering indexes so that you can access the data you need more efficiently. I can't make any kind of guarantee about the overall runtime but it should help to some degree. Please note that I did not have time to take a very careful look at the plans that you posted.
In case anyone is following along at home, there's a bizarre edge case that can cause the covering index to still not be used. Here's one way to work around it:
FeeTest and 3 million rows for FeeTestClient. Code to do that below:DROP TABLE IF EXISTS FeeTestClient;
CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, [Name] VARCHAR(16));
INSERT INTO FeeTestClient WITH (TABLOCK)
([Name])
SELECT 'ZZZZZZZ' + CAST(RN AS VARCHAR(7))
FROM
(
SELECT TOP (3000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS FeeTest_source;
CREATE TABLE FeeTest_source (ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));
INSERT INTO FeeTest_source (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
(1, 10, 'Default', 'asdf'),
(2, 15, 'Default', 'Default'),
(2, 20, 'Default', 'qwer'),
(2, 10, 'zxcv', 'asdf'),
(3, 20, 'Default', 'Default');
SELECT 3 * client.Id - 2 + client.Id % 3 AS ClientId
, src.Fee
, src.Val
, src.Val2 into #t
FROM FeeTest_source src
INNER JOIN FeeTestClient client ON src.ClientId = 1 + client.Id % 3;
DROP TABLE IF EXISTS FeeTest;
CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));
INSERT INTO FeeTest WITH (TABLOCK)
(ClientId, Fee, Val, Val2)
SELECT * FROM #t;
DROP TABLE #t;Using
GROUP BY and only keeping the relevant aggregate could be a good approach depending on the nature of the data in the table and the indexes that are defined against the table. The query below finishes in 2 seconds on my machine:SELECT Id, Name, MAX(COALESCE(CASE WHEN Val <> 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 = 'Default' THEN Fee END))
FROM FeeTestClient
LEFT OUTER JOIN FeeTest ON FeeTest.ClientId = Id
GROUP BY Id, NameThe plan is as I expect. There's a hash join along with a hash aggregate at the end.
However, you have other options because you have a
FeeTestClient table. Another strategy is to calculate what you're looking for per row with OUTER APPLY. One way to do that is below:SELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
SELECT TOP 1 ft.Fee
FROM FeeTest ft
WHERE ft.ClientId = ftc.Id
ORDER BY
CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
+ CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
DESC, ft.ClientId
) oa;With
APPLY and TOP you almost always want a good index on the inner table. The query optimizer builds a temporary index for us via the spool, and the query takes 14 seconds to run on my machine:We're searching on
ClientId so let's try an index on that:CREATE INDEX NOT_COVERING ON FeeTest (ClientId);The index isn't covering so the optimizer doesn't want to use it. Using it would require a lot of key lookups to get the columns that aren't on the index. I can force the index to be used with a lazy, undocumented trick that you should not use in production:
SELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
SELECT TOP 1 ft.Fee
FROM FeeTest ft
WHERE ft.ClientId = ftc.Id
ORDER BY
CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
+ CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
DESC, ft.ClientId
) oa
OPTION (QueryRuleOff BuildSpool);Now the query runs in 5 seconds. We can see the key lookups in the plan:
Our final attempt with be with a covering index:
CREATE INDEX COVERING ON FeeTest (ClientId) INCLUDE (Val, Val2, Fee);The query from before now runs in four seconds:
How can you apply this to your giant, anonymized query? You have a lot of key lookups and an index spool. Try defining covering indexes so that you can access the data you need more efficiently. I can't make any kind of guarantee about the overall runtime but it should help to some degree. Please note that I did not have time to take a very careful look at the plans that you posted.
In case anyone is following along at home, there's a bizarre edge case that can cause the covering index to still not be used. Here's one way to work around it:
ALTER TABLE FeeTest
ADD MAGIC_COLUMN AS CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END + CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END;
CREATE INDEX COVERING_2 ON FeeTest (ClientId) INCLUDE (MAGIC_COLUMN, Fee);Code Snippets
DROP TABLE IF EXISTS FeeTestClient;
CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, [Name] VARCHAR(16));
INSERT INTO FeeTestClient WITH (TABLOCK)
([Name])
SELECT 'ZZZZZZZ' + CAST(RN AS VARCHAR(7))
FROM
(
SELECT TOP (3000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS FeeTest_source;
CREATE TABLE FeeTest_source (ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));
INSERT INTO FeeTest_source (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
(1, 10, 'Default', 'asdf'),
(2, 15, 'Default', 'Default'),
(2, 20, 'Default', 'qwer'),
(2, 10, 'zxcv', 'asdf'),
(3, 20, 'Default', 'Default');
SELECT 3 * client.Id - 2 + client.Id % 3 AS ClientId
, src.Fee
, src.Val
, src.Val2 into #t
FROM FeeTest_source src
INNER JOIN FeeTestClient client ON src.ClientId = 1 + client.Id % 3;
DROP TABLE IF EXISTS FeeTest;
CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));
INSERT INTO FeeTest WITH (TABLOCK)
(ClientId, Fee, Val, Val2)
SELECT * FROM #t;
DROP TABLE #t;SELECT Id, Name, MAX(COALESCE(CASE WHEN Val <> 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 = 'Default' THEN Fee END))
FROM FeeTestClient
LEFT OUTER JOIN FeeTest ON FeeTest.ClientId = Id
GROUP BY Id, NameSELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
SELECT TOP 1 ft.Fee
FROM FeeTest ft
WHERE ft.ClientId = ftc.Id
ORDER BY
CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
+ CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
DESC, ft.ClientId
) oa;CREATE INDEX NOT_COVERING ON FeeTest (ClientId);SELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
SELECT TOP 1 ft.Fee
FROM FeeTest ft
WHERE ft.ClientId = ftc.Id
ORDER BY
CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
+ CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
DESC, ft.ClientId
) oa
OPTION (QueryRuleOff BuildSpool);Context
StackExchange Database Administrators Q#185898, answer score: 5
Revisions (0)
No revisions yet.