HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlMinor

Single-row Settings table: pros and cons of Joins vs scalar subqueries

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
settingssubqueriesscalarprossingleandrowtablejoinscons

Problem

I work with an application that uses a SQL Server database including a number of tables that hold a single row of configuration data, which is sometimes needed in queries against more conventional multi-row tables. Most of the code I've seen accesses these tables via joins when working with a single query, but in a recent code review I saw an approach using a scalar subquery, along the lines of:

Select T.Id
From dbo.SomeTable T
Where T.SomeValue > (Select Tolerance From dbo.Settings)


Although it was clearly working, my initial reaction was to assume that it was a risky breach of our standard practices, but I experimented with the form a little and discovered the "Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, , >= or when the subquery is used as an expression" error. That makes this appear to avoid the risk of unwanted behaviour from an unexpected 1:n join. (That shouldn't be a concern with these single-row tables in practice, they're fairly robust, but I've seen it come up elsewhere in the system.)

The execution plans for my simple test case look pretty similar apart from a (probably very cheap) Stream Aggregate and an Assert, which I assume are responsible for the query engine's ability to recognise and throw the error in the multiple-row case.

Is there a generally-accepted best practice for using this kind of table? Are there any major pros and cons I should be aware of when choosing an approach?

(I'm aware that using variables to hold the data is also an option, but it's not always practical to do so in some of our code so I'd like to focus on the comparison of these two approaches and/or any other way of folding this into a single query.)

Solution

I created a set of demo tables to see how this ends up working in practice.

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.Settings;
CREATE TABLE dbo.Settings
(
    Id int IDENTITY(1,1) NOT NULL,
    Tolerance int NOT NULL,

    CONSTRAINT PK_Settings PRIMARY KEY (Id)
);
GO

INSERT INTO dbo.Settings (Tolerance) VALUES (1073741823);

CREATE TABLE dbo.SomeTable
(
    Id int IDENTITY(1,1) NOT NULL,
    SomeValue int NOT NULL,

    CONSTRAINT PK_SomeTable PRIMARY KEY (Id)
);
CREATE NONCLUSTERED INDEX IX_SomeValue
ON dbo.SomeTable (SomeValue);
GO

INSERT INTO dbo.SomeTable
    (SomeValue)
SELECT
    SomeValue = v1.number
FROM master.dbo.spt_values v1
CROSS JOIN master.dbo.spt_values v2;
GO


That puts ~6,000,000 rows into the SomeTable table, ~15,000 of which are great than the row in the Settings table (1073741823).

The scalar subquery version from the OP

SELECT st.Id
FROM dbo.SomeTable st
WHERE st.SomeValue > (SELECT s.Tolerance FROM dbo.Settings s);


This results in a sort of strange query plan. The estimates are off by orders of magnitude. This results in a parallel plan being selected. However, since there's only one row on the upper input of the join, all the rows coming from the lower input end up on one thread - resulting in badly skewed (and completely unhelpful) parallelism.

I don't have a good explanation of why the estimate is so bad here, but it seems like a bad sign for this approach, depending on your table sizes and data distribution.

The INNER JOIN version

SELECT st.Id
FROM dbo.SomeTable st
INNER JOIN dbo.Settings s
    ON st.SomeValue > s.Tolerance;


This gets a much better estimate (only off by 2x), and doesn't warrant parallelism.

Actual Scalar Subquery

We can force the subquery to be scalar by adding TOP (1) to the inner query:

SELECT st.Id
FROM dbo.SomeTable st
WHERE st.SomeValue > (SELECT TOP (1) s.Tolerance FROM dbo.Settings s);


This results in a very similar query plan to the INNER JOIN version, with the TOP there to enforce that we only get one row from the Settings table.

Using a Parameter Instead

You mentioned avoiding this approach, but I was curious. If your application code can cache / retrieve this value, and pass it the query (could be in a stored proc, I'm using sp_executesql here):

EXEC sys.sp_executesql 
    N'SELECT st.Id FROM dbo.SomeTable st WHERE st.SomeValue > @SomeValue',
    N'@SomeValue int',
    @SomeValue = 1073741823;


This results in a good estimate, and in general a fairly efficient execution plan, and doesn't have to touch the Settings table.

Note: all of the items above depend on there being an index on the SomeValue column of the SomeTable table.

Based on all of that, I'd agree with J.D.'s answer that the INNER JOIN approach is probably more reliable. If you're forced to use the subquery approach, see if adding TOP (1) helps things out.

I'd add that if you can use a parameter, even better, but it sounds like that's not viable.

Code Snippets

USE tempdb;
GO

DROP TABLE IF EXISTS dbo.Settings;
CREATE TABLE dbo.Settings
(
    Id int IDENTITY(1,1) NOT NULL,
    Tolerance int NOT NULL,

    CONSTRAINT PK_Settings PRIMARY KEY (Id)
);
GO

INSERT INTO dbo.Settings (Tolerance) VALUES (1073741823);

CREATE TABLE dbo.SomeTable
(
    Id int IDENTITY(1,1) NOT NULL,
    SomeValue int NOT NULL,

    CONSTRAINT PK_SomeTable PRIMARY KEY (Id)
);
CREATE NONCLUSTERED INDEX IX_SomeValue
ON dbo.SomeTable (SomeValue);
GO

INSERT INTO dbo.SomeTable
    (SomeValue)
SELECT
    SomeValue = v1.number
FROM master.dbo.spt_values v1
CROSS JOIN master.dbo.spt_values v2;
GO
SELECT st.Id
FROM dbo.SomeTable st
WHERE st.SomeValue > (SELECT s.Tolerance FROM dbo.Settings s);
SELECT st.Id
FROM dbo.SomeTable st
INNER JOIN dbo.Settings s
    ON st.SomeValue > s.Tolerance;
SELECT st.Id
FROM dbo.SomeTable st
WHERE st.SomeValue > (SELECT TOP (1) s.Tolerance FROM dbo.Settings s);
EXEC sys.sp_executesql 
    N'SELECT st.Id FROM dbo.SomeTable st WHERE st.SomeValue > @SomeValue',
    N'@SomeValue int',
    @SomeValue = 1073741823;

Context

StackExchange Database Administrators Q#318072, answer score: 6

Revisions (0)

No revisions yet.