HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to check if a combination of columns exists in another table in SQL Server?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
combinationhowcolumnssqlanotherexistsserverchecktable

Problem

This is how I would express my query in MySQL or Oracle:

SELECT t1.c1, t2.c1, t1.c2, t2.c2, ...
FROM t1, t2
WHERE (t1.c1, t2.c1) NOT IN (SELECT c1, c2 FROM t3)


How can I express the same query in SQL Server?

I figured I could concatenate them as strings but it cannot be the best way to do it.

I could not find a way to do it nicely in SQL Server.

I cannot join them once more because each table contains millions of records.

Example tables so that it is easier to imagine the problem:

t1:

c1 c2 c3
--------
1  x  A
2  y  B
3  z  C


t2:

c1 c2 c3
--------
1  m  D
2  n  E
3  t  F
4  v  G


t3:

c1 c2 c4
--------
1  2  aa
1  4  bb
2  3  cc

Solution

A join in SQL Server is not automatically implemented as a nested loop. For example, a hash join can be used to implement the NOT IN. This means that the query will not necessarily "end up in a number of tuples over 10^18" even if each of the three tables being joined contains over 10^6 rows.

For example, here is a query plan on a reduced data volume that shows that the largest number of rows at any point in the query plan is proportional to the the size of t1 times the size of t2, but is independent of the size of t3.

Even so, there is no getting around the fact that a query that selects 10 trillion rows (presumably to aggregate them, or to insert them into a columnstore table?) is going to take quite some time to execute. Here's the full script, along with further analysis:

Create sample data

DROP TABLE IF EXISTS #t1, #t2, #t3, #batchMode, #results
GO

CREATE TABLE #t1 (c1 INT NOT NULL, c2 INT NOT NULL, INDEX CI CLUSTERED (c1, c2))
CREATE TABLE #t2 (c1 INT NOT NULL, c2 INT NOT NULL, INDEX CI CLUSTERED (c1, c2))
CREATE TABLE #t3 (c1 INT NOT NULL, c2 INT NOT NULL, INDEX CI CLUSTERED (c1, c2))
CREATE TABLE #batchMode (dummy INT NOT NULL)
CREATE CLUSTERED COLUMNSTORE INDEX CCI ON #batchMode
GO

DECLARE @numRows INT = 32000
INSERT INTO #t1 (c1, c2)
SELECT TOP(@numRows)
    ABS(CRYPT_GEN_RANDOM(8) % 33) AS c1,
    ABS(CRYPT_GEN_RANDOM(8) % 57) AS c2
FROM master..spt_values v1
CROSS JOIN master..spt_values v2

INSERT INTO #t2 (c1, c2)
SELECT TOP(@numRows)
    ABS(CRYPT_GEN_RANDOM(8) % 17) AS c1,
    ABS(CRYPT_GEN_RANDOM(8) % 74) AS c2
FROM master..spt_values v1
CROSS JOIN master..spt_values v2

INSERT INTO #t3 (c1, c2)
SELECT TOP(@numRows)
    ABS(CRYPT_GEN_RANDOM(8) % 33) AS c1,
    ABS(CRYPT_GEN_RANDOM(8) % 17) AS c2
FROM master..spt_values v1
CROSS JOIN master..spt_values v2
GO


Execute the query

By executing the query with different row counts, we can see that query performances increases quadratically (N^2) with the row count of t1 and t2, but is not significantly impacted by the row count of t3. We can therefore extrapolate the execution time with the full 16 trillion row cross product to be about 217 hours.

-- Perform your query, using aggregates to avoid writing billions of rows to console
-- 4K rows / table      CPU time = 13906 ms,  elapsed time = 1510 ms.
-- 8K rows / table      CPU time = 54563 ms,  elapsed time = 4548 ms.
-- 16K rows / table     CPU time = 192032 ms,  elapsed time = 13683 ms.
-- 32K rows / table     CPU time = 757688 ms,  elapsed time = 50754 ms.
-- Since the time scales quadratically once we have enough rows to efficiently utilize
-- all threads, we can predict 4MM rows / table with the following query:
--      SELECT POWER(4000000 / 32000, 2) * (50. / 3600)
-- 4MM rows / table     Estimated elapsed time = 217 hours
SELECT COUNT_BIG(*), SUM(1.0 * t1.c1 * t2.c1 * t1.c2 * t2.c2)
FROM #t1 t1
CROSS JOIN #t2 t2
WHERE NOT EXISTS (
    SELECT *
    FROM #t3 t3
    WHERE t3.c1 = t1.c1
        AND t3.c2 = t2.c2
)
    /* Enable batch mode hash join to avoid repartition streams on billions of rows
        This reduces elapsed time from 77 seconds to 50 seconds at 32K rows / table */
    AND NOT EXISTS (SELECT * FROM #batchMode WHERE 0=1)
OPTION (MAXDOP 16)
GO


Replace loop join with batch mode hash join

Because this query is very CPU intensive, it is natural to consider whether the majority of the work — the loop join to build out the cross-product of rows between t1 and t2 — can be replaced with an operator that supports batch mode execution. One way to do this is to re-implement the cross join as an equijoin on a single column where all rows in each table have the same value.

This reduces the projected time by about a factor of 10, to about 20 hours to filter and aggregate the 16 trillion intermediate rows generated by the cross product.

```
-- Add a column that will always have the value 1 to each table, enabling an
-- as hash match implementation of a cross join by matching on this column
ALTER TABLE #t1 ADD one TINYINT NOT NULL DEFAULT (1)
ALTER TABLE #t2 ADD one TINYINT NOT NULL DEFAULT (1)
GO

-- 32K rows / table CPU time = 61112 ms, elapsed time = 5507 ms.
-- 64K rows / table CPU time = 274955 ms, elapsed time = 19695 ms.
-- Since the time scales quadratically once we have enough rows to efficiently utilize
-- all threads, we can predict 4MM rows / table with the following query:
-- SELECT POWER(4000000 / 64000, 2) * (19. / 3600)
-- 4MM rows / table Estimated elapsed time = 20 hours
SELECT COUNT_BIG(), SUM(1.0 t1.c1 t2.c1 t1.c2 * t2.c2)
FROM #t1 t1
JOIN #t2 t2
/* Add an equijoin logically equivalent to 1=1 in order to enable batch mode
hash join to replace a loop join for this "cross join" */
ON t2.one = t1.one
WHERE NOT EXISTS (
SELECT *
FROM #t3 t3
WHERE t3.c1 = t1.c1
AND t3.c2 = t2.c2
)
/ Enable batch mode hash join /
AND NOT EXISTS (SELECT * FROM #batchMode WHERE

Code Snippets

DROP TABLE IF EXISTS #t1, #t2, #t3, #batchMode, #results
GO

CREATE TABLE #t1 (c1 INT NOT NULL, c2 INT NOT NULL, INDEX CI CLUSTERED (c1, c2))
CREATE TABLE #t2 (c1 INT NOT NULL, c2 INT NOT NULL, INDEX CI CLUSTERED (c1, c2))
CREATE TABLE #t3 (c1 INT NOT NULL, c2 INT NOT NULL, INDEX CI CLUSTERED (c1, c2))
CREATE TABLE #batchMode (dummy INT NOT NULL)
CREATE CLUSTERED COLUMNSTORE INDEX CCI ON #batchMode
GO

DECLARE @numRows INT = 32000
INSERT INTO #t1 (c1, c2)
SELECT TOP(@numRows)
    ABS(CRYPT_GEN_RANDOM(8) % 33) AS c1,
    ABS(CRYPT_GEN_RANDOM(8) % 57) AS c2
FROM master..spt_values v1
CROSS JOIN master..spt_values v2

INSERT INTO #t2 (c1, c2)
SELECT TOP(@numRows)
    ABS(CRYPT_GEN_RANDOM(8) % 17) AS c1,
    ABS(CRYPT_GEN_RANDOM(8) % 74) AS c2
FROM master..spt_values v1
CROSS JOIN master..spt_values v2

INSERT INTO #t3 (c1, c2)
SELECT TOP(@numRows)
    ABS(CRYPT_GEN_RANDOM(8) % 33) AS c1,
    ABS(CRYPT_GEN_RANDOM(8) % 17) AS c2
FROM master..spt_values v1
CROSS JOIN master..spt_values v2
GO
-- Perform your query, using aggregates to avoid writing billions of rows to console
-- 4K rows / table      CPU time = 13906 ms,  elapsed time = 1510 ms.
-- 8K rows / table      CPU time = 54563 ms,  elapsed time = 4548 ms.
-- 16K rows / table     CPU time = 192032 ms,  elapsed time = 13683 ms.
-- 32K rows / table     CPU time = 757688 ms,  elapsed time = 50754 ms.
-- Since the time scales quadratically once we have enough rows to efficiently utilize
-- all threads, we can predict 4MM rows / table with the following query:
--      SELECT POWER(4000000 / 32000, 2) * (50. / 3600)
-- 4MM rows / table     Estimated elapsed time = 217 hours
SELECT COUNT_BIG(*), SUM(1.0 * t1.c1 * t2.c1 * t1.c2 * t2.c2)
FROM #t1 t1
CROSS JOIN #t2 t2
WHERE NOT EXISTS (
    SELECT *
    FROM #t3 t3
    WHERE t3.c1 = t1.c1
        AND t3.c2 = t2.c2
)
    /* Enable batch mode hash join to avoid repartition streams on billions of rows
        This reduces elapsed time from 77 seconds to 50 seconds at 32K rows / table */
    AND NOT EXISTS (SELECT * FROM #batchMode WHERE 0=1)
OPTION (MAXDOP 16)
GO
-- Add a column that will always have the value 1 to each table, enabling an
-- as hash match implementation of a cross join by matching on this column
ALTER TABLE #t1 ADD one TINYINT NOT NULL DEFAULT (1)
ALTER TABLE #t2 ADD one TINYINT NOT NULL DEFAULT (1)
GO

-- 32K rows / table     CPU time = 61112 ms,    elapsed time = 5507 ms.
-- 64K rows / table     CPU time = 274955 ms,   elapsed time = 19695 ms.
-- Since the time scales quadratically once we have enough rows to efficiently utilize
-- all threads, we can predict 4MM rows / table with the following query:
--      SELECT POWER(4000000 / 64000, 2) * (19. / 3600)
-- 4MM rows / table     Estimated elapsed time = 20 hours
SELECT COUNT_BIG(*), SUM(1.0 * t1.c1 * t2.c1 * t1.c2 * t2.c2)
FROM #t1 t1
JOIN #t2 t2
    /* Add an equijoin logically equivalent to 1=1 in order to enable batch mode
        hash join to replace a loop join for this "cross join" */
    ON t2.one = t1.one
WHERE NOT EXISTS (
    SELECT *
    FROM #t3 t3
    WHERE t3.c1 = t1.c1
        AND t3.c2 = t2.c2
)
    /* Enable batch mode hash join  */
    AND NOT EXISTS (SELECT * FROM #batchMode WHERE 0=1)
OPTION (MAXDOP 16)
GO

Context

StackExchange Database Administrators Q#210607, answer score: 3

Revisions (0)

No revisions yet.