HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Is there appreciable overhead from having a large proportion of empty table partitions in SQL Server?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fromserversqlhavingproportionemptylargepartitionsappreciableoverhead

Problem

I recently inherited a project using partitioning on date where a daily scheduled task implements a sliding window scheme with 30 days in the past and 60 future dates.

In reality the data that is inserted uses SYSUTCDATETIME() for the partitioning column so the 60 future partitions are always empty.

Is this an issue that needs addressing or should I just let sleeping dogs lie?

Solution

Having a large number of partitions can cause a noticeable performance impact, so removing the relatively large number of empty ones in this case may provide an easy benefit.

Setup

CREATE PARTITION FUNCTION [pf](datetime2(2)) AS RANGE RIGHT FOR VALUES (N'2020-05-06', N'2020-05-07', N'2020-05-08', N'2020-05-09', N'2020-05-10', N'2020-05-11', N'2020-05-12', N'2020-05-13', N'2020-05-14', N'2020-05-15', N'2020-05-16', N'2020-05-17', N'2020-05-18', N'2020-05-19', N'2020-05-20', N'2020-05-21', N'2020-05-22', N'2020-05-23', N'2020-05-24', N'2020-05-25', N'2020-05-26', N'2020-05-27', N'2020-05-28', N'2020-05-29', N'2020-05-30', N'2020-05-31', N'2020-06-01', N'2020-06-02', N'2020-06-03', N'2020-06-04', N'2020-06-05', N'2020-06-06', N'2020-06-07', N'2020-06-08', N'2020-06-09', N'2020-06-10', N'2020-06-11', N'2020-06-12', N'2020-06-13', N'2020-06-14', N'2020-06-15', N'2020-06-16', N'2020-06-17', N'2020-06-18', N'2020-06-19', N'2020-06-20', N'2020-06-21', N'2020-06-22', N'2020-06-23', N'2020-06-24', N'2020-06-25', N'2020-06-26', N'2020-06-27', N'2020-06-28', N'2020-06-29', N'2020-06-30', N'2020-07-01', N'2020-07-02', N'2020-07-03', N'2020-07-04', N'2020-07-05',N'2020-07-06', N'2020-07-07', N'2020-07-08', N'2020-07-09', N'2020-07-10', N'2020-07-11', N'2020-07-12', N'2020-07-13', N'2020-07-14', N'2020-07-15', N'2020-07-16', N'2020-07-17', N'2020-07-18', N'2020-07-19', N'2020-07-20', N'2020-07-21', N'2020-07-22', N'2020-07-23', N'2020-07-24', N'2020-07-25', N'2020-07-26', N'2020-07-27', N'2020-07-28', N'2020-07-29', N'2020-07-30', N'2020-07-31', N'2020-08-01', N'2020-08-02', N'2020-08-03')

CREATE PARTITION SCHEME [ps] AS PARTITION [pf] ALL TO ([PRIMARY])

CREATE TABLE T1(X INT PRIMARY KEY);

INSERT INTO T1
SELECT TOP 30000 ROW_NUMBER() OVER (ORDER BY @@SPID)
FROM   sys.all_objects o1,
       sys.all_objects o2

CREATE TABLE T2
(
   X        INT,
   dt2      DATETIME2(2),
   OtherCol CHAR(100),
   PRIMARY KEY(X, dt2) ON ps(dt2)
);

INSERT INTO T2 (X, dt2)
SELECT TOP 21474836 ROW_NUMBER() OVER (ORDER BY @@SPID),
                    DATEADD(MILLISECOND, 100 * ROW_NUMBER() OVER (ORDER BY @@SPID), '2020-05-05')
FROM   sys.all_objects o1,
       sys.all_objects o2,
       sys.all_objects o3


This sets up a situation somewhat similar to described in the question. The first 25 partitions have data and the remaining 66 are empty.

Query

SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM T1 INNER JOIN T2 ON T1.X = T2.X


The above took 9.8 seconds for me. It needed to scan the whole of T2.

We have a partition aligned index with leading column X so what happens if we force a loop join?

SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.X = T2.X


The loop join was actually slightly worse (0.4 seconds slower). The 30,000 seeks are unable to do any partition elimination and all need to inspect 91 partitions, multiplying out the work required significantly.

One final attempt...

SELECT COUNT(*)
FROM T1
CROSS APPLY (SELECT TOP 1 $partition.pf(dt2) FROM T2 ORDER BY $partition.pf(dt2) DESC) CA(MaxPtn)
INNER LOOP JOIN T2 ON T1.X = T2.X AND $partition.pf(dt2) <= CA.MaxPtn


This completed in 3.7 seconds for me. The difference is that the query now first identifies the top non empty partition and in the subsequent seeks uses this value to avoid needing to do any work for the empty ones.

So my conclusion is that the empty partitions certainly can have a noticeable impact on query performance, and likely should be addressed if performing queries that do not include the partition column as a predicate.

Removing all but one empty tail partition with the below...

DECLARE @dt datetime2(2) = '2020-08-03'

WHILE @dt >=  '2020-05-31'
BEGIN
ALTER PARTITION FUNCTION pf()  
 MERGE RANGE (@dt)  
 SET @dt = DATEADD(DAY, -1, @dt)
END


... gives the same speedup (and no longer needed to use the hint to get a loop join plan as the estimated operator cost for the seek fell from 523.328 in the original INNER LOOP JOIN case to 149.546)

Code Snippets

CREATE PARTITION FUNCTION [pf](datetime2(2)) AS RANGE RIGHT FOR VALUES (N'2020-05-06', N'2020-05-07', N'2020-05-08', N'2020-05-09', N'2020-05-10', N'2020-05-11', N'2020-05-12', N'2020-05-13', N'2020-05-14', N'2020-05-15', N'2020-05-16', N'2020-05-17', N'2020-05-18', N'2020-05-19', N'2020-05-20', N'2020-05-21', N'2020-05-22', N'2020-05-23', N'2020-05-24', N'2020-05-25', N'2020-05-26', N'2020-05-27', N'2020-05-28', N'2020-05-29', N'2020-05-30', N'2020-05-31', N'2020-06-01', N'2020-06-02', N'2020-06-03', N'2020-06-04', N'2020-06-05', N'2020-06-06', N'2020-06-07', N'2020-06-08', N'2020-06-09', N'2020-06-10', N'2020-06-11', N'2020-06-12', N'2020-06-13', N'2020-06-14', N'2020-06-15', N'2020-06-16', N'2020-06-17', N'2020-06-18', N'2020-06-19', N'2020-06-20', N'2020-06-21', N'2020-06-22', N'2020-06-23', N'2020-06-24', N'2020-06-25', N'2020-06-26', N'2020-06-27', N'2020-06-28', N'2020-06-29', N'2020-06-30', N'2020-07-01', N'2020-07-02', N'2020-07-03', N'2020-07-04', N'2020-07-05',N'2020-07-06', N'2020-07-07', N'2020-07-08', N'2020-07-09', N'2020-07-10', N'2020-07-11', N'2020-07-12', N'2020-07-13', N'2020-07-14', N'2020-07-15', N'2020-07-16', N'2020-07-17', N'2020-07-18', N'2020-07-19', N'2020-07-20', N'2020-07-21', N'2020-07-22', N'2020-07-23', N'2020-07-24', N'2020-07-25', N'2020-07-26', N'2020-07-27', N'2020-07-28', N'2020-07-29', N'2020-07-30', N'2020-07-31', N'2020-08-01', N'2020-08-02', N'2020-08-03')

CREATE PARTITION SCHEME [ps] AS PARTITION [pf] ALL TO ([PRIMARY])

CREATE TABLE T1(X INT PRIMARY KEY);

INSERT INTO T1
SELECT TOP 30000 ROW_NUMBER() OVER (ORDER BY @@SPID)
FROM   sys.all_objects o1,
       sys.all_objects o2

CREATE TABLE T2
(
   X        INT,
   dt2      DATETIME2(2),
   OtherCol CHAR(100),
   PRIMARY KEY(X, dt2) ON ps(dt2)
);

INSERT INTO T2 (X, dt2)
SELECT TOP 21474836 ROW_NUMBER() OVER (ORDER BY @@SPID),
                    DATEADD(MILLISECOND, 100 * ROW_NUMBER() OVER (ORDER BY @@SPID), '2020-05-05')
FROM   sys.all_objects o1,
       sys.all_objects o2,
       sys.all_objects o3
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM T1 INNER JOIN T2 ON T1.X = T2.X
SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.X = T2.X
SELECT COUNT(*)
FROM T1
CROSS APPLY (SELECT TOP 1 $partition.pf(dt2) FROM T2 ORDER BY $partition.pf(dt2) DESC) CA(MaxPtn)
INNER LOOP JOIN T2 ON T1.X = T2.X AND $partition.pf(dt2) <= CA.MaxPtn
DECLARE @dt datetime2(2) = '2020-08-03'

WHILE @dt >=  '2020-05-31'
BEGIN
ALTER PARTITION FUNCTION pf()  
 MERGE RANGE (@dt)  
 SET @dt = DATEADD(DAY, -1, @dt)
END

Context

StackExchange Database Administrators Q#268647, answer score: 15

Revisions (0)

No revisions yet.