HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

TOP(1) BY GROUP of very huge (100,000,000+) table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
topgroup000huge100verytable

Problem

Setup

I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.

CREATE TABLE [data].OperationData NOT NULL,
[Size] [bigint] NULL,
[Begin] datetime2 NULL,
[End] datetime2 NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] nvarchar NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))

CREATE TABLE [model].SourceDevice NOT NULL,
[Name] nvarchar NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))

ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])


The table is clustered at around 500 clusters and on a daily base.

Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.

Index based SELECTs are lightning fast and we had no problem with it.

Problem

I need to know the last (TOP) row by [End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.

Question

I need to find a way to solve this in a good way and without bringing the DB to the limits.

Effort 1

The first try was obvious GROUP BY or SELECT OVER PARTITION BY query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.

Example query 1

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1


Example query 2

`SELECT

Solution

If you have a table of SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.

SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
    FROM data.OperationData d
    WHERE d.SourceID = s.SourceID
    ORDER BY d.[End] DESC) d;


If you know you’re only after your newest partition, you could include a filter on End, like AND d.[End] > DATEADD(day, -1, GETDATE())

Edit: Because your clustered index is on SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.

SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
    FROM data.OperationData d -- Big table quick seeks
    WHERE d.SourceID = s.SourceID
    AND d.Source = s.Source
    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
    ORDER BY d.[End] DESC) d;

Code Snippets

SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
    FROM data.OperationData d
    WHERE d.SourceID = s.SourceID
    ORDER BY d.[End] DESC) d;
SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
    FROM data.OperationData d -- Big table quick seeks
    WHERE d.SourceID = s.SourceID
    AND d.Source = s.Source
    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
    ORDER BY d.[End] DESC) d;

Context

StackExchange Database Administrators Q#223631, answer score: 9

Revisions (0)

No revisions yet.