patternsqlMinor
TOP(1) BY GROUP of very huge (100,000,000+) table
Viewed 0 times
topgroup000huge100verytable
Problem
Setup
I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.
The table is clustered at around 500 clusters and on a daily base.
Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.
Index based SELECTs are lightning fast and we had no problem with it.
Problem
I need to know the last (TOP) row by
Question
I need to find a way to solve this in a good way and without bringing the DB to the limits.
Effort 1
The first try was obvious
Example query 1
Example query 2
`SELECT
I have a huge table of ~115,382,254 rows.
The table is relatively simple and logs application process operations.
CREATE TABLE [data].OperationData NOT NULL,
[Size] [bigint] NULL,
[Begin] datetime2 NULL,
[End] datetime2 NOT NULL,
[Date] AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
[DataSetCount] [bigint] NULL,
[Result] [int] NULL,
[Error] nvarchar NULL,
[Status] [int] NULL,
CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED
(
[SourceDeviceID] ASC,
[FileSource] ASC,
[End] ASC
))
CREATE TABLE [model].SourceDevice NOT NULL,
[Name] nvarchar NULL,
CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
ALTER TABLE [data].[OperationData] WITH CHECK ADD CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])
The table is clustered at around 500 clusters and on a daily base.
Also, the table is well indexed by PK, statistics are up-to-date and the INDEXer get defraged every night.
Index based SELECTs are lightning fast and we had no problem with it.
Problem
I need to know the last (TOP) row by
[End] and partitioned by [SourceDeciveID]. To get the very last [OperationData] of every source device.Question
I need to find a way to solve this in a good way and without bringing the DB to the limits.
Effort 1
The first try was obvious
GROUP BY or SELECT OVER PARTITION BY query. The problem here is also obvious, every query has to scan over very partition order/find the top row. So the query is very slow and has a very high IO impact.Example query 1
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1
Example query 2
`SELECT
Solution
If you have a table of
If you know you’re only after your newest partition, you could include a filter on End, like
Edit: Because your clustered index is on
SourceID values, and an index on your main table on (SourceID, End) include (othercolumns), just use OUTER APPLY.SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;If you know you’re only after your newest partition, you could include a filter on End, like
AND d.[End] > DATEADD(day, -1, GETDATE())Edit: Because your clustered index is on
SourceID, Source, End), put Source into your Sources table too and join on that as well. Then you don’t need the new index.SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;Code Snippets
SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d
WHERE d.SourceID = s.SourceID
ORDER BY d.[End] DESC) d;SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
FROM data.OperationData d -- Big table quick seeks
WHERE d.SourceID = s.SourceID
AND d.Source = s.Source
AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
ORDER BY d.[End] DESC) d;Context
StackExchange Database Administrators Q#223631, answer score: 9
Revisions (0)
No revisions yet.