patternMinor
Database for efficient range aggregate queries?
Viewed 0 times
rangeefficientdatabaseforqueriesaggregate
Problem
As a simplified example, suppose I have a table like this:
The table may contain hundreds of millions of records, and I need to frequently do queries like this:
Even if
Is there any database that can do this efficiently, as in
I have come across a data structure called a Segment Tree as described here. Also sometimes referred to as a range tree or interval tree, although all these names are often described as a slightly different variation of the data structure.
However, I have not come across any database that implements such a data structure. Implementing it from scratch is easy for an in-memory structure, but becomes tricky if it has to be persisted or is too big to fit into memory. If there is an efficient pattern for implementing this on top of an existing database, that could also help.
Side note: This is not an append-only table, so a solution such as keeping a cumulative sum will not work in this case.
seq | value
----+------
102 | 11954
211 | 43292
278 | 19222
499 | 3843The table may contain hundreds of millions of records, and I need to frequently do queries like this:
SELECT sum(value) WHERE seq > $a and seq < $bEven if
seq is indexed, a typical database implementation will loop through each row to compute the sum in best case O(n), where n is the size of the range.Is there any database that can do this efficiently, as in
O(log(n)) per query?I have come across a data structure called a Segment Tree as described here. Also sometimes referred to as a range tree or interval tree, although all these names are often described as a slightly different variation of the data structure.
However, I have not come across any database that implements such a data structure. Implementing it from scratch is easy for an in-memory structure, but becomes tricky if it has to be persisted or is too big to fit into memory. If there is an efficient pattern for implementing this on top of an existing database, that could also help.
Side note: This is not an append-only table, so a solution such as keeping a cumulative sum will not work in this case.
Solution
Using SQL Server ColumnStore indexes
Well, okay, just one -- a clustered CS index.
If you want to read about the hardware I did this on, head over here. Full disclosure, I wrote that blog post on the website of the company I work for.
On to the test!
Here's some generic code to build a pretty big table. Same warning as Evan, this can take a while to build and index.
Well, Evan wins for simplicity, but I've talked about that before.
Here's the index definition. La and dee and dah.
Looking at a count, every Id has a pretty even distribution:
Results:
...
With every Id having ~5,005,005 rows, we can look at a pretty small range of IDs to get you a 10 million row sum.
Result:
Query profile:
For fun, a larger aggregation:
Results:
Query profile:
Hope this helps!
Well, okay, just one -- a clustered CS index.
If you want to read about the hardware I did this on, head over here. Full disclosure, I wrote that blog post on the website of the company I work for.
On to the test!
Here's some generic code to build a pretty big table. Same warning as Evan, this can take a while to build and index.
USE tempdb
CREATE TABLE t1 (Id INT NOT NULL, Amount INT NOT NULL)
;WITH T (N)
AS ( SELECT X.N
FROM (
VALUES (NULL), (NULL), (NULL),
(NULL), (NULL), (NULL),
(NULL), (NULL), (NULL),
(NULL) ) AS X (N)
), NUMS (N) AS (
SELECT TOP ( 710000000 )
ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )) AS N
FROM T AS T1, T AS T2, T AS T3,
T AS T4, T AS T5, T AS T6,
T AS T7, T AS T8, T AS T9,
T AS T10 )
INSERT dbo.t1 WITH ( TABLOCK ) (
Id, Amount )
SELECT NUMS.N % 999 AS Id, NUMS.N % 9999 AS Amount
FROM NUMS;
--(705032704 row(s) affected) --Aw, close enoughWell, Evan wins for simplicity, but I've talked about that before.
Here's the index definition. La and dee and dah.
CREATE CLUSTERED COLUMNSTORE INDEX CX_WOAHMAMA ON dbo.t1Looking at a count, every Id has a pretty even distribution:
SELECT t.Id, COUNT(*) AS [Records]
FROM dbo.t1 AS t
GROUP BY t.Id
ORDER BY t.IdResults:
Id Records
0 5005005
1 5005006
2 5005006
3 5005006
4 5005006
5 5005006...
994 5005005
995 5005005
996 5005005
997 5005005
998 5005005With every Id having ~5,005,005 rows, we can look at a pretty small range of IDs to get you a 10 million row sum.
SELECT COUNT(*) AS [Records], SUM(t.Amount) AS [Total]
FROM dbo.t1 AS t
WHERE t.Id > 0
AND t.Id < 3;Result:
Records Total
10010012 50015062308Query profile:
Table 't1'. Scan count 6, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2560758, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Segment reads 4773, segment skipped 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 564 ms, elapsed time = 106 ms.For fun, a larger aggregation:
SELECT COUNT(*) AS [Records], SUM(CONVERT(BIGINT, t.Amount)) AS [Total]
FROM dbo.t1 AS t
WHERE t.Id > 0
AND t.Id < 101;Results:
Records Total
500500505 2501989114575Query profile:
Table 't1'. Scan count 6, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2560758, lob physical reads 0, lob read-ahead reads 0.
Table 't1'. Segment reads 4773, segment skipped 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1859 ms, elapsed time = 321 ms.Hope this helps!
Code Snippets
USE tempdb
CREATE TABLE t1 (Id INT NOT NULL, Amount INT NOT NULL)
;WITH T (N)
AS ( SELECT X.N
FROM (
VALUES (NULL), (NULL), (NULL),
(NULL), (NULL), (NULL),
(NULL), (NULL), (NULL),
(NULL) ) AS X (N)
), NUMS (N) AS (
SELECT TOP ( 710000000 )
ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )) AS N
FROM T AS T1, T AS T2, T AS T3,
T AS T4, T AS T5, T AS T6,
T AS T7, T AS T8, T AS T9,
T AS T10 )
INSERT dbo.t1 WITH ( TABLOCK ) (
Id, Amount )
SELECT NUMS.N % 999 AS Id, NUMS.N % 9999 AS Amount
FROM NUMS;
--(705032704 row(s) affected) --Aw, close enoughCREATE CLUSTERED COLUMNSTORE INDEX CX_WOAHMAMA ON dbo.t1SELECT t.Id, COUNT(*) AS [Records]
FROM dbo.t1 AS t
GROUP BY t.Id
ORDER BY t.IdId Records
0 5005005
1 5005006
2 5005006
3 5005006
4 5005006
5 5005006994 5005005
995 5005005
996 5005005
997 5005005
998 5005005Context
StackExchange Database Administrators Q#174086, answer score: 8
Revisions (0)
No revisions yet.