patternsqlMinor
What is the correct and performant pattern for loading (and querying) a large clustered columnstore table?
Viewed 0 times
clusteredthewhattablecolumnstorequeryinglargeperformantcorrectfor
Problem
I am doing some test with clustered columnstore tables - both partitioned and not.
In my research, I've seen flavors of this pattern:
It uses a rowstore index to sort the data by the desired
column before switching it to a CCI.
But in other articles or posts, I see the presence of an explicit "ORDER BY"
in the insert statement:
I have these questions.
Is this the most performant code pattern for a large (500m row) table
where partitioning is employed? In contrast, this author
[https://janizajcbi.com/2018/09/14/row-store-to-column-store-story/]
populates the partitions in a loop.
Is the ORDER BY needed? I ran this pattern on a small (30m row) table both with and
without the ORDER BY and in both cases, the CCI metadata shows segments to be
date aligned/sorted. I would think that the ORDER BY is not needed.
But I just ran the same on the large table, and the segments are not
date aligned/sorted. But again note that it was a much larger table and was also partitioned.
Below is the SQL pattern that DID NOT partition but DID result in sorted segments
```
CREATE TABLE dbo.RiskExposure (
...
);
GO
INSERT INTO dbo.Risk
In my research, I've seen flavors of this pattern:
-- step 1: make new table
SELECT TOP 0 * INTO [Fact].[Order_CCI] FROM [Fact].[Order];
-- step 2: rowstore by date
CREATE CLUSTERED INDEX [CCI_Order] ON [Fact].[Order_CCI]
(
[Order Date Key]
) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF)
ON [ps_OrderDate]([Order Date Key])
-- step 3: insert data
INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
[Order Key]
,[City Key]
,[Customer Key]
,[Stock Item Key]
,[Order Date Key]
....
FROM [Fact].[Order]
-- step 4: replace rowstore with columnstore
CREATE CLUSTERED COLUMNSTORE INDEX [CCI_Order]
ON [Fact].[Order_CCI]
WITH (DROP_EXISTING = ON, COMPRESSION_DELAY = 0)It uses a rowstore index to sort the data by the desired
column before switching it to a CCI.
But in other articles or posts, I see the presence of an explicit "ORDER BY"
in the insert statement:
INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
[Order Key]
,[City Key]
,[Customer Key]
,[Stock Item Key]
,[Order Date Key]
....
FROM [Fact].[Order]
ORDER BY [Order Date Key]I have these questions.
Is this the most performant code pattern for a large (500m row) table
where partitioning is employed? In contrast, this author
[https://janizajcbi.com/2018/09/14/row-store-to-column-store-story/]
populates the partitions in a loop.
Is the ORDER BY needed? I ran this pattern on a small (30m row) table both with and
without the ORDER BY and in both cases, the CCI metadata shows segments to be
date aligned/sorted. I would think that the ORDER BY is not needed.
But I just ran the same on the large table, and the segments are not
date aligned/sorted. But again note that it was a much larger table and was also partitioned.
Below is the SQL pattern that DID NOT partition but DID result in sorted segments
```
CREATE TABLE dbo.RiskExposure (
...
);
GO
INSERT INTO dbo.Risk
Solution
The ORDER BY on INSERT is not guaranteed to load the data in order. In particular the INSERT . . . SELECT can be parallelized, and the resulting table will not be loaded in any particular order.
Even single-threaded, the ORDER BY is ignored by INSERT ... SELECT. e g
The insert doesn't have a sort operator:
This is documented here:
When used with a SELECT...INTO statement to insert rows from another
source, the ORDER BY clause does not guarantee the rows are inserted
in the specified order.
SELECT - ORDER BY Clause (Transact-SQL)
And if the table is big enough, you can check the column segments to see that they are not in fact ordered by ProdctKey.
Even single-threaded, the ORDER BY is ignored by INSERT ... SELECT. e g
drop table if exists test
go
select *
into test
from FactInternetSales where 1=0
create clustered columnstore index cci_test on test
insert into test --with (tablock)
select s.* from FactInternetSales s
order by ProductKeyThe insert doesn't have a sort operator:
This is documented here:
When used with a SELECT...INTO statement to insert rows from another
source, the ORDER BY clause does not guarantee the rows are inserted
in the specified order.
SELECT - ORDER BY Clause (Transact-SQL)
And if the table is big enough, you can check the column segments to see that they are not in fact ordered by ProdctKey.
select object_name(p.object_id) table_name, s.segment_id, c.name column_name, s.min_data_id, s.max_data_id
from sys.column_store_segments s
join sys.partitions p
on s.hobt_id = p.hobt_id
left join sys.columns c
on c.object_id = p.object_id
and c.column_id = s.column_id
where c.name = 'ProductKey'Code Snippets
drop table if exists test
go
select *
into test
from FactInternetSales where 1=0
create clustered columnstore index cci_test on test
insert into test --with (tablock)
select s.* from FactInternetSales s
order by ProductKeyselect object_name(p.object_id) table_name, s.segment_id, c.name column_name, s.min_data_id, s.max_data_id
from sys.column_store_segments s
join sys.partitions p
on s.hobt_id = p.hobt_id
left join sys.columns c
on c.object_id = p.object_id
and c.column_id = s.column_id
where c.name = 'ProductKey'Context
StackExchange Database Administrators Q#282944, answer score: 2
Revisions (0)
No revisions yet.