HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

What is the correct and performant pattern for loading (and querying) a large clustered columnstore table?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
clusteredthewhattablecolumnstorequeryinglargeperformantcorrectfor

Problem

I am doing some test with clustered columnstore tables - both partitioned and not.
In my research, I've seen flavors of this pattern:

-- step 1: make new table
SELECT TOP 0 * INTO [Fact].[Order_CCI] FROM [Fact].[Order];

-- step 2: rowstore by date

CREATE CLUSTERED INDEX [CCI_Order] ON [Fact].[Order_CCI]
(
  [Order Date Key]
) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) 
ON [ps_OrderDate]([Order Date Key])

-- step 3: insert data

INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
  [Order Key]
  ,[City Key]
  ,[Customer Key]
  ,[Stock Item Key]
  ,[Order Date Key]
  ....

FROM [Fact].[Order]

-- step 4: replace rowstore with columnstore

CREATE CLUSTERED COLUMNSTORE INDEX [CCI_Order]
ON [Fact].[Order_CCI]
WITH (DROP_EXISTING = ON, COMPRESSION_DELAY = 0)


It uses a rowstore index to sort the data by the desired
column before switching it to a CCI.

But in other articles or posts, I see the presence of an explicit "ORDER BY"
in the insert statement:

INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK)
SELECT
  [Order Key]
  ,[City Key]
  ,[Customer Key]
  ,[Stock Item Key]
  ,[Order Date Key]
  ....

FROM [Fact].[Order]
ORDER BY [Order Date Key]


I have these questions.

Is this the most performant code pattern for a large (500m row) table
where partitioning is employed? In contrast, this author
[https://janizajcbi.com/2018/09/14/row-store-to-column-store-story/]
populates the partitions in a loop.

Is the ORDER BY needed? I ran this pattern on a small (30m row) table both with and
without the ORDER BY and in both cases, the CCI metadata shows segments to be
date aligned/sorted. I would think that the ORDER BY is not needed.

But I just ran the same on the large table, and the segments are not
date aligned/sorted. But again note that it was a much larger table and was also partitioned.

Below is the SQL pattern that DID NOT partition but DID result in sorted segments

```
CREATE TABLE dbo.RiskExposure (
...
);
GO

INSERT INTO dbo.Risk

Solution

The ORDER BY on INSERT is not guaranteed to load the data in order. In particular the INSERT . . . SELECT can be parallelized, and the resulting table will not be loaded in any particular order.

Even single-threaded, the ORDER BY is ignored by INSERT ... SELECT. e g

drop table if exists test 
go
select *
into test 
from FactInternetSales where 1=0

create clustered columnstore index cci_test on test

insert into test --with (tablock)
select s.* from FactInternetSales s
order by ProductKey


The insert doesn't have a sort operator:

This is documented here:

When used with a SELECT...INTO statement to insert rows from another
source, the ORDER BY clause does not guarantee the rows are inserted
in the specified order.

SELECT - ORDER BY Clause (Transact-SQL)

And if the table is big enough, you can check the column segments to see that they are not in fact ordered by ProdctKey.

select object_name(p.object_id) table_name, s.segment_id, c.name column_name,  s.min_data_id, s.max_data_id
from sys.column_store_segments s
join sys.partitions p
  on s.hobt_id = p.hobt_id 
left join sys.columns c
  on c.object_id = p.object_id 
  and c.column_id = s.column_id
where c.name = 'ProductKey'

Code Snippets

drop table if exists test 
go
select *
into test 
from FactInternetSales where 1=0

create clustered columnstore index cci_test on test

insert into test --with (tablock)
select s.* from FactInternetSales s
order by ProductKey
select object_name(p.object_id) table_name, s.segment_id, c.name column_name,  s.min_data_id, s.max_data_id
from sys.column_store_segments s
join sys.partitions p
  on s.hobt_id = p.hobt_id 
left join sys.columns c
  on c.object_id = p.object_id 
  and c.column_id = s.column_id
where c.name = 'ProductKey'

Context

StackExchange Database Administrators Q#282944, answer score: 2

Revisions (0)

No revisions yet.