HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

SQL Staging Tables: Primary Key Clustered or Heap

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
clusteredtablesprimarysqlheapstagingkey

Problem

We are taking legacy flat txt files and inserting them into stage tables with SSIS.The question arose whether table should have primary clustered key index or not. This is direct flat file import with no transformation.

create table dbo.CustomerTransaction
(
     CustomerName varchar(255),
     PurchaseLocation varchar(255),
     Productid  int,
     AmountSold float,
     CustomerAddress varchar(50)
)

create table dbo.CustomerTransaction
(
     -- discussion for adding this column
     CustomerTransactionId int primary key clustered identity(1,1) 

     CustomerName varchar(255),
     PurchaseLocation varchar(255),
     Productid  int,
     AmountSold float,
     CustomerAddress varchar(50)
)

-- both tables have nonclustered indexes
create nonclustered index idx_ProductId on dbo.CustomerTransaction(ProductId)
create nonclustered index idx_CustomerAddress on dbo.CustomerTransaction(CustomerAddress)

-- Actually have more indexes, tables above are just for sample


1) Before ETL, the staging tables are truncated. There are No Deletes and No Updates. Only Inserts.

truncate table dbo.[CustomerTransaction]


2) Then disable all indexes before ETL.

alter index all on dbo.[CustomerTransaction] DISABLE


3) We conduct SSIS data flow with default fast load, which I read is equivalent to bulk insert. No transformations occur here.

4) Then reenable all indexes after import is done.

alter index all on dbo.[CustomerTransaction] REBUILD


5) The staging tables are then selected on join and where clauses, and placed into datawarehouse. This we why we have nonclustered indexes. After data warehouse is loaded, we truncate the staging tables.

We are hearing information that ETL Stage tables are good as heaps. However, also learning of fragmentation and performance issues with heaps. Reading all the articles below

I am reading conflicting opinions. One says Binary tree clustered are maintenance headaches for import ETL. Other says Heaps have pe

Solution

Having an identity column doesn’t force you to use it as a clustered index key.

You’re right that heaps work well here. I would consider Thomas Kejser to be the authority on the subject, and it’s good you’ve listed him as one of your resources.

As for fragmentation in heaps - doesn’t happen on insert-only.

Edit: Go through this article about parallel insert, and notice the comparisons between heaps and clustered indexes. https://blogs.msdn.microsoft.com/sqlcat/2016/07/21/real-world-parallel-insert-what-else-you-need-to-know/

Context

StackExchange Database Administrators Q#222247, answer score: 5

Revisions (0)

No revisions yet.