patternsqlMinor
What to do with duplicate lookup information
Viewed 0 times
whatwithduplicatelookupinformation
Problem
I have multiple databases that I want to store in one data warehouse database. I am wondering how I design the import process to handle multiple lookup tables.
For example, say I have 5 databases all with the lookup table CustomerState. In one datatabse it could look like this:
In another database it could look like this:
How should I handle this in my enterprise layer of my DW database? Do I add a SourceSystemId to the lookup table, maybe something like this:
And then use the pkyCustomerStateId in my Customer table rather than the CustomerStateId?
For example, say I have 5 databases all with the lookup table CustomerState. In one datatabse it could look like this:
In another database it could look like this:
How should I handle this in my enterprise layer of my DW database? Do I add a SourceSystemId to the lookup table, maybe something like this:
And then use the pkyCustomerStateId in my Customer table rather than the CustomerStateId?
Solution
This type of thing should be handled by the ETL process that brings the data into the data warehouse. In fact, this process is the T in ETL.
What you need to do first is define the logical key column(s) of the tables, so the business meaning of the rows can be equated between the databases. A multi-column key as you propose would complicate matters, and really doesn't solve the problem.
For this example, I would define
The ETL process might do something like this (assuming the
(The reason I used
Then, the fact table loading process would use a lookup mechanism (Lookup Transformation in SSIS) to go from the
What you need to do first is define the logical key column(s) of the tables, so the business meaning of the rows can be equated between the databases. A multi-column key as you propose would complicate matters, and really doesn't solve the problem.
For this example, I would define
CustomerState as the logical key column in the dimension, and when the separate tables are merged together, this column would be unique in the result, with new CustomerStateId values assigned. This ensures the dimension primary key is as narrow as possible, which will carry through to the fact tables and make them as narrow as possible as well.The ETL process might do something like this (assuming the
CustomerStateId column of the target table is an IDENTITY column):MERGE INTO [dbo].[CustomerState] tgt
USING [Staging].[CustomerState] src ON src.CustomerState = tgt.CustomerState
WHEN NOT MATCHED BY TARGET THEN
INSERT (CustomerState) VALUES (src.CustomerState);(The reason I used
MERGE instead of INSERT is that in other dimensions you may need to handle doing updates as well; not in this case as there are no other columns.)Then, the fact table loading process would use a lookup mechanism (Lookup Transformation in SSIS) to go from the
CustomerState logical value to the newly-assigned CustomerStateid value generated by the above statement.Code Snippets
MERGE INTO [dbo].[CustomerState] tgt
USING [Staging].[CustomerState] src ON src.CustomerState = tgt.CustomerState
WHEN NOT MATCHED BY TARGET THEN
INSERT (CustomerState) VALUES (src.CustomerState);Context
StackExchange Database Administrators Q#34816, answer score: 6
Revisions (0)
No revisions yet.