patternsqlMinor
Using surrogate key after merging the person's data?
Viewed 0 times
afterpersonthemergingusingsurrogatedatakey
Problem
This question is about business intelligence with sql server 2008 R2.
If you have a person named Jim Brown with his birthdate and social security number (SSN) being located in three different source system.
Do you use surrogate key that should contain Jim's birthday and SSN after you have merged three data row into single one/row in the ETL phase?
If you have a person named Jim Brown with his birthdate and social security number (SSN) being located in three different source system.
Do you use surrogate key that should contain Jim's birthday and SSN after you have merged three data row into single one/row in the ETL phase?
Solution
A surrogate key is a system assigned unique value to identify an entity occurrence. A natural key is what the business uses to identify an entity occurrence. The source systems, as well as your BI/Data Integration database, can use either type to identify the entity occurrence - such as Jim Brown in your example. In the source system we call what the source system uses to identify the entity occurrence a source key. So if you can have 3 different source systems each of which contain Jim Brown, each will have a different source key in addition to the natural key - which you have identified as the SSN + birthdate. The BI staging environment, which the ETL uses, will include a key map table which will map each source key to the assigned surrogate key for the BI database. So for example:
When a transaction against Jim Brown is to be processed, it will come from the source using the source key. Say its system A with source key 12345 with an update to one of the other fields. The ETL looks up the source key in the key map and finds it, and knows to apply that transaction to the row on Person with a surrogate key of 1.
If you only had 1 source system life would be easy. But you have 3. This is where the natural key comes in as you know you can identify Jim in any system by using these 2 characteristics that never change and are true about him. Now, say you integrate System D. The first transaction comes in for Jim Brown in system D, and system D's source key is AB2945. Now when the ETL looks in the key map, it doesn't find it. But since you know you have multiple systems that may contain a person, the ETL also does a lookup on the Person table for the natural key and voila - a match. Now the ETL inserts a new row to the key map for System D also for Jim Brown.
So in summary you always want to use a surrogate key as the PK in the BI database table when doing data integration of the same logical entity among many sources. You'll map it to the source key using a key map table, and prevent duplicates by looking up on the natural key in the BI table (along with an alternate unique key on the natural key to ensure no duplicates). Now there are many other details to work out - like is your natural key really unique, do you want to store it in your key map to prevent look ups on Person, do you have to handle composite source keys, do you keep a history of source key changes, do you track a system of record and systems of reference, how do you handle source key reuse (a potentially very sticky issue in its own right), etc. But this short summary gives you and idea of the differences between natural, surrogate, and source keys and where to use each in the BI scenario. I hope this helps...
Person Table
Surrogate Key Name Birthday SSN Other fields
1 Jim Brown 9-15-1988 123456789
Person Key Map
Surrogate Key Source Key Source System
1 12345 System A
1 230383 System B
1 294829 System CWhen a transaction against Jim Brown is to be processed, it will come from the source using the source key. Say its system A with source key 12345 with an update to one of the other fields. The ETL looks up the source key in the key map and finds it, and knows to apply that transaction to the row on Person with a surrogate key of 1.
If you only had 1 source system life would be easy. But you have 3. This is where the natural key comes in as you know you can identify Jim in any system by using these 2 characteristics that never change and are true about him. Now, say you integrate System D. The first transaction comes in for Jim Brown in system D, and system D's source key is AB2945. Now when the ETL looks in the key map, it doesn't find it. But since you know you have multiple systems that may contain a person, the ETL also does a lookup on the Person table for the natural key and voila - a match. Now the ETL inserts a new row to the key map for System D also for Jim Brown.
So in summary you always want to use a surrogate key as the PK in the BI database table when doing data integration of the same logical entity among many sources. You'll map it to the source key using a key map table, and prevent duplicates by looking up on the natural key in the BI table (along with an alternate unique key on the natural key to ensure no duplicates). Now there are many other details to work out - like is your natural key really unique, do you want to store it in your key map to prevent look ups on Person, do you have to handle composite source keys, do you keep a history of source key changes, do you track a system of record and systems of reference, how do you handle source key reuse (a potentially very sticky issue in its own right), etc. But this short summary gives you and idea of the differences between natural, surrogate, and source keys and where to use each in the BI scenario. I hope this helps...
Code Snippets
Person Table
Surrogate Key Name Birthday SSN Other fields
1 Jim Brown 9-15-1988 123456789
Person Key Map
Surrogate Key Source Key Source System
1 12345 System A
1 230383 System B
1 294829 System CContext
StackExchange Database Administrators Q#18446, answer score: 3
Revisions (0)
No revisions yet.