patternMinor
Design star schema for many-many relationship
Viewed 0 times
stardesignformanyschemarelationship
Problem
What are the steps/rules to build a DW star schema design from a production database; specifically, how do you handle many-to-many relationships.
I understand how to take basic data including a many-to-many relationship, and get to a normalized production database:
For example:
If I want to handle sales transactions, given tables
The
However, that would only allow 1 product and 1 promotion per sale transaction. Since we want to allow one or more products, and zero or more promotions:
-
I remove
-
I create
-
I will create
What would be the equivalent steps to take a production DB design to a DW star schema design?
I understand how to take basic data including a many-to-many relationship, and get to a normalized production database:
For example:
If I want to handle sales transactions, given tables
Product Entity, Promotion, and Employee, the initial step is to construct a table, SaleTransaction:SaleTransaction
- TransactionID
- ProductID
- EmployeeID
- SellingDateID
- Quantity
- SaleAmount
- PromotionIDThe
Promotion entity will be:Promotion
- PromotionID
- ProductID
- DiscountAmountHowever, that would only allow 1 product and 1 promotion per sale transaction. Since we want to allow one or more products, and zero or more promotions:
-
I remove
PromotionID, ProductID, Quantity, and (unless required by some business rule) SaleAmount from SaleTransaction-
I create
SaleTransactionDetail:- DetailID
- TransactionID
- ProductID
- Quantity
- SaleAmount-
I will create
SaleTransactionPromotion:- TransPromoID
- PromotionIDWhat would be the equivalent steps to take a production DB design to a DW star schema design?
Solution
Many-to-many tables are typically handled with a "bridge table." This is really just another name for a many-to-many table like you'd use in a transactional database: one column for an FK to the first table, one column for an FK to the second table, and probably a column or two for housekeeping fields.
You may also want a
Your proposed schema is good as-is, but I think it could be improved slightly; two things come to mind. One is that I would generally collapse
So we have three tables
You could include
The second caveat I had in mind is IDs. I strongly recommend that you use synthetic keys (SKs) in a data warehouse, rather than relying on IDs from the source system. There are a number of reasons, and this is discussed elsewhere on DBA.StackExchange, but offhand:
You can take advantage of synthetic keys in your bridge table by including a "dummy record" for orders without promotions. Populate it with a
You may also want a
Weight field in the bridge table, with a value equal to one divided by the number of entries for each sale. This lets you calculate SUM(Weight) instead of COUNT(*), when you want to avoid double-counting orders which had multiple promotions.Your proposed schema is good as-is, but I think it could be improved slightly; two things come to mind. One is that I would generally collapse
SalesTransactions and SalesTransactionDetails into a single table. This introduces redundancy for things like customer IDs, which ought to make your OLTP instincts flinch (transitioning from normalized databases to data warehousing is a paradigm shift!), but in this case it's worth it. By putting all of your sales into a single table, you simplify joins and, in many databases, signal that you want to use bitmap indices. In a warehouse, you often end up having to do table scans, but that's OK because your fact tables are very narrow, and fit in memory.So we have three tables
Sales, Promotions, and SalesPromotions. You might consider "order" rather than "sale," because the latter implies a finality which may not apply. If your system includes orders which have been placed but not shipped or not closed, the more general term makes sense. That's quibbling, but it never hurts to think carefully about what you name things.CREATE TABLE Orders
-- IDs/dimensions:
OrderID
EmployeeID
ProductID
-- Date dimensions: date placed, date contract signed, date shipped, etc.
-- Metrics: quantity, gross price, net price, extended price
-- Attributes: is taxable, currency code, business key
CREATE TABLE Promotions
PromotionID
-- Date dimensions: date effective, date expires
-- Metrics: fixed amount, percentage amount
-- Attributes: name, type
CREATE TABLE OrderPromotions
OrderID
PromotionID
WeightExtendedPrice can be a computed column rather than storing a value. If attributes start to bloat your Orders table, consider a "junk dimension."You could include
ProductID in Promotions, but only if promotions will only ever apply to a single product. For example, do you have some promotions which are applied to a whole order rather than specific line items? Might a promotion apply to product code ABC1 today, but ABC2 tomorrow, after a price change?The second caveat I had in mind is IDs. I strongly recommend that you use synthetic keys (SKs) in a data warehouse, rather than relying on IDs from the source system. There are a number of reasons, and this is discussed elsewhere on DBA.StackExchange, but offhand:
- A 32- or 64-bit integer for an SK will often be narrower than source system order IDs or customer IDs (e.g., a 144-bit Salesforce ID or other long alphanumeric IDs). This keeps your tables narrow, your secondary indices narrow, and your scans fast.
- You can avoid collisions in the source system. If Marketing some day decides to use product key ABC123 for something unrelated to what it was last year, you can give them two different synthetic keys.
- In particular, if your company acquires another business, you may find that their IDs don't match the data types you've been using. You'll need to map them to a common set, and handle collisions. Best to have the infrastructure for that from day one.
- You can use magic values for specific meanings which are relevant for reporting, but which aren't handled in the transactional system. For example, I always use SKs -1 for "N/A" and 0 for "Unknown". E.g., the
Customers.FirstOrderSKfield will be populated with -1 for customers who've had no orders, and 0 for customers who've had an order but for which no more specific info is available yet. Your reports can reflect this without having to put in special logic to handle NULL values.
You can take advantage of synthetic keys in your bridge table by including a "dummy record" for orders without promotions. Populate it with a
PromotionID of -1, to indicate that no promotion applied; now consumers of your data can filter for orders sold at full price.Code Snippets
CREATE TABLE Orders
-- IDs/dimensions:
OrderID
EmployeeID
ProductID
-- Date dimensions: date placed, date contract signed, date shipped, etc.
-- Metrics: quantity, gross price, net price, extended price
-- Attributes: is taxable, currency code, business key
CREATE TABLE Promotions
PromotionID
-- Date dimensions: date effective, date expires
-- Metrics: fixed amount, percentage amount
-- Attributes: name, type
CREATE TABLE OrderPromotions
OrderID
PromotionID
WeightContext
StackExchange Database Administrators Q#215307, answer score: 7
Revisions (0)
No revisions yet.