HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Is there a non-DDL way to preserve back-references when staging data into production?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
nonpreserveintoproductionwaydataddlstagingbackwhen

Problem

Let's say I've got this hypothetical schema:

Source (OLTP) DB:

Table Orders
------------
OrderID int IDENTITY (PK),
CustomerID int NOT NULL,
OrderAmount decimal NOT NULL


Destination (DSS) DB:

Table Activity
--------------
ActivityID int IDENTITY (PK),
PersonID int NOT NULL,
Amount decimal NOT NULL

Table ActivityOrderImport
--------------------
ActivityID int NOT NULL,
SourceOrderID int NOT NULL

Table CustomerMapping
---------------------
CustomerID int NOT NULL,
PersonID int NOT NULL


Obviously the real deal is considerably more complicated with more transformations. But assume for the moment that all this ETL does is merge specific transactions ("orders") from an external entity into a DSS that tracks generic "activities". The links between the external customer and the DSS person are in the CustomerMapping table.

The idea of the "Import" table is to provide some sort of audit trail in case something goes wrong. We don't have a lot of control over the source system and know it to be a bit on the flaky side. So it's really important for us to be able to understand the origin of any given activity.

Right now, there's a script that does this with DDL, which looks something like this:

ALTER TABLE Activity
ADD OrderID int NULL

MERGE Activity
USING #StagingOrders
(...)

INSERT ActivityOrderImport (ActivityID, SourceOrderID)
SELECT a.ActivityID, s.OrderID
FROM #StagingOrders s
INNER JOIN Activity a
    ON a.OrderID = s.OrderID

ALTER TABLE Activity
DROP COLUMN OrderID


This works fine, in the sense of not crashing and burning, it just makes me nauseous every time I look at the DDL.

-
Permanently adding an OrderID column to the Activity table really isn't an option because the data may come from multiple sources, each currently needing their own log table. Adding a separate column to the main table for all of these would break normalization and quickly turn the production database into a dog's breakfast.

-
Removing the `IDENTI

Solution

I would suggest adding a (set of) staging tables to the destination or an intermediary that can be better controlled and more stable. Place the tracking info there. Then do all the transformation from the staging to the final destination, either carrying the tracking info with it or discarding it.

There are several ways you can generate a tracking key from multiple systems, as long as they all follow the same algorithm, it doesn't have to be an INT. It could be a 2 char prefix with a sequential number. For that matter it doesn't have to be just one column.

Context

StackExchange Database Administrators Q#1395, answer score: 4

Revisions (0)

No revisions yet.