HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

best ETL design to transfer transaction tables records into the data-warehouse

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
tablesthewarehouseintodesignrecordsdatatransactionetltransfer

Problem

I have 2 type of tables to populate the data-warehouse with every day, lookup tables or configuration tables with few 100s records, and thats easy where i just truncate and refill the table.

but for transaction tables, that have many records, i usually increment, that is i run the ETL daily to add yesterdays records.

i have 2 problems that i face always

  • when the job fails for any reason (i lose that Days transactions)



  • when for any reason the job run twice or i run it twice (i get duplicates)



now i am trying to design a way where i over come these 2 problems as well as am trying to develop the ETL in such a way that it can auto fix it self incase any of these events occur.

i want it to check if there are missing days and run run the ETL for that day, and check if there are duplicates and delete them.

below are ways i though of
  1. i take in the last 5 days regardless, every day the ETL runs, deletes the last 5 days and refill.
  2. i check the destination tables if they have missing dates in the last month and then i query the source with the missing days.



keeping in mind that the source is a huge table in a production environment that i have to optimize my query to the maximum when requesting from it.

thanks

Solution

Do the transactions have an audit timestamp ? It has to be one that goes up only (no late-arriving facts. insert/update audit timestamps are good for this)

If so you could use this to define a range to extract. It's a common technique for this type of thing:

  • For each extraction, at the start, determine the range you want to extract (let's call the minimum timestamp min_ts and the maximum timestamp max_ts)



  • At the start of the extraction, put 1 line in a separate table (let's call it extraction_log and also give it a PK) with fields: extraction_id = a unique key, min_ts, max_ts, status = 'Starting Extraction'.



  • Use min_ts & max_ts to extract the data, either in 1 go (select * from where ts > min_ts and ts



  • Ad the end of a successful extraction, update the line and set status to 'Finished OK'



how to determine min_ts and max_ts ?

-
You could take min_ts from the extraction_log, using the last successful max_ts.
select max(max_ts) from extraction_log where status = 'Finished OK'

-
You could take max_ts from the source db at the start of your extraction.
select max(audit_ts) from source_table

There are alternatives here. If you're extracting these into a temporary staging table (a best practice), you could also take too much, ie the last 5 days and deal with duplicates later when upserting the entries in your ODS. For max_ts, if you're absolutely certain that the clocks between your dwh and the source are and will remain in sync (this is a very dangerous assumption - not recommended), you could even use sysdate()

Technically, you can get away with less effort. You don't really need the statuses, or a extraction_log table that keeps track of each batch. But I have found having it like this helps a lot with debugging and troubleshooting later on.
Additionally, if you need a routine to remove entries from a load that failed halfway, to find gaps for ranges of past extractions, and so on, the extraction_log will help. You might even want to include the extraction_id as an additional column in your ODS.

some further thoughts

-
if you don't have a good timestamp candidate, a technical key in the source system with the same properties (must go up only, no late-arriving facts) should be fine too.

-
if the timestamps are generated by the source application and there is a risk that 2 transactions with the exact same timestamp are not inserted at the same time (quite common), then it's safer to use a max_ts that is slightly in the past (
select max(max_ts) - 5 minutes from source_table.)

-
regarding failed loads. if you're only concerned with cleaning up entries of the last failed loads, you can add it as a first step in your ETL flow. (
delete from dwh_table where ts > min_ts`) This will remove the entries from any failures (if any) after the last successful extraction. It won't deal with failures between earlier successful extractions.

Context

StackExchange Database Administrators Q#49068, answer score: 2

Revisions (0)

No revisions yet.