HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Change Data Capture using HIVE

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
hivecaptureusingdatachange

Problem

Could you please review my code and let me know if there is something wrong? I'm getting the expected results.

I'm trying to do change data capture using hive. We already have an existing CDC in DataStage, but we are trying to implement in HIVE now. Here is what I have done, please correct me if I'm doing something wrong.

Here is the source data:

Table name: month1

e_id, e_name, e_ad
11, john, singapore
22, jack, Melbourne
33, jessy, sydney
44, lampard, canada
55, pomer, perth


Table name2: month2

e_id, e_name, e_add
11, john, odisha
22, jack, Melbourne
44, lampard, canada
55, pomer, perth
66, rahul, mumbai


Here, the key field is the e_id, and value field is e_name and e_add:

select
case when cdc_codes ='Updates' Then month2s
when cdc_codes = 'NoChange' then month1s
when cdc_codes = 'New' then month2s
when cdc_codes = 'Deletes' then month1s
else 'Error' end as fin_cols
from (select
case when m1.e_id = m2.e_id and concat(m1.e_name,m1.e_add) = concat(m2.e_name,m2.e_add) then 'NoChange'
when m1.e_id = m2.e_id and concat(m1.e_name,m1.e_add) <> concat(m2.e_name,m2.e_add) then 'Update'
when m1.e_id is null then 'New'
when m2.e_id is null then 'Deletes'
else 'Error' end as cdc_codes,
concat(m1.e_id,',',m1.e_name,',',m1.e_add) as month1s,
concat(m2.e_id,',',m2.e_name,',',m2.e_add) as month2s
from month1 as m1
full outer join
month2 as m2
on m1.e_id = m2.e_id) as b1

Solution

Nitpicks

There are a few inconsistencies in your capitalization of SQL keywords. It's pretty minor. I see then and Then in some places, etc.

Naming

m1 and m2 don't make for very good aliases. Likewise e_id, e_name, e_add are not very good column names, though I realize those are probably example tables and your real schema is probably quite a bit different.

concat

Concatenation can make things confusing in SQL, in my opinion. I would encourage to see if there is a more natural way to compare data in your case statements. For example, instead of this:

case    when m1.e_id = m2.e_id and concat(m1.e_name,m1.e_add) = concat(m2.e_name,m2.e_add)


Why not try:

case    when m1.e_id = m2.e_id and m1.e_name = m2.e_name and m1.e_add = m2.e_add


This one is especially odd looking and would be easy to make a mistake:

concat(m1.e_id,',',m1.e_name,',',m1.e_add) as month1s,


CTE

I think using a common table expression instead of a subquery would make your query more elegant and easier to maintain. I also think a little more vertical space in your case statements to separate the conditions from the results helps the code read better:

with cte_cdc_codes as (
    select 
    case    when m1.e_id = m2.e_id and concat(m1.e_name,m1.e_add) = concat(m2.e_name,m2.e_add) 
                then  'NoChange'
            when m1.e_id = m2.e_id and  concat(m1.e_name,m1.e_add) <> concat(m2.e_name,m2.e_add) 
                then  'Update'
            when m1.e_id is null 
                then 'New'
            when m2.e_id is null 
                then 'Deletes'
            else 'Error' 
            end as cdc_codes,
        concat(m1.e_id,',',m1.e_name,',',m1.e_add) as month1s,
        concat(m2.e_id,',',m2.e_name,',',m2.e_add) as month2s
    from    month1 as m1 
    full outer join 
    month2 as m2
    on  m1.e_id = m2.e_id
)
select 
    case    when cdc_codes ='Updates' 
            Then month2s
        when cdc_codes = 'NoChange' then month1s
        when cdc_codes = 'New' then month2s
        when cdc_codes = 'Deletes' then month1s
        else 'Error' end as fin_cols
from cte_cdc_codes;

Code Snippets

case    when m1.e_id = m2.e_id and concat(m1.e_name,m1.e_add) = concat(m2.e_name,m2.e_add)
case    when m1.e_id = m2.e_id and m1.e_name = m2.e_name and m1.e_add = m2.e_add
concat(m1.e_id,',',m1.e_name,',',m1.e_add) as month1s,
with cte_cdc_codes as (
    select 
    case    when m1.e_id = m2.e_id and concat(m1.e_name,m1.e_add) = concat(m2.e_name,m2.e_add) 
                then  'NoChange'
            when m1.e_id = m2.e_id and  concat(m1.e_name,m1.e_add) <> concat(m2.e_name,m2.e_add) 
                then  'Update'
            when m1.e_id is null 
                then 'New'
            when m2.e_id is null 
                then 'Deletes'
            else 'Error' 
            end as cdc_codes,
        concat(m1.e_id,',',m1.e_name,',',m1.e_add) as month1s,
        concat(m2.e_id,',',m2.e_name,',',m2.e_add) as month2s
    from    month1 as m1 
    full outer join 
    month2 as m2
    on  m1.e_id = m2.e_id
)
select 
    case    when cdc_codes ='Updates' 
            Then month2s
        when cdc_codes = 'NoChange' then month1s
        when cdc_codes = 'New' then month2s
        when cdc_codes = 'Deletes' then month1s
        else 'Error' end as fin_cols
from cte_cdc_codes;

Context

StackExchange Code Review Q#67575, answer score: 2

Revisions (0)

No revisions yet.