HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Identify changes between two data.frames, explain deltas for x columns

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
deltascolumnsidentifytwobetweenchangesframesforexplaindata

Problem

I wrote the following snippet to deal with a situation in which I have two datasets (usually the current version versus a previous one) and I need to identify what changes have occured: usually explaining the deltas of one specific column.

I want to identify what rows have been added, what rows have been removed as well as what (manually specified) variables have changed from the old to the new version.

I've done it like so:

  • Return the most recent version, but containing all rows in both x and y. This, thus, consists of adding the 'missing' or 'lost' rows from the previous version to the new version.



  • I can manually specify the keys that will uniquely identify a row in df.old and df.new;



  • Delta columns are added for variables of my choice; for each of these specified variables the output will contain 'x.delta, x.old, x.new' columns at the end;



  • I wanted to be able to specify additional columns that would be filled for the 'missing rows'. As the output is the most current representation, rows appearing only in the 'old' version by default only consist of the original keys.



So, in short, compare these two data.frames and identify how the difference between the sums of 'qsec' can be allocated to each row:

Original data.frame:

id1 id2  hp cyl  qsec
1      Mazda RX4 Maz 110   6 16.46
2  Mazda RX4 Wag Maz 110   6 17.02
3     Datsun 710 Dat 181   4 33.00
4 Hornet 4 Drive Hor 110   6 19.44
7     Duster 360 Dus 245   8 15.84
8      Merc 240D Mer  62   4 20.00


'New data.frame'

id1 id2  hp cyl  qsec
1         Mazda RX4 Maz 110   6 16.46
2     Mazda RX4 Wag Maz 110   6 17.02
3        Datsun 710 Dat  93   4 18.61
4    Hornet 4 Drive Hor 110   6 19.44
5 Hornet Sportabout Hor 175   8 17.02
6           Valiant Val 105   6 20.22


Function call with output:

```
df.ch <- df.changes(df.old,
df.new,
KEYS=c("id1", "id2"),
VAL = c("qsec"), # Values for which I want a delta column
retain.columns=c("cyl")) # Co

Solution

I feel you wrote some very complicated code when the hardest part (from an algorithm point of view) should be a single merge of your two data.frames. So my rewrite is centered around a call to the base merge function. The only trick is to add an is column to both data.frames before merging so the output data.frame M will contain two is.new and is.old columns
telling us from which of the two input data.frames the output rows are coming from. From here, M contains everything you need to know, and forming the output is just a few lines of vectorized operations to add rows.changed, the new data, the retained data, and the comparisons.

I hope you will agree that it is much more readable and easier to maintain this way. dplyr (similarly data.table) is a great tool for speeds or for making clean/concise code, but I don't think it applied well here.

Also notice how I used stopifnot in a concise manner for doing all your input checks.

df.changes <- function(df.old, df.new, 
                       KEYS = c("id"),
                       VAL = NULL,
                       retain.columns = NULL) {
  # input checks 
  stopifnot(KEYS %in% names(df.old),
            KEYS %in% names(df.new),
            VAL %in% names(df.old),
            VAL %in% names(df.new),
            retain.columns %in% names(df.new),
            retain.columns %in% names(df.old))

  # add columns to help us track new/old provenance
  N <- transform(df.new, is = TRUE)
  O <- transform(df.old, is = TRUE)

  # merge
  M <- merge(N, O, by = KEYS, all = TRUE, suffixes = c(".new",".old"))
  M$is.new <- !is.na(M$is.new) # replace NA with FALSE
  M$is.old <- !is.na(M$is.old) # replace NA with FALSE

  # this will be our output
  O <- M[KEYS]

  # add rows.changed
  O$row.changed <- with(M, ifelse(is.old & is.new, "10.Retained",
                           ifelse(is.old,          "05. Lost",
                                                   "00. New")))
  # add data from new
  original.vars <- setdiff(names(df.new), KEYS)
  for (var in original.vars)
     O[[var]] <- M[[paste0(var, ".new")]]

  # modify data for retain.columns
  for (var in retain.columns)
    O[[var]] <- ifelse(M$is.new, M[[paste0(var, ".new")]],
                                 M[[paste0(var, ".old")]])

  # add comparisons
  for (var in VAL) {
    old.var <- paste0(var, ".old")
    new.var <- paste0(var, ".new")
    del.var <- paste0(var, ".delta")
    O[[del.var]] <- M[[new.var]] - M[[old.var]]
    O[[old.var]] <- M[[old.var]]
    O[[new.var]] <- M[[new.var]]
  }

  # reorder rows
  O[order(O$row.changed), ]
}

Code Snippets

df.changes <- function(df.old, df.new, 
                       KEYS = c("id"),
                       VAL = NULL,
                       retain.columns = NULL) {
  # input checks 
  stopifnot(KEYS %in% names(df.old),
            KEYS %in% names(df.new),
            VAL %in% names(df.old),
            VAL %in% names(df.new),
            retain.columns %in% names(df.new),
            retain.columns %in% names(df.old))

  # add columns to help us track new/old provenance
  N <- transform(df.new, is = TRUE)
  O <- transform(df.old, is = TRUE)

  # merge
  M <- merge(N, O, by = KEYS, all = TRUE, suffixes = c(".new",".old"))
  M$is.new <- !is.na(M$is.new) # replace NA with FALSE
  M$is.old <- !is.na(M$is.old) # replace NA with FALSE

  # this will be our output
  O <- M[KEYS]

  # add rows.changed
  O$row.changed <- with(M, ifelse(is.old & is.new, "10.Retained",
                           ifelse(is.old,          "05. Lost",
                                                   "00. New")))
  # add data from new
  original.vars <- setdiff(names(df.new), KEYS)
  for (var in original.vars)
     O[[var]] <- M[[paste0(var, ".new")]]

  # modify data for retain.columns
  for (var in retain.columns)
    O[[var]] <- ifelse(M$is.new, M[[paste0(var, ".new")]],
                                 M[[paste0(var, ".old")]])

  # add comparisons
  for (var in VAL) {
    old.var <- paste0(var, ".old")
    new.var <- paste0(var, ".new")
    del.var <- paste0(var, ".delta")
    O[[del.var]] <- M[[new.var]] - M[[old.var]]
    O[[old.var]] <- M[[old.var]]
    O[[new.var]] <- M[[new.var]]
  }

  # reorder rows
  O[order(O$row.changed), ]
}

Context

StackExchange Code Review Q#94253, answer score: 5

Revisions (0)

No revisions yet.