patternpythonMinor
Identify changes between two data.frames, explain deltas for x columns
Viewed 0 times
deltascolumnsidentifytwobetweenchangesframesforexplaindata
Problem
I wrote the following snippet to deal with a situation in which I have two datasets (usually the current version versus a previous one) and I need to identify what changes have occured: usually explaining the deltas of one specific column.
I want to identify what rows have been added, what rows have been removed as well as what (manually specified) variables have changed from the old to the new version.
I've done it like so:
So, in short, compare these two data.frames and identify how the difference between the sums of 'qsec' can be allocated to each row:
Original data.frame:
'New data.frame'
Function call with output:
```
df.ch <- df.changes(df.old,
df.new,
KEYS=c("id1", "id2"),
VAL = c("qsec"), # Values for which I want a delta column
retain.columns=c("cyl")) # Co
I want to identify what rows have been added, what rows have been removed as well as what (manually specified) variables have changed from the old to the new version.
I've done it like so:
- Return the most recent version, but containing all rows in both x and y. This, thus, consists of adding the 'missing' or 'lost' rows from the previous version to the new version.
- I can manually specify the keys that will uniquely identify a row in df.old and df.new;
- Delta columns are added for variables of my choice; for each of these specified variables the output will contain 'x.delta, x.old, x.new' columns at the end;
- I wanted to be able to specify additional columns that would be filled for the 'missing rows'. As the output is the most current representation, rows appearing only in the 'old' version by default only consist of the original keys.
So, in short, compare these two data.frames and identify how the difference between the sums of 'qsec' can be allocated to each row:
Original data.frame:
id1 id2 hp cyl qsec
1 Mazda RX4 Maz 110 6 16.46
2 Mazda RX4 Wag Maz 110 6 17.02
3 Datsun 710 Dat 181 4 33.00
4 Hornet 4 Drive Hor 110 6 19.44
7 Duster 360 Dus 245 8 15.84
8 Merc 240D Mer 62 4 20.00'New data.frame'
id1 id2 hp cyl qsec
1 Mazda RX4 Maz 110 6 16.46
2 Mazda RX4 Wag Maz 110 6 17.02
3 Datsun 710 Dat 93 4 18.61
4 Hornet 4 Drive Hor 110 6 19.44
5 Hornet Sportabout Hor 175 8 17.02
6 Valiant Val 105 6 20.22Function call with output:
```
df.ch <- df.changes(df.old,
df.new,
KEYS=c("id1", "id2"),
VAL = c("qsec"), # Values for which I want a delta column
retain.columns=c("cyl")) # Co
Solution
I feel you wrote some very complicated code when the hardest part (from an algorithm point of view) should be a single merge of your two data.frames. So my rewrite is centered around a call to the base
telling us from which of the two input data.frames the output rows are coming from. From here,
I hope you will agree that it is much more readable and easier to maintain this way.
Also notice how I used
merge function. The only trick is to add an is column to both data.frames before merging so the output data.frame M will contain two is.new and is.old columnstelling us from which of the two input data.frames the output rows are coming from. From here,
M contains everything you need to know, and forming the output is just a few lines of vectorized operations to add rows.changed, the new data, the retained data, and the comparisons.I hope you will agree that it is much more readable and easier to maintain this way.
dplyr (similarly data.table) is a great tool for speeds or for making clean/concise code, but I don't think it applied well here.Also notice how I used
stopifnot in a concise manner for doing all your input checks.df.changes <- function(df.old, df.new,
KEYS = c("id"),
VAL = NULL,
retain.columns = NULL) {
# input checks
stopifnot(KEYS %in% names(df.old),
KEYS %in% names(df.new),
VAL %in% names(df.old),
VAL %in% names(df.new),
retain.columns %in% names(df.new),
retain.columns %in% names(df.old))
# add columns to help us track new/old provenance
N <- transform(df.new, is = TRUE)
O <- transform(df.old, is = TRUE)
# merge
M <- merge(N, O, by = KEYS, all = TRUE, suffixes = c(".new",".old"))
M$is.new <- !is.na(M$is.new) # replace NA with FALSE
M$is.old <- !is.na(M$is.old) # replace NA with FALSE
# this will be our output
O <- M[KEYS]
# add rows.changed
O$row.changed <- with(M, ifelse(is.old & is.new, "10.Retained",
ifelse(is.old, "05. Lost",
"00. New")))
# add data from new
original.vars <- setdiff(names(df.new), KEYS)
for (var in original.vars)
O[[var]] <- M[[paste0(var, ".new")]]
# modify data for retain.columns
for (var in retain.columns)
O[[var]] <- ifelse(M$is.new, M[[paste0(var, ".new")]],
M[[paste0(var, ".old")]])
# add comparisons
for (var in VAL) {
old.var <- paste0(var, ".old")
new.var <- paste0(var, ".new")
del.var <- paste0(var, ".delta")
O[[del.var]] <- M[[new.var]] - M[[old.var]]
O[[old.var]] <- M[[old.var]]
O[[new.var]] <- M[[new.var]]
}
# reorder rows
O[order(O$row.changed), ]
}Code Snippets
df.changes <- function(df.old, df.new,
KEYS = c("id"),
VAL = NULL,
retain.columns = NULL) {
# input checks
stopifnot(KEYS %in% names(df.old),
KEYS %in% names(df.new),
VAL %in% names(df.old),
VAL %in% names(df.new),
retain.columns %in% names(df.new),
retain.columns %in% names(df.old))
# add columns to help us track new/old provenance
N <- transform(df.new, is = TRUE)
O <- transform(df.old, is = TRUE)
# merge
M <- merge(N, O, by = KEYS, all = TRUE, suffixes = c(".new",".old"))
M$is.new <- !is.na(M$is.new) # replace NA with FALSE
M$is.old <- !is.na(M$is.old) # replace NA with FALSE
# this will be our output
O <- M[KEYS]
# add rows.changed
O$row.changed <- with(M, ifelse(is.old & is.new, "10.Retained",
ifelse(is.old, "05. Lost",
"00. New")))
# add data from new
original.vars <- setdiff(names(df.new), KEYS)
for (var in original.vars)
O[[var]] <- M[[paste0(var, ".new")]]
# modify data for retain.columns
for (var in retain.columns)
O[[var]] <- ifelse(M$is.new, M[[paste0(var, ".new")]],
M[[paste0(var, ".old")]])
# add comparisons
for (var in VAL) {
old.var <- paste0(var, ".old")
new.var <- paste0(var, ".new")
del.var <- paste0(var, ".delta")
O[[del.var]] <- M[[new.var]] - M[[old.var]]
O[[old.var]] <- M[[old.var]]
O[[new.var]] <- M[[new.var]]
}
# reorder rows
O[order(O$row.changed), ]
}Context
StackExchange Code Review Q#94253, answer score: 5
Revisions (0)
No revisions yet.