patternpythonMinor
Rolling regressions in R
Viewed 0 times
regressionsrollingstackoverflow
Problem
I want to run rolling regressions over each group and store the coefficient. The following works, but it's slow, since I have too many series and I want to run too many regressions for each group.
Using
Using
foreach(), speeds things up (and also getting the coefficient with (X'X)^{-1}X'Y but is there a way to vectorize this operation? Also open for any and all improvements.library(data.table)
run.rolling.regressions <- function(x) {
DT <- data.table( Y = rnorm(10000),
X = rnorm(10000),
key.group = rep(LETTERS[1:10], each = 1000))
window.length <- 12
names.of.groups <- unique(DT$key.group)
number.of.groups <- length(names.of.groups)
X.coefficients <- list()
for(j in 1:length(names.of.groups)) {
regressed.DT <- DT[key.group == names.of.groups[j]]
nrows.of.group <- nrow(regressed.DT)
print(paste0(j, ', key.group: ', names.of.groups[j]))
for (i in window.length:nrows.of.group) {
if(i == window.length) {
X.coefficients[[names.of.groups[j]]] <- c(rep(NA, nrows.of.group)) }
X.coefficients[[names.of.groups[j]]][i] <- lm(Y ~ 1 + X,
data = regressed.DT[(i - 11):i])$coefficients['X']
}
}
return(X.coefficients)
}
system.time(X.coef <- run.rolling.regressions())Solution
For each group in your data table, your code computes the coefficient
Vectorization of prediction models across datasets is in general not straightforward, but for the special case you have here (simple linear regression) is it possible because there is a simple closed-form expression for
The end result is a fully vectorized version of the code (I use the
We can confirm that this yields identical results to the code from the original post about 3 orders of magnitude more quickly:
Note that I slightly modified the provided
b1 from a linear regression y = b0 + b1*x + epsilon, and you want to run this regression and obtain b1 for observations 1-12, 2-13, 3-14, ..., 989-1000. Right now you are separately calling lm for each data subset, which is a non-vectorized approach.Vectorization of prediction models across datasets is in general not straightforward, but for the special case you have here (simple linear regression) is it possible because there is a simple closed-form expression for
b1, the coefficient of interest. In particular, for given vectors x and y we have b1 = (mean(xy) - mean(x)mean(y)) / (mean(x^2) - mean(x)^2). The rolling coefficient value can therefore be computed using the rolling means of x*y, x, y, and x^2 with the appropriate window width.The end result is a fully vectorized version of the code (I use the
RcppRoll package to obtain rolling means):library(RcppRoll)
rolling2 <- function(DT, window.length) {
setNames(lapply(unique(DT$key.group), function(g) {
regressed.DT <- DT[key.group == g]
xyBar = roll_mean(regressed.DT$X*regressed.DT$Y, window.length)
xBar = roll_mean(regressed.DT$X, window.length)
yBar = roll_mean(regressed.DT$Y, window.length)
x2Bar = roll_mean(regressed.DT$X^2, window.length)
c(rep(NA, window.length-1), (xyBar - xBar*yBar) / (x2Bar - xBar^2))
}), unique(DT$key.group))
}We can confirm that this yields identical results to the code from the original post about 3 orders of magnitude more quickly:
set.seed(144)
DT <- data.table( Y = rnorm(10000),
X = rnorm(10000),
key.group = rep(LETTERS[1:10], each = 1000))
system.time(X.coef <- run.rolling.regressions(DT, 12))
# user system elapsed
# 13.321 0.098 13.504
system.time(X.coef2 <- rolling2(DT, 12))
# user system elapsed
# 0.010 0.000 0.011
all.equal(X.coef, X.coef2)
# [1] TRUENote that I slightly modified the provided
run.rolling.regressions function to take DT and window.length as input and to not print progress updates; I think it makes sense to separate the generation of the dataset from the function that computes the rolling means, and down the road it might be useful to have the window length as an adjustable argument instead of a fixed value.Code Snippets
library(RcppRoll)
rolling2 <- function(DT, window.length) {
setNames(lapply(unique(DT$key.group), function(g) {
regressed.DT <- DT[key.group == g]
xyBar = roll_mean(regressed.DT$X*regressed.DT$Y, window.length)
xBar = roll_mean(regressed.DT$X, window.length)
yBar = roll_mean(regressed.DT$Y, window.length)
x2Bar = roll_mean(regressed.DT$X^2, window.length)
c(rep(NA, window.length-1), (xyBar - xBar*yBar) / (x2Bar - xBar^2))
}), unique(DT$key.group))
}set.seed(144)
DT <- data.table( Y = rnorm(10000),
X = rnorm(10000),
key.group = rep(LETTERS[1:10], each = 1000))
system.time(X.coef <- run.rolling.regressions(DT, 12))
# user system elapsed
# 13.321 0.098 13.504
system.time(X.coef2 <- rolling2(DT, 12))
# user system elapsed
# 0.010 0.000 0.011
all.equal(X.coef, X.coef2)
# [1] TRUEContext
StackExchange Code Review Q#125509, answer score: 7
Revisions (0)
No revisions yet.