HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Rolling regressions in R

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
regressionsrollingstackoverflow

Problem

I want to run rolling regressions over each group and store the coefficient. The following works, but it's slow, since I have too many series and I want to run too many regressions for each group.

Using foreach(), speeds things up (and also getting the coefficient with (X'X)^{-1}X'Y but is there a way to vectorize this operation? Also open for any and all improvements.

library(data.table)

run.rolling.regressions <- function(x) {
DT <- data.table(   Y = rnorm(10000), 
                    X = rnorm(10000), 
                    key.group = rep(LETTERS[1:10], each = 1000))

window.length <- 12
names.of.groups <- unique(DT$key.group)
number.of.groups <- length(names.of.groups)

X.coefficients <- list()

for(j in 1:length(names.of.groups)) {
regressed.DT <- DT[key.group == names.of.groups[j]]
nrows.of.group <- nrow(regressed.DT)
            print(paste0(j, ', key.group: ', names.of.groups[j]))
for (i in window.length:nrows.of.group) {
            if(i == window.length) {
            X.coefficients[[names.of.groups[j]]] <- c(rep(NA, nrows.of.group)) }
            X.coefficients[[names.of.groups[j]]][i] <-  lm(Y ~ 1 + X,
                    data = regressed.DT[(i - 11):i])$coefficients['X']

}
}
return(X.coefficients)
}
system.time(X.coef <- run.rolling.regressions())

Solution

For each group in your data table, your code computes the coefficient b1 from a linear regression y = b0 + b1*x + epsilon, and you want to run this regression and obtain b1 for observations 1-12, 2-13, 3-14, ..., 989-1000. Right now you are separately calling lm for each data subset, which is a non-vectorized approach.

Vectorization of prediction models across datasets is in general not straightforward, but for the special case you have here (simple linear regression) is it possible because there is a simple closed-form expression for b1, the coefficient of interest. In particular, for given vectors x and y we have b1 = (mean(xy) - mean(x)mean(y)) / (mean(x^2) - mean(x)^2). The rolling coefficient value can therefore be computed using the rolling means of x*y, x, y, and x^2 with the appropriate window width.

The end result is a fully vectorized version of the code (I use the RcppRoll package to obtain rolling means):

library(RcppRoll)
rolling2 <- function(DT, window.length) {
  setNames(lapply(unique(DT$key.group), function(g) {
    regressed.DT <- DT[key.group == g]
    xyBar = roll_mean(regressed.DT$X*regressed.DT$Y, window.length)
    xBar = roll_mean(regressed.DT$X, window.length)
    yBar = roll_mean(regressed.DT$Y, window.length)
    x2Bar = roll_mean(regressed.DT$X^2, window.length)
    c(rep(NA, window.length-1), (xyBar - xBar*yBar) / (x2Bar - xBar^2))
  }), unique(DT$key.group))
}


We can confirm that this yields identical results to the code from the original post about 3 orders of magnitude more quickly:

set.seed(144)
DT <- data.table(   Y = rnorm(10000), 
                    X = rnorm(10000), 
                    key.group = rep(LETTERS[1:10], each = 1000))
system.time(X.coef <- run.rolling.regressions(DT, 12))
#    user  system elapsed 
#  13.321   0.098  13.504 
system.time(X.coef2 <- rolling2(DT, 12))
#    user  system elapsed 
#   0.010   0.000   0.011 
all.equal(X.coef, X.coef2)
# [1] TRUE


Note that I slightly modified the provided run.rolling.regressions function to take DT and window.length as input and to not print progress updates; I think it makes sense to separate the generation of the dataset from the function that computes the rolling means, and down the road it might be useful to have the window length as an adjustable argument instead of a fixed value.

Code Snippets

library(RcppRoll)
rolling2 <- function(DT, window.length) {
  setNames(lapply(unique(DT$key.group), function(g) {
    regressed.DT <- DT[key.group == g]
    xyBar = roll_mean(regressed.DT$X*regressed.DT$Y, window.length)
    xBar = roll_mean(regressed.DT$X, window.length)
    yBar = roll_mean(regressed.DT$Y, window.length)
    x2Bar = roll_mean(regressed.DT$X^2, window.length)
    c(rep(NA, window.length-1), (xyBar - xBar*yBar) / (x2Bar - xBar^2))
  }), unique(DT$key.group))
}
set.seed(144)
DT <- data.table(   Y = rnorm(10000), 
                    X = rnorm(10000), 
                    key.group = rep(LETTERS[1:10], each = 1000))
system.time(X.coef <- run.rolling.regressions(DT, 12))
#    user  system elapsed 
#  13.321   0.098  13.504 
system.time(X.coef2 <- rolling2(DT, 12))
#    user  system elapsed 
#   0.010   0.000   0.011 
all.equal(X.coef, X.coef2)
# [1] TRUE

Context

StackExchange Code Review Q#125509, answer score: 7

Revisions (0)

No revisions yet.