HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

determine residuals and outliers

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
andresidualsoutliersdetermine

Problem

For my internship I have to perform certain analysis to determine residuals and outliers. The table I'm currently using has over 12 million records with 130+ columns.

My first tests take approx. 5398 seconds or 1.5 hours to apply the module and establish outliers and put them into a plot. The goal of my project is performing analysis on 14-20 similar models within an hour (on a better server, though).

I just changed the glm() to lm() and the process time went down from 1.5hr to 15 minutes.

I also got the following details from Rprof():

```
$by.self
self.time self.pct total.time total.pct
"lm.fit" 920.64 99.16 921.56 99.26
".Call" 1.48 0.16 1.48 0.16
"plot.xy" 1.22 0.13 1.22 0.13
".External2" 1.10 0.12 1.52 0.16
"colnames" 0.44 0.05 0.00 0.00
"eval" 0.44 0.05 0.00 0.00
"model.frame.default" 0.44 0.05 0.00 0.00
"na.omit" 0.42 0.05 0.08 0.01
"lapply" 0.38 0.04 0.04 0.00
"FUN" 0.34 0.04 0.08 0.01
"na.omit.data.frame" 0.34 0.04 0.08 0.01
"unique.default" 0.24 0.03 0.18 0.02
"factor" 0.20 0.02 0.00 0.00
"strsplit" 0.18 0.02 0.18 0.02
"[.data.frame" 0.18 0.02 0.16 0.02
"[" 0.18 0.02 0.00 0.00
"quantile" 0.16 0.02 0.00 0.00
"quantile.default" 0.16 0.02 0.00 0.00
"match" 0.12 0.01 0.12 0.01
"sort.int" 0.12 0.01 0.12 0.01
"sort" 0.12 0.01

Solution

Comment by @flodel:

The CRAN High-Performance Tasks Views mention the speedglm package. It is Worth a try. Note how it says "High performances
can be obtained especially if R is linked against an optimized BLAS, such as ATLAS". You will find many articles showing you how to do that if you google R blas atlas.

I'll point you to these results showing how switching from the default blas shipped with R to OpenBLAS improved this person's qr decomposition (what lm uses) computation times by a factor of ~4 (from 417 to 113 ms). So regardless of whether you choose to try speedglm, it is definitely worth looking into what blas you are currently using and possibly switching to a better one.

Context

StackExchange Code Review Q#105318, answer score: 2

Revisions (0)

No revisions yet.