patternpythonMinor
determine residuals and outliers
Viewed 0 times
andresidualsoutliersdetermine
Problem
For my internship I have to perform certain analysis to determine residuals and outliers. The table I'm currently using has over 12 million records with 130+ columns.
My first tests take approx. 5398 seconds or 1.5 hours to apply the module and establish outliers and put them into a plot. The goal of my project is performing analysis on 14-20 similar models within an hour (on a better server, though).
I just changed the
I also got the following details from
```
$by.self
self.time self.pct total.time total.pct
"lm.fit" 920.64 99.16 921.56 99.26
".Call" 1.48 0.16 1.48 0.16
"plot.xy" 1.22 0.13 1.22 0.13
".External2" 1.10 0.12 1.52 0.16
"colnames" 0.44 0.05 0.00 0.00
"eval" 0.44 0.05 0.00 0.00
"model.frame.default" 0.44 0.05 0.00 0.00
"na.omit" 0.42 0.05 0.08 0.01
"lapply" 0.38 0.04 0.04 0.00
"FUN" 0.34 0.04 0.08 0.01
"na.omit.data.frame" 0.34 0.04 0.08 0.01
"unique.default" 0.24 0.03 0.18 0.02
"factor" 0.20 0.02 0.00 0.00
"strsplit" 0.18 0.02 0.18 0.02
"[.data.frame" 0.18 0.02 0.16 0.02
"[" 0.18 0.02 0.00 0.00
"quantile" 0.16 0.02 0.00 0.00
"quantile.default" 0.16 0.02 0.00 0.00
"match" 0.12 0.01 0.12 0.01
"sort.int" 0.12 0.01 0.12 0.01
"sort" 0.12 0.01
My first tests take approx. 5398 seconds or 1.5 hours to apply the module and establish outliers and put them into a plot. The goal of my project is performing analysis on 14-20 similar models within an hour (on a better server, though).
I just changed the
glm() to lm() and the process time went down from 1.5hr to 15 minutes. I also got the following details from
Rprof():```
$by.self
self.time self.pct total.time total.pct
"lm.fit" 920.64 99.16 921.56 99.26
".Call" 1.48 0.16 1.48 0.16
"plot.xy" 1.22 0.13 1.22 0.13
".External2" 1.10 0.12 1.52 0.16
"colnames" 0.44 0.05 0.00 0.00
"eval" 0.44 0.05 0.00 0.00
"model.frame.default" 0.44 0.05 0.00 0.00
"na.omit" 0.42 0.05 0.08 0.01
"lapply" 0.38 0.04 0.04 0.00
"FUN" 0.34 0.04 0.08 0.01
"na.omit.data.frame" 0.34 0.04 0.08 0.01
"unique.default" 0.24 0.03 0.18 0.02
"factor" 0.20 0.02 0.00 0.00
"strsplit" 0.18 0.02 0.18 0.02
"[.data.frame" 0.18 0.02 0.16 0.02
"[" 0.18 0.02 0.00 0.00
"quantile" 0.16 0.02 0.00 0.00
"quantile.default" 0.16 0.02 0.00 0.00
"match" 0.12 0.01 0.12 0.01
"sort.int" 0.12 0.01 0.12 0.01
"sort" 0.12 0.01
Solution
Comment by @flodel:
The CRAN High-Performance Tasks Views mention the
can be obtained especially if R is linked against an optimized BLAS, such as ATLAS". You will find many articles showing you how to do that if you google
I'll point you to these results showing how switching from the default blas shipped with R to OpenBLAS improved this person's
The CRAN High-Performance Tasks Views mention the
speedglm package. It is Worth a try. Note how it says "High performancescan be obtained especially if R is linked against an optimized BLAS, such as ATLAS". You will find many articles showing you how to do that if you google
R blas atlas.I'll point you to these results showing how switching from the default blas shipped with R to OpenBLAS improved this person's
qr decomposition (what lm uses) computation times by a factor of ~4 (from 417 to 113 ms). So regardless of whether you choose to try speedglm, it is definitely worth looking into what blas you are currently using and possibly switching to a better one.Context
StackExchange Code Review Q#105318, answer score: 2
Revisions (0)
No revisions yet.