patternpythonMinor
Processing weather station data using idw - Follow Up
Viewed 0 times
stationusingfollowidwprocessingdataweather
Problem
New Approach
My previous question may have had a little too much going on and I realized I could simplify the problem by constructing the data a bit differently thanks to @Gentian Kasa . Previously, the code was filtering data constantly and causing a big bottle neck in the processing time. I have now constructed the data in a way that both the main station, and local stations have the same number of days, so instead of filtering the code it now simply processes through the data.frames.
Problem
There is 1 main station (
This code is working fine and it is certainly an improvement from before, but I would like to see if there is a better/faster way using an optimized package/method (e.g.
The original data set has around 19,000 days, with 3 different variables, for 20,000 stations totaling 1.14 trillion observations. You can imagine how long this might take -- prior estimates were at 14 days;although, I have not checked with this updated code.
Data
Main Station :
Local Stations:
```
id lat long year month day value
1 USC00031152 33.5900 -92.8236 1900 1 1 63.31576
2 USC00034638 34.7392 -90.7664 1900 1 1 86.04906
3 USC00036352 35.2833 -93.1000 1900 1 1 76.50639
4 USC00031152 33.5900 -92.8236 1900 1 2 71.37608
5 USC00034638 34.7392 -90.7664 1900 1 2 89.91196
6 USC00036352 35.28
My previous question may have had a little too much going on and I realized I could simplify the problem by constructing the data a bit differently thanks to @Gentian Kasa . Previously, the code was filtering data constantly and causing a big bottle neck in the processing time. I have now constructed the data in a way that both the main station, and local stations have the same number of days, so instead of filtering the code it now simply processes through the data.frames.
Problem
There is 1 main station (
df) and 3 local stations (s) stacked in a single data.frame with values for three days. The idea is to take each day from the main station, find the relative anomaly of the three local stations, and smooth it using inverse distance weighting (IDW) from the phylin package. This is then applied to the value in the main station by multiplication. This code is working fine and it is certainly an improvement from before, but I would like to see if there is a better/faster way using an optimized package/method (e.g.
data.table, dplyr, apply). I still don't know how to approach this problem without the cumbersome for loop.The original data set has around 19,000 days, with 3 different variables, for 20,000 stations totaling 1.14 trillion observations. You can imagine how long this might take -- prior estimates were at 14 days;although, I have not checked with this updated code.
Data
Main Station :
dfid lat long year month day value
1 12345 100 50 1900 1 1 54.87800
2 12345 100 50 1900 1 2 106.96603
3 12345 100 50 1900 1 3 98.31988Local Stations:
s```
id lat long year month day value
1 USC00031152 33.5900 -92.8236 1900 1 1 63.31576
2 USC00034638 34.7392 -90.7664 1900 1 1 86.04906
3 USC00036352 35.2833 -93.1000 1900 1 1 76.50639
4 USC00031152 33.5900 -92.8236 1900 1 2 71.37608
5 USC00034638 34.7392 -90.7664 1900 1 2 89.91196
6 USC00036352 35.28
Solution
The code is quite inefficient for two reasons:
a) the
b) but mostly, your algorithm duplicates many operations. For instance, in your example, the distance between your main station and each of the three local stations is repeated
I would suggest you re-arrange your data as follows:
A matrix of coordinates for the main stations:
A matrix of coordinates for the local stations:
A matrix for the values at the local stations, where one dimension corresponds to the stations, the other to the time:
First step is to compute the distances between the main stations and the local stations. I use the
The output is a matrix where each row contains the distances between one main station and all the local stations.
Next, we go from a distance matrix to a weight matrix:
In the event that one or more of the distances is zero, the weight is now infinite, which is a bit undesirable. To handle that situation, we modify the weights to 0 or 1 for all rows that contain infinite values:
If you only want to work with the 3 closest stations, you can write a function that will only keep the three highest weights on each row and turn all other weights to zero:
and run it on each row of the weight matrix:
Then, you want to scale your weights so that they sum to 1 on each row. Easy:
Now that you have your weights, computing the weighted averages for all main stations and all days is done in one simple matrix multiplication:
If
I think you will know how to take it from here. Note that if your data is so large that you cannot handle all stations at the same time (i.e can't even compute
One remark: your earlier solution and the one I suggested here both use basic euclidean distance to compute distances. This will probably be fine if your data is restricted to a small region and far from where longitude jumps. Otherwise, you might want to look in a more appropriate function for computing the distance matrix.
a) the
idw function is poorly written. I see at least three for loops (one inside real.dist and two apply calls inside idw) that could have been replaced with faster alternativesb) but mostly, your algorithm duplicates many operations. For instance, in your example, the distance between your main station and each of the three local stations is repeated
nrow(df) times. It also manifests itself by the fact that your data, in its current form, contains many id/lat/log/year/month/day duplicates.I would suggest you re-arrange your data as follows:
A matrix of coordinates for the main stations:
m.coord <- structure(c(100, 105, 50, 55),
.Dim = c(2L, 2L),
.Dimnames = list(c("12345", "12346"),
c("lat", "long")))A matrix of coordinates for the local stations:
s.coord <- structure(c(33.59, 34.7392, 35.2833,
-92.8236, -90.7664, -93.1),
.Dim = c(3L, 2L),
.Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
c("lat", "long")))A matrix for the values at the local stations, where one dimension corresponds to the stations, the other to the time:
s.values <- structure(
c(63.3157576809045, 86.0490598902219, 76.506386949066,
71.3760752788486, 89.9119576975542, 76.3535163951321,
53.7259645981243, 61.7989638892985, 85.8911224149051),
.Dim = c(3L, 3L),
.Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
c("1900-01-01", "1900-01-02", "1900-01-03")))First step is to compute the distances between the main stations and the local stations. I use the
fields::rdist because it is fast (compiled in Fortran), at least a lot faster than the for loop inside idw.library(fields)
d.mat <- rdist(m.coord, s.coord)The output is a matrix where each row contains the distances between one main station and all the local stations.
Next, we go from a distance matrix to a weight matrix:
w.mat <- 1 / d.mat ^ 2In the event that one or more of the distances is zero, the weight is now infinite, which is a bit undesirable. To handle that situation, we modify the weights to 0 or 1 for all rows that contain infinite values:
is.inf 0
w.mat[has.infinite, ] <- as.numeric(is.inf[has.infinite, ])If you only want to work with the 3 closest stations, you can write a function that will only keep the three highest weights on each row and turn all other weights to zero:
keep_n length(x) – n, x, 0)and run it on each row of the weight matrix:
w.mat <- t(apply(w.mat, 1, keep_n))Then, you want to scale your weights so that they sum to 1 on each row. Easy:
w.mat <- w.mat / rowSums(w.mat)Now that you have your weights, computing the weighted averages for all main stations and all days is done in one simple matrix multiplication:
new_val <- w.mat %*% s.valuesIf
s.values contains NAs, then it is a little bit more difficult:z <- is.na(s.values)
new_val <- (w.mat %*% ifelse(z, 0, s.values)) / (w.mat %*% !z)I think you will know how to take it from here. Note that if your data is so large that you cannot handle all stations at the same time (i.e can't even compute
d.mat), you could loop on the main stations by bunches, e.g. 1000 main stations at a time.One remark: your earlier solution and the one I suggested here both use basic euclidean distance to compute distances. This will probably be fine if your data is restricted to a small region and far from where longitude jumps. Otherwise, you might want to look in a more appropriate function for computing the distance matrix.
Code Snippets
m.coord <- structure(c(100, 105, 50, 55),
.Dim = c(2L, 2L),
.Dimnames = list(c("12345", "12346"),
c("lat", "long")))s.coord <- structure(c(33.59, 34.7392, 35.2833,
-92.8236, -90.7664, -93.1),
.Dim = c(3L, 2L),
.Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
c("lat", "long")))s.values <- structure(
c(63.3157576809045, 86.0490598902219, 76.506386949066,
71.3760752788486, 89.9119576975542, 76.3535163951321,
53.7259645981243, 61.7989638892985, 85.8911224149051),
.Dim = c(3L, 3L),
.Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
c("1900-01-01", "1900-01-02", "1900-01-03")))library(fields)
d.mat <- rdist(m.coord, s.coord)w.mat <- 1 / d.mat ^ 2Context
StackExchange Code Review Q#111870, answer score: 3
Revisions (0)
No revisions yet.