patternpythonMinor

Processing weather station data using idw - Follow Up

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

time-limit-exceeded codereview r stackoverflow performance python

stationusingfollowidwprocessingdataweather

Problem

New Approach

My previous question may have had a little too much going on and I realized I could simplify the problem by constructing the data a bit differently thanks to @Gentian Kasa . Previously, the code was filtering data constantly and causing a big bottle neck in the processing time. I have now constructed the data in a way that both the main station, and local stations have the same number of days, so instead of filtering the code it now simply processes through the data.frames.

Problem

There is 1 main station (df) and 3 local stations (s) stacked in a single data.frame with values for three days. The idea is to take each day from the main station, find the relative anomaly of the three local stations, and smooth it using inverse distance weighting (IDW) from the phylin package. This is then applied to the value in the main station by multiplication.

This code is working fine and it is certainly an improvement from before, but I would like to see if there is a better/faster way using an optimized package/method (e.g. data.table, dplyr, apply). I still don't know how to approach this problem without the cumbersome for loop.

The original data set has around 19,000 days, with 3 different variables, for 20,000 stations totaling 1.14 trillion observations. You can imagine how long this might take -- prior estimates were at 14 days;although, I have not checked with this updated code.

Data

Main Station : df

id lat long year month day     value
1 12345 100   50 1900     1   1  54.87800
2 12345 100   50 1900     1   2 106.96603
3 12345 100   50 1900     1   3  98.31988

Local Stations: s

```
id lat long year month day value
1 USC00031152 33.5900 -92.8236 1900 1 1 63.31576
2 USC00034638 34.7392 -90.7664 1900 1 1 86.04906
3 USC00036352 35.2833 -93.1000 1900 1 1 76.50639
4 USC00031152 33.5900 -92.8236 1900 1 2 71.37608
5 USC00034638 34.7392 -90.7664 1900 1 2 89.91196
6 USC00036352 35.28

Solution

The code is quite inefficient for two reasons:

a) the idw function is poorly written. I see at least three for loops (one inside real.dist and two apply calls inside idw) that could have been replaced with faster alternatives

b) but mostly, your algorithm duplicates many operations. For instance, in your example, the distance between your main station and each of the three local stations is repeated nrow(df) times. It also manifests itself by the fact that your data, in its current form, contains many id/lat/log/year/month/day duplicates.

I would suggest you re-arrange your data as follows:

A matrix of coordinates for the main stations:

m.coord <- structure(c(100, 105, 50, 55),
  .Dim = c(2L, 2L),
  .Dimnames = list(c("12345", "12346"),
                   c("lat", "long")))

A matrix of coordinates for the local stations:

s.coord <- structure(c(33.59, 34.7392, 35.2833,
                      -92.8236, -90.7664, -93.1),
  .Dim = c(3L, 2L),
  .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
                   c("lat", "long")))

A matrix for the values at the local stations, where one dimension corresponds to the stations, the other to the time:

s.values <- structure(
  c(63.3157576809045, 86.0490598902219, 76.506386949066, 
    71.3760752788486, 89.9119576975542, 76.3535163951321,  
    53.7259645981243, 61.7989638892985, 85.8911224149051),
  .Dim = c(3L, 3L),
  .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
                   c("1900-01-01", "1900-01-02", "1900-01-03")))

First step is to compute the distances between the main stations and the local stations. I use the fields::rdist because it is fast (compiled in Fortran), at least a lot faster than the for loop inside idw.

library(fields)
d.mat <- rdist(m.coord, s.coord)

The output is a matrix where each row contains the distances between one main station and all the local stations.

Next, we go from a distance matrix to a weight matrix:

w.mat <- 1 / d.mat ^ 2

In the event that one or more of the distances is zero, the weight is now infinite, which is a bit undesirable. To handle that situation, we modify the weights to 0 or 1 for all rows that contain infinite values:

is.inf  0
w.mat[has.infinite, ] <- as.numeric(is.inf[has.infinite, ])

If you only want to work with the 3 closest stations, you can write a function that will only keep the three highest weights on each row and turn all other weights to zero:

keep_n  length(x) – n, x, 0)

and run it on each row of the weight matrix:

w.mat <- t(apply(w.mat, 1, keep_n))

Then, you want to scale your weights so that they sum to 1 on each row. Easy:

w.mat <-  w.mat / rowSums(w.mat)

Now that you have your weights, computing the weighted averages for all main stations and all days is done in one simple matrix multiplication:

new_val <- w.mat %*% s.values

If s.values contains NAs, then it is a little bit more difficult:

z <- is.na(s.values)
new_val <- (w.mat %*% ifelse(z, 0, s.values)) / (w.mat %*% !z)

I think you will know how to take it from here. Note that if your data is so large that you cannot handle all stations at the same time (i.e can't even compute d.mat), you could loop on the main stations by bunches, e.g. 1000 main stations at a time.

One remark: your earlier solution and the one I suggested here both use basic euclidean distance to compute distances. This will probably be fine if your data is restricted to a small region and far from where longitude jumps. Otherwise, you might want to look in a more appropriate function for computing the distance matrix.

Code Snippets

m.coord <- structure(c(100, 105, 50, 55),
  .Dim = c(2L, 2L),
  .Dimnames = list(c("12345", "12346"),
                   c("lat", "long")))

s.coord <- structure(c(33.59, 34.7392, 35.2833,
                      -92.8236, -90.7664, -93.1),
  .Dim = c(3L, 2L),
  .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
                   c("lat", "long")))

s.values <- structure(
  c(63.3157576809045, 86.0490598902219, 76.506386949066, 
    71.3760752788486, 89.9119576975542, 76.3535163951321,  
    53.7259645981243, 61.7989638892985, 85.8911224149051),
  .Dim = c(3L, 3L),
  .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
                   c("1900-01-01", "1900-01-02", "1900-01-03")))

library(fields)
d.mat <- rdist(m.coord, s.coord)

w.mat <- 1 / d.mat ^ 2

Context

StackExchange Code Review Q#111870, answer score: 3

Revisions (0)

No revisions yet.