patternsqlMinor
PostgreSQL: interpolate missing value
Viewed 0 times
postgresqlinterpolatevaluemissing
Problem
I have a table in PostgreSQL with a timestamp and a value.
I would like to interpolate the missing values under "lat".
The value under "lat" are tidalheights above a datum.
For the purpose it is ok to interpolate the missing value linear between the two known values.
What is the best method to do so in PostgreSQL?
Edit 20200825
I have solved this problem in a different way using the QGIS fieldcalculator. Problem with this method: it takes a long time and it the process runs client-side and I would like to run it directly on the server.
In steps, my workflow was:
In QGIS:
for the first two values from the example image that results in:
(4.95 - 5.07) /10 = -0.012
In QGIS:
for the first value in the example:
2019-01-01 00:15:15 returns: '5'
This is 5 minutes past the last recorded value.
In QGIS
```
CASE
WHEN "lat" = NULL
THEN
-- minutes pas the last recorded instance:
("min_past_rec" *
-- the "t
I would like to interpolate the missing values under "lat".
The value under "lat" are tidalheights above a datum.
For the purpose it is ok to interpolate the missing value linear between the two known values.
What is the best method to do so in PostgreSQL?
Edit 20200825
I have solved this problem in a different way using the QGIS fieldcalculator. Problem with this method: it takes a long time and it the process runs client-side and I would like to run it directly on the server.
In steps, my workflow was:
- The interval between the recorded "lat" values is 10 minutes. I calculated the increment per minute between two recorded values and stored this in an extra column called "tidal_step" at the recorded "lat"-value. (i stored the timestamp also as an "epoch" in a column)
In QGIS:
tidal_step =
-- the lat value @ the epoch, 10 minutes or 600000 miliseconds from the current epoch:
(attribute(get_feature('werkset','epoch',("epoch"+'600000')),'lat') -
-- the lat value @ the current
attribute(get_feature('werkset','epoch',"epoch"),'lat'))
/10for the first two values from the example image that results in:
(4.95 - 5.07) /10 = -0.012
- I determined the amount of minutes of the "lat"-value to be interpolated, past the last recorded instance where a "lat" value was recorded and stored this in a column: "min_past_rec"
In QGIS:
left(
right("timestamp",8) --this takes the timestamp and goes 8 charakters from the right
,1) -- this takes the string from the previous right( and goes 1 character leftfor the first value in the example:
2019-01-01 00:15:15 returns: '5'
This is 5 minutes past the last recorded value.
- I interpolated the missing values by adding the ("min_past_rec" * "tidal_step") to the last recorded "lat" value and stored this in in column called "lat_interpolated"
In QGIS
```
CASE
WHEN "lat" = NULL
THEN
-- minutes pas the last recorded instance:
("min_past_rec" *
-- the "t
Solution
I have a (partial) solution - what I did was the following (see the fiddle available here):
The algorithm I used for interpolating was
-
if there's a sequence of 1
-
A sequence of 2
In order to do this, I did the following:
Create a table:
Populate it with some sample data:
Note that records 3 & 4 and 7 are
And then I ran my first query:
Result (spaces mean
Note the use of the
This and using the
Final result:
So, you can see that the calculated value for record 7 is the average of records 6 & 8 and record 3 is the average of records 1 & 2 and the assigned value for record 4 is the average of 5 & 6. This was enabled by the use of the
I'm not really happy with this solution - it involves hard-coding for the number of
=============================== Full Query ========================
```
WITH cte1 AS
(
SELECT d1.s,
d1.t AS t1, d1.lat AS l1,
LAG(d1.lat, 2) OVER (ORDER BY t ASC) AS lag_t1_2,
LAG(d1.lat, 1) OVER (ORDER BY t ASC) AS lag_t1,
LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
FROM data d1
),
cte2 AS
(
SELECT
d2.t AS t2, d2.lat AS l2,
LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
FROM data d2
),
cte3 AS
(
SELECT t1.s,
The algorithm I used for interpolating was
-
if there's a sequence of 1
NULL, take the average of the value above and the value below.-
A sequence of 2
NULLs, the top assigned value is the average of the two records above it and the bottom assigned one is the average of the two records below.In order to do this, I did the following:
Create a table:
CREATE TABLE data
(
s SERIAL PRIMARY KEY,
t TIMESTAMP,
lat NUMERIC
);Populate it with some sample data:
INSERT INTO data (t, lat)
VALUES
('2019-01-01 00:00:00', 5.07),
('2019-01-01 01:00:00', 4.60),
('2019-01-01 02:00:00', NULL),
('2019-01-01 03:00:00', NULL),
('2019-01-01 04:00:00', 4.7),
('2019-01-01 05:00:00', 4.20),
('2019-01-01 06:00:00', NULL),
('2019-01-01 07:00:00', 4.98),
('2019-01-01 08:00:00', 4.50);Note that records 3 & 4 and 7 are
NULL.And then I ran my first query:
WITH cte1 AS
(
SELECT d1.s,
d1.t AS t1, d1.lat AS l1,
LAG(d1.lat, 2) OVER (ORDER BY t ASC) AS lag_t1_2,
LAG(d1.lat, 1) OVER (ORDER BY t ASC) AS lag_t1,
LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
FROM data d1
),
cte2 AS
(
SELECT
d2.t AS t2, d2.lat AS l2,
LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
FROM data d2
),
cte3 AS
(
SELECT t1.s,
t1.t1, t1.lag_t1_2, t1.lag_t1, t2.lag_t2, t1.l1, t2.l2,
t1.lead_t1, t2.lead_t2, t1.lead_t1_2
FROM cte1 t1
JOIN cte2 t2
ON t1.t1 = t2.t2
)
SELECT * FROM cte3;Result (spaces mean
NULL - it's much clearer on the fiddle):s t1 lag_t1_2 lag_t1 lag_t2 l1 l2 lead_t1 lead_t2 lead_t1_2
1 2019-01-01 00:00:00 4.60 5.07 5.07 4.60
2 2019-01-01 01:00:00 5.07 4.60 4.60 5.07
3 2019-01-01 02:00:00 5.07 4.60 4.60 4.7
4 2019-01-01 03:00:00 4.60 4.7 4.7 4.20
5 2019-01-01 04:00:00 4.20 4.7 4.7 4.20
6 2019-01-01 05:00:00 4.7 4.20 4.20 4.7 4.98
7 2019-01-01 06:00:00 4.7 4.20 4.98 4.98 4.20 4.50
8 2019-01-01 07:00:00 4.20 4.50 4.98 4.98 4.50
9 2019-01-01 08:00:00 4.98 4.50 4.50 4.98Note the use of the
LAG() and LEAD() Window functions (documentation). I've used them on the same table, but sorted differently.This and using the
OFFSET option means that from my original single lat column, I now have 6 extra columns of "generated" data which are very useful for assigning values to the missing NULL values. The last (partial) piece of the puzzle is shown below (full SQL query is at the bottom of this post and also in the fiddle).cte4 AS
(
SELECT t1.s,
t1.l1 AS lat,
CASE
WHEN (t1.l1 IS NOT NULL) THEN t1.l1
WHEN (t1.l1 IS NULL) AND (t1.l2) IS NULL AND (t1.lag_t1 IS NOT NULL)
AND (t1.lag_t2 IS NOT NULL) THEN ROUND((t1.lag_t1 + t1.lag_t2)/2, 2)
WHEN (t1.lag_t2 IS NULL) AND (t1.l1 IS NULL) AND (t1.l2 IS NULL)
AND (t1.lead_t1 IS NULL) THEN ROUND((t1.lag_t1 + t1.lag_t1_2)/2, 2)
WHEN (t1.l1 IS NULL) AND (t1.l2 IS NULL) AND (t1.lag_t1 IS NULL)
AND (t1.lead_t2 IS NULL) THEN ROUND((t1.lead_t1 + t1.lead_t1_2)/2, 2)
ELSE 0
END AS final_val
FROM cte3 t1
)
SELECT s, lat, final_val FROM cte4;Final result:
s lat final_val
1 5.07 5.07
2 4.60 4.60
3 NULL 4.84
4 NULL 4.45
5 4.7 4.7
6 4.20 4.20
7 NULL 4.59
8 4.98 4.98
9 4.50 4.50So, you can see that the calculated value for record 7 is the average of records 6 & 8 and record 3 is the average of records 1 & 2 and the assigned value for record 4 is the average of 5 & 6. This was enabled by the use of the
OFFSET option for the LAG() and LEAD() functions. If you get sequences of 3 NULLs, then you'll have to use an OFFSET of 3 and so on.I'm not really happy with this solution - it involves hard-coding for the number of
NULLs and those CASE statements will become even more complex and horrible. Ideally some sort of RECURSIVE CTE solution is required, but I HTH!=============================== Full Query ========================
```
WITH cte1 AS
(
SELECT d1.s,
d1.t AS t1, d1.lat AS l1,
LAG(d1.lat, 2) OVER (ORDER BY t ASC) AS lag_t1_2,
LAG(d1.lat, 1) OVER (ORDER BY t ASC) AS lag_t1,
LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
FROM data d1
),
cte2 AS
(
SELECT
d2.t AS t2, d2.lat AS l2,
LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
FROM data d2
),
cte3 AS
(
SELECT t1.s,
Code Snippets
CREATE TABLE data
(
s SERIAL PRIMARY KEY,
t TIMESTAMP,
lat NUMERIC
);INSERT INTO data (t, lat)
VALUES
('2019-01-01 00:00:00', 5.07),
('2019-01-01 01:00:00', 4.60),
('2019-01-01 02:00:00', NULL),
('2019-01-01 03:00:00', NULL),
('2019-01-01 04:00:00', 4.7),
('2019-01-01 05:00:00', 4.20),
('2019-01-01 06:00:00', NULL),
('2019-01-01 07:00:00', 4.98),
('2019-01-01 08:00:00', 4.50);WITH cte1 AS
(
SELECT d1.s,
d1.t AS t1, d1.lat AS l1,
LAG(d1.lat, 2) OVER (ORDER BY t ASC) AS lag_t1_2,
LAG(d1.lat, 1) OVER (ORDER BY t ASC) AS lag_t1,
LEAD(d1.lat, 1) OVER (ORDER BY t ASC) AS lead_t1,
LEAD(d1.lat, 2) OVER (ORDER BY t ASC) AS lead_t1_2
FROM data d1
),
cte2 AS
(
SELECT
d2.t AS t2, d2.lat AS l2,
LAG(d2.lat, 1) OVER(ORDER BY t DESC) AS lag_t2,
LEAD(d2.lat, 1) OVER(ORDER BY t DESC) AS lead_t2
FROM data d2
),
cte3 AS
(
SELECT t1.s,
t1.t1, t1.lag_t1_2, t1.lag_t1, t2.lag_t2, t1.l1, t2.l2,
t1.lead_t1, t2.lead_t2, t1.lead_t1_2
FROM cte1 t1
JOIN cte2 t2
ON t1.t1 = t2.t2
)
SELECT * FROM cte3;s t1 lag_t1_2 lag_t1 lag_t2 l1 l2 lead_t1 lead_t2 lead_t1_2
1 2019-01-01 00:00:00 4.60 5.07 5.07 4.60
2 2019-01-01 01:00:00 5.07 4.60 4.60 5.07
3 2019-01-01 02:00:00 5.07 4.60 4.60 4.7
4 2019-01-01 03:00:00 4.60 4.7 4.7 4.20
5 2019-01-01 04:00:00 4.20 4.7 4.7 4.20
6 2019-01-01 05:00:00 4.7 4.20 4.20 4.7 4.98
7 2019-01-01 06:00:00 4.7 4.20 4.98 4.98 4.20 4.50
8 2019-01-01 07:00:00 4.20 4.50 4.98 4.98 4.50
9 2019-01-01 08:00:00 4.98 4.50 4.50 4.98cte4 AS
(
SELECT t1.s,
t1.l1 AS lat,
CASE
WHEN (t1.l1 IS NOT NULL) THEN t1.l1
WHEN (t1.l1 IS NULL) AND (t1.l2) IS NULL AND (t1.lag_t1 IS NOT NULL)
AND (t1.lag_t2 IS NOT NULL) THEN ROUND((t1.lag_t1 + t1.lag_t2)/2, 2)
WHEN (t1.lag_t2 IS NULL) AND (t1.l1 IS NULL) AND (t1.l2 IS NULL)
AND (t1.lead_t1 IS NULL) THEN ROUND((t1.lag_t1 + t1.lag_t1_2)/2, 2)
WHEN (t1.l1 IS NULL) AND (t1.l2 IS NULL) AND (t1.lag_t1 IS NULL)
AND (t1.lead_t2 IS NULL) THEN ROUND((t1.lead_t1 + t1.lead_t1_2)/2, 2)
ELSE 0
END AS final_val
FROM cte3 t1
)
SELECT s, lat, final_val FROM cte4;Context
StackExchange Database Administrators Q#273931, answer score: 3
Revisions (0)
No revisions yet.