patternsqlMinor
Efficient downsampling of a selected timeseries to equidistant samples
Viewed 0 times
equidistantdownsamplingtimeseriesefficientselectedsamples
Problem
We have timeseries data from several sensors in a table and I am interested in getting N equally-spaced samples from the full range, as the whole data would be too much to display in a web-app.
I am currently using the following query for N = 500:
The table contains
The above query roughly takes 36s and the result size for
I have tried
Are there more efficient ways to achieve the above? The table
About PG: "PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit"
Running on: AWS RDS db.t3.micro which has the following specs:
Here and here are execution plans for my query.
I am currently using the following query for N = 500:
SELECT sensor_id, sensor_data_id, val1, val2, datetime
FROM (
SELECT sensor_id, sensor_data_id, val1, val2, datetime
, ROW_NUMBER() OVER (ORDER BY datetime)
FROM sensordata
WHERE sensor_id = 22
) x
WHERE mod(ROW_NUMBER, (SELECT COUNT(1) / 500 FROM sensordata WHERE sensor_id = 22)) = 0;The table contains
val1 and val2 for several timestamps per sensor_id (roughly 2M rows for sensor_id = 22). The table has primary key (sensor_data_id) and the following other indices - (sensor_id) and a composite index (sensor_id, date_time).The above query roughly takes 36s and the result size for
sensor_id = 22 is around 278 MB, found using (SELECT pg_size_pretty( sum(pg_column_size(sensordata))) FROM sensordata WHERE sensor_id = 22;). While the above query does not consider a time-range, I want to eventually add another condition in the WHERE clause to downsample the data for a selected datetime range for a particular sensor_id.I have tried
EXPLAIN ANALYZE a couple of times, and the only hint that seemed to help a little was increasing the work_mem which is 300 MB now. The database does not need to serve a lot of connections simultaneously, so this high a number is not an issue currently.Are there more efficient ways to achieve the above? The table
sensordata is likely to grow in time. Currently around 7 GBs for 121 distinct sensor_id.About PG: "PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit"
Running on: AWS RDS db.t3.micro which has the following specs:
Here and here are execution plans for my query.
Solution
In any case, you need the multicolumn index on
Optimized simple query
You can improve your original query to:
The added
See my bug report on pgsql-bugs and the discussion on pgsql-hackers.
Either way, upgrade to a current version of Postgres. There has been a steady stream of improvements to performance.
Correct query
Neither query gets you exactly 500 equidistant rows.
The function
Adding 1 to the count makes the 500th row land inside the 500th bucket, not the 501st.
Much faster index-based selection?
If there is any reliable (!) pattern in your
Should be substantially faster for big full samples as it can skip over many rows and pick the next via index, recursively.
Related:
Aside
and the following other indices -
The index on just
(sensor_id, datetime) that you already have.Optimized simple query
You can improve your original query to:
SELECT sensor_data_id, val1, val2, datetime
FROM (
SELECT sensor_data_id, val1, val2, datetime
, row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING) AS rn
FROM sensordata
WHERE sensor_id = 22
) x
WHERE rn % (SELECT COUNT(*) / 500 FROM sensordata WHERE sensor_id = 22) = 0;The added
ROWS UNBOUNDED PRECEDING is a workaround for a sneaky performance issue in window functions that is fixed in Postgres 16.See my bug report on pgsql-bugs and the discussion on pgsql-hackers.
Either way, upgrade to a current version of Postgres. There has been a steady stream of improvements to performance.
Correct query
Neither query gets you exactly 500 equidistant rows.
The function
width_bucket() would get us there:SELECT DISTINCT ON (bucket)
sensor_data_id, val1, val2, datetime
FROM (
SELECT sensor_data_id, val1, val2, datetime
, width_bucket(row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING), 1
, (SELECT COUNT(*) + 1 FROM sensordata WHERE sensor_id = 22), 500) AS bucket
FROM sensordata
WHERE sensor_id = 22
) sub
ORDER BY bucket, datetime;Adding 1 to the count makes the 500th row land inside the 500th bucket, not the 501st.
Much faster index-based selection?
If there is any reliable (!) pattern in your
datetime column (or any column, for that matter; maybe sensor_data_id?), or if sampled rows don't have to be exactly equidistant, a much faster index-based selection may be possible. Say, you have a (mostly) regular time schedule for your sensor (regular intervals for datetime), then you can fetch 1 row per computed time interval:WITH RECURSIVE rcte AS (
( -- parentheses required!
SELECT sensor_data_id, val1, val2, datetime
, (SELECT (max(s2.datetime) - s.datetime) / 500
FROM sensordata s2
WHERE s2.sensor_id = 22) AS intvl -- very fast idx-based
FROM sensordata s
WHERE sensor_id = 22
ORDER BY datetime
LIMIT 1
)
UNION ALL
SELECT s.*, r.intvl
FROM rcte r
CROSS JOIN LATERAL (
SELECT s.sensor_data_id, s.val1, s.val2, s.datetime
FROM sensordata s
WHERE s.sensor_id = 22
AND s.datetime > r.datetime + r.intvl
ORDER BY s.datetime
LIMIT 1
) s
)
TABLE rcte;Should be substantially faster for big full samples as it can skip over many rows and pick the next via index, recursively.
Related:
- SELECT DISTINCT is slower than expected on my table in PostgreSQL
- Best way to select random rows PostgreSQL
Aside
and the following other indices -
(sensor_id) and a composite index (sensor_id, date_time).The index on just
(sensor_id) is kind of redundant. Unless Index-deduplication in Postgres 13 or later compresses it enough to be useful, additionally. See:- Is a composite index also good for queries on the first field?
Code Snippets
SELECT sensor_data_id, val1, val2, datetime
FROM (
SELECT sensor_data_id, val1, val2, datetime
, row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING) AS rn
FROM sensordata
WHERE sensor_id = 22
) x
WHERE rn % (SELECT COUNT(*) / 500 FROM sensordata WHERE sensor_id = 22) = 0;SELECT DISTINCT ON (bucket)
sensor_data_id, val1, val2, datetime
FROM (
SELECT sensor_data_id, val1, val2, datetime
, width_bucket(row_number() OVER (ORDER BY datetime ROWS UNBOUNDED PRECEDING), 1
, (SELECT COUNT(*) + 1 FROM sensordata WHERE sensor_id = 22), 500) AS bucket
FROM sensordata
WHERE sensor_id = 22
) sub
ORDER BY bucket, datetime;WITH RECURSIVE rcte AS (
( -- parentheses required!
SELECT sensor_data_id, val1, val2, datetime
, (SELECT (max(s2.datetime) - s.datetime) / 500
FROM sensordata s2
WHERE s2.sensor_id = 22) AS intvl -- very fast idx-based
FROM sensordata s
WHERE sensor_id = 22
ORDER BY datetime
LIMIT 1
)
UNION ALL
SELECT s.*, r.intvl
FROM rcte r
CROSS JOIN LATERAL (
SELECT s.sensor_data_id, s.val1, s.val2, s.datetime
FROM sensordata s
WHERE s.sensor_id = 22
AND s.datetime > r.datetime + r.intvl
ORDER BY s.datetime
LIMIT 1
) s
)
TABLE rcte;Context
StackExchange Database Administrators Q#320935, answer score: 2
Revisions (0)
No revisions yet.