HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Computing Matching Data Points from Fuzzy Timestamps in Postgresql

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqltimestampspointsfromdatamatchingfuzzycomputing

Problem

I have a table containing multiple time series of different types. The timestamps of cohesive points from different series do not match exactly (i. e. the difference can be up to an hour).

Schema

Below is the schema with two example series:

CREATE TABLE series (id integer, series_type integer, charttime timestamp,
                     value integer, PRIMARY KEY (id));
INSERT INTO series VALUES (1, 1, '2018-03-01 12:10:00', 40),
    (2, 1, '2018-03-01 13:25:00', 30), (3, 1, '2018-03-01 14:10:00', 50);
INSERT INTO series VALUES (4, 2, '2018-03-01 11:20:00', 2), (5, 2, '2018-03-01 12:15:00', 6),
    (6, 2, '2018-03-01 13:00:00', 7), (7, 2, '2018-03-01 13:45:00', 1);

id |series_type |charttime           |value |
---|------------|--------------------|------|
1  |1           |2018-03-01 12:10:00 |40    |
2  |1           |2018-03-01 13:25:00 |30    |
3  |1           |2018-03-01 14:10:00 |50    |
4  |2           |2018-03-01 11:20:00 |2     |
5  |2           |2018-03-01 12:15:00 |6     |
7  |2           |2018-03-01 13:45:00 |1     |
6  |2           |2018-03-01 13:00:00 |7     |


Goal

The goal is to select one series together with the closest datapoint from another series. For the example dataset the result should be:

charttime           |s1 |s2 |
--------------------|---|---|
2018-03-01 12:10:00 |40 |6  |
2018-03-01 13:25:00 |30 |1  |
2018-03-01 14:10:00 |50 |1  |


First working approach

My current approach is to select the best matching data point from the other series by a subquery:

SELECT l.charttime, l.value AS s1,
    ( SELECT r.value
      FROM series r
      WHERE ABS( EXTRACT( EPOCH FROM l.charttime - r.charttime ) / 3600 ) < 1
            AND r.series_type = 2
      ORDER BY ABS( EXTRACT( EPOCH FROM l.charttime - r.charttime )) ASC LIMIT 1 
    ) AS s2
FROM series l
WHERE l.series_type = 1
ORDER BY l.charttime ASC


This does not seem to be the best approach as the dataset is quite huge and thus performing many subqueries slows down th

Solution

In your second approach, with the self join, you could remove duplicates using row_number(),

Partition by l.charttime, order by the time difference and filter for row_number = 1.

I think performance will be horrible, however. Because of the cartesian join this will be an O(size(series 1) x size(series 2)) operation.

Having both l.charttime and r.charttime inside functions may also be causing trouble. Try refactoring to (in pseudo code)

r.charttime  l.charttime - 3600


.. and see how the query plan looks. I presume there's an index on charttime. Without one no approach will be fast. Indeed, two partial indexes, one on Series 1 and one on Series 2 may be even better.

Code Snippets

r.charttime < l.charttime + 3600
and r.charttime > l.charttime - 3600

Context

StackExchange Database Administrators Q#211331, answer score: 4

Revisions (0)

No revisions yet.