HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Count number of occurences per location per hour in PostgreSQL

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlnumberperhouroccurencescountlocation

Problem

I have a dataset in Postgres of boat locations on waterways. Here is a sample of the table:

boat_id
ts
waterway_id

Boat_A
2019-01-01 16:29:11
WW_01

Boat_A
2019-01-01 17:03:04
WW_02

Boat_B
2019-01-01 16:11:34
WW_01

Boat_B
2019-01-01 16:13:45
WW_01

Boat_B
2019-01-01 17:05:13
WW_01

Boat_C
2019-01-01 16:03:00
WW_01

Boat_C
2019-01-01 16:09:50
WW_02

Boat_C
2019-01-01 16:16:22
WW_01

Boat_C
2019-01-01 16:45:44
WW_01

boat_id is the unique identification of the boat, ts is timestamp and water_id is the unique identifier of the waterway.
I would like to know for each hour in the dataset how many boats passed each waterway. The result should look like this:

waterway_id
report_ts
passage_count

WW_01
2019-01-01 00:00
3

WW_01
2019-01-01 01:00
1

...
...
...

WW_01
2019-12-31 23:00
5

WW_02
2019-01-01 00:00
13

WW_02
2019-01-01 01:00
11

...
...
...

The raw data contains the position of boats, not passages. Thus:

  • Multiple datapoints of the same boat on the same waterway should be counted as a single passage.



  • If a boat has been on another waterway and comes back it should be counted as another passage.



  • If a boat is detected on the same waterway in multiple hours, without being on anther waterway in between, it should be counted as a single passage in the hour it was first detected.


In the example data above, boat_A makes 1 passage on waterway WW_01 at 16h and 1 on WW_02 at 17h, boat_b makes 1 passages on WW_01 at 16h (there is no passage at 18h because it did not go to antoher waterway in between), boat_C makes 2 passages on waterway WW_01 at 16h and 1 passage on WW_02 at 16h. In a table (waterway-hour combinations with 0 passages do not have to be included in the result):

waterway_id
report_ts
passage_count

WW_01
2019-01-01 16:00
4

WW_02
2019-01-01 16:00
1

WW_02
2019-01-01 17:00
1

What should the query to get this result look like?
In my mind, it consists of two steps:

  • Computing unique passages per boat per waterway



  • Organizing these in a table as

Solution

Editing to address this (emphasis mine) which was not the case with the original request:

In a table (waterway-hour combinations with 0 passages do not have to
be included in the result):

Primary keys are important

But before we get into that, we need to make sure you have the right primary key defined on your data, which is (Boat_Id,Timestamp). Creating this gives us two things:

  • Non-conforming records are rejected (a Boat can't be in two places at once)



  • A B-Tree for efficiently locating prior records for each Boat using a method other than an analytic/windowing function



Getting Passages

To determine if a passage has occurred, we need to know the last position of each Boat, which we get through a correlated subquery searching for the entry with the greatest Timestamp less than the current Timestamp. Since we are only interested in Boats that have moved Waterways, we can exclude them from our result set.

SELECT
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp) AS Timestamp
 ,COUNT(*) AS passage_count
FROM
  Boat_Data BD
LEFT JOIN
  Boat_Data PriorBD
    ON PriorBD.Boat_Id = BD.Boat_Id
        AND PriorBD.Timestamp =
          (
            SELECT
              MAX(TimeStamp)
            FROM
              Boat_Data
            WHERE
              Boat_Id = BD.Boat_Id
                AND TimeStamp  PriorBD.Waterway_Id
    OR PriorBD.Waterway_Id IS NULL
GROUP BY
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp)


Alternately, you can use an analytical/windowing function as Erwin and Vérace have done. I provide this as a "second solution" as analytic/windowing functions will force a sort in most instances1. With larger amounts of data (or a different RDBMS), this may be a more expensive operation than just a self join with the proper primary key2. As always, test.

SELECT
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp) AS Timestamp
 ,COUNT(*) AS passage_count
FROM
  (
    SELECT
      Boat_Id
     ,Timestamp
     ,Waterway_Id
     ,CASE
        WHEN Waterway_Id <> LAG(Waterway_Id,1,'') OVER (PARTITION BY Boat_Id ORDER BY Timestamp) THEN 1
        ELSE 0
      END AS Passage_Ind
    FROM
      Boat_Data
  ) BD
WHERE
  BD.Passage_Ind = 1
GROUP BY
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp)
;


Modified fiddle here: http://sqlfiddle.com/#!17/2cede7/2

1 In SQL Server (and probably some other commercial platforms) a windowing/analytic function will not force a sort if the PARTITION BY
and ORDER BY statements match the sort order of the clustered index. This is not the case in MySQL.

2 The more recent versions of Postgres allow the INCLUDE statement to force specified non-key columns to be added to the B-Tree. In this instance, you could include the Waterway_Id so the entire query could be fulfilled without touching the heap.

Code Snippets

SELECT
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp) AS Timestamp
 ,COUNT(*) AS passage_count
FROM
  Boat_Data BD
LEFT JOIN
  Boat_Data PriorBD
    ON PriorBD.Boat_Id = BD.Boat_Id
        AND PriorBD.Timestamp =
          (
            SELECT
              MAX(TimeStamp)
            FROM
              Boat_Data
            WHERE
              Boat_Id = BD.Boat_Id
                AND TimeStamp < BD.Timestamp
         )
WHERE
  BD.Waterway_ID <> PriorBD.Waterway_Id
    OR PriorBD.Waterway_Id IS NULL
GROUP BY
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp)
SELECT
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp) AS Timestamp
 ,COUNT(*) AS passage_count
FROM
  (
    SELECT
      Boat_Id
     ,Timestamp
     ,Waterway_Id
     ,CASE
        WHEN Waterway_Id <> LAG(Waterway_Id,1,'') OVER (PARTITION BY Boat_Id ORDER BY Timestamp) THEN 1
        ELSE 0
      END AS Passage_Ind
    FROM
      Boat_Data
  ) BD
WHERE
  BD.Passage_Ind = 1
GROUP BY
  BD.Waterway_ID
 ,date_trunc('hour',BD.TimeStamp)
;

Context

StackExchange Database Administrators Q#295951, answer score: 4

Revisions (0)

No revisions yet.