patternsqlMinor
Count number of occurences per location per hour in PostgreSQL
Viewed 0 times
postgresqlnumberperhouroccurencescountlocation
Problem
I have a dataset in Postgres of boat locations on waterways. Here is a sample of the table:
boat_id
ts
waterway_id
Boat_A
2019-01-01 16:29:11
WW_01
Boat_A
2019-01-01 17:03:04
WW_02
Boat_B
2019-01-01 16:11:34
WW_01
Boat_B
2019-01-01 16:13:45
WW_01
Boat_B
2019-01-01 17:05:13
WW_01
Boat_C
2019-01-01 16:03:00
WW_01
Boat_C
2019-01-01 16:09:50
WW_02
Boat_C
2019-01-01 16:16:22
WW_01
Boat_C
2019-01-01 16:45:44
WW_01
boat_id is the unique identification of the boat, ts is timestamp and water_id is the unique identifier of the waterway.
I would like to know for each hour in the dataset how many boats passed each waterway. The result should look like this:
waterway_id
report_ts
passage_count
WW_01
2019-01-01 00:00
3
WW_01
2019-01-01 01:00
1
...
...
...
WW_01
2019-12-31 23:00
5
WW_02
2019-01-01 00:00
13
WW_02
2019-01-01 01:00
11
...
...
...
The raw data contains the position of boats, not passages. Thus:
In the example data above, boat_A makes 1 passage on waterway WW_01 at 16h and 1 on WW_02 at 17h, boat_b makes 1 passages on WW_01 at 16h (there is no passage at 18h because it did not go to antoher waterway in between), boat_C makes 2 passages on waterway WW_01 at 16h and 1 passage on WW_02 at 16h. In a table (waterway-hour combinations with 0 passages do not have to be included in the result):
waterway_id
report_ts
passage_count
WW_01
2019-01-01 16:00
4
WW_02
2019-01-01 16:00
1
WW_02
2019-01-01 17:00
1
What should the query to get this result look like?
In my mind, it consists of two steps:
boat_id
ts
waterway_id
Boat_A
2019-01-01 16:29:11
WW_01
Boat_A
2019-01-01 17:03:04
WW_02
Boat_B
2019-01-01 16:11:34
WW_01
Boat_B
2019-01-01 16:13:45
WW_01
Boat_B
2019-01-01 17:05:13
WW_01
Boat_C
2019-01-01 16:03:00
WW_01
Boat_C
2019-01-01 16:09:50
WW_02
Boat_C
2019-01-01 16:16:22
WW_01
Boat_C
2019-01-01 16:45:44
WW_01
boat_id is the unique identification of the boat, ts is timestamp and water_id is the unique identifier of the waterway.
I would like to know for each hour in the dataset how many boats passed each waterway. The result should look like this:
waterway_id
report_ts
passage_count
WW_01
2019-01-01 00:00
3
WW_01
2019-01-01 01:00
1
...
...
...
WW_01
2019-12-31 23:00
5
WW_02
2019-01-01 00:00
13
WW_02
2019-01-01 01:00
11
...
...
...
The raw data contains the position of boats, not passages. Thus:
- Multiple datapoints of the same boat on the same waterway should be counted as a single passage.
- If a boat has been on another waterway and comes back it should be counted as another passage.
- If a boat is detected on the same waterway in multiple hours, without being on anther waterway in between, it should be counted as a single passage in the hour it was first detected.
In the example data above, boat_A makes 1 passage on waterway WW_01 at 16h and 1 on WW_02 at 17h, boat_b makes 1 passages on WW_01 at 16h (there is no passage at 18h because it did not go to antoher waterway in between), boat_C makes 2 passages on waterway WW_01 at 16h and 1 passage on WW_02 at 16h. In a table (waterway-hour combinations with 0 passages do not have to be included in the result):
waterway_id
report_ts
passage_count
WW_01
2019-01-01 16:00
4
WW_02
2019-01-01 16:00
1
WW_02
2019-01-01 17:00
1
What should the query to get this result look like?
In my mind, it consists of two steps:
- Computing unique passages per boat per waterway
- Organizing these in a table as
Solution
Editing to address this (emphasis mine) which was not the case with the original request:
In a table (waterway-hour combinations with 0 passages do not have to
be included in the result):
Primary keys are important
But before we get into that, we need to make sure you have the right primary key defined on your data, which is
Getting Passages
To determine if a passage has occurred, we need to know the last position of each
Alternately, you can use an analytical/windowing function as Erwin and Vérace have done. I provide this as a "second solution" as analytic/windowing functions will force a sort in most instances1. With larger amounts of data (or a different RDBMS), this may be a more expensive operation than just a self join with the proper primary key2. As always, test.
Modified fiddle here: http://sqlfiddle.com/#!17/2cede7/2
1 In SQL Server (and probably some other commercial platforms) a windowing/analytic function will not force a sort if the
and
2 The more recent versions of Postgres allow the INCLUDE statement to force specified non-key columns to be added to the B-Tree. In this instance, you could include the
In a table (waterway-hour combinations with 0 passages do not have to
be included in the result):
Primary keys are important
But before we get into that, we need to make sure you have the right primary key defined on your data, which is
(Boat_Id,Timestamp). Creating this gives us two things:- Non-conforming records are rejected (a
Boatcan't be in two places at once)
- A B-Tree for efficiently locating prior records for each
Boatusing a method other than an analytic/windowing function
Getting Passages
To determine if a passage has occurred, we need to know the last position of each
Boat, which we get through a correlated subquery searching for the entry with the greatest Timestamp less than the current Timestamp. Since we are only interested in Boats that have moved Waterways, we can exclude them from our result set.SELECT
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp) AS Timestamp
,COUNT(*) AS passage_count
FROM
Boat_Data BD
LEFT JOIN
Boat_Data PriorBD
ON PriorBD.Boat_Id = BD.Boat_Id
AND PriorBD.Timestamp =
(
SELECT
MAX(TimeStamp)
FROM
Boat_Data
WHERE
Boat_Id = BD.Boat_Id
AND TimeStamp PriorBD.Waterway_Id
OR PriorBD.Waterway_Id IS NULL
GROUP BY
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp)Alternately, you can use an analytical/windowing function as Erwin and Vérace have done. I provide this as a "second solution" as analytic/windowing functions will force a sort in most instances1. With larger amounts of data (or a different RDBMS), this may be a more expensive operation than just a self join with the proper primary key2. As always, test.
SELECT
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp) AS Timestamp
,COUNT(*) AS passage_count
FROM
(
SELECT
Boat_Id
,Timestamp
,Waterway_Id
,CASE
WHEN Waterway_Id <> LAG(Waterway_Id,1,'') OVER (PARTITION BY Boat_Id ORDER BY Timestamp) THEN 1
ELSE 0
END AS Passage_Ind
FROM
Boat_Data
) BD
WHERE
BD.Passage_Ind = 1
GROUP BY
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp)
;Modified fiddle here: http://sqlfiddle.com/#!17/2cede7/2
1 In SQL Server (and probably some other commercial platforms) a windowing/analytic function will not force a sort if the
PARTITION BYand
ORDER BY statements match the sort order of the clustered index. This is not the case in MySQL.2 The more recent versions of Postgres allow the INCLUDE statement to force specified non-key columns to be added to the B-Tree. In this instance, you could include the
Waterway_Id so the entire query could be fulfilled without touching the heap.Code Snippets
SELECT
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp) AS Timestamp
,COUNT(*) AS passage_count
FROM
Boat_Data BD
LEFT JOIN
Boat_Data PriorBD
ON PriorBD.Boat_Id = BD.Boat_Id
AND PriorBD.Timestamp =
(
SELECT
MAX(TimeStamp)
FROM
Boat_Data
WHERE
Boat_Id = BD.Boat_Id
AND TimeStamp < BD.Timestamp
)
WHERE
BD.Waterway_ID <> PriorBD.Waterway_Id
OR PriorBD.Waterway_Id IS NULL
GROUP BY
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp)SELECT
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp) AS Timestamp
,COUNT(*) AS passage_count
FROM
(
SELECT
Boat_Id
,Timestamp
,Waterway_Id
,CASE
WHEN Waterway_Id <> LAG(Waterway_Id,1,'') OVER (PARTITION BY Boat_Id ORDER BY Timestamp) THEN 1
ELSE 0
END AS Passage_Ind
FROM
Boat_Data
) BD
WHERE
BD.Passage_Ind = 1
GROUP BY
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp)
;Context
StackExchange Database Administrators Q#295951, answer score: 4
Revisions (0)
No revisions yet.