patternMinor
Validate id sequence consistency in records in Redshift
Viewed 0 times
validateredshiftrecordssequenceconsistency
Problem
I have a backend that log event to Redshift, for each event it generates a unique id. The id is a sequential number.
I have something like (
I have task than run every hour that move few records to an other table (
I would like to verify that for the last 3 hours I don't have any missing record using
First I thought about using generate_series but this is not present in Redshift. Other suggested to make a table with only the ids but that's still a pain to populate database with integer (here one example to generate 1 million
I'm wondering if the best is not use min max and count like this :
PS: Bonus, how to found missing or duplicate records?
I have something like (
events table):+-------------------------+------+
| created_at | id |
+-------------------------+------+
| 2017-06-30 09:20:47 UTC | 100 |
| 2017-06-30 09:18:31 UTC | 101 |
| 2017-06-30 09:16:19 UTC | 102 |
| 2017-06-30 09:12:08 UTC | 103 |
| 2017-06-30 09:11:59 UTC | 104 |
| 2017-06-30 09:11:15 UTC | 105 |
| 2017-06-30 07:03:41 UTC | 106 |
+-------------------------+------+I have task than run every hour that move few records to an other table (
deactivated_events).I would like to verify that for the last 3 hours I don't have any missing record using
id sequence. First I thought about using generate_series but this is not present in Redshift. Other suggested to make a table with only the ids but that's still a pain to populate database with integer (here one example to generate 1 million
I'm wondering if the best is not use min max and count like this :
WITH merged_events AS
(SELECT *
FROM
(SELECT id, created_at
FROM events
UNION
SELECT id, created_at
FROM deactivated_events
)
WHERE created_at > GETDATE() - INTERVAL '3 hours'
ORDER BY id)
SELECT COUNT(*), (max(id) - min(id) + 1) AS diff
FROM merged_events;PS: Bonus, how to found missing or duplicate records?
Solution
Let's imagine you have this setup:
and ...
Provided that Redshift provides
This can be done with the following query
This will produce:
id | repeated_event | previous_event_missing
--: | :------------- | :---------------------
97 | f | null
98 | f | f
99 | f | f
100 | TRUE | f
102 | f | TRUE
103 | f | f
104 | f | f
105 | f | f
106 | f | f
You can check the whole setup (with PostgreSQL instead of RedShift) at
dbfiddle here
Caveats: your
References:
INSERT INTO events
(created_at, id)
VALUES
('2017-06-30 09:20:47 UTC', 100),
-- ('2017-06-30 09:18:31 UTC', 101), -- Missing row
('2017-06-30 09:16:19 UTC', 102),
('2017-06-30 09:12:08 UTC', 103),
('2017-06-30 09:11:59 UTC', 104),
('2017-06-30 09:11:15 UTC', 105),
('2017-06-30 07:03:41 UTC', 106) ;and ...
INSERT INTO deactivated_events
(created_at, id)
VALUES
('2017-06-30 07:03:41 UTC', 97),
('2017-06-30 09:11:15 UTC', 98),
('2017-06-30 09:11:15 UTC', 99),
('2017-06-30 09:18:31 UTC', 100) -- Repeated row
;Provided that Redshift provides
Window Functions, you can use a slight variation of your query to do two things:- Consider that there are repeated
ids when you group byidand find the count to be > 1
- Consider that there is a missing row (or more!) if the
idof the previous row is not 1 + the current one. This is done by means of theLAGfunction.
This can be done with the following query
WITH merged_events AS
(
SELECT
id
FROM
(
SELECT
id
FROM
events
UNION ALL /* Must be UNION ALL, because we want to find repeated values */
SELECT
id
FROM
deactivated_events
) AS q0
WHERE true -- In practice, created_at > GETDATE() - INTERVAL '3 hours'
)
SELECT
id,
count(id) > 1 AS repeated_event,
((lag(id) OVER(ORDER BY id)) /* previous id */ + 1) <> id AS previous_event_missing
FROM
merged_events
GROUP BY
id ;This will produce:
id | repeated_event | previous_event_missing
--: | :------------- | :---------------------
97 | f | null
98 | f | f
99 | f | f
100 | TRUE | f
102 | f | TRUE
103 | f | f
104 | f | f
105 | f | f
106 | f | f
You can check the whole setup (with PostgreSQL instead of RedShift) at
dbfiddle here
Caveats: your
id might have gaps if it is actually from an IDENTITY column (or a SEQUENCE it Redshift happens to implement it, which I don't think is the case). So, you should first guarantee by some other means that your ids are originally consecutive...References:
- Sequence number generation function in AWS redshift which discusses different ways of getting sequences, and its tradeoffs.
- How to find missing values in a sequence with SQL which shows different alternatives.
Code Snippets
INSERT INTO events
(created_at, id)
VALUES
('2017-06-30 09:20:47 UTC', 100),
-- ('2017-06-30 09:18:31 UTC', 101), -- Missing row
('2017-06-30 09:16:19 UTC', 102),
('2017-06-30 09:12:08 UTC', 103),
('2017-06-30 09:11:59 UTC', 104),
('2017-06-30 09:11:15 UTC', 105),
('2017-06-30 07:03:41 UTC', 106) ;INSERT INTO deactivated_events
(created_at, id)
VALUES
('2017-06-30 07:03:41 UTC', 97),
('2017-06-30 09:11:15 UTC', 98),
('2017-06-30 09:11:15 UTC', 99),
('2017-06-30 09:18:31 UTC', 100) -- Repeated row
;WITH merged_events AS
(
SELECT
id
FROM
(
SELECT
id
FROM
events
UNION ALL /* Must be UNION ALL, because we want to find repeated values */
SELECT
id
FROM
deactivated_events
) AS q0
WHERE true -- In practice, created_at > GETDATE() - INTERVAL '3 hours'
)
SELECT
id,
count(id) > 1 AS repeated_event,
((lag(id) OVER(ORDER BY id)) /* previous id */ + 1) <> id AS previous_event_missing
FROM
merged_events
GROUP BY
id ;Context
StackExchange Database Administrators Q#177691, answer score: 2
Revisions (0)
No revisions yet.