HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Validate id sequence consistency in records in Redshift

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
validateredshiftrecordssequenceconsistency

Problem

I have a backend that log event to Redshift, for each event it generates a unique id. The id is a sequential number.

I have something like (events table):

+-------------------------+------+
| created_at              |  id  |
+-------------------------+------+
| 2017-06-30 09:20:47 UTC | 100  |
| 2017-06-30 09:18:31 UTC | 101  |
| 2017-06-30 09:16:19 UTC | 102  |
| 2017-06-30 09:12:08 UTC | 103  |
| 2017-06-30 09:11:59 UTC | 104  |
| 2017-06-30 09:11:15 UTC | 105  |
| 2017-06-30 07:03:41 UTC | 106  |
+-------------------------+------+


I have task than run every hour that move few records to an other table (deactivated_events).

I would like to verify that for the last 3 hours I don't have any missing record using id sequence.
First I thought about using generate_series but this is not present in Redshift. Other suggested to make a table with only the ids but that's still a pain to populate database with integer (here one example to generate 1 million

I'm wondering if the best is not use min max and count like this :

WITH merged_events AS
  (SELECT *
   FROM
     (SELECT id, created_at
      FROM events
      UNION 
      SELECT id, created_at
      FROM deactivated_events
     )
   WHERE created_at > GETDATE() - INTERVAL '3 hours'
   ORDER BY id)
SELECT COUNT(*), (max(id) - min(id) + 1) AS diff
FROM merged_events;


PS: Bonus, how to found missing or duplicate records?

Solution

Let's imagine you have this setup:

INSERT INTO events
    (created_at, id)
VALUES
    ('2017-06-30 09:20:47 UTC', 100),
    -- ('2017-06-30 09:18:31 UTC', 101), -- Missing row
    ('2017-06-30 09:16:19 UTC', 102),
    ('2017-06-30 09:12:08 UTC', 103),
    ('2017-06-30 09:11:59 UTC', 104),
    ('2017-06-30 09:11:15 UTC', 105),
    ('2017-06-30 07:03:41 UTC', 106) ;


and ...

INSERT INTO deactivated_events
    (created_at, id)
VALUES
    ('2017-06-30 07:03:41 UTC', 97),
    ('2017-06-30 09:11:15 UTC', 98),
    ('2017-06-30 09:11:15 UTC', 99),
    ('2017-06-30 09:18:31 UTC', 100)     -- Repeated row
    ;


Provided that Redshift provides Window Functions, you can use a slight variation of your query to do two things:

  • Consider that there are repeated ids when you group by id and find the count to be > 1



  • Consider that there is a missing row (or more!) if the id of the previous row is not 1 + the current one. This is done by means of the LAG function.



This can be done with the following query

WITH merged_events AS
(
  SELECT 
      id
  FROM
  (
      SELECT 
          id
      FROM 
          events
      UNION ALL  /* Must be UNION ALL, because we want to find repeated values */
      SELECT 
          id
      FROM 
          deactivated_events
  ) AS q0
  WHERE true -- In practice, created_at > GETDATE() - INTERVAL '3 hours'
)
SELECT 
    id, 
    count(id) > 1 AS repeated_event, 
    ((lag(id) OVER(ORDER BY id)) /* previous id */ + 1) <> id AS previous_event_missing
FROM 
    merged_events
GROUP BY
    id ;


This will produce:

id | repeated_event | previous_event_missing
--: | :------------- | :---------------------
97 | f | null
98 | f | f
99 | f | f
100 | TRUE | f
102 | f | TRUE
103 | f | f
104 | f | f
105 | f | f
106 | f | f

You can check the whole setup (with PostgreSQL instead of RedShift) at
dbfiddle here

Caveats: your id might have gaps if it is actually from an IDENTITY column (or a SEQUENCE it Redshift happens to implement it, which I don't think is the case). So, you should first guarantee by some other means that your ids are originally consecutive...

References:

  • Sequence number generation function in AWS redshift which discusses different ways of getting sequences, and its tradeoffs.



  • How to find missing values in a sequence with SQL which shows different alternatives.

Code Snippets

INSERT INTO events
    (created_at, id)
VALUES
    ('2017-06-30 09:20:47 UTC', 100),
    -- ('2017-06-30 09:18:31 UTC', 101), -- Missing row
    ('2017-06-30 09:16:19 UTC', 102),
    ('2017-06-30 09:12:08 UTC', 103),
    ('2017-06-30 09:11:59 UTC', 104),
    ('2017-06-30 09:11:15 UTC', 105),
    ('2017-06-30 07:03:41 UTC', 106) ;
INSERT INTO deactivated_events
    (created_at, id)
VALUES
    ('2017-06-30 07:03:41 UTC', 97),
    ('2017-06-30 09:11:15 UTC', 98),
    ('2017-06-30 09:11:15 UTC', 99),
    ('2017-06-30 09:18:31 UTC', 100)     -- Repeated row
    ;
WITH merged_events AS
(
  SELECT 
      id
  FROM
  (
      SELECT 
          id
      FROM 
          events
      UNION ALL  /* Must be UNION ALL, because we want to find repeated values */
      SELECT 
          id
      FROM 
          deactivated_events
  ) AS q0
  WHERE true -- In practice, created_at > GETDATE() - INTERVAL '3 hours'
)
SELECT 
    id, 
    count(id) > 1 AS repeated_event, 
    ((lag(id) OVER(ORDER BY id)) /* previous id */ + 1) <> id AS previous_event_missing
FROM 
    merged_events
GROUP BY
    id ;

Context

StackExchange Database Administrators Q#177691, answer score: 2

Revisions (0)

No revisions yet.