HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Form groups of consecutive rows with same value

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowssamegroupswithvalueformconsecutive

Problem

I have a situation I think can be solved using window function but I'm not sure.

Imagine the following table

CREATE TABLE tmp (
  date timestamp
, id_type integer
) ;

INSERT INTO tmp (date, id_type)
VALUES
    ( '2017-01-10 07:19:21.0', 3 ),
    ( '2017-01-10 07:19:22.0', 3 ),
    ( '2017-01-10 07:19:23.1', 3 ),
    ( '2017-01-10 07:19:24.1', 3 ),
    ( '2017-01-10 07:19:25.0', 3 ),
    ( '2017-01-10 07:19:26.0', 5 ),
    ( '2017-01-10 07:19:27.1', 3 ),
    ( '2017-01-10 07:19:28.0', 5 ),
    ( '2017-01-10 07:19:29.0', 5 ),
    ( '2017-01-10 07:19:30.1', 3 ),
    ( '2017-01-10 07:19:31.0', 5 ),
    ( '2017-01-10 07:19:32.0', 3 ),
    ( '2017-01-10 07:19:33.1', 5 ),
    ( '2017-01-10 07:19:35.0', 5 ),
    ( '2017-01-10 07:19:36.1', 5 ),
    ( '2017-01-10 07:19:37.1', 5 );


I'd like to have a new group at each change of value in column id_type.
E.G. 1st group from 7:19:21 to 7:19:25, 2nd starting and finishing at 7:19:26, and so on.

At this moment, using the query below ...

SELECT distinct 
    min(min(date)) over w as begin, 
    max(max(date)) over w as end,   
    id_type
FROM tmp
GROUP BY id_type
WINDOW w AS (PARTITION BY id_type)
ORDER BY begin;


I get the following result:
begin end id_type
2017-01-10 07:19:21.0 2017-01-10 07:19:32.0 3
2017-01-10 07:19:26.0 2017-01-10 07:19:37.1 5


While I'd like:
begin end id_type
2017-01-10 07:19:21.0 2017-01-10 07:19:25.0 3
2017-01-10 07:19:26.0 2017-01-10 07:19:26.0 5
2017-01-10 07:19:27.1 2017-01-10 07:19:27.1 3
2017-01-10 07:19:28.0 2017-01-10 07:19:29.0 5
2017-01-10 07:19:30.1 2017-01-10 07:19:30.1 3
2017-01-10 07:19:31.0 2017-01-10 07:19:31.0 5
2017-01-10 07:19:32.0 2017-01-10 07:19:32.0 3
2017-01-10 07:19:33.1 2017-01-10 07:19:37.1 5


Once that works, I want to include more criteria to define groups, and these others will be nullable.

Postgres Version: 8.4. We have Postgres wi

Solution


  1. Window functions plus subqueries



Count the steps to form groups, similar to Evan's idea, with modifications and fixes:

SELECT id_type
     , min(date) AS begin
     , max(date) AS end
     , count(*)  AS row_ct  -- optional addition
FROM  (
   SELECT date, id_type, count(step OR NULL) OVER (ORDER BY date) AS grp
   FROM  (
      SELECT date, id_type
           , lag(id_type, 1, id_type) OVER (ORDER BY date) <> id_type AS step
      FROM   tmp
      ) sub1
   ) sub2
GROUP  BY id_type, grp
ORDER  BY min(date);


This assumes involved columns are NOT NULL. Else you need to do more.

Also assuming date to be defined UNIQUE, else you need to add a tiebreaker to the ORDER BY clauses get deterministic results. Like: ORDER BY date, id.

Detailed explanation (answer to very similar question):

  • Select longest continuous sequence



Note in particular:

-
In related cases, lag() with 3 parameters can be essential to cover the corner case of the first (or last) row elegantly. (The 3rd param is used as default if there is no previous (next) row.

lag(id_type, 1, id_type) OVER ()


Since we are only interested in an actual change of id_type (TRUE), it does not matter in this particular case. NULL and FALSE both don't count as step.

  • count(step OR NULL) OVER (ORDER BY date) is the shortest syntax that also works in Postgres 9.3 or older. count() only counts non-null values ...



In modern Postgres, the cleaner, equivalent syntax would be:

count(step) FILTER (WHERE step) OVER (ORDER BY date)


Details:

  • For absolute performance, is SUM faster or COUNT?



  1. Subtract two window functions, one subquery



Similar to Erik's idea with modifications:

SELECT min(date) AS begin
     , max(date) AS end
     , id_type
FROM  (
   SELECT date, id_type
        , row_number() OVER (ORDER BY date)
        - row_number() OVER (PARTITION BY id_type ORDER BY date) AS grp
   FROM   tmp
   ) sub
GROUP  BY id_type, grp
ORDER  BY min(date);


If date is defined UNIQUE, like I mention above, dense_rank() would be pointless, since the result is the same as for row_number() and the latter is substantially cheaper.

If date is not defined UNIQUE (and we don't know that the only duplicates are on (date, id_type)), all of these queries are pointless, since the result is arbitrary.

Also, a subquery is typically cheaper than a CTE in Postgres. Only use CTEs when you need them.

Related answers with more explanation:

  • GROUP BY and aggregate sequential numeric values



  • Group by repeating attribute



  • GROUP BY uninterrupted sequence of logs for same location



In related cases where we already have a running number in the table, we can make do with a single window function:

  • Rank based on sequence of dates



  1. Top performance with plpgsql function



Since this question has become unexpectedly popular, I'll add another solution to demonstrate top performance.

SQL has many sophisticated tools to create solutions with short and elegant syntax. But a declarative language has its limits for more complex requirements that involve procedural elements.

A procedural solution with a server-side function is faster for this than anything posted so far because it only needs a single sequential scan over the table and a single sort operation. If a fitting index is available, even just a single index-only scan.

CREATE OR REPLACE FUNCTION f_tmp_groups()
  RETURNS TABLE (id_type int, grp_begin timestamp, grp_end timestamp)
  LANGUAGE plpgsql AS
$func$
DECLARE
   _row  tmp;                       -- use table type for row variable
BEGIN
   FOR _row IN
      TABLE tmp ORDER BY date       -- add more columns for deterministic order 
   LOOP
      CASE _row.id_type = id_type 
      WHEN TRUE THEN                -- same group continues
         grp_end := _row.date;      -- remember last date so far
      WHEN FALSE THEN               -- next group starts
         RETURN NEXT;               -- return result for last group
         id_type   := _row.id_type;
         grp_begin := _row.date;
         grp_end   := _row.date;
      ELSE                          -- NULL for 1st row
         id_type   := _row.id_type; -- remember row data for starters
         grp_begin := _row.date;
         grp_end   := _row.date;
      END CASE;
   END LOOP;

   RETURN NEXT;                     -- return last result row      
END
$func$;


Call:

SELECT * FROM f_tmp_groups();


Test with:

EXPLAIN (ANALYZE, TIMING OFF)  -- to focus on total performance
SELECT * FROM  f_tmp_groups();


You could make the function generic with polymorphic types and pass table type and column names. Details:

  • Refactor a PL/pgSQL function to return the output of various SELECT queries



If you don't want to or cannot persist a function for this, it would even pay to create a temporary function on the fly. Costs a few ms.

  • How to create a temporary function in PostgreSQL?



db<>fiddle here - comparing performance

Code Snippets

SELECT id_type
     , min(date) AS begin
     , max(date) AS end
     , count(*)  AS row_ct  -- optional addition
FROM  (
   SELECT date, id_type, count(step OR NULL) OVER (ORDER BY date) AS grp
   FROM  (
      SELECT date, id_type
           , lag(id_type, 1, id_type) OVER (ORDER BY date) <> id_type AS step
      FROM   tmp
      ) sub1
   ) sub2
GROUP  BY id_type, grp
ORDER  BY min(date);
lag(id_type, 1, id_type) OVER ()
count(step) FILTER (WHERE step) OVER (ORDER BY date)
SELECT min(date) AS begin
     , max(date) AS end
     , id_type
FROM  (
   SELECT date, id_type
        , row_number() OVER (ORDER BY date)
        - row_number() OVER (PARTITION BY id_type ORDER BY date) AS grp
   FROM   tmp
   ) sub
GROUP  BY id_type, grp
ORDER  BY min(date);
CREATE OR REPLACE FUNCTION f_tmp_groups()
  RETURNS TABLE (id_type int, grp_begin timestamp, grp_end timestamp)
  LANGUAGE plpgsql AS
$func$
DECLARE
   _row  tmp;                       -- use table type for row variable
BEGIN
   FOR _row IN
      TABLE tmp ORDER BY date       -- add more columns for deterministic order 
   LOOP
      CASE _row.id_type = id_type 
      WHEN TRUE THEN                -- same group continues
         grp_end := _row.date;      -- remember last date so far
      WHEN FALSE THEN               -- next group starts
         RETURN NEXT;               -- return result for last group
         id_type   := _row.id_type;
         grp_begin := _row.date;
         grp_end   := _row.date;
      ELSE                          -- NULL for 1st row
         id_type   := _row.id_type; -- remember row data for starters
         grp_begin := _row.date;
         grp_end   := _row.date;
      END CASE;
   END LOOP;

   RETURN NEXT;                     -- return last result row      
END
$func$;

Context

StackExchange Database Administrators Q#166374, answer score: 18

Revisions (0)

No revisions yet.