HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Omit max entry from each group in postgres?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
entryeachgrouppostgresomitmaxfrom

Problem

I have a query which produces data similar to the following:

a_key     | a_timestamp | a_value
---------------------------------
345223452 | 2016-03-07  | 1
029370982 | 2017-01-12  | 1
984656834 | 2017-02-03  | 1
928459238 | 2014-08-06  | 2
298338793 | 2015-07-10  | 2
203897932 | 2015-09-12  | 3
123477399 | 2016-05-03  | 3


I need to eliminate the most recent timestamp for each given value.

Keys are all unique. Values in this query are all non-unique (that is, there will be no value that is not contained in at least two rows) due to filters already applied.

It seems there should be some easy way to do this with an aggregate function or a "group by" clause, but I can't seem to work it out.

How can I filter this result to exclude the most recently seen key for each value?

(If it helps, I only really need the key from the results, not all three columns.)

I am using Postgres 9.3.11.

Solution

My first thought would be to use a window function like ROW_NUMBER(), almost identical to your solution.

Here are a few more ways to write this query:

WITH mytable AS
  ( --- the query --- )
SELECT a_key
FROM mytable AS t
WHERE EXISTS 
      ( SELECT *
        FROM mytable AS n
        WHERE t.a_value = n.a_value
          AND t.a_timestamp < n.a_timestamp
      ) ;


Using another window function, LEAD():

SELECT a_key
FROM
  ( SELECT a_key,
           (LEAD(a_timestamp) OVER (PARTITION BY a_value
                                    ORDER BY a_timestamp)
            IS NOT NULL) AS ok
    FROM mytable
  ) AS pointless
WHERE ok ;


A variation on the above, using a different condition to check which rows to keep:

(LEAD(a_value) OVER (ORDER BY a_value, a_timestamp)
            = a_value) AS ok


And a rather weird solution:

SELECT a_key
FROM mytable

EXCEPT

( SELECT DISTINCT ON (a_value) a_key
  FROM mytable
  ORDER BY a_value, a_timestamp DESC
) ;


Modified NOT IN to NOT EXISTS:

SELECT a_key
FROM mytable AS t
WHERE NOT EXISTS
      ( SELECT 1 
        FROM mytable AS m 
        WHERE m.a_value = t.a_value 
        HAVING MAX(m.a_timestamp) = t.a_timestamp
      ) ;


And modified again to a JOIN:

SELECT t.a_key
FROM mytable AS t 
  JOIN
    ( SELECT a_value, max(a_timestamp) AS a_timestamp 
      FROM mytable                   
      GROUP BY a_value 
    ) AS m 
  ON  t.a_value = m.a_value 
  AND t.a_timestamp < m.a_timestamp ;


Regarding performance, I did a test with a small 200K rows table (not a subquery), with and without indexes, of the various methods.

Since the query needs to return a large majority of rows (more than 50% and could be close to 100% depending on the distribution), I wouldn't expect indexes to be particularly helpful.

The window function solutions (ROW_NUMBER(), RANK(), LEAD()) performed quite well and similarly to each other (less than 2 sec).

The EXISTS method came a bit slower and last was the DISTINCT ON method (around 3 sec).

The NOT IN method by @joanolo shows a materialized subplan and was really slow (but it may be more efficient if the mytable subquery returns fewer rows). Modified to a similar NOT EXISTS lowered the response time to about 3 seconds. The JOIN modification was somewhat better, around 2 - 2.5 sec.

The plans showed sequential scans of course and most improved with indexes, doing index scans instead (and lowering response time to about 1-1.5 sec for the window functions and the join methods).

(I used (a_value, a_timestamp) and (a_value, a_timestamp, a_key) indexes and variations changing to timestamp DESC but the actual indexes are more or less irrelevant to the specific example, since we have no idea how complex the sybquery is.)

Code Snippets

WITH mytable AS
  ( --- the query --- )
SELECT a_key
FROM mytable AS t
WHERE EXISTS 
      ( SELECT *
        FROM mytable AS n
        WHERE t.a_value = n.a_value
          AND t.a_timestamp < n.a_timestamp
      ) ;
SELECT a_key
FROM
  ( SELECT a_key,
           (LEAD(a_timestamp) OVER (PARTITION BY a_value
                                    ORDER BY a_timestamp)
            IS NOT NULL) AS ok
    FROM mytable
  ) AS pointless
WHERE ok ;
(LEAD(a_value) OVER (ORDER BY a_value, a_timestamp)
            = a_value) AS ok
SELECT a_key
FROM mytable

EXCEPT

( SELECT DISTINCT ON (a_value) a_key
  FROM mytable
  ORDER BY a_value, a_timestamp DESC
) ;
SELECT a_key
FROM mytable AS t
WHERE NOT EXISTS
      ( SELECT 1 
        FROM mytable AS m 
        WHERE m.a_value = t.a_value 
        HAVING MAX(m.a_timestamp) = t.a_timestamp
      ) ;

Context

StackExchange Database Administrators Q#165389, answer score: 3

Revisions (0)

No revisions yet.