HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Get the last 5 distinct values for each ID

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distinctlasttheeachgetforvalues

Problem

I'm working with PostgreSQL 9.4.

I have a table that contains the following entries:

id | postcode | date_created
 ---+----------+-----------------
 14 | al2 2qp  | 2015-09-23 14:46:57
 14 | al2 2qp  | 2015-09-23 14:51:07
 14 | sp2 8ag  | 2015-09-23 14:56:11
 14 | se4      | 2015-09-23 16:12:05
 17 | e2       | 2015-09-23 16:15:35
 17 | fk20 8ru | 2015-09-23 16:28:35
 17 | fk20 8ru | 2015-09-23 16:35:51
 17 | se2      | 2015-09-23 16:36:17
 17 | fk20 8ru | 2015-09-23 16:36:22
 17 | fk20 8ru | 2015-09-23 16:37:04
 17 | se1      | 2015-09-23 16:37:11
 17 | fk20 8ru | 2015-09-23 16:37:15
 17 | se1 8ga  | 2015-09-24 09:52:46
 17 | se1      | 2015-09-24 10:01:19
 17 | hp27 9rz | 2015-09-24 10:05:27
 17 | hp27 9rz | 2015-09-24 10:05:29
 17 | se1      | 2015-09-24 10:19:46
 14 | tn21 8qb | 2015-09-24 14:49:05
 14 | tn21 8qb | 2015-09-24 15:42:45
 14 | tn21 8qb | 2015-09-24 17:38:06
 14 | n4 1ny   | 2015-09-25 14:49:10


What I want to achieve is a query that returns the 5 most recent unique postcode records for each id:

id | postcode
 ---+---------
 14 | n4 1ny
 14 | tn21 8qb
 14 | se4
 14 | sp2 8ag
 14 | al2 2qp
 17 | se1
 17 | hp27 9rz
 17 | se1 8ga
 17 | fk20 8ru
 17 | se2


What would be the best way of achieving this? I've been playing around with subqueries but keep hitting walls when it comes to ordering them whilst doing a DISTINCT and GROUP BY.

Solution

There are probably many ways to do this. The first that comes to mind is to use window functions:

SELECT 
    id, postcode
FROM
  ( SELECT id, postcode, 
           ROW_NUMBER() OVER (PARTITION BY id
                              ORDER BY MAX(date_created) DESC
                             ) AS rn
    FROM tablename
    GROUP BY id, postcode
  ) AS t
WHERE
    rn <= 5
ORDER BY 
    id, rn ;


Test at SQLfiddle.

If there are ties, say the 5th, 6th and 7th postcode for an id have the same date_created, only one of them (choice will be arbitrary) will be in the results. If you want all the tied postcodes in those cases, use RANK() instead of ROW_NUMBER().

Another option is to use the LATERAL syntax. I'm not sure which will be more efficient, it wil probably depend on the values distribution of the two columns (id and postcode), i.e. how many distinct ids in the whole table, how many distinct postcodes per id and how many rows per (id, postcode) combinations.

SELECT 
    t.id, ti.postcode
FROM
    ( SELECT DISTINCT id
      FROM tablename
    ) AS t
  CROSS JOIN LATERAL
    ( SELECT tt.postcode,
             MAX(tt.date_created) AS date_created
      FROM tablename AS tt
      WHERE tt.id = t.id
      GROUP BY tt.postcode
      ORDER BY date_created DESC
      LIMIT 5
    ) AS ti 
ORDER BY 
    t.id, ti.date_created DESC;


Adding an index on (id, postcode, date_created) would be a good idea, too - or on (id, postcode, date_created DESC).

Code Snippets

SELECT 
    id, postcode
FROM
  ( SELECT id, postcode, 
           ROW_NUMBER() OVER (PARTITION BY id
                              ORDER BY MAX(date_created) DESC
                             ) AS rn
    FROM tablename
    GROUP BY id, postcode
  ) AS t
WHERE
    rn <= 5
ORDER BY 
    id, rn ;
SELECT 
    t.id, ti.postcode
FROM
    ( SELECT DISTINCT id
      FROM tablename
    ) AS t
  CROSS JOIN LATERAL
    ( SELECT tt.postcode,
             MAX(tt.date_created) AS date_created
      FROM tablename AS tt
      WHERE tt.id = t.id
      GROUP BY tt.postcode
      ORDER BY date_created DESC
      LIMIT 5
    ) AS ti 
ORDER BY 
    t.id, ti.date_created DESC;

Context

StackExchange Database Administrators Q#116142, answer score: 11

Revisions (0)

No revisions yet.