HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distincteachbutmorewaylikethanoneselectthere

Problem

I have a table us_customers that looks like this (with hundreds of thousands of rows):

+----------+----------+
|    id    | us_state |
+----------+----------+
| 12345678 | MA       |
| 23456781 | AL       |
| 34567812 | GA       |
| 45678123 | FL       |
| 56781234 | AZ       |
| 67812345 | MA       |
| 78123456 | CO       |
| 81234567 | FL       |
+----------+----------+


... and I want to SELECT a sample of n customers from each us_state.

Is there a way to do that cleanly in PostgreSQL 9.3?

I can get one customer from each us_state easily with:

SELECT DISTINCT ON (us_state) id
FROM us_customers
ORDER BY us_state;


But if I want, say, three customers from each state, is there a way I can do this without running the same query multiple times?

Solution

You can number and order id by us_state using the ROW_NUMBER() Window Function and only keep the n first values:

SELECT * 
FROM (
  SELECT *
    , ROW_NUMBER() OVER(PARTITION BY us_state ORDER BY id) as n
  FROM data
) as ord
WHERE n <= 2
ORDER BY us_state
;


Or you can CROSS JOIN with a subquery:

SELECT l.*
FROM (
  SELECT DISTINCT us_state FROM data
) as s
CROSS JOIN LATERAL (
  SELECT * 
  FROM data d
  WHERE d.us_state = s.us_state
  ORDER BY id
  LIMIT 2
) as l
ORDER BY l.us_state
;


  • Sample SQL Fiddle here



  • I used a small sample of 1 to 3 rows per state. Therefore I limit only to 2 values



  • I ordered it by ids but you can change that and order by whatever works best for you



Output with my small sample:

id | us_state | n
      123 |       AL | 1 
      456 |       AL | 2 
 56781234 |       AZ | 1 
 78123456 |       CO | 1 
 45678123 |       FL | 1 
 81234567 |       FL | 2 
 34567812 |       GA | 1 
      123 |       MA | 1 
      456 |       MA | 2


Note that n is the result of ROW_NUMBER and doesn't exist in the second query.
On a big table, an index on the partition (us-state) and order (id here) columns will help.

Sample used:

CREATE TABLE data
    ("id" int, "us_state" varchar(2))
;

INSERT INTO data
    ("id", "us_state")
VALUES
    (12345678, 'MA'),
    (123, 'MA'),
    (456, 'MA'),
    (23456781, 'AL'),
    (123, 'AL'),
    (456, 'AL'),
    (34567812, 'GA'),
    (45678123, 'FL'),
    (56781234, 'AZ'),
    (67812345, 'MA'),
    (78123456, 'CO'),
    (81234567, 'FL')
;

Code Snippets

SELECT * 
FROM (
  SELECT *
    , ROW_NUMBER() OVER(PARTITION BY us_state ORDER BY id) as n
  FROM data
) as ord
WHERE n <= 2
ORDER BY us_state
;
SELECT l.*
FROM (
  SELECT DISTINCT us_state FROM data
) as s
CROSS JOIN LATERAL (
  SELECT * 
  FROM data d
  WHERE d.us_state = s.us_state
  ORDER BY id
  LIMIT 2
) as l
ORDER BY l.us_state
;
id | us_state | n
      123 |       AL | 1 
      456 |       AL | 2 
 56781234 |       AZ | 1 
 78123456 |       CO | 1 
 45678123 |       FL | 1 
 81234567 |       FL | 2 
 34567812 |       GA | 1 
      123 |       MA | 1 
      456 |       MA | 2
CREATE TABLE data
    ("id" int, "us_state" varchar(2))
;

INSERT INTO data
    ("id", "us_state")
VALUES
    (12345678, 'MA'),
    (123, 'MA'),
    (456, 'MA'),
    (23456781, 'AL'),
    (123, 'AL'),
    (456, 'AL'),
    (34567812, 'GA'),
    (45678123, 'FL'),
    (56781234, 'AZ'),
    (67812345, 'MA'),
    (78123456, 'CO'),
    (81234567, 'FL')
;

Context

StackExchange Database Administrators Q#133393, answer score: 10

Revisions (0)

No revisions yet.