patternsqlMinor
Get top two rows per group efficiently
Viewed 0 times
rowsefficientlytoppergrouptwoget
Problem
Currently, I use a query with
But how can I get previous recent posts for each author? In other words, how to make
I need this to compare recent post with previous one.
This query selects the most recent post for each author. (Third post for each in my example.):
db<>fiddle here
I also need the previous post for each author (Second post for each in my example.)
I don't want to use window functions because my table is large enough, and I suppose window functions could be slow.
DISTINCT ON to get a list of recent posts for each author.But how can I get previous recent posts for each author? In other words, how to make
DISTINCT ON return second row of each group, not first.I need this to compare recent post with previous one.
CREATE TABLE posts (
title varchar(30),
author varchar(30),
created_at date
);
INSERT INTO posts VALUES
('Johns first post', 'John', 'January 1, 2021'),
('Johns second post', 'John', 'January 2, 2021'),
('Johns third post', 'John', 'January 3, 2021'),
('Mikes first post', 'Mike', 'January 1, 2021'),
('Mikes second post', 'Mike', 'January 2, 2021'),
('Mikes third post', 'Mike', 'January 3, 2021');This query selects the most recent post for each author. (Third post for each in my example.):
SELECT DISTINCT ON (author) * FROM posts ORDER BY author ASC, created_at DESCdb<>fiddle here
I also need the previous post for each author (Second post for each in my example.)
I don't want to use window functions because my table is large enough, and I suppose window functions could be slow.
Solution
DISTINCT ON is only good to get a single (distinct) row per group. And only the one you can sort on top somehow. Even then it's only efficient with few rows per group. See:- Select first row in each GROUP BY group?
I am going to assume a big table with many posts per author (the typical case).
Simple and slow
Can be built around
row_number() in a subquery:SELECT *
FROM (
SELECT *, row_number() OVER (PARTITION BY author ORDER BY created_at DESC NULLS LAST) AS post_num
FROM posts
) p
WHERE post_num < 3;DESC NULLS LAST since all your columns can be NULL.You'd really want all columns to be
NOT NULL and created_at to be timestamptz. (See below for more.)The query will use a sequential scan, which is very inefficient for the case. (Just like you supposed.)
Sophisticated and fast
You'd want to use an index efficiently. Assuming a table design as primitive as displayed, one like:
CREATE INDEX ON posts (author DESC NULLS LAST, created_at DESC NULLS LAST);We can put this to work, with some sophistication:
WITH RECURSIVE cte AS (
(
SELECT *
FROM posts
ORDER BY author DESC NULLS LAST, created_at DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT p.*
FROM cte c
CROSS JOIN LATERAL (
SELECT *
FROM posts p
WHERE p.author < c.author -- lateral reference
ORDER BY author DESC NULLS LAST, created_at DESC NULLS LAST
LIMIT 1
) p
)
SELECT *, 1 AS post_num
FROM cte
UNION ALL
SELECT p.*
FROM cte c
CROSS JOIN LATERAL (
SELECT *, 2 AS post_num
FROM posts p
WHERE p.author = c.author
AND p.created_at < c.created_at -- assuming no two posts with same date
ORDER BY created_at DESC NULLS LAST
LIMIT 1
) p;db<>fiddle here
The first step is a classic recursive CTE to get the first post per author. Detailed explanation here:
- Optimize GROUP BY query to retrieve latest row per user
The second step it to fetch the next post for each author in a
LATERAL subquery - making use of the index once more.Simple and fast
In a proper relational design, you'd have a separate
author table like:CREATE TABLE author (
author_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author text NOT NULL
);
INSERT INTO author(author) VALUES
('John')
, ('Mike');
CREATE TABLE post (
post_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author_id int NOT NULL REFERENCES author
, title varchar(30) NOT NULL
, created_at timestamptz NOT NULL DEFAULT now()
);
INSERT INTO post (author_id, title, created_at) VALUES
(1, 'Johns first post', 'January 1, 2021')
, (1, 'Johns second post', 'January 2, 2021')
, (1, 'Johns third post', 'January 3, 2021')
, (2, 'Mikes first post', 'January 1, 2021')
, (2, 'Mikes second post', 'January 2, 2021')
, (2, 'Mikes third post', 'January 3, 2021')
;Then the index can simply be:
CREATE INDEX ON post (author_id, created_at);And we can have a very simple and very efficient query:
SELECT p.*
FROM author a
CROSS JOIN LATERAL (
SELECT *
FROM post
WHERE author_id = a.author_id
ORDER BY created_at DESC
LIMIT 2
) p;db<>fiddle here
Code Snippets
SELECT *
FROM (
SELECT *, row_number() OVER (PARTITION BY author ORDER BY created_at DESC NULLS LAST) AS post_num
FROM posts
) p
WHERE post_num < 3;CREATE INDEX ON posts (author DESC NULLS LAST, created_at DESC NULLS LAST);WITH RECURSIVE cte AS (
(
SELECT *
FROM posts
ORDER BY author DESC NULLS LAST, created_at DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT p.*
FROM cte c
CROSS JOIN LATERAL (
SELECT *
FROM posts p
WHERE p.author < c.author -- lateral reference
ORDER BY author DESC NULLS LAST, created_at DESC NULLS LAST
LIMIT 1
) p
)
SELECT *, 1 AS post_num
FROM cte
UNION ALL
SELECT p.*
FROM cte c
CROSS JOIN LATERAL (
SELECT *, 2 AS post_num
FROM posts p
WHERE p.author = c.author
AND p.created_at < c.created_at -- assuming no two posts with same date
ORDER BY created_at DESC NULLS LAST
LIMIT 1
) p;CREATE TABLE author (
author_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author text NOT NULL
);
INSERT INTO author(author) VALUES
('John')
, ('Mike');
CREATE TABLE post (
post_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, author_id int NOT NULL REFERENCES author
, title varchar(30) NOT NULL
, created_at timestamptz NOT NULL DEFAULT now()
);
INSERT INTO post (author_id, title, created_at) VALUES
(1, 'Johns first post', 'January 1, 2021')
, (1, 'Johns second post', 'January 2, 2021')
, (1, 'Johns third post', 'January 3, 2021')
, (2, 'Mikes first post', 'January 1, 2021')
, (2, 'Mikes second post', 'January 2, 2021')
, (2, 'Mikes third post', 'January 3, 2021')
;CREATE INDEX ON post (author_id, created_at);Context
StackExchange Database Administrators Q#286627, answer score: 3
Revisions (0)
No revisions yet.