snippetsqlMinor
Sort by number of related rows in referencing table
Viewed 0 times
rowsnumberreferencingrelatedsorttable
Problem
Let there be two tables:
Users
Emails
With a single query I want to retrieve:
I'd like the output to be ordered descending by number of emails and filtered including only emails starting with 'a'. Users without emails shall be included, too - treat their emails' count as
There is my query:
And the (expected) result, it looks good:
It's obvious that this is a simple and small example while the actual dataset could be large enough, so I'd like to use
And… fail. I've got incomplete information about Bob only:
Hence the question: how can I apply limit/
Users
id [pk] | name
--------+---------
1 | Alice
2 | Bob
3 | Charlie
4 | DanEmails
id | user_id | email
----+---------+-------
1 | 1 | a.1
2 | 1 | a.2
3 | 2 | a.3
4 | 2 | b.1
5 | 2 | a.4
6 | 2 | a.5
7 | 3 | b.2
8 | 3 | a.6With a single query I want to retrieve:
- user's id and name
- count of user's emails
- user's email and its id
I'd like the output to be ordered descending by number of emails and filtered including only emails starting with 'a'. Users without emails shall be included, too - treat their emails' count as
0.There is my query:
SELECT users.id AS user_id, users.name AS name,
emails.id AS email_id, emails.email AS email,
count(emails.id) OVER (PARTITION BY users.id) as n_emails
FROM users
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'
ORDER BY n_emails DESC;And the (expected) result, it looks good:
user_id | name | email_id | email | n_emails
---------+---------+----------+-------+----------
2 | Bob | 6 | a.5 | 3
2 | Bob | 5 | a.4 | 3
2 | Bob | 3 | a.3 | 3
1 | Alice | 2 | a.2 | 2
1 | Alice | 1 | a.1 | 2
3 | Charlie | 8 | a.6 | 1It's obvious that this is a simple and small example while the actual dataset could be large enough, so I'd like to use
LIMIT/OFFSET for paging. For example, I'd like to fetch a first pair of users (not just rows):-- previous query ...
LIMIT 2 OFFSET 0;And… fail. I've got incomplete information about Bob only:
user_id | name | email_id | email | n_emails
---------+------+----------+-------+----------
2 | Bob | 6 | a.5 | 3
2 | Bob | 5 | a.4 | 3Hence the question: how can I apply limit/
Solution
Exclude users without emails
Assuming we only want users that actually have emails. Users without emails are ignored. The reason I went with this assumption at first is that all your queries do that already:
By adding a
2nd query rewritten
Your 2nd query does not work as advertised, results are not "descending by number of emails". You have to nest the result of
This should be fastest if the predicate is selective enough (selects only a small fraction of all emails). Two window functions with rows sorted differently have their price, too.
3rd query improved
If, on the other hand, the predicate
Include users without emails
You can either include the users table in the inner query again, similar to what you had before. But you have to pull the filter on email into the join condition!
Which will be a bit more expensive.
Since you retrieve users with the most emails first, users without emails will rarely be in the result. To optimize performance, you could use a
Detailed explanation:
I would consider materializing the result for two reasons:
Build a MV from the 2nd query without
Assuming we only want users that actually have emails. Users without emails are ignored. The reason I went with this assumption at first is that all your queries do that already:
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'By adding a
WHERE condition on emails.email you effectively convert your LEFT JOIN to a plain [INNER] JOIN and exclude users without emails. Detailed explanation:- Query with LEFT JOIN not returning rows for count of 0
2nd query rewritten
Your 2nd query does not work as advertised, results are not "descending by number of emails". You have to nest the result of
count() in another CTE or subquery and run dense_rank() on it. You cannot nest window functions in the same query level.SELECT u.name, e2.*
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT user_id, id AS e_id, e_mail
, count(*) OVER (PARTITION BY user_id) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%' -- one % is enough
) e1
) e2
JOIN users u ON u.id = e2.user_id
WHERE rnk < 3
ORDER BY rnk;This should be fastest if the predicate is selective enough (selects only a small fraction of all emails). Two window functions with rows sorted differently have their price, too.
- A major point is to run the subquery on
emailsonly - which is possible if the preliminary assumption holds.
3rd query improved
If, on the other hand, the predicate
WHERE e.email LIKE 'a' || '%' is not very selective, your 3rd query is probably faster, even if it reads from the table twice - but the second time only desired rows. Also improved:SELECT e.user_id, u.name,
e.id AS e_id, e.e_mail, sq.n_emails
FROM (
SELECT user_id, count(*) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%'
GROUP BY user_id
ORDER BY count(*) DESC, user_id -- break ties
LIMIT 2 OFFSET 0
) sq
JOIN emails e USING (user_id)
JOIN users u ON u.id = e.user_id
WHERE e.email LIKE 'a' || '%'
ORDER BY sq.n_emails DESC;Include users without emails
You can either include the users table in the inner query again, similar to what you had before. But you have to pull the filter on email into the join condition!
SELECT u.name, e2.*
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT u.id AS user_id, u.name, e.id AS e_id
, count(e.user_id) OVER (PARTITION BY u.id) AS n_emails
FROM users u
LEFT JOIN emails e ON e.user_id = u.id
AND e.email LIKE 'a' || '%' -- !!!
) e1
) e2
WHERE rnk < 3
ORDER BY rnk;Which will be a bit more expensive.
Since you retrieve users with the most emails first, users without emails will rarely be in the result. To optimize performance, you could use a
UNION ALL with LIMIT:( -- parentheses required
SELECT u.name, e2.user_id, e2.e_id, e2.e_mail, e2.n_emails
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT user_id, id AS e_id, e_mail
, count(*) OVER (PARTITION BY user_id) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%' -- one % is enough
) e1
) e2
JOIN users u ON u.id = e2.user_id
WHERE rnk < 3 -- adapt to paging!
ORDER BY rnk
)
UNION ALL
(
SELECT u.name, u.user_id, NULL AS e_id, NULL AS e_mail, 0 AS n_emails
FROM users u
LEFT JOIN emails e ON e.user_id = u.id
AND e.email LIKE 'a' || '%'
WHERE e.e.user_id IS NULL
)
OFFSET 0 -- adapt to paging!
LIMIT 2 -- adapt to paging!Detailed explanation:
- Optimize a query on two big tables
MATERIALIZED VIEWI would consider materializing the result for two reasons:
- Subsequent queries are much faster.
- You don't have to operate on a moving target. You speak of paging, and if users get new emails between pages, your whole sort order may be moot.
Build a MV from the 2nd query without
LIMIT (REFRESH MATERIALIZED VIEW), then return the first page etc. It's a matter of policy, when you refresh the MV again.Code Snippets
LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'SELECT u.name, e2.*
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT user_id, id AS e_id, e_mail
, count(*) OVER (PARTITION BY user_id) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%' -- one % is enough
) e1
) e2
JOIN users u ON u.id = e2.user_id
WHERE rnk < 3
ORDER BY rnk;SELECT e.user_id, u.name,
e.id AS e_id, e.e_mail, sq.n_emails
FROM (
SELECT user_id, count(*) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%'
GROUP BY user_id
ORDER BY count(*) DESC, user_id -- break ties
LIMIT 2 OFFSET 0
) sq
JOIN emails e USING (user_id)
JOIN users u ON u.id = e.user_id
WHERE e.email LIKE 'a' || '%'
ORDER BY sq.n_emails DESC;SELECT u.name, e2.*
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT u.id AS user_id, u.name, e.id AS e_id
, count(e.user_id) OVER (PARTITION BY u.id) AS n_emails
FROM users u
LEFT JOIN emails e ON e.user_id = u.id
AND e.email LIKE 'a' || '%' -- !!!
) e1
) e2
WHERE rnk < 3
ORDER BY rnk;( -- parentheses required
SELECT u.name, e2.user_id, e2.e_id, e2.e_mail, e2.n_emails
FROM (
SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
FROM (
SELECT user_id, id AS e_id, e_mail
, count(*) OVER (PARTITION BY user_id) AS n_emails
FROM emails
WHERE email LIKE 'a' || '%' -- one % is enough
) e1
) e2
JOIN users u ON u.id = e2.user_id
WHERE rnk < 3 -- adapt to paging!
ORDER BY rnk
)
UNION ALL
(
SELECT u.name, u.user_id, NULL AS e_id, NULL AS e_mail, 0 AS n_emails
FROM users u
LEFT JOIN emails e ON e.user_id = u.id
AND e.email LIKE 'a' || '%'
WHERE e.e.user_id IS NULL
)
OFFSET 0 -- adapt to paging!
LIMIT 2 -- adapt to paging!Context
StackExchange Database Administrators Q#86038, answer score: 2
Revisions (0)
No revisions yet.