patternsqlMinor
Optimize query to find top N users who commented on a post
Viewed 0 times
commentedtopwhoquerypostoptimizefindusers
Problem
Problem
Trying to find the most efficient query to retrieve the top N (5 in the examples) users who have commented on a post, where a user is considered 'top' if they have the most followers. The query optimizer does not seem to be choosing the correct path.
Tables (Postgres v9.4.4)
user_account (40k records)
following (13k records)
follower_count_mv (10.5k records with only 5 users having > 1 follower)
user_post_comment (13.4k records but the majority are on 3 posts)
Queries I've tried
1) The most natural choice: join the tables and sort
This is what I originally had, but the query optimizer seems to have a hard time figuring out the best way to execute this. Something to do with the data distribution perhaps?
```
Limit (cost=0.99..117.00 rows=5 width=580) (actual time=0.082..148.688 rows=2 loops=1)
-> Nested Loop (cost=0.99..12136
Trying to find the most efficient query to retrieve the top N (5 in the examples) users who have commented on a post, where a user is considered 'top' if they have the most followers. The query optimizer does not seem to be choosing the correct path.
Tables (Postgres v9.4.4)
user_account (40k records)
CREATE TABLE user_account (
user_id TEXT PRIMARY KEY,
name TEXT
);following (13k records)
CREATE TABLE following (
follower_user_id TEXT,
followed_user_id TEXT,
PRIMARY KEY (followed_user_id, follower_user_id)
);follower_count_mv (10.5k records with only 5 users having > 1 follower)
CREATE MATERIALIZED VIEW follower_count_mv AS
SELECT followed_user_id AS user_id, COUNT(*)::int AS follower_count
FROM following
WHERE deleted_at IS NULL
GROUP BY user_id;
CREATE UNIQUE INDEX follower_count_mv_user_id_idx ON follower_count_mv (user_id);
CREATE INDEX follower_count_mv_follower_count_idx ON follower_count_mv (follower_count);user_post_comment (13.4k records but the majority are on 3 posts)
CREATE TABLE user_post_comment (
comment_id TEXT PRIMARY KEY,
user_id TEXT,
post_id TEXT
)
CREATE INDEX user_post_comment_user_id_idx ON user_post_comment (user_id);
CREATE INDEX user_post_comment_post_id_idx ON user_post_comment (post_id);Queries I've tried
1) The most natural choice: join the tables and sort
SELECT user_account.*
FROM user_account
JOIN follower_count_mv ON (user_account.user_id = follower_count_mv.user_id)
JOIN user_post_comment ON (user_account.user_id = user_post_comment.user_id)
WHERE user_post_comment.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
ORDER BY follower_count DESC LIMIT 5;This is what I originally had, but the query optimizer seems to have a hard time figuring out the best way to execute this. Something to do with the data distribution perhaps?
```
Limit (cost=0.99..117.00 rows=5 width=580) (actual time=0.082..148.688 rows=2 loops=1)
-> Nested Loop (cost=0.99..12136
Solution
Better data types
Indexes
You only use single-column indexes. Since you need to optimize read performance for your query, add these two multicolumn indexes:
@Ziggy already mentioned the second.
The order of index columns is important:
Btree indexes can be scanned backwards at practically the same cost. But it's even slightly faster to use matching descending sort order. (There's a corner case with NULL values.)
Query
For a single post like in your examples it won't get faster than this:
This is assuming that the same user can comment on the same post multiple times. It's cheapest to fold duplicates before joining to
And join to
Only join to
You did not specify, but your queries have an outer
If
Getting the top N for multiple posts at once is a bit more complex. Detailed explanation in chapter 2a of this related answer:
text is a sub-optimal data type for key columns. It would be more efficient to use integer. Related:- Indexes: integer vs string performance if the number of nodes in the index is the same
'26c72242-7e3b-4982-92c5-021b622d7ecd' in your example looks like a UUID. If you need to use UUIDs, still don't store them as text. The appropriate data type would be uuid - much more efficient. Details:- Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
Indexes
You only use single-column indexes. Since you need to optimize read performance for your query, add these two multicolumn indexes:
CREATE INDEX fc_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count DESC);
CREATE INDEX upc_post_id_user_id_idx ON user_post_comment (post_id, user_id);@Ziggy already mentioned the second.
The order of index columns is important:
- Is a composite index also good for queries on the first field?
Btree indexes can be scanned backwards at practically the same cost. But it's even slightly faster to use matching descending sort order. (There's a corner case with NULL values.)
Query
For a single post like in your examples it won't get faster than this:
SELECT ua.*
FROM (
SELECT user_id, fc.follower_count
FROM (
SELECT DISTINCT user_id
FROM user_post_comment
WHERE post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
) pc
JOIN follower_count_mv fc USING (user_id)
ORDER BY fc.follower_count DESC
LIMIT 5
) sub
JOIN user_account ua USING (user_id)
ORDER BY sub.follower_count DESC;This is assuming that the same user can comment on the same post multiple times. It's cheapest to fold duplicates before joining to
follower_count_mv.And join to
follower_count_mv directly. It's expensive and useless to use user_account as stepping stone.Only join to
user_account after reducing to the top 5.You did not specify, but your queries have an outer
ORDER BY follower_count DESC. I only include follower_count in the subquery for the outer sort.If
(post_id, user_id) is unique (users can only comment once on each post), simplify to:SELECT ua.*
FROM (
SELECT user_id, fc.follower_count
FROM user_post_comment pc
JOIN follower_count_mv fc USING (user_id)
WHERE pc.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
ORDER BY fc.follower_count DESC
LIMIT 5
) sub
JOIN user_account ua USING (user_id)
ORDER BY sub.follower_count DESC;Getting the top N for multiple posts at once is a bit more complex. Detailed explanation in chapter 2a of this related answer:
- Optimize GROUP BY query to retrieve latest record per user
Code Snippets
CREATE INDEX fc_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count DESC);
CREATE INDEX upc_post_id_user_id_idx ON user_post_comment (post_id, user_id);SELECT ua.*
FROM (
SELECT user_id, fc.follower_count
FROM (
SELECT DISTINCT user_id
FROM user_post_comment
WHERE post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
) pc
JOIN follower_count_mv fc USING (user_id)
ORDER BY fc.follower_count DESC
LIMIT 5
) sub
JOIN user_account ua USING (user_id)
ORDER BY sub.follower_count DESC;SELECT ua.*
FROM (
SELECT user_id, fc.follower_count
FROM user_post_comment pc
JOIN follower_count_mv fc USING (user_id)
WHERE pc.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
ORDER BY fc.follower_count DESC
LIMIT 5
) sub
JOIN user_account ua USING (user_id)
ORDER BY sub.follower_count DESC;Context
StackExchange Database Administrators Q#138158, answer score: 4
Revisions (0)
No revisions yet.