HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Optimize query to find top N users who commented on a post

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
commentedtopwhoquerypostoptimizefindusers

Problem

Problem

Trying to find the most efficient query to retrieve the top N (5 in the examples) users who have commented on a post, where a user is considered 'top' if they have the most followers. The query optimizer does not seem to be choosing the correct path.

Tables (Postgres v9.4.4)

user_account (40k records)

CREATE TABLE user_account (
  user_id TEXT PRIMARY KEY,
  name TEXT
);


following (13k records)

CREATE TABLE following (
  follower_user_id TEXT,
  followed_user_id TEXT,
  PRIMARY KEY (followed_user_id, follower_user_id)
);


follower_count_mv (10.5k records with only 5 users having > 1 follower)

CREATE MATERIALIZED VIEW follower_count_mv AS
SELECT followed_user_id AS user_id, COUNT(*)::int AS follower_count
FROM following
WHERE deleted_at IS NULL
GROUP BY user_id;

CREATE UNIQUE INDEX follower_count_mv_user_id_idx ON follower_count_mv (user_id);
CREATE INDEX follower_count_mv_follower_count_idx ON follower_count_mv (follower_count);


user_post_comment (13.4k records but the majority are on 3 posts)

CREATE TABLE user_post_comment (
  comment_id TEXT PRIMARY KEY,
  user_id TEXT,
  post_id TEXT
)
CREATE INDEX user_post_comment_user_id_idx ON user_post_comment (user_id);
CREATE INDEX user_post_comment_post_id_idx ON user_post_comment (post_id);


Queries I've tried

1) The most natural choice: join the tables and sort

SELECT user_account.*
FROM user_account
JOIN follower_count_mv ON (user_account.user_id = follower_count_mv.user_id)
JOIN user_post_comment ON (user_account.user_id = user_post_comment.user_id)
WHERE user_post_comment.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
ORDER BY follower_count DESC LIMIT 5;


This is what I originally had, but the query optimizer seems to have a hard time figuring out the best way to execute this. Something to do with the data distribution perhaps?

```
Limit (cost=0.99..117.00 rows=5 width=580) (actual time=0.082..148.688 rows=2 loops=1)
-> Nested Loop (cost=0.99..12136

Solution

Better data types

text is a sub-optimal data type for key columns. It would be more efficient to use integer. Related:

  • Indexes: integer vs string performance if the number of nodes in the index is the same



'26c72242-7e3b-4982-92c5-021b622d7ecd' in your example looks like a UUID. If you need to use UUIDs, still don't store them as text. The appropriate data type would be uuid - much more efficient. Details:

  • Would index lookup be noticeably faster with char vs varchar when all values are 36 chars



Indexes

You only use single-column indexes. Since you need to optimize read performance for your query, add these two multicolumn indexes:

CREATE INDEX fc_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count DESC);
CREATE INDEX upc_post_id_user_id_idx ON user_post_comment (post_id, user_id);


@Ziggy already mentioned the second.

The order of index columns is important:

  • Is a composite index also good for queries on the first field?



Btree indexes can be scanned backwards at practically the same cost. But it's even slightly faster to use matching descending sort order. (There's a corner case with NULL values.)

Query

For a single post like in your examples it won't get faster than this:

SELECT ua.*
FROM  (
   SELECT user_id, fc.follower_count 
   FROM  (
      SELECT DISTINCT user_id
      FROM   user_post_comment
      WHERE  post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
      ) pc
   JOIN   follower_count_mv fc USING (user_id)
   ORDER  BY fc.follower_count DESC
   LIMIT  5
   ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;


This is assuming that the same user can comment on the same post multiple times. It's cheapest to fold duplicates before joining to follower_count_mv.

And join to follower_count_mv directly. It's expensive and useless to use user_account as stepping stone.

Only join to user_account after reducing to the top 5.

You did not specify, but your queries have an outer ORDER BY follower_count DESC. I only include follower_count in the subquery for the outer sort.

If (post_id, user_id) is unique (users can only comment once on each post), simplify to:

SELECT ua.*
FROM  (
   SELECT user_id, fc.follower_count
   FROM   user_post_comment pc
   JOIN   follower_count_mv fc USING (user_id)
   WHERE  pc.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
   ORDER  BY fc.follower_count DESC
   LIMIT  5
   ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;


Getting the top N for multiple posts at once is a bit more complex. Detailed explanation in chapter 2a of this related answer:

  • Optimize GROUP BY query to retrieve latest record per user

Code Snippets

CREATE INDEX fc_mv_user_id_follower_count_idx ON follower_count_mv (user_id, follower_count DESC);
CREATE INDEX upc_post_id_user_id_idx ON user_post_comment (post_id, user_id);
SELECT ua.*
FROM  (
   SELECT user_id, fc.follower_count 
   FROM  (
      SELECT DISTINCT user_id
      FROM   user_post_comment
      WHERE  post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
      ) pc
   JOIN   follower_count_mv fc USING (user_id)
   ORDER  BY fc.follower_count DESC
   LIMIT  5
   ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;
SELECT ua.*
FROM  (
   SELECT user_id, fc.follower_count
   FROM   user_post_comment pc
   JOIN   follower_count_mv fc USING (user_id)
   WHERE  pc.post_id = '26c72242-7e3b-4982-92c5-021b622d7ecd'
   ORDER  BY fc.follower_count DESC
   LIMIT  5
   ) sub
JOIN   user_account ua USING (user_id)
ORDER  BY sub.follower_count DESC;

Context

StackExchange Database Administrators Q#138158, answer score: 4

Revisions (0)

No revisions yet.