patternsqlModerate
Have Postgresql query planner use nested loop w/ indices over hash join
Viewed 0 times
postgresqlindicesplannerqueryloophashjoinnesteduseover
Problem
I'm having a problem with some StackOverflow-schema related data loaded into PostgreSQL 9.3.4. I have a query that is taking about 10x longer than it should, due to the fact that it is choosing to use a hash join instead of a nested loop w/ indices. For example, if I select 500 users in the query, a hash join is used instead of using the id and type indices on the post_tokenized table:
```
explain
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
join posts as posts_tbl
on posts_tbl.id = tokenized_tbl.id
where type = 'tag'
and user_screen_name is not null
and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500)
and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500))
Hash Join (cost=29570.13..751852.55 rows=119954 width=21)
Hash Cond: (tokenized_tbl.id = posts_tbl.id)
-> Index Scan using type_index_post_tokenized on post_tokenized tokenized_tbl (cost=0.44..646219.29 rows=20281711 width=8)
Index Cond: (type = 'tag'::text)
-> Hash (cost=29561.73..29561.73 rows=637 width=25)
-> Hash Join (cost=15576.75..29561.73 rows=637 width=25)
Hash Cond: (posts_tbl.id = posts.id)
-> Nested Loop (cost=48.20..12824.71 rows=106853 width=21)
-> HashAggregate (cost=47.76..52.76 rows=500 width=4)
-> Limit (cost=0.43..41.51 rows=500 width=8)
-> Index Scan using reputation_index_users on users (cost=0.43..211.57 rows=2570 width=8)
Index Cond: (reputation > 100000)
-> Index Scan using owner_user_id_index_posts on posts posts_tbl (cost=0.44..23.40 rows=214 width=25)
```
explain
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
join posts as posts_tbl
on posts_tbl.id = tokenized_tbl.id
where type = 'tag'
and user_screen_name is not null
and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500)
and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500))
Hash Join (cost=29570.13..751852.55 rows=119954 width=21)
Hash Cond: (tokenized_tbl.id = posts_tbl.id)
-> Index Scan using type_index_post_tokenized on post_tokenized tokenized_tbl (cost=0.44..646219.29 rows=20281711 width=8)
Index Cond: (type = 'tag'::text)
-> Hash (cost=29561.73..29561.73 rows=637 width=25)
-> Hash Join (cost=15576.75..29561.73 rows=637 width=25)
Hash Cond: (posts_tbl.id = posts.id)
-> Nested Loop (cost=48.20..12824.71 rows=106853 width=21)
-> HashAggregate (cost=47.76..52.76 rows=500 width=4)
-> Limit (cost=0.43..41.51 rows=500 width=8)
-> Index Scan using reputation_index_users on users (cost=0.43..211.57 rows=2570 width=8)
Index Cond: (reputation > 100000)
-> Index Scan using owner_user_id_index_posts on posts posts_tbl (cost=0.44..23.40 rows=214 width=25)
Solution
This closely related answer on SO has an answer to your primary question:
You could use that in similar fashion, to disable hash joins for the current transaction:
But that's not my advice. Read the answer over there.
And this one about statistics and cost settings:
More importantly, untangle your query first:
Should be considerably faster and also make it easier for the query planner to choose the best plan (given sane cost settings and table statistics).
- Setting enable_seqscan = off in a single SELECT query
You could use that in similar fashion, to disable hash joins for the current transaction:
SET LOCAL enable_hashjoin=off;But that's not my advice. Read the answer over there.
And this one about statistics and cost settings:
- Keep PostgreSQL from sometimes choosing a bad query plan
More importantly, untangle your query first:
SELECT creation_epoch, user_screen_name, chunk
FROM (
SELECT id AS owner_user_id
FROM users
WHERE reputation > 100000
ORDER BY reputation
LIMIT 500
) u
JOIN posts p USING (owner_user_id)
JOIN post_tokenized t USING (id)
WHERE type = 'tag'
AND user_screen_name IS NOT NULL;Should be considerably faster and also make it easier for the query planner to choose the best plan (given sane cost settings and table statistics).
Code Snippets
SET LOCAL enable_hashjoin=off;SELECT creation_epoch, user_screen_name, chunk
FROM (
SELECT id AS owner_user_id
FROM users
WHERE reputation > 100000
ORDER BY reputation
LIMIT 500
) u
JOIN posts p USING (owner_user_id)
JOIN post_tokenized t USING (id)
WHERE type = 'tag'
AND user_screen_name IS NOT NULL;Context
StackExchange Database Administrators Q#66678, answer score: 14
Revisions (0)
No revisions yet.