patternsqlMinor
Fastest query for distinct IDs in a many-to-many relationship
Viewed 0 times
distinctqueryidsforfastestmanyrelationship
Problem
I have this table in PostgreSQL 9.4:
The table consists of
Parameters:
I also have an index on the
QUESTION: What would be the most optimal plan for querying all distinct
It's possible to create more indexes on the table. Currently, the plan for the query is:
Is such query plan really optimal in such circumstances? I mean, I'm not sure about correctness of using
CREATE TABLE user_operations(
id SERIAL PRIMARY KEY,
operation_id integer,
user_id integer )The table consists of
~1000-2000 different operations each of which corresponds to some subset (consisting of approximately 80000-120000 elements each) of the set S of all users:S = {1, 2, 3, ... , 122655}Parameters:
work_mem = 128MB
table_size = 880MBI also have an index on the
operation_id.QUESTION: What would be the most optimal plan for querying all distinct
user_id for a significant part of the operation_id set (20%-60%) like:SELECT DISTINCT user_id FROM user_operation WHERE operation_id < 500It's possible to create more indexes on the table. Currently, the plan for the query is:
HashAggregate (cost=196173.56..196347.14 rows=17358 width=4) (actual time=1227.408..1359.947 rows=598336 loops=1)
-> Bitmap Heap Scan on user_operation (cost=46392.24..189978.17 rows=2478155 width=4) (actual time=233.163..611.182 rows=2518122 loops=1)
Recheck Cond: (operation_id Bitmap Index Scan on idx (cost=0.00..45772.70 rows=2478155 width=0) (actual time=230.432..230.432 rows=2518122 loops=1)
Index Cond: (operation_id < 500)Is such query plan really optimal in such circumstances? I mean, I'm not sure about correctness of using
Bitmap Heap Scan. I'll appreciate any references to relevant articles.Solution
What would be the most optimal plan for querying all distinct
for a significant part of the
Use a recursive query:
In combination with an index on
Since there are only
Details:
If the predicate
user_idfor a significant part of the
operation_id set (20%-60%).Use a recursive query:
WITH RECURSIVE cte AS (
( -- parentheses are required
SELECT user_id
FROM user_operations
WHERE operation_id cte.user_id -- lateral reference
ORDER BY user_id
LIMIT 1
) u
)
TABLE cte;In combination with an index on
(user_id, operation_id) - columns in that order. I expect index scans that filter on the second column. Reasonably accurate table statistics are important so Postgres knows it will only have to skip a few rows in the index to find the next user_id. Generally, one might want to increase the statistics target for operation_id in particular:ALTER TABLE user_operations ALTER operation_id SET STATISTICS 1000;Since there are only
~1000-2000 different operations, this may not even be necessary, but it's a small price to pay.Details:
- Optimizing queries on a range of timestamps (two columns)
If the predicate
operation_id
If you have a separate users` table and a large part of all users can be found in your sample, even faster query styles are possible. Details in the linked answer.Code Snippets
WITH RECURSIVE cte AS (
( -- parentheses are required
SELECT user_id
FROM user_operations
WHERE operation_id < 500
ORDER BY user_id
LIMIT 1
)
UNION ALL
SELECT u.user_id
FROM cte, LATERAL (
SELECT user_id
FROM user_operations
WHERE operation_id < 500
AND user_id > cte.user_id -- lateral reference
ORDER BY user_id
LIMIT 1
) u
)
TABLE cte;ALTER TABLE user_operations ALTER operation_id SET STATISTICS 1000;CREATE INDEX foo ON user_operations (user_id) WHERE operation_id < 500;Context
StackExchange Database Administrators Q#118688, answer score: 4
Revisions (0)
No revisions yet.