snippetsqlMajor
How to speed up select distinct?
Viewed 0 times
selectdistinctspeedhow
Problem
I have a simple select distinct on some time series data:
And it takes 112 seconds. Here's the query plan:
http://explain.depesz.com/s/NTyA
My application has to preform a lot of distinct operations and counts like this. Is there a faster way to get this kind of data?
SELECT DISTINCT user_id
FROM events
WHERE project_id = 6
AND time > '2015-01-11 8:00:00'
AND time < '2015-02-10 8:00:00';And it takes 112 seconds. Here's the query plan:
http://explain.depesz.com/s/NTyA
My application has to preform a lot of distinct operations and counts like this. Is there a faster way to get this kind of data?
Solution
You probably don't want to hear this, but the best option to speed up
Sometimes,
In your particular case, it doesn't seem like you can get rid of
In Postgres 11 or later, you can use an actual "covering" index like:
Adding
Would remove the expensive Bitmap Heap Scan from your query plan, which consumes 90% of the query time.
Your
Either way, if time intervals in your queries are always the same, a
Nitpick:
Most probably, the predicates on
Aside:
Don't use
SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries.Sometimes,
GROUP BY is faster, because it takes a different code path.In your particular case, it doesn't seem like you can get rid of
DISTINCT (well, see below). But you can support the query with a special index if you have many queries of that kind:CREATE INDEX foo ON events (project_id, "time", user_id);In Postgres 11 or later, you can use an actual "covering" index like:
CREATE INDEX foo ON events (project_id, "time") INCLUDE (user_id);Adding
user_id is only useful if you get index-only scans out of this. See:- https://www.postgresql.org/docs/current/indexes-index-only-scans.html
- https://wiki.postgresql.org/wiki/Index-only_scans
Would remove the expensive Bitmap Heap Scan from your query plan, which consumes 90% of the query time.
Your
EXPLAIN shows 2,491 distinct users out of half a million qualifying rows. This won't become super-fast, no matter what you do, but it can be substantially faster. With around 200 rows per user, emulating an index skip scan on above index might pay. The range condition on time complicates matters, and 200 rows per user is still a moderate number. So not sure. See:- SELECT DISTINCT is slower than expected on my table in PostgreSQL
- Select first row in each GROUP BY group?
- Optimize GROUP BY query to retrieve latest row per user
Either way, if time intervals in your queries are always the same, a
MATERIALIZED VIEW folding user_id per (project_id, ) would go a long way. No chance there with varying time intervals, though. Maybe you could at least fold users per hour or some other minimum time unit, and that would buy enough performance to warrant the considerable overhead. Can be combined with either query style.Nitpick:
Most probably, the predicates on
"time" should really be:AND "time" >= '2015-01-11 8:00:00'
AND "time" < '2015-02-10 8:00:00';Aside:
Don't use
time as identifier. It's a reserved word in standard SQL and a basic type in Postgres.Code Snippets
CREATE INDEX foo ON events (project_id, "time", user_id);CREATE INDEX foo ON events (project_id, "time") INCLUDE (user_id);AND "time" >= '2015-01-11 8:00:00'
AND "time" < '2015-02-10 8:00:00';Context
StackExchange Database Administrators Q#93158, answer score: 37
Revisions (0)
No revisions yet.