HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMajor

How to speed up select distinct?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
selectdistinctspeedhow

Problem

I have a simple select distinct on some time series data:

SELECT DISTINCT user_id
FROM events
WHERE project_id = 6
AND time > '2015-01-11 8:00:00'
AND time < '2015-02-10 8:00:00';


And it takes 112 seconds. Here's the query plan:

http://explain.depesz.com/s/NTyA

My application has to preform a lot of distinct operations and counts like this. Is there a faster way to get this kind of data?

Solution

You probably don't want to hear this, but the best option to speed up SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries.

Sometimes, GROUP BY is faster, because it takes a different code path.

In your particular case, it doesn't seem like you can get rid of DISTINCT (well, see below). But you can support the query with a special index if you have many queries of that kind:

CREATE INDEX foo ON events (project_id, "time", user_id);


In Postgres 11 or later, you can use an actual "covering" index like:

CREATE INDEX foo ON events (project_id, "time") INCLUDE (user_id);


Adding user_id is only useful if you get index-only scans out of this. See:

  • https://www.postgresql.org/docs/current/indexes-index-only-scans.html



  • https://wiki.postgresql.org/wiki/Index-only_scans



Would remove the expensive Bitmap Heap Scan from your query plan, which consumes 90% of the query time.

Your EXPLAIN shows 2,491 distinct users out of half a million qualifying rows. This won't become super-fast, no matter what you do, but it can be substantially faster. With around 200 rows per user, emulating an index skip scan on above index might pay. The range condition on time complicates matters, and 200 rows per user is still a moderate number. So not sure. See:

  • SELECT DISTINCT is slower than expected on my table in PostgreSQL



  • Select first row in each GROUP BY group?



  • Optimize GROUP BY query to retrieve latest row per user



Either way, if time intervals in your queries are always the same, a MATERIALIZED VIEW folding user_id per (project_id, ) would go a long way. No chance there with varying time intervals, though. Maybe you could at least fold users per hour or some other minimum time unit, and that would buy enough performance to warrant the considerable overhead. Can be combined with either query style.

Nitpick:

Most probably, the predicates on "time" should really be:

AND "time" >= '2015-01-11 8:00:00'
AND "time" <  '2015-02-10 8:00:00';


Aside:

Don't use time as identifier. It's a reserved word in standard SQL and a basic type in Postgres.

Code Snippets

CREATE INDEX foo ON events (project_id, "time", user_id);
CREATE INDEX foo ON events (project_id, "time") INCLUDE (user_id);
AND "time" >= '2015-01-11 8:00:00'
AND "time" <  '2015-02-10 8:00:00';

Context

StackExchange Database Administrators Q#93158, answer score: 37

Revisions (0)

No revisions yet.