HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to use index in "NOT IN" statement in PostgreSQL?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlstatementhowindexusenot

Problem

Schema:

CREATE TABLE traffic_hit (
    id            SERIAL                   NOT NULL PRIMARY KEY,
    country       VARCHAR(2)               NOT NULL,
    created       TIMESTAMP WITH TIME ZONE NOT NULL,
    unique        BOOLEAN                  NOT NULL,
    user_agent_id INTEGER                  NULL
);
CREATE TABLE utils_useragent (
    id                SERIAL      NOT NULL PRIMARY KEY,
    user_agent_string TEXT        NOT NULL UNIQUE,
    is_robot          BOOLEAN     NOT NULL
);


Initial Query:

SELECT
  traffic_hit.created::DATE AS group_by,
  COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
  COUNT(*) AS non_unique_visits
FROM
  traffic_hit
LEFT JOIN utils_useragent ON traffic_hit.user_agent_id = utils_useragent.id
WHERE
  traffic_hit.created >= '2016-01-01' AND
  traffic_hit.created < '2017-01-01' AND
  traffic_hit.country = 'CZ' AND
  utils_useragent.is_robot = FALSE
GROUP BY 1


Indexes:

CREATE INDEX traffic_hit_user_agent_id ON traffic_hit (user_agent_id);
CREATE INDEX new_idx ON traffic_hit(created, country, user_agent_id, unique);
CREATE INDEX robots ON utils_useragent (id) WHERE is_robot = TRUE


Query plan:

```
HashAggregate (cost=582436.93..603769.28 rows=1706588 width=20) (actual time=2514.233..2515.597 rows=366 loops=1)
Output: ((traffic_hit.created)::date), count() FILTER (WHERE traffic_hit.""unique""), count()"
Group Key: (traffic_hit.created)::date
-> Hash Join (cost=15732.00..545234.80 rows=4960285 width=5) (actual time=83.141..2157.453 rows=2430245 loops=1)
Output: (traffic_hit.created)::date, traffic_hit.""unique"""
Hash Cond: (traffic_hit.user_agent_id = utils_useragent.id)
-> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=0.56..448722.21 rows=5007358 width=13) (actual time=0.066..1278.475 rows=4618870 loops=1)
Output: traffic_hit.created, traffic_hit.country, traffic_hit.user_agent_id, traffic_h

Solution

It does use the index. It uses the index to build the hash table, which it then uses in the filter. Using an in-memory unshared hash table is going to be faster than using an on-disk shared index.

But why are you repeatedly aggregating millions of rows of data which have not changed in 6 months, if it is performance sensitive? Aggregate it once and store the result. You could use materialized views to do this, or just do it by hand.

You can do partial aggregations, for example aggregate data grouping by date(created) and whichever other column you need to. Then people can re-aggregate this reduced data set to a specific date range as long they are happy with integral day boundaries, either filtering on those other column, or aggregating over them, or grouping by them. If they want a count, you have to careful to sum up the counts, not count the counts. If you want an average, you have to be careful to weight that average by the counts, rather than doing unweighted average of the averages.

And of course if you change your mind about what is or is not a robot, then you would have to re-make your partial aggregation table.

Anyway, the bottleneck is not the not-in statement, it is just the raw amount of data you want to process.

Parallel query in the upcoming v10 release of PostgreSQL could help this query.

Context

StackExchange Database Administrators Q#182584, answer score: 4

Revisions (0)

No revisions yet.