snippetsqlMinor
How to use index in "NOT IN" statement in PostgreSQL?
Viewed 0 times
postgresqlstatementhowindexusenot
Problem
Schema:
Initial Query:
Indexes:
Query plan:
```
HashAggregate (cost=582436.93..603769.28 rows=1706588 width=20) (actual time=2514.233..2515.597 rows=366 loops=1)
Output: ((traffic_hit.created)::date), count() FILTER (WHERE traffic_hit.""unique""), count()"
Group Key: (traffic_hit.created)::date
-> Hash Join (cost=15732.00..545234.80 rows=4960285 width=5) (actual time=83.141..2157.453 rows=2430245 loops=1)
Output: (traffic_hit.created)::date, traffic_hit.""unique"""
Hash Cond: (traffic_hit.user_agent_id = utils_useragent.id)
-> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=0.56..448722.21 rows=5007358 width=13) (actual time=0.066..1278.475 rows=4618870 loops=1)
Output: traffic_hit.created, traffic_hit.country, traffic_hit.user_agent_id, traffic_h
CREATE TABLE traffic_hit (
id SERIAL NOT NULL PRIMARY KEY,
country VARCHAR(2) NOT NULL,
created TIMESTAMP WITH TIME ZONE NOT NULL,
unique BOOLEAN NOT NULL,
user_agent_id INTEGER NULL
);
CREATE TABLE utils_useragent (
id SERIAL NOT NULL PRIMARY KEY,
user_agent_string TEXT NOT NULL UNIQUE,
is_robot BOOLEAN NOT NULL
);Initial Query:
SELECT
traffic_hit.created::DATE AS group_by,
COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
COUNT(*) AS non_unique_visits
FROM
traffic_hit
LEFT JOIN utils_useragent ON traffic_hit.user_agent_id = utils_useragent.id
WHERE
traffic_hit.created >= '2016-01-01' AND
traffic_hit.created < '2017-01-01' AND
traffic_hit.country = 'CZ' AND
utils_useragent.is_robot = FALSE
GROUP BY 1Indexes:
CREATE INDEX traffic_hit_user_agent_id ON traffic_hit (user_agent_id);
CREATE INDEX new_idx ON traffic_hit(created, country, user_agent_id, unique);
CREATE INDEX robots ON utils_useragent (id) WHERE is_robot = TRUEQuery plan:
```
HashAggregate (cost=582436.93..603769.28 rows=1706588 width=20) (actual time=2514.233..2515.597 rows=366 loops=1)
Output: ((traffic_hit.created)::date), count() FILTER (WHERE traffic_hit.""unique""), count()"
Group Key: (traffic_hit.created)::date
-> Hash Join (cost=15732.00..545234.80 rows=4960285 width=5) (actual time=83.141..2157.453 rows=2430245 loops=1)
Output: (traffic_hit.created)::date, traffic_hit.""unique"""
Hash Cond: (traffic_hit.user_agent_id = utils_useragent.id)
-> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=0.56..448722.21 rows=5007358 width=13) (actual time=0.066..1278.475 rows=4618870 loops=1)
Output: traffic_hit.created, traffic_hit.country, traffic_hit.user_agent_id, traffic_h
Solution
It does use the index. It uses the index to build the hash table, which it then uses in the filter. Using an in-memory unshared hash table is going to be faster than using an on-disk shared index.
But why are you repeatedly aggregating millions of rows of data which have not changed in 6 months, if it is performance sensitive? Aggregate it once and store the result. You could use materialized views to do this, or just do it by hand.
You can do partial aggregations, for example aggregate data grouping by
And of course if you change your mind about what is or is not a robot, then you would have to re-make your partial aggregation table.
Anyway, the bottleneck is not the not-in statement, it is just the raw amount of data you want to process.
Parallel query in the upcoming v10 release of PostgreSQL could help this query.
But why are you repeatedly aggregating millions of rows of data which have not changed in 6 months, if it is performance sensitive? Aggregate it once and store the result. You could use materialized views to do this, or just do it by hand.
You can do partial aggregations, for example aggregate data grouping by
date(created) and whichever other column you need to. Then people can re-aggregate this reduced data set to a specific date range as long they are happy with integral day boundaries, either filtering on those other column, or aggregating over them, or grouping by them. If they want a count, you have to careful to sum up the counts, not count the counts. If you want an average, you have to be careful to weight that average by the counts, rather than doing unweighted average of the averages.And of course if you change your mind about what is or is not a robot, then you would have to re-make your partial aggregation table.
Anyway, the bottleneck is not the not-in statement, it is just the raw amount of data you want to process.
Parallel query in the upcoming v10 release of PostgreSQL could help this query.
Context
StackExchange Database Administrators Q#182584, answer score: 4
Revisions (0)
No revisions yet.