patternsqlModerate
Slow query on large table with GROUP BY and ORDER BY
Viewed 0 times
ordergroupwithqueryslowlargeandtable
Problem
I have a table with 7.2 million tuples which looks like this:
Now I want to select some values but the query is incredibly slow:
The
The table
table public.methods
column | type | attributes
--------+-----------------------+----------------------------------------------------
id | integer | not null DEFAULT nextval('methodkey'::regclass)
hash | character varying(32) | not null
string | character varying | not null
method | character varying | not null
file | character varying | not null
type | character varying | not null
Indexes:
"methods_pkey" PRIMARY KEY, btree (id)
"methodhash" btree (hash)Now I want to select some values but the query is incredibly slow:
db=# explain
select hash, string, count(method)
from methods
where hash not in
(select hash from nostring)
group by hash, string
order by count(method) desc;
QUERY PLAN
----------------------------------------------------------------------------------------
Sort (cost=160245190041.10..160245190962.07 rows=368391 width=182)
Sort Key: (count(methods.method))
-> GroupAggregate (cost=160245017241.77..160245057764.73 rows=368391 width=182)
-> Sort (cost=160245017241.77..160245026451.53 rows=3683905 width=182)
Sort Key: methods.hash, methods.string
-> Seq Scan on methods (cost=0.00..160243305942.27 rows=3683905 width=182)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..41071.54 rows=970636 width=33)
-> Seq Scan on nostring (cost=0.00..28634.36 rows=970636 width=33)The
hash column is the md5 hash of string and has an index. So I think my problem is that the whole table is sorted by id and not by hash, so it takes a while to sort it first and then group it?The table
nostring contains only a list of hashes I don't want to have. But I need both tables to have Solution
The
Run
Create a multi-column index that matches your query:
Wait? After I said an index wouldn't help? Well, we need it to
Rerun
If your table sees a lot of write operations, the effect will degrade over time. Schedule
If the percentage of
(Your later update shows your columns to be
If you are running PostgreSQL 9.2 or later (as @deszo commented) the presented indexes may be useful without
The above mentioned partial index could be even more useful in that case.
If, on the other hand, there are no
1.) define it
2.) use
If you have to call this query often and the table is read-only, create a
Exotic fine point:
Your table is named
LEFT JOIN in @dezso's answer should be good. An index, however, will hardly be useful (per se), because the query has to read the whole table anyway - the exception being index-only scans in Postgres 9.2+ and favorable conditions, see below.SELECT m.hash, m.string, count(m.method) AS method_ct
FROM methods m
LEFT JOIN nostring n USING (hash)
WHERE n.hash IS NULL
GROUP BY m.hash, m.string
ORDER BY count(m.method) DESC;Run
EXPLAIN ANALYZE on the query. Several times to exclude cashing effects and noise. Compare the best results.Create a multi-column index that matches your query:
CREATE INDEX methods_cluster_idx ON methods (hash, string, method);Wait? After I said an index wouldn't help? Well, we need it to
CLUSTER the table:CLUSTER methods USING methods_cluster_idx;
ANALYZE methods;Rerun
EXPLAIN ANALYZE. Any faster? It should be.CLUSTER is a one-time operation to rewrite the whole table in the order of the used index. It is also effectively a VACUUM FULL. If you want to be sure, you'd run a pre-test with VACUUM FULL alone to see what can be attributed to that.If your table sees a lot of write operations, the effect will degrade over time. Schedule
CLUSTER at off-hours to restore the effect. Fine tuning depends of your exact use-case. The manual about CLUSTER.CLUSTER is a rather crude tool, needs an exclusive lock on the table. If you can't afford that, consider pg_repack which can do the same without exclusive lock. More in this later answer:- Configuring PostgreSQL for read performance
If the percentage of
NULL values in the column method is high (more than ~ 20 percent, depending on actual row sizes), a partial index should help:CREATE INDEX methods_foo_idx ON methods (hash, string)
WHERE method IS NOT NULL;(Your later update shows your columns to be
NOT NULL, so not applicable.)If you are running PostgreSQL 9.2 or later (as @deszo commented) the presented indexes may be useful without
CLUSTER if the planner can utilize index-only scans. Only applicable under favorable conditions: No write operations that would effect the visibility map since the last VACUUM and all columns in the query have to be covered by the index. Basically read-only tables can use this any time, while heavily written tables are limited. More details in the Postgres Wiki.The above mentioned partial index could be even more useful in that case.
If, on the other hand, there are no
NULL values in column method, you should1.) define it
NOT NULL and2.) use
count(*) instead of count(method), that's slightly faster and does the same in the absence of NULL values.If you have to call this query often and the table is read-only, create a
MATERIALIZED VIEW.Exotic fine point:
Your table is named
nostring, yet seems to contain hashes. By excluding hashes instead of strings, there is a chance that you exclude more strings than intended. Extremely unlikely, but possible.Code Snippets
SELECT m.hash, m.string, count(m.method) AS method_ct
FROM methods m
LEFT JOIN nostring n USING (hash)
WHERE n.hash IS NULL
GROUP BY m.hash, m.string
ORDER BY count(m.method) DESC;CREATE INDEX methods_cluster_idx ON methods (hash, string, method);CLUSTER methods USING methods_cluster_idx;
ANALYZE methods;CREATE INDEX methods_foo_idx ON methods (hash, string)
WHERE method IS NOT NULL;Context
StackExchange Database Administrators Q#29489, answer score: 19
Revisions (0)
No revisions yet.