patternsqlMinor
Speeding up a GROUP BY, HAVING COUNT query
Viewed 0 times
groupqueryhavingspeedingcount
Problem
I'm trying to speed up this query in Postgres 9.4:
I have an index on
Here is the query plan:
I don't understand why it needs to do a Seq Scan.
SELECT "groupingsFrameHash", COUNT(*) AS nb
FROM "public"."zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"
GROUP BY "groupingsFrameHash"
HAVING COUNT(*) > 1
ORDER BY nb DESC LIMIT 10I have an index on
"groupingsFrameHash". I don't need exact results, a fuzzy approximation is good enough.Here is the query plan:
Limit (cost=17207.03..17207.05 rows=10 width=25) (actual time=740.056..740.058 rows=10 loops=1)
-> Sort (cost=17207.03..17318.19 rows=44463 width=25) (actual time=740.054..740.055 rows=10 loops=1)
Sort Key: (count(*))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=14725.95..16246.20 rows=44463 width=25) (actual time=615.109..734.740 rows=25977 loops=1)
Group Key: "groupingsFrameHash"
Filter: (count(*) > 1)
Rows Removed by Filter: 24259
-> Sort (cost=14725.95..14967.07 rows=96446 width=25) (actual time=615.093..705.507 rows=96026 loops=1)
Sort Key: "groupingsFrameHash"
Sort Method: external merge Disk: 3280kB
-> Seq Scan on "zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876" (cost=0.00..4431.46 rows=96446 width=25) (actual time=0.007..33.813 rows=96026 loops=1)
Planning time: 0.080 ms
Execution time: 740.877 msI don't understand why it needs to do a Seq Scan.
Solution
You want the 10 most common values for
a fuzzy approximation is good enough
This allows for radically faster solutions. Postgres happens to store just those approximations in the system catalogs, the total count in
Entries are created by
You have been warned.
Also consider the chapter Statistics Used by the Planner in the manual.
If you have set up autovacuum properly and the contents of your table do not change too much, these estimates should be good. In case you run this query immediately after substantial changes to the table (so autovacuum didn't get a chance to kick in), run
There are security considerations. Quoting the manual again:
statistical information about a table's contents might be considered
sensitive. (Example: minimum and maximum values of a salary column
might be quite interesting.)
readable by the current user.
All this considered, you can get quick estimates:
There are a couple of noteworthy features in this query:
-
Provide all identifier strings unescaped and case sensitive.
-
-
Replace with your actual data type.
-
I added
This takes around 1 ms or less - no matter how many rows there are in your table.
Experimental optimizations
If you still need to squeeze out more performance and you have superuser access, you could use
As we are getting closer to the core of Postgres, you need to know what you are doing. We are relying on implementation details that may change across major Postgres versions (though unlikely). Read details about
To squeeze out the last drop, you could even hard-code the attribute number of your column (which changes if you change the position of the column in your table!) and rely on the order of rows returned by
Get your own estimates
With the new
```
SELECT birthday, 10 count() AS estimate
FROM big
TABLESAMPLE SYSTEM (10)
GRO
"groupingsFrameHash" with their respective counts (excluding unique values) - a common task. This specification caught my attention, though:a fuzzy approximation is good enough
This allows for radically faster solutions. Postgres happens to store just those approximations in the system catalogs, the total count in
pg_class and most common values in pg_statistic. The manual about the nature of these numbers:Entries are created by
ANALYZE and subsequently used by the query planner. Note that all the statistical data is inherently approximate, even assuming that it is up-to-date.You have been warned.
Also consider the chapter Statistics Used by the Planner in the manual.
If you have set up autovacuum properly and the contents of your table do not change too much, these estimates should be good. In case you run this query immediately after substantial changes to the table (so autovacuum didn't get a chance to kick in), run
ANALYZE first (or better VACUUM ANALYZE if you can spare the time). You can also fine-tune precision, that's beyond the scope of this question ...There are security considerations. Quoting the manual again:
pg_statistic should not be readable by the public, since evenstatistical information about a table's contents might be considered
sensitive. (Example: minimum and maximum values of a salary column
might be quite interesting.)
pg_stats is a publicly readable view onpg_statistic that only exposes information about those tables that arereadable by the current user.
All this considered, you can get quick estimates:
SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM pg_stats s
CROSS JOIN LATERAL
unnest(s.most_common_vals::text::text[] -- use your actual data type
, s.most_common_freqs) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
CROSS JOIN (
SELECT reltuples FROM pg_class
WHERE oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
) c
WHERE schemaname = 'public'
AND tablename = 'zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND attname = 'groupingsFrameHash' -- case sensitive
ORDER BY v.ord
LIMIT 10;There are a couple of noteworthy features in this query:
-
Provide all identifier strings unescaped and case sensitive.
-
unnest() for multiple arrays requires Postgres 9.4 or later. Details:- Passing arrays to stored procedures in Postgres
-
pg_stats.most_common_vals is a special column with the data pseudo-type anyarray (not available in user tables). It can store arrays of any type. To decompose, cast to text and then to the array type of your column type. Assuming text[] in the example:s.most_common_vals::text::text[]Replace with your actual data type.
-
I added
WITH ORDINALITY to unnest() (Postgres 9.4 or later) to preserve the original order of elements. Since the numbers in the arrays are ordered by descending frequency, we can work with that sort order right away. Consider:- PostgreSQL unnest() with element number
This takes around 1 ms or less - no matter how many rows there are in your table.
Experimental optimizations
If you still need to squeeze out more performance and you have superuser access, you could use
pg_statistic directly:SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM pg_attribute a
JOIN pg_class c ON c.oid = a.attrelid
JOIN pg_statistic s ON s.starelid = a.attrelid
AND s.staattnum = a.attnum
, unnest(s.stavalues1::text::text[]
, s.stanumbers1) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
WHERE a.attrelid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND a.attname = 'groupingsFrameHash'
ORDER BY v.ord
LIMIT 10;As we are getting closer to the core of Postgres, you need to know what you are doing. We are relying on implementation details that may change across major Postgres versions (though unlikely). Read details about
pg_statistics in the manual and comments in the source code.To squeeze out the last drop, you could even hard-code the attribute number of your column (which changes if you change the position of the column in your table!) and rely on the order of rows returned by
unnest(), which normally works:SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM pg_class c
JOIN pg_statistic s ON s.starelid = c.oid
, unnest(s.stavalues1::text::text[], s.stanumbers1) v("groupingsFrameHash", freq)
WHERE c.oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND s.staattnum = int2 '6' -- hard-coded pg_attribute.attnum
LIMIT 10;Get your own estimates
With the new
TABLESAMPLE feature in Postgres 9.5 you can base your aggregates on a (more or less) random sample of the table:```
SELECT birthday, 10 count() AS estimate
FROM big
TABLESAMPLE SYSTEM (10)
GRO
Code Snippets
SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM pg_stats s
CROSS JOIN LATERAL
unnest(s.most_common_vals::text::text[] -- use your actual data type
, s.most_common_freqs) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
CROSS JOIN (
SELECT reltuples FROM pg_class
WHERE oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
) c
WHERE schemaname = 'public'
AND tablename = 'zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND attname = 'groupingsFrameHash' -- case sensitive
ORDER BY v.ord
LIMIT 10;s.most_common_vals::text::text[]SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM pg_attribute a
JOIN pg_class c ON c.oid = a.attrelid
JOIN pg_statistic s ON s.starelid = a.attrelid
AND s.staattnum = a.attnum
, unnest(s.stavalues1::text::text[]
, s.stanumbers1) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
WHERE a.attrelid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND a.attname = 'groupingsFrameHash'
ORDER BY v.ord
LIMIT 10;SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM pg_class c
JOIN pg_statistic s ON s.starelid = c.oid
, unnest(s.stavalues1::text::text[], s.stanumbers1) v("groupingsFrameHash", freq)
WHERE c.oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND s.staattnum = int2 '6' -- hard-coded pg_attribute.attnum
LIMIT 10;SELECT birthday, 10 * count(*) AS estimate
FROM big
TABLESAMPLE SYSTEM (10)
GROUP BY 1
ORDER BY estimate DESC
LIMIT 10;Context
StackExchange Database Administrators Q#131067, answer score: 8
Revisions (0)
No revisions yet.