patternsqlMinor

Speeding up a GROUP BY, HAVING COUNT query

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

groupqueryhavingspeedingcount

Problem

I'm trying to speed up this query in Postgres 9.4:

SELECT "groupingsFrameHash", COUNT(*) AS nb
FROM "public"."zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"
GROUP BY "groupingsFrameHash"
HAVING COUNT(*) > 1
ORDER BY nb DESC LIMIT 10

I have an index on "groupingsFrameHash". I don't need exact results, a fuzzy approximation is good enough.

Here is the query plan:

Limit  (cost=17207.03..17207.05 rows=10 width=25) (actual time=740.056..740.058 rows=10 loops=1)
  ->  Sort  (cost=17207.03..17318.19 rows=44463 width=25) (actual time=740.054..740.055 rows=10 loops=1)
        Sort Key: (count(*))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  GroupAggregate  (cost=14725.95..16246.20 rows=44463 width=25) (actual time=615.109..734.740 rows=25977 loops=1)
              Group Key: "groupingsFrameHash"
              Filter: (count(*) > 1)
              Rows Removed by Filter: 24259
              ->  Sort  (cost=14725.95..14967.07 rows=96446 width=25) (actual time=615.093..705.507 rows=96026 loops=1)
                    Sort Key: "groupingsFrameHash"
                    Sort Method: external merge  Disk: 3280kB
                    ->  Seq Scan on "zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"  (cost=0.00..4431.46 rows=96446 width=25) (actual time=0.007..33.813 rows=96026 loops=1)
Planning time: 0.080 ms
Execution time: 740.877 ms

I don't understand why it needs to do a Seq Scan.

Solution

You want the 10 most common values for "groupingsFrameHash" with their respective counts (excluding unique values) - a common task. This specification caught my attention, though:

a fuzzy approximation is good enough

This allows for radically faster solutions. Postgres happens to store just those approximations in the system catalogs, the total count in pg_class and most common values in pg_statistic. The manual about the nature of these numbers:

Entries are created by ANALYZE and subsequently used by the query planner. Note that all the statistical data is inherently approximate, even assuming that it is up-to-date.

You have been warned.

Also consider the chapter Statistics Used by the Planner in the manual.

If you have set up autovacuum properly and the contents of your table do not change too much, these estimates should be good. In case you run this query immediately after substantial changes to the table (so autovacuum didn't get a chance to kick in), run ANALYZE first (or better VACUUM ANALYZE if you can spare the time). You can also fine-tune precision, that's beyond the scope of this question ...

There are security considerations. Quoting the manual again:

pg_statistic should not be readable by the public, since even
statistical information about a table's contents might be considered
sensitive. (Example: minimum and maximum values of a salary column
might be quite interesting.) pg_stats is a publicly readable view on
pg_statistic that only exposes information about those tables that are
readable by the current user.

All this considered, you can get quick estimates:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_stats s
CROSS  JOIN LATERAL
   unnest(s.most_common_vals::text::text[]  -- use your actual data type
        , s.most_common_freqs) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
CROSS  JOIN (
   SELECT reltuples FROM pg_class
   WHERE oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
   ) c
WHERE  schemaname = 'public'
AND    tablename  = 'zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    attname    = 'groupingsFrameHash'  -- case sensitive
ORDER  BY v.ord
LIMIT  10;

There are a couple of noteworthy features in this query:

-
Provide all identifier strings unescaped and case sensitive.

-
unnest() for multiple arrays requires Postgres 9.4 or later. Details:

Passing arrays to stored procedures in Postgres

-
pg_stats.most_common_vals is a special column with the data pseudo-type anyarray (not available in user tables). It can store arrays of any type. To decompose, cast to text and then to the array type of your column type. Assuming text[] in the example:

s.most_common_vals::text::text[]

Replace with your actual data type.

-
I added WITH ORDINALITY to unnest() (Postgres 9.4 or later) to preserve the original order of elements. Since the numbers in the arrays are ordered by descending frequency, we can work with that sort order right away. Consider:

PostgreSQL unnest() with element number

This takes around 1 ms or less - no matter how many rows there are in your table.

Experimental optimizations

If you still need to squeeze out more performance and you have superuser access, you could use pg_statistic directly:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_attribute a 
JOIN   pg_class     c ON  c.oid = a.attrelid
JOIN   pg_statistic s ON  s.starelid = a.attrelid
                      AND s.staattnum = a.attnum
     , unnest(s.stavalues1::text::text[]
            , s.stanumbers1) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
WHERE  a.attrelid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    a.attname  = 'groupingsFrameHash'
ORDER  BY v.ord
LIMIT  10;

As we are getting closer to the core of Postgres, you need to know what you are doing. We are relying on implementation details that may change across major Postgres versions (though unlikely). Read details about pg_statistics in the manual and comments in the source code.

To squeeze out the last drop, you could even hard-code the attribute number of your column (which changes if you change the position of the column in your table!) and rely on the order of rows returned by unnest(), which normally works:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_class     c
JOIN   pg_statistic s ON s.starelid = c.oid
     , unnest(s.stavalues1::text::text[], s.stanumbers1) v("groupingsFrameHash", freq)    
WHERE  c.oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    s.staattnum = int2 '6'  -- hard-coded pg_attribute.attnum
LIMIT  10;

Get your own estimates

With the new TABLESAMPLE feature in Postgres 9.5 you can base your aggregates on a (more or less) random sample of the table:

```
SELECT birthday, 10 count() AS estimate
FROM big
TABLESAMPLE SYSTEM (10)
GRO

Code Snippets

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_stats s
CROSS  JOIN LATERAL
   unnest(s.most_common_vals::text::text[]  -- use your actual data type
        , s.most_common_freqs) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
CROSS  JOIN (
   SELECT reltuples FROM pg_class
   WHERE oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
   ) c
WHERE  schemaname = 'public'
AND    tablename  = 'zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    attname    = 'groupingsFrameHash'  -- case sensitive
ORDER  BY v.ord
LIMIT  10;

s.most_common_vals::text::text[]

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_attribute a 
JOIN   pg_class     c ON  c.oid = a.attrelid
JOIN   pg_statistic s ON  s.starelid = a.attrelid
                      AND s.staattnum = a.attnum
     , unnest(s.stavalues1::text::text[]
            , s.stanumbers1) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
WHERE  a.attrelid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    a.attname  = 'groupingsFrameHash'
ORDER  BY v.ord
LIMIT  10;

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_class     c
JOIN   pg_statistic s ON s.starelid = c.oid
     , unnest(s.stavalues1::text::text[], s.stanumbers1) v("groupingsFrameHash", freq)    
WHERE  c.oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    s.staattnum = int2 '6'  -- hard-coded pg_attribute.attnum
LIMIT  10;

SELECT birthday, 10 * count(*) AS estimate
FROM   big
TABLESAMPLE SYSTEM (10)
GROUP  BY 1
ORDER  BY estimate DESC
LIMIT  10;

Context

StackExchange Database Administrators Q#131067, answer score: 8

Revisions (0)

No revisions yet.