HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Improving distinct values estimates in Postgres

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
distinctestimatespostgresvaluesimproving

Problem

Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by EXPLAIN works okay.

https://www.cybertec-postgresql.com/en/count-made-fast/

But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?).

I just ran this test and have provided some results:

analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.

select 'count(*) distinct' as method,
        count(*) as count
from (select distinct assembly_id 
      from assembly_prods) d 
union all
select 'n_distinct from pg_stats' as method,
        n_distinct as count
from pg_stats 
where tablename  = 'assembly_prods' and
      attname    = 'assembly_id';


The results:

method                      count
count(*) distinct           28088
n_distinct from pg_stats    13805


That's only off by a factor of 2, but I've seem much worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?

Follow-up

I hadn't ever experimented SET STATISTICS before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:

https://www.postgresql.org/docs/current/planner-stats.html


The amount of information stored in pg_statistic by ANALYZE, in
particular the maximum number of entries in the most_common_vals and
histogram_bounds arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS command,
or globally by setting the default_statistics_target configuration
variable. The default limit is presently 100 entries. Raising the
limit might

Solution

First, a remark: your query could be written simpler as

SELECT count(DISTINCT assembly_id) FROM assembly_prods;


Also, your statistics query is wrong, because n_distict can also be negative. You should query:

SELECT CASE WHEN s.n_distinct < 0
            THEN - s.n_distinct * t.reltuples
            ELSE s.n_distinct
       END AS n_distinct
FROM pg_class t
   JOIN pg_namespace n ON n.oid = t.relnamespace
   JOIN pg_stats s ON t.relname = s.tablename
                      AND n.nspname = s.schemaname
WHERE s.schemaname = 'public'
  AND s.tablename = 'assembly_prods'
  AND s.attname = 'assembly_id';


For a simple query like that, the statistics should contain a good estimate.

If the estimates are off, try to ANALYZE the table. That will also fix the results for a newly TRUNCATEd table. TRUNCATE does not cause PostgreSQL to autoanalyze the table (there may be room for improvement here).

If that improves the results, see that the table is analyzed more often by configuring

ALTER TABLE assembly_prods SET (autovacuum_analyze_scale_factor = 0.05);


It is also possible to set autovacuum_analyze_scale_factor to 0 and raise autovacuum_analyze_threshold to the daily change rate for the table.

If ANALYZE alone does not improve the estimate, increase the size of the sample:

ALTER TABLE assembly_prods ALTER assembly_id SET STATISTICS 1000;


A new ANALYZE should now produce better estimates.

Getting good n_distinct estimates for more complicated queries becomes increasingly more difficult. Sometimes extended statistics will improve the estimate considerably.

As far as I know, PostgreSQL v12 does not bring any improvements in this area.

Code Snippets

SELECT count(DISTINCT assembly_id) FROM assembly_prods;
SELECT CASE WHEN s.n_distinct < 0
            THEN - s.n_distinct * t.reltuples
            ELSE s.n_distinct
       END AS n_distinct
FROM pg_class t
   JOIN pg_namespace n ON n.oid = t.relnamespace
   JOIN pg_stats s ON t.relname = s.tablename
                      AND n.nspname = s.schemaname
WHERE s.schemaname = 'public'
  AND s.tablename = 'assembly_prods'
  AND s.attname = 'assembly_id';
ALTER TABLE assembly_prods SET (autovacuum_analyze_scale_factor = 0.05);
ALTER TABLE assembly_prods ALTER assembly_id SET STATISTICS 1000;

Context

StackExchange Database Administrators Q#249969, answer score: 6

Revisions (0)

No revisions yet.