patternsqlMinor
Improving distinct values estimates in Postgres
Viewed 0 times
distinctestimatespostgresvaluesimproving
Problem
Full counts in Postgres can be slow, for reasons that are well-understood and much discussed. So, I've been using estimation techniques instead, where possible. For rows, pg_stats seems fine, for views, extracting an estimate returned by
https://www.cybertec-postgresql.com/en/count-made-fast/
But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?).
I just ran this test and have provided some results:
The results:
That's only off by a factor of 2, but I've seem much worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?
Follow-up
I hadn't ever experimented
https://www.postgresql.org/docs/current/planner-stats.html
The amount of information stored in
particular the maximum number of entries in the
histogram_bounds arrays for each column, can be set on a
column-by-column basis using the
or globally by setting the
variable. The default limit is presently 100 entries. Raising the
limit might
EXPLAIN works okay.https://www.cybertec-postgresql.com/en/count-made-fast/
But what about distinct values? Here, I have had a lot less luck. Sometimes the estimates are 100% correct, sometimes they're off by factors of 2 or 20. Truncated tables seem to have badly stale estimates in particular (?).
I just ran this test and have provided some results:
analyze assembly_prods; -- Doing an ANLYZE to give pg_stats every help.
select 'count(*) distinct' as method,
count(*) as count
from (select distinct assembly_id
from assembly_prods) d
union all
select 'n_distinct from pg_stats' as method,
n_distinct as count
from pg_stats
where tablename = 'assembly_prods' and
attname = 'assembly_id';The results:
method count
count(*) distinct 28088
n_distinct from pg_stats 13805That's only off by a factor of 2, but I've seem much worse in my data. To the point where I won't use estimates. Is there something else that I can try? Is this something that PG 12 improves?
Follow-up
I hadn't ever experimented
SET STATISTICS before, because there are only so many hours in a day. Inspired by Laurenz' answer, I've take a quick look. Here's a useful comment from the documentation:https://www.postgresql.org/docs/current/planner-stats.html
The amount of information stored in
pg_statistic by ANALYZE, inparticular the maximum number of entries in the
most_common_vals andhistogram_bounds arrays for each column, can be set on a
column-by-column basis using the
ALTER TABLE SET STATISTICS command,or globally by setting the
default_statistics_target configurationvariable. The default limit is presently 100 entries. Raising the
limit might
Solution
First, a remark: your query could be written simpler as
Also, your statistics query is wrong, because
For a simple query like that, the statistics should contain a good estimate.
If the estimates are off, try to
If that improves the results, see that the table is analyzed more often by configuring
It is also possible to set
If
A new
Getting good
As far as I know, PostgreSQL v12 does not bring any improvements in this area.
SELECT count(DISTINCT assembly_id) FROM assembly_prods;Also, your statistics query is wrong, because
n_distict can also be negative. You should query:SELECT CASE WHEN s.n_distinct < 0
THEN - s.n_distinct * t.reltuples
ELSE s.n_distinct
END AS n_distinct
FROM pg_class t
JOIN pg_namespace n ON n.oid = t.relnamespace
JOIN pg_stats s ON t.relname = s.tablename
AND n.nspname = s.schemaname
WHERE s.schemaname = 'public'
AND s.tablename = 'assembly_prods'
AND s.attname = 'assembly_id';For a simple query like that, the statistics should contain a good estimate.
If the estimates are off, try to
ANALYZE the table. That will also fix the results for a newly TRUNCATEd table. TRUNCATE does not cause PostgreSQL to autoanalyze the table (there may be room for improvement here).If that improves the results, see that the table is analyzed more often by configuring
ALTER TABLE assembly_prods SET (autovacuum_analyze_scale_factor = 0.05);It is also possible to set
autovacuum_analyze_scale_factor to 0 and raise autovacuum_analyze_threshold to the daily change rate for the table.If
ANALYZE alone does not improve the estimate, increase the size of the sample:ALTER TABLE assembly_prods ALTER assembly_id SET STATISTICS 1000;A new
ANALYZE should now produce better estimates.Getting good
n_distinct estimates for more complicated queries becomes increasingly more difficult. Sometimes extended statistics will improve the estimate considerably.As far as I know, PostgreSQL v12 does not bring any improvements in this area.
Code Snippets
SELECT count(DISTINCT assembly_id) FROM assembly_prods;SELECT CASE WHEN s.n_distinct < 0
THEN - s.n_distinct * t.reltuples
ELSE s.n_distinct
END AS n_distinct
FROM pg_class t
JOIN pg_namespace n ON n.oid = t.relnamespace
JOIN pg_stats s ON t.relname = s.tablename
AND n.nspname = s.schemaname
WHERE s.schemaname = 'public'
AND s.tablename = 'assembly_prods'
AND s.attname = 'assembly_id';ALTER TABLE assembly_prods SET (autovacuum_analyze_scale_factor = 0.05);ALTER TABLE assembly_prods ALTER assembly_id SET STATISTICS 1000;Context
StackExchange Database Administrators Q#249969, answer score: 6
Revisions (0)
No revisions yet.