patternsqlMinor
Is there a configuration setting for how many pages ANALYZE will scan?
Viewed 0 times
analyzescanconfigurationpagessettingwillformanyhowthere
Problem
Is there a setting that controls how many records/pages are scanned by the PostgreSQL
ANALYZE command? It looks like by default it scans 30,000 pages and 30,000 records. The PostgreSQL Server Configuration Documentation lists tons of options, but I don't see anything specific to ANALYZE.Solution
Each column of a table has an
It defaults to
In Statistics Used by the Planner, the doc says:
The amount of information stored in pg_statistic by ANALYZE, in
particular the maximum number of entries in the most_common_vals and
histogram_bounds arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS command,
or globally by setting the default_statistics_target configuration
variable
The reason behind the
The
As for the number of pages, since rows don't span across pages, at most
attstattarget property (stored in pg_attribute) that tells how much data should be stored for it from the statistical sample gathered by ANALYZE.It defaults to
default_statistics_target, which itself defaults to 100.In Statistics Used by the Planner, the doc says:
The amount of information stored in pg_statistic by ANALYZE, in
particular the maximum number of entries in the most_common_vals and
histogram_bounds arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS command,
or globally by setting the default_statistics_target configuration
variable
The reason behind the
30,000 pages and rows is that the sample size in rows considered by ANALYZE is 300 times the maximum value of attstattarget for the sampled table, which would be the default 100.The
300 comes from a statistical formula mentioned in the source codesrc/backend/commands/analyze.c:/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives minrows = 300 * attr->attstattarget;As for the number of pages, since rows don't span across pages, at most
N pages are going to be read to get N rows. I believe that ANALYZE intentionally aims at fetching this maximum of pages to get the best sample. It makes sense since rows stored in a same page are more likely to be correlated.Code Snippets
/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;Context
StackExchange Database Administrators Q#66256, answer score: 8
Revisions (0)
No revisions yet.