HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Is there a configuration setting for how many pages ANALYZE will scan?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
analyzescanconfigurationpagessettingwillformanyhowthere

Problem

Is there a setting that controls how many records/pages are scanned by the PostgreSQL ANALYZE command? It looks like by default it scans 30,000 pages and 30,000 records. The PostgreSQL Server Configuration Documentation lists tons of options, but I don't see anything specific to ANALYZE.

Solution

Each column of a table has an attstattarget property (stored in pg_attribute) that tells how much data should be stored for it from the statistical sample gathered by ANALYZE.

It defaults to default_statistics_target, which itself defaults to 100.

In Statistics Used by the Planner, the doc says:


The amount of information stored in pg_statistic by ANALYZE, in
particular the maximum number of entries in the most_common_vals and
histogram_bounds arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS command,
or globally by setting the default_statistics_target configuration
variable

The reason behind the 30,000 pages and rows is that the sample size in rows considered by ANALYZE is 300 times the maximum value of attstattarget for the sampled table, which would be the default 100.

The 300 comes from a statistical formula mentioned in the source code

src/backend/commands/analyze.c:

/*--------------------
     * The following choice of minrows is based on the paper
     * "Random sampling for histogram construction: how much is enough?"
     * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
     * Proceedings of ACM SIGMOD International Conference on Management
     * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
     * says that for table size n, histogram size k, maximum relative
     * error in bin size f, and error probability gamma, the minimum
     * random sample size is
     *      r = 4 * k * ln(2*n/gamma) / f^2
     * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
     *      r = 305.82 * k
     * Note that because of the log function, the dependence on n is
     * quite weak; even at n = 10^12, a 300*k sample gives minrows = 300 * attr->attstattarget;


As for the number of pages, since rows don't span across pages, at most N pages are going to be read to get N rows. I believe that ANALYZE intentionally aims at fetching this maximum of pages to get the best sample. It makes sense since rows stored in a same page are more likely to be correlated.

Code Snippets

/*--------------------
     * The following choice of minrows is based on the paper
     * "Random sampling for histogram construction: how much is enough?"
     * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
     * Proceedings of ACM SIGMOD International Conference on Management
     * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
     * says that for table size n, histogram size k, maximum relative
     * error in bin size f, and error probability gamma, the minimum
     * random sample size is
     *      r = 4 * k * ln(2*n/gamma) / f^2
     * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
     *      r = 305.82 * k
     * Note that because of the log function, the dependence on n is
     * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
     * bin size error with probability 0.99.  So there's no real need to
     * scale for n, which is a good thing because we don't necessarily
     * know it at this point.
     *--------------------
     */
    stats->minrows = 300 * attr->attstattarget;

Context

StackExchange Database Administrators Q#66256, answer score: 8

Revisions (0)

No revisions yet.