patternsqlMinor
Inconsistent statistics on jsonb column with btree index
Viewed 0 times
columnwithinconsistentbtreestatisticsindexjsonb
Problem
I noticed that the performance of a query involving a jsonb column varied significantly between VACUUM ANALYZE runs while testing it. I get completely different execution plans seemingly randomly after analyzing the table.
I'm using Postgres 9.6 here. The setup for my tests is the following, I'm inserting a single key "x" into the jsonb column "params" with values between 1 and 6, with 1 being the rarest value and 6 the most common. I also have a regular int column "single_param" that contains the same distribution of values for comparison.:
The query I'm testing is a typical query for paginating results, I'm sorting by id and limiting the output to the first 50 rows.
I get one of the two explain analyze outputs randomnly after running
or
`Limit (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.034 rows=10 loops=1)
-> Sort (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.032 rows=10 loops=1)
Sort Key: id DESC
Sort Metho
I'm using Postgres 9.6 here. The setup for my tests is the following, I'm inserting a single key "x" into the jsonb column "params" with values between 1 and 6, with 1 being the rarest value and 6 the most common. I also have a regular int column "single_param" that contains the same distribution of values for comparison.:
CREATE TABLE test_data (
id serial,
single_param int,
params jsonb
);
INSERT INTO test_data
SELECT
generate_series(1, 1000000) AS id,
floor(log(random() * 9999999 + 1)) AS single_param,
json_build_object(
'x', floor(log(random() * 9999999 + 1))
) AS params;
CREATE INDEX idx_test_btree ON test_data (cast(test_data.params->>'x' AS int));
CREATE INDEX idx_test_gin ON test_data USING GIN (params);
CREATE INDEX ON test_data(id)
CREATE INDEX ON test_data(single_param)The query I'm testing is a typical query for paginating results, I'm sorting by id and limiting the output to the first 50 rows.
SELECT * FROM test_data where (params->>'x')::int = 1 ORDER BY id DESC LIMIT 50;I get one of the two explain analyze outputs randomnly after running
VACUUM ANALYZE:Limit (cost=0.42..836.59 rows=50 width=33) (actual time=39.679..410.292 rows=10 loops=1)
-> Index Scan Backward using test_data_id_idx on test_data (cost=0.42..44317.43 rows=2650 width=33) (actual time=39.678..410.283 rows=10 loops=1)
Filter: (((params ->> 'x'::text))::integer = 1)
Rows Removed by Filter: 999990"
Planning time: 0.106 ms
Execution time: 410.314 ms
or
`Limit (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.034 rows=10 loops=1)
-> Sort (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.032 rows=10 loops=1)
Sort Key: id DESC
Sort Metho
Solution
Currently (version 9.6), Postgres does not have any statistics about the internals of document types like
However, there are separate statistics for functional indexes like your
Tip: Although per-column tweaking of
very productive, you might find it worthwhile to do per-column
adjustment of the level of detail of the statistics collected by
highly irregular data distributions might require a finer-grain data
histogram than other columns. See
change the database-wide default using the default_statistics_target
configuration parameter.
Also, by default there is limited information available about the
selectivity of functions. However, if you create an expression index
that uses a function call, useful statistics will be gathered about
the function, which can greatly improve query plans that use the
expression index.
The volume of statistics gathered depends on general setting of
The default setting of
Default names for index columns are not exactly intuitive, but you can look it up with:
Should result in the type name
The best setting for
Internally, this sets the value of
For scalar data types,
"most common values" to collect, and the target number of histogram
bins to create.
Then run
You must
Try your query again ...
Related:
json, jsonb, xml or hstore. (There has been discussion whether and how to change that.) Instead, the Postgres query planner uses constant default frequency estimates (like you observed).However, there are separate statistics for functional indexes like your
idx_test_btree. The manual has this tip for you:Tip: Although per-column tweaking of
ANALYZE frequency might not bevery productive, you might find it worthwhile to do per-column
adjustment of the level of detail of the statistics collected by
ANALYZE. Columns that are heavily used in WHERE clauses and havehighly irregular data distributions might require a finer-grain data
histogram than other columns. See
ALTER TABLE SET STATISTICS, orchange the database-wide default using the default_statistics_target
configuration parameter.
Also, by default there is limited information available about the
selectivity of functions. However, if you create an expression index
that uses a function call, useful statistics will be gathered about
the function, which can greatly improve query plans that use the
expression index.
The volume of statistics gathered depends on general setting of
default_statistics_target, which can be overruled with a per-column setting. The setting for the column automatically covers depending indexes.The default setting of
100 is conservative. For your test with 1M rows, if data distribution is uneven, it may help to increase it substantially. Checking on this once more I found you can actually tweak the statistics target per index column with ALTER INDEX, which is currently not documented. See related discussion on pgsql-docs.ALTER TABLE idx_test_btree ALTER int4 SET STATISTICS 2000; -- max 10000, default 100Default names for index columns are not exactly intuitive, but you can look it up with:
SELECT attname FROM pg_attribute WHERE attrelid = 'idx_test_btree'::regclassShould result in the type name
int4 as index column name for your case.The best setting for
STATISTICS depends on several factors: data distribution, data type, update frequency, characteristics of typical queries, ...Internally, this sets the value of
pg_attribute.attstattarget, and the exact meaning of this is (per documentation):For scalar data types,
attstattarget is both the target number of"most common values" to collect, and the target number of histogram
bins to create.
Then run
ANALYZE if you don't want to wait for autovacuum to kick in:ANALYZE test_data;You must
ANALYZE the table, since you cannot ANALYZE indexes directly. Check with (before and after if you want to verify the effect):SELECT * FROM pg_statistic WHERE starelid = 'idx_test_btree'::regclass;Try your query again ...
Related:
- Check statistics targets in PostgreSQL
- Index that is not used, yet influences query
- PostgreSQL partial index unused when created on a table with existing data
Code Snippets
ALTER TABLE idx_test_btree ALTER int4 SET STATISTICS 2000; -- max 10000, default 100SELECT attname FROM pg_attribute WHERE attrelid = 'idx_test_btree'::regclassANALYZE test_data;SELECT * FROM pg_statistic WHERE starelid = 'idx_test_btree'::regclass;Context
StackExchange Database Administrators Q#167525, answer score: 7
Revisions (0)
No revisions yet.