HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

GIN index not used when adding order clause

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
orderusedginaddingwhenindexnotclause

Problem

I'm trying to speed up a query that executes three ILIKE queries and reduces these with or (returning the overall count and 10 entries)

SELECT , count() OVER() as filtered_count FROM "users"
WHERE (
(f_unaccent("users"."first_name") ILIKE f_unaccent('%foo%') OR
f_unaccent("users"."last_name") ILIKE f_unaccent('%foo%')) OR
f_unaccent("users"."club_or_hometown") ILIKE f_unaccent('%foo%')
) LIMIT 10 OFFSET 0


This works reasonably fast after adding gin-indexes for all queried attributes (here only for first_name):

CREATE INDEX users_first_name_gin
ON users
USING gin
(f_unaccent(first_name::text) COLLATE pg_catalog."default" gin_trgm_ops);


If I however add an additional order clause, e. g. ORDER BY users.first_name ASC, postgresql does not use the gin index, but the normal b-tree index on first_name, then filters the result. This takes significantly longer in my application. How can I adapt the query/indexes to keep using the gin indexes even for ordered queries?

edit: I'm using postgresql 9.4

Explain of unordered query:

```
"Limit (cost=125.98..139.61 rows=10 width=58) (actual time=17.828..17.833 rows=10 loops=1)"
" -> WindowAgg (cost=125.98..2972.72 rows=2088 width=58) (actual time=17.826..17.831 rows=10 loops=1)"
" -> Bitmap Heap Scan on users (cost=125.98..2946.62 rows=2088 width=58) (actual time=0.915..16.816 rows=1755 loops=1)"
" Recheck Cond: ((f_unaccent((first_name)::text) ~~ '%foo%'::text) OR (f_unaccent((last_name)::text) ~~ '%foo%'::text) OR (f_unaccent((club_or_hometown)::text) ~~* '%foo%'::text))"
" Heap Blocks: exact=891"
" -> BitmapOr (cost=125.98..125.98 rows=2088 width=0) (actual time=0.742..0.742 rows=0 loops=1)"
" -> Bitmap Index Scan on users_first_name_gin (cost=0.00..51.80 rows=2074 width=0) (actual time=0.600..0.600 rows=1735 loops=1)"
" Index Cond: (f_unaccent((first_name)::text) ~~* '%foo%'::text)"
"

Solution

Your added comment is on the right track already:

I further found out that the inefficient index is only used if the
search query (foo in my example) appears in pg_stats.most_common_vals
for this column. So I assume this skews the estimated costs into the
wrong direction. Any ideas how to fix this?

If 'foo' is very common, Postgres expects it to be faster to just read from the b-tree index sequentially and just skip rows that do not match. The estimation is also based on cost setting and the expected selectivity of predicates. There are multiple entry points for skewed estimates.

  • row counts in column statistics



  • selectivity estimates for predicates



  • Combined statistics for multiple predicates



  • cost settings



Selectivity of predicates and combinations

Your most important problem seems to be here:

(cost=0.42..84,314.74 rows=2,088 width=58) (actual
time=2,363.900..2,363.904 rows=10 loops=1)

Postgres expects 209 times as many rows as returned. explain.depesz.com can help auditing a query plan: http://explain.depesz.com/s/53E

You may get more accurate estimates by increasing the statistics target for the indexed columns. Postgres not only gathers statistics for table columns, it does the same for indexed expressions (not for plain index columns for which statistics are available already). See:

  • Index that is not used, yet influences query



You can check with:

SELECT * FROM pg_stats WHERE tablename = 'users_first_name_gin';


Basic row counts and settings are in pg_class and pg_attribute:

SELECT *
FROM   pg_attribute
WHERE  attrelid = 'users_last_name_gin'::regclass;

SELECT *
FROM   pg_class
WHERE  oid = 'users_last_name_gin'::regclass;


Indexes are treated as special tables internally. This should make it less surprising that you can do use ALTER TABLE on an index:

ALTER TABLE users_last_name_gin
ALTER COLUMN f_unaccent SET STATISTICS 1000;


Use this to increase the sample size for calculating statistics for the index column. Then run ANALYZE users to update statistics.

You can't provide explicit column names for index entries, names are chosen automatically. You can look it up with the query on pg_attribute above. The column name f_unaccent is derived from the used function name.

Default statistics target is 100, the allowed range is 0 - 10000. For big tables with uneven data distribution, 100 is often not enough to get reasonable estimates. Set it to 1000 (example) for the index expression to get better estimates.
Workaround

Like dezso commented you can get the alternative query plan by encapsulating the original form in a CTE (which acts as optimization barrier in Postgres) - before ORDER BY and LIMIT in the outer SELECT:

WITH cte AS (
   SELECT *, count(*) OVER() AS filtered_count
   FROM   users 
   WHERE (f_unaccent("users"."first_name")       ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."last_name")        ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."club_or_hometown") ILIKE f_unaccent('%foo%'))
   )
SELECT *
FROM   cte
ORDER  BY first_name
LIMIT  10;


Alternative index

You commented:

I'm always searching all three columns

A single index for all three would be cheaper. GIN indexes can be multicolumn indexes. The manual:

Currently, only the B-tree, GiST, GIN, and BRIN index types support
multiple-key-column indexes.

And:

A multicolumn GIN index can be used with query conditions that involve
any subset of the index's columns. Unlike B-tree or GiST, index search
effectiveness is the same regardless of which index column(s) the
query conditions use.

So:

CREATE INDEX big_unaccent_big_gin_idx ON users USING gin (
     f_unaccent(first_name)       gin_trgm_ops
   , f_unaccent(last_name)        gin_trgm_ops
   , f_unaccent(club_or_hometown) gin_trgm_ops);


Reduces the triple overhead per index entry to just one. Should be faster overall. Or, faster yet, concatenate all three columns into a single string. I am adding a space as separator to avoid false positives. Use any character as separator that is not going to show up in search expressions:

CREATE INDEX big_unaccent_big_gin_idx ON users USING gin (
   f_unaccent(f_concat_ws(' ', first_name, last_name, club_or_hometown)) gin_trgm_ops);


Uses the custom function f_concat_ws(), which must be created first as explained here:

  • PostgreSQL full text search on many columns



If all columns are NOT NULL, you can use plain concatenation instead:

first_name || ' ' || last_name || ' ' || club_or_hometown


Be sure to use the same expression in your query:

WHERE f_unaccent(f_concat_ws(' ', first_name, last_name, club_or_hometown)) ILIKE '%foo%'


Set STATISTICS to 1000 or more like demonstrated above and ANALYZE before you test again. Be sure to run the query multiple times to compare warm cache to warm cache.

Besides the smaller index and faster computation, the main benefit for your case could be that a single predic

Code Snippets

SELECT * FROM pg_stats WHERE tablename = 'users_first_name_gin';
SELECT *
FROM   pg_attribute
WHERE  attrelid = 'users_last_name_gin'::regclass;

SELECT *
FROM   pg_class
WHERE  oid = 'users_last_name_gin'::regclass;
ALTER TABLE users_last_name_gin
ALTER COLUMN f_unaccent SET STATISTICS 1000;
WITH cte AS (
   SELECT *, count(*) OVER() AS filtered_count
   FROM   users 
   WHERE (f_unaccent("users"."first_name")       ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."last_name")        ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."club_or_hometown") ILIKE f_unaccent('%foo%'))
   )
SELECT *
FROM   cte
ORDER  BY first_name
LIMIT  10;
CREATE INDEX big_unaccent_big_gin_idx ON users USING gin (
     f_unaccent(first_name)       gin_trgm_ops
   , f_unaccent(last_name)        gin_trgm_ops
   , f_unaccent(club_or_hometown) gin_trgm_ops);

Context

StackExchange Database Administrators Q#113132, answer score: 8

Revisions (0)

No revisions yet.