patternsqlMinor
Slow fulltext search due to wildly inaccurate row estimates
Viewed 0 times
estimatesduefulltextsearchinaccurateslowrowwildly
Problem
Fulltext queries against this database (storing RT (Request Tracker) tickets) seem to be taking a very long time to execute. The attachments table (containing the fulltext data) is about 15GB.
The database schema is as follows, it's about 2 million rows:
rt4=# \d+ attachments
Table "public.attachments"
Column | Type | Modifiers | Storage | Description
-----------------+-----------------------------+----------------------------------------------------------+----------+-------------
id | integer | not null default nextval('attachments_id_seq'::regclass) | plain |
transactionid | integer | not null | plain |
parent | integer | not null default 0 | plain |
messageid | character varying(160) | | extended |
subject | character varying(255) | | extended |
filename | character varying(255) | | extended |
contenttype | character varying(80) | | extended |
contentencoding | character varying(80) | | extended |
content | text | | extended |
headers | text | | extended |
creator | integer | not null default 0 | plain |
created | timestamp without time zone |
The database schema is as follows, it's about 2 million rows:
rt4=# \d+ attachments
Table "public.attachments"
Column | Type | Modifiers | Storage | Description
-----------------+-----------------------------+----------------------------------------------------------+----------+-------------
id | integer | not null default nextval('attachments_id_seq'::regclass) | plain |
transactionid | integer | not null | plain |
parent | integer | not null default 0 | plain |
messageid | character varying(160) | | extended |
subject | character varying(255) | | extended |
filename | character varying(255) | | extended |
contenttype | character varying(80) | | extended |
contentencoding | character varying(80) | | extended |
content | text | | extended |
headers | text | | extended |
creator | integer | not null default 0 | plain |
created | timestamp without time zone |
Solution
This can be improved in a thousand and one ways, then it should be a matter of milliseconds.
Better Queries
This is just your query reformatted with aliases and some noise removed to clear the fog:
Most of the problem with your query lies in the first two tables
Assuming
Instead of multiplying rows with two 1:n joins, only to collapse multiple matches in the end with
The subquery will generally only be executed long enough to determine
whether at least one row is returned, not all the way to completion.
Effectiveness depends on how many transactions per ticket and attachments per transaction there are.
Determine order of joins with
If you know that your search term for
[...]
Because the query planner does not always choose the optimal
join order, advanced users can elect to temporarily set this variable
to 1, and then specify the join order they desire explicitly.
Use
The order of
Or use a CTE like @jjanes explains in "Option 2". for a similar effect.
Indexes
B-tree indexes
Take all conditions on
If one of the conditions is variable, drop it from the
Another one on
The third column is just to enable index-only scans.
Also, since you have this composite index with two integer columns on
This additional index is a complete waste, delete it:
Details:
GIN index
Add
You need additional operator classes provided by the additional module
4 bytes from an
A multicolumn GIN index can be used with query conditions that involve
any subset of the index's columns. Unlike B-tree or GiST, index search
effectiveness is the same regardless of which index column(s) the
query conditions use.
Bold emphasis mine. So you just need the one (big and
Better Queries
This is just your query reformatted with aliases and some noise removed to clear the fog:
SELECT count(DISTINCT t.id)
FROM tickets t
JOIN transactions tr ON tr.objectid = t.id
JOIN attachments a ON a.transactionid = tr.id
WHERE t.status <> 'deleted'
AND t.type = 'ticket'
AND t.effectiveid = t.id
AND tr.objecttype = 'RT::Ticket'
AND a.contentindex @@ plainto_tsquery('frobnicate');Most of the problem with your query lies in the first two tables
tickets and transactions, which are missing from the question. I'm filling in with educated guesses.t.status,t.objecttypeandtr.objecttypeshould probably not betext, butenumor possibly some very small value referencing a look-up table.
EXISTS semi-joinAssuming
tickets.id is the primary key, this rewritten form should be much cheaper:SELECT count(*)
FROM tickets t
WHERE status <> 'deleted'
AND type = 'ticket'
AND effectiveid = id
AND EXISTS (
SELECT 1
FROM transactions tr
JOIN attachments a ON a.transactionid = tr.id
WHERE tr.objectid = t.id
AND tr.objecttype = 'RT::Ticket'
AND a.contentindex @@ plainto_tsquery('frobnicate')
);Instead of multiplying rows with two 1:n joins, only to collapse multiple matches in the end with
count(DISTINCT id), use an EXISTS semi-join, which can stop looking further as soon as the first match is found and at the same time obsoletes the final DISTINCT step. Per documentation:The subquery will generally only be executed long enough to determine
whether at least one row is returned, not all the way to completion.
Effectiveness depends on how many transactions per ticket and attachments per transaction there are.
Determine order of joins with
join_collapse_limitIf you know that your search term for
attachments.contentindex is very selective - more selective than other conditions in the query (which is probably the case for 'frobnicate', but not for 'problem'), you can force the sequence of joins. The query planner can hardly judge selectiveness of particular words, except for the most common ones. Per documentation:join_collapse_limit (integer)[...]
Because the query planner does not always choose the optimal
join order, advanced users can elect to temporarily set this variable
to 1, and then specify the join order they desire explicitly.
Use
SET LOCAL for the purpose to only set it for the current transaction.BEGIN;
SET LOCAL join_collapse_limit = 1;
SELECT count(DISTINCT t.id)
FROM attachments a -- 1st
JOIN transactions tr ON tr.id = a.transactionid -- 2nd
JOIN tickets t ON t.id = tr.objectid -- 3rd
WHERE t.status <> 'deleted'
AND t.type = 'ticket'
AND t.effectiveid = t.id
AND tr.objecttype = 'RT::Ticket'
AND a.contentindex @@ plainto_tsquery('frobnicate');
ROLLBACK; -- or COMMIT;The order of
WHERE conditions is always irrelevant. Only the order of joins is relevant here.Or use a CTE like @jjanes explains in "Option 2". for a similar effect.
Indexes
B-tree indexes
Take all conditions on
tickets that are used identically with most queries and create a partial index on tickets:CREATE INDEX tickets_partial_idx
ON tickets(id)
WHERE status <> 'deleted'
AND type = 'ticket'
AND effectiveid = id;If one of the conditions is variable, drop it from the
WHERE condition and prepend the column as index column instead.Another one on
transactions:CREATE INDEX transactions_partial_idx
ON transactions(objecttype, objectid, id)The third column is just to enable index-only scans.
Also, since you have this composite index with two integer columns on
attachments:"attachments3" btree (parent, transactionid)This additional index is a complete waste, delete it:
"attachments1" btree (parent)Details:
- Is a composite index also good for queries on the first field?
GIN index
Add
transactionid to your GIN index to make it a lot more effective. This may be another silver bullet, because it potentially allows index-only scans, eliminating visits to the big table completely.You need additional operator classes provided by the additional module
btree_gin. Detailed instructions:- Inner join using an array column
"contentindex_idx" gin (transactionid, contentindex)4 bytes from an
integer column don't make the index much bigger. Also, fortunately for you, GIN indexes are different from B-tree indexes in a crucial aspect. Per documentation:A multicolumn GIN index can be used with query conditions that involve
any subset of the index's columns. Unlike B-tree or GiST, index search
effectiveness is the same regardless of which index column(s) the
query conditions use.
Bold emphasis mine. So you just need the one (big and
Code Snippets
SELECT count(DISTINCT t.id)
FROM tickets t
JOIN transactions tr ON tr.objectid = t.id
JOIN attachments a ON a.transactionid = tr.id
WHERE t.status <> 'deleted'
AND t.type = 'ticket'
AND t.effectiveid = t.id
AND tr.objecttype = 'RT::Ticket'
AND a.contentindex @@ plainto_tsquery('frobnicate');SELECT count(*)
FROM tickets t
WHERE status <> 'deleted'
AND type = 'ticket'
AND effectiveid = id
AND EXISTS (
SELECT 1
FROM transactions tr
JOIN attachments a ON a.transactionid = tr.id
WHERE tr.objectid = t.id
AND tr.objecttype = 'RT::Ticket'
AND a.contentindex @@ plainto_tsquery('frobnicate')
);BEGIN;
SET LOCAL join_collapse_limit = 1;
SELECT count(DISTINCT t.id)
FROM attachments a -- 1st
JOIN transactions tr ON tr.id = a.transactionid -- 2nd
JOIN tickets t ON t.id = tr.objectid -- 3rd
WHERE t.status <> 'deleted'
AND t.type = 'ticket'
AND t.effectiveid = t.id
AND tr.objecttype = 'RT::Ticket'
AND a.contentindex @@ plainto_tsquery('frobnicate');
ROLLBACK; -- or COMMIT;CREATE INDEX tickets_partial_idx
ON tickets(id)
WHERE status <> 'deleted'
AND type = 'ticket'
AND effectiveid = id;CREATE INDEX transactions_partial_idx
ON transactions(objecttype, objectid, id)Context
StackExchange Database Administrators Q#76229, answer score: 8
Revisions (0)
No revisions yet.