HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Improving FTS phrase search on postgres for common terms

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
searchpostgresftsphrasefortermscommonimproving

Problem

I have a table defined like
CREATE TABLE details_search (
id int4 NOT NULL PRIMARY KEY,
"search" tsvector NULL
);
CREATE INDEX details_search_idx ON details_search USING gin (search);


I ran this to get an idea of its size:
SELECT pg_size_pretty(pg_relation_size('details_search')) relation_size,
pg_size_pretty(pg_total_relation_size('details_search')) total_relation_size,
pg_size_pretty(pg_table_size('details_search')) table_size,
pg_size_pretty(pg_indexes_size('details_search')) indexes_size;


and these are the results

relation_size|total_relation_size|table_size|indexes_size|
-------------+-------------------+----------+------------+
800 MB       |64 GB              |57 GB     |6830 MB     |


I'm interested in only performing phrase searches, and these are used aggregated. When I perform phrase searches with uncommon terms, things work fine. Now when I use a phrase that has a common term, performance suffers a lot.

This query took 192s:

SELECT COUNT(id)
FROM   details_search
WHERE  search @@ phraseto_tsquery('simple', 'data management')


Here's the query plan (and here the query plan in a nice interface):

```
Output: count(id)
Buffers: shared hit=25942383 read=6354221 written=4588
I/O Timings: shared/local read=512605.708 write=122.864
-> Gather (cost=178176.43..178176.64 rows=2 width=8) (actual time=192857.512..192861.652 rows=3 loops=1)
Output: (PARTIAL count(id))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=25942383 read=6354221 written=4588
I/O Timings: shared/local read=512605.708 write=122.864
-> Partial Aggregate (cost=177176.43..177176.44 rows=1 width=8) (actual time=192852.434..192852.435 rows=1 loops=3)
Output: PARTIAL count(id)
Buffers: shared hit=25942383 read=6354221 written=4588
I/O Timings: shared/local read=512605.708 write=122.864
Worker 0: actual time=192851.530..19

Solution

There is nothing much you can do about that. A GIN index indexes the individual constituents, not the phrase. So the "bitmap index scan" will give you all rows that contain "data" and "management", and the recheck in the "bitmap heap scan" weeds out the 95% false positives.

The mere 170 seconds it took to read the over 2 million 8kB blocks indicate that most of your data were cached in the kernel page cache anyway. You can boost the performance slightly by increasing work_mem so that you get no more "lossy" blocks.

If the problem were to boost the search for certain specific phrases, you could use a thesaurus dictionary to replace them with single words like "datamanagement", which would make the index scan more effective. But I guess you want to speed up the search for arbitrary phrases.

Context

StackExchange Database Administrators Q#335867, answer score: 3

Revisions (0)

No revisions yet.