patternsqlModerate
Full text search over multiple related tables: indices and performance
Viewed 0 times
indicestablesfullsearchtextrelatedperformancemultipleandover
Problem
We have the following database structure
The user enters a whitespace-separated list of words (search terms, basically) and we have to search for tasks, that satisfy the following: the task is a "match" if each search term occurs at least once in concatenation of task's description, the name and address of its associated object and the name of its associated actor.
Now, if we are not concerned about performance, we can just do something like this (given query "foo bar"):
Unfortunately, we are concerned about performance. The dataset will, probably, be as follows (and it is expected to grow):
What I've considered:
Make a denormalized table like this:
The field "doc" will contain the concatenation of task's description, associated object's name and address and actor's name. We will have to create an index over this field and it will be used in full text search queries. This table will be updated in update / insert triggers on tasks, actors, objects.
Drawbacks: tons of duplicated data (this one I am not quite concerned about), and updates to the master tables will become unpredictable in terms of number of rows updated (say, you update a name of some object and now suddenly you must update thousands of rows in task_documents
CREATE TABLE objects (
id int PRIMARY KEY,
name text,
address text
);
CREATE TABLE tasks (
id int PRIMARY KEY,
object_id int NOT NULL,
actor_id int NOT NULL,
description text
);
CREATE TABLE actors (
id int PRIMARY KEY,
name text
);The user enters a whitespace-separated list of words (search terms, basically) and we have to search for tasks, that satisfy the following: the task is a "match" if each search term occurs at least once in concatenation of task's description, the name and address of its associated object and the name of its associated actor.
Now, if we are not concerned about performance, we can just do something like this (given query "foo bar"):
SELECT t.id, t.description
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id
WHERE to_tsvector(concat_ws(' ', t.description, o.name, o.address, a.name)) @@
plainto_tsquery('foo bar');Unfortunately, we are concerned about performance. The dataset will, probably, be as follows (and it is expected to grow):
- about 10000 objects
- about 1000 actors
- about 100000 tasks evenly distributed between objects
What I've considered:
Make a denormalized table like this:
CREATE TABLE task_documents (
id int PRIMARY KEY,
doc tsvector
)The field "doc" will contain the concatenation of task's description, associated object's name and address and actor's name. We will have to create an index over this field and it will be used in full text search queries. This table will be updated in update / insert triggers on tasks, actors, objects.
Drawbacks: tons of duplicated data (this one I am not quite concerned about), and updates to the master tables will become unpredictable in terms of number of rows updated (say, you update a name of some object and now suddenly you must update thousands of rows in task_documents
Solution
Optimization
You're going down the right track.
You either need to
Caching the results
What you probably want is a
Then just
Denormalizing the table
This can take a lot of forms, you've got this right though..
Redesign
Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.
It becomes a lot easier to generate a query when you introduce syntax likes
You're going down the right track.
You either need to
- Denormalize
- Cache
Caching the results
What you probably want is a
MATERIALIZED VIEW. This is easy and works reasonably well.CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id
;Then just
SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');Denormalizing the table
This can take a lot of forms, you've got this right though..
Redesign
Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.
It becomes a lot easier to generate a query when you introduce syntax likes
[text] for tagging, and is:answer to search just answers, rather than rebuilding Google and normalized indexes.Code Snippets
CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id
;SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');Context
StackExchange Database Administrators Q#164546, answer score: 12
Revisions (0)
No revisions yet.