HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Full text search over multiple related tables: indices and performance

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
indicestablesfullsearchtextrelatedperformancemultipleandover

Problem

We have the following database structure

CREATE TABLE objects (
    id       int   PRIMARY KEY,
    name     text,
    address  text
);

CREATE TABLE tasks (
    id int           PRIMARY KEY,
    object_id int    NOT NULL,
    actor_id int     NOT NULL,
    description text
);

CREATE TABLE actors (
    id   int  PRIMARY KEY,
    name text
);


The user enters a whitespace-separated list of words (search terms, basically) and we have to search for tasks, that satisfy the following: the task is a "match" if each search term occurs at least once in concatenation of task's description, the name and address of its associated object and the name of its associated actor.

Now, if we are not concerned about performance, we can just do something like this (given query "foo bar"):

SELECT t.id, t.description
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id
WHERE to_tsvector(concat_ws(' ', t.description, o.name, o.address, a.name)) @@
    plainto_tsquery('foo bar');


Unfortunately, we are concerned about performance. The dataset will, probably, be as follows (and it is expected to grow):

  • about 10000 objects



  • about 1000 actors



  • about 100000 tasks evenly distributed between objects



What I've considered:

Make a denormalized table like this:

CREATE TABLE task_documents (
    id int PRIMARY KEY,
    doc tsvector
)


The field "doc" will contain the concatenation of task's description, associated object's name and address and actor's name. We will have to create an index over this field and it will be used in full text search queries. This table will be updated in update / insert triggers on tasks, actors, objects.

Drawbacks: tons of duplicated data (this one I am not quite concerned about), and updates to the master tables will become unpredictable in terms of number of rows updated (say, you update a name of some object and now suddenly you must update thousands of rows in task_documents

Solution

Optimization

You're going down the right track.

You either need to

  • Denormalize



  • Cache



Caching the results

What you probably want is a MATERIALIZED VIEW. This is easy and works reasonably well.

CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id 
;


Then just

SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');


Denormalizing the table

This can take a lot of forms, you've got this right though..

Redesign

Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.

It becomes a lot easier to generate a query when you introduce syntax likes [text] for tagging, and is:answer to search just answers, rather than rebuilding Google and normalized indexes.

Code Snippets

CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id 
;
SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');

Context

StackExchange Database Administrators Q#164546, answer score: 12

Revisions (0)

No revisions yet.