HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Full text search in XML documents

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fullsearchtextxmldocuments

Problem

I would like to do full-text search in XML documents.

However, I can't get matches for attribute values (I get a syntax error if I provide the full attribute value including the : and / characters in the example below). Also, I don't know how to specify my own delimiters for the parser.

Below is SSCCE:

I would like to be able to select for the last component of the attribute value: attributevalue and for the parser to recognize : and / as delimiters and so yield boo1, boo2, some and attributevalue as lexemes.

DROP TABLE IF EXISTS xmldocument;
CREATE TABLE IF NOT EXISTS xmldocument (
i SERIAL NOT NULL,
content XML NOT NULL
);
ALTER TABLE xmldocument ADD PRIMARY KEY (i);

INSERT INTO xmldocument (content) VALUES
('foo')
,('foo')
;

-- matches both records as expected
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'foo'; 

-- no match
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'attributevalue'; 

-- no match
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'boo2'; 

 -- no match
 SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'boo4';


As to why I don't just use ILIKE % etc. The reason is that I need to optimize by using a GIN index and I don't think it is possible or meaningful to build an index on simple VARCHAR values.

update after accepting the answer

Based on the accepted answer, the following queries all work as expected:

SELECT * FROM fts.xmldocument WHERE 
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('foo');

SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('attributevalue');

SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('boo2');

SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('boo4');

Solution

If you don't want to write your own parser, the quick and dirty solution would be to replace ` with some other punctuation, so that the existing parsers don't decide to discard them as html tags.

SELECT * 
FROM xmldocument 
WHERE to_tsvector(regexp_replace(content::text,'[<>]',' ','g')) @@ 'boo2';



As to why I don't just use ILIKE % etc. the reason is that I need to optimize by using a GIN index and I don't think it is possible or meaningful to build an index on simple VARCHAR values.

If you use the
pg_trgm` extension, you can build a gin index on VARCHAR values which will optimize ILIKE queries. How effective it is depends on the size of your indexed documents, and the size of your query. I'd recommend trying it and seeing how it does for you.

Version 1.2 of pg_trgm (to be included in PostgreSQL 9.6, but it is fairly easy to back-port into 9.4 and 9.5 if you are willing to compile some code) will be much more effective with large queries.

Code Snippets

SELECT * 
FROM xmldocument 
WHERE to_tsvector(regexp_replace(content::text,'[<>]',' ','g')) @@ 'boo2';

Context

StackExchange Database Administrators Q#129413, answer score: 4

Revisions (0)

No revisions yet.