patternsqlMinor
Full text search in XML documents
Viewed 0 times
fullsearchtextxmldocuments
Problem
I would like to do full-text search in XML documents.
However, I can't get matches for attribute values (I get a syntax error if I provide the full attribute value including the
Below is SSCCE:
I would like to be able to select for the last component of the attribute value:
As to why I don't just use
update after accepting the answer
Based on the accepted answer, the following queries all work as expected:
However, I can't get matches for attribute values (I get a syntax error if I provide the full attribute value including the
: and / characters in the example below). Also, I don't know how to specify my own delimiters for the parser.Below is SSCCE:
I would like to be able to select for the last component of the attribute value:
attributevalue and for the parser to recognize : and / as delimiters and so yield boo1, boo2, some and attributevalue as lexemes.DROP TABLE IF EXISTS xmldocument;
CREATE TABLE IF NOT EXISTS xmldocument (
i SERIAL NOT NULL,
content XML NOT NULL
);
ALTER TABLE xmldocument ADD PRIMARY KEY (i);
INSERT INTO xmldocument (content) VALUES
('foo')
,('foo')
;
-- matches both records as expected
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'foo';
-- no match
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'attributevalue';
-- no match
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'boo2';
-- no match
SELECT * FROM xmldocument WHERE to_tsvector(CAST (content AS VARCHAR))@@'boo4';As to why I don't just use
ILIKE % etc. The reason is that I need to optimize by using a GIN index and I don't think it is possible or meaningful to build an index on simple VARCHAR values.update after accepting the answer
Based on the accepted answer, the following queries all work as expected:
SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('foo');
SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('attributevalue');
SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('boo2');
SELECT * FROM fts.xmldocument WHERE
to_tsvector(regexp_replace(content::text,'[<>/]',' ','g')) @@ to_tsquery('boo4');Solution
If you don't want to write your own parser, the quick and dirty solution would be to replace `
Version 1.2 of pg_trgm (to be included in PostgreSQL 9.6, but it is fairly easy to back-port into 9.4 and 9.5 if you are willing to compile some code) will be much more effective with large queries.
with some other punctuation, so that the existing parsers don't decide to discard them as html tags.
SELECT *
FROM xmldocument
WHERE to_tsvector(regexp_replace(content::text,'[<>]',' ','g')) @@ 'boo2';
As to why I don't just use ILIKE % etc. the reason is that I need to optimize by using a GIN index and I don't think it is possible or meaningful to build an index on simple VARCHAR values.
If you use the pg_trgm` extension, you can build a gin index on VARCHAR values which will optimize ILIKE queries. How effective it is depends on the size of your indexed documents, and the size of your query. I'd recommend trying it and seeing how it does for you. Version 1.2 of pg_trgm (to be included in PostgreSQL 9.6, but it is fairly easy to back-port into 9.4 and 9.5 if you are willing to compile some code) will be much more effective with large queries.
Code Snippets
SELECT *
FROM xmldocument
WHERE to_tsvector(regexp_replace(content::text,'[<>]',' ','g')) @@ 'boo2';Context
StackExchange Database Administrators Q#129413, answer score: 4
Revisions (0)
No revisions yet.