HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

SQL Server Full Text Search - .rtf files incorrectly indexed by rtf tags

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
fullsearchsqltextrtftagsindexedfilesserverincorrectly

Problem

I've set up a full-text index on my SQL Server DB on a varbinary(max) column. I have a type column specified, which contains the extension of the file, such a ".doc", ".pdf" etc.

However, I have noticed that when any .rtf files are indexed, SQL is including all of the meta information in the file (such as the RTF tag "listoverridecount0").

This is bloating the indexes quite a bit, and also means that searches will match on these tags (i.e. I can search on "listoverridecount0" and get EVERY .rtf returned).

Is there any reason that the iFilter for .rtf would not be removing the RTF Tags?

When I run this:

SELECT * FROM sys.fulltext_document_types WHERE document_type = '.rtf';


I get this:

document_type  .rtf                                     
class_id       C7310720-AC80-11D1-8DF3-00C04FB6EF4F 
path       c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\msfte.dll
version        12.0.6828.0
manufacturer   Microsoft Corporation


I have filed a Microsoft Connect bug as I don't seem to be able to find any workaround. Presumably this is either a) Error with the RTF iFilter not removing tags, or b) Full Text index issue.

My SQL Server version is:

Microsoft SQL Server 2012 (SP1) - 11.0.3393.0 (X64)
Oct 25 2013 19:04:40
Copyright (c) Microsoft Corporation
Developer Edition (64-bit) on Windows NT 6.2 (Build 9200: ) (Hypervisor)

Solution

I can reproduce this but I think you have a few options:

  • not worry too much about full-text index bloat. Not many people are going to be searching for 'rtf1', 'pard' or 'wmetafile0' and I don't think it will affect your performance that much. You might get the odd clash if someone is searching for 'red' or 'blue' but I'd say this is fairly low risk.



  • Clean the string up eg with RegEx I copied from here. Seemed to make a good go of it. Store it in a separate table with the stream_id ( which is the unique id for the document ) and full-text index that table instead:



-- Convert rtf to plain text
SELECT
CAST( file_stream AS VARCHAR(MAX) ) original
, MDS.mdq.RegExReplace(
CAST( file_stream AS VARCHAR(MAX) ),
'\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?', '', 1 )
FROM dbo.Documents


  • Similar approach with a CLR function I adapted from here:



`using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Linq; // for TakeWhile

public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlChars rtfToText( SqlChars inputRtf )
{
// RTF function lifted from here and adapted for SQL CLR:
// https://stackoverflow.com/questions/23277178/richtextbox-throws-outofmemory-on-azure

bool slash = false; //indicates if backslash followed by the space
bool figure_opened = false; //indicates if opening figure brace followed by the space
bool figure_closed = false; //indicates if closing brace followed by the space
bool first_space = false; //the else spaces are in plain text and must be included to the result

if (inputRtf.Length i && j

I'm sure this isn't the world's most efficient CLR function but it did an ok job on some 100 documents I tested, with a few fails.

Let us know if you hear anything back on the connect item.

Context

StackExchange Database Administrators Q#57383, answer score: 2

Revisions (0)

No revisions yet.