HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Converting RTF in a text column to plain text in bulk

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
columnrtftextbulkconvertingplain

Problem

I have a legacy system with about 10 million rows in a table. In that table there is a column of type text, most of them are standard text but about 500 thousand rows have RTF markup in them. I need to convert the RTF formatted text in to plain text.

My current method is I have a C# program that loads the query in to a DataTable using a SqlDataAdapter and uses the winforms RichTextBox control to do the conversion.

void bw_DoWork(object sender, DoWorkEventArgs e)
{
    count = 0;

    rtbRTFToPlain = new RichTextBox();

    using (SqlDataAdapter ada = new SqlDataAdapter("select note_guid, notes from client_notes", Globals.SQLConnectionString))
    using(SqlCommandBuilder cmb = new SqlCommandBuilder(ada))
    {
        DataTable dt = new DataTable();
        ada.UpdateCommand = cmb.GetUpdateCommand();

        ada.Fill(dt);

        int reportEvery = dt.Rows.Count / 100;
        if (reportEvery == 0)
            reportEvery = 1;
        foreach (DataRow row in dt.Rows)
        {
            if (count % reportEvery == 0)
                bw.ReportProgress(count / reportEvery);

            try
            {
                if (((string)row["notes"]).TrimStart().StartsWith("{") == true)
                {
                    rtbRTFToPlain.Rtf = (string)row["notes"];
                    row["notes"] = rtbRTFToPlain.Text;
                }
            }
            catch
            {
            }

            count++;

        }
        bw.ReportProgress(100);

        this.Invoke(new Action(() => 
            {
                this.ControlBox = false;
                this.Text = "Updating database please wait";
            }));
        ada.Update(dt);
    }
}


This is working great for small tables, however this is the first time I had to run it on a table with such a large data-set (some of the rtf files can be several megabytes in size with embedded pictures) and I am getting OutOfMemory errors with my C# program.

I know I can chunk my query down in t

Solution

I ended up making a CLR function to convert it.

I found this library, I then tweaked it a tiny bit to remove things I did not need like logging and Drawing methods, which allowed me to mark it as safe.

I then just made this small class.

using System.Data.SqlTypes;
using Itenso.Rtf.Converter.Text;
using Itenso.Rtf.Support;

public partial class StoredProcedures
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString RtfToPlainText(SqlString text)
    {
        if (text.Value.StartsWith(@"{\rtf"))
        {
            RtfTextConverter textConverter = new RtfTextConverter();
            RtfInterpreterTool.Interpret(text.Value, textConverter);
            return textConverter.PlainText;
        }
        else
            return text;
    }
}


And ran this in SQL

sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

CREATE ASSEMBLY ConversionsSqlExtensionsAssembly 
from 'E:\Code\ConversionsSqlExtensions\bin\Debug\ConversionsSqlExtensions.dll' 
WITH PERMISSION_SET = safe
go

CREATE function RtfToPlainText(@value nvarchar(max))
returns nvarchar(max)
AS EXTERNAL NAME ConversionsSqlExtensionsAssembly.StoredProcedures.RtfToPlainText


And it is fast and works great!

Code Snippets

using System.Data.SqlTypes;
using Itenso.Rtf.Converter.Text;
using Itenso.Rtf.Support;

public partial class StoredProcedures
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString RtfToPlainText(SqlString text)
    {
        if (text.Value.StartsWith(@"{\rtf"))
        {
            RtfTextConverter textConverter = new RtfTextConverter();
            RtfInterpreterTool.Interpret(text.Value, textConverter);
            return textConverter.PlainText;
        }
        else
            return text;
    }
}
sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

CREATE ASSEMBLY ConversionsSqlExtensionsAssembly 
from 'E:\Code\ConversionsSqlExtensions\bin\Debug\ConversionsSqlExtensions.dll' 
WITH PERMISSION_SET = safe
go

CREATE function RtfToPlainText(@value nvarchar(max))
returns nvarchar(max)
AS EXTERNAL NAME ConversionsSqlExtensionsAssembly.StoredProcedures.RtfToPlainText

Context

StackExchange Database Administrators Q#17454, answer score: 5

Revisions (0)

No revisions yet.