patternsqlMinor
Converting RTF in a text column to plain text in bulk
Viewed 0 times
columnrtftextbulkconvertingplain
Problem
I have a legacy system with about 10 million rows in a table. In that table there is a column of type
My current method is I have a C# program that loads the query in to a DataTable using a
This is working great for small tables, however this is the first time I had to run it on a table with such a large data-set (some of the rtf files can be several megabytes in size with embedded pictures) and I am getting OutOfMemory errors with my C# program.
I know I can chunk my query down in t
text, most of them are standard text but about 500 thousand rows have RTF markup in them. I need to convert the RTF formatted text in to plain text.My current method is I have a C# program that loads the query in to a DataTable using a
SqlDataAdapter and uses the winforms RichTextBox control to do the conversion.void bw_DoWork(object sender, DoWorkEventArgs e)
{
count = 0;
rtbRTFToPlain = new RichTextBox();
using (SqlDataAdapter ada = new SqlDataAdapter("select note_guid, notes from client_notes", Globals.SQLConnectionString))
using(SqlCommandBuilder cmb = new SqlCommandBuilder(ada))
{
DataTable dt = new DataTable();
ada.UpdateCommand = cmb.GetUpdateCommand();
ada.Fill(dt);
int reportEvery = dt.Rows.Count / 100;
if (reportEvery == 0)
reportEvery = 1;
foreach (DataRow row in dt.Rows)
{
if (count % reportEvery == 0)
bw.ReportProgress(count / reportEvery);
try
{
if (((string)row["notes"]).TrimStart().StartsWith("{") == true)
{
rtbRTFToPlain.Rtf = (string)row["notes"];
row["notes"] = rtbRTFToPlain.Text;
}
}
catch
{
}
count++;
}
bw.ReportProgress(100);
this.Invoke(new Action(() =>
{
this.ControlBox = false;
this.Text = "Updating database please wait";
}));
ada.Update(dt);
}
}This is working great for small tables, however this is the first time I had to run it on a table with such a large data-set (some of the rtf files can be several megabytes in size with embedded pictures) and I am getting OutOfMemory errors with my C# program.
I know I can chunk my query down in t
Solution
I ended up making a CLR function to convert it.
I found this library, I then tweaked it a tiny bit to remove things I did not need like logging and Drawing methods, which allowed me to mark it as safe.
I then just made this small class.
And ran this in SQL
And it is fast and works great!
I found this library, I then tweaked it a tiny bit to remove things I did not need like logging and Drawing methods, which allowed me to mark it as safe.
I then just made this small class.
using System.Data.SqlTypes;
using Itenso.Rtf.Converter.Text;
using Itenso.Rtf.Support;
public partial class StoredProcedures
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString RtfToPlainText(SqlString text)
{
if (text.Value.StartsWith(@"{\rtf"))
{
RtfTextConverter textConverter = new RtfTextConverter();
RtfInterpreterTool.Interpret(text.Value, textConverter);
return textConverter.PlainText;
}
else
return text;
}
}And ran this in SQL
sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO
CREATE ASSEMBLY ConversionsSqlExtensionsAssembly
from 'E:\Code\ConversionsSqlExtensions\bin\Debug\ConversionsSqlExtensions.dll'
WITH PERMISSION_SET = safe
go
CREATE function RtfToPlainText(@value nvarchar(max))
returns nvarchar(max)
AS EXTERNAL NAME ConversionsSqlExtensionsAssembly.StoredProcedures.RtfToPlainTextAnd it is fast and works great!
Code Snippets
using System.Data.SqlTypes;
using Itenso.Rtf.Converter.Text;
using Itenso.Rtf.Support;
public partial class StoredProcedures
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString RtfToPlainText(SqlString text)
{
if (text.Value.StartsWith(@"{\rtf"))
{
RtfTextConverter textConverter = new RtfTextConverter();
RtfInterpreterTool.Interpret(text.Value, textConverter);
return textConverter.PlainText;
}
else
return text;
}
}sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO
CREATE ASSEMBLY ConversionsSqlExtensionsAssembly
from 'E:\Code\ConversionsSqlExtensions\bin\Debug\ConversionsSqlExtensions.dll'
WITH PERMISSION_SET = safe
go
CREATE function RtfToPlainText(@value nvarchar(max))
returns nvarchar(max)
AS EXTERNAL NAME ConversionsSqlExtensionsAssembly.StoredProcedures.RtfToPlainTextContext
StackExchange Database Administrators Q#17454, answer score: 5
Revisions (0)
No revisions yet.