patterncsharpModerate
Replacing non-ASCII characters
Viewed 0 times
charactersasciireplacingnon
Problem
I wrote a C# program to remove non-ASCII characters in a text file, and then output the result to a
The input file is in XML format. In fact, the data may all be on two lines, which is why I am not doing the replacement line by line. Instead, I'm using
The problem is, the input file can be as big as 4 GB. When this happens, I'm getting the following OutOfMemoryException:
Line 530 contains
How can I improve the memory footprint of my code?
.NonAsciiChars file.The input file is in XML format. In fact, the data may all be on two lines, which is why I am not doing the replacement line by line. Instead, I'm using
StreamReader.ReadToEnd().The problem is, the input file can be as big as 4 GB. When this happens, I'm getting the following OutOfMemoryException:
DateTime:2014-08-04 12:55:26,035 Thread ID:[1] Log Level:ERROR Logger Property:OS_fileParser.Program property:[(null)] - Message:System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
at System.Text.StringBuilder.Append(Char* value, Int32 valueCount)
at System.Text.StringBuilder.Append(Char[] value, Int32 startIndex, Int32 charCount)
at System.IO.StreamReader.ReadToEnd()
at OS_fileParser.MyProgram.FormatXmlFile(String inFile) in D:\Test\myProgram.cs:line 530
at OS_fileParser.MyProgram.Run() in D:\Test\myProgram.cs:line 336
Line 530 contains
content = Regex.Replace(content, pattern, "");, while line 336 calls a method with the following body:const string pattern = @"[^\x20-\x7E]";
string content;
using (var reader = new StreamReader(inFile))
{
content = reader.ReadToEnd();
reader.Close();
}
content = Regex.Replace(content, pattern, "");
using (var writer = new StreamWriter(inFile + ".NonAsciiChars"))
{
writer.Write(content);
writer.Close();
}
using (var myXmlReader = XmlReader.Create(inFile + ".NonAsciiChars", myXmlReaderSettings))
{
try
{
while (myXmlReader.Read())
{
}
}
catch (XmlException ex)
{
Logger.Error("Validation error: " + ex);
}
}
How can I improve the memory footprint of my code?
Solution
You need to use the two
Some notes about this code:
Streams as, well, streams: read a manageable part of the input, transform it, write it to the output and repeat.int bufferSize = 4096; // or whatever
char[] characters = new char[bufferSize];
using (var reader = new StreamReader(inFile))
using (var writer = new StreamWriter(inFile + ".NonAsciiChars"))
{
while (true)
{
int read = reader.Read(characters, 0, characters.Length);
if (read == 0)
break;
var replaced = Regex.Replace(new string(characters), pattern, string.Empty);
writer.Write(replaced);
}
}Some notes about this code:
- Notice the missing
Close()calls: the whole point ofusingis safe closing of streams, and similar resources, so you don't need to close them twice.
- This code (just like the original) creates a lot of garbage to be collected by the GC. Since your regex is actually very simple, it might be better to manually work directly with
char[]s.
- I used
string.Emptyinstead of"". This makes it very clear that empty string was actually intended and it's not just “I started writing string and then forgot about it” bug.
Code Snippets
int bufferSize = 4096; // or whatever
char[] characters = new char[bufferSize];
using (var reader = new StreamReader(inFile))
using (var writer = new StreamWriter(inFile + ".NonAsciiChars"))
{
while (true)
{
int read = reader.Read(characters, 0, characters.Length);
if (read == 0)
break;
var replaced = Regex.Replace(new string(characters), pattern, string.Empty);
writer.Write(replaced);
}
}Context
StackExchange Code Review Q#59122, answer score: 12
Revisions (0)
No revisions yet.