HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpModerate

Replacing non-ASCII characters

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
charactersasciireplacingnon

Problem

I wrote a C# program to remove non-ASCII characters in a text file, and then output the result to a .NonAsciiChars file.

The input file is in XML format. In fact, the data may all be on two lines, which is why I am not doing the replacement line by line. Instead, I'm using StreamReader.ReadToEnd().

The problem is, the input file can be as big as 4 GB. When this happens, I'm getting the following OutOfMemoryException:

DateTime:2014-08-04 12:55:26,035 Thread ID:[1] Log Level:ERROR Logger Property:OS_fileParser.Program property:[(null)] - Message:System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
at System.Text.StringBuilder.Append(Char* value, Int32 valueCount)
at System.Text.StringBuilder.Append(Char[] value, Int32 startIndex, Int32 charCount)
at System.IO.StreamReader.ReadToEnd()
at OS_fileParser.MyProgram.FormatXmlFile(String inFile) in D:\Test\myProgram.cs:line 530
at OS_fileParser.MyProgram.Run() in D:\Test\myProgram.cs:line 336


Line 530 contains content = Regex.Replace(content, pattern, "");, while line 336 calls a method with the following body:

const string pattern = @"[^\x20-\x7E]";

string content;
using (var reader = new StreamReader(inFile))
{
content = reader.ReadToEnd();
reader.Close();
}

content = Regex.Replace(content, pattern, "");

using (var writer = new StreamWriter(inFile + ".NonAsciiChars"))
{
writer.Write(content);
writer.Close();
}

using (var myXmlReader = XmlReader.Create(inFile + ".NonAsciiChars", myXmlReaderSettings))
{
try
{
while (myXmlReader.Read())
{
}
}
catch (XmlException ex)
{
Logger.Error("Validation error: " + ex);
}
}


How can I improve the memory footprint of my code?

Solution

You need to use the two Streams as, well, streams: read a manageable part of the input, transform it, write it to the output and repeat.

int bufferSize = 4096; // or whatever

char[] characters = new char[bufferSize];

using (var reader = new StreamReader(inFile))
using (var writer = new StreamWriter(inFile + ".NonAsciiChars"))
{
    while (true)
    {
        int read = reader.Read(characters, 0, characters.Length);

        if (read == 0)
            break;

        var replaced = Regex.Replace(new string(characters), pattern, string.Empty);

        writer.Write(replaced);
    }
}


Some notes about this code:

  • Notice the missing Close() calls: the whole point of using is safe closing of streams, and similar resources, so you don't need to close them twice.



  • This code (just like the original) creates a lot of garbage to be collected by the GC. Since your regex is actually very simple, it might be better to manually work directly with char[]s.



  • I used string.Empty instead of "". This makes it very clear that empty string was actually intended and it's not just “I started writing string and then forgot about it” bug.

Code Snippets

int bufferSize = 4096; // or whatever

char[] characters = new char[bufferSize];

using (var reader = new StreamReader(inFile))
using (var writer = new StreamWriter(inFile + ".NonAsciiChars"))
{
    while (true)
    {
        int read = reader.Read(characters, 0, characters.Length);

        if (read == 0)
            break;

        var replaced = Regex.Replace(new string(characters), pattern, string.Empty);

        writer.Write(replaced);
    }
}

Context

StackExchange Code Review Q#59122, answer score: 12

Revisions (0)

No revisions yet.