HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Code works but memory usage is erratic

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
buterraticusageworksmemorycode

Problem

I have the following class which is a tool for retrieving MD5 hashes of files that are input into it. They can potentially be very large files and I require cancellation and progress report, hence why i rolled my own solution rather than just passing in a filestream to HashAlgorithm.

I have a producer - consumer set up so that I can be reading in file blocks at the same time as calculating the MD5 hasher. I find the best performance is using 16mb blocks, which I allow a max pre-read of 3 blocks. Considering I have up to 5 of these running in parallel (5 drives) then memory usage can max out at 240mb assuming that the GC collects immediately. Obviously this doesn't happen so i see memory usage jump to approx 700mb or more in some instances.

How can I reduce this erratic memory behaviour?

```
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Security.Cryptography;
using System.IO;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;

namespace MD5Library
{

public class FileBlock
{
public byte[] Data;

public int Length;

public FileBlock(int inBufferSize)
{
Data = new byte[inBufferSize];
}
}

public class MD5Hasher
{
protected int bufferSize = 1024 1024 16; //16mb at a time

public void MD5Hasher()
{
}

public void MD5Hasher(int _bufferSize)
{
bufferSize = _bufferSize;
}

public Task Go(string inFile, CancellationToken ct, IProgress prg)
{
FileInfo fi = new FileInfo(inFile);
// Queue with capacity 3x the buffer
BlockingCollection BigBuffer = new BlockingCollection(3);

Task readFileTask = new Task(() => readFiles(fi, ct, BigBuffer));

Task computeHashTask = new Task(() => calcHash(fi, BigBuffer, ct, prg));

readFileTask.Start();
computeHashTask.Start();

return computeHashTask;
}

private async void readFiles(FileInfo fi, Cancellat

Solution

-
Before I address your specific issue directly, I'd like to suggest that you review MSDN's guidelines for C# naming conventions, specifically with respect to capitalization. The erratic name formatting makes your code unnecessarily difficult to follow. You will undoubtedly increase the consistency of your code when it becomes more readable, and you will be able to gain much more when interacting with other programmers (such as on CodeReview).

-
I am not 100% certain about the culprit here, but I have a suspicion that your performance issue is being caused by the way you are creating the string in your HashToString method. Your code is concatenating the strings together using the + operator. There are known performance issues with this approach, and the performance degradation increases proportionally to the number of strings. As a general rule, you should use a StringBuilder if you are concatenating more than four strings. An informative article about this guideline is available at: dotnetperls - C# string.Concat. If you are concatenating strings in a loop, you should always use a StringBuilder. You can also read an article with benchmarks and explanation of the relative performance between concatenation and StringBuilder at dotnetperls - C# StringBuilder Performance. After reading these articles, it should be clear that converting any large array to a string using your approach will result in terrible performance. I found another article, which cites performance tests that indicate using a StringBuilder is 550% faster than using a plus sign.

Now, let's start with your code and see if we can find a better way to handle that issue...

Original string.Concat* Version:

**The + operator calls String.Concat under the hood.*

public string HashToString(byte[] inHash)
{
    string hex = "";
    foreach (byte b in inHash)
    {
        hex += b.ToString("x2");
    } 
    return hex;
}


Now, since the concatenation is happening in a loop, we know it's better to use the StringBuilder than the + operator. This is what the code looks like if we make this change:

Better StringBuilder Version:

public string HashToString2(byte[] inHash)
{
    StringBuilder hexBuilder = new StringBuilder(); 
    foreach (byte b in inHash)
    {
        hexBuilder.AppendFormat("{0:x2}", b);
    } 
    return hexBuilder.ToString();
}


Now, if you read the dotnerperls articles, you'd realize we can do a minor tweak to get this to give use even better performance. The performance tweak involves explicitly assigning the StringBuilder's capacity when you create the instance. Since we have an array and we know each element in the array will result in 2 characters, we can determine the capacity of the StringBuilder and eliminate the cost of constant resizing.

Even Better StringBuilder Version:

public string HashToString3(byte[] inHash)
{
    StringBuilder hexBuilder = new StringBuilder( inHash.Length * 2 ); 
    foreach (byte b in inHash)
    {
        hexBuilder.AppendFormat("{0:x2}", b);
    } 
    return hexBuilder.ToString();
}


At this point, you should be able to run tests and notice better performance. Since we know that at least one of your issues was caused by the HashToString method, it's worth stepping back and see if we are converting a hash to a string in the most optimal way. Now, the hash is just a byte array in a specific context, so we really just need to know how to get the best performance when converting a byte array to a hexadecimal string in C#. I googled this, and found out that this issue has been discussed and tested exhaustively. The resources I found include:

-
"How do you convert a byte array to a hexadecimal string and vice versa in C#?" on StackOverflow. The issue gets beaten to death here, with the selected answer getting close to 200 votes.

-
An open source repository for a project called "Convert Byte Array to Hex String Performance Test", which was inspired by the previously noted StackOverflow question. It tested 9 candidate methods, (the StringBuilder implementation analogous to our previous result being the slowest).

The fastest result from the performance test used byte manipulation and looks like:

Ridiculously Optimized Byte Manipulation Version:

public string HashToString4(byte[] bytes) {
    char[] c = new char[bytes.Length * 2];
    byte b;
    for (int i = 0; i > 4));
        c[i * 2] = (char)(b > 9 ? b + 0x37 : b + 0x30);
        b = ((byte)(bytes[i] & 0xF));
        c[i * 2 + 1] = (char)(b > 9 ? b + 0x37 : b + 0x30);
    }
    return new string(c);
}


My only criticism of this is that the byte manipulation technique is very difficult for human beings to read. The second best option was about half as fast as the byte manipulation, but is very readable and leverages a built in .NET function for converting byte arrays to hexadecimal strings. It looks like:

2nd Best, Highly-Readable Version:

```
publ

Code Snippets

public string HashToString(byte[] inHash)
{
    string hex = "";
    foreach (byte b in inHash)
    {
        hex += b.ToString("x2");
    } 
    return hex;
}
public string HashToString2(byte[] inHash)
{
    StringBuilder hexBuilder = new StringBuilder(); 
    foreach (byte b in inHash)
    {
        hexBuilder.AppendFormat("{0:x2}", b);
    } 
    return hexBuilder.ToString();
}
public string HashToString3(byte[] inHash)
{
    StringBuilder hexBuilder = new StringBuilder( inHash.Length * 2 ); 
    foreach (byte b in inHash)
    {
        hexBuilder.AppendFormat("{0:x2}", b);
    } 
    return hexBuilder.ToString();
}
public string HashToString4(byte[] bytes) {
    char[] c = new char[bytes.Length * 2];
    byte b;
    for (int i = 0; i < bytes.Length; i++) 
    {
        b = ((byte)(bytes[i] >> 4));
        c[i * 2] = (char)(b > 9 ? b + 0x37 : b + 0x30);
        b = ((byte)(bytes[i] & 0xF));
        c[i * 2 + 1] = (char)(b > 9 ? b + 0x37 : b + 0x30);
    }
    return new string(c);
}
public string HashToString5(byte[] bytes) {
    return BitConverter.ToString(bytes).Replace("-", "");
}

Context

StackExchange Code Review Q#16331, answer score: 5

Revisions (0)

No revisions yet.