HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Word frequency in a large text file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filefrequencytextwordlarge

Problem

I've am trying to read a large text file and output the distinct words in it along with it's count. I've tried a couple of attempts so far, and this is by far the fastest solution I have come up with.

private static readonly char[] separators = { ' ' };

public IDictionary Parse(string path)
{
    var wordCount = new Dictionary();

    using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var streamReader = new StreamReader(fileStream))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (wordCount.ContainsKey(word))
                {
                    wordCount[word] = wordCount[word] + 1;
                }
                else
                {
                    wordCount.Add(word, 1);
                }
            }
        }
    }

    return wordCount;
}


How I am measuring my solution

I have a 200MB text, which I know the total word count for (via a text editor). I'm using the Stopwatch class and counting the words to ensure accuracy and measuring the time taken. So far, it is taking around 9 seconds.

Other attempts

I have tried to utilise multithreading to split out the work via the TPL library. This involved batching multiple lines, sending out the processing of the batch of lines to a separate task and locking the read/write operations in the dictionary. This however, seems to not provide me any performance improvements.

It took around 30 seconds. I suspect the locking to read/write to the dictionary is too costly to gain any performance.I am sure there is a faster way to achieve this! Any suggestions/criticisms to my solution are welcome.

Here is the link to the test file I'm using.

Solution

Let's set up code to benchmark different approaches. Every word counter will implement this interface:

interface IWordCounter
{
    IDictionary CountWords(string path);
}


And here's our benchmark runner:

var wordCounters = new IWordCounter[]
{
    // ...
};

foreach (var wordCounter in wordCounters)
{
    GC.Collect();
    GC.WaitForPendingFinalizers();

    var sw = Stopwatch.StartNew();
    var wordCount = wordCounter.CountWords(path);
    sw.Stop();

    Console.WriteLine("{0}, {1} entries, {2}", wordCounter.GetType().Name, wordCount.Count, sw.Elapsed);
}


Timings were taken with a release build, on the test file provided, no debugger attached, on .NET 4.5.2.

Here's the original code:

class OriginalWordCounter : IWordCounter
{
    private static readonly char[] separators = { ' ' };

    public IDictionary CountWords(string path)
    {
        var wordCount = new Dictionary();

        using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
        using (var streamReader = new StreamReader(fileStream))
        {
            string line;
            while ((line = streamReader.ReadLine()) != null)
            {
                var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

                foreach (var word in words)
                {
                    if (wordCount.ContainsKey(word))
                    {
                        wordCount[word] = wordCount[word] + 1;
                    }
                    else
                    {
                        wordCount.Add(word, 1);
                    }
                }
            }
        }

        return wordCount;
    }
}


On my machine, this takes about 8.2s.

We see an improvement using Heslacher's suggestion to use TryGet:

class OriginalTryGetWordCounter : IWordCounter
{
    private static readonly char[] separators = { ' ' };

    public IDictionary CountWords(string path)
    {
        var wordCount = new Dictionary();

        foreach (var line in File.ReadLines(path, Encoding.UTF8))
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
            foreach (var word in words)
            {
                int count;
                wordCount.TryGetValue(word, out count);
                wordCount[word] = count + 1;
            }
        }

        return wordCount;
    }
}


This takes about 6.7s. (The use of File.ReadLines here doesn't seem to effect the timing, it's just a bit cleaner.)

We get another improvement with Parallel.ForEach together with a ConcurrentDictionary:

class ParallelWordCounter : IWordCounter
{
    public IDictionary CountWords(string path)
    {
        var result = new ConcurrentDictionary();
        Parallel.ForEach(File.ReadLines(path, Encoding.UTF8), line =>
        {
            var words = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
            foreach (var word in words)
            {
                result.AddOrUpdate(word, 1, (_, x) => x + 1);
            }
        });

        return result;
    }
}


This takes about 5.2s.

You might want to try some of the Parallel.Foreach overloads to see if you can get any further improvements, and remember to take these results with a grain of salt.

Code Snippets

interface IWordCounter
{
    IDictionary<string, int> CountWords(string path);
}
var wordCounters = new IWordCounter[]
{
    // ...
};

foreach (var wordCounter in wordCounters)
{
    GC.Collect();
    GC.WaitForPendingFinalizers();

    var sw = Stopwatch.StartNew();
    var wordCount = wordCounter.CountWords(path);
    sw.Stop();

    Console.WriteLine("{0}, {1} entries, {2}", wordCounter.GetType().Name, wordCount.Count, sw.Elapsed);
}
class OriginalWordCounter : IWordCounter
{
    private static readonly char[] separators = { ' ' };

    public IDictionary<string, int> CountWords(string path)
    {
        var wordCount = new Dictionary<string, int>();

        using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
        using (var streamReader = new StreamReader(fileStream))
        {
            string line;
            while ((line = streamReader.ReadLine()) != null)
            {
                var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

                foreach (var word in words)
                {
                    if (wordCount.ContainsKey(word))
                    {
                        wordCount[word] = wordCount[word] + 1;
                    }
                    else
                    {
                        wordCount.Add(word, 1);
                    }
                }
            }
        }

        return wordCount;
    }
}
class OriginalTryGetWordCounter : IWordCounter
{
    private static readonly char[] separators = { ' ' };

    public IDictionary<string, int> CountWords(string path)
    {
        var wordCount = new Dictionary<string, int>();

        foreach (var line in File.ReadLines(path, Encoding.UTF8))
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
            foreach (var word in words)
            {
                int count;
                wordCount.TryGetValue(word, out count);
                wordCount[word] = count + 1;
            }
        }

        return wordCount;
    }
}
class ParallelWordCounter : IWordCounter
{
    public IDictionary<string, int> CountWords(string path)
    {
        var result = new ConcurrentDictionary<string, int>();
        Parallel.ForEach(File.ReadLines(path, Encoding.UTF8), line =>
        {
            var words = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
            foreach (var word in words)
            {
                result.AddOrUpdate(word, 1, (_, x) => x + 1);
            }
        });

        return result;
    }
}

Context

StackExchange Code Review Q#90925, answer score: 6

Revisions (0)

No revisions yet.