patterncsharpMinor
Word frequency in a large text file
Viewed 0 times
filefrequencytextwordlarge
Problem
I've am trying to read a large text file and output the distinct words in it along with it's count. I've tried a couple of attempts so far, and this is by far the fastest solution I have come up with.
How I am measuring my solution
I have a 200MB text, which I know the total word count for (via a text editor). I'm using the Stopwatch class and counting the words to ensure accuracy and measuring the time taken. So far, it is taking around 9 seconds.
Other attempts
I have tried to utilise multithreading to split out the work via the TPL library. This involved batching multiple lines, sending out the processing of the batch of lines to a separate task and locking the read/write operations in the dictionary. This however, seems to not provide me any performance improvements.
It took around 30 seconds. I suspect the locking to read/write to the dictionary is too costly to gain any performance.I am sure there is a faster way to achieve this! Any suggestions/criticisms to my solution are welcome.
Here is the link to the test file I'm using.
private static readonly char[] separators = { ' ' };
public IDictionary Parse(string path)
{
var wordCount = new Dictionary();
using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
using (var streamReader = new StreamReader(fileStream))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
if (wordCount.ContainsKey(word))
{
wordCount[word] = wordCount[word] + 1;
}
else
{
wordCount.Add(word, 1);
}
}
}
}
return wordCount;
}How I am measuring my solution
I have a 200MB text, which I know the total word count for (via a text editor). I'm using the Stopwatch class and counting the words to ensure accuracy and measuring the time taken. So far, it is taking around 9 seconds.
Other attempts
I have tried to utilise multithreading to split out the work via the TPL library. This involved batching multiple lines, sending out the processing of the batch of lines to a separate task and locking the read/write operations in the dictionary. This however, seems to not provide me any performance improvements.
It took around 30 seconds. I suspect the locking to read/write to the dictionary is too costly to gain any performance.I am sure there is a faster way to achieve this! Any suggestions/criticisms to my solution are welcome.
Here is the link to the test file I'm using.
Solution
Let's set up code to benchmark different approaches. Every word counter will implement this interface:
And here's our benchmark runner:
Timings were taken with a release build, on the test file provided, no debugger attached, on .NET 4.5.2.
Here's the original code:
On my machine, this takes about 8.2s.
We see an improvement using Heslacher's suggestion to use
This takes about 6.7s. (The use of
We get another improvement with
This takes about 5.2s.
You might want to try some of the
interface IWordCounter
{
IDictionary CountWords(string path);
}And here's our benchmark runner:
var wordCounters = new IWordCounter[]
{
// ...
};
foreach (var wordCounter in wordCounters)
{
GC.Collect();
GC.WaitForPendingFinalizers();
var sw = Stopwatch.StartNew();
var wordCount = wordCounter.CountWords(path);
sw.Stop();
Console.WriteLine("{0}, {1} entries, {2}", wordCounter.GetType().Name, wordCount.Count, sw.Elapsed);
}Timings were taken with a release build, on the test file provided, no debugger attached, on .NET 4.5.2.
Here's the original code:
class OriginalWordCounter : IWordCounter
{
private static readonly char[] separators = { ' ' };
public IDictionary CountWords(string path)
{
var wordCount = new Dictionary();
using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
using (var streamReader = new StreamReader(fileStream))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
if (wordCount.ContainsKey(word))
{
wordCount[word] = wordCount[word] + 1;
}
else
{
wordCount.Add(word, 1);
}
}
}
}
return wordCount;
}
}On my machine, this takes about 8.2s.
We see an improvement using Heslacher's suggestion to use
TryGet:class OriginalTryGetWordCounter : IWordCounter
{
private static readonly char[] separators = { ' ' };
public IDictionary CountWords(string path)
{
var wordCount = new Dictionary();
foreach (var line in File.ReadLines(path, Encoding.UTF8))
{
var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
int count;
wordCount.TryGetValue(word, out count);
wordCount[word] = count + 1;
}
}
return wordCount;
}
}This takes about 6.7s. (The use of
File.ReadLines here doesn't seem to effect the timing, it's just a bit cleaner.)We get another improvement with
Parallel.ForEach together with a ConcurrentDictionary:class ParallelWordCounter : IWordCounter
{
public IDictionary CountWords(string path)
{
var result = new ConcurrentDictionary();
Parallel.ForEach(File.ReadLines(path, Encoding.UTF8), line =>
{
var words = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
result.AddOrUpdate(word, 1, (_, x) => x + 1);
}
});
return result;
}
}This takes about 5.2s.
You might want to try some of the
Parallel.Foreach overloads to see if you can get any further improvements, and remember to take these results with a grain of salt.Code Snippets
interface IWordCounter
{
IDictionary<string, int> CountWords(string path);
}var wordCounters = new IWordCounter[]
{
// ...
};
foreach (var wordCounter in wordCounters)
{
GC.Collect();
GC.WaitForPendingFinalizers();
var sw = Stopwatch.StartNew();
var wordCount = wordCounter.CountWords(path);
sw.Stop();
Console.WriteLine("{0}, {1} entries, {2}", wordCounter.GetType().Name, wordCount.Count, sw.Elapsed);
}class OriginalWordCounter : IWordCounter
{
private static readonly char[] separators = { ' ' };
public IDictionary<string, int> CountWords(string path)
{
var wordCount = new Dictionary<string, int>();
using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
using (var streamReader = new StreamReader(fileStream))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
if (wordCount.ContainsKey(word))
{
wordCount[word] = wordCount[word] + 1;
}
else
{
wordCount.Add(word, 1);
}
}
}
}
return wordCount;
}
}class OriginalTryGetWordCounter : IWordCounter
{
private static readonly char[] separators = { ' ' };
public IDictionary<string, int> CountWords(string path)
{
var wordCount = new Dictionary<string, int>();
foreach (var line in File.ReadLines(path, Encoding.UTF8))
{
var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
int count;
wordCount.TryGetValue(word, out count);
wordCount[word] = count + 1;
}
}
return wordCount;
}
}class ParallelWordCounter : IWordCounter
{
public IDictionary<string, int> CountWords(string path)
{
var result = new ConcurrentDictionary<string, int>();
Parallel.ForEach(File.ReadLines(path, Encoding.UTF8), line =>
{
var words = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var word in words)
{
result.AddOrUpdate(word, 1, (_, x) => x + 1);
}
});
return result;
}
}Context
StackExchange Code Review Q#90925, answer score: 6
Revisions (0)
No revisions yet.