HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpModerate

Slow-running File parser

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
parserfilerunningslow

Problem

I've got a method that parses a file. I take all the words and add them to a SortedSet. Every word contains a list of Lines that contain said word. Words are not strings but a class I created:

class Word : IComparable
{
    public Word()
    {
        Lines = new List();
    }
    public string WordStr { get; set; }
    public List Lines { get; set; }

    public int CompareTo(Word other)
    {
        return string.Compare(this.WordStr, other.WordStr, StringComparison.OrdinalIgnoreCase);
    }
}


And the method that parses the file (I suspect I am not doing something properly here):

```
private void AddFile(string path)
{
Regex regex = new Regex("[^A-Za-z0-9\\s\\']");
FileInfo fi = new FileInfo(path);
if (!fi.Exists || Files.Contains(path.ToLower())) //File does not exist or file already indexed
{
return;
}
Files.Add(path.ToLower());
StreamReader sr = new StreamReader(path);
string file = sr.ReadToEnd();
string saniFile = regex.Replace(file, "");
string[] saniLines = saniFile.Split(new char[]{'\r', '\n'}, StringSplitOptions.RemoveEmptyEntries);
int lineNo = 1;
foreach (var l in saniLines)
{
Line line = new Line(l, path, lineNo);
string[] words = l.Split(' ');
foreach (var word in words)
{
Word w = new Word();
w.WordStr = word;

if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
if (!wordToAdd.Lines.Contains(line))
wordToAdd.Lines.Add(line);
}
else
{
w.Lines.Add(line);
Words.Add(w);
}
}
lineNo++;

Solution

Your instincts are basically right that maybe a dictionary of some sort could help. Let's look at some key lines of your code.

foreach (var l in saniLines)
{
    Line line = new Line(l, path, lineNo);
    string[] words = l.Split(' ');
    foreach (var word in words)
    {
        Word w = new Word();
        w.WordStr = word;

        if (Words.Contains(w, new WordComparer())) //Set already contains the word
        {
            Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
            if (!wordToAdd.Lines.Contains(line))
                wordToAdd.Lines.Add(line);


I see 5 nested loops here. If you step through the code, you'll see them, too. The outer 2 are obvious, you've coded them explicitly. But stepping through the code will reveal the looping being done at Words.Contains. The Linq query is also an abstracted loop. And wordToAdd.Lines.Contains is also going to be a loop. You're going to pay a high price for these.

Profiling will always help. But I would suggest perhaps changing Words to some sort of IDictionary, where you can replace

if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
    Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();


With something more like

if (Words.ContainsKey(w.WordStr))
{
    Word wordToAdd = Words[w.WordStr];
    // ...
}


And you could also change Lines into a HashSet instead of a list. That should improve the performance of

if (!wordToAdd.Lines.Contains(line))
    wordToAdd.Lines.Add(line);


In fact, HashSet.Add(T val) could be used by itself, without the call to Contains, as Add returns a bool indicating if the addition could be performed. If a matching value is already in the set, it simply returns false without modifying its contents.

Try these changes, measure the performance before and after and see where you are.

Unrelated, but I also recommend that you wrap your StreamReader in a using () { } block so that the resource is properly disposed when you're through with it.

Code Snippets

foreach (var l in saniLines)
{
    Line line = new Line(l, path, lineNo);
    string[] words = l.Split(' ');
    foreach (var word in words)
    {
        Word w = new Word();
        w.WordStr = word;

        if (Words.Contains(w, new WordComparer())) //Set already contains the word
        {
            Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
            if (!wordToAdd.Lines.Contains(line))
                wordToAdd.Lines.Add(line);
if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
    Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
if (Words.ContainsKey(w.WordStr))
{
    Word wordToAdd = Words[w.WordStr];
    // ...
}
if (!wordToAdd.Lines.Contains(line))
    wordToAdd.Lines.Add(line);

Context

StackExchange Code Review Q#2269, answer score: 14

Revisions (0)

No revisions yet.