patterncsharpModerate
Slow-running File parser
Viewed 0 times
parserfilerunningslow
Problem
I've got a method that parses a file. I take all the words and add them to a
And the method that parses the file (I suspect I am not doing something properly here):
```
private void AddFile(string path)
{
Regex regex = new Regex("[^A-Za-z0-9\\s\\']");
FileInfo fi = new FileInfo(path);
if (!fi.Exists || Files.Contains(path.ToLower())) //File does not exist or file already indexed
{
return;
}
Files.Add(path.ToLower());
StreamReader sr = new StreamReader(path);
string file = sr.ReadToEnd();
string saniFile = regex.Replace(file, "");
string[] saniLines = saniFile.Split(new char[]{'\r', '\n'}, StringSplitOptions.RemoveEmptyEntries);
int lineNo = 1;
foreach (var l in saniLines)
{
Line line = new Line(l, path, lineNo);
string[] words = l.Split(' ');
foreach (var word in words)
{
Word w = new Word();
w.WordStr = word;
if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
if (!wordToAdd.Lines.Contains(line))
wordToAdd.Lines.Add(line);
}
else
{
w.Lines.Add(line);
Words.Add(w);
}
}
lineNo++;
SortedSet. Every word contains a list of Lines that contain said word. Words are not strings but a class I created:class Word : IComparable
{
public Word()
{
Lines = new List();
}
public string WordStr { get; set; }
public List Lines { get; set; }
public int CompareTo(Word other)
{
return string.Compare(this.WordStr, other.WordStr, StringComparison.OrdinalIgnoreCase);
}
}And the method that parses the file (I suspect I am not doing something properly here):
```
private void AddFile(string path)
{
Regex regex = new Regex("[^A-Za-z0-9\\s\\']");
FileInfo fi = new FileInfo(path);
if (!fi.Exists || Files.Contains(path.ToLower())) //File does not exist or file already indexed
{
return;
}
Files.Add(path.ToLower());
StreamReader sr = new StreamReader(path);
string file = sr.ReadToEnd();
string saniFile = regex.Replace(file, "");
string[] saniLines = saniFile.Split(new char[]{'\r', '\n'}, StringSplitOptions.RemoveEmptyEntries);
int lineNo = 1;
foreach (var l in saniLines)
{
Line line = new Line(l, path, lineNo);
string[] words = l.Split(' ');
foreach (var word in words)
{
Word w = new Word();
w.WordStr = word;
if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
if (!wordToAdd.Lines.Contains(line))
wordToAdd.Lines.Add(line);
}
else
{
w.Lines.Add(line);
Words.Add(w);
}
}
lineNo++;
Solution
Your instincts are basically right that maybe a dictionary of some sort could help. Let's look at some key lines of your code.
I see 5 nested loops here. If you step through the code, you'll see them, too. The outer 2 are obvious, you've coded them explicitly. But stepping through the code will reveal the looping being done at
Profiling will always help. But I would suggest perhaps changing
With something more like
And you could also change
In fact,
Try these changes, measure the performance before and after and see where you are.
Unrelated, but I also recommend that you wrap your
foreach (var l in saniLines)
{
Line line = new Line(l, path, lineNo);
string[] words = l.Split(' ');
foreach (var word in words)
{
Word w = new Word();
w.WordStr = word;
if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
if (!wordToAdd.Lines.Contains(line))
wordToAdd.Lines.Add(line);I see 5 nested loops here. If you step through the code, you'll see them, too. The outer 2 are obvious, you've coded them explicitly. But stepping through the code will reveal the looping being done at
Words.Contains. The Linq query is also an abstracted loop. And wordToAdd.Lines.Contains is also going to be a loop. You're going to pay a high price for these. Profiling will always help. But I would suggest perhaps changing
Words to some sort of IDictionary, where you can replace if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();With something more like
if (Words.ContainsKey(w.WordStr))
{
Word wordToAdd = Words[w.WordStr];
// ...
}And you could also change
Lines into a HashSet instead of a list. That should improve the performance of if (!wordToAdd.Lines.Contains(line))
wordToAdd.Lines.Add(line);In fact,
HashSet.Add(T val) could be used by itself, without the call to Contains, as Add returns a bool indicating if the addition could be performed. If a matching value is already in the set, it simply returns false without modifying its contents.Try these changes, measure the performance before and after and see where you are.
Unrelated, but I also recommend that you wrap your
StreamReader in a using () { } block so that the resource is properly disposed when you're through with it.Code Snippets
foreach (var l in saniLines)
{
Line line = new Line(l, path, lineNo);
string[] words = l.Split(' ');
foreach (var word in words)
{
Word w = new Word();
w.WordStr = word;
if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();
if (!wordToAdd.Lines.Contains(line))
wordToAdd.Lines.Add(line);if (Words.Contains(w, new WordComparer())) //Set already contains the word
{
Word wordToAdd = (from wors in Words where wors.WordStr.ToLower() == w.WordStr.ToLower() select wors).First();if (Words.ContainsKey(w.WordStr))
{
Word wordToAdd = Words[w.WordStr];
// ...
}if (!wordToAdd.Lines.Contains(line))
wordToAdd.Lines.Add(line);Context
StackExchange Code Review Q#2269, answer score: 14
Revisions (0)
No revisions yet.