snippetcsharpMinor
How to optimize C# console application
Viewed 0 times
consoleoptimizeapplicationhow
Problem
i'm wondering is there any way to optimize this code?
It must be able to work with big files, for example, formatting raw 140kb .txt file (12.5k words) takes 2 seconds (measured with Stopwatch class).
Text example http://pastebin.com/2p88v8EN
Maybe i used here some bad techniques or there is some part to simplify? Maybe multithreading? I'm not familiar with this yet.
Would be grateful for help!
Code below:
```
class TextManipulations
{
public string[] wordsDist; // main array, contains words in alphabetic order and output lines
public void TextFormat(string sourcePath) // creating method that will format our source text according to task
{
string textInput = System.IO.File.ReadAllText(sourcePath).ToLower(); // reading text from file, lowercased at start for precise search
MatchCollection m = Regex.Matches(textInput, @"\b[\w']+\b"); // exact search of all alphanumeric "words" including words with apostrophe
List words = new List(); // creating List for containing unknown amount of words
foreach (Match match in m) // assigning all matches to List
{
words.Add(match.ToString());
}
words.Sort(); // sorting words in alphabetic order
wordsDist = words.Distinct().ToArray(); // assigning words to main array without duplicates
System.IO.File.WriteAllLines(@"D:\output.txt", wordsDist); // writing words into txt file to edit in setLineNumbers method
}
public void setLineNumbers(string sourcePath) // creating method for adding line numbers
{
string[] linesOutput = new string[wordsDist.Count()];
It must be able to work with big files, for example, formatting raw 140kb .txt file (12.5k words) takes 2 seconds (measured with Stopwatch class).
Text example http://pastebin.com/2p88v8EN
Maybe i used here some bad techniques or there is some part to simplify? Maybe multithreading? I'm not familiar with this yet.
Would be grateful for help!
Code below:
```
class TextManipulations
{
public string[] wordsDist; // main array, contains words in alphabetic order and output lines
public void TextFormat(string sourcePath) // creating method that will format our source text according to task
{
string textInput = System.IO.File.ReadAllText(sourcePath).ToLower(); // reading text from file, lowercased at start for precise search
MatchCollection m = Regex.Matches(textInput, @"\b[\w']+\b"); // exact search of all alphanumeric "words" including words with apostrophe
List words = new List(); // creating List for containing unknown amount of words
foreach (Match match in m) // assigning all matches to List
{
words.Add(match.ToString());
}
words.Sort(); // sorting words in alphabetic order
wordsDist = words.Distinct().ToArray(); // assigning words to main array without duplicates
System.IO.File.WriteAllLines(@"D:\output.txt", wordsDist); // writing words into txt file to edit in setLineNumbers method
}
public void setLineNumbers(string sourcePath) // creating method for adding line numbers
{
string[] linesOutput = new string[wordsDist.Count()];
Solution
Rewritten code:
TextFormat
Using a SortedSet is better becouse we don't need additional Sort() or Distinct() calls becouse it will handle all these stuff by it's self.
SetLineNumbers
The main problem was is that the testing Regex was always created inside the second for loop instead of creating once and then using it in all the iterations. The parallel stuff isn't necessary but we can speed up the code a little bit with that.
Results
Your code was executed on my machine a little bit above 2 secs and my code is finishing around 0.25 secs.
Other things
Do not ever write comments at the line endings becouse they are really useless and annoying.
I've added an additional parameter to each method which can be used to specify the output file. (But the code still isn't really reuseable or testable.)
class TextManipulations
{
private const string AlphanumericWords = @"\b[\w']+\b";
public string[] WordsDist;
public void TextFormat(string sourcePath, string output)
{
var textInput = File.ReadAllText(sourcePath).ToLower();
var m = Regex.Matches(textInput, AlphanumericWords);
var words = new SortedSet();
foreach (Match match in m)
{
words.Add(match.ToString());
}
WordsDist = words.ToArray();
File.WriteAllLines(output, WordsDist);
}
public void SetLineNumbers(string sourcePath, string output)
{
var lines = File.ReadAllLines(sourcePath).AsParallel().Select(x => x.ToLowerInvariant()).ToArray();
var regExs = WordsDist.Select(word => new Regex("\\b" + word + "\\b")).ToArray();
var wordsOut = new string[regExs.Length];
Parallel.For(0, regExs.Length, j =>
{
var sb = new List(15);
var regEx = regExs[j];
for (var i = 0; i < lines.Length; i++)
{
if (regEx.IsMatch(lines[i]))
{
sb.Add(i + 1);
}
}
wordsOut[j] = WordsDist[j] + "_______________________________" + string.Join(", ", sb);
});
File.WriteAllLines(output, wordsOut);
}
}TextFormat
Using a SortedSet is better becouse we don't need additional Sort() or Distinct() calls becouse it will handle all these stuff by it's self.
SetLineNumbers
The main problem was is that the testing Regex was always created inside the second for loop instead of creating once and then using it in all the iterations. The parallel stuff isn't necessary but we can speed up the code a little bit with that.
Results
Your code was executed on my machine a little bit above 2 secs and my code is finishing around 0.25 secs.
Other things
Do not ever write comments at the line endings becouse they are really useless and annoying.
I've added an additional parameter to each method which can be used to specify the output file. (But the code still isn't really reuseable or testable.)
Code Snippets
class TextManipulations
{
private const string AlphanumericWords = @"\b[\w']+\b";
public string[] WordsDist;
public void TextFormat(string sourcePath, string output)
{
var textInput = File.ReadAllText(sourcePath).ToLower();
var m = Regex.Matches(textInput, AlphanumericWords);
var words = new SortedSet<string>();
foreach (Match match in m)
{
words.Add(match.ToString());
}
WordsDist = words.ToArray();
File.WriteAllLines(output, WordsDist);
}
public void SetLineNumbers(string sourcePath, string output)
{
var lines = File.ReadAllLines(sourcePath).AsParallel().Select(x => x.ToLowerInvariant()).ToArray();
var regExs = WordsDist.Select(word => new Regex("\\b" + word + "\\b")).ToArray();
var wordsOut = new string[regExs.Length];
Parallel.For(0, regExs.Length, j =>
{
var sb = new List<int>(15);
var regEx = regExs[j];
for (var i = 0; i < lines.Length; i++)
{
if (regEx.IsMatch(lines[i]))
{
sb.Add(i + 1);
}
}
wordsOut[j] = WordsDist[j] + "_______________________________" + string.Join(", ", sb);
});
File.WriteAllLines(output, wordsOut);
}
}Context
StackExchange Code Review Q#35532, answer score: 5
Revisions (0)
No revisions yet.