HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Log-reading & String-matching with hashtable for fastest execution speed

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readinglogwithhashtableexecutionforfastestspeedstringmatching

Problem

I'm currently enjoying this exercise I'm working on for fun: I wish to write optimized code for the searching of particular strings per line in a (very large) file, counting how many exist, and producing a text file with a count of said strings.

The data format is actually very consistent, so when I do my indexof() on the lines of my logs, I'm searching against lines that all conform to this:

2011-05-13 00:00:00 195.249.159.77 GET /blahblah/blah/lol.png - 80 - 141.166.254.22 Mozilla/5.0+(Windows+NT+6.1;+rv:2.0b12)+Gecko/20100101+Firefox/4.0b12 http://lolcats.com/content/styles.css?x=a9d6c00 lolcats.com


(hence the specific indexes)

```
private static void Main()
{
var sw = new Stopwatch();
var ht = new Hashtable();
var ts = new TimeSpan();
var files = Directory.GetFiles(Directory.GetCurrentDirectory());
var tempPath = Path.GetTempFileName();
var tempOutput = new StreamWriter(tempPath);

const int iterations = 5;

for (int i = 0; i 27)
{
var lineHit = line.IndexOf("get", 27, 13, StringComparison.OrdinalIgnoreCase);

if (lineHit > -1)
{
var relevantDataIndex = line.IndexOf(" - ", 27,
StringComparison.OrdinalIgnoreCase);

var relevantData = line.Substring(lineHit + 4, relevantDataIndex - (lineHit + 4));

if (ht.Contains(relevantData))
{
ht[relevantData] = (int) ht[relevantData] + 1;
}
else
{
ht.Add(relevantData, 1);
}
}
}
}
}
sw.Stop();
ts += sw.Elapsed;
Console.WriteLine("tim

Solution

-
Try to break you code into different methods which take care of the individual things, like counting the words and copying the output. Otherwise your code will get messy quickly and you'll have a hard time trying different implementations easily (and compare them against each other).

-
Your use of a for loop with the StreamReader is rather unusual. Typically you'd use a while loop like:

string line;
while ((line = sr.ReadLine()) != null)
{
    ...


or

while (!sr.EndOfStream)
{
    var line = sr.ReadLine();
    ...


Both versions convey the semantics better than the for loop (imho).

-
You can reduce nesting a bit by using continue. E.g.

string line;
while ((line = sr.ReadLine()) != null)
{
    if (line.Length <= 27)
        continue;

    ...


-
Instead of a Hashtable you should use a Dictionary this will avoid boxing of the value (a Hashtable operates on object which requires boxing for primitive types like int).

-
I'm not sure why you think asynchronous methods would make your code faster. Asynchronous processing is not free and incurs overhead.

-
StreamWriter is IDisposable so tempWriter should be wrapped in a using block.

-
Not entirely sure why you write the output to a temporary file first and then append to to the output file. You could simply append it to the final output file directly.

-
You are only measuring specific parts of the code. All the output writing and copying around is not free and uses time. In the end from an end-users perspective I don't care if you application can count specific lines in a file in less than a second when it spends much longer copying stuff around.

Code Snippets

string line;
while ((line = sr.ReadLine()) != null)
{
    ...
while (!sr.EndOfStream)
{
    var line = sr.ReadLine();
    ...
string line;
while ((line = sr.ReadLine()) != null)
{
    if (line.Length <= 27)
        continue;

    ...

Context

StackExchange Code Review Q#80608, answer score: 4

Revisions (0)

No revisions yet.