HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMajor

Looping through an array of strings containing approximately 3 million elements

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
loopingcontainingapproximatelymillionelementsarraythroughstrings

Problem

I have a function that loops through an array of strings containing approximately 3 million lines of strings. My problem is this loop is running too slowly and I want to know how to make it faster.

string[] words = { "Hello", "world", "!", "My", "name", "is", "Something", "."};
    List result = new List();

    for (int i = 0; i  i + 2 && FileContains(words[i] + " " + words[i + 1] + " " + words[i + 2]))
        {
            result.Add(new MyClass(words[i] + " " + words[i + 1] + " " + words[i + 2]));
            words[i + 1] = "&&&";
            words[i + 2] = "&&&";
        }
        else if (words.Length > i + 1 && FileContains(words[i] + " " + words[i + 1]))
        {
            result.Add(new MyClass(words[i] + " " + words[i + 1]));
            words[i + 1] = "&&&";
        }
        else
            result.Add(new MyClass(words[i], Parent));

    }


And my FileContains function which loops through the array of strings:

public static bool FileContains(string Text)
{
    foreach (string line in ARRAYOFWORDS)  // Approximately 3 million strings
    {
        if (Text.ToLower() == line.Substring(0, line.LastIndexOf('|')))
        {
            return true;
        }
    }
    return false;
}

Solution

Well you can optimize FileContains very simply - stop calling Text.ToLower() on every iteration!

In fact, at that point it's really simple to rewrite. Here's the code for the original question, which didn't have the Substring part:

public static bool FileContains(string text)
{
    // Variable name changed to save my eyes from the horror.
    return lines.Contains(text.ToLower());
}


If you store the lines in a HashSet instead of an array or a list, it will be blazingly fast.

With the Substring part, I'd create a new HashSet with just those substrings, assuming you want to look up more than once:

HashSet lineStarts;

...

lineStarts = new HashSet(lines.Select(x => x.Substring(0, x.IndexOf('|'));


Then use lineStarts within FileContains.

Note that using ToLower is a fairly crude approach to case sensitivity. You should think carefully about what this needs to do in terms of culture-sensitivity etc. See the relevant MSDN page for lots more information. (Ain't culture fun?)

Code Snippets

public static bool FileContains(string text)
{
    // Variable name changed to save my eyes from the horror.
    return lines.Contains(text.ToLower());
}
HashSet<string> lineStarts;

...

lineStarts = new HashSet<string>(lines.Select(x => x.Substring(0, x.IndexOf('|'));

Context

StackExchange Code Review Q#59328, answer score: 21

Revisions (0)

No revisions yet.