HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Fast evaluation of Euclidean distance with sparse data

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fastevaluationwithdistanceeuclideandatasparse

Problem

I am working on a KNN implementation for sparse datasets (that I apply to text analysis). My points are represented by a Dictionary, each key represents a word of the text and the value, its count or TFIDF.

I am using index inversion to reduce the number of distances to evaluate. However, evaluating the distance remains the bottleneck. It is represented in the following method:

public static double Euclide(Dictionary sp1, Dictionary sp2)
    {
        double distance = 0;
        foreach (KeyValuePair kvp1 in sp1)
        {
            if (sp2.ContainsKey(kvp1.Key))
                distance += Math.Pow((kvp1.Value - sp2[kvp1.Key]), 2);
            else
                distance += Math.Pow((kvp1.Value), 2);
        }

        foreach (KeyValuePair kvp2 in sp2)
            if (!sp1.ContainsKey(kvp2.Key))
                distance += Math.Pow((kvp2.Value), 2);

        return distance;
    }


How can I make it faster? Any help would be greatly appreciated.

I think I could reduce the time of evaluation using a Dictionary, but I prefer to stick to strings, as it allows me to see at a glance what is going on (hashing words would compromise that).

The following yieled a 1.3~2.1x improvement (depending on the length of the input, but it is still not enough) :

public static double FastEuclide(Dictionary sp1, Dictionary sp2)
    {
        double distance = 0;
        foreach (KeyValuePair kvp1 in sp1)
        {
            double sp1Value = kvp1.Value;

            if (sp2.ContainsKey(kvp1.Key))
            {
                double sp2Value = sp2[kvp1.Key],
                    diff = kvp1.Value - sp2[kvp1.Key];
                distance += diff * diff;
            }
            else
                distance += sp1Value * sp1Value;
        }

Solution

Calling ContainsKey() on a dictionary if you need to get the value later on should be replaced with TryGetValue() to reduce the processing time.

See: what-is-more-efficient-dictionary-trygetvalue-or-containskeyitem

TryGetValue will be faster.

ContainsKey uses the same check as TryGetValue, which internally refers to the actual entry location. The Item property actually has nearly identical code functionality as TryGetValue, except that it will throw an exception instead of returning false.

Using ContainsKey followed by the item basically duplicates the lookup functionality, which is the bulk of the computation in this case

The next part to increase performance is to replace the call to Math.Pow() with multiplying the value by itself.

So instead of Math.Pow((kvp2.Value), 2) you should use kvp2.Value * kvp2.Value.

You just edited your question to do this but I will keep it in my review.

Using the said above will lead to

public static double Euclide(Dictionary sp1, Dictionary sp2)
{
    double distance = 0;
    
    foreach (KeyValuePair kvp1 in sp1)
    {
        double possibleValue = 0.0d;
        sp2.TryGetValue(kvp1.Key, out possibleValue);

        double currentValue = kvp1.Value - possibleValue;

        distance += currentValue * currentValue;
    }

    foreach (KeyValuePair kvp2 in sp2)
    {
        if (!sp1.ContainsKey(kvp2.Key))
        {
            distance += kvp2.Value * kvp2.Value;
        }
    }
    return distance;
}


This

double possibleValue = 0.0d;
sp2.TryGetValue(kvp1.Key, out possibleValue);

double currentValue = kvp1.Value - possibleValue;


will store the value of the searched key inside possibleValue if the key is in the dictionary. If TryGetValue() does not succeed it will return default(T) which for a double is 0.0.

I have added braces {} for the for and the if which makes the code less error prone. I would like to encourage you to always use braces.

Code Snippets

public static double Euclide(Dictionary<string, double> sp1, Dictionary<string, double> sp2)
{
    double distance = 0;
    
    foreach (KeyValuePair<string, double> kvp1 in sp1)
    {
        double possibleValue = 0.0d;
        sp2.TryGetValue(kvp1.Key, out possibleValue);

        double currentValue = kvp1.Value - possibleValue;

        distance += currentValue * currentValue;
    }

    foreach (KeyValuePair<string, double> kvp2 in sp2)
    {
        if (!sp1.ContainsKey(kvp2.Key))
        {
            distance += kvp2.Value * kvp2.Value;
        }
    }
    return distance;
}
double possibleValue = 0.0d;
sp2.TryGetValue(kvp1.Key, out possibleValue);

double currentValue = kvp1.Value - possibleValue;

Context

StackExchange Code Review Q#99170, answer score: 6

Revisions (0)

No revisions yet.