HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

K-nearest neighbours in C# for large number of dimensions

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
numbernearestneighbourslargedimensionsfor

Problem

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples and 25 dimensions.

There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation:

```
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList trainSamples, IList testSamples, IList trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);

var testResults = new int[testSamples.Count()];

var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});

// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);

// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;

foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;

}

return testResults;
}

// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList sample1, IList sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid

Solution

var distances = new double[trainNumber][]; 
for (var i = 0; i < trainNumber; i++)
{
   distances[i] = new double[2]; // Will store both distance and index in here
}


This is a code smell. You shouldn't use a jagged double array to store an array of distances and indexes. Despite the comment, what you're doing is unclear, and it's very confusing to have a variable named distances that stores both distances and indexes. The only justification for this would be if you actually had hard profiling evidence that it caused a significant speedup.

Make a separate class (or struct, if you're worried about overhead) with members double distance; int index; and then trainInfo (the former distances) should just be a trainNumber-sized array of that type.

Also, since you only need the top K elements, you don't need to sort the whole list (n log n time). You ought to be able to do it with a partial sort (actual code sample) which is almost linear-speed. There are also parallel algorithms for this; you could probably get a speedup using PLINQ with a custom aggregate.

On to the next refactoring.

foreach (var voter in votingDistances)
{
    if (trainClasses[(int)voter[1]] == 1)  
       yea++;
    else
       nay++;
}


This code is also crying out for LINQ. How about

yea = votingDistances.AsParallel().Count(voter=>trainClasses[voter.index] == 1);
nay = votingDistances.Count - yea;


And I'd refactor this:

for (var i = 0; i < sample1.Count; i++)
{   
    var temp = sample1[i] - sample2[i];
    distance += temp * temp;
}


to this:

var differences = sample1.AsParallel().Zip(sample2,(s1,s2)=>s1-s2);
distance = differences.Sum(x=>x*x);


though you could get improved performance from a custom aggregate.

Code Snippets

var distances = new double[trainNumber][]; 
for (var i = 0; i < trainNumber; i++)
{
   distances[i] = new double[2]; // Will store both distance and index in here
}
foreach (var voter in votingDistances)
{
    if (trainClasses[(int)voter[1]] == 1)  
       yea++;
    else
       nay++;
}
yea = votingDistances.AsParallel().Count(voter=>trainClasses[voter.index] == 1);
nay = votingDistances.Count - yea;
for (var i = 0; i < sample1.Count; i++)
{   
    var temp = sample1[i] - sample2[i];
    distance += temp * temp;
}
var differences = sample1.AsParallel().Zip(sample2,(s1,s2)=>s1-s2);
distance = differences.Sum(x=>x*x);

Context

StackExchange Code Review Q#56367, answer score: 7

Revisions (0)

No revisions yet.