patterncsharpMinor
K-nearest neighbours in C# for large number of dimensions
Viewed 0 times
numbernearestneighbourslargedimensionsfor
Problem
I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples and 25 dimensions.
There are only two classes, represented by
```
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList trainSamples, IList testSamples, IList trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList sample1, IList sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid
There are only two classes, represented by
'0' and '1' in my implementation. For now, I have the following simple implementation:```
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList trainSamples, IList testSamples, IList trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList sample1, IList sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid
Solution
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}This is a code smell. You shouldn't use a jagged double array to store an array of distances and indexes. Despite the comment, what you're doing is unclear, and it's very confusing to have a variable named
distances that stores both distances and indexes. The only justification for this would be if you actually had hard profiling evidence that it caused a significant speedup.Make a separate
class (or struct, if you're worried about overhead) with members double distance; int index; and then trainInfo (the former distances) should just be a trainNumber-sized array of that type.Also, since you only need the top
K elements, you don't need to sort the whole list (n log n time). You ought to be able to do it with a partial sort (actual code sample) which is almost linear-speed. There are also parallel algorithms for this; you could probably get a speedup using PLINQ with a custom aggregate.On to the next refactoring.
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}This code is also crying out for LINQ. How about
yea = votingDistances.AsParallel().Count(voter=>trainClasses[voter.index] == 1);
nay = votingDistances.Count - yea;And I'd refactor this:
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}to this:
var differences = sample1.AsParallel().Zip(sample2,(s1,s2)=>s1-s2);
distance = differences.Sum(x=>x*x);though you could get improved performance from a custom aggregate.
Code Snippets
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}yea = votingDistances.AsParallel().Count(voter=>trainClasses[voter.index] == 1);
nay = votingDistances.Count - yea;for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}var differences = sample1.AsParallel().Zip(sample2,(s1,s2)=>s1-s2);
distance = differences.Sum(x=>x*x);Context
StackExchange Code Review Q#56367, answer score: 7
Revisions (0)
No revisions yet.