HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Nearest Neighbour classification algorithm

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nearestclassificationalgorithmneighbour

Problem

The following code is from a university assignment of mine to write a classification algorithm (using nearest neighbour) to classify whether or not a given feature set (each feature is the frequency of words in an email) is spam or not.

We are given a CSV file (the training data) with the frequencies, and an integer (1 or 0) at the end of the row indicating whether or not it is spam.
So in this form:

X1,X2,...,Xn,SPAM


The test data is also in this form (including the SPAM column, so we can verify the accuracy).

My question is, how can I make this code more idiomatic, and what speedups can I make to this code? For example, is there a way to write getMostCommon, without having to groupBy and then run a maximumBy again?

```
import Text.CSV
import Data.List

data SpamType = Spam | NotSpam
deriving (Show, Eq)

type FeatureSet = [Float]

toSpam :: Int -> SpamType
toSpam 0 = NotSpam
toSpam 1 = Spam
toSpam a = NotSpam

parseClassifiedRecord :: Record -> (FeatureSet, SpamType)
parseClassifiedRecord x = (init converted, toSpam (truncate (last converted)))
where
converted = map (\val -> read val :: Float) x

-- returns the euclidean distance between two feature sets
difference :: FeatureSet -> FeatureSet -> Float
difference first second = sqrt (sum (zipWith (\x y -> (x - y)^2) first second))

-- finds the SpamType of the k nearest 'nodes' in the training set
findKNearest :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> [SpamType]
findKNearest trainingSet toMatch k = take k (map snd (sortBy (\x y -> (compare (fst x) (fst y))) [(difference (fst x) toMatch, snd x) | x [a] -> a
getMostCommon list = head (maximumBy (\x y -> (compare (length x) (length y))) (groupBy (\x y -> (x == y)) list))

-- given a feature set, returns an ordered (i.e. same order as input)
-- list of whether or not feature is spam or not spam
-- looks at the closest k neighbours
classify :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> SpamType
classify trainingSet

Solution

One helpful function is Data.Ord.comparing:

comparing :: (Ord a) => (b -> a) -> b -> b -> Ordering
comparing p x y = compare (p x) (p y)


With this and using function composition you can write:

getMostCommon :: (Eq a) => [a] -> a
getMostCommon = head . maximumBy (comparing length) . groupBy (==)


Similar changes apply to several places. Another small simplification could be changing (\x -> (fst x) == (snd x)) to uncurry (==).

Code Snippets

comparing :: (Ord a) => (b -> a) -> b -> b -> Ordering
comparing p x y = compare (p x) (p y)
getMostCommon :: (Eq a) => [a] -> a
getMostCommon = head . maximumBy (comparing length) . groupBy (==)

Context

StackExchange Code Review Q#15721, answer score: 4

Revisions (0)

No revisions yet.