patternMinor
Nearest Neighbour classification algorithm
Viewed 0 times
nearestclassificationalgorithmneighbour
Problem
The following code is from a university assignment of mine to write a classification algorithm (using nearest neighbour) to classify whether or not a given feature set (each feature is the frequency of words in an email) is spam or not.
We are given a CSV file (the training data) with the frequencies, and an integer (1 or 0) at the end of the row indicating whether or not it is spam.
So in this form:
The test data is also in this form (including the SPAM column, so we can verify the accuracy).
My question is, how can I make this code more idiomatic, and what speedups can I make to this code? For example, is there a way to write
```
import Text.CSV
import Data.List
data SpamType = Spam | NotSpam
deriving (Show, Eq)
type FeatureSet = [Float]
toSpam :: Int -> SpamType
toSpam 0 = NotSpam
toSpam 1 = Spam
toSpam a = NotSpam
parseClassifiedRecord :: Record -> (FeatureSet, SpamType)
parseClassifiedRecord x = (init converted, toSpam (truncate (last converted)))
where
converted = map (\val -> read val :: Float) x
-- returns the euclidean distance between two feature sets
difference :: FeatureSet -> FeatureSet -> Float
difference first second = sqrt (sum (zipWith (\x y -> (x - y)^2) first second))
-- finds the SpamType of the k nearest 'nodes' in the training set
findKNearest :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> [SpamType]
findKNearest trainingSet toMatch k = take k (map snd (sortBy (\x y -> (compare (fst x) (fst y))) [(difference (fst x) toMatch, snd x) | x [a] -> a
getMostCommon list = head (maximumBy (\x y -> (compare (length x) (length y))) (groupBy (\x y -> (x == y)) list))
-- given a feature set, returns an ordered (i.e. same order as input)
-- list of whether or not feature is spam or not spam
-- looks at the closest k neighbours
classify :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> SpamType
classify trainingSet
We are given a CSV file (the training data) with the frequencies, and an integer (1 or 0) at the end of the row indicating whether or not it is spam.
So in this form:
X1,X2,...,Xn,SPAMThe test data is also in this form (including the SPAM column, so we can verify the accuracy).
My question is, how can I make this code more idiomatic, and what speedups can I make to this code? For example, is there a way to write
getMostCommon, without having to groupBy and then run a maximumBy again?```
import Text.CSV
import Data.List
data SpamType = Spam | NotSpam
deriving (Show, Eq)
type FeatureSet = [Float]
toSpam :: Int -> SpamType
toSpam 0 = NotSpam
toSpam 1 = Spam
toSpam a = NotSpam
parseClassifiedRecord :: Record -> (FeatureSet, SpamType)
parseClassifiedRecord x = (init converted, toSpam (truncate (last converted)))
where
converted = map (\val -> read val :: Float) x
-- returns the euclidean distance between two feature sets
difference :: FeatureSet -> FeatureSet -> Float
difference first second = sqrt (sum (zipWith (\x y -> (x - y)^2) first second))
-- finds the SpamType of the k nearest 'nodes' in the training set
findKNearest :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> [SpamType]
findKNearest trainingSet toMatch k = take k (map snd (sortBy (\x y -> (compare (fst x) (fst y))) [(difference (fst x) toMatch, snd x) | x [a] -> a
getMostCommon list = head (maximumBy (\x y -> (compare (length x) (length y))) (groupBy (\x y -> (x == y)) list))
-- given a feature set, returns an ordered (i.e. same order as input)
-- list of whether or not feature is spam or not spam
-- looks at the closest k neighbours
classify :: [(FeatureSet, SpamType)] -> FeatureSet -> Int -> SpamType
classify trainingSet
Solution
One helpful function is
With this and using function composition you can write:
Similar changes apply to several places. Another small simplification could be changing
Data.Ord.comparing:comparing :: (Ord a) => (b -> a) -> b -> b -> Ordering
comparing p x y = compare (p x) (p y)With this and using function composition you can write:
getMostCommon :: (Eq a) => [a] -> a
getMostCommon = head . maximumBy (comparing length) . groupBy (==)Similar changes apply to several places. Another small simplification could be changing
(\x -> (fst x) == (snd x)) to uncurry (==).Code Snippets
comparing :: (Ord a) => (b -> a) -> b -> b -> Ordering
comparing p x y = compare (p x) (p y)getMostCommon :: (Eq a) => [a] -> a
getMostCommon = head . maximumBy (comparing length) . groupBy (==)Context
StackExchange Code Review Q#15721, answer score: 4
Revisions (0)
No revisions yet.