patternMinor
Naive Bayes classifier
Viewed 0 times
bayesclassifiernaive
Problem
I just started learning Haskell, and I must say it is unlike anything else. As my first not-entirely-trivial piece of code I tried to write a simple Bayes Classifier. It is in two parts, a classifier module and a cli.
What could be done to improve it? Is it entirely awful? Is it too imperative? Any and all feedback appreciated. Here are the two files:
The Main Classifier:
``
-- Get Probability for eac
What could be done to improve it? Is it entirely awful? Is it too imperative? Any and all feedback appreciated. Here are the two files:
The Main Classifier:
``
module BayesClassifier where
-- Text Classifier Using Bayes Formula
import Data.List
import Data.Char
type Category = String
newtype Classifier = Classifier { training :: [(Category, [String])] } deriving (Eq, Show)
-- Get a new classifer with no training
classifier :: Classifier
classifier = Classifier []
-- classifier probabilities
probabilityOfWordInCategory :: Classifier -> String -> Category -> Double
probabilityOfCategory :: Classifier -> Category -> Double
-- Adding + 1 for Laplacian Correction
probabilityOfWordInCategory (Classifier training) word category = let allInCategory = filter (\(cat, _) -> cat == category) training
allInCategoryContainingWord = filter (\(_, text) -> word elem text) allInCategory
in (fromIntegral $ length allInCategoryContainingWord + 1) / (fromIntegral $ length allInCategory + 1)
probabilityOfCategory (Classifier training) category = let allInCategory = filter (\(cat, _) -> cat == category) training
in (fromIntegral $ length allInCategory) / (fromIntegral $ length training)
-- Train a classifier
train :: Classifier -> String -> Category -> Classifier
train (Classifier training ) text category = Classifier $ (category, cleanInput $ text):training
-- Categorize text with a classifier
classify :: Classifier -> String -> Category
classify classifier text = fst $ head $ sortBy (\(_, a) (_, b) -> b compare` a) $ probabilities classifier text-- Get Probability for eac
Solution
Rather nice attempt :)
A small advice first. Do not use Category as a type name. It may confuse haskell people who might mistake it for a completely different thing.
I added the unit test import here, and the set.
It is rather adviced in haskell to look out for opportunities to take out
general definitions when possible from a more complicated expression. The idea is to build the language to describe your problem and then use it to solve the problem by describing it. So lots of tiny general purpose functions are very good.
I prefer the below version because liftM2 follows naturally and is quite readable.
stopWords is clearly a constant. So take its construction out. Also see that a set is used. It makes the member operation cheaper.
Where possible use function composition. From my experience, it makes it easier to understand the essence of the function. Also if there is a choice between composing functions and using parenthesis or $, go for (.), That would make refactoring easier later. Use currying in preference to explicit lambdas.
Run tt to execute the test.
Try to restrict your line width. First, it makes it easier to read your code, and second, it makes you on the lookout for refactoring opportunities.
Note that the way you declared your data structure for Classifier, you get the function 'training' defined. So there is no point in extracting it using @ as you did.
Thus we take out the wrapfn because it is nice and generic.
We can do the same with classify
And our extract max is now nice and clean.
The training function is clean, but remove the extra $, and make use of your
training accessing function from classify.
The changes here should be self evident now.
Also, make use of precedence of operators and avoid extra parenthesis. They are ugly :)
A small advice first. Do not use Category as a type name. It may confuse haskell people who might mistake it for a completely different thing.
import Data.List
import Data.Char
import qualified Data.Set as Set
import qualified Control.Monad as Ctl
import Test.HUnitI added the unit test import here, and the set.
It is rather adviced in haskell to look out for opportunities to take out
general definitions when possible from a more complicated expression. The idea is to build the language to describe your problem and then use it to solve the problem by describing it. So lots of tiny general purpose functions are very good.
onlyLetters = filter (\x -> isLetter x || isSpace x)I prefer the below version because liftM2 follows naturally and is quite readable.
onlyLetters = filter (Ctl.liftM2 (||) isLetter isSpace)stopWords is clearly a constant. So take its construction out. Also see that a set is used. It makes the member operation cheaper.
stopWords = Set.fromAscList $ sort ["a","about","you","yourself","yourselves"]Where possible use function composition. From my experience, it makes it easier to understand the essence of the function. Also if there is a choice between composing functions and using parenthesis or $, go for (.), That would make refactoring easier later. Use currying in preference to explicit lambdas.
cleanInput :: String -> [String]
cleanInput = filter (flip Set.notMember stopWords) . words . clean
where clean = onlyLetters . (map toLower)Run tt to execute the test.
tests = TestList [TestLabel "clean" testClean]
tt = runTestTT tests
tInput = "I and you and elephant"
testClean = TestCase $ assertEqual
"cleanInput"
["i","and","and","elephant"] (cleanInput tInput)Try to restrict your line width. First, it makes it easier to read your code, and second, it makes you on the lookout for refactoring opportunities.
-- Get Probability for a passage in a certain category
probabilityForCategory :: Classifier -> String -> Category -> Double
probabilityForCategory classifier text cat = sum $ pfst cat : plst cat (cleanInput text)
where plst cat = map (log . probabilityOfWordInCategory classifier cat)
pfst = log . probabilityOfCategory classifierNote that the way you declared your data structure for Classifier, you get the function 'training' defined. So there is no point in extracting it using @ as you did.
-- Get Probability for each Category
probabilities :: Classifier -> String -> [(Category, Double)]
probabilities classifier text = map (wrap text) $ nub $ map fst (training classifier)
where wrap = wrapfn . probabilityForCategory classifierThus we take out the wrapfn because it is nice and generic.
wrapfn fn cat = (cat, fn cat)We can do the same with classify
-- Categorize text with a classifier
classify :: Classifier -> String -> Category
classify = ((fst . extMax snd) .) . probabilitiesAnd our extract max is now nice and clean.
extMax :: Ord b => (x -> b) -> [x] -> x
extMax fn = maximumBy (compare `F.on` fn)The training function is clean, but remove the extra $, and make use of your
training accessing function from classify.
-- How to train a dragon
train :: Classifier -> String -> Category -> Classifier
train c text category = Classifier $ (category, cleanInput text) : training cThe changes here should be self evident now.
filterElem :: Eq a1 => (a -> [a1]) -> a1 -> [a] -> [a]
filterElem fn word = filter ((elem word) . fn)
filterEq :: Eq b => (a -> b) -> b -> [a] -> [a]
filterEq fn x = filter ((x ==) . fn)
allInCategory :: Eq a => a -> [(a, b)] -> [(a, b)]
allInCategory = filterEq fst
fl = fromIntegral . length
-- Adding + 1 for Laplacian Correction
probabilityOfWordInCategory :: Classifier -> String -> Category -> Double
probabilityOfWordInCategory (Classifier training) word category = succ a / succ b
where a = fl $ (filterElem snd) word y
b = fl y
y = allInCategory category training
probabilityOfCategory :: Classifier -> Category -> Double
probabilityOfCategory (Classifier training) category = a / b
where a = fl (allInCategory category training)
b = fl trainingAlso, make use of precedence of operators and avoid extra parenthesis. They are ugly :)
Code Snippets
import Data.List
import Data.Char
import qualified Data.Set as Set
import qualified Control.Monad as Ctl
import Test.HUnitonlyLetters = filter (\x -> isLetter x || isSpace x)onlyLetters = filter (Ctl.liftM2 (||) isLetter isSpace)stopWords = Set.fromAscList $ sort ["a","about","you","yourself","yourselves"]cleanInput :: String -> [String]
cleanInput = filter (flip Set.notMember stopWords) . words . clean
where clean = onlyLetters . (map toLower)Context
StackExchange Code Review Q#12359, answer score: 7
Revisions (0)
No revisions yet.