patternMinor

Naive Bayes classifier

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

bayesclassifiernaive

Problem

I just started learning Haskell, and I must say it is unlike anything else. As my first not-entirely-trivial piece of code I tried to write a simple Bayes Classifier. It is in two parts, a classifier module and a cli.

What could be done to improve it? Is it entirely awful? Is it too imperative? Any and all feedback appreciated. Here are the two files:

The Main Classifier:

``


module BayesClassifier where
-- Text Classifier Using Bayes Formula
import Data.List
import Data.Char
type Category = String
newtype Classifier = Classifier { training :: [(Category, [String])] } deriving (Eq, Show)
-- Get a new classifer with no training
classifier :: Classifier
classifier = Classifier []
-- classifier probabilities
probabilityOfWordInCategory :: Classifier -> String -> Category -> Double
probabilityOfCategory :: Classifier -> Category -> Double
-- Adding + 1 for Laplacian Correction
probabilityOfWordInCategory (Classifier training) word category = let allInCategory = filter (\(cat, _) -> cat == category) training
                                                                      allInCategoryContainingWord = filter (\(_, text) -> word

elem

 text) allInCategory
                                                                  in (fromIntegral $ length allInCategoryContainingWord + 1) / (fromIntegral $ length allInCategory + 1)
probabilityOfCategory (Classifier training) category =  let allInCategory = filter (\(cat, _) -> cat == category) training 
                                                        in (fromIntegral $ length allInCategory) / (fromIntegral $ length training)
-- Train a classifier
train :: Classifier -> String -> Category -> Classifier 
train (Classifier training ) text category = Classifier $ (category, cleanInput $ text):training
-- Categorize text with a classifier
classify :: Classifier -> String -> Category
classify classifier text = fst $ head $ sortBy (\(_, a) (_, b)  -> b

compare` a) $ probabilities classifier text
-- Get Probability for eac

Solution

Rather nice attempt :)

A small advice first. Do not use Category as a type name. It may confuse haskell people who might mistake it for a completely different thing.

import Data.List
import Data.Char
import qualified Data.Set as Set
import qualified Control.Monad as Ctl
import Test.HUnit

I added the unit test import here, and the set.

It is rather adviced in haskell to look out for opportunities to take out
general definitions when possible from a more complicated expression. The idea is to build the language to describe your problem and then use it to solve the problem by describing it. So lots of tiny general purpose functions are very good.

onlyLetters = filter (\x -> isLetter x || isSpace x)

I prefer the below version because liftM2 follows naturally and is quite readable.

onlyLetters = filter (Ctl.liftM2 (||) isLetter isSpace)

stopWords is clearly a constant. So take its construction out. Also see that a set is used. It makes the member operation cheaper.

stopWords = Set.fromAscList $ sort ["a","about","you","yourself","yourselves"]

Where possible use function composition. From my experience, it makes it easier to understand the essence of the function. Also if there is a choice between composing functions and using parenthesis or $, go for (.), That would make refactoring easier later. Use currying in preference to explicit lambdas.

cleanInput :: String -> [String]
cleanInput = filter (flip Set.notMember stopWords) . words . clean
  where clean = onlyLetters . (map toLower)

Run tt to execute the test.

tests = TestList [TestLabel "clean" testClean]
tt = runTestTT tests 

tInput = "I and you and elephant"
testClean = TestCase $ assertEqual
               "cleanInput"
               ["i","and","and","elephant"] (cleanInput tInput)

Try to restrict your line width. First, it makes it easier to read your code, and second, it makes you on the lookout for refactoring opportunities.

-- Get Probability for a passage in a certain category
probabilityForCategory :: Classifier -> String -> Category -> Double
probabilityForCategory classifier text cat = sum $ pfst cat : plst cat (cleanInput text)
  where plst cat = map (log . probabilityOfWordInCategory classifier cat)
        pfst = log . probabilityOfCategory classifier

Note that the way you declared your data structure for Classifier, you get the function 'training' defined. So there is no point in extracting it using @ as you did.

-- Get Probability for each Category
probabilities :: Classifier ->  String -> [(Category, Double)]
probabilities classifier text =  map (wrap text) $ nub $ map fst (training classifier)
    where wrap = wrapfn . probabilityForCategory classifier

Thus we take out the wrapfn because it is nice and generic.

wrapfn fn cat = (cat, fn cat)

We can do the same with classify

-- Categorize text with a classifier
classify :: Classifier -> String -> Category
classify = ((fst . extMax snd) .) . probabilities

And our extract max is now nice and clean.

extMax :: Ord b => (x -> b) -> [x] -> x
extMax fn = maximumBy (compare `F.on` fn)

The training function is clean, but remove the extra $, and make use of your
training accessing function from classify.

-- How to train a dragon
train :: Classifier -> String -> Category -> Classifier
train c text category = Classifier $ (category, cleanInput text) : training c

The changes here should be self evident now.

filterElem :: Eq a1 => (a -> [a1]) -> a1 -> [a] -> [a]
filterElem fn word = filter ((elem word) . fn)
filterEq :: Eq b => (a -> b) -> b -> [a] -> [a]
filterEq fn x = filter ((x ==) . fn)

allInCategory :: Eq a => a -> [(a, b)] -> [(a, b)]
allInCategory = filterEq fst

fl = fromIntegral . length

-- Adding + 1 for Laplacian Correction
probabilityOfWordInCategory :: Classifier -> String -> Category -> Double
probabilityOfWordInCategory (Classifier training) word category = succ a / succ b
  where a = fl $ (filterElem snd) word y
        b = fl y
        y = allInCategory category training

probabilityOfCategory :: Classifier -> Category -> Double
probabilityOfCategory (Classifier training) category =  a / b
  where a = fl (allInCategory category training)
        b = fl training

Also, make use of precedence of operators and avoid extra parenthesis. They are ugly :)

Code Snippets

import Data.List
import Data.Char
import qualified Data.Set as Set
import qualified Control.Monad as Ctl
import Test.HUnit

onlyLetters = filter (\x -> isLetter x || isSpace x)

onlyLetters = filter (Ctl.liftM2 (||) isLetter isSpace)

stopWords = Set.fromAscList $ sort ["a","about","you","yourself","yourselves"]

cleanInput :: String -> [String]
cleanInput = filter (flip Set.notMember stopWords) . words . clean
  where clean = onlyLetters . (map toLower)

Context

StackExchange Code Review Q#12359, answer score: 7

Revisions (0)

No revisions yet.