patternrubyMinor

Spam classifier

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

classifierspamstackoverflow

Problem

I wrote a small script to get the probability of a given text to be spam. I downloaded some ham (good content) and spam (bad content) text from the internet. The spam corpus is a 1.8 MB txt file, and the ham corpus a 3.6 MB txt file.

My method is pretty naive: I count how many times each word appears in the ham and spam and calculate the ratio (ham_count / spam_count): the higher the better. I then average this ratio, getting a final number that should be interpreted as follows:

almost 0 ⇒ almost surely spam

1 ⇒ perfectly equally likely to be spam or ham

big_number ⇒ almost surely ham

TOP_WORDS = 3

class String
    def remove_invalid
        self.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
    end
end

class Array
    def avg
        self.inject(:+) / self.length.to_f
    end
end

def is_spam?(text, ham, spam)
    # Small value means spam: 0 is almost surely spam
    # Big values mean good content

    words = text.split
    words.each                        #+ 1 avoids division by zero error                                
         .map {|word| (ham.count(word) + 1) / (spam.count(word) + 1)}
         .avg
end

# Attribution to 'Enron' for the mail corpus
ham = File.read("good_mails.txt").remove_invalid.split
spam = File.read("bad_mails.txt").remove_invalid.split

messages = ["buy all now friend",
            "you must click here now",
            "I would like to hear from you",
            "what is your research about",
            "buy 100% discount",
            "national lottery: you won",
            "little money needed buy all now!",
            "I am intersted in your work, you could make some good money from it",
            "I would propose a financial affair",
            "new policy free just for you, no trouble",
            "apparently you misunderstood me yesterday"
            ]

puts messages.map{|txt| [txt, ": ", is_spam?(txt, ham, spam)].join}

Solution

Let's kick off with some style stuff, since your algorithm looks good (i.e. no obvious room for improvement) though basic.

I'm using this as my style guide; it seems to be fairly well accepted.

-
Avoid needless metaprogramming. You could easily have avg as its own function; ditto for remove_invalid. There's no particular reason to monkey-patch String and Array besides prettiness, and the code doesn't lose anything to a couple of extra parentheses.

-
Indents in Ruby are two spaces, not four.

-
Put spaces around { (except in string interpolation; see below). It makes your code much more readable. For example, this:

.map {|word| (ham.count(word) + 1) / (spam.count(word) + 1)}

becomes this:

.map { |word| (ham.count(word) + 1) / (spam.count(word) + 1) }

-
Since my attention has been drawn to that area like a velociraptor to impossible faint noise, why did you use words.each instead of just words in is_spam?? That's redundant. Array has a map function, and both Array#map and Enumerable#map return the same thing (assuming that all else is equal). All you have to do is delete the .each on that line.

(You also have some trailing whitespace at the end of the comment on that line. Ew.)

-
What on Earth is this?

[txt, ": ", is_spam?(txt, ham, spam)].join

Just use string interpolation:

"#{ txt }: #{ is_spam?(txt, ham, spam) }"

Much simpler, at least to me.

-
Okay, right around this point I realized you're a Python fellow. It explains a lot1, including this odd indentation:

messages = ["buy all now friend",
            "you must click here now",
            "I would like to hear from you",
            # etc.
            ]

Aside from what I pointed out in #2, there are three issues here (the style guide doesn't explicitly say any of them, AFAICT, but they're the standard from what I've seen):

-
You didn't indent the innards; you should put one indent before each line until the ].

-
The first element should be on the line after the opening brace, not on the same line, and indented the same as everything else.

-
The ending bracket should be indented as much as the opening one, not as much as the elements in the array.

-
You never use TOP_WORDS in your code. Since it's useless, delete it.

-
In Ruby, methods ending in ? tend to return booleans, not numbers, so is_spam? should return true or false, depending on if it's spam or not. Alternatively, you could rename it to spam_factor, since that's what it seems to be calculating. See the code below for how I chose to solve that particular dilemma.

That's about all I see right now. With these changes (and some personal style changes which are purely opinion-based, so I didn't enumerate them), here's what your code looks like:

def remove_invalid(str)
  str.encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '?')
end

def avg(arr)
  arr.inject(:+) / arr.length.to_f
end

def spam_factor(text, ham, spam)
  # Small value means spam: 0 is almost surely spam
  # Big values mean good content

  words = text.split
  avg(words.map { |word| (ham.count(word) + 1) / (spam.count(word) + 1) })
                                        # + 1 avoids division by zero error
end

def is_spam?(text, ham, spam)
     spam_factor(text, ham, spam) > 1
end

# Attribution to 'Enron' for the mail corpus
ham = remove_invalid(File.read('good_mails.txt')).split
spam = remove_invalid(File.read('bad_mails.txt')).split

messages = [
  'buy all now friend',
  'you must click here now',
  'I would like to hear from you',
  'what is your research about',
  'buy 100% discount',
  'national lottery: you won',
  'little money needed buy all now!',
  'I am intersted in your work, you could make some good money from it',
  'I would propose a financial affair',
  'new policy free just for you, no trouble',
  'apparently you misunderstood me yesterday'
]

puts messages.map { |txt| "#{txt}: #{is_spam?(txt, ham, spam)}" }

From what I can get of your algorithm without having a runnable version to play with, it looks good. About the only suggestion I could offer would be to make it linear rather than exponential (i.e. subtraction instead of division), to make it more intuitive to use/simpler to update with more features. That would change your function (if I've understood it right) to:

def spam_factor(text, ham, spam)
  # Small value means spam: 0 is almost surely spam
  # Big values mean good content

  words = text.split
  avg(words.map { |word| (ham.count(word) + 1) - (spam.count(word) + 1) })
                                        # + 1 avoids division by zero error
end

def is_spam?(text, ham, spam)
    spam_factor(text, ham, spam) > 0

Notice that the sole difference is that I swapped / for -. The result is that the more positive the result, the more hammy it is; the more negative, the more spammy.

I've got a niggling idea for using a Hash rather than recalculating the

Code Snippets

.map {|word| (ham.count(word) + 1) / (spam.count(word) + 1)}

.map { |word| (ham.count(word) + 1) / (spam.count(word) + 1) }

[txt, ": ", is_spam?(txt, ham, spam)].join

"#{ txt }: #{ is_spam?(txt, ham, spam) }"

messages = ["buy all now friend",
            "you must click here now",
            "I would like to hear from you",
            # etc.
            ]

Context

StackExchange Code Review Q#85874, answer score: 7

Revisions (0)

No revisions yet.