HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Bayes rating selective term filtering

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
selectivebayestermratingfiltering

Problem

I am working on a project, and at the moment I have just finished the majority of the Bayes Rating/Classifier.

I have two sets of methods I am most concerned about. Method set 1, will select a certain number of extraneous elements (which have probabilities furthest from 0.5), and only take those into account for classification.

public double GetBayesRating(int numberOfExtremes)
{
    return GetBayesRating(numberOfExtremes, 0.5);
}

public double GetBayesRating(int numberOfExtremes, double messageSpamProbability)
{
    // Formula based on https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering#Other_expression_of_the_formula_for_combining_individual_probabilities
    // We used the natural-log method for the reasons that section describes.
    List topKeywords;

    if (messageSpamProbability = 1)
        throw new ArgumentOutOfRangeException("messageSpamProbability", messageSpamProbability, "The parameter messageSpamProbability must be between the values of 0 and 1.");

    if (numberOfExtremes > 0)
        topKeywords = keywordsMatched.OrderByDescending(x => Math.Abs(x.SpamProbability)).Take(numberOfExtremes).ToList();
    else
        topKeywords = keywordsMatched;

    double n = 0;

    for (int i = 0; i < topKeywords.Count; i++)
        n += (Math.Log(1 - topKeywords[i].GetSpamProbability(messageSpamProbability)) - Math.Log(topKeywords[i].GetSpamProbability(messageSpamProbability)));

    return 1 / (1 + Math.Pow(Math.E, n));
}


The second set, can allow you to drill down into how many extremes you want above/below the median (0.5). That is, if you want the six highest and four lowest, then it will only take those (or how many it can find until it hits 0.5) into account.

```
public double GetBayesRating(int numberAboveMiddle, int numberBelowMiddle)
{
return GetBayesRating(numberAboveMiddle, numberBelowMiddle, 0.5);
}

public double GetBayesRating(int numberAboveMiddle, int numberBelowMiddle, double messageSpamProbability)
{
// Formula based

Solution

Using LINQ can help you a lot here. Wherever you see a loop, there's a good chance you can replace it with LINQ to improve readability and safety. Let's look at the following loop:

double n = 0;

for (int i = 0; i < topKeywords.Count; i++)
   n += (Math.Log(1 - topKeywords[i].GetSpamProbability(messageSpamProbability)) - Math.Log(topKeywords[i].GetSpamProbability(messageSpamProbability)));


First of all, topKeywords[i].GetSpamProbability(messageSpamProbability) is used twice and would be more readable if it were stored in a variable:

double n = 0;

for (int i = 0; i < topKeywords.Count; i++)
{
    var spamProbability = topKeywords[i].GetSpamProbability(messageSpamProbability);
    n += (Math.Log(1 - spamProbability) - Math.Log(spamProbability));
}


Here is the version using LINQ:

double n = topKeywords
    .Select(x => x.GetSpamProbability(messageSpamProbability))
    .Select(x => Math.Log(1 - x) - Math.Log(x))
    .Sum();


Now let's look at the following code in your final GetBayesRating method:

List topKeywordsAbove = keywordsMatched.OrderByDescending(x => x.SpamProbability).Take(numberAboveMiddle).ToList();
List topKeywordsBelow = keywordsMatched.OrderBy(x => x.SpamProbability).Take(numberBelowMiddle).ToList();
List topKeywords = new List();

for (int i = 0; i = 0.5)
        topKeywords.Add(topKeywordsAbove[i]);

for (int i = 0; i < topKeywordsBelow.Count; i++)
    if (topKeywordsBelow[i].SpamProbability <= 0.5 && !topKeywords.Contains(topKeywordsBelow[i]))
        topKeywords.Add(topKeywordsBelow[i]);


I was going to suggest replacing the for loops with LINQ queries using Where, but then I realized that topKeywordsAbove and topKeywordsBelow are ordered, so you could just add a TakeWhile call to the initial queries. Note that you'll have to change <= 5 to < 5 so there isn't any overlap. And now that you're not accessing the collections by index and you're only accessing them once, you don't need to convert them to Lists.

var topKeywordsAbove = 
    keywordsMatched
        .OrderByDescending(x => x.SpamProbability)
        .Take(numberAboveMiddle)
        .TakeWhile(x => x.SpamProbability >= 0.5);
var topKeywordsBelow =
    keywordsMatched
        .OrderBy(x => x.SpamProbability)
        .Take(numberBelowMiddle)
        .TakeWhile(x => x.SpamProbability < 0.5);
var topKeywords = topKeywordsAbove.Concat(topKeywordsBelow);


Finally, you can extract the end of your two big methods into a method:

private double GetBayesRating(IEnumerable topKeywords, double messageSpamProbability)
{
    double n = topKeywords
        .Select(x => x.GetSpamProbability(messageSpamProbability))
        .Select(x => Math.Log(1 - x) - Math.Log(x))
        .Sum();

    return 1 / (1 + Math.Pow(Math.E, n));
}

Code Snippets

double n = 0;

for (int i = 0; i < topKeywords.Count; i++)
   n += (Math.Log(1 - topKeywords[i].GetSpamProbability(messageSpamProbability)) - Math.Log(topKeywords[i].GetSpamProbability(messageSpamProbability)));
double n = 0;

for (int i = 0; i < topKeywords.Count; i++)
{
    var spamProbability = topKeywords[i].GetSpamProbability(messageSpamProbability);
    n += (Math.Log(1 - spamProbability) - Math.Log(spamProbability));
}
double n = topKeywords
    .Select(x => x.GetSpamProbability(messageSpamProbability))
    .Select(x => Math.Log(1 - x) - Math.Log(x))
    .Sum();
List<SpamKeyword> topKeywordsAbove = keywordsMatched.OrderByDescending(x => x.SpamProbability).Take(numberAboveMiddle).ToList();
List<SpamKeyword> topKeywordsBelow = keywordsMatched.OrderBy(x => x.SpamProbability).Take(numberBelowMiddle).ToList();
List<SpamKeyword> topKeywords = new List<SpamKeyword>();

for (int i = 0; i < topKeywordsAbove.Count; i++)
    if (topKeywordsAbove[i].SpamProbability >= 0.5)
        topKeywords.Add(topKeywordsAbove[i]);

for (int i = 0; i < topKeywordsBelow.Count; i++)
    if (topKeywordsBelow[i].SpamProbability <= 0.5 && !topKeywords.Contains(topKeywordsBelow[i]))
        topKeywords.Add(topKeywordsBelow[i]);
var topKeywordsAbove = 
    keywordsMatched
        .OrderByDescending(x => x.SpamProbability)
        .Take(numberAboveMiddle)
        .TakeWhile(x => x.SpamProbability >= 0.5);
var topKeywordsBelow =
    keywordsMatched
        .OrderBy(x => x.SpamProbability)
        .Take(numberBelowMiddle)
        .TakeWhile(x => x.SpamProbability < 0.5);
var topKeywords = topKeywordsAbove.Concat(topKeywordsBelow);

Context

StackExchange Code Review Q#96238, answer score: 5

Revisions (0)

No revisions yet.