patternphpMinor
Spam detection in PHP for comment system
Viewed 0 times
commentspamphpsystemfordetection
Problem
I'am trying to build a PHP/MySQL/jQuery comment system. Off late I started to realize that manual spamming is a serious issue that need to be addressed carefully. So I thought of building a PHP function (initially, later can be implemented in OOPS concept) that will sniff the content and give points (spam points) based on some criteria. Getting a spam point less than will make the content inactive and send for moderation, whereas posts with spam points greater than 2 won't make the content eligible to be inserted in DB. This is a very basic piece of codes, so I want you to give me your valuable suggestion as how to make this code much better and if I am doing something wrong here.
```
2 { wont be posted , ie wont be inserted in DB}
function sniffSpams($content)
{
$spam_points = 0;
$url_pattern = '#(www\.|https?://)?[a-z0-9]+\.[a-z0-9]{2,4}\S*#i';
preg_match_all($url_pattern, $content, $matches, PREG_PATTERN_ORDER);
if(! empty($matches))
{
//url is/are present
$get_number_of_urls = count($matches[0]); //get the number of urls/emails present in content
if($get_number_of_urls > 2)
$spam_points += $get_number_of_urls; //1 point per link
else
{
$spam_words_uri = array('free', 'vote', 'play');
//if less thamn 2 , check for length of url
foreach ($matches[0] as $url)
{
if(strlen($url) > 150) //long url mostly are spam
$spam_points += 1;
foreach($spam_words_uri as $spam)
{
if(stripos($url, $spam) !== false )
$spam_points += 1;
}
}
}
}
$spam_words = array('Levitra', 'viagra', 'casino', '*');
foreach($spam_words as $spam)
{
if(stripos($content, $spam) !== false )
$spam_points += 1;
}
return $spam_points;
}
echo sniffS
```
2 { wont be posted , ie wont be inserted in DB}
function sniffSpams($content)
{
$spam_points = 0;
$url_pattern = '#(www\.|https?://)?[a-z0-9]+\.[a-z0-9]{2,4}\S*#i';
preg_match_all($url_pattern, $content, $matches, PREG_PATTERN_ORDER);
if(! empty($matches))
{
//url is/are present
$get_number_of_urls = count($matches[0]); //get the number of urls/emails present in content
if($get_number_of_urls > 2)
$spam_points += $get_number_of_urls; //1 point per link
else
{
$spam_words_uri = array('free', 'vote', 'play');
//if less thamn 2 , check for length of url
foreach ($matches[0] as $url)
{
if(strlen($url) > 150) //long url mostly are spam
$spam_points += 1;
foreach($spam_words_uri as $spam)
{
if(stripos($url, $spam) !== false )
$spam_points += 1;
}
}
}
}
$spam_words = array('Levitra', 'viagra', 'casino', '*');
foreach($spam_words as $spam)
{
if(stripos($content, $spam) !== false )
$spam_points += 1;
}
return $spam_points;
}
echo sniffS
Solution
Minor (and subjective) remarks:
- Function name: The function does not sniff spam but calculate a score for the given content. Therefore I'd rename the function to something like
getSpamScore
- The variable
$matchesis not declared. Declaring it before using in inpreg_match_allallows IDEs to perform code analysis and other developers don't have a hard time looking for its declaration.
- Two empty lines just take up space. Your code should be clear enough from structure not to need any extra indentions (which you don't have) or double blank lines. As the
preg_match_allis directly related to theifand only to it, I'd not have any lines at all between the match and the if-statement. (This applies to all other lines too)
//url is/are presentcomment: more of a personal thing maybe: I don't like obvious comments. Personally I prefer theif it needs a comment it needs to be refactoredway of writing comments (non in about almost every case). Comments need maintenance time (updating comments for changed code). But time should only be spent here if it provides value. Additionally comments tend to loose their relation to the code they originally were attached to (mostly by refactoring). This greatly confuses other developers.
$get_number_of_urlsnaming: this variable contains a verb. Verbs indicate actions (e.g. a method). In PHP I'd expect this variable to be a closur, not an integer. Yet you don't actually get the number or urls, you already have them. A proper name might benumber_of_urlsinstead.
- Stick to one convention of styling code: in your case this especially applies for where to have whitespaces. In your
ifandforeachlines you mix upforeach(foreach (and yourifconditions sometimes have whitespaces in between and sometimes not. I suppose this is some kind of work-in-progress code. Yet later styling costs time. Code you write always should meetproduction readyrequirements (regarding style). It takes some time at the beginning to stick with someones conventions, but you'll get used to it and do it automatically after some time.
- In my opinion this function has too many responsibilities (at least two). I'd suggest splitting this function into two sub-functions. This function itself only calls and sums of the results of the independent tests. When going OO this of course wants a design pattern ;)
Context
StackExchange Code Review Q#41399, answer score: 3
Revisions (0)
No revisions yet.