HiveBrain v1.2.0
Get Started
← Back to all entries
patternphpMinor

Regex to clean text in preparation for word count in PHP

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
countphptextpreparationwordforregexclean

Problem

EDIT: Here's my totally-revised PHP...

$text = preg_replace("~[^ a-z0-9'-]~"," ",strtolower($INPUT));
for($i=1;$i<strlen($text)-1;$i++) {
    if(preg_match("~['-]~",$text[$i]) && ( !preg_match("~[a-z0-9]~",$text[$i-1]) || !preg_match("~[a-z0-9]~",$text[$i+1])) ){
        $text[$i] = " ";
    }
}
while( preg_match ("~  ~",$text) ) $text = preg_replace( "~  ~", " ", $text);
if(preg_match("~[' -]~",$text[0])) $text = substr($text,1,strlen($text)-1);
if(preg_match("~[' -]~",$text[strlen($text)-1])) $text = substr($text,0,strlen($text)-2);


Now, what I'd said still applies..

This regex seems to work for me, but I'm curious if anyone can think of a breaking case, or tell me anything I did wrong.

(If it seems the code does what I say I want it do to, say so, then for fun you can help me become more neurotic about what I actually want it to do, which hinges on the question of defining an English "word".)

Desired final form of $text: replace, with spaces, all characters in $INPUT except for letters, digits, and any hyphens or apostrophes that are directly between two letters/digits (and hence "part of a word"). Then collapse all whitespace into single spaces, and if necessary, drop the leading and closing space.

End result should be a lowercase series of words separated by spaces, of which some words may contain (entirely "inside" the word) one or more non-consecutive apostraphes, and/or one or more non-consecutive hyphens.

Does this do that, with no exceptions?

After this step, the next part will be to split (or explode, whichever is better) the string by spaces, then generate a list of words and frequencies. I'm pretty sure how to do that on my own (especially because my teacher actually told us two algorithms). In fact I'll probably hand the assignment in before choosing an answer to this; I'm mostly asking out of curiosity.

My thinking is...

  • English "words" tend to be case insensitive for distinction (though there is the proper noun issue, which I'll j

Solution

Do it in two steps. First replace all the stuff you deem to be "junk" with spaces, then collapse the spaces and hyphens down to a single space. You can express this with a reduce through a series of patterns that collapse to a space.

Some output from a play in Boris.

[7] boris> function clean($input) {
[7]     *>   $patterns = array(
[7]     *>     '/[^\w\r\n -]/',
[7]     *>     '/[\r\n -]{2,}/'
[7]     *>   );
[7]     *>   return array_reduce(
[7]     *>     $patterns,
[7]     *>     function($text, $re) { return preg_replace($re, ' ', $text); },
[7]     *>     $input
[7]     *>   );
[7]     *> }
[8] boris> clean('abc def &% f');
// 'abc def f'
[9] boris> clean('abc-def &% f');
// 'abc-def f'
[10] boris> clean('abc-def-&%-f');
// 'abc-def f

Code Snippets

[7] boris> function clean($input) {
[7]     *>   $patterns = array(
[7]     *>     '/[^\w\r\n -]/',
[7]     *>     '/[\r\n -]{2,}/'
[7]     *>   );
[7]     *>   return array_reduce(
[7]     *>     $patterns,
[7]     *>     function($text, $re) { return preg_replace($re, ' ', $text); },
[7]     *>     $input
[7]     *>   );
[7]     *> }
[8] boris> clean('abc def &% f');
// 'abc def f'
[9] boris> clean('abc-def &% f');
// 'abc-def f'
[10] boris> clean('abc-def-&%-f');
// 'abc-def f

Context

StackExchange Code Review Q#40014, answer score: 2

Revisions (0)

No revisions yet.