HiveBrain v1.2.0
Get Started
← Back to all entries
snippetphpMinor

PHP function to convert a Portuguese word from plural to singular

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
convertphpfunctionwordsingularportuguesefromplural

Problem

I know, this sounds really difficult, but it is really easy.

I needed to convert a single Portuguese word in the plural into singular. I know there's a right name for that, but it is escaping me.

The rules are simple, and will compile them from http://www.easyportuguese.com/portuguese-lessons/plural/ (but applying in reverse):

  • If the word ends in a vowel, remove the s at the end



  • Words ending in ões, ães and ãos should end with ão



  • Words ending in is, remove the is and add l to the end



Special case: accents should be removed, if needed. The only cases I saw were anéis and pastéis, which have to be anel and papel.

  • Words ending in ns get it replaced with m



  • Words ending with [rsz]es should lose the es



Special case: words ending in eses need the first e replaced with ê, like in meses => mês

  • Some words are always used in the plural, like óculos, parabéns and férias.



Below, here's the code:

function plural_to_singular($string)
{
    if(preg_match('/^(?:[oó]culos|parab[eé]ns|f[eé]rias)$/iu', $string))
    {
        return $string;
    }

    $regexes = array(
        '[õã]es' => 'ão',
        '[áó].*eis' => 'el',
        '[eé]is' => 'el',
        '([^eé])is' => '$1l',
        'ns' => 'm',
        'eses' => 'ês',
        '([rzs])es' => '$1',
        's' => ''
    );

    foreach($regexes as $fragment => $replace)
    {
        $regex = '/' . $fragment . '$/ui';
        if(preg_match($regex, $string))
        {
            return preg_replace($regex, $replace, $string);
        }
    }

    return $string;
}


You can try it on http://sandbox.onlinephpfunctions.com/code/7947a0efd16f361e89491e4a64f71b578d2278df with some testcases

In your opinion, what can I improve?

Is there any obvious butchering or performance killer?

Solution

Let me start by saying, that I have respect for Mike Brant, and have been enjoying his posts for quite a while now. However, I disagree with some of his review.

  • $regex_config can not store the the replacement values as associative keys unless the regex patterns that use the same replacement value are merged. This is not explained in the ... (yatta-yatta). The key clash would be on el.



  • Simply throwing 1 at the end of preg_replace() is NOT going to provide the desired output. Declaring a replacement limit on the call will only limit the replacements PER array element. The damage is evident in this output: meses => mês = mê



  • Most trivially, array_values() doesn't need to be called because preg_replace() is "key ignorant" regarding the array inputs.



  • For this process to maintain accuracy, there needs to be a return as soon as a replacement occurs on the input string. To avoid calling multiple replacements, iterate the array of pattern-replacement pairs.



  • You can avoid using capture groups and shorten your replacement strings in a couple places by implementing the \K metacharacter (restart fullstring match). This way you don't need to use $1 or rewrite a literals from the pattern into the replacement.



  • If you need to add case-sensitivity to your replacement process, you can check the last character of the incoming string. If it is uppercase, assume the whole string is in CAPS and call mb_strtoupper().



  • I don't have a sample string to test against ~[áó].*eis$~iu, but I wonder if this is accurate/correct and my Portuguese is not too sharp.



-
After my implementation of \K you can see that two pairs of patterns are making the same replacement. If you don't expect to be making lots of future adjustments to this set of regex patterns, you could combine the patterns with a pipe. Here's what I mean: '~(?:[áó].*eis|[eé]is)$~iu' => 'el', and '~(?:[rzs]\Kes|s)$~iu' => ''

-
I am using the regex patterns as the keys because they will all logically be unique. the same cannot be said about the replacement values (not without merging anyhow).

Code: (Demo)

function is_allcaps($string)
{
    $last_letter = mb_substr($string, -1, 1, 'UTF-8');
    return $last_letter === mb_strtoupper($last_letter, 'UTF-8');
    // otherwise use cytpe_upper() and setlocale()
}

function plural_to_singular($string)
{
    // quick return of "untouchables"
    if(preg_match('~^(?:[oó]culos|parab[eé]ns|f[eé]rias)$~iu', $string))
    {
        return $string;
    }

    $regex_map = [
        '~[õã]es$~iu' => 'ão',
        '~(?:[áó].*e|[eé])is$~iu' => 'el',
        '~[^eé]\Kis$~iu' => 'l',
        '~ns$~iu' => 'm',
        '~eses$~iu' => 'ês',
        '~(?:[rzs]\Ke)?s$~iu' => ''
    ];

    foreach ($regex_map as $pattern => $replacement)
    {
        $singular = preg_replace($pattern, $replacement, $string, 1, $count);
        if ($count)
        {
            return is_allcaps($string) ? mb_strtoupper($singular) : $singular;

        }
    }
    return $string;
}

$words = array(
    'óculos' => 'óculos',
    'papéis' => 'papel',
    'anéis' => 'anel',
    'PASTEIS' => 'PASTEL',
    'CAMIÕES' => 'CAMIÃO',
    'rodas' => 'roda',
    'cães' => 'cão',
    'meses' => 'mês',
    'vezes' => 'vez',
    'luzes' => 'luz',
    'cristais' => 'cristal',
    'canções' => 'canção',
    'nuvens' => 'nuvem',
    'alemães' => 'alemão'
);

foreach($words as $plural => $singular)
{
    echo "$plural => $singular = " , plural_to_singular($plural) , "\n";
}

Code Snippets

function is_allcaps($string)
{
    $last_letter = mb_substr($string, -1, 1, 'UTF-8');
    return $last_letter === mb_strtoupper($last_letter, 'UTF-8');
    // otherwise use cytpe_upper() and setlocale()
}

function plural_to_singular($string)
{
    // quick return of "untouchables"
    if(preg_match('~^(?:[oó]culos|parab[eé]ns|f[eé]rias)$~iu', $string))
    {
        return $string;
    }

    $regex_map = [
        '~[õã]es$~iu' => 'ão',
        '~(?:[áó].*e|[eé])is$~iu' => 'el',
        '~[^eé]\Kis$~iu' => 'l',
        '~ns$~iu' => 'm',
        '~eses$~iu' => 'ês',
        '~(?:[rzs]\Ke)?s$~iu' => ''
    ];

    foreach ($regex_map as $pattern => $replacement)
    {
        $singular = preg_replace($pattern, $replacement, $string, 1, $count);
        if ($count)
        {
            return is_allcaps($string) ? mb_strtoupper($singular) : $singular;

        }
    }
    return $string;
}

$words = array(
    'óculos' => 'óculos',
    'papéis' => 'papel',
    'anéis' => 'anel',
    'PASTEIS' => 'PASTEL',
    'CAMIÕES' => 'CAMIÃO',
    'rodas' => 'roda',
    'cães' => 'cão',
    'meses' => 'mês',
    'vezes' => 'vez',
    'luzes' => 'luz',
    'cristais' => 'cristal',
    'canções' => 'canção',
    'nuvens' => 'nuvem',
    'alemães' => 'alemão'
);

foreach($words as $plural => $singular)
{
    echo "$plural => $singular = " , plural_to_singular($plural) , "\n";
}

Context

StackExchange Code Review Q#149991, answer score: 3

Revisions (0)

No revisions yet.