HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Removing doubles from a string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fromstringdoublesremoving

Problem

I've wrote a function where the doubles from a string are removed:

def removeDoubles(string):
    output = ' '
    for char in string:
        if output[-1].lower() != char.lower():
            output += char
    return output[1:]


For example:

  • removeDoubles('bookkeeper') = 'bokeper'



  • removeDoubles('Aardvark') = 'Ardvark'



  • removeDoubles('eELGRASS') = 'eLGRAS'



  • removeDoubles('eeEEEeeel') = 'el'



As you see, it will remove every double letter from a string no matter if it's uppercase or lowercase.

I was wondering if this could be more pythonic. As I have to start with a string containing a space, output[-1] does exist. I was also wondering if it's possible to use list comprehensions for this.

Solution

Your examples are pretty useful (especially 'Aardvark'), and should be included in the documentation of the function, ideally as a doctest. However, the problem is still underspecified: what should happen when a streak of three identical characters is encountered? Should removeDoubles('eeek') return 'eek' (which is how I would interpret "doubles"), or 'ek' (which is what your code actually does)?

As per PEP 8, the official Python style guide, function names should be lower_case_with_underscores unless you have a good reason to deviate. Therefore, I recommend renaming the function to remove_doubles.

Obviously, initializing output to ' ' and then dropping it with output[1:] is cumbersome and inefficient.

Fundamentally, this operation is a fancy string substitution. Typically, such substitutions are best done using regular expressions. In particular, you need the backreferences feature:


Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. Remember that Python’s string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE.


For example, the following RE detects doubled words in a string.

>>>
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'


For my interpretation of "doubles":

import re

def remove_doubles(string):
    """
    For each consecutive pair of the same character (case-insensitive),
    drop the second character.

    >>> remove_doubles('Aardvark')
    'Ardvark'
    >>> remove_doubles('bookkeeper')
    'bokeper'
    >>> remove_doubles('eELGRASS')
    'eLGRAS'
    >>> remove_doubles('eeek')
    'eek'
    """
    return re.sub(r'(.)\1', r'\1', string, flags=re.I)


To preserve your implementation's behaviour:

import re

def deduplicate_consecutive_chars(string):
    """
    For each consecutive streak of the same character (case-insensitive),
    drop all but the first character.

    >>> deduplicate_consecutive_chars('Aardvark')
    'Ardvark'
    >>> deduplicate_consecutive_chars('bookkeeper')
    'bokeper'
    >>> deduplicate_consecutive_chars('eELGRASS')
    'eLGRAS'
    >>> deduplicate_consecutive_chars('eeek')
    'ek'
    """
    return re.sub(r'(.)\1+', r'\1', string, flags=re.I)

Code Snippets

>>>
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'
import re

def remove_doubles(string):
    """
    For each consecutive pair of the same character (case-insensitive),
    drop the second character.

    >>> remove_doubles('Aardvark')
    'Ardvark'
    >>> remove_doubles('bookkeeper')
    'bokeper'
    >>> remove_doubles('eELGRASS')
    'eLGRAS'
    >>> remove_doubles('eeek')
    'eek'
    """
    return re.sub(r'(.)\1', r'\1', string, flags=re.I)
import re

def deduplicate_consecutive_chars(string):
    """
    For each consecutive streak of the same character (case-insensitive),
    drop all but the first character.

    >>> deduplicate_consecutive_chars('Aardvark')
    'Ardvark'
    >>> deduplicate_consecutive_chars('bookkeeper')
    'bokeper'
    >>> deduplicate_consecutive_chars('eELGRASS')
    'eLGRAS'
    >>> deduplicate_consecutive_chars('eeek')
    'ek'
    """
    return re.sub(r'(.)\1+', r'\1', string, flags=re.I)

Context

StackExchange Code Review Q#151715, answer score: 6

Revisions (0)

No revisions yet.