HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Python string clean up function with optional args

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withfunctionargsoptionalpythonstringclean

Problem

I've got a function that I mainly use while web scraping. It gives me the ability to throw in a multi line address and clean it or a name field with unwanted characters and clean those, etc.

Below is the code and I would like to know if this is the best approach. If I should switch to recursive or stick with the while loop. Or if I should look at some other completely different approach. Examples of I/O commented in the code.

```
def clean_up(text, strip_chars=[], replace_extras={}):
"""
:type text str
:type strip_chars list
:type replace_extras dict
*
strip_chars: optional arg
Accepts passed list of string objects to iter through.
Each item, if found at beginning or end of string, will be
gotten rid of.
example:
text input: ' , , , .,.,.,.,,,......test, \t this\n.is.a\n.test...,,, , .'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^------^^^^----^^-----^^-----^^^^^^^^^^^^^^^^^^
strip_chars arg: [',', '.']
output: 'test, this .is.a .test'
*
replace_extras: optional arg
Accepts passed dict of items to replace in the standard
clean_up_items dict or append to it.
example:
text_input: ' this is one test\n!\n'
^--------^^^-----^^-^^
replace_extras arg: {'\n': '', 'one': '1'}
output: 'this is 1 test!'
*
DEFAULT REPLACE ITEMS
---------------------
These can be overridden and/or appended to using the replace_extras
argument.
replace item | with
-
-
-
-
-
*
"""

clean_up_items = {'\n': ' ', '\r': ' ', '\t': ' ', ' ': ' '}
clean_up_items.update(replace_extras)

text = text.strip()

change_made = True
while change_made:
text_old = text
for x in strip_chars:
while text.startswith(x) or text.endswith(x):
text

Solution

Substitution in multiple passes

Your function is slightly buggy in a non-deterministic way, due to the way you perform substitutions in multiple passes. Performing string substitutions in multiple passes is usually a bad idea. Here is another instance of such a bug. The problem is that the result could vary depending on the iteration order through replace_extras.

Take this example:

>>> clean_up('acetone', replace_extras={
...     'one': '1',
...     'acetone': '(CH3)2CO',
...     'CO': 'carbon monoxide',
...     'C': 'carbon',
...     'CH3': 'methane'
... })


What will be the result? If the 'one' substitution is done first, then it becomes 'acet1'. If the 'acetone' substitution is done first, followed by the 'CO' substitution and the 'C' substitution, then it becomes '(carbonH3)2carbon monoxide'. If it's 'acetone', then 'C', then 'CO', then it becomes '(carbonH3)2carbonO'. Another possible outcome is '(methane)2carbon monoxide'. All kinds of results are possible!

Therefore, I recommend doing substitutions in a single pass wherever possible, using regular expressions. Regular expressions will always look for the leftmost longest match.1 With a regex substitution, the result will not be fed back for further processing.

Documentation and function design

You wrote a very long docstring, which is nice, but nowhere did you actually state the purpose of the function.

What is the purpose of the function? There is a strip_chars phase, followed by a replace_extras phase. By the single-responsibility principle, consider splitting the function into two functions, or at least writing it as a composition of two helper functions.

If you have specific examples of inputs and their corresponding outputs, it would be a good idea to write them as doctests instead.

Suggested solution

import re

def clean_up(text, strip_chars=[], replace_extras={}):
    r"""
    Remove all occurrences of strip_chars and whitespace at the beginning
    and end of each line, then perform string substitutions specified by
    the replace_extras dictionary (as well as normalizing all whitespace
    to a single space character), and then strip whitespace from the
    beginning and end.

    >>> clean_up('       ,  ,      , .,.,.,.,,,......test, \t  this\n'
    ...     '.is.a\n'
    ...     '.test...,,,         , .', strip_chars=',.')
    'test, this .is.a .test'

    Any consecutive whitespace is normalized to a single space, but
    you can override these implicit substitutions in replace_extras.

    >>> clean_up(' this is one test\n!\n', replace_extras={'\n': '', 'one': '1'})
    'this is 1 test!'
    """
    # Handle strip_chars
    strip_items = '|'.join(re.escape(s) for s in strip_chars)
    strip_re = r'^(?:{}|\s)+|(?:{}|\s)+.format(strip_items, strip_items)
    text = re.sub(strip_re, '', text, re.MULTILINE)

    # Normalize whitespace and handle replace_extras
    replace_keys = list(replace_extras.keys())
    replace_keys.sort(key=len, reverse=True)
    replace_re = '|'.join([re.escape(s) for s in replace_keys] + [r'\s+'])
    return re.sub(
        replace_re,
        lambda match: replace_extras.get(match.group(), ' '),
        text
    ).strip()

Code Snippets

>>> clean_up('acetone', replace_extras={
...     'one': '1',
...     'acetone': '(CH3)2CO',
...     'CO': 'carbon monoxide',
...     'C': 'carbon',
...     'CH3': 'methane'
... })
import re

def clean_up(text, strip_chars=[], replace_extras={}):
    r"""
    Remove all occurrences of strip_chars and whitespace at the beginning
    and end of each line, then perform string substitutions specified by
    the replace_extras dictionary (as well as normalizing all whitespace
    to a single space character), and then strip whitespace from the
    beginning and end.

    >>> clean_up('       ,  ,      , .,.,.,.,,,......test, \t  this\n'
    ...     '.is.a\n'
    ...     '.test...,,,         , .', strip_chars=',.')
    'test, this .is.a .test'

    Any consecutive whitespace is normalized to a single space, but
    you can override these implicit substitutions in replace_extras.

    >>> clean_up(' this is one test\n!\n', replace_extras={'\n': '', 'one': '1'})
    'this is 1 test!'
    """
    # Handle strip_chars
    strip_items = '|'.join(re.escape(s) for s in strip_chars)
    strip_re = r'^(?:{}|\s)+|(?:{}|\s)+$'.format(strip_items, strip_items)
    text = re.sub(strip_re, '', text, re.MULTILINE)

    # Normalize whitespace and handle replace_extras
    replace_keys = list(replace_extras.keys())
    replace_keys.sort(key=len, reverse=True)
    replace_re = '|'.join([re.escape(s) for s in replace_keys] + [r'\s+'])
    return re.sub(
        replace_re,
        lambda match: replace_extras.get(match.group(), ' '),
        text
    ).strip()

Context

StackExchange Code Review Q#139549, answer score: 5

Revisions (0)

No revisions yet.