patternpythonMinor
Python string clean up function with optional args
Viewed 0 times
withfunctionargsoptionalpythonstringclean
Problem
I've got a function that I mainly use while web scraping. It gives me the ability to throw in a multi line address and clean it or a name field with unwanted characters and clean those, etc.
Below is the code and I would like to know if this is the best approach. If I should switch to recursive or stick with the
```
def clean_up(text, strip_chars=[], replace_extras={}):
"""
:type text str
:type strip_chars list
:type replace_extras dict
*
strip_chars: optional arg
Accepts passed list of string objects to iter through.
Each item, if found at beginning or end of string, will be
gotten rid of.
example:
text input: ' , , , .,.,.,.,,,......test, \t this\n.is.a\n.test...,,, , .'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^------^^^^----^^-----^^-----^^^^^^^^^^^^^^^^^^
strip_chars arg: [',', '.']
output: 'test, this .is.a .test'
*
replace_extras: optional arg
Accepts passed dict of items to replace in the standard
clean_up_items dict or append to it.
example:
text_input: ' this is one test\n!\n'
^--------^^^-----^^-^^
replace_extras arg: {'\n': '', 'one': '1'}
output: 'this is 1 test!'
*
DEFAULT REPLACE ITEMS
---------------------
These can be overridden and/or appended to using the replace_extras
argument.
replace item | with
-
-
-
-
-
*
"""
clean_up_items = {'\n': ' ', '\r': ' ', '\t': ' ', ' ': ' '}
clean_up_items.update(replace_extras)
text = text.strip()
change_made = True
while change_made:
text_old = text
for x in strip_chars:
while text.startswith(x) or text.endswith(x):
text
Below is the code and I would like to know if this is the best approach. If I should switch to recursive or stick with the
while loop. Or if I should look at some other completely different approach. Examples of I/O commented in the code.```
def clean_up(text, strip_chars=[], replace_extras={}):
"""
:type text str
:type strip_chars list
:type replace_extras dict
*
strip_chars: optional arg
Accepts passed list of string objects to iter through.
Each item, if found at beginning or end of string, will be
gotten rid of.
example:
text input: ' , , , .,.,.,.,,,......test, \t this\n.is.a\n.test...,,, , .'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^------^^^^----^^-----^^-----^^^^^^^^^^^^^^^^^^
strip_chars arg: [',', '.']
output: 'test, this .is.a .test'
*
replace_extras: optional arg
Accepts passed dict of items to replace in the standard
clean_up_items dict or append to it.
example:
text_input: ' this is one test\n!\n'
^--------^^^-----^^-^^
replace_extras arg: {'\n': '', 'one': '1'}
output: 'this is 1 test!'
*
DEFAULT REPLACE ITEMS
---------------------
These can be overridden and/or appended to using the replace_extras
argument.
replace item | with
-
-
-
-
-
*
"""
clean_up_items = {'\n': ' ', '\r': ' ', '\t': ' ', ' ': ' '}
clean_up_items.update(replace_extras)
text = text.strip()
change_made = True
while change_made:
text_old = text
for x in strip_chars:
while text.startswith(x) or text.endswith(x):
text
Solution
Substitution in multiple passes
Your function is slightly buggy in a non-deterministic way, due to the way you perform substitutions in multiple passes. Performing string substitutions in multiple passes is usually a bad idea. Here is another instance of such a bug. The problem is that the result could vary depending on the iteration order through
Take this example:
What will be the result? If the
Therefore, I recommend doing substitutions in a single pass wherever possible, using regular expressions. Regular expressions will always look for the leftmost longest match.1 With a regex substitution, the result will not be fed back for further processing.
Documentation and function design
You wrote a very long docstring, which is nice, but nowhere did you actually state the purpose of the function.
What is the purpose of the function? There is a
If you have specific examples of inputs and their corresponding outputs, it would be a good idea to write them as doctests instead.
Suggested solution
Your function is slightly buggy in a non-deterministic way, due to the way you perform substitutions in multiple passes. Performing string substitutions in multiple passes is usually a bad idea. Here is another instance of such a bug. The problem is that the result could vary depending on the iteration order through
replace_extras.Take this example:
>>> clean_up('acetone', replace_extras={
... 'one': '1',
... 'acetone': '(CH3)2CO',
... 'CO': 'carbon monoxide',
... 'C': 'carbon',
... 'CH3': 'methane'
... })What will be the result? If the
'one' substitution is done first, then it becomes 'acet1'. If the 'acetone' substitution is done first, followed by the 'CO' substitution and the 'C' substitution, then it becomes '(carbonH3)2carbon monoxide'. If it's 'acetone', then 'C', then 'CO', then it becomes '(carbonH3)2carbonO'. Another possible outcome is '(methane)2carbon monoxide'. All kinds of results are possible!Therefore, I recommend doing substitutions in a single pass wherever possible, using regular expressions. Regular expressions will always look for the leftmost longest match.1 With a regex substitution, the result will not be fed back for further processing.
Documentation and function design
You wrote a very long docstring, which is nice, but nowhere did you actually state the purpose of the function.
What is the purpose of the function? There is a
strip_chars phase, followed by a replace_extras phase. By the single-responsibility principle, consider splitting the function into two functions, or at least writing it as a composition of two helper functions.If you have specific examples of inputs and their corresponding outputs, it would be a good idea to write them as doctests instead.
Suggested solution
import re
def clean_up(text, strip_chars=[], replace_extras={}):
r"""
Remove all occurrences of strip_chars and whitespace at the beginning
and end of each line, then perform string substitutions specified by
the replace_extras dictionary (as well as normalizing all whitespace
to a single space character), and then strip whitespace from the
beginning and end.
>>> clean_up(' , , , .,.,.,.,,,......test, \t this\n'
... '.is.a\n'
... '.test...,,, , .', strip_chars=',.')
'test, this .is.a .test'
Any consecutive whitespace is normalized to a single space, but
you can override these implicit substitutions in replace_extras.
>>> clean_up(' this is one test\n!\n', replace_extras={'\n': '', 'one': '1'})
'this is 1 test!'
"""
# Handle strip_chars
strip_items = '|'.join(re.escape(s) for s in strip_chars)
strip_re = r'^(?:{}|\s)+|(?:{}|\s)+.format(strip_items, strip_items)
text = re.sub(strip_re, '', text, re.MULTILINE)
# Normalize whitespace and handle replace_extras
replace_keys = list(replace_extras.keys())
replace_keys.sort(key=len, reverse=True)
replace_re = '|'.join([re.escape(s) for s in replace_keys] + [r'\s+'])
return re.sub(
replace_re,
lambda match: replace_extras.get(match.group(), ' '),
text
).strip()Code Snippets
>>> clean_up('acetone', replace_extras={
... 'one': '1',
... 'acetone': '(CH3)2CO',
... 'CO': 'carbon monoxide',
... 'C': 'carbon',
... 'CH3': 'methane'
... })import re
def clean_up(text, strip_chars=[], replace_extras={}):
r"""
Remove all occurrences of strip_chars and whitespace at the beginning
and end of each line, then perform string substitutions specified by
the replace_extras dictionary (as well as normalizing all whitespace
to a single space character), and then strip whitespace from the
beginning and end.
>>> clean_up(' , , , .,.,.,.,,,......test, \t this\n'
... '.is.a\n'
... '.test...,,, , .', strip_chars=',.')
'test, this .is.a .test'
Any consecutive whitespace is normalized to a single space, but
you can override these implicit substitutions in replace_extras.
>>> clean_up(' this is one test\n!\n', replace_extras={'\n': '', 'one': '1'})
'this is 1 test!'
"""
# Handle strip_chars
strip_items = '|'.join(re.escape(s) for s in strip_chars)
strip_re = r'^(?:{}|\s)+|(?:{}|\s)+$'.format(strip_items, strip_items)
text = re.sub(strip_re, '', text, re.MULTILINE)
# Normalize whitespace and handle replace_extras
replace_keys = list(replace_extras.keys())
replace_keys.sort(key=len, reverse=True)
replace_re = '|'.join([re.escape(s) for s in replace_keys] + [r'\s+'])
return re.sub(
replace_re,
lambda match: replace_extras.get(match.group(), ' '),
text
).strip()Context
StackExchange Code Review Q#139549, answer score: 5
Revisions (0)
No revisions yet.