HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Function to split strings on multiple delimiters

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
functionsplitmultiplestringsdelimiters

Problem

I have this implementation of the split algorithm that different from .split() method you can use with multiple delimiters. Is this a good way of implementing it (more performance)?

def split(str, delim=" "):
    index = 0
    string = ""
    array = []
    while index < len(str):
        if str[index] not in delim: 
            string += str[index]
        else:
            if string: 
                array.append(string)
                string = ""
        index += 1
    if string: array.append(string)
    return array


Using the standard .split() method:

>>> print "hello = 20".split()
['hello', '=', '20']

>>> print "one;two; abc; b ".split(";")
['one', 'two', ' abc', ' b ']


Using my implementation:

>>> print split("hello = 20")
['hello', '=', '20']

>>> print split("one;two; abc; b ", ";")
['one', 'two', ' abc', ' b ']


Multiple delimiters:

>>> print split("one;two; abc; b.e. b eeeeee.e.e;;e ;.", " .;")
['one', 'two', 'abc', 'b', 'e', 'b', 'eeeeee', 'e', 'e', 'e']

>>> print split("foo barfoo;bar;foo bar.foo", " .;")
['foo', 'barfoo', 'bar', 'foo', 'bar', 'foo']

>>> print split("foo*bar*foo.foo bar;", "*.")
['foo', 'bar', 'foo', 'foo bar;']


Obs: We can do something like using re.split().

Solution

There's no need to iterate using that while, a for is good enough.

Also string concatenation (+=) is expensive. It's better to use a list and join its elements at the end1.

def split(s, delim=" "):
    words = []
    word = []
    for c in s:
        if c not in delim:
            word.append(c)
        else:
            if word:
                words.append(''.join(word))
                word = []
    if word:
        words.append(''.join(word))
    return words


As Maarten Fabré suggested, you could also ditch the words list and transform the function into a generator that iterates over (yields) each word. This saves some memory if you're examining only one word at a time and don't need all of them in one shot, for example when you're counting word frequency (collections.Counter(isplit(s))).

def isplit(s, delim=" "):  # iterator version
    word = []
    for c in s:
        if c not in delim:
            word.append(c)
        else:
            if word:
                yield ''.join(word)
                word = []
    if word:
        yield ''.join(word)

def split(*args, **kwargs):  # only converts the iterator to a list
    return list(isplit(*args, **kwargs))


There's also a one-liner solution based on itertools.groupby:

import itertools

def isplit(s, delim=" "):  # iterator version
    # replace the outer parentheses (...) with brackets [...]
    # to transform the generator comprehension into a list comprehension
    # and return a list
    return (''.join(word)
            for is_word, word in itertools.groupby(s, lambda c: c not in delim)
            if is_word)

def split(*args, **kwargs):  # only converts the iterator to a list
    return list(isplit(*args, **kwargs))


1 From https://wiki.python.org/moin/PythonSpeed: "String concatenation is best done with ''.join(seq) which is an O(n) process. In contrast, using the + or += operators can result in an O(n**2) process because new strings may be built for each intermediate step. The CPython 2.4 interpreter mitigates this issue somewhat; however, ''.join(seq) remains the best practice".

Code Snippets

def split(s, delim=" "):
    words = []
    word = []
    for c in s:
        if c not in delim:
            word.append(c)
        else:
            if word:
                words.append(''.join(word))
                word = []
    if word:
        words.append(''.join(word))
    return words
def isplit(s, delim=" "):  # iterator version
    word = []
    for c in s:
        if c not in delim:
            word.append(c)
        else:
            if word:
                yield ''.join(word)
                word = []
    if word:
        yield ''.join(word)

def split(*args, **kwargs):  # only converts the iterator to a list
    return list(isplit(*args, **kwargs))
import itertools

def isplit(s, delim=" "):  # iterator version
    # replace the outer parentheses (...) with brackets [...]
    # to transform the generator comprehension into a list comprehension
    # and return a list
    return (''.join(word)
            for is_word, word in itertools.groupby(s, lambda c: c not in delim)
            if is_word)

def split(*args, **kwargs):  # only converts the iterator to a list
    return list(isplit(*args, **kwargs))

Context

StackExchange Code Review Q#47627, answer score: 9

Revisions (0)

No revisions yet.