HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Searching text for a string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
textstringforsearching

Problem

This is my current code:

def search(word,source):
    places = []
    for x in range(0,len(source)):
        if source[x] == word[0]:
            if source[x:x+len(word)] == word:
                places.append(x)
    return places


How would I optimize this without using regular expressions?

Solution

Using timeit, we can make changes and see how it affects the speed. Baseline:

>>> def search1(word,source):
    places = []
    for x in range(0,len(source)):
        if source[x] == word[0]:
            if source[x:x+len(word)] == word:
                places.append(x)
    return places

>>> timeit.timeit("search1(word, source)", 
                  setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
                  number=10000)
15.60436644058882


Note the following:

  • You calculate word[0] and len(word) on every loop, these can be factored out;



  • A list comprehension is generally faster than for looping and appending; and



  • enumerate allows you to get the index and character from source simultaneously.



With those in mind:

>>> def search2(word, source):
    """Find every index in source at which word starts."""
    firstchar = word[0]
    wordlen = len(word)
    return [index for index, char in enumerate(source) 
            if firstchar == char and word == source[index:index+wordlen]]

>>> timeit.timeit("search2(word, source)", 
                  setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
                  number=10000)
9.99040215947457


I have also added a docstring to explain what the function does.

However, if you really want efficiency (and readability, for that matter), ignoring regular expressions is a bad move:

>>> def search3(word, source):
    """Find every index in source at which word starts."""
    return [m.start() for m in re.finditer(word, source)]

>>> timeit.timeit("search3(word, source)", 
                  setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
                  number=10000)
2.1652778798459167

Code Snippets

>>> def search1(word,source):
    places = []
    for x in range(0,len(source)):
        if source[x] == word[0]:
            if source[x:x+len(word)] == word:
                places.append(x)
    return places

>>> timeit.timeit("search1(word, source)", 
                  setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
                  number=10000)
15.60436644058882
>>> def search2(word, source):
    """Find every index in source at which word starts."""
    firstchar = word[0]
    wordlen = len(word)
    return [index for index, char in enumerate(source) 
            if firstchar == char and word == source[index:index+wordlen]]

>>> timeit.timeit("search2(word, source)", 
                  setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
                  number=10000)
9.99040215947457
>>> def search3(word, source):
    """Find every index in source at which word starts."""
    return [m.start() for m in re.finditer(word, source)]

>>> timeit.timeit("search3(word, source)", 
                  setup="from __main__ import search1, search2, search3; import re; word, source = 'foo', 'foobarbaz'*1000", 
                  number=10000)
2.1652778798459167

Context

StackExchange Code Review Q#70720, answer score: 5

Revisions (0)

No revisions yet.