HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Check if strings from a list in a text and return first longest match

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
checkreturntextlongestmatchfirstandlistfromstrings

Problem

I'm trying to design an application that binds a phrase or word to some item, for example image, and saves phrase-item pair in the database. Then it receives a text, and if it contains binded substring, corresponding item should be returned.
It should return only one item (first match), and longest substrings should take precedence.

I wrote a function, that returns expected values:

from operator import itemgetter

def get_item(text, bindings): 
    text = text.lower()
    matches = []
    for phrase, item in bindings:
        phrase = phrase.lower()
        index = text.find(phrase)
        if index != -1:
            matches.append((phrase, item, index))
    if matches:
        matches.sort(key=lambda x: len(x[0]), reverse=True)
        matches.sort(key=itemgetter(2))
        item_id = matches[0][1]
    else:
        item_id = None
    return item_id


Example:

bindings = [
('i', 'item1'), ('like', 'item2'), ('i like', 'item3'), ('turtles', 'item4'),
]
text = 'I like turtles!'
print(get_item(text, bindings)) # should return item3


Is there cleaner ways to complete such task, or faster, perhaps?

Solution

You can use an early return in your function to get rid of a few lines, because python's default return value is already None:

from operator import itemgetter

def get_item(text, bindings): 
    text = text.lower()
    matches = []
    for phrase, item in bindings:
        index = text.find(phrase)
        if index != -1:
            matches.append((phrase, item, index))
    if matches:
        matches.sort(key=lambda x: len(x[0]), reverse=True)
        matches.sort(key=itemgetter(2))
        return matches[0][1]


Algorithm wise, I'm not sure this code is up to the specification you wrote. You said, phrases should be preferred over single words. However, you only sort by length. Therefore

bindings = [('i am a', 'item1'), ('terminator', 'item2')]
text = 'I am a terminator'
print get_item(text, bindings)


will print 'item2' and not 'item1', even though it is a single word and the other binding is a phrase. Maybe also use something like len(x[0].split()) to give you the number of words as the sort key?

Also, your code scales with something like \$\mathcal{O}(m)\mathcal{O}(nk)\$, where \$m\$ is the number of defined bindings, \$n\$ is the length of the text and \$k\$ is the average length of the phrases in the bindings. I'm not sure whether there is a better algorithm, though.

Code Snippets

from operator import itemgetter

def get_item(text, bindings): 
    text = text.lower()
    matches = []
    for phrase, item in bindings:
        index = text.find(phrase)
        if index != -1:
            matches.append((phrase, item, index))
    if matches:
        matches.sort(key=lambda x: len(x[0]), reverse=True)
        matches.sort(key=itemgetter(2))
        return matches[0][1]
bindings = [('i am a', 'item1'), ('terminator', 'item2')]
text = 'I am a terminator'
print get_item(text, bindings)

Context

StackExchange Code Review Q#140777, answer score: 2

Revisions (0)

No revisions yet.