HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Autocomplete Trie Optimization

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
optimizationautocompletetrie

Problem

I'm currently playing with the Typeahead problem on Talentbuddy. There are already 2 questions on this subject (Typeahead Talent Buddy and Typeahead autocomplete functionality challenge), but none of these use a trie which is why I created a new one.

Problem Statement


Given a big list of user names and a list of strings representing queries Your task is to:



  • Write a function that prints to the standard output (stdout) for each query the user name that matches the query



  • If there are multiple user names matching the query please select the one that is the smallest lexicographically



  • All string matches must be case insensitive



  • If no match is found for a given query please print "-1"





Note that your function will receive the following arguments: usernames



-
Which is an array of strings representing the user names queries

-
Which is an array of strings representing the queries described above



Data constraints



-
The length of the array above will not exceed 100,000 entries

-
Each name or query string will not exceed 30 characters



Efficiency constraints



  • Your function is expected to print the requested result and return in less than 2 seconds





Example

names: ["james", "jBlank"]
queries: ["j", "jm", "jbl", "JB"]
Output: 
      james
      -1
      jBlank
      jBlank"


I quickly whipped up a brute force solution which was far too slow (about 100 seconds on my laptop for one of the medium test data sets). Then I started researching and came to the conclusion that a trie is probably the best data structure for this. I first went with a generic trie which improved the speed by about 100x. This got me a bit further in the tests, but they try with a few bigger data sets after that and with the last one the algorithm is still too slow.

I've optimized the trie specifically for the problem, then profiled and finetuned the implementation, which got me another 100x speed imp

Solution

Do it the simple way: make the usernames into a list of tuples (username.lower(), username). Then sort this list; the usernames are now ordered lexicographically; in your example you'd get

index = [("james", "james"),
         ("jblank", "jBlank")]


Now, use the bisect module and the bisect_left adapted from find_le in the documentation page to find the rightmost value in the list with lowercase username less than or equal to the lowercase prefix; compare the returned value to

i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
    found = index[i]
    if found[0].startswith(prefix):
        return found[1]

return '-1'


The returned value is tuple ('username', 'UserName'); now just check if the entry[0].startswith(prefix); if it does, the answer is entry[1], if it does not, give -1.

Adapting this into a class one gets:

from bisect import bisect_left

names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]

class Index(object):
    def __init__(self, words):
        self.index = [ (w.lower(), w) for w in words ]
        self.index.sort()

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """

        prefix = prefix.lower()
        i = bisect_left(self.index, (prefix, ''))
        if 0 <= i < len(self.index):
            found = self.index[i]
            if found[0].startswith(prefix):
                return found[1]

        return '-1'

def typeahead(usernames, queries):
    """Given a list of users and queries,
    this function prints all valid users for these queries.

    Args:
        usernames: list of strings representing users.
        queries: list of strings representing (incomplete) input
    """
    users = Index(usernames)
    print("\n".join(users.by_prefix(q) for q in queries))

typeahead(names, queries)


This is guaranteed to be faster than your Trie method; the initial overhead comes from creating a tuple for all entries, which is still less burdensome than your solution having to create O(n) dictionaries; the lookup is guaranteed to be faster too.

In case this is not fast enough, then instead of a trie, you can store all prefixes in a single dictionary!

class Index(object):
    def __init__(self, words):
        index = {}
        for w in sorted(words, key=str.lower, reverse=True):
            lw = w.lower()
            for i in range(1, len(lw) + 1):
                index[lw[:i]] = w

        self.index = index

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """
        return self.index.get(prefix.lower(), '-1')

Code Snippets

index = [("james", "james"),
         ("jblank", "jBlank")]
i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
    found = index[i]
    if found[0].startswith(prefix):
        return found[1]

return '-1'
from bisect import bisect_left

names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]


class Index(object):
    def __init__(self, words):
        self.index = [ (w.lower(), w) for w in words ]
        self.index.sort()

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """

        prefix = prefix.lower()
        i = bisect_left(self.index, (prefix, ''))
        if 0 <= i < len(self.index):
            found = self.index[i]
            if found[0].startswith(prefix):
                return found[1]

        return '-1'


def typeahead(usernames, queries):
    """Given a list of users and queries,
    this function prints all valid users for these queries.

    Args:
        usernames: list of strings representing users.
        queries: list of strings representing (incomplete) input
    """
    users = Index(usernames)
    print("\n".join(users.by_prefix(q) for q in queries))

typeahead(names, queries)
class Index(object):
    def __init__(self, words):
        index = {}
        for w in sorted(words, key=str.lower, reverse=True):
            lw = w.lower()
            for i in range(1, len(lw) + 1):
                index[lw[:i]] = w

        self.index = index

    def by_prefix(self, prefix):
        """Return lexicographically smallest word that starts with a given
        prefix.
        """
        return self.index.get(prefix.lower(), '-1')

Context

StackExchange Code Review Q#63051, answer score: 8

Revisions (0)

No revisions yet.