patternpythonMinor
Autocomplete Trie Optimization
Viewed 0 times
optimizationautocompletetrie
Problem
I'm currently playing with the Typeahead problem on Talentbuddy. There are already 2 questions on this subject (Typeahead Talent Buddy and Typeahead autocomplete functionality challenge), but none of these use a trie which is why I created a new one.
Problem Statement
Given a big list of user names and a list of strings representing queries Your task is to:
Note that your function will receive the following arguments: usernames
-
Which is an array of strings representing the user names queries
-
Which is an array of strings representing the queries described above
Data constraints
-
The length of the array above will not exceed 100,000 entries
-
Each name or query string will not exceed 30 characters
Efficiency constraints
Example
I quickly whipped up a brute force solution which was far too slow (about 100 seconds on my laptop for one of the medium test data sets). Then I started researching and came to the conclusion that a trie is probably the best data structure for this. I first went with a generic trie which improved the speed by about 100x. This got me a bit further in the tests, but they try with a few bigger data sets after that and with the last one the algorithm is still too slow.
I've optimized the trie specifically for the problem, then profiled and finetuned the implementation, which got me another 100x speed imp
Problem Statement
Given a big list of user names and a list of strings representing queries Your task is to:
- Write a function that prints to the standard output (stdout) for each query the user name that matches the query
- If there are multiple user names matching the query please select the one that is the smallest lexicographically
- All string matches must be case insensitive
- If no match is found for a given query please print "-1"
Note that your function will receive the following arguments: usernames
-
Which is an array of strings representing the user names queries
-
Which is an array of strings representing the queries described above
Data constraints
-
The length of the array above will not exceed 100,000 entries
-
Each name or query string will not exceed 30 characters
Efficiency constraints
- Your function is expected to print the requested result and return in less than 2 seconds
Example
names: ["james", "jBlank"]
queries: ["j", "jm", "jbl", "JB"]
Output:
james
-1
jBlank
jBlank"I quickly whipped up a brute force solution which was far too slow (about 100 seconds on my laptop for one of the medium test data sets). Then I started researching and came to the conclusion that a trie is probably the best data structure for this. I first went with a generic trie which improved the speed by about 100x. This got me a bit further in the tests, but they try with a few bigger data sets after that and with the last one the algorithm is still too slow.
I've optimized the trie specifically for the problem, then profiled and finetuned the implementation, which got me another 100x speed imp
Solution
Do it the simple way: make the usernames into a list of tuples
Now, use the
The returned value is tuple
Adapting this into a class one gets:
This is guaranteed to be faster than your Trie method; the initial overhead comes from creating a tuple for all entries, which is still less burdensome than your solution having to create O(n) dictionaries; the lookup is guaranteed to be faster too.
In case this is not fast enough, then instead of a trie, you can store all prefixes in a single dictionary!
(username.lower(), username). Then sort this list; the usernames are now ordered lexicographically; in your example you'd getindex = [("james", "james"),
("jblank", "jBlank")]Now, use the
bisect module and the bisect_left adapted from find_le in the documentation page to find the rightmost value in the list with lowercase username less than or equal to the lowercase prefix; compare the returned value to i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
found = index[i]
if found[0].startswith(prefix):
return found[1]
return '-1'The returned value is tuple
('username', 'UserName'); now just check if the entry[0].startswith(prefix); if it does, the answer is entry[1], if it does not, give -1.Adapting this into a class one gets:
from bisect import bisect_left
names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]
class Index(object):
def __init__(self, words):
self.index = [ (w.lower(), w) for w in words ]
self.index.sort()
def by_prefix(self, prefix):
"""Return lexicographically smallest word that starts with a given
prefix.
"""
prefix = prefix.lower()
i = bisect_left(self.index, (prefix, ''))
if 0 <= i < len(self.index):
found = self.index[i]
if found[0].startswith(prefix):
return found[1]
return '-1'
def typeahead(usernames, queries):
"""Given a list of users and queries,
this function prints all valid users for these queries.
Args:
usernames: list of strings representing users.
queries: list of strings representing (incomplete) input
"""
users = Index(usernames)
print("\n".join(users.by_prefix(q) for q in queries))
typeahead(names, queries)This is guaranteed to be faster than your Trie method; the initial overhead comes from creating a tuple for all entries, which is still less burdensome than your solution having to create O(n) dictionaries; the lookup is guaranteed to be faster too.
In case this is not fast enough, then instead of a trie, you can store all prefixes in a single dictionary!
class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
for i in range(1, len(lw) + 1):
index[lw[:i]] = w
self.index = index
def by_prefix(self, prefix):
"""Return lexicographically smallest word that starts with a given
prefix.
"""
return self.index.get(prefix.lower(), '-1')Code Snippets
index = [("james", "james"),
("jblank", "jBlank")]i = bisect_left(index, (prefix, ''))
if 0 <= i < len(index):
found = index[i]
if found[0].startswith(prefix):
return found[1]
return '-1'from bisect import bisect_left
names = ["james", "jBlank"]
queries = ["j", "jm", "jbl", "JB"]
class Index(object):
def __init__(self, words):
self.index = [ (w.lower(), w) for w in words ]
self.index.sort()
def by_prefix(self, prefix):
"""Return lexicographically smallest word that starts with a given
prefix.
"""
prefix = prefix.lower()
i = bisect_left(self.index, (prefix, ''))
if 0 <= i < len(self.index):
found = self.index[i]
if found[0].startswith(prefix):
return found[1]
return '-1'
def typeahead(usernames, queries):
"""Given a list of users and queries,
this function prints all valid users for these queries.
Args:
usernames: list of strings representing users.
queries: list of strings representing (incomplete) input
"""
users = Index(usernames)
print("\n".join(users.by_prefix(q) for q in queries))
typeahead(names, queries)class Index(object):
def __init__(self, words):
index = {}
for w in sorted(words, key=str.lower, reverse=True):
lw = w.lower()
for i in range(1, len(lw) + 1):
index[lw[:i]] = w
self.index = index
def by_prefix(self, prefix):
"""Return lexicographically smallest word that starts with a given
prefix.
"""
return self.index.get(prefix.lower(), '-1')Context
StackExchange Code Review Q#63051, answer score: 8
Revisions (0)
No revisions yet.