patternpythonModerate
Trie Tree match string containing ? and *
Viewed 0 times
containingmatchtrieandstringtree
Problem
Using Python 2.7 and here is my code which implement trie match for a whole word, and also support if the whole word contains
My question is wondering if any improvement I can make (in terms of performance) and also if any code functional bugs, please feel free to point out.
? (match any one character) or * (any zero or more characters).My question is wondering if any improvement I can make (in terms of performance) and also if any code functional bugs, please feel free to point out.
from collections import defaultdict
class TrieNode:
def __init__(self):
self.children = defaultdict(TrieNode)
self.isEnd = False
def insert(self, source):
if not source:
return
node = self
for s in source:
node = node.children[s]
node.isEnd = True
def search(self, source):
if not source:
return self.isEnd == True
if source[0] == '?':
for (ch,node) in self.children.items():
if node.search(source[1:]) == True:
return True
return False
elif source[0] == '*':
for (ch, node) in self.children.items():
if ((node.search(source[1:]) == True) or
(node.search(source) == True)):
return True
return False
else:
if source[0] in self.children:
node = self.children[source[0]]
return node.search(source[1:])
else:
return False
if __name__ == "__main__":
root = TrieNode()
root.insert('abc')
root.insert('acd')
root.insert('bcd')
print root.search('acd') # True
print root.search('acdd') # False
print root.search('aaa') # False
print root.search('a?d') # True
print root.search('a*d') # True
print root.search('ad*') # False
print root.search('*c*') # True
print root.search('*c?') # True
print root.search('*a?') # FalseSolution
- Review
-
The implementation with
defaultdict is very elegant. I like it.-
The code is not portable to Python 3 because of the use of the
print statement. It would be easy to make it portable using from __future__ import print_function.-
Set-like data structures contain elements, so
element would be a better variable name than source.-
It's inconvenient to create a trie from some other data structure like a list of strings: you have to call the
insert method for each element you want to add. It would be better if the constructor took an optional iterator and inserted all the elements, just like the constructors for other collections in Python.-
Instead of
self.isEnd == True, just write self.isEnd, and similarly for other comparisons against True.-
The attributes
children and isEnd are only intended for use by the TrieNode class itself. It's conventional to give such attributes names starting with _.-
The code for
insert starts like this:if not source:
returnwhich means that if you try to add the empty string to the trie, it apparently succeeds, but in fact the empty string was not added. This is misleading. If you really want to prevent the caller adding the empty string, then raise an exception:
if not source:
raise ValueError("empty string not supported")But this seems inconvenient to me, so instead I suggest removing these two lines and allowing the empty string as an element. But this reveals a problem with the
search method:>>> root = TrieNode()
>>> root.insert('')
>>> root.search('*')
FalseTo fix this, revise the
search method as follows:elif source[0] == '*':
if node.search(source[1:]):
return True
for (ch, node) in self.children.items():
if node.search(source) == True:
return True
return False(but see below for further improvements to this code).
-
In this code:
for (ch,node) in self.children.items():
if node.search(source[1:]) == True:
return True
return Falseyou don't make any use of
ch, so it would be simpler to write:for node in self.children.values():
if node.search(source[1:]):
return True
return Falsewhich is the same as:
return any(node.search(source[1:]) for node in self.children.values())-
Instead of computing
source[1:] on each iteration of the loop, compute it once and remember it in a local variable:rest = source[1:]
return any(node.search(rest) for node in self.children.values())But better still, reorganize the search so that it remembers the current index into the search term, then you don't have to construct all these substrings. See below for how this might be implemented.
-
The interface to the
search method makes it impossible to tell if strings containing wildcards are elements of the trie. Consider:>>> import random
>>> root = TrieNode()
>>> root.insert(random.choice(['?', 'x']))
>>> root.search('?')
TrueIs
'?' a member of the trie or not? It would make sense to provide an alternative search method that doesn't use wildcards: for example the caller could write 'element' in trie instead of trie.search('element'). This requires a __contains__ method, which is straightforward to implement since it doesn't need to consider wildcards:def __contains__(self, element):
node = self
for k in element:
if k not in node._children:
return False
node = node._children[k]
return node._endNow you can accurately determine membership for strings containing wildcards:
>>> '?' in root
False-
The
search method only tells you whether some element in the trie matched the search term, and doesn't tell you which element or elements match. When the search term contains wildcards, it would be more useful if the search generated the matching elements. For example, if you have added the words in a dictionary to a trie, then you'd like to be able to query 'p?t' and get out 'pat', 'pet', 'pit', 'pot', 'put'. See the revised code in §2 below, and further discussion in §3.-
The data structure presents a set-like interface (it represents a collection of distinct strings where you can add new elements and test elements for membership). It would therefore make sense to design the interface so that it uses the same method names as Python's built-in sets, that is,
add instead of insert, and to implement more of the set interface, for example __iter__, __len__, __or__, __and__ and so on.The easiest way to do this is to inherit from
collections.abc.Set. The idea is that your class implements the __contains__, __iter__ and __len__ methods, and collections.abc.Set class implements the other set methods in terms of these. See the revised code in §2 below.-
The code has test cases, but it's hard to check that they are correct: you have to carefully compare the output against the e
Code Snippets
if not source:
returnif not source:
raise ValueError("empty string not supported")>>> root = TrieNode()
>>> root.insert('')
>>> root.search('*')
Falseelif source[0] == '*':
if node.search(source[1:]):
return True
for (ch, node) in self.children.items():
if node.search(source) == True:
return True
return Falsefor (ch,node) in self.children.items():
if node.search(source[1:]) == True:
return True
return FalseContext
StackExchange Code Review Q#142878, answer score: 10
Revisions (0)
No revisions yet.