HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Trie Tree match string containing ? and *

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
containingmatchtrieandstringtree

Problem

Using Python 2.7 and here is my code which implement trie match for a whole word, and also support if the whole word contains ? (match any one character) or * (any zero or more characters).

My question is wondering if any improvement I can make (in terms of performance) and also if any code functional bugs, please feel free to point out.

from collections import defaultdict

class TrieNode:
    def __init__(self):
        self.children = defaultdict(TrieNode)
        self.isEnd = False

    def insert(self, source):
        if not source:
            return
        node = self
        for s in source:
            node = node.children[s]
        node.isEnd = True

    def search(self, source):
        if not source:
            return self.isEnd == True
        if source[0] == '?':
            for (ch,node) in self.children.items():
                if node.search(source[1:]) == True:
                    return True
            return False
        elif source[0] == '*':
            for (ch, node) in self.children.items():
                if ((node.search(source[1:]) == True) or
                    (node.search(source) == True)):
                    return True
            return False
        else:
            if source[0] in self.children:
                node = self.children[source[0]]
                return node.search(source[1:])
            else:
                return False
if __name__ == "__main__":
    root = TrieNode()
    root.insert('abc')
    root.insert('acd')
    root.insert('bcd')
    print root.search('acd') # True
    print root.search('acdd') # False
    print root.search('aaa') # False
    print root.search('a?d') # True
    print root.search('a*d') # True
    print root.search('ad*') # False
    print root.search('*c*') # True
    print root.search('*c?') # True
    print root.search('*a?') # False

Solution


  1. Review



-
The implementation with defaultdict is very elegant. I like it.

-
The code is not portable to Python 3 because of the use of the print statement. It would be easy to make it portable using from __future__ import print_function.

-
Set-like data structures contain elements, so element would be a better variable name than source.

-
It's inconvenient to create a trie from some other data structure like a list of strings: you have to call the insert method for each element you want to add. It would be better if the constructor took an optional iterator and inserted all the elements, just like the constructors for other collections in Python.

-
Instead of self.isEnd == True, just write self.isEnd, and similarly for other comparisons against True.

-
The attributes children and isEnd are only intended for use by the TrieNode class itself. It's conventional to give such attributes names starting with _.

-
The code for insert starts like this:

if not source:
    return


which means that if you try to add the empty string to the trie, it apparently succeeds, but in fact the empty string was not added. This is misleading. If you really want to prevent the caller adding the empty string, then raise an exception:

if not source:
    raise ValueError("empty string not supported")


But this seems inconvenient to me, so instead I suggest removing these two lines and allowing the empty string as an element. But this reveals a problem with the search method:

>>> root = TrieNode()
>>> root.insert('')
>>> root.search('*')
False


To fix this, revise the search method as follows:

elif source[0] == '*':
    if node.search(source[1:]):
        return True
    for (ch, node) in self.children.items():
        if node.search(source) == True:
            return True
    return False


(but see below for further improvements to this code).

-
In this code:

for (ch,node) in self.children.items():
    if node.search(source[1:]) == True:
        return True
return False


you don't make any use of ch, so it would be simpler to write:

for node in self.children.values():
    if node.search(source[1:]):
        return True
return False


which is the same as:

return any(node.search(source[1:]) for node in self.children.values())


-
Instead of computing source[1:] on each iteration of the loop, compute it once and remember it in a local variable:

rest = source[1:]
return any(node.search(rest) for node in self.children.values())


But better still, reorganize the search so that it remembers the current index into the search term, then you don't have to construct all these substrings. See below for how this might be implemented.

-
The interface to the search method makes it impossible to tell if strings containing wildcards are elements of the trie. Consider:

>>> import random
>>> root = TrieNode()
>>> root.insert(random.choice(['?', 'x']))
>>> root.search('?')
True


Is '?' a member of the trie or not? It would make sense to provide an alternative search method that doesn't use wildcards: for example the caller could write 'element' in trie instead of trie.search('element'). This requires a __contains__ method, which is straightforward to implement since it doesn't need to consider wildcards:

def __contains__(self, element):
    node = self
    for k in element:
        if k not in node._children:
            return False
        node = node._children[k]
    return node._end


Now you can accurately determine membership for strings containing wildcards:

>>> '?' in root
False


-
The search method only tells you whether some element in the trie matched the search term, and doesn't tell you which element or elements match. When the search term contains wildcards, it would be more useful if the search generated the matching elements. For example, if you have added the words in a dictionary to a trie, then you'd like to be able to query 'p?t' and get out 'pat', 'pet', 'pit', 'pot', 'put'. See the revised code in §2 below, and further discussion in §3.

-
The data structure presents a set-like interface (it represents a collection of distinct strings where you can add new elements and test elements for membership). It would therefore make sense to design the interface so that it uses the same method names as Python's built-in sets, that is, add instead of insert, and to implement more of the set interface, for example __iter__, __len__, __or__, __and__ and so on.

The easiest way to do this is to inherit from collections.abc.Set. The idea is that your class implements the __contains__, __iter__ and __len__ methods, and collections.abc.Set class implements the other set methods in terms of these. See the revised code in §2 below.

-
The code has test cases, but it's hard to check that they are correct: you have to carefully compare the output against the e

Code Snippets

if not source:
    return
if not source:
    raise ValueError("empty string not supported")
>>> root = TrieNode()
>>> root.insert('')
>>> root.search('*')
False
elif source[0] == '*':
    if node.search(source[1:]):
        return True
    for (ch, node) in self.children.items():
        if node.search(source) == True:
            return True
    return False
for (ch,node) in self.children.items():
    if node.search(source[1:]) == True:
        return True
return False

Context

StackExchange Code Review Q#142878, answer score: 10

Revisions (0)

No revisions yet.