patternjavaModerate
Name Filtering using a Trie
Viewed 0 times
nameusingtriefiltering
Problem
This question suggests a different (improved?) implementation for searching/filtering names based on a search query. This is based on the question here: Name filtering in Java
A Trie allows the prefix of terms to be used as a search mechanism. The Trie requires pre-processing the data in to a data structure that has very efficient lookups. Lookups in the order of \$O(n)\$ where \$n\$ is the size of the prefix, not the amount of data.
For the purpose of this task, the system indexes a number of names. Names like "Rob Smith". The goal will be to filter names in a text box, and to do that really fast. As the user types in a name, we need to filter them. As the user types, we can use the Trie as follows:
As the user types, starting with 'r' for 'Rob', we can start at the root, then navigate to the 'r' entry. Any names under the 'r' all start with 'r'.
When the user types the 'o', we can go from the root to 'r' and then to 'o'. Any names under that 'o' node start with 'ro'. Wen the user types the 'b', we get to the node containing the name 'Rob Smith'.
The Trie tree above shows only one name, but there are thousands in there. 'Roland', 'Roger', etc. would all be names descendent from the r -> o branch.
Additionally, since people names have first names, and last names, all names are in multiple locations in the Trie. The name would be in one place based on 'Rob', and in another place based on 'Smith'.
By managing the query carefully, when the user enters
The following is a class which 'ingests' and indexes the names in the Trie, then allows you to filter them based on search terms. It is this code that is the primary focus of this Code Review question.
Of particular interest is:
Of course, all other input is valued too.
```
package nameit;
A Trie allows the prefix of terms to be used as a search mechanism. The Trie requires pre-processing the data in to a data structure that has very efficient lookups. Lookups in the order of \$O(n)\$ where \$n\$ is the size of the prefix, not the amount of data.
For the purpose of this task, the system indexes a number of names. Names like "Rob Smith". The goal will be to filter names in a text box, and to do that really fast. As the user types in a name, we need to filter them. As the user types, we can use the Trie as follows:
As the user types, starting with 'r' for 'Rob', we can start at the root, then navigate to the 'r' entry. Any names under the 'r' all start with 'r'.
When the user types the 'o', we can go from the root to 'r' and then to 'o'. Any names under that 'o' node start with 'ro'. Wen the user types the 'b', we get to the node containing the name 'Rob Smith'.
The Trie tree above shows only one name, but there are thousands in there. 'Roland', 'Roger', etc. would all be names descendent from the r -> o branch.
Additionally, since people names have first names, and last names, all names are in multiple locations in the Trie. The name would be in one place based on 'Rob', and in another place based on 'Smith'.
By managing the query carefully, when the user enters
r s (r space s), we can find all names that have a part starting with 'r', and the other name starting with 's'. This would find names like 'Rob Smith', and also 'Stan Rogers'.The following is a class which 'ingests' and indexes the names in the Trie, then allows you to filter them based on search terms. It is this code that is the primary focus of this Code Review question.
Of particular interest is:
- performance
- scalability
- memory usage
Of course, all other input is valued too.
```
package nameit;
Solution
Thread safety
NameParts (and thus child nodes) can still be added via
I have absolutely no idea what's going on here.
If there are no kids, make new ones... Okay, that's the create part...
And then...
It.. finds something? Okay, but if you get less than 0, it was not found ... wait, what, Arrays.binarySearch?
Huh?
What does
Well, at least we need to add a new node in case it wasn't found. That's what
You create a set, give it a name, but don't do anything with it after the function returns. Do you need to keep a reference to it?
I recommend hitting the auto-formatter in your IDE from time to time. Specifically; before committing to a repository and before posting questions to Code Review.
NameParts (and thus child nodes) can still be added via
NameBuilder.addName whilst refineNames is running. Your code is not completely thread safe. (partial credit to @AdriaanKoster for making me look in greater detail at thread-safety).public TrieNode getOrCreateChild(final char c) {
if (kids == null) {
kids = new TrieNode[8];
}
int ip = locateIndex(c);
if (ip < 0) {
// insert new node.
ip = -ip - 1;
if (kidSize == kids.length) {
kids = Arrays.copyOf(kids, kidSize + 8);
}
System.arraycopy(kids, ip, kids, ip + 1, kidSize - ip);
kids[ip] = new TrieNode(c);
kidSize++;
}
return kids[ip];
}I have absolutely no idea what's going on here.
If there are no kids, make new ones... Okay, that's the create part...
And then...
ip (not an ip address, so what is it?) is locateIndex...?private int locateIndex(final char c) {
// binary search for the kid TrieNode with the right char key.
// use the same -ip-1 convention from Arrays.binarySearch(...) to indicate a missing entry
int left = 0;
int right = kidSize - 1;
while (right >= left) {
int mid = left + (right - left) / 2;
char k = kids[mid].key;
if (k == c) {
return mid;
}
if (c < k) {
right = mid - 1;
} else {
left = mid + 1;
}
}
return -left - 1;
}It.. finds something? Okay, but if you get less than 0, it was not found ... wait, what, Arrays.binarySearch?
Huh?
What does
locateIndex even do?Well, at least we need to add a new node in case it wasn't found. That's what
- Instantiating a new array with some magic constant
- Search for where it is or should be
- If it's not there...
- Then we're gonna check if we have space
- And make some space if there's none
- Then we're gonna shift all the boxes to the side
- And put our new box there
- And then we can say okay I've got your box
This should be documented or split up.
I want to see something like this: (comments for mid code explanation)
public TrieNode getOrCreateChild(final char c) {
//no existance check! Make an array of 0 items instead.
int ip = locateIndex(c);//this is good, although I don't know what ip is.
if (ip < 0) {
// insert new node.
return createChild(-ip -1, c);//Creation can be delegated. In turn, creation should delegate array expansion.
}
return kids[ip];
}
Ahhh. Find or create child. Nice.
I'm still stuck with -ip -1 but that's because I have no idea what it's supposed to do.
If performance is REALLY of key issue as such that you can't split up functions, start putting in some comments. Because the next programmer will have to maintain this and he will not be able to.
Additionally, I saw these:
public NameControl(String name) {
this.name = name;
nameParts = name.trim().split("\\s+");
for (int i = 0; i < nameParts.length; i++) {
nameParts[i] = nameParts[i].toLowerCase();
}
}
Any reason you can't do name.trim().toLowerCase().split("\\s+"); and be done with it?
System.out.printf("Generated Name Trei: Input names in %.3fms, Trie in %.3fms%n", (read - start) / 1000000.0, (done - read) / 1000000.0);
You've got a typo here - "Trei" -> "Trie".
for (String name : builder.getRandomNames(count)) {
filter.addName(name);
}
Consider adding a helper method boolean addNames(Collection names)` to facilitate adding multiple names.Set roundHit = new HashSet<>();
// node, and all its children have this prefix...
walkTree(result, node, candidates, seen, roundHit);
return result;You create a set, give it a name, but don't do anything with it after the function returns. Do you need to keep a reference to it?
String name = pos < 0 ? source : source.substring(0,pos);I recommend hitting the auto-formatter in your IDE from time to time. Specifically; before committing to a repository and before posting questions to Code Review.
Code Snippets
public TrieNode getOrCreateChild(final char c) {
if (kids == null) {
kids = new TrieNode[8];
}
int ip = locateIndex(c);
if (ip < 0) {
// insert new node.
ip = -ip - 1;
if (kidSize == kids.length) {
kids = Arrays.copyOf(kids, kidSize + 8);
}
System.arraycopy(kids, ip, kids, ip + 1, kidSize - ip);
kids[ip] = new TrieNode(c);
kidSize++;
}
return kids[ip];
}private int locateIndex(final char c) {
// binary search for the kid TrieNode with the right char key.
// use the same -ip-1 convention from Arrays.binarySearch(...) to indicate a missing entry
int left = 0;
int right = kidSize - 1;
while (right >= left) {
int mid = left + (right - left) / 2;
char k = kids[mid].key;
if (k == c) {
return mid;
}
if (c < k) {
right = mid - 1;
} else {
left = mid + 1;
}
}
return -left - 1;
}public TrieNode getOrCreateChild(final char c) {
//no existance check! Make an array of 0 items instead.
int ip = locateIndex(c);//this is good, although I don't know what ip is.
if (ip < 0) {
// insert new node.
return createChild(-ip -1, c);//Creation can be delegated. In turn, creation should delegate array expansion.
}
return kids[ip];
}public NameControl(String name) {
this.name = name;
nameParts = name.trim().split("\\s+");
for (int i = 0; i < nameParts.length; i++) {
nameParts[i] = nameParts[i].toLowerCase();
}
}System.out.printf("Generated Name Trei: Input names in %.3fms, Trie in %.3fms%n", (read - start) / 1000000.0, (done - read) / 1000000.0);Context
StackExchange Code Review Q#75392, answer score: 10
Revisions (0)
No revisions yet.