patternjavaModerate
Suffix tree implementation in Java
Viewed 0 times
suffiximplementationtreejava
Problem
I was trying to solve the problem where given a larger string and an array/list of smaller string, we need to make the determination whether the larger string contains each of the smaller strings. I realized one way to do it would be to create a suffix tree for the larger string and then look for each of the smaller strings within the suffix tree.
My
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.ArrayList;
public class SuffixTree{
class Node{
private final char currentValue;
private Map children;
Node(){
this.currentValue = '*';
this.children = new HashMap();
}
Node(char currentValue){
this.currentValue = currentValue;
this.children = new HashMap();
}
char getValue(){
return this.currentValue;
}
void addChild(Node c){
this.children.put(c.getValue(),c);
}
boolean hasChild(Node c){
return this.children.containsKey(c.getValue());
}
Node getChild(Node c){
return this.children.get(c.getValue());
}
public String toString(){
char currentValue = this.getValue();
StringBuffer keys = new StringBuffer();
for(char key:this.children.keySet()){
keys.append("Current key:"+key+"\n");
}
return "Current value:"+currentValue+"\n"+
"Keys:"+keys.toString();
}
}
private Node root;
private void log(Object l){
System.out.println(l);
}
/*
* Helper method that initializes the suffix tree
*/
private Node createSuffixTree(String source,Node root){
for(int i=0;i map){
for(char c:map.keySet()){
System.out.println("Current node has child"+c);
}
}
boolean search(String target){
Map rootChildren = this.root.children;
for(char c:target.toCharArray()){
if(rootChildren.containsKey(c)){
rootChildren = rootChildren.get(c).children;
}else{
My
SuffixTree is essentially a Node which contains a value and a Map of ` for its children.
``import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.ArrayList;
public class SuffixTree{
class Node{
private final char currentValue;
private Map children;
Node(){
this.currentValue = '*';
this.children = new HashMap();
}
Node(char currentValue){
this.currentValue = currentValue;
this.children = new HashMap();
}
char getValue(){
return this.currentValue;
}
void addChild(Node c){
this.children.put(c.getValue(),c);
}
boolean hasChild(Node c){
return this.children.containsKey(c.getValue());
}
Node getChild(Node c){
return this.children.get(c.getValue());
}
public String toString(){
char currentValue = this.getValue();
StringBuffer keys = new StringBuffer();
for(char key:this.children.keySet()){
keys.append("Current key:"+key+"\n");
}
return "Current value:"+currentValue+"\n"+
"Keys:"+keys.toString();
}
}
private Node root;
private void log(Object l){
System.out.println(l);
}
/*
* Helper method that initializes the suffix tree
*/
private Node createSuffixTree(String source,Node root){
for(int i=0;i map){
for(char c:map.keySet()){
System.out.println("Current node has child"+c);
}
}
boolean search(String target){
Map rootChildren = this.root.children;
for(char c:target.toCharArray()){
if(rootChildren.containsKey(c)){
rootChildren = rootChildren.get(c).children;
}else{
Solution
Let's break down your algorithm for a moment:
With that tree, you can then compare matching values to see if they exist in the tree. To see if they exist, you:
Problems
Advantages
Alternatives
There are two alternatives I recommend, the differences between them will depend on the number of letters in the input word. The one solution has a runtime performance of \$O(1)\$, but a memory consumption of \$O(n^2)\$. The other has a runtime performance of \$O(n \log n)\$ and a memory consumption of \$O(n)\$. Note that your solution has a runtime performance of \$O(n)\$ and a memory size of \$O(n^2)\$.
What does that mean? It means the one alternative will have slitgtly worse performance than yours for larger inputs, but will take much less space. The other alternative will be much faster than yours for larger inputs, and will take about the same proportion of space. Each alternative has merits over yours.
In your solution, with \$n\$ being the number of letters in the input word ("bananas"), your runtime performance requires you scanning the Node tree for as many as
The two alternatives I suggest are different, in that the one alternative will have an \$O(n \log n)\$ search performance but an \$O(n)\$ memory performance. Because of the way the data is stored, though, for smaller input strings ("bananas" is small), it will likely be much faster though, than yours.
The other solution is much faster (essentially constant time - \$O(1)\$ ), but has a higher memory cost - about the same as your code.
Fastest - and Largest
The fastest solution is to take all possible substrings from the input and put them in a
Now, there's a single HashSet (which under the covers is a HashMap), and it contains all substrings. Searching for those substrings is a case of computing the hashCode of the search value, and it will find the value "fast"...
That algorithm stores potentially a lot of strings, but the search is lightning fast.
The way this code works, is by taking the input string, and splitting it in all possible ways. For example, from "fubar" you will get:
f fu fub fuba fubar u ub uba ubar b ba bar a ar r
Put those all in the set, and every possible infix "search" word is recorded. The HashSet makes the search effectively an O(1) operation.
Fastish - and Smallish
The second alternative will slow down (very slightly) as the input words get larger, but the memory footprint will be relatively small.
First, create an Array of words.... each word being a suffix:
Then, sort it....
That array is pretty small in comparison to your alternatives....
Now, using that, you can quickly find (in \$O(n \log n)\$ time) whether a search string is a suffix:
This algorithm works in part by relying on the lexical (alphabetical) sorting of any suffix of the input word. Again, using "fubar", the code creates all 5 suffixes:
It then sorts them alphabetically:
```
- get a String to analyze ("bananas").
- do a 'nested' loop (i,j) to find every possible sub-word in it
- for each sub-word, inject it in to the possible node tree 1 character at a time.
With that tree, you can then compare matching values to see if they exist in the tree. To see if they exist, you:
- loop over each character in the search word
- for each character, you check whether a Node exists matching the respective character a the respective level.
- if any character is impossible, you return false. If all characters are possible, you return true.
Problems
- the name - it is not a "Suffix" Tree, it is an "Infix" tree. It does not match only suffixes.
- the overhead - The HashMap instances and the Character and Node classes, are a problem from a memory perspective. Sure, the count of these instances will be relatively small, but, for "bananas", you are creating about.... 28 HashMaps? Each HashMap has a significant memory footprint. It is expensive.
Advantages
- the perceived advantage here is that you have discrete places where the data exists in a tree. You can check for infixes starting from any character in the base word, at the same time.
Alternatives
There are two alternatives I recommend, the differences between them will depend on the number of letters in the input word. The one solution has a runtime performance of \$O(1)\$, but a memory consumption of \$O(n^2)\$. The other has a runtime performance of \$O(n \log n)\$ and a memory consumption of \$O(n)\$. Note that your solution has a runtime performance of \$O(n)\$ and a memory size of \$O(n^2)\$.
What does that mean? It means the one alternative will have slitgtly worse performance than yours for larger inputs, but will take much less space. The other alternative will be much faster than yours for larger inputs, and will take about the same proportion of space. Each alternative has merits over yours.
In your solution, with \$n\$ being the number of letters in the input word ("bananas"), your runtime performance requires you scanning the Node tree for as many as
n nodes (one for each letter), which makes your check performance proportional to the number of characters in the input word. The number of nodes you have is proportional to the square of the number of input letters, so if you double the number of letters, you quadruple the number of nodes.The two alternatives I suggest are different, in that the one alternative will have an \$O(n \log n)\$ search performance but an \$O(n)\$ memory performance. Because of the way the data is stored, though, for smaller input strings ("bananas" is small), it will likely be much faster though, than yours.
The other solution is much faster (essentially constant time - \$O(1)\$ ), but has a higher memory cost - about the same as your code.
Fastest - and Largest
The fastest solution is to take all possible substrings from the input and put them in a
HashSet. The input processing is simple:Set infixes = new HashSet<>();
for (int i = 0; i < input.length; i++) {
for (j = i + 1; j < input.length; j++) {
infixes.add(input.substring(i,j));
}
}Now, there's a single HashSet (which under the covers is a HashMap), and it contains all substrings. Searching for those substrings is a case of computing the hashCode of the search value, and it will find the value "fast"...
return infixes.contains(search);That algorithm stores potentially a lot of strings, but the search is lightning fast.
The way this code works, is by taking the input string, and splitting it in all possible ways. For example, from "fubar" you will get:
f fu fub fuba fubar u ub uba ubar b ba bar a ar r
Put those all in the set, and every possible infix "search" word is recorded. The HashSet makes the search effectively an O(1) operation.
Fastish - and Smallish
The second alternative will slow down (very slightly) as the input words get larger, but the memory footprint will be relatively small.
First, create an Array of words.... each word being a suffix:
String[] suffixes = new String[input.length];
for (int i = 0; i < input.length; i++) {
suffixes[i] = input.substring(i);
}Then, sort it....
Arrays.sort(suffixes);That array is pretty small in comparison to your alternatives....
Now, using that, you can quickly find (in \$O(n \log n)\$ time) whether a search string is a suffix:
int ip = Arrays.binarySearch(suffixes, search);
if (ip >= 0) {
// exact match to a suffix
return true;
}
// no exact match, but, maybe the place the value would
// belong is a longer version of our search term....
ip = -ip - 1;
return ip < suffixes.length && suffixes[ip].startsWith(search);This algorithm works in part by relying on the lexical (alphabetical) sorting of any suffix of the input word. Again, using "fubar", the code creates all 5 suffixes:
fubar ubar bar ar rIt then sorts them alphabetically:
```
Code Snippets
Set<String> infixes = new HashSet<>();
for (int i = 0; i < input.length; i++) {
for (j = i + 1; j < input.length; j++) {
infixes.add(input.substring(i,j));
}
}return infixes.contains(search);String[] suffixes = new String[input.length];
for (int i = 0; i < input.length; i++) {
suffixes[i] = input.substring(i);
}Arrays.sort(suffixes);int ip = Arrays.binarySearch(suffixes, search);
if (ip >= 0) {
// exact match to a suffix
return true;
}
// no exact match, but, maybe the place the value would
// belong is a longer version of our search term....
ip = -ip - 1;
return ip < suffixes.length && suffixes[ip].startsWith(search);Context
StackExchange Code Review Q#87312, answer score: 12
Revisions (0)
No revisions yet.