patternMinor
Turn one string into another with single letter substitions
Viewed 0 times
turnwithintosubstitionsonesingleanotherletterstring
Problem
I want to turn one string into another with only single letter substitions. What is a good way to do this, passing through only valid words in between (this website has some examples)?
Valid here means "a word in English" as this is the domain I consider.
My current idea is that I could use a shortest path algorithm with the Hamming distance for edge weights. The problem is that it will take a long time to build the graph, and even then the weight is not so precise in terms of distance (though it will never underestimate it) unless the weight is one, so I would probably have to find a to build a graph that only had weights of one.
What would be the easiest way to build the graph? Am I taking entirely the wrong approach?
Valid here means "a word in English" as this is the domain I consider.
My current idea is that I could use a shortest path algorithm with the Hamming distance for edge weights. The problem is that it will take a long time to build the graph, and even then the weight is not so precise in terms of distance (though it will never underestimate it) unless the weight is one, so I would probably have to find a to build a graph that only had weights of one.
What would be the easiest way to build the graph? Am I taking entirely the wrong approach?
Solution
The graph I consider has the valid words for vertices and edges correspond to one-letter substitutions. In this graph the Hamming distance never overestimates the distance but can underestimate it (e.g. between $aa$ and $cc$ in the set of valid words ${aa,ab,bb,bc,cc}$)
You don't need an actual graph, you just need a representation of it. The hard part is computing the neighbors : you may want to store the valid words into a prefix tree. But for the search algorithm you already have a non-overestimating distance so you can use that as a admissible heuristic for the $A^*$ algorithm. You can use an hash table to store the marked vertices.
Building the prefix tree
To list all the neighbors of $w$ first use this function that computes all words $v$ such that $w$ and $v$ differs only on the $i$th letter. It runs in $O(|Σ||w|)$ if you implement it correctly : $w$ must be in the form of a list for the
Then to list all the neighbors, just run:
Which runs in $O(|Σ||w|^2)$.
Compared to other methods, where $N_V$ the number of valid words.
Precomputation
At first the goal was to get all the neighbors in only one look up in time $O(|w|)$ by previously having computed some kind of prefix tree with wildcards. (I don't really know the name of this data structure) Then I realized that it can simply be done with just a reverse search.
I insist that this is only interesting if you only care about the look up time. I would probably be a good idea if you have to find the shortest path between several pairs of words. If it's only one time the method above will be enough.
This is really very simple : for each word $w$, enumerate all the possible neighbors $v$ of $w$ and check if they are valid. This is in $O(N_V\log N_V |w||Σ|)$ or $O(N_V^2)$ if $N_V$ is small enough for this to be interesting. Add them to your automaton / prefix tree / hash table with the key $v$ and the value $w$.
Then getting all the neighbors of $v$ is looking up in this structure and getting back all the $w$'s. So you can make your search algorithm with this.
Implementation details:
Conclusion
If the set of valid words does not change I would suggest precomputing and storing in a hash table and using the A* with the hamming distance admissible heuristic.
You don't need an actual graph, you just need a representation of it. The hard part is computing the neighbors : you may want to store the valid words into a prefix tree. But for the search algorithm you already have a non-overestimating distance so you can use that as a admissible heuristic for the $A^*$ algorithm. You can use an hash table to store the marked vertices.
Building the prefix tree
To list all the neighbors of $w$ first use this function that computes all words $v$ such that $w$ and $v$ differs only on the $i$th letter. It runs in $O(|Σ||w|)$ if you implement it correctly : $w$ must be in the form of a list for the
tail() dans head() operation to be in constant time. The operation prefix . c . tail(w) just has to be in linear time.neighbors(i, prefix, w, trie) =
if(i == 0) :
for c in 'a' .. 'z' :
if(tail(w) is a word of trie) :
add(prefix . c . tail(w))
else :
neighbors(i-1, prefix . w[0], tail(w), trie[head(w)])Then to list all the neighbors, just run:
for i in 0 .. length(w) - 1 :
neighbors(i, emptyprefix, w, trie_of_valid_words)Which runs in $O(|Σ||w|^2)$.
Compared to other methods, where $N_V$ the number of valid words.
- if you use a hash table and enumerate all the potential neighbors: $O(|Σ||w|^2)$ in the practical case and $O(|Σ||w|^2N_V)$ in the worst case. It might be faster in practice though.
- if you use a binary search tree: $O(|Σ||w|\log N_V)$ in the worst and average case.
Precomputation
At first the goal was to get all the neighbors in only one look up in time $O(|w|)$ by previously having computed some kind of prefix tree with wildcards. (I don't really know the name of this data structure) Then I realized that it can simply be done with just a reverse search.
I insist that this is only interesting if you only care about the look up time. I would probably be a good idea if you have to find the shortest path between several pairs of words. If it's only one time the method above will be enough.
This is really very simple : for each word $w$, enumerate all the possible neighbors $v$ of $w$ and check if they are valid. This is in $O(N_V\log N_V |w||Σ|)$ or $O(N_V^2)$ if $N_V$ is small enough for this to be interesting. Add them to your automaton / prefix tree / hash table with the key $v$ and the value $w$.
Then getting all the neighbors of $v$ is looking up in this structure and getting back all the $w$'s. So you can make your search algorithm with this.
Implementation details:
- generic binary tree : look up in $O(\log(N_V|Σ||w|))$ which is $O(|w|+\log|Σ|)$ but you don't care about the $\log|Σ|$.
- trie : look up in $O(|w|)$
- hash : look up in $O(|w|)$ (and worst worst case)
Conclusion
If the set of valid words does not change I would suggest precomputing and storing in a hash table and using the A* with the hamming distance admissible heuristic.
Code Snippets
neighbors(i, prefix, w, trie) =
if(i == 0) :
for c in 'a' .. 'z' :
if(tail(w) is a word of trie) :
add(prefix . c . tail(w))
else :
neighbors(i-1, prefix . w[0], tail(w), trie[head(w)])for i in 0 .. length(w) - 1 :
neighbors(i, emptyprefix, w, trie_of_valid_words)Context
StackExchange Computer Science Q#1785, answer score: 6
Revisions (0)
No revisions yet.