HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Understanding the heuristic used for approximate string searching through an FSA

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
understandingtheapproximatesearchingusedheuristicforthroughstringfsa

Problem

The paper I'm looking at: Fast approximate string matching with finite automata (2009)

Explanation of the algorithm (from my understanding anyway):

A word is inputted into the automaton and from each state, a number of possible actions can be taken depending on the current symbol/character being looked at at position $pos$. These "actions" include:

  • Staying at the current state and deleting the character currently being pointed at by $pos$.



  • Taking a transition to a next state and substituting the current


character at $pos$ with the transition label used.

  • Taking a transition to a next state and inserting the transition


label at $pos$.

The action is decides to take is based on a cost function $f=g+h$ (where the lowest $f$ is chosen) for each action.

$g = 0$ if no edit operation is performed on the input word (i.e. the transition label taken to a next state, is the same as the current character at pos) and $g=1$ otherwise.

$h$ is the maximum between two heuristic functions $h(2)$ and $h(\infty)$.

-
$h(2)$ looks at the transition labels that can be reached within two
transitions from that state (note, not the current state) and
compares those reachable transitions to the next two characters in
the input word after $pos$. $h(2)$ gets $+1$ added for each character in
the input word that does not match any of the characters that the
state $s$ can reach within two transitions.

-
$h(\infty)$ looks at the transition labels that can be reached until
the end of the path from that state and compares that to the rest of the input word after pos. As with $h(2)$, $+1$ for each character in the input word that can't be matched to any of the reachable characters from state $s$.

Eventually this algorithm will lead to some accepting state where the path up to that accepting state is the closest accepted word to the input.

I think I understand the algorithm correctly since I tried to implement it from my understanding in Java (I'm doing it for my project, although m

Solution

Those two statements are correlated: "giving priority to the highest word position" is the same as "the node that has advanced farthest in the word is given priority".

This appears to be why the algorithm selects state three. State zero does not result in an advancement in word position, whereas the transition to state three is further in the word position and so satisfies their heuristic of highest word position and thus gets priority for first expansion.

To see why this heuristic results in better results, examine the node zero. Deleting 'd' you see all the letters remaining are in the reachable nodes from zero, which means $h = 0$, and we only performed 1 operation, so $f = 1$. However in order to get to the acceptance state along the shortest path we have to take another operation to go to 3 while adding 'c' back, then we transition through the states 5, and 7 making no changes. Obviously the fastest (and shortest) path is to simply swap 'd' to 'c' and transition to state three first, which the heuristic suggests in this case.

Context

StackExchange Computer Science Q#37451, answer score: 3

Revisions (0)

No revisions yet.