patternMinor
Block matching pattern matching problem
Viewed 0 times
patternblockmatchingproblem
Problem
I've struggling with this problem for a while.
Given text $T=t_1t_2\ldots t_n$ and pattern $P=p_1p_2\ldots p_m$ over alphabet $\Sigma=\mathbb{N}$, we say there's a block matching of $P$ in index $i$ if there exist $m$ successive blocks $S_1,\ldots,S_m$ such that $$p_1=\sum_{s_1 \in S_1}s_1 \land p_2=\sum_{s_2 \in S_2}s_2 \land \cdots \land p_m=\sum_{s_m \in S_m}s_m. $$
For example:
$T=\text{1 2 3 5 1 2 3 2}$
$P=\text{6 5}$
Then we have block matching in index 1 because $1+2+3=6$ and $5=5$.
Schematically it looks like: $\text{|1 2 3||5| 1 2 3 2}$
We also have a block matching in index 4 because $5+1=6$ and $2+3=5$.
Schematically it looks like: $\text{ 1 2 3 |5 1||2 3| 2}$
The task is to find all indices where we have block matching.
I couldn't do better then a $\mathcal{O}(mn)$ but I believe there's a FFT application here.
Any hints?
Given text $T=t_1t_2\ldots t_n$ and pattern $P=p_1p_2\ldots p_m$ over alphabet $\Sigma=\mathbb{N}$, we say there's a block matching of $P$ in index $i$ if there exist $m$ successive blocks $S_1,\ldots,S_m$ such that $$p_1=\sum_{s_1 \in S_1}s_1 \land p_2=\sum_{s_2 \in S_2}s_2 \land \cdots \land p_m=\sum_{s_m \in S_m}s_m. $$
For example:
$T=\text{1 2 3 5 1 2 3 2}$
$P=\text{6 5}$
Then we have block matching in index 1 because $1+2+3=6$ and $5=5$.
Schematically it looks like: $\text{|1 2 3||5| 1 2 3 2}$
We also have a block matching in index 4 because $5+1=6$ and $2+3=5$.
Schematically it looks like: $\text{ 1 2 3 |5 1||2 3| 2}$
The task is to find all indices where we have block matching.
I couldn't do better then a $\mathcal{O}(mn)$ but I believe there's a FFT application here.
Any hints?
Solution
If the numbers are all integers at most $k$, then you can apply a fast algorithm for substring matching with don't-care symbols (see below) to solve the problem in $O(kn(\log m + \log k))$ time by applying the following transformation:
Basically, 1s represent "boundaries" in both strings, and we look for a way to make each boundary in the pattern line up with some boundary in the text.
This paper describes an $O(n+m+\alpha)$-time algorithm, where $\alpha$ is "the total number of occurrences of the subpatterns" (I didn't read the full paper, so I'm not sure exactly what that means). In any case it mentions the 1974 algorithm of Fischer and Paterson, "String matching and other products", which solves the problem in $O(n\log m\log \Sigma)$ time -- though that link is behind a paywall, googling should find other information on it.
- Map each text character $t$ to $t-1$ zeros followed by a 1.
- Map each pattern character $p$ to $p-1$ don't-care symbols followed by a 1, and also prepend a 1 to the overall pattern.
Basically, 1s represent "boundaries" in both strings, and we look for a way to make each boundary in the pattern line up with some boundary in the text.
This paper describes an $O(n+m+\alpha)$-time algorithm, where $\alpha$ is "the total number of occurrences of the subpatterns" (I didn't read the full paper, so I'm not sure exactly what that means). In any case it mentions the 1974 algorithm of Fischer and Paterson, "String matching and other products", which solves the problem in $O(n\log m\log \Sigma)$ time -- though that link is behind a paywall, googling should find other information on it.
Context
StackExchange Computer Science Q#79533, answer score: 2
Revisions (0)
No revisions yet.