patternMinor
Dynamic programming table for finding similar substrings is too large
Viewed 0 times
substringsprogrammingdynamictoolargeforfindingsimilartable
Problem
Substring Diff
Given two strings of length $n$, $P = p_1\dots p_n$ and $Q = q_1 \dots q_n$, we define $M(i, j, L)$ as the number of
mismatches between $p_i \dots p_{i+L-1}$ and $q_j \dots q_{j+L-1}$. In set
notation, $M(i, j, L)$ refers to the size of the set $\{0 \leq x
The recursive function will have the form:
The state space then has size $K^3$. With an upper bound on $K$ of 1500, the running time and space usage will be terrible... So direct dynamic programming will not work without some additional property to reduce the state space.
Ideas?
UPDATE
Using the ideas suggested by both Yuval and Vor, I came up with the following solution that works like a charm, running in $O(K^2)$ time and using $K$ space.
```
def longest_range_min_sum(str1, str2, start1, start2, slice_size, max_sum):
longest = 0
i = 0
running_sum = 0
while i + longest max_sum:
if str1[start1 + i] != str2[start2 + i]:
running_sum -= 1
i += 1
else:
longest += 1
return longest
import sys
data = sys.stdin.readlines()
num_cases = int(data.pop(0))
for ignore in xrange(num_cases):
max_mismatches, str1, str2 = data.pop(0).split()
max_mismatches = int(max_mismatches)
m = n = len(str1)
longest = 0
for i in xrange(m + n + 1):
if i > n:
slice_size = m - (i - n)
else:
slice_size = min(i, m)
if slice_size == 0:
continue
end1 = max(m, m - i)
if i > n:
end1 = m - (i - n)
start1 = end1 - slice_size
end2 = min(i, n)
start2 = end2 - slice_size
#print zeros_and_ones
#print str1[start1:end1], ' - ', str2[
Given two strings of length $n$, $P = p_1\dots p_n$ and $Q = q_1 \dots q_n$, we define $M(i, j, L)$ as the number of
mismatches between $p_i \dots p_{i+L-1}$ and $q_j \dots q_{j+L-1}$. In set
notation, $M(i, j, L)$ refers to the size of the set $\{0 \leq x
- Both $P$ & $Q$ would have the same length
- The size of each of the string would be at the max 1500
- All characters in $P$ and $Q$ are lower-case English letters.
The recursive function will have the form:
longest(string1, string2, allowed_mismatches) =
{
... (something :P )
}The state space then has size $K^3$. With an upper bound on $K$ of 1500, the running time and space usage will be terrible... So direct dynamic programming will not work without some additional property to reduce the state space.
Ideas?
UPDATE
Using the ideas suggested by both Yuval and Vor, I came up with the following solution that works like a charm, running in $O(K^2)$ time and using $K$ space.
```
def longest_range_min_sum(str1, str2, start1, start2, slice_size, max_sum):
longest = 0
i = 0
running_sum = 0
while i + longest max_sum:
if str1[start1 + i] != str2[start2 + i]:
running_sum -= 1
i += 1
else:
longest += 1
return longest
import sys
data = sys.stdin.readlines()
num_cases = int(data.pop(0))
for ignore in xrange(num_cases):
max_mismatches, str1, str2 = data.pop(0).split()
max_mismatches = int(max_mismatches)
m = n = len(str1)
longest = 0
for i in xrange(m + n + 1):
if i > n:
slice_size = m - (i - n)
else:
slice_size = min(i, m)
if slice_size == 0:
continue
end1 = max(m, m - i)
if i > n:
end1 = m - (i - n)
start1 = end1 - slice_size
end2 = min(i, n)
start2 = end2 - slice_size
#print zeros_and_ones
#print str1[start1:end1], ' - ', str2[
Solution
One can reduce your problem to the following. Given a sequence of $N$ numbers, find a contiguous subsequence of length at most $K+1$ having maximal sum. This problem, in turn, is solvable in time $O(N)$.
What's the connection between your problem and mine? Let the positions of the mistakes be $I_1,\ldots,I_t$, and add $I_0 = 0$, $I_{t+1} = N+1$. The sequence in question is $J_1 = I_1 - I_0,\ldots,J_{t+1}=I_{t+1}-I_t$. Every $K+1$ consecutive numbers $J_a,\ldots,J_{a+K}$ correspond to a maximal solution for your problem of length $J_a + \cdots + J_{a+K} - 1$. The entire algorithm takes linear time.
Edit: This calculates the maximal $L$ such that $M(i,i,L) \leq K$ for some $i$. The actual problem wanted to find the maximal $L$ such that $M(i,j,L) \leq K$ for some $i,j$. By considering all possible shifts, we can solve this in $O(N^2)$ time and $O(K)$ space.
What's the connection between your problem and mine? Let the positions of the mistakes be $I_1,\ldots,I_t$, and add $I_0 = 0$, $I_{t+1} = N+1$. The sequence in question is $J_1 = I_1 - I_0,\ldots,J_{t+1}=I_{t+1}-I_t$. Every $K+1$ consecutive numbers $J_a,\ldots,J_{a+K}$ correspond to a maximal solution for your problem of length $J_a + \cdots + J_{a+K} - 1$. The entire algorithm takes linear time.
Edit: This calculates the maximal $L$ such that $M(i,i,L) \leq K$ for some $i$. The actual problem wanted to find the maximal $L$ such that $M(i,j,L) \leq K$ for some $i,j$. By considering all possible shifts, we can solve this in $O(N^2)$ time and $O(K)$ space.
Context
StackExchange Computer Science Q#4994, answer score: 5
Revisions (0)
No revisions yet.