patternMinor
Uniformly sample $x,y\in\{0,1\}^n$ with Levenstein distance $k$
Viewed 0 times
sampleuniformlywithdistancelevenstein
Problem
Given alphabet $\{0,1\}$, we want to uniformly sample $x,y \in \{0,1\}^n$ such that $ed(x,y)=k$, where $ed$ denotes the Levenstein distance, i.e., the minimum number of edit operations (insert, delete, substitute) that can transform $x$ to $y$.
Clearly, this can be done by brute force over all pairs of strings, which is very costly. To reduce the complexity, my main idea is to use Monte Carlo sampling:
Monte Carlo sampling: pick $x\sim\{0,1\}^n$ at random, introduce $k$ successive edit operations (equal number of delete and insert to end up with the same length) ending up in string $y$ and reject the pair iff $ed(x,y)\neq k$. The main challenge is to guarantee that the probability of confirmation is not exponentially low. Such a low probability of success forces us to repeat the sampling an exponential number of times, rendering any time savings useless.
Informal analysis: Let $x=:x_0\to \dots \to x_k:=y$ denote the intermediary strings during our $k$ random edits. Let $S_i(x)$ for $i\le n$ denote the level set for edit distance from $x$:
$$S_i(x)=\{y\in \{0,1\}^n: ed(x,y)=i\}.$$
Therefore, $ed(x,y)=k$ implies $x_i\in S_i$ for all $i=0,\dots, k$.
By construction, if we have $x_i\in S_i$, then $x_{i+1}$ must belong to one of $S_{i-1}, S_i, S_{i+1}$ level sets. Therefore, the probability of failure at step $i$ is bounded by the ratio $\alpha_i=(|S_i|+|S_{i-1}|)/(|S_i|+|S_{i-1}|+|S_{i+1}|)$. It remains to show that if $k$ is small (how small?) the ratio $\alpha_i$ remains very high, and in fact $\prod_i^k \alpha_i$ is not exponentially low.
Two main questions:
Clearly, this can be done by brute force over all pairs of strings, which is very costly. To reduce the complexity, my main idea is to use Monte Carlo sampling:
Monte Carlo sampling: pick $x\sim\{0,1\}^n$ at random, introduce $k$ successive edit operations (equal number of delete and insert to end up with the same length) ending up in string $y$ and reject the pair iff $ed(x,y)\neq k$. The main challenge is to guarantee that the probability of confirmation is not exponentially low. Such a low probability of success forces us to repeat the sampling an exponential number of times, rendering any time savings useless.
Informal analysis: Let $x=:x_0\to \dots \to x_k:=y$ denote the intermediary strings during our $k$ random edits. Let $S_i(x)$ for $i\le n$ denote the level set for edit distance from $x$:
$$S_i(x)=\{y\in \{0,1\}^n: ed(x,y)=i\}.$$
Therefore, $ed(x,y)=k$ implies $x_i\in S_i$ for all $i=0,\dots, k$.
By construction, if we have $x_i\in S_i$, then $x_{i+1}$ must belong to one of $S_{i-1}, S_i, S_{i+1}$ level sets. Therefore, the probability of failure at step $i$ is bounded by the ratio $\alpha_i=(|S_i|+|S_{i-1}|)/(|S_i|+|S_{i-1}|+|S_{i+1}|)$. It remains to show that if $k$ is small (how small?) the ratio $\alpha_i$ remains very high, and in fact $\prod_i^k \alpha_i$ is not exponentially low.
Two main questions:
- How can we make my hand-wavy argument more rigorous?
- In the regime that $k$ is comparable to $n$ the proposed algorithm provably fails. Namely if $k>2n/3$, random walks will never get you much further than $ed(x,y)\approx n/2$, as even the edit distance of two randomly generated strings is concentrated around $n/2$. What would be an alternative to sampling for large $k$?
Solution
There are two considerations: running time, and correctness.
Running time: When $k < (n-1)/\lg(4n)$, heuristically I expect the running time of your algorithm to be fine and I'd guess you won't experience exponential blowup. We can show $|S_k| \le (4n)^k$, so when $k < (n-1)/\lg(4n)$, $|S_k|/2^n \le 1/2$. This is not a proof; to prove that everything is OK, we need to prove that $|S_k|/(|S_0| + \dots + |S_k|)$ is not too small, and I don't have a proof of that.
For large $k$, I don't know what algorithm might be suitable.
Correctness: Your method is not correct. It does not produce $(x,y)$ with the right distribution. Given $x$, not all $y \in S_k(x)$ will be equally be likely to be reached by the procedure you give, so your approach introduces a bias in the distribution of $x,y$.
For example, suppose $x=10001$ and $k=2$. If $y=00000$, there is just one edit script that transforms $x$ into $y$ (substitute bits 1 and 5). If $y=10010$, there are seven edit scripts that transform $x$ into $y$ (substitute bits 4 and 5; or, delete one of bits 2-4 and insert a 0 at the end, or do the same in the opposite order). So, if $x=10001$, your algorithm will be seven times as likely to output $(10001,10010)$ as it is to output $(10001,00000)$. This deviates from the uniform distribution.
Incidentally, I believe the procedure you outline is just rejection sampling (i.e., just Monte Carlo), not MCMC.
Running time: When $k < (n-1)/\lg(4n)$, heuristically I expect the running time of your algorithm to be fine and I'd guess you won't experience exponential blowup. We can show $|S_k| \le (4n)^k$, so when $k < (n-1)/\lg(4n)$, $|S_k|/2^n \le 1/2$. This is not a proof; to prove that everything is OK, we need to prove that $|S_k|/(|S_0| + \dots + |S_k|)$ is not too small, and I don't have a proof of that.
For large $k$, I don't know what algorithm might be suitable.
Correctness: Your method is not correct. It does not produce $(x,y)$ with the right distribution. Given $x$, not all $y \in S_k(x)$ will be equally be likely to be reached by the procedure you give, so your approach introduces a bias in the distribution of $x,y$.
For example, suppose $x=10001$ and $k=2$. If $y=00000$, there is just one edit script that transforms $x$ into $y$ (substitute bits 1 and 5). If $y=10010$, there are seven edit scripts that transform $x$ into $y$ (substitute bits 4 and 5; or, delete one of bits 2-4 and insert a 0 at the end, or do the same in the opposite order). So, if $x=10001$, your algorithm will be seven times as likely to output $(10001,10010)$ as it is to output $(10001,00000)$. This deviates from the uniform distribution.
Incidentally, I believe the procedure you outline is just rejection sampling (i.e., just Monte Carlo), not MCMC.
Context
StackExchange Computer Science Q#140987, answer score: 3
Revisions (0)
No revisions yet.