debugMinor
Sampling a uniform distribution of fixed size strings containing no forbidden substrings
Viewed 0 times
containingsubstringssizesamplinguniformfixeddistributionstringsforbidden
Problem
Given a list of "forbidden" words (substrings), an alphabet, and a desired output string length, how would I efficiently sample output strings containing no forbidden word?
For short output strings with few forbidden words, I would use simple rejection sampling. Pick a string (uniformly) with the specified alphabet and length, return that string if it contains no element of the forbidden list, try again otherwise.
If I use that algorithm for output lengths several times larger than the typical forbidden word, then the probability of rejection will be higher. (Most words are 2 or 3 characters long.)
Assume the requested output length is too long to enumerate and store every possible value. My alphabet size would be 16 to 36 characters, but solutions to large alphabets would be interesting to think about. (In which case I would call these things random sentences, forbidden n-grams, and dictionary words.)
My forbidden word list will have one hundred to one thousand strings. I would like to avoid solutions requiring expensive precomputation or lots of memory.
My first idea was to try to build a random string incrementally, in contrast to the all-or-nothing approach of straightforward rejection sampling. I doubt that my algorithm produces each possible output with equal probability.
The algorithm idea follows:
Step 3 serves to rewind the algorithm, returning the char buffer to a previous legal state.
I understand that clearing the whole buffer in step 3 definitely would produce uniform output just like the straightforward rejection sampling method. However, the average number of r
For short output strings with few forbidden words, I would use simple rejection sampling. Pick a string (uniformly) with the specified alphabet and length, return that string if it contains no element of the forbidden list, try again otherwise.
If I use that algorithm for output lengths several times larger than the typical forbidden word, then the probability of rejection will be higher. (Most words are 2 or 3 characters long.)
Assume the requested output length is too long to enumerate and store every possible value. My alphabet size would be 16 to 36 characters, but solutions to large alphabets would be interesting to think about. (In which case I would call these things random sentences, forbidden n-grams, and dictionary words.)
My forbidden word list will have one hundred to one thousand strings. I would like to avoid solutions requiring expensive precomputation or lots of memory.
My first idea was to try to build a random string incrementally, in contrast to the all-or-nothing approach of straightforward rejection sampling. I doubt that my algorithm produces each possible output with equal probability.
The algorithm idea follows:
- Initialize a char buffer long enough to fit
outlencharacters.
- Pick a random letter of the alphabet and append it to the buffer.
- If the buffer ends with a forbidden word of length
k, then remove the lastkletters from the char buffer and go to 2.
- Otherwise, go to 2 if the buffer has less than
outlencharacters.
- Return the contents of the buffer if it is full.
Step 3 serves to rewind the algorithm, returning the char buffer to a previous legal state.
I understand that clearing the whole buffer in step 3 definitely would produce uniform output just like the straightforward rejection sampling method. However, the average number of r
Solution
Suppose the alphabet is $\{a,b\}$, and you have one forbidden word, $aa$. Suppose we are trying to generate a word of length 3. The first two letters will be distributed uniformly over $ab,ba,bb$. Hence the first letter has the following distribution: $a$ with probability $1/3$, $b$ with probability $2/3$. In contrast, the allowable words are
$$
aba,abb,bab,bba,bbb.
$$
So the first letter should have the distribution $a$ with probability $2/5$, $b$ with probability $3/5$.
Here is an algorithm that does work. Construct a DFA (or UFA) for your language. For each state $q$, using dynamic programming you can count how many words of length $m$ are accepted when the automaton is started at $q$. Let us denote this by $c(q,m)$.
The correct distribution of the first letter $\sigma_1$ of a word of length $n$ in the language is
$$
\Pr[\sigma_1 = \sigma] = \frac{c(\delta(q_0,\sigma),n-1)}{c(q_0,n)}.
$$
More generally, given the first $\ell$ letters $\sigma_1 \ldots \sigma_\ell$, the following letter has the distribution
$$
\Pr[\sigma_{\ell+1} = \sigma \mid \sigma_1 \ldots \sigma_\ell] = \frac{c(\delta(q_0,\sigma_1\ldots\sigma_\ell\sigma),n-\ell-1)}{c(\delta(q_0,\sigma_1\ldots\sigma_\ell),n-\ell)}.
$$
Ignoring the cost of arithmetic, you can implement this scheme in roughly $O(|Q|n)$, where $Q$ is the set of states, or in $O(|\Sigma|n^2)$. (The former, assuming that $|Q| = \Omega(|\Sigma|)$.)
As an example, let us consider the counterexample above. We construct a DFA containing two states (we can omit the sink state, obtaining a UFA) $q_0,q_1$. The transition function is $\delta(q_0,a) = q_1$, $\delta(q_0,b) = q_0$, $\delta(q_1,b) = q_0$. The relevant values of $c$ are
$$
\begin{array}{c|cc}
n & c(q_0,n) & c(q_1,n) \\\hline
0 & 1 & 1 \\
1 & 2 & 1 \\
2 & 3 & 2 \\
3 & 5 & 3
\end{array}
$$
These are computed by the recurrences $c(q_0,n) = c(q_0,n-1) + c(q_1,n-1)$ and $c(q_1,n) = c(q_0,n-1)$, with base case $c(q,0) = 1$.
Since $\delta(q_0,a) = q_1$ and $\delta(q_0,b) = q_0$, we see that (for $n = 3$) $\Pr[\sigma_1 = a] = c(q_1,2)/c(q_0,3) = 2/5$ and $\Pr[\sigma_1 = b] = c(q_0,2)/c(q_0,3) = 3/5$.
$$
aba,abb,bab,bba,bbb.
$$
So the first letter should have the distribution $a$ with probability $2/5$, $b$ with probability $3/5$.
Here is an algorithm that does work. Construct a DFA (or UFA) for your language. For each state $q$, using dynamic programming you can count how many words of length $m$ are accepted when the automaton is started at $q$. Let us denote this by $c(q,m)$.
The correct distribution of the first letter $\sigma_1$ of a word of length $n$ in the language is
$$
\Pr[\sigma_1 = \sigma] = \frac{c(\delta(q_0,\sigma),n-1)}{c(q_0,n)}.
$$
More generally, given the first $\ell$ letters $\sigma_1 \ldots \sigma_\ell$, the following letter has the distribution
$$
\Pr[\sigma_{\ell+1} = \sigma \mid \sigma_1 \ldots \sigma_\ell] = \frac{c(\delta(q_0,\sigma_1\ldots\sigma_\ell\sigma),n-\ell-1)}{c(\delta(q_0,\sigma_1\ldots\sigma_\ell),n-\ell)}.
$$
Ignoring the cost of arithmetic, you can implement this scheme in roughly $O(|Q|n)$, where $Q$ is the set of states, or in $O(|\Sigma|n^2)$. (The former, assuming that $|Q| = \Omega(|\Sigma|)$.)
As an example, let us consider the counterexample above. We construct a DFA containing two states (we can omit the sink state, obtaining a UFA) $q_0,q_1$. The transition function is $\delta(q_0,a) = q_1$, $\delta(q_0,b) = q_0$, $\delta(q_1,b) = q_0$. The relevant values of $c$ are
$$
\begin{array}{c|cc}
n & c(q_0,n) & c(q_1,n) \\\hline
0 & 1 & 1 \\
1 & 2 & 1 \\
2 & 3 & 2 \\
3 & 5 & 3
\end{array}
$$
These are computed by the recurrences $c(q_0,n) = c(q_0,n-1) + c(q_1,n-1)$ and $c(q_1,n) = c(q_0,n-1)$, with base case $c(q,0) = 1$.
Since $\delta(q_0,a) = q_1$ and $\delta(q_0,b) = q_0$, we see that (for $n = 3$) $\Pr[\sigma_1 = a] = c(q_1,2)/c(q_0,3) = 2/5$ and $\Pr[\sigma_1 = b] = c(q_0,2)/c(q_0,3) = 3/5$.
Context
StackExchange Computer Science Q#115518, answer score: 5
Revisions (0)
No revisions yet.