debugMinor

Sampling a uniform distribution of fixed size strings containing no forbidden substrings

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

substrings sampling cs random-number-generator stackoverflow strings

containingsubstringssizesamplinguniformfixeddistributionstringsforbidden

Problem

Given a list of "forbidden" words (substrings), an alphabet, and a desired output string length, how would I efficiently sample output strings containing no forbidden word?

For short output strings with few forbidden words, I would use simple rejection sampling. Pick a string (uniformly) with the specified alphabet and length, return that string if it contains no element of the forbidden list, try again otherwise.

If I use that algorithm for output lengths several times larger than the typical forbidden word, then the probability of rejection will be higher. (Most words are 2 or 3 characters long.)

Assume the requested output length is too long to enumerate and store every possible value. My alphabet size would be 16 to 36 characters, but solutions to large alphabets would be interesting to think about. (In which case I would call these things random sentences, forbidden n-grams, and dictionary words.)

My forbidden word list will have one hundred to one thousand strings. I would like to avoid solutions requiring expensive precomputation or lots of memory.

My first idea was to try to build a random string incrementally, in contrast to the all-or-nothing approach of straightforward rejection sampling. I doubt that my algorithm produces each possible output with equal probability.

The algorithm idea follows:

Initialize a char buffer long enough to fit outlen characters.

Pick a random letter of the alphabet and append it to the buffer.

If the buffer ends with a forbidden word of length k, then remove the last k letters from the char buffer and go to 2.

Otherwise, go to 2 if the buffer has less than outlen characters.

Return the contents of the buffer if it is full.

Step 3 serves to rewind the algorithm, returning the char buffer to a previous legal state.

I understand that clearing the whole buffer in step 3 definitely would produce uniform output just like the straightforward rejection sampling method. However, the average number of r

Solution

Suppose the alphabet is $\{a,b\}$, and you have one forbidden word, $aa$. Suppose we are trying to generate a word of length 3. The first two letters will be distributed uniformly over $ab,ba,bb$. Hence the first letter has the following distribution: $a$ with probability $1/3$, $b$ with probability $2/3$. In contrast, the allowable words are
$$
aba,abb,bab,bba,bbb.
$$
So the first letter should have the distribution $a$ with probability $2/5$, $b$ with probability $3/5$.

Here is an algorithm that does work. Construct a DFA (or UFA) for your language. For each state $q$, using dynamic programming you can count how many words of length $m$ are accepted when the automaton is started at $q$. Let us denote this by $c(q,m)$.

The correct distribution of the first letter $\sigma_1$ of a word of length $n$ in the language is
$$
\Pr[\sigma_1 = \sigma] = \frac{c(\delta(q_0,\sigma),n-1)}{c(q_0,n)}.
$$
More generally, given the first $\ell$ letters $\sigma_1 \ldots \sigma_\ell$, the following letter has the distribution
$$
\Pr[\sigma_{\ell+1} = \sigma \mid \sigma_1 \ldots \sigma_\ell] = \frac{c(\delta(q_0,\sigma_1\ldots\sigma_\ell\sigma),n-\ell-1)}{c(\delta(q_0,\sigma_1\ldots\sigma_\ell),n-\ell)}.
$$
Ignoring the cost of arithmetic, you can implement this scheme in roughly $O(|Q|n)$, where $Q$ is the set of states, or in $O(|\Sigma|n^2)$. (The former, assuming that $|Q| = \Omega(|\Sigma|)$.)

As an example, let us consider the counterexample above. We construct a DFA containing two states (we can omit the sink state, obtaining a UFA) $q_0,q_1$. The transition function is $\delta(q_0,a) = q_1$, $\delta(q_0,b) = q_0$, $\delta(q_1,b) = q_0$. The relevant values of $c$ are
$$
\begin{array}{c|cc}
n & c(q_0,n) & c(q_1,n) \\\hline
0 & 1 & 1 \\
1 & 2 & 1 \\
2 & 3 & 2 \\
3 & 5 & 3
\end{array}
$$
These are computed by the recurrences $c(q_0,n) = c(q_0,n-1) + c(q_1,n-1)$ and $c(q_1,n) = c(q_0,n-1)$, with base case $c(q,0) = 1$.

Since $\delta(q_0,a) = q_1$ and $\delta(q_0,b) = q_0$, we see that (for $n = 3$) $\Pr[\sigma_1 = a] = c(q_1,2)/c(q_0,3) = 2/5$ and $\Pr[\sigma_1 = b] = c(q_0,2)/c(q_0,3) = 3/5$.

Context

StackExchange Computer Science Q#115518, answer score: 5

Revisions (0)

No revisions yet.