patternMinor

Is the language "substrings of a regular language that are over half the length of the superstring" regular?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

automata cs stackoverflow finite-automata regular-languages nondeterminism

substringsthehalfarelengthregularsuperstringlanguagethatover

Problem

We say $x$ is a majority substring of $y$ if $y \in \Sigma^ x \Sigma^$ and $|x| \geq \frac 12|y|$. If $B$ is a regular language, is the set of majority substrings of $B$ regular?

I was provided the following solution, but I don't understand it:

Let M be a DFA that recognizes B. Construct an NFA $N$ that accepts its input $x$ if $x$ is a major substring of some string $y ∈ B$, that is, if $y = uxv ∈ L(M)$ for some string $u$ and $v$ where $|x| ≥ |u| + |v|$. Informally, $N$ uses its nondeterminism to guess the appropriate $y$ as follows. First, $N$ guesses the state $q_1$ that it is in just before it reads the first symbol of $x$ and the state $q_2$ that it is in just after it reads the last symbol of $x$. Then it reads $x$ and checks that $x$ takes $M$ from $q_1$ to $q_2$. In parallel with reading the symbols of $x$ and performing that check, $N$ guesses the symbols of $u$ followed by the symbols of $v$, and checks that $u$ takes M from its start state to $q_1$ and that $v$ takes $M$ from $q_2$ to an accept state. Because $|x| ≥ |u| + |v|$, the machine $N$ has enough time to guess both $u$ and $v$, while it is reading $x$.

What exactly is meant by "$N$ guessing the state that it's in"? Does it iterate over $Q \times Q$ and check if $x$ takes $M$ from $q_1$ to $q_2$? If so, is it done by parallelizing these $|Q|^2$ "lines of thought"?

How is it that $N$ can guess the symbols of $u$ and $v$? It seems to me that there are infinite possibilities for $u$ and $v$, and so iterating over these is beyond the capabilities of an $NFA$.

Lastly, how do we guarantee that the guessed $u$ and $v$ have a combined length no more than that of $x$? Counting is outside of the capabilities of an NFA, so I don't think it should be able to count $x$'s length (call it $k$) and try every $s \in \Sigma^k$.

Solution

The solution provided is pretty concise.

"First, $N$ guesses the state $q_1$ that it is in just before it reads the first symbol of $x$ and the state $q_2$ that it is in just after it reads the last symbol of $x$." That means $N$ is the union of $|Q|^2$ $\text{NFA}$. (That means, running $N$ will be the same as running all those $\text{NFA}$s. $N$ accepts a word iff at least one of those $\text{NFA}$ accepts the word). We can label them $N_{q_1, q_2}$ where both $q_1$ and $q_2$ range over all states of $M$ independently. $N_{q_1, q_2}$ will be constructed to accept the words
$$\{x\in \Sigma^ \mid\text{there exist } u, v\in \Sigma^ \text{ such that } uxv\in L(M)\\ \text{ and } |x| ≥ |u| + |v|\\ \text{ and, if starting at state }q_1 \text{ and given input }x, \ M \text{ will go to }q_2 \}.$$

Here is a description of $N_{q_1, q_2}$. Running $N_{q_1, q_2}$ will be the same as running one DFA and one NFA in parallel. $N_{q_1,q_2}$ accepts a word if and only if that DFA and that NFA both accept.

The $\text{DFA}$ for $N_{q_1, q_2}$ is the same $M$ except that its start state is $q_1$ and its only accept state is $q_2$.

The $\text{NFA}$ for $N_{q_1, q_2}$ has states $(r, t)$, where $s$ and $t$ range over all states of $M$ independently. (So, there are $|Q|^2$ states). $(S, q_2)$ is the start state of this $\text{NFA}$, where $S$ is the start state of $M$. For each accept state of $M$, $(q_1, M)$ is an accept state of this NFA. In each step, regardless of what is the current input symbol, this $\text{NFA}$ will go from its current state $(r, t)$ to either $(r', t)$ where $r'$ is some state of $M$ that is immediately reachable from $r$ or $(r, t')$ where $t'$ is some state of $M$ that is immediately reachable from $t$. However, if the current state is an accept state, this $\text{NFA}$ will remain in that accept state forever.

You can view this $\text{NFA}$ for $N_{q_1, q_2}$ as running two copies of $M$, one from the start state of $M$ and one from the state $q_2$. In each step, it will either run the first copy or the second copy, indeterministically. This $\text{NFA}$ accepts once the first copy reaches $q_1$ and the second copy reaches an accept state of $M$, and it will not change state thereafter. So, when given input $x$ and running in parallel with the $\text{DFA}$ for $N_{q_1, q_2}$, this $\text{NFA}$ for $N_{q_1, q_2}$ is just checking whether within $|x|$ steps it will reach one of its accept states. That is how "$N$ can guess the symbols of $u$ and $v$" even if there might be "infinite possibilities for $u$ and $v$", as well as "guarantee that the guessed $u$ and $v$ have a combined length no more than that of $x$".

Exercise. Fix a positive integer $k$. We say $x$ is a $\frac 1k$-significant substring of $y$ if $y \in \Sigma^ x \Sigma^$ and $|x| \geq\frac1k |y|$. Show that if $B$ is a regular language, the set of $\frac 1k$-significant substrings of strings in $B$ is a regular language.

Context

StackExchange Computer Science Q#144265, answer score: 2

Revisions (0)

No revisions yet.