HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

What are the possible sets of word lengths in a regular language?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
thewhatareregularlanguagepossiblewordlengthssets

Problem

Given a language $L$, define the length set of $L$ as the set of lengths of words in $L$:
$$\mathrm{LS}(L) = \{|u| \mid u \in L \}$$

Which sets of integers can be the length set of a regular language?

Solution

First, an observation which is not crucial but convenient: the set $\mathscr{S}$ of sets of integers that are $LS(L)$ for some regular language $L$ on a non-empty alphabet $\mathscr{A}$ does not depend on the choice of alphabet. To see that, consider a finite automaton that recognizes $L$; the lengths of the words that are in $L$ are the lengths of the paths on the automaton seen as an unlabeled graph from the start state to any accept state. In particular, you can relabel every arrow to $a$ and get a regular language with the same length set over the alphabet $\{a\}$. Conversely, if $L$ is a regular language over a one-element alphabet, it can be trivially injected into a larger alphabet, and the result is still a regular language.

Therefore we are looking for the possible length sets for words over a singleton alphabet. On a singleton alphabet, the language is the length set written out in unary: $\mathrm{LS}(L) = \{n\in\mathbb{N} \mid a^n \in L\}$. Such languages are called unary languages.

Let $L$ be a regular language, and consider a deterministic finite automaton (DFA) that recognizes $L$. The set of lengths of words of $L$ is the set of lengths of paths in the DFA seen as a directed graph that start on the start state and end in one of the accept states. A DFA on a one-element alphabet is pretty tame (NFAs would be wilder): it's either a finite list or a circular list. If the list is finite, number the states from $0$ to $h$ following the list order; if it's circular, number the states from $0$ to $h$ following the head of the list, and $h$ to $h+r$ along the loop.

Let $F$ be the set of indices of accept states up to $h$, and $G$ be the set of indices of accept states from $h$ to $h+r$. Then

$$\mathrm{LS}(L) = F \cup \{ k \, r + x \mid x \in G, k\in\mathbb{N} \}$$

Conversely, let $h$ and $r$ be two integers and $F$ and $G$ be two finite sets of integers such that $\forall x \in F, x \le h$ and $\forall x \in G, h \le x \le h+r$. Then the set $L_{F,G,r} = \{ a^{k\,r+x} \mid x\in G, k\in\mathbb{N} \}$ is a regular language: it is the language recognized by the DFA described above. A regular expression that describes this language is $a^F \mid a^{G} (a^r)^*$.

To summarize in English, the length sets of regular languages are the sets of integers that are periodic¹ above a certain value.

¹
To hang on to a well-established notion, periodic means the characteristic function of the set (which is a function $\mathbb{N}\to\{\mathtt{false},\mathtt{true}\}$ which we lift to a function $\mathbb{Z}\to\{\mathtt{false},\mathtt{true}\}$) is periodic. Periodic above a certain value means that the function restricted to $[h,+\infty[$ can be prolonged into a periodic function.

Context

StackExchange Computer Science Q#164, answer score: 15

Revisions (0)

No revisions yet.