patternMinor

What's the input to the decoder in a sequence to sequence autoencoder?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

thewhatsequenceinputautoencoderdecoder

Problem

What's the input to the decoder part of a sequence to sequence autoencoder? I've seen certain examples of such an autoencoder (using LSTM's more often than not) but am still unclear.

-
For example, here in this often-cited paper by Dai & Le ('Semi
Supervised Sequence Learning'), we have the following diagram:

What's the input to the decoder portion of the autoencoder here? In this example it's 'W-X-Y-Z.' But in general, is
it the same as the input to the encoder? Or is it using the output
from the previous timestep/LSTM cell as input?

-
Similarly, in another popular paper by Srivastava et. al
('Unsupervised Learning of Video Representations using LSTMs'), they
have the following diagram:

It seems they're using the reversed input from the encoder as input
here. However, there's a section as follows:

The decoder can be of two kinds – conditional or unconditioned. A conditional decoder receives the last generated output frame as
input, i.e., the dotted input in Fig. 2 is present. An unconditioned
decoder does not receive that input.

In the unconditioned decoder, what input does the decoder receive?

Solution

I was wondering the same and just stumbled across a nice tutorial by Quoc V. Le. The following explanation deals with the conditional case since this seems to be the common case. My explanation is based on and the image is taken from chapter 5 Sequence output prediction with Recurrent Neural Networks.

Background

We only regard a decoder with a single cell RNN which has:

W input to hidden weights

U hidden to hidden weights (ignore that first weight is different here)

V hidden to label weight

$$
f(x) = Vh_T \\
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ h_t = \sigma ( U h_{t-1} + W x_t ) ~~~ \text{for} ~ t = T, \dots, 1 \\
~~~~~~~~~~ h_0 = \sigma ( W x_0 )
$$

Since we are doing sequence prediction, we are not only interested in the last output of the RNN but rather the output at every timestep. Therefore the conditional decoder is designed to predict each label based on the previous ones, mathematically it splits the conditional probability in the following way:

$$
p(y|x) = \prod_{i=1}^n p(y_{i} | y_{i-1}, \dots, y_{1}, x)
$$

Like often done in machine learning, you can transform the probability to a score/energy/whatever which consists of a more convenient summation:

$$
\log p(y|x) = \sum_{i=1}^n \log p(y_{i} | y_{i-1}, \dots, y_{1}, x)
$$

The maximization of that score over our data is the objective. That can be equivalently formulated by the minimization of $ - \sum_{(x,y) \in D} \log p(y|x) $ where $D$ is our training set.

To complete this thought: When this probability is approximated we are actually interested in the argument $\theta_{min}$ of the minimization of $ - \sum_{(x,y) \in D} \log f(y|x,\theta) $ where $f(y|x,\theta)$ is our function approximator, e.g. a RNN, and $\theta$ the vector containing all weights $U$, $W$ and $V$.

During Training

During training we have the ground truth at hand so we can feed the decoder (see image) with the respective previous label at each timestep. At the first timestep we use the output of the encoder. Easy peasy.

During Inference

Here we obviously do not have the ground truth so instead we use the output of the previous timestep. This however poses another problem, since to find the sequence of maximum probability, we would have to compute every possible sequence and its probability according to probability function defined above.

Now, apparently one "greedy" approach is, to just ignore the above probability model and take the argmax's of each timestep.

Another more faithful approach is Beam Search which just heuristically looks at a subset of probable sequences and picks the one with maximum probability of them.

Things to note:

Inference is stopped, when the End-Of-Sequence symbol (`) is returned (greedy: when a timestep's argmax is , beam search: the currently regarded sequence leads to `)

Both inference methods do not gurantee retrieving the sequence with maximum probability

The output of $f(x,\theta)$ needs to be a probability distribution at each timestep, so the final activation of the RNN is usually a softmax $f(x) = softmax(Vh_T)$

Unconditioned Case

I'd say based on the above explanation and the paper you referenced, that the unconditioned case does only get input at the first timestep of the decoder and then just "works with" the propagation of the hidden states. So the second equation of the RNN equations changes to $h_t = \sigma ( U h_{t-1} )$.

Then the outputs at each step would not be conditioned on the previous timestep's output, but rather be unconditioned. This way the modelled probability would become

$$
p(y|x) = \prod_{i=1}^n p(y_{i} | x)
$$

and the greedy inference strategy would become valid.

Context

StackExchange Computer Science Q#69432, answer score: 7

Revisions (0)

No revisions yet.