HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

In TensorFlow tutorials, why do they use only the first term of cross-entropy as the cost function?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
tutorialswhythetensorflowcrosstermfunctionfirstentropyonly

Problem

The cross-entropy cost function is usually defined as

$$C = -\frac{1}{n} \sum_x \left[y \ln \hat{y} + (1-y ) \ln (1-\hat{y}) \right]$$

where $y$ is the expected output and $\hat{y}$ is the predicted output, for training example $x$.

But, in TensorFlow MNIST tutorial, they use

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))

which, I suppose, is equivalent to

$$C = -\frac{1}{n} \sum_x y\ln\hat{y}$$

That means only the first term of the cross-entropy expression, $y\ln\hat{y}$, is being used. Why? Why isn't the second term, $(1 - y)\ln(1 - \hat{y})$, being used too?

Solution

Great question! Actually, there's no contradiction. The short version of the explanation is that those two equations look different because they are intended for slightly different scenarios -- but under the covers, they're actually much more similar than you might think. Let me walk you through it, and hopefully by the end of this explanation you'll see how everything is consistent and why both equations are correct, in the particular context where they're used.

Cross-entropy loss for multi-class neural networks

When using neural networks for MNIST, we have 10 classes (one per digit). The neural net has 10 outputs (i.e., 10 neurons at the final layer). Call the outputs $\hat{y}_0,\hat{y}_1,\dots,\hat{y}_{9}$. If you feed in an image $x$, the intended interpretation is that $\hat{y}_d$ is supposed to represent the neural network's estimate of the "probability" that the image is an instance of the digit $d$.

For the training set, we know what the desired output is. Let's define $y_0,y_1,\dots,y_9$ to be the desired "probability distribution". In particular, $y_d$ should be $1$ for the correct digit $d$ and $0$ for all other values of $d$.

With these definitions, the cross-entropy loss for a single instance $x$ is defined to be

$$C_x = - \sum_{i=0}^9 y_i \log \hat{y}_i.$$

(Notice that if the correct digit is $d$, then this value simplifies to $-\log \hat{y}_d$, since we have $y_d=1$ and $y_i=0$ for all other $i$.)

The empirical cross-entropy loss for an entire training set is the average of these values, over all of the instances in the training set:

$$C = - {1 \over n} \sum_x \sum_{i=0}^9 y_i \log \hat{y}_i.$$

That's the cross-entropy loss. I think this is exactly what the Tensor Flow tutorial is computing. (Side note: this is different from the equation you presented. You were missing the inner sum over all 10 classes. I suspect you might have misinterpreted the Tensor Flow code. No biggy.)

You can also see how this generalizes to any number of classes: the sum over $i=0,1,\dots,9$ gets changed to a sum over all classes, however many of them there may be.

Cross-entropy loss for two-class neural networks

As a special case, suppose we have two classes. In particular, suppose there are two classes and two outputs from the neural network (two neurons at the output layer). Then the cross-entropy loss for a single instance (the inner sum) becomes just

$$C_x = - y_0 \log \hat{y}_0 - y_1 \log \hat{y}_1.$$

Normally we normalize $y_0,y_1$ to be a probability distribution, so $y_0+y_1=1$, and similarly for $\hat{y}_0,\hat{y}_1$. As a result, we have $y_0 = 1-y_1$ and $\hat{y}_0 = 1-\hat{y}_1$. So, for a two-class neural network, we have

$$C_x = - y_1 \log \hat{y}_1 - (1-y_1) \log (1-\hat{y}_1),$$

and the empirical loss for an entire training set is

$$C = - {1 \over n} \sum_x [y_1 \log \hat{y}_1 - (1-y_1) \log (1-\hat{y}_1)].$$

So far, so good.

Cross-entropy loss for two-class neural networks with a single output

Now if we have a two-class classification problem, it's not actually necessary for the network to produce two outputs. Alternatively, we could build a network with only a single output $\hat{y}$. We could interpret this single output value as "probability" that the input instance should be labelled as class 1. It follows that the "probability" that the instance should be labelled as class 0 is $1-\hat{y}$. So, if $\hat{y}>0.5$, we'll label the input as class 1; otherwise, we'll label it as class 0. This is the architecture used in the first web page you link to.

How should we measure the cross-entropy loss for this network? Well, just replace $\hat{y}_0,\hat{y}_1$ with $1-\hat{y},\hat{y}$ and everything goes through unchanged.

A slightly tricky thing is that we need to replace $y_0,y_1$ with something. What should we replace it with? I suggest we replace it with $1-y,y$, where $y$ is a value that indicates the desired output: if the correct label is class 1, then $y=1$, else $y=0$. Notice how this all works out nicely: if the correct label is class 1, then we get the distribution $0,1$; if the correct label is class 0, we get the distribution $1,0$.

Now plugging into the equations above, we see that the cross-entropy loss for a single instance $x$ is

$$C_x = - y \log \hat{y} - (1-y) \log (1-\hat{y})$$

(since we decided that instead of $\hat{y}_1$ we now have $\hat{y}$, and similarly for $y_1,y$).

As a result, the empirical loss for an entire training set is

$$C = - {1 \over n} \sum_x [y \log \hat{y} - (1-y) \log (1-\hat{y})].$$

This exactly matches the formula found in the first link you gave.

Bottom line

See how it all lines up and is consistent? Basically, the cross-entropy is a well-defined notion in information theory; there is only a single definition of the cross-entropy. In information theory, the cross-entropy is defined in terms of two probability distributions.

To use this idea to construct a loss function for a neural network, we construct t

Context

StackExchange Computer Science Q#60153, answer score: 2

Revisions (0)

No revisions yet.