debugMinor

Flaw with Cross Entropy Error in Neural Networks

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

artificial-intelligence neural-networks cs stackoverflow

errorcrosswithneuralnetworksflawentropy

Problem

I've recently been working on creating a neural network to classify handwritten digits. I implemented 1-of-N encoding such that there are the same number of output nodes as possible digits (The expected output is 0 for all digits' nodes except for the digit that was inputted, which would be 1).

Because this is a classification problem, I opted for Cross Entropy Error. I followed this model shown here: https://visualstudiomagazine.com/articles/2014/04/01/neural-network-cross-entropy-error.aspx
and also shown here:
http://www.mathworks.com/help/nnet/ref/crossentropy.html

The error function is:

$$-\frac1n\sum y\ln(\hat y)$$

Where $y$ is the expected output and $\hat y$ is the predicted output.

However, after I implemented my network I noticed a problem. Because this formula for cross entropy error does not account at all for the error of the predicted output for the nodes that have an expected output 0 (since the CE function multiplies all of those costs by the expected 0), the network tunes the weights/biases to always output nodes close to 1. Therefore, I end up getting a list of 1s. According to the CE cost function this is good because one of the outputs is spot on, but it doesn't even look at the other nodes' error (which is huge), so it is impossible to decide on one output.

Maybe I'm missing something? I see the alternate CE function is

$$-\frac1n\sum y\ln(\hat y) + (1-y)\ln(1-\hat y)$$

but according to the MathWorks link above, it shouldn't be used for 1-of-N encoding where there are more than $N=1$ output nodes (Not sure why this is the case).

So my question would be:
Why is the first Cross Entropy Error equation viable for classification if it does not account for the error of the nodes where 0 is expected, as it always tends all the nodes to 1?

Solution

Make sure you use softmax as your final layer. This will ensure that the outputs sum to 1. You need this for cross-entropy loss to be meaningful.

Note that the correct cross-entropy loss is

$$-\frac{1}{n} \sum_{i=0}^9 y_i \log \hat{y}_i,$$

where $i$ ranges over all of the classes. You omitted the dependence on the classes.

Once you change the outputs to use softmax, the anomaly you saw can't happen. The network can't cause all nodes to output 1, because softmax renormalizes the outputs so they sum to 1. This then works cleanly with cross-entropy loss, which only makes sense if the outputs are guaranteed to sum to 1 (i.e., cross-entropy loss only makes sense if we have $\sum_i y_i = 1$ and $\sum_i \hat{y}_i = 1$).

Context

StackExchange Computer Science Q#51950, answer score: 2

Revisions (0)

No revisions yet.