Cross Entropy Derivatives, Part 4: Solving for other output classes

#ai #machinelearning

In the previous article, we have solved the derivative for one of the output classes, we will do the same for the other ones now.

Now let us see what happens when we calculate the cross-entropy loss for Virginica with respect to ( b_3 ).

The predicted probability for Virginica is produced by the softmax function. The inputs to the softmax function are the raw output values for Setosa, Versicolor, and Virginica.

Only the green crinkled surface is directly influenced by ( b_3 ). This green surface represents the raw output for Setosa and is one of the inputs to the softmax function. The green surface itself is formed by summing the blue and orange surfaces and then adding ( b_3 ).

Because the cross-entropy loss is linked to ( b_3 ) through the predicted probability for Virginica and the raw output for Setosa, we can apply the chain rule to compute the derivative of the cross entropy with respect to ( b_3 ).

As before, we start by computing the derivative of the cross entropy with respect to the predicted probability for Virginica. By substituting the cross-entropy equation and simplifying, we obtain the following result.

When the predicted probability for Virginica is used to compute the cross entropy, the derivative of the cross entropy with respect to ( b_3 ) is

Now let us apply the same reasoning for Versicolor.

When the observed measurements correspond to Versicolor, and we follow the same steps used for Virginica, the resulting derivative is again the predicted probability for Setosa:

At this point, we can summarize the results as follows:

It is important to note that we are currently targeting ( b_3 ), which only influences the raw output for Setosa. To influence the raw outputs for Versicolor and Virginica, we must instead target ( b_4 ) and ( b_5 ).

The corresponding derivatives are:

Now that all the derivatives have been calculated, we will begin optimizing the bias terms using backpropagation in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Top comments (1)

Dev Group • Feb 5

This is a very solid walkthrough. I especially like that you do not jump straight to the well known result, but instead show how it emerges from the dependency structure between logits, softmax, and cross entropy.

Many explanations stop at "the gradient is p minus y" and move on, which works for implementation but leaves a gap in understanding. By explicitly tracking which bias affects which logit, and how that influence propagates through softmax, the symmetry between classes becomes obvious rather than magical.

This kind of derivation really helps when debugging training issues or when implementing custom losses, where blindly trusting the formula is no longer enough. It also makes clear why the softmax plus cross entropy combination behaves so cleanly during backpropagation.

Great continuation of the series. Looking forward to seeing how you connect this to the actual optimization step.