Loss Function with Cross Entropy made simple

I found this on the Udacity Deep Learning course by Google. I am going to elaborate on these notes to help you understand the concept better.

Notation:
D(S,L) is the cross entropy
L is the label provided for training
S(Y) is the output of the softmax in terms of probability for each class for a multi-nomial logistic classification.





Why is it called multi-nomial logistic classification?

Let's look at the figure below, There is an input vector X, with which we train the linear model also called logistics regression model - Wx + b. This produces logit aka scores Y which are further fed into a softmax activation to get probability output. 

Linear Binary classification is called bi-nomial logistics classification.
Multi-Nomial is used as a fact that there are more than 2 classes (as compared to Bi-nomial or Binary classification).


-------------
Let's help you understand the math of cross-entropy. It is basically a function of the output generated by the network and the actual labels. Remember, It is asymmetric because of the nasty log term used. The reason we take log of the softmax output rather than the labels is that the labels are one-hot (contains a serios of 0s and a single 1). Log of 0 asymptotically tends to -infinity as seen below -

This is how the loss function looks like when you do it for each valid input and label through the network.

Now we need to tune the big matrix W (weights of the network) using this loss function averaged over the entire network. One way to do that is to use Gradient Descent.


Consider your network has 2 weights (2 dimensional space), we need to find an optimal numerical solution to the problem where the value of loss function is minimum (inner red circle) for a set of value of Weight 1 and Weight 2. How do we do that?

Basically, we use the technique of gradient descent. We take derivate of the loss at different points, derivative gives you direction of change if the derivate is -ve your slope is decreasing (you are progressing in the downward direction) what this will do is keep you along the course until you reach a minimum of the loss function as shown in the figure. However, if you are n the upslope beyond the centroid point of the loss function shown in the figure, you will move along the negative X-axis direction until you reach the minimum point. 

I will supplement this with another article with a little more math, stay tuned :)



Comments