Physics, asked by souradeep55791, 1 year ago

Ground truth and predicted probabiklites are given how to get cross entropy loss

Answers

Answered by Rohit65k0935Me

The cross entropy formula takes in two distributions, p(x)p(x), the true distribution, and q(x)q(x), the estimated distribution, defined over the discrete variable xx and is given by

H(p,q)=−∑∀xp(x)log(q(x))

H(p,q)=−∑∀xp(x)log⁡(q(x))

For a neural network, the calculation is independent of the following:

What kind of layer was used.

What kind of activation was used - although many activations will not be compatible with the calculation because their outputs are not interpretable as probabilities (i.e., their outputs are negative, greater than 1, or do not sum to 1). Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.

For a neural network, you will usually see the equation written in a form where yy is the ground truth vector and y^y^ (or some other value taken direct from the last layer output) is the estimate. For a single example, it would look like this:

L=−y⋅log(y^)

L=−y⋅log⁡(y^)

where ⋅⋅ is the vector dot product.

Your example ground truth yy gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates y^y^

L=−(1×log(0.1)+0×log(0.5)+...)L=−(1×log(0.1)+0×log⁡(0.5)+...)

L=−log(0.1)≈2.303L=−log(0.1)≈2.303

An important point from comments

That means, the loss would be same no matter if the predictions are [0.1,0.5,0.1,0.1,0.2][0.1,0.5,0.1,0.1,0.2] or [0.1,0.6,0.1,0.1,0.1][0.1,0.6,0.1,0.1,0.1]?

Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.

You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size NN might look like this:

J=−1N(∑i=1Nyi⋅log(y^i))

J=−1N(∑i=1Nyi⋅log⁡(y^i))

Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.

Previous Question

Next Question