Physics, asked by mridulgarg9344, 1 year ago

why is vanishing and exploding gradient problem for relu

Answers

Answered by divyansh108
0
The vanishing gradient problem affects saturating neurons or units only. For example the saturating sigmoid activation function as given below.

S(t)=11+e−tS(t)=11+e−t

You can easily prove that

S(t)t→∞=1.0S(t)t→∞=1.0

and

S(t)t→−∞=0.0S(t)t→−∞=0.0

Just by inspection. This is a saturation effect that becomes problematic as explained below, consider the first order derivative of the sigmoid function.

S′(t)=S(t)(1.0−S(t))S′(t)=S(t)(1.0−S(t))

Since S(t)S(t) ranges from 0.0 to 1.0, its lower and upper asymptotes respectively, then clearly the derivative will be near zero for S(t)≈0.0S(t)≈0.0 or S(t)≈1.0S(t)≈1.0. The vanishing gradient problem comes about when the error signal passes backwards and starts approaching zero as it propagates backwards especially through neurons near saturation. If the network is deep enough the error signal from the output layer can be completely attenuated on it’s way back towards the input layer.

The attenuation comes about because the derivative S′(t)S′(t) will always be near zero especially for saturating neurons, you do notice that if you use chain rule, which is backpropagation in neural net terms, you will somehow be multiplying this almost zero derivative with the error signal before throwing it backwards at every level or stage. Now keep doing that as you throw the error signal backwards and what you get is an error signal becoming weaker hence vanishing. Even the hyperbolic tangent has this saturating effect and hence the vanishing gradient problem also affects it.

Those units active in the linear region of the sigmoid won’t attenuate the error signal that much but this is generally problematic for very deep nets.

That’s why ReLUs are favorable not only because they solve the vanishing gradient problem but also because they result in highly sparse neural nets. Sparsity means efficient and reliable performance. The rectifier is as given below.

f(x)={0,x,for x<0for x≥0f(x)={0,for x<0x,for x≥0

Clearly it ranges from 0 to positive infinity thus it is non saturating and the derivative is given by

f′(x)={0,1,for x<0for x≥0f′(x)={0,for x<01,for x≥0

It’s always 1 hence no attenuation of an error signal propagating backwards. This makes ReLUs favorable for the deeper trainable feature detectors as in ConvNets, you can have very deep neural nets with ReLUs without the vanishing gradient problem.

EDIT:

You do notice that the negative region has a zero derivative, right? This can be a problem as the neuron is off within this region and cannot learn and gradients cannot be backpropagated through an off neuron. There are remedies to this by adding a leakage factor which results in a non-zero derivative and thus resulting in a modified neuron known as the leaky ReLU.

Hope this helps.

Sources:

Activation functionSigmoid function

11.5k Views · View Upvoters

Promoted by UpGrad.com

Post graduate diploma in data science from IIIT Bangalore.

Build your science career, choose your specialization, 3 month capstone project.

Apply Now

MORE ANSWERS BELOW. RELATED QUESTIONS

What is the vanishing gradient problem?

58,769 Views

How does RELU solve vanishing and exploding gradients in neural network?

9,010 Views

How do we compute the gradient of a ReLU for backpropagation?

9,494 Views

How do I fix exploding and vanishing gradients? How do ReLUs, LSTMs and new techniques like batchnorm help with these problems?

4,225 Views

Does the Deep Learning book discuss the vanishing gradient problem somewhere?

16,341 Views

What is the generally accepted practice of avoiding the vanishing gradient problems with deep networks?

2,887 Views

Why isn't leaky ReLU always preferable to ReLU given the zero gradient for x<0?

1,403 Views

Why is it a problem to have exploding gradients in a neural net (especially in an RNN)?

9,828 Views

What is "saturation of neuron" in a neural network? How does the "ReLU" activation function overcomes the "saturation of neuron" problems?

137 Views

Why is the gradient of U = y-pred of my deep neural network? How do I get the various gradients?

529 Views

OTHER ANSWERS



Wonwoo Park, MLP, BP, Kohonen, NNToolbox, CNN

Answered Jun 8 2016 · Author has 73 answers and 29.2k answer views

May be, Because of the linearity …

As a partial linearity, that is also not perfect.

960 Views

Similar questions