why is vanishing and exploding gradient problem for relu
Answers
Answered by
0
The vanishing gradient problem affects saturating neurons or units only. For example the saturating sigmoid activation function as given below.
S(t)=11+e−tS(t)=11+e−t
You can easily prove that
S(t)t→∞=1.0S(t)t→∞=1.0
and
S(t)t→−∞=0.0S(t)t→−∞=0.0
Just by inspection. This is a saturation effect that becomes problematic as explained below, consider the first order derivative of the sigmoid function.
S′(t)=S(t)(1.0−S(t))S′(t)=S(t)(1.0−S(t))
Since S(t)S(t) ranges from 0.0 to 1.0, its lower and upper asymptotes respectively, then clearly the derivative will be near zero for S(t)≈0.0S(t)≈0.0 or S(t)≈1.0S(t)≈1.0. The vanishing gradient problem comes about when the error signal passes backwards and starts approaching zero as it propagates backwards especially through neurons near saturation. If the network is deep enough the error signal from the output layer can be completely attenuated on it’s way back towards the input layer.
The attenuation comes about because the derivative S′(t)S′(t) will always be near zero especially for saturating neurons, you do notice that if you use chain rule, which is backpropagation in neural net terms, you will somehow be multiplying this almost zero derivative with the error signal before throwing it backwards at every level or stage. Now keep doing that as you throw the error signal backwards and what you get is an error signal becoming weaker hence vanishing. Even the hyperbolic tangent has this saturating effect and hence the vanishing gradient problem also affects it.
Those units active in the linear region of the sigmoid won’t attenuate the error signal that much but this is generally problematic for very deep nets.
That’s why ReLUs are favorable not only because they solve the vanishing gradient problem but also because they result in highly sparse neural nets. Sparsity means efficient and reliable performance. The rectifier is as given below.
f(x)={0,x,for x<0for x≥0f(x)={0,for x<0x,for x≥0
Clearly it ranges from 0 to positive infinity thus it is non saturating and the derivative is given by
f′(x)={0,1,for x<0for x≥0f′(x)={0,for x<01,for x≥0
It’s always 1 hence no attenuation of an error signal propagating backwards. This makes ReLUs favorable for the deeper trainable feature detectors as in ConvNets, you can have very deep neural nets with ReLUs without the vanishing gradient problem.
EDIT:
You do notice that the negative region has a zero derivative, right? This can be a problem as the neuron is off within this region and cannot learn and gradients cannot be backpropagated through an off neuron. There are remedies to this by adding a leakage factor which results in a non-zero derivative and thus resulting in a modified neuron known as the leaky ReLU.
Hope this helps.
Sources:
Activation functionSigmoid function
11.5k Views · View Upvoters
Promoted by UpGrad.com
Post graduate diploma in data science from IIIT Bangalore.
Build your science career, choose your specialization, 3 month capstone project.
Apply Now
MORE ANSWERS BELOW. RELATED QUESTIONS
What is the vanishing gradient problem?
58,769 Views
How does RELU solve vanishing and exploding gradients in neural network?
9,010 Views
How do we compute the gradient of a ReLU for backpropagation?
9,494 Views
How do I fix exploding and vanishing gradients? How do ReLUs, LSTMs and new techniques like batchnorm help with these problems?
4,225 Views
Does the Deep Learning book discuss the vanishing gradient problem somewhere?
16,341 Views
What is the generally accepted practice of avoiding the vanishing gradient problems with deep networks?
2,887 Views
Why isn't leaky ReLU always preferable to ReLU given the zero gradient for x<0?
1,403 Views
Why is it a problem to have exploding gradients in a neural net (especially in an RNN)?
9,828 Views
What is "saturation of neuron" in a neural network? How does the "ReLU" activation function overcomes the "saturation of neuron" problems?
137 Views
Why is the gradient of U = y-pred of my deep neural network? How do I get the various gradients?
529 Views
OTHER ANSWERS

Wonwoo Park, MLP, BP, Kohonen, NNToolbox, CNN
Answered Jun 8 2016 · Author has 73 answers and 29.2k answer views
May be, Because of the linearity …
As a partial linearity, that is also not perfect.
960 Views
S(t)=11+e−tS(t)=11+e−t
You can easily prove that
S(t)t→∞=1.0S(t)t→∞=1.0
and
S(t)t→−∞=0.0S(t)t→−∞=0.0
Just by inspection. This is a saturation effect that becomes problematic as explained below, consider the first order derivative of the sigmoid function.
S′(t)=S(t)(1.0−S(t))S′(t)=S(t)(1.0−S(t))
Since S(t)S(t) ranges from 0.0 to 1.0, its lower and upper asymptotes respectively, then clearly the derivative will be near zero for S(t)≈0.0S(t)≈0.0 or S(t)≈1.0S(t)≈1.0. The vanishing gradient problem comes about when the error signal passes backwards and starts approaching zero as it propagates backwards especially through neurons near saturation. If the network is deep enough the error signal from the output layer can be completely attenuated on it’s way back towards the input layer.
The attenuation comes about because the derivative S′(t)S′(t) will always be near zero especially for saturating neurons, you do notice that if you use chain rule, which is backpropagation in neural net terms, you will somehow be multiplying this almost zero derivative with the error signal before throwing it backwards at every level or stage. Now keep doing that as you throw the error signal backwards and what you get is an error signal becoming weaker hence vanishing. Even the hyperbolic tangent has this saturating effect and hence the vanishing gradient problem also affects it.
Those units active in the linear region of the sigmoid won’t attenuate the error signal that much but this is generally problematic for very deep nets.
That’s why ReLUs are favorable not only because they solve the vanishing gradient problem but also because they result in highly sparse neural nets. Sparsity means efficient and reliable performance. The rectifier is as given below.
f(x)={0,x,for x<0for x≥0f(x)={0,for x<0x,for x≥0
Clearly it ranges from 0 to positive infinity thus it is non saturating and the derivative is given by
f′(x)={0,1,for x<0for x≥0f′(x)={0,for x<01,for x≥0
It’s always 1 hence no attenuation of an error signal propagating backwards. This makes ReLUs favorable for the deeper trainable feature detectors as in ConvNets, you can have very deep neural nets with ReLUs without the vanishing gradient problem.
EDIT:
You do notice that the negative region has a zero derivative, right? This can be a problem as the neuron is off within this region and cannot learn and gradients cannot be backpropagated through an off neuron. There are remedies to this by adding a leakage factor which results in a non-zero derivative and thus resulting in a modified neuron known as the leaky ReLU.
Hope this helps.
Sources:
Activation functionSigmoid function
11.5k Views · View Upvoters
Promoted by UpGrad.com
Post graduate diploma in data science from IIIT Bangalore.
Build your science career, choose your specialization, 3 month capstone project.
Apply Now
MORE ANSWERS BELOW. RELATED QUESTIONS
What is the vanishing gradient problem?
58,769 Views
How does RELU solve vanishing and exploding gradients in neural network?
9,010 Views
How do we compute the gradient of a ReLU for backpropagation?
9,494 Views
How do I fix exploding and vanishing gradients? How do ReLUs, LSTMs and new techniques like batchnorm help with these problems?
4,225 Views
Does the Deep Learning book discuss the vanishing gradient problem somewhere?
16,341 Views
What is the generally accepted practice of avoiding the vanishing gradient problems with deep networks?
2,887 Views
Why isn't leaky ReLU always preferable to ReLU given the zero gradient for x<0?
1,403 Views
Why is it a problem to have exploding gradients in a neural net (especially in an RNN)?
9,828 Views
What is "saturation of neuron" in a neural network? How does the "ReLU" activation function overcomes the "saturation of neuron" problems?
137 Views
Why is the gradient of U = y-pred of my deep neural network? How do I get the various gradients?
529 Views
OTHER ANSWERS

Wonwoo Park, MLP, BP, Kohonen, NNToolbox, CNN
Answered Jun 8 2016 · Author has 73 answers and 29.2k answer views
May be, Because of the linearity …
As a partial linearity, that is also not perfect.
960 Views
Similar questions
Accountancy,
8 months ago
Science,
8 months ago
Math,
8 months ago
Biology,
1 year ago
Business Studies,
1 year ago
Physics,
1 year ago
Social Sciences,
1 year ago