Vanishing Gradient Problem
Vanishing Gradient Problem
What & How?
The problem is that during the backpropagation (from the final layer to the initial layer) for updating the weights in the large neural network with specific activation function, the gradient (derivative) value calculated for the initial (front) layers are very small, and the old weight and new weight will be approximately equal. This will make the network training slow. This is the vanishing gradient problem.
Why?
This problem is mainly because of some specific activation functions like Sigmoid and Tanh. The derivatives of these functions will range from (0,1). Therefore, the large value to the function will return a small value. Thus the derivative will be small.
This is not a problem for small neural networks, but a big problem for the dense networks. During the backpropagation, the gradients are being found from the final layer to the initial layer. By the chain rule, the derivative of each hidden layer is calculated.
For the n Hidden layers with sigmoid or tanh activation functions, the derivatives are multiplied. And, this result is multiplied with the learning rate. Thus the gradient of the Initial layers is exponentially decreased. So, the weights for these layers are not updated properly. So, the Old weight and New weight will approximately equal.
Solution?
- ReLU Activation Function can be replaced with Sigmoid Function.
- Reducing the number of layers.
Comments
Post a Comment