Various Optimizers

Optimizer

Optimers methods of algorithms are used to change the weights and learning rates of networks in order to reduce the losses.

Gradient descent

Gradient descent is one of the basic optimizers in deep learning. It is used in linear regression and classification problems. Gradient descent is a first-order optimization algorithm it is dependent on the first-order derivative of the loss function. It calculates the way Of altering the weights in order to reduce loss function and reach the minima.

θ=θ−α⋅∇J(θ)

Advantages

1. Easy understanding and computation

Disadvantages

1. Traps at local minima

2. Weights are changed after the gradient is found for the whole dataset.

3. More memory is required

Stochastic gradient descent

This is a variant of gradient descent this algorithm changes which more frequently. In this model both the weights and learning rates are change after the loss function for each training data is calculated.

θ=θ−α⋅∇J(θ;x(i);y(i))

where x(i) and y(i) are training data.

Advantages

1. Converges to a minimum very fast

2. Requires very less memory

Disadvantage

1. High variance in learning rate and weights

2. shoots even after converging to global Minima

Mini-batch gradient descent

This algorithm is more efficient than standard gradient descent and forecasting gradient descent. In this model, the training data set is divided into many batches and both the learning rate and weights are updated after every batch.

θ=θ−α⋅∇J(θ; B(i))

where, B(i) is a batch

Advantages

1. Required medium amount of memory

2. It has less variance as frequently the weights and learning rates are changed.

so, choosing the optimal learning rate is very challenging and may trap at local minima.

Gradient descent with momentum

Momentum was invented to reduce the variance in stochastic gradient descent. This model will make the convergence soften and will move towards a relevant direction.

Advantages

1. Reduces the oscillations and high variants of learning rate and weights.

2. Converges very fast

Disadvantage

1. One more hyperparameter is added and that should be selected manually

RMSprop

The Root Means Square Propagation optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm could take larger steps in the horizontal direction converging faster. The difference between RMSprop and gradient descent is on how the gradients are calculated. The value of momentum is denoted by beta and is usually set to 0.9.

Advantages

1. Converges very fast

Adagrad

One of the disadvantages of all the optimizers explained is that the learning rate is constant for all parameters and for each cycle. This optimizer changes the learning rate. It changes the learning rate ‘η’ for each parameter and at every time step ‘t’. It’s a type second-order optimization algorithm. It works on the derivative of an error function.

η is a learning rate that is modified for given parameter θ(i) at a given time based on previous gradients calculated for given parameter θ(i). We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t, while ϵ is a smoothing term that avoids division by zero (usually on the order of 1e−8). Interestingly, without the square root operation, the algorithm performs much worse.

It makes big updates for less frequent parameters and a small step for frequent parameters.

Advantages

1. Learning rate changes automatically after each cycle manual changes not required

Disadvantages

1. Computationally it is expensive as we have to calculate the second-order derivative

2. Decaying in learning rate.

Ada Delta

Ada delta is an extension of Adagrad. This model will rectify the disadvantage of Adagrad, as it tends to remove the decaying learning rate. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients.

We set γ to a similar value as the momentum term, around 0.9.

Advantages

1. Here the learning rate will not detail and training will not be stopped

Disadvantage

1. Computationally expensive

Adam

Adam (Adaptive Moment Estimation) works with momentums of first and second order.

Adam takes into account the advantage of both Adagrad and RMS prop. The key idea of the Adam is that Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

M(t) and V(t) are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradients respectively.

Advantages

1. it is a very effective and fast method

2. It also rectifiers vanishing learning rates and also I variants

Disadvantage

1. It is computationally expensive

Conclusion

Above all stated optimizer algorithms, the best wellness Adam optimizer because if one wants to train the model effectively and with less time Adam is the one. Also in gradient descent mini-batch gradient descent is good.

Search This Blog

Anandha Murthy B