Member-only story

(16) OPTIMIZATION: Nesterov Momentum or Nesterov Accelerated Gradient (NAG)

Improving Momentum Gradient Descent

4 min readMay 9, 2024

In the previous post, we have seen how momentum gradient descent (MGD) enhances the regular gradient descent (GD) algorithm. In this blog, we will learn how to improve even more MGD for an even faster convergence.

Remember that all these three algorithms (GD, MGD and NAG) are able to find the minimum for the loss function, but it is important to achieve the faster convergence possible in order to lower computer resources used. Less iteractions means less resources, less time and reduced costs.

Nesterov Momentum, also known as Nesterov Accelerated Gradient (NAG), tweaks the standard Momentum update by incorporating the gradient evaluated not at the current parameters θ but at an approximation of where the parameters will be at the next time step, based on the current momentum.

Let’s recall the formula for MGD:

With NGD, the update term is different, as we anticipate the future position based on the momentum term and the velocity vector:

(16) OPTIMIZATION: Nesterov Momentum or Nesterov Accelerated Gradient (NAG)

Improving Momentum Gradient Descent

Written by Carla Martins

No responses yet