Polyak momentum is one of the most iconic methods in optimization. Despite it's simplicity, it features rich dynamics that depend both on the step-size and momentum parameter. In this blog post we identify the different regions of the parameter space and discuss their convergence properties using the theory of Chebyshev polynomials.
Authors
Affiliations
Anonymous
Published
Dec. 1, 2022
Gradient Descent with Momentum
Gradient descent with momentum, also known as heavy ball or momentum for short, is an optimization method designed to solve unconstrained minimization problems of the form where the objective function is differentiable and we have access to its gradient . In this method the update is a sum of two terms. The first term is the difference between the current and the previous iterate , also known as momentum term. The second term is the gradient of the objective function.
Gradient Descent with Momentum Input: starting guess , step-size and momentum parameter . For compute
Despite its simplicity, gradient descent with momentum exhibits unexpectedly rich dynamics that we’ll explore on this post.
The origins of momentum can be traced back to Frankel’s method in the 1950s for solving linear system of equations. It was later generalized by Boris Polayk to non-quadratic objectives. In recent years there has been a resurgence in interest in this venerable method, as a stochastic variant of this method, where the gradient is replaced by a stochastic estimate, is one of the most popular methods for deep learning. This has led in recent years to a flurry fo research –and improved understanding – of this stochastic variant. Although this blog posts limits itself with the deterministic variant, the interested reader is encouraged to explore following references. A good starting point is the paper by Sutskever et al., which was among the firsts to highlight the importance of momentum for deep learning optimization. More recent progress include an analysis of the last iteration of the method by Tao et al. and a paper by Liu et al. that develops accelerated variants for over-parameterized models.
Coming back to the subject of our post, the (non-stochastic) gradient descendent with momentum method, a paper that also explores the dynamics of momentum is Gabriel Goh’s Why Momentum Really Works. There are subtle but important differences between both analysis. The landscape described in the section “The Dynamics of Momentum” describe the improvement along the direction of a single eigenvector. This partial view produces some misleading conclusions. For example, along the direction of a single eigenvector, the largest improvement is achieved with zero momentum and a step-size of 1 over the associated eigenvalue. This conclusion however doesn’t hold in higher dimensions, where as we will see, the momentum term that yields the fastest convergence is non-zero.
How fast is Momentum?
Momentum is fast. So fast that it’s often the default choice of machine learning practitioners. But can we quantify this more precisely?
Throughout the post we’ll assume that our objective function is a quadratic objective of the form where is a symmetric positive definite matrix and is the minimizer of the objective. We’ll assume that the eigenvalues of are in the interval .
The measure we’ll use to quantify the speed of convergence is the rate of convergence. This is the worst-case relative improvement in the iterate suboptimality at iteration , defined as This is a worst-case measure because of all problem instances, we take worst possible initialization and matrix with eigenvalues in the interval .
This is a measure of how much progress is made (in the worst-case) at iteration . The smaller the value of , the faster the algorithm converges. Since all algorithms that we consider converge exponentially fast, for large enough the error is of the order of . Hence the most informative quantity is the value of in this expression. We’ll call this quantity the asymptotic rate of convergence, and denote it: This is the quantity we’ll be discussing throughout the post and what we’ll use to compare the speed of momentum for different values of its hyperparameters.
A connection between optimization methods and polynomials
To compute easily the asymptotic rate of convergence for all admissible values of step-size and momentum, we’ll use a connection between optimization of quadratic functions and the theory of orthogonal polynomials. This theory was extensively used in the early days of numerical analysis and provides an elegant and simple way to compute asymptotic rates (and non-asymptotic ones, althought not the topic of this blog post) from known results in the theory of orthogonal polynomials. We favor this technique for its simplicity and elegance, although ones ones that also be used with identical results. Other techniques include the linear operator technique used by Polyak, the estimate sequences technique pioneered by Nesterov or the use of Lyapunov functions.
The main result that will allow us to make the link between optimization and orthogonal polynomials is the following result. It’s origins seem unclear, although a proof can be found in the 1959 monograph of Rutishauser.
Consider the following polynomial of degree , defined recursively as: Then we can write the suboptimality at iteration as where is the matrix obtained from evaluating the (originally rel-valued) polynomial at the matrix .
This last identity will allow us to easily compute convergence rates. In particular, plugging it into the definition of the convergence rate we get that the rate is determined by the absolute value of the residual polynomial over the interval: We’ve now reduced the problem of computing the convergence rate to the problem of computing the absolute value of a polynomial over a given interval. This is a problem that has been extensively studied in the theory of orthogonal polynomials. In particular, we’ll use known bounds on Chebyshev polynomials of the first and second kind, as the residual polynomial of momentum can be written as a convex combination of these two polynomials. This fact is proven in the next result, which is a generalization of equation (II.29) in (Rutishauser 1959).
The residual polynomial of momentum can be written in terms of Chebyshev polynomials of the first and second kind as where is a linear function that we'll refer to as the link function and and are the Chebyshev polynomials of the first and second kind respectively.
Let's denote by the right hand side of the above equation, that is, Our goal is to show that for all .
For , and , so we have which corresponds to the definition of in .
Assume it's true for any iteration up to , we will show it's true for . Using the three-term recurrence of Chebyshev polynomials we have where the third identity follows from grouping polynomials of same degree and the induction hypothesis. The last expression is the recursive definition of in , which proves the desired .
Tools of the trade: the two faces of Chebyshev polynomials
A key feature that we’ll use extensively about Chebyshev polynomials is that they behave very differently inside and outside the interval . Inside this interval (shaded blue region) the magnitude of these polynomials stays close to zero, while outside it explodes:
Let’s make this observation more precise.
Inside the interval, Chebyshev polynomials admit the trigonometric definitions and and so they have an oscillatory behavior with values bounded in absolute value by 1 and respectively.
Outside of this interval the Chebyshev polynomials of the first kind admit the explicit form for : We’re interested in convergence rates, so we’ll look into -th root asymptotics of the quantities.With little extra effort, it would be possible to derive non-asymptotic convergence rates, although I won't pursue this analysis here. Luckily, these asymptotics are the same for both polynomialsAlthough we won't use it here, this -th root asymptotic holds for (almost) all orthogonal polynomials, not just Chebyshev polynomials. See for instance reference below and taking limits we have that
The Robust Region
Let’s start first by considering the case in which the image of is in the interval. This is the most favorable case. In this case, the Chebyshev polynomials are bounded in absolute value by 1 and respectively. Since the Chebsyshev polynomials are evaluated at , this implies that . We’ll call the set of step-size and momentum parameters for which the previous inequality is verified the robust region.
Let’s visualize this region in a map. Since is a linear function, its extremal values are reached at the edges: Using and that is decreasing in , we can simplify the condition to and , which in terms of the step-size and momentum correspond to: These two conditions provide the upper and lower bound of the robust region.
Asymptotic rate
Let for some , which is always possible since . In this regime, Chebyshev polynomials verify the identities and , which replacing in the definition of the residual polynomial gives
Since the expression inside the square brackets is bounded in absolute value by , taking -th root and then limits we have for any. This gives our first asymptotic rate:
The asymptotic rate in the robust region is .
This is nothing short of magical. It would seem natural –and this will be the case in other regions– that the speed of convergence should depend on both the step-size and the momentum parameter. Yet, this result implies that it’s not the case in the robust region. In this region, the convergence only depends on the momentum parameter \mom. Amazing.This insensitivity to step-size has been leveraged by Zhang et al. 2018 to develop a momentum tuner
This also illustrates why we call this the robust region. In its interior, perturbing the step-size in a way that we stay within the region has no effect on the convergence rate. The next figure displays the asymptotic rate (darker is faster) in the robust region.
The Lazy Region
Let’s consider now what happens outside of the robust region. In this case, the convergence will depend on the largest of . We’ll consider first the case in which the maximum is and leave the other one for next section.
This region is determined by the inequalities and . Using the definition of and solving for gives the equivalent conditions Note the second inequality is the same one as for the robust region but with the inequality sign reversed, and so the region will be on the oposite side of that curve. We’ll call this the lazy region, as in increasing the momentum will take us out of it and into the robust region.
Asymptotic rate
As we saw earlier, outside of the interval both Chebyshev have simple -th root asymptotics. Using this and that both kinds of Chebyshev polynomials agree in sign outside of the interval we can compute the asymptotic rate as This gives the asymptotic rate for this region
In the lazy region the asymptotic rate is .
Unlike in the robust region, this rate depends on both the step-size and the momentum parameter, which enters in the rate through the link function . This can be observed in the color plot of the asymptotic rate
Knife’s Edge
The robust and lazy region occupy most (but not all!) of the region for which momentum converges. There’s a small region that sits between the lazy and robust regions and the region where momentum diverges. We call this region the Knife’s edge
For parameters not in the robust or lazy region, we have that and . Using the asymptotics of Chebyshev polynomials as we did in the previous section, we have that the asymptotic rate is . The method will only converge when this asymptotic rate is below 1. Enforcing this results in . Combining this condition with the one of not being in the robust or lazy region gives the characterization:
Asymptotic rate
The asymptotic rate can be computed using the same technique as in the lazy region. The resulting rate is the same as in that region but with replacing :
In the Knife's edge region the asymptotic rate is .
Pictorially, this corresponds to
Putting it All Together
This is the end of our journey. We’ve visited all the regions on which momentum converges.There's a small convergent region with negative momentum parameter that we haven't visited. Although not typically used for minimization, negative momentum has found applications in smooth games (Gidel et al., 2020). The only thing left to do is to combine all the asymptotic rates we’ve gathered along the way.
The asymptotic rate of momentum is
Plotting the asymptotic rates for all regions we can see that Polyak momentum (the method with momentum \mom = \left(\frac{\sqrt{L} - \sqrt{\lmin}}{\sqrt{L} + \sqrt{\lmin}}\right)^2 and step-size \step = \left(\frac{2}{\sqrt{L} + \sqrt{\lmin}}\right)^2 which is asymptotically optimal among the momentum methods with constant coefficients) is at the intersection of the three regions.
Reproducibility
All plots in this post were generated using the following Jupyer notebook: [HTML][IPYNB]
Footnotes
With little extra effort, it would be possible to derive non-asymptotic convergence rates, although I won't pursue this analysis here.[↩]
Although we won't use it here, this -th root asymptotic holds for (almost) all orthogonal polynomials, not just Chebyshev polynomials. See for instance reference below[↩]
This insensitivity to step-size has been leveraged by Zhang et al. 2018 to develop a momentum tuner [↩]
There's a small convergent region with negative momentum parameter that we haven't visited. Although not typically used for minimization, negative momentum has found applications in smooth games (Gidel et al., 2020).[↩]
References
Some methods of speeding up the convergence of iteration methods[link] Polyak, B.T., 1964. USSR Computational Mathematics and Mathematical Physics.
Convergence rates of iterative treatments of partial differential equations[link] Frankel, S., 1950. Mathematical Tables and Other Aids to Computation. JSTOR.
On the importance of initialization and momentum in deep learning[PDF] Sutskever, I., Martens, J., Dahl, G. and Hinton, G., 2013. International conference on machine learning.
The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods[link] Tao, W., Long, S., Wu, G. and Tao, Q., 2021. International Conference on Learning Representations.
Accelerating SGD with momentum for over-parameterized learning[link] Liu, C. and Belkin, M., 2020. International Conference on Learning Representations.
Theory of Gradient Methods[link] Rutishauser, H., 1959. Refined Iterative Methods for Computation of the Solution and the Eigenvalues of Self-Adjoint Boundary Value Problems.
A method of solving a convex programming problem with convergence rate O(1/k2)[link] Nesterov, Y.E., 1983. Doklady Akademii Nauk, Vol 269(3), pp. 543--547.
A Lyapunov Analysis of Accelerated Methods in Optimization[HTML] Wilson, A.C., Recht, B. and Jordan, M.I., 2021. Journal of Machine Learning Research, Vol 22(113), pp. 1--34.
Nth Root Asymptotic Behavior of Orthonormal Polynomials[link] Stahl, H. and Totik, V., 1990. Orthogonal polynomials, pp. 395--417. Springer.
Yellowfin and the art of momentum tuning[PDF] Zhang, J. and Mitliagkas, I., 2018. SysML.