Lecture 19: Newton's method
Newton's Method
- The basic idea underlying gradient descent method was to consider the linear approximation of $f$ around current point $x$ and take a step that minimizes this simple function.
- But, how about generalizing this idea and work with a local approximation that is more complex but also provides tighter grasp of $f$? That is, how about approximating $f$ by a quadratic function?
- Specifically, by Taylor theorem we know that \[ f(x+\delta)= f(x) + \nabla f(x)^T \delta + \frac{1}{2} \delta^T\nabla^2 f(x) \delta + O(\|\delta\|^3). \]
- What is the step $\delta^*$ that minimizes this quadratic approximation?
$\Rightarrow$ This approximation is strongly convex, a point is its minimizer iff its gradient (wrt to $\delta$) is zero.
$\Rightarrow$ We need that \[ \nabla f(x) + \nabla^2 f(x) \delta^* = 0. \] (We use here the fact that $\nabla f(x)$ and $\nabla^2f(x)$ do not depend on $\delta$.)
$\Rightarrow$ The resulting Newton's step is \[ \delta^*=-\left(\nabla^2f(x)\right)^{-1} \nabla f(x). \]
- Note: To apply this step we need to either invert the Hessian, or at least solve the linear system $\nabla^2 f(x) z = \nabla f(x)$. (As long as $f$ is strongly convex, $\nabla^2 f(x)\succ 0$, so it is invertible.)
$\Rightarrow$ This can be quite expensive computationally: Gaussian elimination takes $O(n^3)$ time and best matrix multiplication algorithm takes $O(n^\omega)=O(n^{2.37})$ time. So, gradient descent is usually more useful. In some special cases, however, i.e., when the Hessian has some special structure, Netwon's step might have a much faster implementation.
- What is the convergence of this algorithm?
- If our starting point $x^0$ is relatively close to optimum, in a sense of $\nabla^2 f(x^0)\approx \nabla^2 f(x^*)$, i.e., the Hessian at the two points being similar, the convergence is extremely fast. In fact, it depends doubly logarithmically on $\epsilon^{-1}$. (In practice, we never need more than $6$ iterations to get a solution that is essentially optimal.)
- But: As our step is not "dampened" by any step size $\eta>0$, if we don't start sufficiently close to $x^*$ we might not converge at all. Instead, we end up jumping around uncontrollably, as we end up using our local quadratic approximation of $f$ way outside of its region of applicability.
- One can remedy the latter aspect of convergence by introducing a step size $\eta$ that keeps our steps be within the region where our quadratic approximation is still valid.
- Still, the convergence then depends on the "smoothness" of the Hessian of our function. (Instead of smoothness of the gradients, as was the case for gradient descent.)
Newton's Method as a Variant of Gradient Descent
- It turns out that there is an alternative view/derivation of Newton's method/step. One can view it as an instance of gradient descent but with respect to other than Euclidean/$\ell_2$-norm.
- Define, for a positive-definite matrix $A\succ 0$, an $A$-norm $\|\cdot\|_A$ as \[ \|x\|_A^2 := x^TAx, \] for each vector $x$. Note: If $A=I$ this is just the$\ell_2$-norm.
- Now, recall that the basic idea behind gradient descent method is to consider the first order local approximation of $f$ given by the Taylor approximation \[ f(x+\delta)=f(x)+\nabla f(x)^T \delta + O(\|\delta\|^2). \]
- But: What if we measure the approximation error in an $A$-norm, for some $A$ other than $I$, instead of $\ell_2$-norm? (Note that this will affect the terms hidden in the $O(\cdot)$ notation.) That is, we have that \[ f(x+\delta)=f(x)+\nabla f(x)^T \delta + O(\|\delta\|_A^2). \]
- Consider now the best direction of the (gradient) improvement step $\delta$. That is, we want a direction that minimizes (the linear approximation of) $f$ as much as possible while having a fixed $A$-norm. Or, to put it yet differently, we want the direction that gives us the "best bang for the buck" when "buck" is measured via the $A$-norm of our step.
- In the case of "classic" gradient descent, this direction was simply $-\nabla f(x)$. However, here, it turns out to be $-A^{-1}\nabla f(x)$. (As a sanity check, if $A=I$, these two directions coincide.)
$\Rightarrow$ Our gradient step thus becomes $-\nabla A^{-1}\eta f(x)$, for appropriate step size $\eta>0$.
- Note: Now our gradient step requires inverting/solving a linear system wrt $A$.
- What did we gain then?
- Our "effective" Hessian changed. That is, in a sense, our Hessian now is equal to $A^{-\frac{1}{2}}\nabla^2 f(x) A^{-\frac{1}{2}}$.
$\Rightarrow$ But this affects our "effective" $\beta$-smoothness and $\alpha$-strong convexity of $f$, and thus the condition number of $f$, too! After all, these correspond to the smallest and largest eigenvalues of the Hessian.
- So, what is the best choice of $A$ to make the corresponding condition number the smallest?
$\Rightarrow$ Take $A=\nabla^2 f(x)$!
$\Rightarrow$ The "effective" Hessian at $x$ will become just an identity and thus the condition number at this point will be only $1$.
- But the gradient step corresponding to such "best" choice of $A$ is exactly the Newton's step we derived above.
- The $A$-norm $\|\cdot\|_{\nabla^2 f(x)}$ corresponding to setting $A=\nabla^2 f(x)$ is called a local norm at $x$ and sometimes denoted just by $\|\cdot\|_x$.
- Subtle but important point: It is true that whenever we use the local norm to take the gradient step, the function $f$ behaves as if it had condition number of $1$. However, this behavior is only local in nature.
In particular, it does not mean that we can use our earlier analysis of gradient descent to conclude that we are guaranteed to converge, say, in $O(\log \frac{\|x^*-x^0\|_{x^0}}{\epsilon})$ iterations. The problem is that our new algorithm uses different norm at each step, while parts of our earlier analysis - specifically, the lower bounding part relating to $\alpha$-strong convexity - relied on using a single fixed norm throughout the whole algorithm.
General analysis of this new algorithm is therefore much more nuanced and it goes beyond the scope of the class. Still, special cases of this analysis, i.e., when we consider special classes of the function $f$, become much easier and will, in fact, play important role in the analysis of interior-point methods that we will discuss next time.