6.854 Lecture Notes

Lecture 16: Gradient Descent

Unconstrained Minimization

Our focus today: Unconstrained minimization problem: given a real-valued function $f$ over $\RR^n$, find its minimum $x^*$ (assuming it exists). That is, solve the problem \[ x^*=\arg\min_{x\in \RR^n} f(x). \]

Note: This problem is very general:
- To get maximization, just minimize $-f(x)$.
- To introduce constraints, just consider minimizing $f(x)+\psi(x)$, where $\psi(x)=0$, if $x$ satisfies all constraints, and $+\infty$, otherwise. (So, in principle, this is stronger than LP!)
To make our discussion simpler, we will assume though that our function $f$ is “nice”. That is, $f$ is:
- continuous;
- (infinitely many times) differentiable. (This requirement can, and often needs to, be relaxed.)

Gradient Descent

How to solve an unconstrained minimization problem?

Powerful approach: Gradient descent method.
Key idea: Apply (continuous) local greedy approach.
Start with some point $x^0$.
In each iteration: move a bit (locally) in the direction that reduces the value of $f$ the most (greedily).
$\Rightarrow$ Guarantees that $f(x^{t+1})< f(x^t)$.

Question: What is the direction of the steepest decrease of $f$?

Recall (multi-variate) Taylor theorem: for any $x\in \RR^n$ and (vector) displacement $\delta\in \RR^n$, we have that \[ f(x+\delta)= f(x) + \nabla f(x)^T \delta + \frac{1}{2} \delta^T\nabla^2 f(y) \delta, \] for some $y=x+\lambda\delta$ with $0\leq \lambda\leq 1$, where
- $\nabla f(x)\in \RR^n$ is the gradient of $f$ at point $x$ and \[ \nabla f(x)_i:=\frac{\partial f(x)}{\partial x_i}, \] for each $i$.
- $\nabla^2 f(x)\in \RR^{n\times n}$ is the Hessian of $f$ at point $x$ and \[ \nabla^2 f(x)_{ij} := \frac{\partial^2 f(x)}{\partial x_i \partial x_j}, \] for each $i$ and $j$.
Observe: the gradient term in the Taylor expansion is linear in $\|\delta\|$ while the Hessian term is quadratic in $\|\delta\|$.
Consequently, for small enough step, i.e., $\|\delta\|$, the Hessian term is negligible. That is, \[ f(x+\delta)= f(x) + \nabla f(x)^T \delta + O(\|\delta\|^2)\approx f(x)+\nabla f(x)^T \delta \]
Key conclusion: Even though $f$ might be very complex, locally it is "simple", i.e., it is well approximated by, essentially, the simplest function possible: the linear function!
$\Rightarrow$ We know how to minimize linear functions. Just take $\delta=-\eta \nabla f(x)$, for some step size $\eta>0$.

Resulting algorithm: Gradient descent method:

Start with some $x^0\in \RR^n$.
In each step $t$: $x^{t+1}\leftarrow x^t - \eta \nabla f(x^t)$.

Question: What should $\eta$ be?

Assume that $f$ is $\beta$-smooth, for some $\beta>0$. That is, \[ \|\nabla f(y)-\nabla(x)\| \leq \beta \|y-x\|, \] for any $x, y\in \RR^n$. Intuitively, $\beta$ measures how much the gradient of $f$ can change between two nearby points.
Equivalently (for twice differentiable functions): $f$ is $\beta$-smooth iff $y^T \nabla^2 f(x) y \leq \beta \|y\|^2$, for any $x,y$; or, put yet another way, the maximum eigenvalue of $\nabla^2 f(x)$ is at most $\beta$.
$\Rightarrow$ We have that \[ f(x+\delta)\leq f(x) + \nabla f(x)^T \delta + \frac{\beta}{2} \|\delta\|^2, \] for any $x$ and $\delta$
$\Rightarrow$ Intuitively: For every point $x$, there is a corresponding quadratic (i.e., relatively“simple”) function that upper bounds $f$ everywhere and agrees with $f$ at the point $x$.
$\Rightarrow$ Our progress on minimizing this quadratic function at $x$ lowerbounds our progress on reducing the value of $f$ at $x$.
$\Rightarrow$ If we plug in our choice of $\delta=-\eta \nabla f(x)$, we get that $$ \begin{eqnarray*} f(x+\delta) &\leq & f(x) + \nabla f(x)^T \delta + \frac{\beta}{2} \|\delta\|^2\\ &\leq & f(x) - \eta \|\nabla f(x)\|^2+ \frac{\beta}{2} \eta^2 \|\nabla f(x)\|^2\\ & \leq & f(x)- \frac{1}{2\beta} \|\nabla f(x)\|^2, \end{eqnarray*} $$ for the optimal setting of $\eta = \frac{1}{\beta}$.
$\Rightarrow$ Setting $\eta= \frac{1}{\beta}$ ensures that \[ f(x^{t+1})\leq f(x^t) -\frac{1}{2\beta} \|\nabla f(x)\|^2, \] i.e., we make progress of at least $\frac{1}{2\beta} \|\nabla f(x)\|^2$ towards minimizing the value of $f$.
In practice, we choose best $\eta$ adaptively in each step via binary search -- this is often called line search.

Remaining issue: What if $\|\nabla f(x^k)\|=0$ (or is just very small)?

$x^k$ has to be a critical point -- means $x^k$ is either a local minimum or maximum (with bad initialization) or a saddle point.
If $\nabla^2 f(x^k)\succeq 0$, we know it is a local minimum.
We can deal with the other two possibilities by perturbing our point slightly and resuming the algorithm.
In general: Typically, gradient descent converges to local minimum.
What if we want this local minimum to be a global one?
We need additional (strong) assumption.
$f$ is convex iff, for any $x$ and $y$, \[ f(\lambda x + (1-\lambda) y)\leq \lambda f(x) + (1-\lambda) f(y), \] for any $0\leq \lambda \leq 1$. That is, the epigraph of the function is a convex set.
Alternatively: $f$ is convex iff $\nabla^2 f(x^k)\succeq 0$, for all $x$.
$\Rightarrow$ The only critical points are local minimums!
In fact, a much stronger property holds: all critical points are global minimums.
To see that, note that by Taylor theorem convexity implies that \[ f(x+\delta) = f(x) + \nabla f(x)^T \delta + \frac{1}{2} \delta^T \nabla^2 f(x) \delta \geq f(x) + \nabla f(x)^T \delta. \] That is, every gradient defines a lowerbounding hyperplane for $f$ that agrees with $f$ at $x$.
$\Rightarrow$ If $\nabla f(x)=0$ then $f(x+\delta)\geq f(x)$ for all $\delta$.
It turns out that convexity is a very widespread phenomena in optimization. But there are very important domains, e.g., deep learning, where the underlying optimization problems are inherently non-convex.

Convergence Analysis

How fast does gradient descent converge?

Convexity allows us to bound our (sub-)optimality. Specifically, if $x^*$ is the minimum of $f$, we have that, for any $x$, \[ f(x^*)\geq f(x) + \nabla f(x)^T(x^*-x). \]
$\Rightarrow$ $f(x)-f(x^*)\leq -\nabla f(x)^T(x^*-x)\leq \|\nabla f(x)\|\|x^*-x\|$, where the last inequality follows by Cauchy-Schwartz inequality.

$\Rightarrow$ If $\|\nabla f(x)\|\leq \frac{\eps}{\|x^*-x\|}$, we are by at most $\epsilon$ off from optimum.
The fact that the above near-optimality condition involves $\|x^*-x\|$ is unfortunate (but inherent!). After all, we don't know what this distance is.
To connect this distance to the optimum to the norm of the gradient/difference in function value, and thus to get rid of this dependence, we need to make an (even stronger) assumption on $f$.
Assume that $f$ is $\alpha$-strong convexity. That is, assume that, for any $x$ and $y$, \[ y^T \nabla^2 f(x) y \geq \alpha \|y\|^2. \]
$\Rightarrow$ The smallest eigenvalue of $\nabla^2f(x)$ is always at least $\alpha$.
$\Rightarrow$ "Normal" convexity would correspond to $\alpha=0$ (but we require $\alpha>0$ here).
$\Rightarrow$ We can now strengthen our lowerbounding inequality we got from convexity. Specifically, for any $x$ and $\delta$ we have that \[ f(x+\delta) \geq f(x) + \nabla f(x)^T\delta + \frac{1}{2} \delta^T \nabla^2 f(x)\delta\geq f(x)+ \nabla f(x)^T\delta + \frac{\alpha}{2} \|\delta\|^2. \] That is, for each point $x$, there is a quadratic function that lowerbounds $f$ everywhere and agrees with $f$ at $x$.
Strong convexity indeed allows us to tie distance to optimum in the function value with the metric distance to optimum. We have that, for any $x$, \[ f(x)\geq f(x^*)+\nabla f(x^*)^T (x-x^*) + \frac{\alpha}{2}\|x-x^*\|^2. \]
$\Rightarrow$ $f(x)-f(x^*)\geq \frac{\alpha}{2}\|x-x^*\|^2$, since $\nabla f(x^*)=0$.
To get the convergence bound, let us just put together everything we derived so far:
- Recall that each gradient descent step decreases $f(x^t)-f(x^*)$ be at least $\frac{1}{2\beta}\|\nabla f(x^t)\|^2$. That is, \[ f(x^{t+1})-f(x^*)\leq f(x^k)-f(x^*)-\frac{1}{2\beta}\|\nabla f(x^t)\|^2. \]
- On the other hand, by ($\alpha$-strong) convexity of $f$, we know that \[ \|\nabla f(x^t)\|^2\geq \frac{(f(x^t)-f(x^*))^2}{\|x^*-x^t\|^2}\geq \frac{\alpha}{2} \left(f(x^t)-f(x^*)\right) \]
- Combining these two inequalities we get that $$ \begin{eqnarray*} f(x^{t+1})-f(x^*) &\leq &\left(f(x^k)-f(x^*)\right)\left(1-\frac{1}{2\beta}\frac{\|\nabla f(x^t)\|^2}{f(x^t)-f(x^*)}\right)\\ &\leq & \left(f(x^k)-f(x^*)\right)\left(1-\frac{\alpha}{4\beta}\right)\leq \left(f(x^k)-f(x^*)\right)\left(1-\frac{1}{4\kappa}\right), \end{eqnarray*} $$ where $\kappa:=\frac{\beta}{\alpha}$ is the condition number of $f$. (Intuitively, condition number tells us how “nicely” it behaves. The smaller condition number the faster convergence.)
  $\Rightarrow$ After $O(\kappa \log \frac{(f(x^0)-f(x^*)}{\epsilon})$ steps we obtain a solution that is within $\epsilon$ of the optimal value! Note: The dependence on $\epsilon$ is only logarithmic, which essentially allows us to solve the problem exactly.
What to do if $f$ is not $\alpha$-strongly convex for any $\alpha>0$? (This is often the case in applications.)
A different analysis gives a (much weaker) convergence bound of $O(\frac{\beta \|x^*-x^0\|^2}{\epsilon^2})$. (Here, the dependence on $\epsilon$ is polynomial, so in this regime we can only get approximate answers.)
Alternatively, we could make $f$ $\alpha$-convex by adding $\alpha\|x-x^0\|^2$ to it. (This is an example of a more general technique called regularization.)
$\Rightarrow$ Adding this new term corresponding to adding $\alpha \cdot I$ to the Hessian $\nabla^2 f(x)$ of $f$. So, $f$ is indeed $\alpha$-strongly convex now.
$\Rightarrow$ Problem: The minimizer of $f$ changed! Will need to deal with that by reducing the regularization as we get closer to $x^*$. (Details omitted.)
$\Rightarrow$ This will again deliver a bound that depends on $\epsilon^{-1}$ and $\|x^*-x^0\|$ polynomially instead of logarithmically.

Newton's Method

The basic idea underlying gradient descent method was to consider the linear approximation of $f$ around current point $x$ and take a step that minimizes this simple function.
But, how about generalizing this idea and work with a local approximation that is more complex but also provides tighter grasp of $f$? That is, how about approximating $f$ by a quadratic function?
Specifically, by Taylor theorem we know that \[ f(x+\delta)= f(x) + \nabla f(x)^T \delta + \frac{1}{2} \delta^T\nabla^2 f(x) \delta + O(\|\delta\|^3). \]
What is the step $\delta^*$ that minimizes this quadratic approximation?
$\Rightarrow$ This approximation is strongly convex, a point is its minimizer iff its gradient (wrt to $\delta$) is zero.
$\Rightarrow$ We need that \[ \nabla f(x) + \nabla^2 f(x) \delta^* = 0. \] (We use here the fact that $\nabla f(x)$ and $\nabla^2f(x)$ do not depend on $\delta$.)
$\Rightarrow$ The resulting Newton's step is \[ \delta^*=-\left(\nabla^2f(x)\right)^{-1} \nabla f(x). \]
Note: To apply this step we need to either invert the Hessian, or at least solve the linear system $\nabla f(x) z = \nabla f(x)$. (As long as $f$ is strongly convex, $\nabla^2 f(x)\succ 0$, so it is invertible.)
$\Rightarrow$ This can be quite expensive computationally: Gaussian elimination takes $O(n^3)$ time and best matrix multiplication algorithm takes $O(n^\omega)=O(n^{2.37})$ time. So, gradient descent is usually more useful. In some special cases, however, i.e., when the Hessian has some special structure, Netwon's step might have a much faster implementation.
What is the convergence of this algorithm?
- If our starting point $x^0$ is relatively close to optimum, in a sense of $\nabla^2 f(x^0)\approx \nabla^2 f(x^*)$, i.e., the Hessian at the two points being similar, the convergence is extremely fast. In fact, it depends doubly logarithmically on $\epsilon^{-1}$. (In practice, we never need more than $6$ iterations to get a solution that is essentially optimal.)
- But: As our step is not "dampened" by any step size $\eta>0$, if we don't start sufficiently close to $x^*$ we might not converge at all. Instead, we end up jumping around uncontrollably, as we end up using our local quadratic approximation of $f$ way outside of its region of applicability.
- One can remedy the latter aspect of convergence by introducing a step size $\eta$ that keeps our steps be within the region where our quadratic approximation is still valid.
- Still, the convergence then depends on the "smoothness" of the Hessian of our function. (Instead of smoothness of the gradients, as was the case for gradient descent.)

Newton's Method as a Variant of Gradient Descent

It turns out that there is an alternative view/derivation of Newton's method/step. One can view it as an instance of gradient descent but with respect to other than Euclidean/$\ell_2$-norm.
Define, for a positive-definite matrix $A\succ 0$, an $A$-norm $\|\cdot\|_A$ as \[ \|x\|_A^2 := x^TAx, \] for each vector $x$. Note: If $A=I$ this is just the$\ell_2$-norm.
Now, recall that the basic idea behind gradient descent method is to consider the first order local approximation of $f$ given by the Taylor approximation \[ f(x+\delta)=f(x)+\nabla f(x)^T \delta + O(\|\delta\|^2). \]
But: What if we measure the approximation error in an $A$-norm, for some $A$ other than $I$, instead of $\ell_2$-norm? (Note that this will affect the terms hidden in the $O(\cdot)$ notation.) That is, we have that \[ f(x+\delta)=f(x)+\nabla f(x)^T \delta + O(\|\delta\|_A^2). \]
Consider now the best direction of the (gradient) improvement step $\delta$. That is, we want a direction that minimizes (the linear approximation of) $f$ as much as possible while having a fixed $A$-norm. Or, to put it yet differently, we want the direction that gives us the "best bang for the buck" when "buck" is measured via the $A$-norm of our step.
In the case of "classic" gradient descent, this direction was simply $-\nabla f(x)$. However, here, it turns out to be $-A^{-1}\nabla f(x)$. (As a sanity check, if $A=I$, these two directions coincide.)
$\Rightarrow$ Our gradient step thus becomes $-\nabla A^{-1}\eta f(x)$, for appropriate step size $\eta>0$.
Note: Now our gradient step requires inverting/solving a linear system wrt $A$.
What did we gain then?
Our "effective" Hessian changed. That is, in a sense, our Hessian now is equal to $A^{-\frac{1}{2}}\nabla^2 f(x) A^{-\frac{1}{2}}$.
$\Rightarrow$ But this affects our "effective" $\beta$-smoothness and $\alpha$-strong convexity of $f$, and thus the condition number of $f$, too! After all, these correspond to the smallest and largest eigenvalues of the Hessian.
So, what is the best choice of $A$ to make the corresponding condition number the smallest?
$\Rightarrow$ Take $A=\nabla^2 f(x)$!
$\Rightarrow$ The "effective" Hessian at $x$ will become just an identity and thus the condition number at this point will be only $1$.
But the gradient step corresponding to such "best" choice of $A$ is exactly the Newton's step we derived above.
The $A$-norm $\|\cdot\|_{\nabla^2 f(x)}$ corresponding to setting $A=\nabla^2 f(x)$ is called a local norm at $x$ and sometimes denoted just by $\|\cdot\|_x$.
Subtle but important point: It is true that whenever we use the local norm to take the gradient step, the function $f$ behaves as if it had condition number of $1$. However, this behavior is only local in nature.
In particular, it does not mean that we can use our earlier analysis of gradient descent to conclude that we are guaranteed to converge, say, in $O(\log \frac{\|x^*-x^0\|_{x^0}}{\epsilon})$ iterations. The problem is that our new algorithm uses different norm at each step, while parts of our earlier analysis - specifically, the lower bounding part relating to $\alpha$-strong convexity - relied on using a single fixed norm throughout the whole algorithm.
General analysis of this new algorithm is therefore much more nuanced and it goes beyond the scope of the class. Still, special cases of this analysis, i.e., when we consider special classes of the function $f$, become much easier and will, in fact, play important role in the analysis of interior-point methods that we will discuss next time.