$\newcommand{\V}[1]{\mathbf #1}$
$\newcommand{\M}[1] {#1}$
$\newcommand{\Reals} { \mathbb{R} }$

Finding the optimal (minimum or maximum) value of a real-valued function is a ubiquitous operation in mathematics, engineering, physics, and computer science. This section will discuss mathematical criteria for defining the existence of optima and some numerical approaches for finding them. Minimization is also closely connected to problem of root-finding, which is an important problem in its own right.

Here we consider the problem of finding the minimum value that a
real-valued function $f :\mathbb{R}^n \mapsto \mathbb{R}$ takes on over
the reals. (The problem of maximizing $f$ can be solved by finding the
minimum of $-f$, so we will typically only concern ourselves with
minimization.) This is known as *unconstrained optimization*. The
problem where the domain is limited to a closed set, such as an interval
$[a,b]$, is known as *constrained optimization*, and will be covered in
the next section.

In some cases, it may be possible to examine the function by hand and use mathematical tools to find a closed-form expression of the minimum. When this is possible, such a solution can be very valuable because it can be evaluated quickly on a computer or can be used in further mathematical transformations.

The minimum of $f$ over a domain $S$ is a value $m$ such that $m=f(c)$ for some value $c \in S$ and that $m \leq f(x)$. If such a value exists, we write: $$m = \min_{x \in S} f(x)$$

We may also be interested in obtaining the x-values $c$ that attains
this minimum. This is known as an *argument of the minimum* (or arg min)
$$c = \arg\min_{x \in S} f(x).$$ Since a function may attain its minimum
at multiple points, the $\arg\min$ is technically a set of points. It
may also be an empty set if the minimum is not defined, e.g.
$\arg \min_{x \in \mathbb{R}} x$.

A value $x$ for which $f$ obtains its maximum over some open interval
containing $x$ is known as a *local minimum* (or maximum) of $f$. If $f$
is differentiable, then one condition for local optimality is that the
derivative of $f$ vanishes at $x$ (Figure 1). In other words, if $f$
attains a minimum (or maximum) at $x$ over the interval $(a,b)$, then
$f^\prime(x) = 0$.

To prove this fact, we will prove the contrapositive: if $x \in (a,b)$ is a point where $f^\prime(x) = L \neq 0$, then $x$ is not a local minimum. Using the definition of a limit, $$L = f^\prime(x) = \lim_{h\rightarrow 0} (f(x+h)-f(x))/h.$$ This means that for any $\epsilon > 0$, we can pick a $\delta$ such that $|(f(x+h)-f(x))/h-L| < \epsilon$ as long as $|h|\leq \delta$. Since $(a,b)$ is an open set, we can choose $\delta$ small enough so that $(x-\delta,x+\delta) \subseteq (a,b)$.

If $L$ is positive, pick $h=\delta$. Then we have $$(f(x+\delta)-f(x))/\delta-L < \epsilon$$ Which can be rearranged to obtain $$f(x+\delta) < f(x)+(L+\epsilon)\delta.$$ Hence, $f(x+\delta)$ is lower than $f(x)$, and hence $f(x)$ is not a minimizer of $f$.

On the other hand, if $L$ is negative, pick $h=-\delta$. Then we have $$-\epsilon < -(f(x-\delta)-f(x))/\delta-L$$ Which can be rearranged to obtain $$f(x-\delta) < f(x)-\delta(L-\epsilon).$$ Since $L-\epsilon$ is negative, this shows that $f(x-\delta)$ is lower than $f(x)$, and hence $f(x)$ is not a minimizer either. QED

Hence it is often useful to find all points where $f^\prime(x) = 0$.
These are known as *critical points*. All local minima and maxima are
critical points. However, not all critical points are local minima and
maxima, as the example $f(x) = x^3$ shows. The derivative vanishes at
$x=0$, but the function does not obtain a local extremum. Critical
points that are not local extrema are known as *inflection points*.

If $f$ is twice differentiable at $x$, then a *sufficient condition* for
a critical point $x$ to be a local minimum is that the second derivative
is positive. In other words, $f^{\prime\prime}(x) > 0$. To show this,
consider the definition of the second derivative:
$$f^{\prime\prime}(x) = \lim_{h\rightarrow 0}(f^\prime(x+h)-f^\prime(x))/h$$
Since $x$ is a critical point, $f^\prime(x)=0$, so
$$f^{\prime\prime}(x) = \lim_{h\rightarrow 0}f^\prime(x+h)/h$$

If $f^{\prime\prime}(x) > 0$, then we can find a $\delta$ small enough such that for all $|h|\leq \delta$, $f^\prime(x+h)/h > 0$. Hence, if $h > 0$, then $f^\prime(x+h) > 0$, which means that $f^\prime$ is positive to the right of $x$. If $h < 0$, then $f^\prime(x+h) < 0$, which means that $f^\prime$ is negative to the left of $x$. This means that on the range $(x,x+\delta)$, $f$ is monotonically increasing, and $(x-\delta,x)$, $f$ is monotonically decreasing. Therefore, on the range $(x-\delta,x+\delta)$, $x$ minimizes $f$.

In many cases, however, $f$ can be very complex, so it may be difficult or impossible to find a closed form expression. Or, $f$ may be given as a black-box computer subroutine, and so a mathematical expression may not be available. In these cases we must resort to numerical (iterative) techniques in order to find a minimum.

The first multivariate optimization technique we will examine is one of the simplest: gradient descent (also known as steepest descent). Gradient descent is an iterative method that is given an initial point, and follows the negative of the gradient in order to move the point toward a critical point, which is hopefully the desired local minimum. Again we are concerned with only local optimization, because global optimization is computationally challenging.

Gradient descent is popular for very large-scale optimization problems because it is easy to implement, can handle "black box" functions, and each iteration is cheap. Its major disadvantage is that it can take a long time to converge. We will also describe a discrete descent method often used to solve large combinatorial problems.

Given a differentiable scalar field $f(\V{x})$ and an initial guess $\V{x_1}$, gradient descent iteratively moves the guess toward lower values of $f$ by taking steps in the direction of the negative gradient $-\nabla f(\V{x})$. Locally, the negated gradient is the steepest descent direction, i.e., the direction that $\V{x}$ would need to move in order to decrease $f$ the fastest. The algorithm typically converges to a local minimum, but may rarely reach a saddle point, or not move at all if $\V{x_1}$ lies at a local maximum.

The first order Taylor approximation of $f(\V{x})$ about $f(\V{x_1})$ is $$f(\V{x}) = f(\V{x_1}) + \nabla f(\V{x_1}) \cdot (\V{x}-\V{x_1}) + O(||\V{x}-\V{x_1}||^2).$$ Consider moving from $\V{x_1}$ a small amount $h$ in a unit direction $\V{u}$. We want to find the $\V{u}$ that minimizes $f(\V{x_1} + h\V{u})$. Using the Taylor expansion, we see that $$f(\V{x_1} + h\V{u}) - f(\V{x_1}) = h \nabla f(\V{x_1}) \cdot \V{u} + h^2 O(1).$$ If we make the $h^2$ term insignificant by shrinking $h$, we see that in order to decrease $f(\V{x_1} + h\V{u}) - f(\V{x_1})$ the fastest we must minimize $\nabla f(\V{x_1}) \cdot \V{u}$. The unit vector that minimizes $\nabla f(\V{x_1}) \cdot \V{u}$ is $\V{u} = -\nabla f(\V{x_1}) / || \nabla f(\V{x_1}) ||$ as desired.

The algorithm is initialized with a guess $\V{x_1}$, a maximum iteration count $N_{max}$, a gradient norm tolerance $\epsilon_g$ that is used to determine whether the algorithm has arrived at a critical point, and a step tolerance $\epsilon_x$ to determine whether significant progress is being made. It proceeds as follows:

Algorithm Gradient Descent

**for**{$t=1,2,\ldots,N_{max}$}$\V{x_{t+1}} \gets \V{x_{t}} - \alpha_t \nabla f(\V{x_{t}})$

**if**$\|\nabla f(\V{x_{t}})\| < \epsilon_g$**return**"Converged on critical point"**if**$\|V{x_{t+1}}-\V{x_{t}}\| < \epsilon_x$**return**"Converged on an $x$ value"**if**$f(\V{x_{t+1}}) > f(\V{x}_t)$**return**"Diverging"**return**"Maximum number of iterations reached"

The variable $\alpha_t$ is known as the *step size*, and should be
chosen to maintain a balance between convergence speed and avoiding
divergence. Note that $\alpha_t$ may depend on the step $t$.

To a first-order approximation, each step decreases the value of $f$ by approximately $\alpha_t ||\nabla f(\V{x_0})||^2$. If $\alpha_t$ is too small, then the algorithm will converge very slowly. On the other hand, if the step size $\alpha_t$ is not chosen small enough, then the algorithm may fail to reduce the value of $f$. (Because the first order approximation is valid only locally.)

One approach is to *adapt* the step size $\alpha_t$ in order to achieve
a reduction in $f$ while still making sufficiently fast progress. This
procedure is known as a *line search* and is employed in many
multivariate optimization algorithms. The input to a line search is a
function $f$, an initial point $\V{x}$ and direction $\V{d}$, initial
candidate step size $\gamma_1$. The output is a step size $\gamma > 0$
such that $f(\V{x}+\gamma \V{d}) < f(\V{x})$, or failure if the step
size falls below some threshold.

The first step is to check whether $\V{x} + \gamma_1 \V{d}$ decreases the value of $f$. If so, then we can either terminate the line search, or search for a larger valid $\gamma$ by repeated doubling: $\gamma_n = \gamma_1 2^n$. If a step of size $\gamma_1$ increases $f$, then we search through repeated halving: $\gamma_n = \frac{\gamma_1}{2^n}$. This continues until a reduction in $f$ is found, or minimum step size tolerance is reached.

You might be thinking, "wouldn't it make sense for line search to find
the *optimal* step size rather than any one that reduces the function
value?" It turns out that in practice it is usually not worthwhile to do
extra work to obtain an optimal line search because moving along the
current line typically doesn't get you much closer to the local minimum.
More often it is a better use of time to take a step and get a new
gradient direction at the next point. Nevertheless it is sometimes
useful during analysis to assume that optimal line searches are being
performed.

It can be proven, through a relatively technical proof, that gradient descent with an optimal line search obtains a linear convergence. However, there are cases in which the convergence constant is very bad, that is, the error ratio $E_{t+1} / E_{t}$ is close to 1. This occurs when the eigenvalues of $f$'s second derivative matrix, known as the Hessian matrix, is poorly "scaled" at the minimum. Here we will provide some intuition for the causes of slow convergence.

Consider a very simple quadratic function in a two dimensional Euclidean space: $f(x_1,x_2) = a x_1^2 + b x_2^2$, with positive constants $a$ and $b$. The gradient of $f$ is $(2 a x_1, 2 b x_2)$, and of course the minimum of $f$ is at the origin.

If $a=b$, then $f$ increases isotropically from the origin in every direction (Figure 1). Another way to think of this is that the level sets of $f$ are circles. At a point $(x_1,x_2)$, observe that $-\nabla f(x_1,x_2)$ points directly toward the origin. Hence, gradient descent moves any initial point directly toward the origin, and it will perform well.

Now consider the case where $b >> a$, say $b=100a$. Now the level sets
of $f$ are ellipses that are thinner in the $x_1$ direction (Figure 2).
At a point $(x_1,x_2)$, the descent direction
$-\nabla f(x_1,x_2) = (-2 a x_1, -2 b x_2)$ is steeper along the $x_2$
axis than the $x_1$ axis. So, the line search will not be directed
toward the origin. This gives gradient descent a characteristic
zig-zagging trajectory that takes longer to converge to the minimum. In
general, slow convergence holds when the "bowl" around the local minimum
is much thinner in one direction than another. Problems like this are
said to have an *ill-conditioned* Hessian matrix. We will make this more
precise in future lectures.

Gradient descent can be generalized to spaces that involve a discrete
component. The method of *steepest descent* is the discrete analogue of
gradient descent, but the best move is computed using a local
minimization rather rather than computing a gradient. It is typically
able to converge in few steps but it is unable to escape local minima or
plateaus in the objective function.

A discrete search problem is defined by a discrete state space $S$ and a set of edges $E \subseteq S \times S$. This induces a graph $G=(S,E)$ that must be searched in order to find a state that minimizes a given function $f(s)$. A steepest descent search begins at a state $s_0$ and takes steps on $G$ that descend $f(s)$ with the maximum rate of descent.

The algorithm repeatedly computes each transition $s\rightarrow s^\prime$ by performing the local minimization $\arg \min_{s\rightarrow s^\prime \in E} f(s^\prime)$, and terminates when $f(s^\prime) \geq f(s)$. This approach is typically much faster than an exhaustive search if $S$ is large or possibly infinite but the degree of $G$ is small.

*Hill climbing* is a similar approach to steepest descent that is used
for large discrete problems in which the state space is combinatorial.
(Although the name is hill *climbing* the approach can be applied to
either minimization or maximization.) If the state is composed of
several attributes $s=(s_1,\ldots,s_n)$, an optimization can be
formulated over the composite state space
$S = S_1 \times \cdots \times S_n$. In each iteration, each of the
attributes $s_i$ is locally optimized in $S_i$ using a single step of
steepest descent. These sweeps continue until convergence or a
termination condition is reached.

Local optimization techniques are not guaranteed to find the global
minimum unless it can be proven that the function under question has
only a single minimum or that the initial guess is within the *basin of
attraction* of the global minimum. Moreover, hard optimization problems
are often riddled with local minima. Nevertheless it is possible to
improve the chance of finding a global minimum by performing many local
optimizations with randomly chosen initial points, and picking the
locally optimized point with the lowest $f$ value. This strategy is
known as steepest descent with random restarts. Quantifying the
probability that random restarts find a global minimum is left as an
exercise.

TODO: Describe newton's method and quasi newton methods.

Extreme Value Theorem. The Extreme Value Theorem states that a continuous function $f$ does indeed attain a minimum and maximum over a closed interval $[a,b]$ at some points $c$ and $d$ in $[a,b]$, respectively.

Although it may seem like a trivial statement, it is possible to construct examples where the minimum and maximum cannot be attained over an interval that is not closed, or not bounded, or when the function is discontinuous.