$\newcommand{\V}[1]{\mathbf #1}$
$\newcommand{\M}[1] {#1}$
$\newcommand{\Reals} { \mathbb{R} }$

We would like to extend concepts that are familiar to us from working
with real numbers $\mathbb{R}$ to other spaces of interest. For example,
through much of this class we will operate in $n$-dimensional Euclidean
spaces $\mathbb{R}^n$. Elements of $\mathbb{R}^n$ (points) are simply
tuples of $n$ real numbers. In this section we extend our understanding
the notions of sequences, limits, derivatives, and closed and open
intervals to a more general setting of *metric spaces*.

A metric space $S$ is an arbitrary abstract set equipped with a *metric*
$d(x,y)$ that measures a sense of distance between points. The metric is
required to be 1) nonnegative ($d(x,y) \geq 0$), 2) reflexive
($d(x,y) = 0$ iff $x=y$) and 3) symmetric ($d(x,y) = d(y,x)$). It is
also required to satisfy the triangle inequality:
$d(x,y) \leq d(x,z)+d(z,y)$ for all $x,y,z$.

Examples:

$\mathbb{R}$ is a metric space under the absolute distance metric $d(x,y)=|x-y|$.

Any Euclidean space $\mathbb{R}^n$ is a metric space under the Euclidean distance metric (and indeed, many other metrics).

Any set is a metric space under the discrete metric: $d(x,y) = 0$ if $x=y$, and $d(x,y) = 1$ otherwise.

We can easily generalize the notions of convergence and limits of sequences into the metric space setting. This is done simply by replacing the absolute distances $|x-y|$ with the metric $d(x,y)$. Limits of a function also generalize, but only in the standard case (taken relative to some point $c$). The one-sided and asymptotic limits do not generalize in the same way, nor do the notions of minimum/maximum/infimum/supremum.

Neighborhood. For a metric space $(S,d)$, the $r$-neighborhood of a point $x\in S$ is the set $N_r(x) = { y\in S | d(x,y) < r }$. For a Euclidean space $\mathbb{R}^n$ and the Euclidean metric, this is a ball of dimension $n$ and radius $r$ centered on $x$, not including its surface.

Open set. Anopen set$A$ is a subset of a metric space $(S,d)$ such that every point $x \in A$ has an $r$-neighborhood completely contained within $A$ with some $r > 0$. More precisely, for all $x$, there exists an $r > 0$ such that $N_r(x) \subseteq A$. (Note that $r$ is allowed to depend on $x$.)

Closed set. Aclosed set$A$ is a subset of a metric space $(S,d)$ such that its complement $S \setminus A$ is an open set. (Note: typically the "universe" of possible elements $S$ is known and is therefore left implicit.)

Examples:

The half-open interval $(a,b]$ is neither open nor closed.

Any finite set of points $\{ x_1,\ldots,x_n \}$ closed in $\mathbb{R}$.

The empty set is both open and closed.

$\mathbb{R}$ is both open and closed (taking $\mathbb{R}$ as the universe).

The union of any number of open sets is open.

The intersection of any finite number of open sets is open.

The union of any finite number of closed sets is closed.

The intersection of any number of closed sets is closed.

Closure. The closure of a set $A$, denoted $cl(A)$ is the set of all points $x \in S$ such that for any $r > 0$, $N_r(x) \cap A \neq \emptyset$.

From this definition, we can see that $A \subseteq cl(A)$, $cl(A)$ is a closed set, and that $A=cl(A)$ iff $A$ is a closed set.

Interior. The interior of a set $A$, denoted $int(A)$ is the set of all points $x \in S$ such that there exists an $r > 0$ such that $N_r(x) \subseteq A$.

From this definition, we can see that $int(A) \subseteq A$, $int(A)$ is an open set, and that $A=int(A)$ iff $A$ is an open set.

Boundary. The boundary of a set $A$, denoted $\partial A$, is defined as $cl(A)\setminus int(A)$.

Or, equivalently, it is the set of points for which all $r$-neighborhoods with $r>0$ contain at least one point of $A$ and one point of its complement.

We will often work with functions that map many variables (a vector) to
a single number (a scalar). A *scalar field*
$f : \mathbb{R}^n \rightarrow \mathbb{R}$, is a real-valued function of
$n$ real-valued variables. We can either write the arguments explicitly
as $f(x_1,\ldots,x_n)$ or as a single vector argument as $f(\V{x})$. The
same notions of continuity, minima, and maxima that apply to univariate
functions also apply to scalar fields.

Continuity. A scalar field $f$ is continuous on a domain $S\in \mathbb{R}^n$ if for every point $\V{x} \in S$ and $\epsilon > 0$, there exists a neighborhood with radius $\delta$ around $\V{x}$ that satisfies $|f(\V{x})-f(\V{y})| < \epsilon$ for all $y \in N_\delta(\V{x})$.

Extreme value theorem. A continuous scalar field $f$ attains a minimum and maximum value on aclosedset $S \subseteq \mathbb{R}^n$.

Local minima. $\V{x}$ is said to be a local minimum of a scalar field $f$ on anopenset $S \subseteq \mathbb{R}^n$ if there exists a neighborhood $N_r(\V{x})$ around $\V{x}$ such that $f(\V{x}) < f(\V{y})$ for all $\V{y} \in N_r(\V{x}) \setminus \{ \V{x}\}$. A similar definition holds for local maxina.

Partial derivatives allow taking derivatives of a scalar field with
respect to certain individual arguments. The *partial derivative* of $f$
with respect to its argument $x_i$ at the values
$(x_1,\ldots,x_n)=(a_1,\ldots,a_n)$ is given by
$$\frac{\partial f}{\partial x_i}(a_1,\ldots,a_n) = \lim_{h \rightarrow 0} \frac{f(a_1,\ldots,a_{i-1},a_i+h,a_{i+1},\ldots,a_n)-f(a_1,\ldots,a_n)}{h}$$
In other words, this is the deriative in the direction of the $a_i$
while fixing all of the other $a_j$'s, $i\neq j$, constant.

We can write this definition more compactly in vector notation: $$\frac{\partial f}{\partial x_i}(\V{a}) = \lim_{h \rightarrow 0} \frac{f(\V{a}+h\V{e_i})-f(\V{a})}{h}$$

A third possible interpretation is that we are taking the derivative of the function $$g_i(x) = f(a_1,\ldots,a_{i-1},x,a_{i+1},\ldots,a_n)$$ at $x = a_i$. In other words, $g_i^\prime(x)|_{x = a_i} =\frac{\partial f}{\partial x_i}(\V{a})$.

(Note that some other texts may use the notation $f_i$ to denote the partial derivative with respect to the i'th argument to $f$. I find this notation to be extremely confusing so we will use the $\frac{\partial f}{\partial x_i}$ notation throughout this class.)

Partial differentiation operator. The notation $\frac{\partial}{\partial x_i}$ refers to the partial differentiation operator on the $i$'th argument to $f$.

For shorthand, we may occasionally write $\frac{\partial}{\partial i}$.

The rules for computing regular derivatives can also be applied when
computing partial derivatives. The only difference is that *all
arguments, except for the one being differentiated, are treated as
constants*. As an exmaple, consider $$f(x_1,x_2) = x_1^k e^{c x_2}.$$
The partial derivatives are:
$$\frac{\partial}{\partial x_1} f(x_1,x_2) = k x_1^{k-1} e^{c x_2}$$
because $x_2$ is treated as a constant, and
$$\frac{\partial}{\partial x_2} f(x_1,x_2) = c x_1^{k} e^{c x_2}$$
because $x_1$ is treated as a constant.

Be careful when a variable appears in multiple arguments. For example,
given a function $f(x,g(x))$, the partial derivative of $f$ with respect
to $x_1$ does *not* use the partial derivative with respect to $x_2$, or
the derivative of $g$ in its calculation. Instead, you should treat the
second argument as a constant $y$ and proceed as though it has no
dependence on $x$.

You *should* incorporate those derivatives when computing the *total
derivative* with respect to $x$, which is denoted $\frac{d}{dx}$ or
$\cdot^\prime$. The total derivative treats the expression as a single
function, $h(x) = f(x,g(x))$, and involves a chain rule with both
partial derivatives:
$$\frac{d}{dx}f(x,g(x)) = h^\prime(x) = \frac{\partial}{\partial x_1} f(x,g(x)) + \frac{\partial}{\partial x_2} f(x,g(x)) g(x)^\prime.$$

The *directional derivative* is another type of derivative that measures
the rate of change of the scalar field $f$ as you move from $\V{a}$ in a
direction $\V{d}$. You can think about moving along a ray
$\V{x}(u) = \V{a} + u\V{d}$ standing at $\V{a}$ and beginning to move
with velocity $\V{d}$. The directional derivative measures the slope of
the field in your desired direction.

The

directional derivativeof $f$ at $\V{a}$ in direction $\V{d}$ is the value: $$\nabla_{\V{d}}f(\V{a}) = \lim_{h\rightarrow 0} \frac{f(\V{a}+h\V{d})-f(\V{a})}{h}.$$

Note that the standard partial derivative $\frac{\partial}{\partial_i}f(\V{a})$ is equivalent to $\nabla_{\V{e_i}}f(\V{a})$.

A fundamental result in multivariate calculus is that *all directional
derivatives* can be expressed as a dot product of $\V{d}$ with a vector
known as the *gradient* of $f$. The gradient of $f$, denoted
$\nabla f(\V{a})$, is the vector of all partial derivatives of $f$:
$$\nabla f(\V{a}) = \left(\frac{\partial}{\partial x_1} f(\V{a}), \ldots, \frac{\partial}{\partial x_n} f(\V{a})\right).$$

We will prove the key result that $$\nabla_{\V{d}}f(\V{a}) = (\nabla f(\V{a})) \cdot \V{d}$$

We will need a lemma that directional derviatives are linear, specifically that: $$\nabla_{c\V{d}} f(\V{a}) = c \nabla_{\V{d}} f(\V{a})$$ and $$\nabla_{\V{d}+\V{e_i}} f(\V{a}) = \nabla_{\V{d}} f(\V{a}) + \nabla_{\V{e_i}} f(\V{a})$$ for any scalar $c$, vectors $\V{d}$, $\V{d_1}$, and $\V{d_2}$.

First, scaling: $$ \begin{split} \nabla_{c\V{d}} f(\V{a}) &= \lim_{h\rightarrow 0} \frac{f(\V{a} + ch\V{d})-f(\V{a})}{h} \\ &= c \lim_{h\rightarrow 0} \frac{f(\V{a} + ch\V{d})-f(\V{a})}{ch} \\ &= c \lim_{(ch)\rightarrow 0} \frac{f(\V{a} + ch\V{d})-f(\V{a})}{ch} \\ &= c \lim_{h\rightarrow 0} \frac{f(\V{a} + h\V{d})-f(\V{a})}{h} \\ &= c \nabla_{\V{d}} f(\V{a}) \end{split}$$

Now, let's look at summing. $$\nabla_{\V{d_1}+\V{d_2}} f(\V{a}) = \lim_{h\rightarrow 0} \frac{f(\V{a}+h\V{d_1}+h\V{d_2})-f(\V{a})}{h}$$

Now let $g_h(x) = f(\V{a}+h\V{d_1} + x\V{d_2})$ and evaluate the Taylor expansion of $g_h(x)$ around $x=0$: $$g_h(x) = f(\V{a}+h\V{d_1}) + g_h^\prime(x)|_{x=0} x + O(x^2)$$

So $$ \begin{split} \nabla_{\V{d_1}+\V{d_2}} f(\V{a}) &= \lim_{h\rightarrow 0} \frac{1}{h}[f(\V{a}+h\V{d_1}) + g_h^\prime(0) h + O(h^2) - f(\V{a})] \\ &= \lim_{h\rightarrow 0} \left(\frac{1}{h}[f(\V{a}+h\V{d_1}) - f(\V{a}) + O(h^2)] + g_h^\prime(0) \right)\\ &= \nabla_{\V{d_1}} f(\V{a}) + \lim_{h\rightarrow 0} g_h^\prime(x)|_{x=0}. \end{split}$$ The desired result is obtained by noting that $\lim_{h\rightarrow 0} g_h^\prime(x)|_{x=0}$ is identically $\nabla_{\V{d_2}}f(\V{a})$.

Returning to the original question, we use linearity of the directional derivative to obtain $$ \begin{split} (\nabla f(\V{a})) \cdot \V{d} &= \sum_{i=1}^n d_i \nabla_{\V{e_i}}f(\V{a}) \\ &= \sum_{i=1}^n \nabla_{d_i \V{e_i}}f(\V{a}) \\ &= \nabla_{\sum_{i=1}^n d_i \V{e_i}}f(\V{a}) \\ &= \nabla_{\V{d}}f(\V{a}) \end{split}$$ as desired. QED.

First-order Taylor expansion of a scalar field. If a scalar field $f$ is differentiable, we can use the gradient to define the first-order Taylor expansion of $f$ about a point $\V{a}$: $$f(\V{x}) = f(\V{a}) + (\nabla f(\V{a})) \cdot (\V{x}-\V{a}) + O(||\V{x}-\V{a}||^2).$$

Note that the equation
$f(\V{x}) = f(\V{a}) + \nabla f(\V{a}) \cdot (\V{x}-\V{a})$ defines a
plane in the $(\V{x},f)$ space. This says that the first two terms of
the Taylor expansion give the plane that is *tangent to the surface*
defined by $f$ at $\V{a}$. Moreover, when $\V{x}$ is close to $\V{a}$,
the fit is quite good. It therefore serves a role that is very much like
the tangent line found by the Taylor expansion of a univariate function.

Here, let $f$ and $g$ be scalar fields, let $h$ be a univariate function, and let $c$ be a scalar.

*Linearity.*$\nabla ( f(\V{x}) + c g(\V{x})) = \nabla f(\V{x}) + c \nabla g(\V{x})$.*Chain rule.*$\nabla h(f(\V{x})) = h^\prime(f(\V{x}))\nabla f(\V{x})$.*Multiplication rule.*$\nabla (fg)(\V{x})) = f(\V{x}) \nabla g(\V{x}) + g(\V{x}) \nabla f(\V{x})$.

There is another version of the chain rule that is occasionally useful. Let $\V{y}(u)$ be a vector space curve parameterized by $u \in \mathbb{R}$. Then $$\frac{d}{du}(f(\V{y}(u))) = \nabla f(\V{y}(u)) \cdot \V{y}^\prime(u).$$ In other words, you get the directional derivative of $f$ in the direction $\V{y}^\prime(u)$.

Analogously to univariate functions, the critical points of a scalar
field are those points $\V{x}$ where $\nabla f(\V{x})=\V{0}$. All local
minima and maxima are critical points, but not all critical points are
local minima and maxima. These critical points are known as *saddle
points* because in two dimensional scalar fields they look like saddles.