$\newcommand{\V}[1]{\mathbf #1}$
$\newcommand{\M}[1] {#1}$
$\newcommand{\Reals} { \mathbb{R} }$

In previous chapters we have seen the use of myopic controllers like PID or operational space control, as well as some predictive controllers like trajectory generation. Particularly for complex and nonlinear systems like robots, predictive control allows the controller to make better decisions at the current time to account for future possibilities. However, our previous predictive methods were largely restricted to one class of systems. Optimal control addresses these shortcomings in a highly general framework.

Optimal control asks to compute a control function (either open loop or
closed loop) that optimizes some performance metric regarding the
control and the predicted state. For example, a driver of a car would
like to reach a desired location while achieving several other goals:
e.g., avoiding obstacles, not driving erratically, maintaining a
comfortable level of accelerations for human passengers. A driving style
can be composed of some balanced combination of these goals. Optimal
control allows a control designer to specify the *dynamic model* and the
*desired outcomes*, and the algorithm will compute an optimized control.
This relieves some burden by letting the designer reason at the level of
*what* the robot should do, rather than designing *how* it should do it.

In this chapter, we will discuss how to specify optimal control problems
and how to implement and use optimal control techniques. We will
consider only the open loop problem; however we will mention how an open
loop optimizer can be adapted to closed loop control via the use of
*model predictive control*.

An optimal control problem is defined by the dynamics function $f$ and a
*cost functional* over the entire trajectory $x$ and $u$:
$$J(x,u) = \int_0^\infty L(x(t),u(t),t) dt.$$ The term *functional*
indicates that this is a function mapping a function to a real number.
The term $L(x,u,t)$ is known as the *instantaneous cost* which is
accumulated over time, and should be chosen to be nonnegative and to
penalize certain undesirable states, velocities, or controls. Its units
are cost units per second.

The goal of optimal control is to find state and control trajectories $x$ and $u$ such that $J(x,u)$ is minimized: $$ \begin{gathered} x^\star, u^\star = \arg \min_{x,u} J(x,u) \text{ such that} \\ \dot{x}(t) = f(x(t),u(t)) \text{ for all }t \end{gathered} \label{eq:OptimalControl}$$

(For somewhat technical reasons, there are problems for which no optimal trajectory exists, but rather only a sequence of trajectories approaching an optimal cost. Hence, if we prefer to be pedantic, it is often necessary to prove existence of an optimal solution first, or to relax the problem to determine only an approximate optimal.)

A variety of behaviors can be specified in this framework by modifying the instantaneous cost. For example, one might penalize trajectory tracking squared error $L(x,u,t) = \|x - x_D(t)\|^2$. Minimizing effort can be defined in terms of a control penalty $\|u\|^2$. Minimum time to hit a target $x_{tgt}$ could be implemented as an indicator function $I[x\neq x_{tgt}$ where $I[z]$ is 1 if $z$ is true, and 0 otherwise. Obstacle avoidance and other feasibility constraints can be implemented as indicator functions as well, $\infty \cdot I[x \notin \mathcal{F}]$ where $\mathcal{F}$ is the free space.

As stated the optimal control problem is somewhat ill-behaved because it involves an infinite integral, which could achieve infinite cost even for relatively well-behaved trajectories. For example, if the cost were simply squared error, a trajectory that achieves 0.01 steady-state error would be rated as having the same cost as a trajectory that had an error of 1: namely, infinitely bad.

There are two general ways to make the cost functional better behaved.
The first method to truncate the problem at some maximum time $T$,
leading to a *finite-horizon optimal control* cost functional
$$J(x,u) = \int_0^T L(x(t),u(t),t) dt + \Phi(x(T))$$ where $\Phi(x)$ is
a nonnegative *terminal cost* that penalizes the state attained at the
terminal time.

The second method is to modify the instantaneous cost functional by
including a *discount factor* that decays to 0 relatively quickly as
$t \rightarrow infty$. This is usually expressed as the product of a
time-independent term and a time-dependent discount factor term:
$$L(x,u,t) = L(x,u,0) \gamma(t)$$ with $\gamma(t)$ a decaying function,
such as $O(1/t^\alpha)$ or $O(\beta^t)$. It is important to choose a
discount factor that drops relatively rapidly toward 0 to ensure that
the cost is integrable. Discount factors of the form $O(1/t^\alpha)$
must have $\alpha > 1$ to ensure that the cost functional is finite for
all bounded trajectories, and those of the form $O(\beta^t)$ must have
$\beta < 1$.

Usually, optimal control solvers require that the cost functional is smooth, and so non-differentiable constraints like minimum time and obstacle avoidance must be reformulated as hard constraints, external to the cost functional. As a result the reformulation becomes essentially an infinite-dimensional constrained optimization problem. Solvers may differ about whether they can handle constraints on state or constraints on control.

Minimization over a space of functions is considerably more difficult to
solve than typical optimization problems: the space of functions is
*uncountably infinite-dimensional*! There are two general ways to tackle
these problems: analytical or numerical. Analytical techniques use the
mathematical conditions of optimality so that the optimal control can be
determined directly through calculus and algebraic manipulation.
Successfully applying analysis typically requires a relatively simple
dynamics and cost. Numerical techniques approximate the problem by
discretize either the state, time, or control space and attempt to cast
the problem as a finite dimensional optimization. These are more
general-purpose and can be applied to complex problems, but are more
computationally expensive and usually require some parameter tuning to
obtain high-quality solutions.

The simplest class of optimal control problems is LTI systems with costs that are quadratic in $x$ and $u$. Through the calculus of variations, which is beyond the scope of this book, the optimal control for this problem class can be determined analytically as a closed-form function of $x$.

LTI systems with quadratic cost are specified as $\dot{x} = Ax + Bu$ and $$L(x,u,t) = x^T Q x + u^T R u$$ where $Q$ and $R$ are symmetric matrices of size $n\times n$ and $m\times m$, respectively. The magnitude of entries of $Q$ penalize error from the equilibrium point, and the magnitude of entries of $R$ penalize control effort. The overall control functional is therefore $$J(x,u) = \int_0^\infty x(t)^T Q x(t) + u(t)^T R u(t).$$

Here, the optimal control can be shown to be a linear function of $x$:
$$u = -Kx$$ for the gain $K = R^{-1}B^T P$ defined as a function of an
unknown matrix $P$. $P$ is a symmetric $n \times n$ matrix that solves
the following *Riccati equation* :
$$A^TP + PA - PBR^{-1}B^TP + Q = 0.$$ Numerical methods are available
for solving the Riccati equation for $P$. This method is known as the
Linear Quadratic Regulator (LQR) approach.

As we showed in Section $$sec:LTIStability$${reference-type="ref" reference="sec:LTIStability"}, traditional pole placement methods can be used to derive a stable controller for LTI systems. However, the significance of the LQR controller compared to traditional pole stability analysis is that the performance metric is made explicit rather than implicit. Moreover, it gives a closed form solution for any dynamic model specified by $A,B$, so if more information is gathered that yields a better estimate for $A$ and $B$, the LQR method can be applied directly to obtain the optimal gains.

The "magic" of the LQR solution is obtained through a more generic
principle of optimal controllers called the *Pointryagin's minimum
principle*. It defines a *first order*, *necessary* condition for a
particular state/control trajectory to be an optimum. It is derived from
$$eq:OptimalControl$${reference-type="eqref"
reference="eq:OptimalControl"} via a combination of calculus of
variations and the method of Lagrange multipliers. This will briefly be
described here.

In (equality) constrained optimization problems, the method of Lagrange
multipliers defines an auxiliary Lagrange multiplier variable
$\lambda_i$ for each equality constraint. However, in optimal control
problems, there are an infinite number of equality constraints
$\dot{x}(t) = f(x(t),u(t))$ defined for each point in time. As a result,
the Lagrange multipliers for this problem are not single variables, but
rather *trajectories* defined over time. This trajectory of multipliers
is known as a *costate trajectory*
$\lambda(t):[0,\infty)\rightarrow \mathbb{R}^n$.

An auxiliary function, called the *Hamiltonian*, is defined over the
system and costate at particular points in time:
$$H(\lambda,x,u,t)=\lambda^T f(x,u) + L(x,u,t).$$ It is also possible to
maintain control constraints $u\in \mathcal{U}\subseteq \mathbb{R}^m$.

Pointryagin's minimum principle is then stated as follows. An optimal state/ control/ costate trajectory $(x^\star,u^\star,\lambda^\star)$ satisfies for all $t \geq 0$:

$H(\lambda^\star(t),x^\star(t),u^\star(t),t) \leq H(\lambda^\star(t),x^\star(t),u,t)$ for all $u \in \mathcal{U}$

$\dot{x}^\star(t) = f(x^\star(t),u^\star(t))$

$\dot{\lambda}^\star(t) = -\frac{\partial}{\partial x}H(\lambda^\star(t),x^\star(t),u^\star(t),t)$.

The derivation of these equations is outside the scope of this book. But the conditions can be applied in certain cases to obtain optimal controls, or at least limit the range of controls possibly optimal.

As a result, consider the LQR setting. The Hamiltonian is $$H(\lambda,x,u,t) = \lambda^T(Ax + Bu) + x^T Q x + u^T R u$$ and the Pointryagin's minimum principle gives:

Subtracting off terms that do not contain $u$, $\lambda^{\star T}Bu^{\star} + u^{\star T} R u^\star \leq \lambda^{\star T}Bu + u^T R u$ for all $u$.

$\dot{x}^\star = Ax^\star + Bu^\star$

$\dot{\lambda}^\star = - A^T \lambda^{\star} - 2 Qx$.

Expanding 1, we see that for $u^\star$ to be a minimizer of the Hamiltonian, given $\lambda^\star$, $x^\star$, and $t$ fixed, we must have that $B^T \lambda^\star + 2 R u^\star = 0$ so that $u^\star = \frac{1}{2}R^{-1} B^T \lambda^\star$.

Now replacing this into 2 and 3, we have a system of ODEs: $$ \begin{split} \dot{x}^\star &= A x^\star + \frac{1}{2} B R^{-1} B^T \lambda^\star \\ \dot{\lambda}^\star &= - 2 Qx^\star -A^T \lambda^\star \end{split}$$ Hypothesizing that $\lambda^\star = 2 P x^\star$ and multiplying the first equation by $P$, we obtain the system of equations $$ \begin{split} P\dot{x}^\star &= (PA + P B R^{-1} B^T P )x^\star \\ 2P\dot{x}^\star &= (-2Q -2A^T P)x^\star \end{split}$$ After dividing the second equation by 2 and equating the left hand sides, we have an equation that must be satisfied for all $x$. Since the equation must hold for all $x$, the matrices must also be equal, which produces the Riccati equations.

Another result from Pointryagin's minimum principle condition (1) is that the Hamiltonian must be minimized by the control, keeping the state and costate fixed. As a result, there are two possibilities: (1a) the derivative of the Hamiltonian is 0 at $u^\star$, or (1b) the control is at the boundary of the control set $u^\star \in \partial U$.

This leads to many systems having the characteristic of *bang-bang
control*, which means that the optimal control will jump discontinuously
between extremes of the control set. As an example, consider a race car
driver attempting to minimize time. The optimal control at all points in
time will either maximize acceleration, maximize braking, or
maximize/minimize angular acceleration; otherwise, time could be saved
by making the control more extreme.

\margindef{Switching times}
If it can be determined that there are a finite number of possible
controls satisfying condition (1), then the optimal control problem
becomes one of simply finding *switching times* between optimal
controls.

[$$sec:DubinsCar$$]{#sec:DubinsCar label="sec:DubinsCar"} As an example, the Dubins car is a car-like dynamic system on the state variable $(x,y,\theta)\in SO(2)$ and control variable $(v,\phi)$ denoting velocity and steering angle: $$ \begin{split} \dot{x} &= v \cos \theta \\ \dot{y} &= v \sin \theta \\ \dot{\theta} &= v/L \tan \phi . \end{split}$$ Here $(x,y)$ are the coordinates of a point in the middle of the rear axis, and $L$ is the length between the rear and front axle. The velocity and steering angle are bounded, $v \in [-1,1]$ and $\phi \in [-\phi_{max},\phi_{max}]$, and the cost only measures time to reach a target state. Hence, the Hamiltonian is $$H(\lambda,x,u,t) = \lambda_1 v \cos \theta + \lambda_2 v \sin \theta + \lambda_3 v/L \tan \phi + I[x \neq x_{tgt}]$$

The latter term does not contribute to the choice of $u$, so we can ignore it. For $(v,\phi)$ to be a minimum of the Hamiltonian, with $\lambda$ and $x$ fixed, either $\lambda = 0$ and the control is irrelevant, or $\lambda \neq 0$ and $v = -sign(\lambda_1 \cos \theta + \lambda_2 \sin \theta + \lambda_3 /L \tan \phi)$. Then, since $\tan$ is a monotonic function, we have $\phi = -sign(\lambda_3 v)\phi_{max}$. As a result, the only options are the minimum, maximum, and 0 controls on each axis.

The trajectories corresponding to these extrema are straight forward/backward, moving forward while turning left/right, and moving backward while turning left/right. The curves traced out by these trajectories are then either straight line segments or arcs of turning rate $\pm \tan \phi_{max}/L$. To find all minimum-time paths between two points, it is then a matter of enumerating all possible arcs and straight line segments. The solutions are known as Reeds-Shepp curves.

It is not always possible (or easy) to derive elegant expressions of optimal controls using Pointryagin's minimum principle for general nonlinear systems. For example, if any modification is made to an LQR system, such as a non-quadratic term in the cost functional, or if state constraints or control constraints are added, the analytical solution no longer applies. As a result, numerical methods are often applied to solve a wider variety of optimal control problems. We have already seen a form of trajectory optimization under the discussion of kinematic path planning. We extend this to dynamic systems here.

Since trajectories are infinite-dimensional, the main challenge of trajectory optimization is to suitably discretize the space of trajectories. A second challenge is that the discretized optimization problem is usually also fairly large, depending on the granularity of the discretization, and hence optimization may be computationally inefficient.

In any case, the general method for performing trajectory optimization includes 1) defining a set of basis functions for the control trajectory, 2) defining a state evolution technique, 3) reformulating $$eq:OptimalControl$${reference-type="eqref" reference="eq:OptimalControl"} as a finite-dimensional optimization problem over the coefficients of the basis functions, and 4) optimizing using some gradient-based technique.

The main choice in trajectory optimization is how to represent a trajectory of controls? The most typical assumption is that the trajectory consists of piecewise-constant controls at a fixed time step.

If we define the time step $\Delta t = T/N$ with $N$ an integer, then we
have the *computational grid*
$0, \Delta t, 2\Delta t, \ldots, N\Delta t$. Let these grid points be
denoted $t_0,t_1,\ldots,t_N$ respectively with $t_0=0$ and $t_N=T$.

Then, the entire control trajectory is specified by a control sequence $u_1,\ldots,u_N$, with each $u_i$ active on the time range $[t_{i-1},t_i)$. In other words $u(t) = u_i$ with $i = \lfloor t/\Delta t \rfloor + 1$.

Suppose now we define a *simulation function*, which is a method for
integrating the state trajectory over time. Specifically, given an
initial state $x(0)$, a constant control $u$, and a fixed duration $h$,
the simulation function $g$ computes an approximation
$$x(h) \approx g(x(0),u,h)$$ If the timestep is small enough, the Euler
approximation is a reasonable simulation function:
$$x(h) \approx x(0) +h f(x(0),u).$$ If accuracy of this method is too
low, then Euler integration could be performed at a finer time sub-step
up to time $h$, and/or a more accurate integration technique could be
used.

In any case, given a piecewise-constant control trajectory defined by a control sequence $u_1,\ldots,u_N$, we can derive corresponding points on the state trajectory as follows.

Set $x_0 \gets x(0)$.

For $i=1,\ldots,N$, set $x_i = g(x_{i-1},u_i,\Delta t)$

to arrive at a state sequence $x_0=x(0),x_1,\ldots,x_N$. With this definition, each $x_i$ is a function of $u_1,\ldots,u_i$. Hence, we can approximate the cost functional as: $$J(x,u) \approx \tilde{J}(u_1,\ldots,u_n) = \delta t \sum_{i=0}^{N-1} L(x_i,u_{i+1},t_i) + \Phi(x_N).$$ Using this definition we can express the approximated optimal control function as a minimization problem: $$\arg \min_{u_1,\ldots,u_N} \tilde{J}(u_1,\ldots,u_N). \label{eq:DDPProblem}$$ With control space $\mathbb{R}^m$, this is an optimization problem over $mN$ variables.

There is a tradeoff in determining the resolution $N$. With higher values of $N$, the control trajectory can obtain lower costs, but the optimization problem will have more variables, and hence become more computationally complex. Moreover, it will be more susceptible to local minimum problems.

Standard gradient-based techniques can be used to solve the problem
$$eq:DDPProblem$${reference-type="eqref"
reference="eq:DDPProblem"}. One difficulty is that to take the gradient
of $\tilde{J}$ with respect to a control variable $u_i$, observe that
this choice of control affects *every* state variable $x_k$ with
$i \leq k \leq N$. Hence,
$$\frac{\partial J}{\partial u_i} = \Delta t \frac{\partial L}{\partial u_i} (x_{i-1},u_i,t_i) + \Delta t \sum_{k=i}^{N-1} \frac{\partial L}{\partial x}(x_k,u_{k+1},t_k)\frac{\partial x_k}{\partial u_i} + \frac{\partial \Phi}{\partial x_N}\frac{\partial x_N}{\partial u_i}
\label{eq:JacobianUi}$$ The expressions for
$\frac{\partial x_k}{\partial u_i}$ are relatively complex because $x_k$
is defined recursively assuming that $x_{i-1}$ is known. $$
\begin{split}
x_i &= g(x_{i-1},u_i,\Delta t) \\
x_{i+1} &= g(x_i,u_{i+1},\Delta t) \\
x_{i+2} &= g(x_{i+1},u_{i+2},\Delta t) \\
&\vdots
\end{split}$$ In this list, the only equation directly affected by $u_i$
is the first. The effects on the remaining states are due to cascading
effects of previous states. Hence, we see that
$$\frac{\partial x_i}{\partial u_i} = \frac{\partial g}{\partial u}(x_{i-1},u_i,\Delta t)$$
$$\frac{\partial x_{i+1}}{\partial u_i} = \frac{\partial x_{i+1}}{\partial x_i} \frac{\partial x_i}{\partial u_i} = \frac{\partial g}{\partial x}(x_i,u_{i+1},\Delta t) \frac{\partial x_i}{\partial u_i}$$
And in general, for $k > i$,
$$\frac{\partial x_k}{\partial u_i} = \frac{\partial g}{\partial x}(x_{k-1},u_k,\Delta t) \frac{\partial x_{k-1}}{\partial u_i}.$$
This appears to be extremely computationally expensive, since each
evaluation of $$eq:JacobianUi$${reference-type="eqref"
reference="eq:JacobianUi"} requires calculating $O(N)$ derivatives,
leading to an overall $O(N^2)$ algorithm for calculating the gradient
with respect to the entire control sequence. However, with a clever
forward/backward formulation, the gradient can be calculated with $O(N)$
operations.

Note that all expressions of the form $\frac{\partial x_k}{\partial u_i}$ are equivalent to $\frac{\partial x_k}{\partial x_i}\frac{\partial x_i}{\partial u_i}$. So, we observe that $$eq:JacobianUi$${reference-type="eqref" reference="eq:JacobianUi"} is equal to $$\Delta t \frac{\partial L}{\partial u_i}(x_{i-1},u_i,t) + \frac{\partial J}{\partial x_i} \frac{\partial x_i}{\partial u_i}.$$ Then, we can express: $$\frac{\partial J}{\partial x_i} = \Delta t \sum_{k=i}^{N-1} \frac{\partial L}{\partial x}(x_k,u_{k+1},t_k) \frac{\partial x_k}{\partial x_i} + \frac{\partial \Phi }{\partial x_N}\frac{\partial x_N}{\partial x_i}.$$ This entire vector can be computed in a single backward pass starting from $i=N$ back to $i=1$. Starting with $i=N$, see that $$\frac{\partial J}{\partial x_N} = \frac{\partial \Phi }{\partial x_N}$$ Then, proceeding to $i=N-1$, observe $$ \begin{split} \frac{\partial J}{\partial x_{N-1}} &= \Delta t \frac{\partial L}{\partial x}(x_{N-1},u_N,t_{N-1}) + \frac{\partial \Phi }{\partial x_N}\frac{\partial x_N}{\partial x_{N-1}} \\ &= \Delta t \frac{\partial L}{\partial x}(x_{N-1},u_N,t_{N-1}) + \frac{\partial J}{\partial x_N} \frac{\partial x_N}{\partial x_{N-1}}. \end{split}$$ In general, with $i<N$, we have the recursive expression $$\frac{\partial J}{\partial x_i} = \Delta t \frac{\partial L}{\partial x}(x_i,u_{i+1},t_i) + \frac{\partial J}{\partial x_{i+1}} \frac{\partial x_{i+1}}{\partial x_{i}}.$$ The entire set of values can be computed in $O(N)$ time for all $x_1,\ldots,x_N$.

However, problems of this sort are usually poorly scaled, and hence standard gradient descent converges slowly. The most commonly used higher-order technique is known as Differential Dynamic Programming (DDP), which is an efficient recursive method for performing Newton's method. Given a current control trajectory, it approximates the cost function as a quadratic function of the controls, and solves for the minimum of the function. The exact steps for implementing DDP are beyond the scope of this course but are readily available from other sources.

As an alternative to piecewise constant controls, it is also possible to
use other discretizations, such as polynomials or splines. In any case,
the control is specified by a linear combination of *basis functions*
$$u(t) = \sum_{i=1}^N c_i \beta_i(t)$$ where the $c_i \in \mathbb{R}^m$
are control coefficients, which are to be chosen by the optimization,
and the basis functions $\beta_i(t)$ are constant. For example, a set of
polynomial basis functions could be $1$, $t$, $t^2$, ..., $t^{N-1}$.
The difficulty with such parameterizations is that the state trajectory
depends on every control coefficient, so evaluating the gradient is
computationally expensive.

Another approach is to perform *state trajectory parameterization* in
which the state trajectory is an optimization variable that is
parameterized explicitly along with the control trajectory. A potential
difficulty with this method is that enforcing dynamic constraints is
challenging, and instead must be performed at a finite number of points
in time, which are known as *collocation points*. The result is an
equality-constrained, finite-dimensional optimization problem.

There are some challenges when applying trajectory optimization to infinite-horizon optimal control problems. Specifically, it is not possible to define a computational grid over the infinite domain $[0,\infty)$ for the purposes of computing the integral in $J(x,u)$. To do so, there are two general techniques available. The first is to simply truncate the problem at some maximum time $T$, leading to a finite-horizon optimal control problem.

The second method is to reparameterize time so that the range $[0,\infty)$ is transformed into a finite range, say $[0,1]$. If we let $s=1-e^t$ then $s$ is in the range $[0,1]$. The cost functional then becomes: $$J(x,u) = \int_0^1 L(x(-\ln(1-s)),u(-\ln(1-s)),-\ln(1-s)) / (1-s) ds.$$ This leads to a finite-horizon optimal control problem over the $s$ domain, with $T=1$. Hence, if a uniform grid is defined over $s \in [0,1]$, then the grid spacing in the time domain becomes progressively large as $t$ increases.

In the reformulated problem it is necessary to express the derivative of $x$ with respect to $s$ in the new dynamics: $$\frac{d}{d s} x(t(s)) = \dot{x}(t(s)) t^\prime(s) = f(x(t(s)),u(t(s))) / (1-s)$$ Care must also be taken as $s$ approaches 1, since the $1/(1-s)$ term approaches infinity, and if instantaneous cost does not also approach 0, then cost will become infinite. It is therefore customary to use a discount factor. With an appropriately defined discount term, the $s=1$ contribution to cost will be dropped.

A major issue with any descent-based trajectory optimization approach is local minima. Only in a few cases can we prove that the problem is convex, such as in LTI problems with convex costs and linear control constraints. As we have seen before, random restarts are one of the most effective ways to handle minima, but in the high dimensional spaces of trajectory optimization, a prohibitive number of restarts is needed to have a high chance of finding a global optimum.

An alternative method to solve optimal control problems is to find the
solution in *state space* rather than the time domain. In the
Hamilton-Jacobi-Bellman (HJB) equation, so named as an extension of the
Bellman equation for discrete optimal planning problems, a partial
differential equation (PDE) across state space is formulated to
determine the optimal control everywhere. (Contrast this with the
Pointryagin's minimum principle, which is an optimality condition only
along a single trajectory.)

We start by formulating the HJB equation in discrete time. Consider a
finite-horizon optimal control problem, and define the *value function*
as a function
$V(x,t) \mathbb{R}^n \times \mathbb{R} \rightarrow \mathbb{R}$ the
*minimum possible added cost* that could be obtained by any trajectory
starting from initial state $x$ and time $t$. In other words, it
truncates the lower point in the integral in $J(x,u)$ to start from time
$t$.

It is apparent that at time $T$, the only term that remains is the terminal cost, so one boundary term is given: $$V(x,T) = \Phi(x).$$ Now we examine the value function going backwards in time. Suppose we know $V(x,t+\Delta t)$ for all $x$, and now we are considering time $t$. Let us also assume that at a state $x$ with control $u$, the resulting state at time $T$ is approximated by Euler integration, and the incremental cost is approximately constant over the interval $[t,t+\Delta t)$. Then, we have the approximation $$V(x,t) \approx \min_{u\in U} [ \Delta t L(x,u,t) + V(x + \Delta t f(x,u),t+\Delta t)] \label{eq:DiscreteTimeHJB}$$ The minimization is taken over controls to find the optimal control for the next time step. The first term of the minimized term includes the incremental cost from the current state, time, and chosen control. The second term includes the cost contribution from the next state under the chosen control, incremented forward in time.

Note that the first order approximation of $V(x,t)$ is given by: $$V(x+\Delta x,t+\Delta t) \approx V(x,t) + \frac{\partial V}{\partial x}(x,t) \Delta x + \dot{V}(x,t)\Delta t$$ If we take the limit of $$eq:DiscreteTimeHJB$${reference-type="eqref" reference="eq:DiscreteTimeHJB"} as the time step $\Delta t$ approaches 0, subtract $V(x,t)$ from both sides, and divide by $\Delta t$, then we obtain the Hamilton-Jacobi-Bellman PDE : $$0 = \dot{V}(x,t) + \min_{u \in U} [ L(x,u,t) + \frac{\partial V}{\partial x}(x,t) f(x,u)]. \label{eq:HJB}$$

If these equations were to be solved either in discrete or continuous time across the $\mathbb{R}^n \times \mathbb{R}$ state space, then we have a complete description of optimal cost starting from any state. It is also possible to enforce state constraints simply by setting the value function at inadmissible states to $\infty$. Moreover, it is a relatively straightforward process to determine the optimal control given a value function: $$u^\star(x,t) = \arg \min_{u \in U} [ \Delta t L(x,u,t) + V(x + \Delta t f(x,u),t + \Delta t)]$$ for the discrete case and $$u^\star(x,t) = \arg \min_{u \in U} [ L(x,u,t) + \frac{\partial V}{\partial x}(x,t) f(x,u)]$$ for the continuous case. The main challenge here is to represent and calculate a function over an $n+1$ dimensional grid, which is prohibitively expensive for high-D state spaces. It is also potentially difficult to perform the minimization over the control in $$eq:DiscreteTimeHJB$${reference-type="eqref" reference="eq:DiscreteTimeHJB"} and $$eq:HJB$${reference-type="eqref" reference="eq:HJB"}, since it must be performed at each point in time and space.

It is often useful to reduce the dimensionality down to an $n$-D grid if
the incremental cost is time-independent and the problem has an infinite
horizon. With these assumptions, the optimal control is *stationary*,
that is, it is dependent only on state and not time. Then, we can set up
a set of recursive equations on a time-independent value function:
$$V(x) = \min_{u \in U} [ \Delta L(x,u) + V(x+\Delta t f(x,u)) ]
\label{eq:DiscreteHJBStationary}$$ in the discrete time case, or taking
the limit as $\Delta t \rightarrow 0$, we get the continuous PDE
$$0 = \min_{u \in U} [ L(x,u) + \frac{\partial V}{\partial x}(x) f(x,u) ].$$
It can be rather challenging to solve these equations exactly due to
their recursive nature. Also, some discretization of the control set $U$
is usually needed, and a finer discretization will help the method
compute better estimates. Three general methods exist for doing so.

Value iteration uses a guess of $V(x)$ and then iteratively improves it by optimizing $$eq:DiscreteHJBStationary$${reference-type="eqref" reference="eq:DiscreteHJBStationary"} on each $x$ in the grid. This is also known as recursive dynamic programming.

Linear programming uses a set of sample points $x_1,\ldots,x_N$ on a state space grid and points $u_1,\ldots,u_M$ on a control space grid, and then sets up a large linear programming problem with constraints of the form $$eq:DiscreteHJBStationary$${reference-type="eqref" reference="eq:DiscreteHJBStationary"}.

Policy iteration assigns guesses for the policy $u(x)$, and iteratively alternates between a) solving for the $V(x)$ induced by those controls, and b) improving the assigned controls using the induced $V(x)$.

The method of model predictive control (MPC) is a process for building a closed-loop controller when given a method that computes open loop trajectories. Generally speaking, it simply replans a new trajectory starting from the sensed state at each step. It executes some small portion of that trajectory, senses the new state, and then replans again. By repeating this process, MPC is able to cope with unexpected disturbances by dynamically calculating paths to return to desirable states.

There are, however, several caveats involved in successful application of MPC. Let us define the steps more specifically. For a control loop that operates at rate $\Delta t$, perform the following steps:

Sense the current state $x_c$

Compute a finite-time optimal trajectory $x,u$ starting at $x(0) = x_c$.

Execute the control $u(t)$ for $t \in [0,\Delta t)$

Repeat from step 1.

There are several variables of note when creating an MPC approach. First, the time step $\Delta t$ must be long enough for step 2 to find an optimal trajectory. Second, the time horizon used in step 2 is an important variable, because it should be long enough for MPC to benefit from predictive lookahead, but not too long to make computational time exceed $\Delta t$. Third, when recasting the problem as a finite-time optimal control problem, the terminal cost should be somewhat approximate the problem's value function, or else the system may not converge properly. Finally, the optimization method used ought to be extremely reliable. Failures in step 2 can be tolerated to some extent simply by using the previous trajectory segment, but to achieve good performance Step 2 should succeed regularly.

Due to all of these variables, MPC is difficult to analyze, and as employed in practice it usually does not satisfy many theoretical stability / convergence properties. However, with careful tuning, MPC can an extremely high performing and practical nonlinear optimal control technique.

Key takeaways:

Optimal control functions are specified by a dynamics function, a cost functional, and optional constraints. By changing the cost functional, different performance objectives can be specified.

Analytical optimal control techniques can be challenging to apply to a given problem, but for some problems can yield computationally efficient solutions by greatly reducing the search space.

Numerical optimal control techniques are more general, but require more computation.

In trajectory optimization, time is discretized to produce a finite-dimensional optimization problem. There is a tradeoff between optimality and speed in the choice of computational grid resolution, and are also susceptible to local minima.

In Hamilton-Jacobi-Bellman (HJB) techniques, both time and space are discretized. There are no local minima, but the technique suffers from the curse of dimensionality.

Model predictive control (MPC) turns open-loop trajectory optimization into a closed-loop controller by means of repeated replanning.

Table $$tab:OptControlSummary$${reference-type="ref" reference="tab:OptControlSummary"} lists an overview of the approaches covered in this chapter.

**Approach** **Type** **Characteristics**

LQR Analytical Applies to LTI systems with quadratic costs PMP Analytical Defines necessary conditions for optimality Trajectory opt. Numerical Optimize a time-discretized control or trajectory space. Local minima HJB Numerical Discretize and solve over state space. MPC Numerical Closed-loop control by repeated optimization

: Summary of approaches[]{label="tab:OptControlSummary"}

Consider devising an optimal control formulation that describes how your arm should reach for a cup. What is the state $x$? The control $u$? The dynamics $f(x,u)$? The objective functional? Is an infinite-horizon or finite-horizon more appropriate for this problem? Do the same for the task of balancing on one foot.

Recall the point mass double-integrator system: $$\dot{x} \equiv \ColVecTwo{\dot{p}}{\dot{v}} = f(x,u) = \ColVecTwo{v}{u/M}.$$ Express this as an LTI system, and solve for the LQR gain matrix $K$ with the cost terms $Q=\MatTwo{10}{0}{0}{1}$ and $R=5$.

TBD...