$\newcommand{\V}[1]{\mathbf #1}$
$\newcommand{\M}[1] {#1}$
$\newcommand{\Reals} { \mathbb{R} }$

Calibration is one of the most important steps to take toward building a high performance robotic sysystem. The precision by which a robot can navigate and manipulate objects is almost entirely dependent on having meaningful, accurate models of physical quantities. For example, a forward model of a sensor predicts what the robot will observe given certain physical quantities; an inverse sensor model predicts what physical quantities correspond to observations. A kinematics model with accurate link lengths, reference tranforms, and joint axes is crucial for forward and inverse kinematics. Physics models predict the outcomes of the robot interacting with the world. In each of these cases, poor calibration can cause catastrophic failures; good calibration helps crush your competitors!

For whatever reason, calibration has a sort of reputation for being an "unsexy" topic, and most students approach it as a sort of bitter pill to swallow in order to make a robot work. It is perhaps a surprise then, that the techniques and processes used in calibration are essentially the same as in the relatively "flashy" subject of machine learning! So this negative reputation is really undeserved: in both calibration and learning, we use observations of the real world (data) to build computational representations of the world (models) which can then be used to make predictions; in both fields, we must concern ourselves with the same general procedure. First, the *data gathering process* should produce meaningful observations that reflect real-world conditions; next, *model fitting* should propose and estimate parameters of a model that matches the data; and finally, *performance measurement* obtains meaningful estimates of the predictive accuracy of our models. The main difference between calibration and learning is that in calibration, the role of the engineer is to propose models based on *knowledge* (e.g., kinematics, physics, and sensor models), while in machine learning the role of the engineer is to achieve good performance by whatever means necessary.

The general calibration framework is a five step process:

Establish a

**calibration procedure**allows the robot to observe some**ground truth**, i.e., trusted measurements of the phenomenon of interest. Ground truth can be established using a*calibration rig*, precision measurement device, or human annotation.Develop a

**parametric model**relating ground truth to the robot's sensor measurements (observations). Ground truth and observations are linked via a set of unknown parameters. This could be a*forward model*that predicts observations from ground truth, or an*inverse model*that predicts ground truth from observations.Execute the calibration procedure to

**acquire a dataset**of several observations and ground truth measurements.**Estimate the parameters**of the model to minimize prediction error.

Developing a calibration procedure often requires some ingenuity to ensure identifiability of the parameters of interest, as well as to keep the procedure convenient.

Let $\V{z}$ be the vector of the robot's observations and $\V{w}$ be the ground truth. A forward model is a function $\V{z} = f(\V{w})$ which predicts the observations from ground truth, while an inverse model $\V{w} = f(\V{z})$ does the converse. In either case, we would like to generate some function $\V{y} = f(\V{x})$, with inputs $\V{x}$ and outputs $\V{y}$ defined accordingly, that makes accurate predictions for all $\V{x}$ in its domain.

A *dataset* $D$ consists of $N$ paired inputs and outputs $D=\{ (\V{x}^{(1)},\V{y}^{(1)}), \ldots, (\V{x}^{(N)},\V{y}^{(N)}) \}$, where we assume $\V{y}^{(i)} = f(\V{x}^{(i)})$ for all $i$. The goal of model fitting is to estimate a function $\hat{f} \approx f$ so that the error between each the true output and the predicted output is low. Here, the notation $\hat{f}$ distinguishes the estimated model from the "true function" $f$. But from now on, we will deal only with estimated models, so our notation will drop the $\hat{\cdot}$ for simplicity.

Specifically, we define an *error function*
$$ E(f;D) = \sum_{i=1}^N e(\V{y}^{(i)},f(\V{x}^{(i)})) $$
with $e(\V{y},\hat{\V{y}})$ a *loss function* (a.k.a. *cost function*) that measures some notion of prediction error. The most common loss function used in calibration is the *quadratic loss*
$$e(\V{y},\hat{\V{y}}) = \| \V{y} - \hat{\V{y}} \|^2.$$
Model fitting seeks a model $f$ that achieves a low value of $E(f;D)$.

For low dimensional, directly related, and noiseless inputs and outputs, we can simply *tabulate* the relationship between $\V{x}$ and $\V{y}$. For example, the torque curve of a motor relates speed to torque, and can be obtained simply by controlling speed across a grid of values $x^1,\ldots,x^N$ and measuring torque $y^1,\ldots,y^N$ at each value. Simply "connecting the dots" allows the model to achieve zero error on the dataset, and it predicts the value of $y$ for a given value of $\V{x} \notin \{x^1, \ldots,x^N \}$ using interpolation.

Tabulation can only be applied in low dimensional input spaces, because at roughly 4 dimensions or higher, the size of the input grid grows large, which requires an impractical number of experiments. A tabulation approach also has the problem that the model is non-differentiable, which is a disadvantage if the model must need to be used for optimization.

More commonly we define a *parametric model*. These are functions of the form $\V{y} = f(\V{x};\V{\theta})$, in which $\V{\theta}$ is a set of unknown parameters. As $\V{\theta}$ varies, the actual predictive model $f_{\V{\theta}}(\V{x}) \equiv f(\V{x};\V{\theta})$ changes. However, $f$ is a fixed function of both $\V{x}$ and $\V{\theta}$. This representation is said to define a *function class* — a set of possible models. Our goal is to find a single value of $\V{\theta^\star}$ — and hence a single model — that minimizes the error:

$$ \V{\theta}^\star = \min_{\V{\theta}} E(\V{\theta};D) = \sum_{i=1}^N e(\V{y}^{(i)},f(\V{x}^{(i)};\V{\theta})). $$

As a very simple example, suppose a servomotor takes as input consisting of $z_s$ a 7-bit integer from $[0,127]$ defining speed, $z_d$ a single bit defining direction, and outputs $z_a$ a 16-bit integer from [0,65536] defining the current angle read by the encoders. A common assumption is that the true angular velocity $\dot{q}$ is linear in $z_s \cdot 2(z_d-0.5)$ (where 2(z_d-0.5) is the sign $\pm 1$) and the current angle $q$ is linear in $z_d$. So, we hypothesize the relationships $\dot{q} = 2 \theta_s z_s(z_d-0.5)$, with $\theta_s$ an unknown speed coefficient, and $q = \theta_0 + z_a \theta_a$, with $\theta_0$ the unknown zero position and $\theta_a$ an unknown angular range coefficient. Any set of coefficients $(\theta_s, \theta_0, \theta_a)$ generates a possible model from this class of models, but given enough data points, only one will minimize the error.

As you might imagine, figuring out these coefficients with two measurements can be done through simple algebra. Suppose we command an arbitrary nonzero velocity $z_s^{(1)}$ and direction $z_d^{(1)}$ for duration $t$, and then observe movement from $z_a^{(1)}$ to $z_a^{(2)}$ corresponding to physical servo angles $q^{(1)}$ and $q^{(2)}$. ($q$ can be measured to high precision using multiple methods, such as measuring the angle of a reference point on an image, given a camera pointed head on to the motor; or using precisely constructed physical stops). Using the relationship $q^{(2)} - q^{(1)} = (z_a^{(2)}-z_a^{(1)}) \theta_a$ we can set $$\theta_a = \frac{q^{(2)} - q^{(1)}}{z_a^{(2)}-z_a^{(1)}}$$ and then $$\theta_0 = q^{(1)} - \theta_a z_a^{(1)}.$$ Estimating $\dot{q} = (q^{(2)}-q^{(1)})/t$ you can set $$\theta_s = \frac{q^{(2)}-q^{(1)}}{2 t z_s^{(1)}(z_d^{(1)}-0.5)}.$$

This connect-the-dots construction, however, is only correct under perfect precision for each measurement. Real servomotors have quantization error (a range of continuous values corresponding to one discrete measurement) and the physical servo angle measurements will be noisy. Moreover, the uniform velocity assumption may not exactly hold due to ramping, backlash in the motors, friction, and other effects.

In [3]:

```
#Code for the linear model fitting figureâ†”
```

The figure above shows a set of corrupted measurements that lie approximately along a line. If we simply perform "connect-the-dots" on the endpoints, as shown in the middle plot, the model seems somewhat reasonable. However, the noise at the endpoints leads to less accurate prediction outside of the range of the data. The figure on the right shows a zoom-in to the upper right corner of the plot. A better approach is to use all of the data to obtain a tighter fit. Since we only have a couple of parameters to play with, we cannot hope to match all of the data points exactly, and instead we must settle for a *least squares* fit.

Let us for now assume that the output is 1D and the input has $n$ dimensions. Linear least squares fitting assumes the model is linear with one coefficient per input dimension. Specifically:

$$y = f(\V{x};\V{\theta}) = \sum_{k=1}^n x_k \theta_k = \V{x}^T \V{\theta}.$$

With this definition, it is straightforward to minimize a squared loss function

$$E(\V{\theta},D) = \sum_{i=1}^N \| \V{y}^{(i)} - \hat{\V{y}}^{(i)} \|^2 = \sum_{i=1}^N (y^{(i)} - \hat{y}^{(i)})^2$$

Replacing $\hat{y}^{(i)}$ with the expression for the prediction $f(\V{x}^{(i)};\V{\theta})$, we obtain

$$E(\V{\theta},D) = \sum_{i=1}^N (y^{(i)} - \V{x}^{(i)T} \V{\theta})^2.$$

At this point it is helpful to view the sum-of-squares as a dot product of a vector $\V{b}-A\V{\theta}$ with itself, where $$\V{b} = \begin{bmatrix} y^{(1)} \\ \vdots \\ y^{(N)} \end{bmatrix}$$ and the matrix $A$ is given by $$A = \begin{bmatrix} \V{x}^{(1)} & \cdots & \V{x}^{(N)} \end{bmatrix}^T = \begin{bmatrix} \V{x}^{(1)T} \\ \vdots \\ \V{x}^{(N)T} \end{bmatrix}.$$ In other words, each of the input data points is a row in A and each of the output data points is a corresponding row in $\V{b}$. Then, the error function is expressed

$$E(\V{\theta},D) = (\V{b}-A\V{\theta})^T (\V{b}-A\V{\theta}).$$

The minimizer $\V{\theta}^\star$ of the error function is given by the well-known *least-squares formula*

$$\V{\theta}^\star = (A^T A)^{-1} A^T \V{b}, \label{eq:LeastSquares}$$

which is a unique global minimum provided that $A^T A$ is invertible. This process is known as ordinary least squares (OLS).

It is common to add a constant affine term, which allows the model to predict that the zero input has a nonzero output: $$f(\V{x};\V{\theta}) = \theta_0 + \sum_{k=1}^n x_i \theta_i.$$

In this case, we can formulate the parameter vector $\V{\theta} = (\theta_0,\theta_1,\ldots,\theta_n)$ and a least-squares $A$ matrix augmented with a leading column of 1's as follows:

$$A = \begin{bmatrix} 1 & \V{x}^{(1)T} \\ \vdots & \vdots \\ 1 & \V{x}^{(N)T} \end{bmatrix}.$$

In other words, we imagine each input vector to be augmented with a single observation of 1. Simply applying the OLS formula $\eqref{eq:LeastSquares}$ directly to the augmented matrix gives us the optimal parameter vector including a leading constant.

In general, we would like to handle $m$-D outputs. The general OLS strategy is simply to estimate $m$ separate univariate models, one for each dimension.

$$y_j = f_j(\V{x};\V{\theta}_j) = \V{x}^T \V{\theta}_j \text{ for each }j=1,\ldots,m. $$

Here each of the $\V{\theta}_j$ is an $n$-D vector, giving a total of $mn$ parameters. In other words, the parameters form a matrix $\Theta$ such that the model is $\V{y} = \Theta \V{x}$, with each $\V{\theta}_j$ a row of $\Theta$.

Since the parameters for each univariate model is separable from the rest, their parameters can be estimated independently. Specifically, consider the quadratic loss $e(\V{y},\hat{\V{y}}) = \sum_{j=1}^m (y_j - \hat{y}_j)^2 = \sum_{j=1}^m (y_j - \V{\theta}_j^T \V{x})^2 $. The overall error function can be rearranged into a sum of element-wise error functions, each of which depends only on a single parameter vector $\V{\theta}_j$ as follows:

$$ E(\Theta;D) = \sum_{i=1}^N \sum_{j=1}^m (y_j^{(i)} - \V{\theta}_j^T \V{x}^{(i)})^2 = \sum_{j=1}^m \sum_{i=1}^N (y_j^{(i)} - \V{\theta}_j^T \V{x}^{(i)})^2 = \sum_{j=1}^m E_j(\V{\theta}_j;D). $$

The summands in the final expression are simply the univariate error functions for each element of $\V{y}$.

Linear models are useful in some circumstances, but are extremely limiting. In the general case we would like to be able to represent curves, oscillations, or even jumps. We may formulate a general parametric model and then run standard optimization, such as gradient descent, to minimize the error function.

Suppose for example we observe that the error of a robot joint's controller decays over time, but also obeys some oscillation. A potential parametric model for this behavior is a damped harmonic oscillator

$$y = f(x_0,\dot{x}_0,t;\theta_d,\theta_\omega) = \exp(-\theta_d t)(C_1 \cos(\theta_\omega t) + C_2 \sin(\theta_\omega t) $$

where $x_0$ is the initial error, $\dot{x}_0$ is the initial rate of change of the error, $t$ is a prediction horizon, and $y$ is the error at time $t$. $C_1$ and $C_2$ are chosen to match the initial conditions $y = x_0$ and $\frac{d}{dt}y = \dot{x}_0$ at $t=0$: $$C_1 = x_0,$$ $$C_2 = (\dot{x}_0 + \theta_d x_0)/\theta_\omega.$$

This defines a parameterized class of models with $\V{\theta} = (\theta_d,\theta_\omega)$ for which we can go ahead and simply optimize the error function over all possible parameters: $$\V{\theta}^\star = \min_{\V{\theta}} E(\V{\theta};D)$$ with $E(\V{\theta};D) = \sum_{i=1}^N e(y^{(i)},f(x_0^{(i)},\dot{x}_0^{(i)},t^{(i)};\V{\theta}))$ the sum of losses over $D$ for the parameters $\V{\theta}$.

We can simply run a minimization process, such as gradient descent, quasi-Newton, or Newton's method (see Appendix B.3) to solve this problem. However, as with any nonlinear optimization, such a process can be susceptible to local minima.

In [4]:

```
# Code for nonlinear least squares with a damped harmonic oscillator â†”
```

The leftmost plot in the above figure shows a simulated dataset of 10-second traces, sampled every half-second, and with known initial state. The dataset is subject to noise in both the time and measurement axes. For the initial guesses $\V{\theta}=(0,1)$ and $\V{\theta} = (0.5,0.1)$, optimization converges to a result very close to the true parameters. However, the initial guess $\V{\theta}=(0.2,5)$ fails to find the global optimum.

Because we cannot guarantee a global optimum of a nonlinear least squares problem, it would be helpful to leverage the benefits of OLS while allowing nonlinearity. The notion of a *feature mapping* allows us to apply OLS to estimate a quite versatile set of models. Specifically, it can be applied to estimate the coefficients of any function of the form

$$f(\V{x};\V{\theta}) = \sum_{k=1}^m \theta_k g_k(\V{x})$$

where the set of $m$ functions $g_1,\ldots,g_m$ are known as the *feature mapping*. The model hypothesizes that the output is a linear function of points in the *feature space*:

$$\V{g}(\V{x}) = \begin{bmatrix} g_1(\V{x}) \\ \vdots \\ g_m(\V{x}) \end{bmatrix}. $$

And the parameter estimation is performed as usual with the $N \times m$ matrix $$A = \begin{bmatrix} \V{g}(\V{x}^{(1)})^T \\ \vdots \\ \V{g}(\V{x}^{(N)})^T \end{bmatrix}.$$

The significance is that if the output is some linear combination of the elements of $\V{g}$, then least squares fitting will perform well.

As an example of a feature mapping is the set of polynomial models of the input. We have seen univariate polynomial models of degree $d$

$$f(x;\V{\theta}) = \sum_{k=0}^d \theta_k x^k$$

which can be expressed as a dot product between the $d+1$-D coefficient vector $(\theta_0,\ldots,\theta_d)$ and the *monomial basis functions* $(0,x,x^2,\ldots,x^d)$. We simply set the $m=d+1$ feature mapping elements to be $g_k(x) = x^{k-1}$, giving the $A$ matrix

$$A = \begin{bmatrix} 1 & x^{(1)} & {x^{(1)}}^2 & \cdots & {x^{(1)}}^d \\ \vdots & & & \vdots \\ 1 & x^{(N)} & {x^{(N)}}^2 & \cdots & {x^{(N)}}^d \end{bmatrix}. $$

An example of univariate polynomial fitting is given below.

In [5]:

```
# Code for the polynomial fitting exampleâ†”
```

There are 5 datapoints in this example, and the constant and linear models do a poor job of fitting the data. A quadratic and cubic (degree 3) fit are able to achieve low error, with the cubic polynomial doing a bit better in terms of error while oscillating a bit more than the quadratic one. The quartic (degree 4) fit goes even further still, achieving 0 error! But if we examine the behavior of the polynomial, the oscillations are quite severe, and in the range $[6,10]$ it makes an unintuitive dip downward. Most of us would agree that the purple curve would be a very poor model for this data. This suggests that perhaps goodness-of-fit is not the only criterion we should consider when deciding upon a model? We will discuss this in much more detail in the next section.

Let us delay on that point for one moment, and mention the question of polynomial bases with multivariate input. Suppose for now $\V{x}$ is 2D with a 1D output. One approach to enrich the class of linear models on this space would be to allow sums of univariate polynomials of maximum degree $d$ in each of the components $x_1$ and $x_2$. In other words we define the model class:

$$ y = \theta_{0} + \theta_{1,1} x_1 + \ldots + \theta_{1,d} x_1^d + \theta_{2,1} x_2 + \ldots + \theta_{2,d} x_2^d $$

which has $2n+1$ monomial features $\V{g}(\V{x}) = (1,x_1,x_1^2,\ldots,x_1^d,x_2,x_2^2,\ldots,x_2^d)$. In general for $d$ dimensional input we can define a similar feature space with $nd+1$ dimensions.

For even greater expressiveness, we could allow polynomial *combinations* of input elements, such as $x_1 x_2$, $x_1^2 x_2^3$, and so on and so forth. The canonical monomial basis of maximum degree $d$ consist of all possible products of powers of input elements. For example, with 3-D input and maximum degree 2, the monomial basis is $(1,x_1,x_2,x_3,x_1^2,x_2^2,x_3^2,x_1 x_2,x_1 x_3,x_2 x_3)$. In general, there are ${d+n}\choose{d}$ components in such a feature space.

Separability is a useful property that can break a complex estimation process into several smaller estimation processes. Suppose we can map our parameter vector into subsets $$\V{\theta} = \begin{bmatrix}\V{\theta}_A \\ \V{\theta}_B \end{bmatrix}$$ so that the error function can be broken into two independent summands: $$ E(\V{\theta};D) = E_A(\V{\theta}_A;D) + E_B(\V{\theta}_B;D). $$ Then, we can estimate each of the components separately by optimizing $E_A$ over $\V{\theta}_A$, optimizing $E_B$ over $\V{\theta}_B$, and then reassembling the results to obtain $\V{\theta}$. Specifically, $$ \V{\theta}^\star = \arg \min_\theta E(\V{\theta};D) = \begin{bmatrix} \V{\theta}_A^\star \\ \V{\theta}_B^\star \end{bmatrix} = \begin{bmatrix} \arg \min_{\theta_A} E_A(\V{\theta}_A;D) \\ \arg \min_{\theta_B} E_B(\V{\theta}_A;D) \end{bmatrix}.$$

A similar construction can be used to separate a large estimation into an arbitrary number of independent estimations.

As shown above in the polynomial fitting example, it is critically important to realize that goodness-of-fit is only one comopnent of the quality of a model. The reason is that if the model is given enough parameters to tune, it may be able to fit a given dataset $D$ arbitrarily well. The ability of a model to fit datasets of a given size is also known as the *capacity* of the model, and there are indeed models, like neural networks, which can be configured to have vast amounts of capacity. Let us step back a moment and realize that $D$ only contains a small sampling of the infinite possible situations that we would actually like our model to fit well. So, our actual goal is to build a model that *generalizes well* from a small dataset to the entire space of inputs.

*Overfitting* is the name given to the phenomenon where a model achieves a good fit to $D$, while generalizing poorly to inputs that are outside of $D$. Avoiding overfitting is the primary challenge in estimation problems found in calibration as well as statistics and machine learning. How can we be sure a model generalizes when we are only presented with a finite number of data points? And how ought we deal with the practical labor / time constraints of data gathering, which encourage us to find a dataset as small as possible?

One possibility is to use an auxiliary draw of the data $D^\prime$ to represent points outside of the dataset used to estimate the model. In this way, $D$ is called the *training set* and $D^\prime$ is called the *testing set*. Both are typically assumed to be drawn from the same underlying distribution of real-world conditions. The value of the error function $E(\theta^\star;D^\prime)$ on the training set $E(\theta^\star;D)$ is called the *training error* while the value of the error function on the testing set is the *testing error*. If a model has both low training error and testing error, then assuming the testing set represents real-world conditions, then we can be reasonably assured that it will generalize well. On the other hand, if testing error is large, then it can be concluded that the model is overfitting.

It is important in this process **not to train the model on any elements in the testing set**. The value $\theta^\star$ is *fixed* after minimizing $\theta^\star \gets \arg \min_\theta E(\theta;D)$, and the testing error is evaluated without re-training on $D^\star$. It is often tempting for a beginner to seek lower testing error by also training on $D^\star$ (i.e., optimizing $E(\theta;D \cup D^\star)$. But this is the cardinal sin of estimation! By including elements of the test set into training, the model is "peeking" at the test, and hence the seemingly positive results are actually bogus.

TODO: *Cross-validation*

TODO: describe L2, L1 regularization.

Performance of a calibration should be reported using some error metric both on testing and training sets.

Goodness-of-fit can be reported in terms of the following aggregate metrics:

**Sum of squared errors (SSE)**:
$$\sum_{i=1} \|y^{(i)} - \hat{y}^{(i)}\|^2$$

- Advantages: smooth derivatives
- Disdvantages: varies by number of observations, units are units-of-y squared

**Mean squared error (MSE)**:
$$\frac{1}{N} \sum_{i=1} \|y^{(i)} - \hat{y}^{(i)}\|^2$$

- Advantages: smooth derivatives, gives a sense of average performance. Same optimum as SSE.
- Disdvantages: units are units-of-y squared

**Root mean squared error (RMSE)**:
$$\sqrt{\frac{1}{N} \sum_{i=1} \|y^{(i)} - \hat{y}^{(i)}\|^2}$$

- Advantages: units are same as units-of-y, gives a sense of average performance. Same optimum as SSE/MSE.
- Disdvantages: derivatives are not smooth

**Mean absolute error (MAE)**:
$$\frac{1}{N} \sum_{i=1} \|y^{(i)} - \hat{y}^{(i)}\|$$

- Advantages: units are same as units-of-y, gives a sense of average performance
- Disdvantages: derivatives are even less smooth than RMSE

It can generally be concluded that SSE or MSE are the most convenient for optimization, while RMSE and MAE are most convenient for reporting results. These can also be generalized to other metrics than the Euclidean distance $d(y,\hat{y)) = \|y - \hat{y}\|$.

In many parametric calibration problems the actual *values of the parameters* are more important to estimate accurately, rather than a low error value. For example, if we are calibrating a camera pose, it is not so important what the locations of the markers were during the calibration process. But the error function directly measures how closely the estimated pose predicts the locations of the markers. We should hope that if the parameters fit well, then these parameters should also be close to their true values. But this is not always the case! In problems that are *identifiable*, we can ensure that parameter estimates will be accurate to some amount proportional to the goodness-of-fit. In problems that are *non-identifiable*, such guarantees cannot be made.

More precisely, we are concerned with whether the estimated parameter values $\hat{\V{\theta}} = \arg \min_{\V{\theta}} E(\V{\theta};D) $ are close to their actual values $\V{\theta}$. Specifically, we are concerned with the estimation error:
$$\| \V{\theta} - \hat{\V{\theta}} \|$$
One cause of estimation error is that the dataset $D$ has a finite number of samples which may be noisy, and this noise may affect some dimensions of $\V{\theta}^\star$ more than others. By studying how the error function relates to parameter variations, we can quantify to a first degree how susceptible our model is to noise. A second, more severe cause of error is that there may be *multiple parameters that achieve the minimum fitting error*. Such cases are called non-identifiable.

Let us study the quadratic Taylor expansion of $E$ centered at an optimum $\theta^\star$: $$E(\theta;D) \approx E(\theta^\star;D) + \nabla_\theta E(\theta^\star;D) (\theta - \theta^\star) + \frac{1}{2} (\theta - \theta^\star) \nabla^2_\theta E(\theta^\star;D)(\theta - \theta^\star). $$ We know that the gradient $\nabla_\theta E(\theta^\star;D)=0$ because it is an optimum. Hence the behavior of the error function as a function of $\theta$ is dominated by the term $\frac{1}{2} (\theta - \theta^\star) \nabla^2_\theta E(\theta^\star;D)(\theta - \theta^\star)$. The Hessian matrix $$H = \nabla^2_\theta E(\theta^\star;D)$$ is the critical quantity here. If $H$ is rank deficient, there exists one or more dimensions in $\theta$ space that locally do not change the error function value. Hence, a problem is non-identifiable when $H$ is rank deficient.

A common case of rank-deficiency is when there are fewer values in the dataset than unknown parameters: $N dim(\V{y}) < dim(\V{\theta})$. The solution is to add more data points. Another common case is a geometric degeneracy. This can happen when trying to estimate rotations from the 3D locations of two points: there is ambiguity in the axis of rotation. Another geometric degeneracy is when trying to estimate both the size and distance of an object from a camera image alone.

Let us now return to the question of parameter sensitivity. Since $E$ is at a minimum, $H$ is a symmetric positive semi-definite matrix. This implies that the shape of $E$ is locally approximated by a quadratic "bowl" shape. The eigenvalues of $H$ determine the steepness of the bowl in different dimensions. If $H$ has a small eigenvalue $\lambda_i$ with corresponding eigenvector $v_i$, then changes of $\theta$ in the $\pm v_i$ dimension will have a correspondingly small change of cost. In particular, if $\theta$ moves from the optimum by distance $\pm 1/\sqrt{\lambda_i}$ in the direction $v_i$, then the cost will change by approximately 1/2.

We can make this more precise by a statistical argument. If we considered the selection of the dataset $D$ as an "experiment", and by running more experiments we could draw an increasing number of datasets $D_1,D_2,D_3,\ldots$. For each of these datasets, we could run parameter estimation to obtain estimates $\hat{\theta}_1,\hat{\theta}_2,\hat{\theta}_3,\ldots$, which would not in generally be the same: because of noise in the drawn dataset, each of these estimates will be somewhat different. We can study how far $\hat{\theta}_1,\hat{\theta}_2,\hat{\theta}_3,\ldots$ deviate from the true value $\theta$. In particular, the covariance matrix
$$Var(\theta) = \lim_{N\rightarrow \infty} 1/N \sum_{i=1}^N (\theta_i - \theta) (\theta_i - \theta)^T$$
tells us how wildly the estimates deviate from the true value. In particular, the square root of the diagonal of $Var(\theta)$ is called the *standard error*, and tells us how far $\hat{\theta}$ deviates from $\theta$ on average.

Via a derivation beyond the scope of this course, if we knew that the probability of drawing a particular $y$ for a given $x$ is proportional to $exp(-0.5 e(y,f(x,\theta))$, it can be shown that the inverse of the Hessian matrix gives an approximation of the covariance matrix: $$Var(\theta) \approx H^{-1}. $$ In other words, the "sharper" the error function at the optimum, the less variability we should expect to see in the parameter estimates.

In general, the variance decreases proportionally to the inverse of $|D|$, which means the standard error decreases proportionally to the inverse square root of $|D|$. Hence, to increase accuracy by a factor of 2, the size of the dataset must be increased by a factor of 4; to increase accuracy by a factor of 10, the size of the dataset must be increased by a factor of 100.

In many calibration problems we would like to estimate some parameters of concern $\theta_C$ accurately, but in order to do so we need to estimate other terms in the optimization. These *nuisance parameters* are not observables or parameters of concern, but rather deal with some variable encountered during the calibration process. For example, to calibrate a multi-camera motion capture system, a human will wave a marker in front of the cameras across the workspace. The trajectory taken by the marker is not important, but we certainly don't have ground truth on this. Hence, it will need to be optimized.

To optimize nuisance parameters, we simply gather them in a vector $\theta_N$ and optimize over the concatenated parameters $\theta = (\theta_C,\theta_N)$. However, one problem that may arise during the design of the calibration procedure is when nuisance parameters are added for each observation. Then the estimation problem may become unidentifiable (particularly when the number of nuisance parameters exceeds the number of outputs for an observation), or very high dimensional.

As an example, suppose we are trying to optimize a robot's gripper location (center of the gripper opening) by moving the robot to touch an object at the same local point, but when the object is at a variety of different poses. We observe configurations $q^{(1)},\ldots,q^{(N)}$ and object transforms $T_O^{(1)},\ldots,T_O^{(N)}$. We wish to estimate $\V{x}_G^k$, the gripper location on link $k$. We can derive the link transforms $T_k^{(i)} = T_k(q^{(i)})$ from forward kinematics. But we do not know the gripper point relative to the object ${\V{x}_G^O}^{(i)}$, so we consider letting $\theta_N = ({\V{x}_G^O}^{(1)},\ldots,{\V{x}_G^O}^{(N)})$. Is this identifiable? Let's examine the error function. The $i$'th gripper point is $\V{x}_G^{(i)} = T_k^{(i)} \V{x}_G^k$, so the predicted gripper coordinates in the object frames are ${T_O^{(i)}}^{-1} T_k^{(i)} \V{x}_G^k$. We would like ${T_O^{(i)}}^{-1} T_k^{(i)} \V{x}_G^k \approx {\V{x}_G^O}^{(i)}$ for all $i=1,\ldots,N$. This gives only $3N$ constraints and $3+3N$ unknowns! Hence, this is not identifiable.

Some care must be taken during the setup procedure. If we can constrain some of the object-relative gripper positions to be equal, so that $\V{x}_G^O \equiv {\V{x}_G^O}^{(1)}=\cdots={\V{x}_G^O}^{(N)}$, then we can set $\theta_N = \V{x}_G^O$, and we have only 3 nuisance parameters. The number of constraints remains at $3N$ and the number of unknowns becomes $3+3=6$.

Calibration problems with rigid transforms require some special care to estimate reliably and efficiently due to the use of rotation parameters, which tend to exhibit problems with singularities, nuisance parameters, and local minima. But due to the special structure of rotation matrices, for some estimation problems we can derive analytical solutions.

In this section we will often use the fact that the mean of a set of points minimizes the sum-of-squared distances to points in the set. More precisely, $$ \overline{\V{x}} = 1/N \sum_{i=1}^N \V{x}^{(i)} = \arg \min_{\V{x}} \|\V{x} - \V{x}^{(i)}\|^2.$$ This gives a closed form solution to many common estimation problems.

A common problem in robotics and computer vision is to find a rotation or rigid transform to match points in one set to another. Specifically, assume we have points $\V{a}^{(1)},\ldots,\V{a}^{(N)}$ defined in some local frame, with which would like to match to points $\V{b}^{(1)},\ldots,\V{b}^{(N)}$ defined in some world frame. As it turns out, there is an efficient, analytical way to determine the optimal transform. (It is important to note that this process assumes a direct, one-to-one matching between point sets; that is, point $\V{a}^{(i)}$ should be matched to $\V{b}^{(i)}$ after rotation for all $i$. This is different from an unmatched problem, in which a point in the first set can be matched to any point in the second, and we must figure out both the rigid transform and the association between points. For this latter problem, see the ICP algorithm.)

To find a rotation matrix $\V{b} = R\V{a}$, we would like to minimize the sum of squared errors $$E(R) = \sum_{i=1}^N \| R \V{a}^{(i)} - \V{b}^{(i)} \|^2 $$ over orthogonal matrices $R$. To find a rigid transformation $\V{b} = R\V{a} + \V{t}$, we minimize $$E(R,\V{t}) = \sum_{i=1}^N \| R \V{a}^{(i)} + \V{t} - \V{b}^{(i)} \|^2 $$ over orthogonal matrices $R$ and translations $\V{t}$.

Let us first examine the rotation fitting problem, which is also known as the *orthogonal Procrustes problem*. Let us define the loss function as a Frobeneus norm of a matrix expression
$$ E(R) = \| R A - B \|_F^2 $$
Where the Frobeneus norm is the most natural analogue to the Euclidean norm for matrices:
$$\|X\|_F = \sqrt{\sum_{i,j} x_{i,j}^2}.$$

To do so, we set up $A = \begin{bmatrix} \V{a}^{(1)} & \cdots & \V{a}^{(N)} \end{bmatrix}$ and $B = \begin{bmatrix} \V{b}^{(1)} & \cdots & \V{b}^{(N)} \end{bmatrix}$. Observe that $RA$ is the matrix whose columns are all the rotated $\V{a}$ vectors, and the Frobeneus norm sums up all the squared errors for each rotated vector.

If we denote the matrix inner-product $A \cdot B = \sum_{i,j} a_{i,j} b_{i,j} = tr(A^T B)$, we have $$ E(R) = (R A - B)\cdot (R A - B) = \|A\|_F^2 + \|B\|_F^2 - (RA \cdot B) $$ Since we are minimizing over $R$ and the first two summands do not depend on $R$, we have $$ \arg \min_R E(R) = \arg \max_R (RA \cdot B), $$ so let us focus on the last term. By rewriting the inner product as a trace, $(RA \cdot B) = tr(A^T R B)$, and since the trace is invariant under cyclic permutations, $tr(A^T R B) = tr(R B A^T)$. Hence, we would like to find $$\arg \max_R (R^T \cdot B A^T). $$ Let us perform the singular value decomposition of $BA^T = U \Sigma V^T$ with $\Sigma$ diagonal, and $U$ and $V$ orthonormal. We can see that $(R^T \cdot B A^T) = (R^T \cdot U \Sigma V^T) = tr(R U \Sigma V^T) = tr(V \Sigma U^T R) = tr(\Sigma, U^T R V) = (\Sigma \cdot U^T R V)$.

Since $U^T R V$ is an orthonormal matrix and $\Sigma$ is diagonal and nonnegative, the maximizer of $(\Sigma \cdot U^T R V)$ over all possible orthonormal matrices should satisfy $(U^T R V = I)$. Hence, $R = U V^T$ is the optimum.
(*Note that if $\Sigma$ contains any zero entries on the diagonal, then the point set is degenerate, and it is not guaranteed that $U$ or $V$ calculated by an SVD has positive determinant. Some care must therefore be taken to flip the row of $U$ and column of $V$ corresponding to such values so that the determinant is indeed positive.*)

Let us return to the transform fitting problem. Let us consider the subproblem of minimizing $E(R,\V{t})$ with respect to $\V{t}$, which would require setting its gradient to 0. First we take the gradient of the summand: $$\frac{\partial}{\partial \V{t}} \| R \V{a}^{(i)} + \V{t} - \V{b}^{(i)} \|^2 = \frac{\partial}{\partial \V{t}}\left(\|\V{t}\|^2 + 2\V{t}^T (R \V{a}^{(i)} - \V{b}^{(i)}) + \| R \V{a}^{(i)} - \V{b}^{(i)} \|^2\right) = 2\V{t} + 2 (R \V{a}^{(i)} - \V{b}^{(i)}). $$ Next we replace this into the gradient of the error function: $$\frac{\partial E}{\partial V{t}} = \sum_{i=1}^{N} \frac{\partial}{\partial \V{t}} \| R \V{a}^{(i)} + \V{t} - \V{b}^{(i)} \|^2 = 2 \sum_{i=1}^{N} (\V{t} + R \V{a}^{(i)} - \V{b}^{(i)}).$$ Setting this to 0, dividing by 2N, and pulling out terms from the summation that do not depend on $i$, we see that $$ 0 = 1/2 \frac{\partial E}{\partial V{t}} = N\V{t} + R \sum_{i=1}^{N} \V{a}^{(i)} + \sum_{i=1}^{N} \V{b}^{(i)}.$$ If we were to divide this expression by $N$, we can see that the sums become means, and have $$\V{t} = \overline{\V{b}} - R \overline{\V{a}}$$ with $\overline{\V{a}} = 1/N \sum_{i=1}^{N}a^{(i)}$ and $\overline{\V{b}} = 1/N \sum_{i=1}^{N}b^{(i)}$.

Replacing the expression for the optimal $\V{t}^\star$ back in the error function, we get $E(R,\V{t}^\star) = \sum_{i=1}^N \| R (\V{a}^{(i)}-\overline{\V{a}}) - (\V{b}^{(i)}-\overline{\V{b}}) \|^2 $. This can be thought of as simply a rotation fitting problem on the shifted datasets $\tilde{\V{a}}^{(i)} = \V{a}^{(i)}-\overline{\V{a}}$ and $\tilde{\V{b}}^{(i)} = \V{b}^{(i)}-\overline{\V{b}}$, for $i=1,\ldots,N$.

The overall running time of this procedure is O(N) if we assume dimension is constant, because no matrix has more than O(N) entries and the SVD is performed on a 2x2 or 3x3 matrix. The below figures show the result of this proceedure in 2D, where each vertex in the black polygon is associated with 10 blue points. The estimated transform is drawn as the red and green frame.

In [6]:

```
#Code for point set fitting figureâ†”
```

One common problem with rigid transforms is how to fit a rotation matrix $R$ to a dataset of rotations $R^{(1)},\ldots,R^{(N)}$.

First, we may assume some rotation matrix parameterization $R(\V{\theta})$. It may be possible to use a minimal parameterization of SO(3), such as Euler angles or rotation vectors, but these exhibit problems with singularities and the domain of validity. It may also be possible to use the 9 parameters of 3x3 matrices, but these need to be constrained to obey the orthogonality constraints of SO(3).

If the parameterization is not minimal, we must include in our optimization constraints on the parameters $h(\V{\theta})=0$ and use constrained optimization techniques. Another alternative is to use *projection techniques*, which automatically cast a parameter vector into the range of valid parameters. Most easily this is illustrated with quaternions. If we consider the range of all quaternions $\V{\theta} \in \mathbb{R}^4$, we would like to restrict ourselves to the set of unit quaternions $\V{\theta} \in S^4$ because this is the manifold of quaternions that represent rotations. The quaternion-to-rotation conversion $R_q(\V{\theta}/\|\V{\theta}\|)$ can be fed the normalized quaternion, which defines a valid function in all of $\mathbb{R}^4$ except at the origin.

Second, it is not clear how to even measure goodness of fit between rotation matrices. The Frobeneus norm gives a rotation loss function $e_F(R,\hat{R}) = \| R-\hat{R} \|_F^2$ that is analogous to the quadratic loss. We could also use the absolute angle loss $e_\theta(R,\hat{R}) = angle(R^T \hat{R})$ with $angle(R) = \cos^{-1}((tr(R)-1)/2)$ the absolute angle as described in Chapter 4.

It can be shown, with a bit of work, that $e_F = 4 - 4 \cos e_\theta$, which is monotonically increasing on the range $e_\theta \in [0,\pi]$. There is a slight advantage in using $e_\theta$ as the loss function, which is that the cosine term in the $e_1$ loss function diminishes the importance of small errors, so that the cost of a larger absolute angular deviation is proportionally much higher than the sum of many smaller deviations. This makes the Frobeneus norm more sensitive to outliers.

We can use the Procrustes problem to derive an analytic solution to the minimum Frobeneus loss problem. The Frobeneus loss can be reinterpreted as an orthgonal matching problem between the three basis vectors rotated by $R$ and the rotated basis vectors in the dataset: $$ e_F(R,\hat{R}) = \sum_{j=1}^3 \| \hat{R} \V{e}_j - R \V{e}_j \|^2. $$

To determine the optimal $\hat{R}$, we can use the SVD method as defined in the above section).

We discuss two iterative numerical methods for rotation averaging, which operate on the square of absolute angle loss $e_\theta^2$.

One approach is quaternion averaging with projection. The mean of the rotations is taken in quaternion space, which produces a non-unit quaternion, then this is projected to the unit hypersphere. Letting $q^{(1)},\ldots,q^{(N)}$ be the quaternion representations of $R^{(1)},\ldots,R^{(N)}$, we can determine the mean $\overline{q} = 1/N \sum_{i=1}^N q^{(i)}$. The unit quaternion in this direction, $\overline{q} / \| \overline{q}\|$, is then converted to a rotation matrix. There are two issues with this approach. First, due to the dual representation of quaternions ($q$ and $-q$ both represent the same rotation) it is important to ensure that each of the quaternions are assigned a sign to keep them in the same hemisphere and ideally packed closely together. One approach to doing this is to incrementally perform averages throughout the dataset, choosing the sign of the next quaternion to minimize distance to the running average. Second, there is no clear loss function that this approach minimizes. Nevertheless, we can still guarantee that as the angular spread of the rotation samples grows closer, the average performed in this way will approach the true best-fit rotation.

A second approach is incremental geodesic interpolation. Compared to quaternion averaging, this can do a better job approaching the true optimum when the spread of the samples is large. The idea is that taking the mean of a set of points in Cartesian space can be interpreted as a sequence of incremental linear interpolations. If we use the same procedure but replace linear interpolations with geodesic interpolation on SO(3) we obtain an analogue to the mean.

The main idea exploits the fact that in Cartesian space, the mean of a set of $N$ points can be derived from by interpolating between the the $N$'th point and the mean of the first $N-1$ points. Let $\overline{\V{x}}_N$ be the mean of the first $N$ points and $\overline{\V{x}}_{N-1}$ be the mean of the first $N-1$ points. We see that $$\overline{\V{x}}_N = \frac{1}{N} \sum_{i=1}^N \V{x}^{(i)} = \frac{1}{N} (\V{x}^{(N)} + \sum_{i=1}^{N-1} \V{x}^{(i)})$$ Using $$\overline{\V{x}}_{N-1} = \frac{1}{N-1} \sum_{i=1}^{N-1} \V{x}^{(i)})$$ to derive $\sum_{i=1}^{N-1} \V{x}^{(i)} = (N-1)\overline{\V{x}}_{N-1}$, we can replace this in the above expression to get $$\overline{\V{x}}_N = \frac{1}{N} \V{x}^{(N)} + \frac{N-1}{N} \overline{\V{x}}_{N-1}.$$ This is simply a linear interpolation $1/N$ of the way from $\V{x}^{(N)}$ to $\overline{\V{x}}_{N-1}$.

Let us apply this to rotations using the SO(3) geodesic interpolation discussed in Chapter 4. Let us denote $\overline{R}_k$ as the average of the first $k$ rotations defined in the following sense. The base case is $\overline{R}_1 = R^{(1)}$. For $k>1$, define $$\overline{R}_k = SO3interp(R^{(k)},\overline{R}_{k-1},1/k).$$ Then, $\overline{R}_N$ is an estimate of the average rotation.

This procedure does have a drawback in that the order of rotation samples in the dataset affects the final result. To obtain even better results, this procedure can continue multiple passes through the dataset (with $k$ continually increasing), as though the dataset were duplicated some number of times. In the limit of an infinite number of passes, this procedure will converge regardless of the order of the dataset.

In [7]:

```
# Code to generate the rotation interpolation figureâ†”
```

The above figure show results for a dataset of 50 rotations centered around some "ground truth," with 5 outliers. The 3D plot shows the axes of each of the sample points, as well as the ground truth drawn as a solid line. The dataset is generated so the Z axis (blue) has a greater spread than the Y (green) or X (red) axes. Two estimates are drawn in dotted lines. The table compares numerical optimization of three different metrics (using Euler angle representation) along with the methods discussed above. It shows that each method is able to achieve relatively similar results, but have various strengths in terms of the computation time and the optimized metric. (The column listing ground truth error is not a strong signal of the performance of a given method; it is rather sensitive to the distribution of outliers. With multiple draws of the dataset, the best method varies significantly.)

Any of the prior procedures can be extended almost trivially to average several transforms $T^{(1)},\ldots,T^{(N)}$. Let $R^{(i)},\V{t}^{(i)}$ be the rotation and translation components of $T^{(i)}$, respectively. Via separability of the rotation and translation errors, we can simply generate $\overline{R}$ using any of the procedures described above, and generate $\overline{\V{t}}$ by calculating the mean translation.

Suppose we have a fixed-base robot with an eye-in-hand camera on link $k$, and we would like to estimate the pose of the camera $T_C^k$ relative to link $k$. We can use the methods above to perform this extrinsic calibration process in an efficient manner, given either position or pose readings from fiducial markers. The general procedure is as follows:

Attach $M$ fiducial markers, such as brightly colored spheres or augmented reality targets to a fixed location in the world.

Move the robot to several configurations $q^{(1)},\ldots,q^{(N)}$ in which the marker(s) are observed by the camera. The camera provides observations $z_{m1}^{(1)},\ldots,z_{m1}^{(N)}$ of marker 1, $z_{m2}^{(1)},\ldots,z_{m2}^{(N)}$ of marker 2, and so on. (If the $j$'th marker is not observed at the $i$'th configuration, we allow $z_{mj}^{(i)} = nil$.)

Optimize the camera transform $T_C^k$ and marker transforms $T_{m1},\ldots,T_{mM}$ to minimize prediction error.

(Note that this procedure assumes the marker identity is associated with the set of readings for that particular marker. This can be done using distinctive marker colors, using augmented reality markers that encode a marker ID, or using human annotation.)

At each robot configuration $q$, link $k$ is known to have transform $T_k(q)$, which is provided through forward kinematics. We also assume the camera's intrinsic parameters are calibrated, so the observations are accurately provided in the camera's local coordinates.

Let us assume that the observations consist of marker *poses*, such as what would be produced by augmented reality fiducials (or QR codes). Hence, $z_{mj}^{(i)}$ is an approximate reading of $T_{mj}^C(q^{(i)})$, the marker pose relative to the camera. Let us begin with the simple case of $M=1$, and assume the configurations are chosen to ensure that the marker is always observed. In this case, if $T_C^k$ and $T_{m1}$ were known, the expected reading at $q^{(i)}$ would be

$$\hat{T}_{m1}^C = T_C^{-1} T_{m1} = (T_k(q^{(i)})T_C^k)^{-1} T_{m1} $$

However, we do not know $T_C^k$ and $T_{m1}$, so we can interpret this as a parametric model over $T_C^k$ and $T_{m1}$ (with $T_{m1}$ the nuisance parameter). Letting $R$ and $\V{t}$ denote the rotation and translation parts of a transform $T$, we can use this to define the reprojection loss

$$e(T_{m1}^C,\hat{T}_{m1}^C) = w_\theta e_\theta^2(R_{m1}^C,\hat{R}_{m1}^C) + \|\V{t}_{m1}^C - \hat{\V{t}}_{m1}^C \|^2 $$

where $w_\theta$ weights the cost of the rotation component relative to the translation component. This loss can be directly used in a numerical optimization over 6+6 parameters encoding $T_C^k$ and $T_{m1}$, e.g., using an Euler angle encoding of $R_C^k$ and $R_{m1}$.

Another approach, known as *block coordinate descent*, can be derived using the transform averaging procedure described above. This procedure works in two phases. In the first phase, we assume that we know $T_C^k$, and then optimize for $T_{m1}$. In the second phase, we assume that we know $T_{m1}$, and then optimize $T_C^k$. Repeating this process several times can converge to the optimum. In each phase we may use transform averaging to efficiently estimate the procedure from data, as follows.

We can see without difficulty that the reprojection loss is transform-invariant: $$ e(T,\hat{T}) = e(T^\prime T,T^\prime \hat{T}) = e(T T^\prime,\hat{T} T^\prime).$$ So, for phase 1, we can rewrite $$e(T_{m1}^C,\hat{T}_{m1}^C) = e(T_{m1}^C,{T_C^k}^{-1} T_k^{-1}(q) T_{m1}) = e(T_k(q) T_C^k T_{m1}^C, T_{m1}) $$ If we examine the overall error function as a function of the unknown $T_{m1}$: $$E(T_{m1};D) = \sum_{i=1}^N e(z_{m1}^{(i)},\hat{T}_{m1}^C) = \sum_{i=1}^N e(T_k(q^{(i)}) T_C^k z_{m1}^{(i)}, T_{m1})$$ we can see that this is simply a transform averaging problem over $\{ T_k(q^{(i)}) T_C^k z_{m1}^{(i)} \text{ for } i=1,\ldots,N \}$.

Likewise, we can rearrange the loss function for phase 2 as follows: $$e(T_{m1}^C,\hat{T}_{m1}^C) = e(T_C^k, T_k^{-1}(q) T_{m1} {T_{m1}^C}^{-1})$$ so that estimating $T_C^k$ in phase 2 is simply a transform averaging problem over $\{ T_k^{-1}(q^{(i)}) T_{m1} {z_{m1}^{(i)}}^{-1} \text{ for } i=1,\ldots,N \}$.

In both standard optimization and in block coordinate descent we must take care to 1) ensure sufficient variability in the configurations $q_{(1)},\ldots,q^{(N)}$ with $N \geq 2$, and 2) input a relatively good set of initial guesses for $T_C^k$ and $T_{m1}$.

Now let us consider the case where the markers are point measurements, and they are always observed at each configuration. Although it is possible to calibrate camera transform and marker transforms given a single marker, this usually requires the robot to move to many more configurations, since only $x$, $y$ coordinates are measured accurately at a single image. Given 3 or more markers, it is possible to construct a fiducial reference frame $T_F$ in which the local positions of the markers $\V{m}_1^F,\ldots,\V{m}_M^F$ are known. For example, fix the origin to $\V{m}_1$, the X axis to the direction from $\V{m}_1$ to $\V{m}_2$, and the Z axis to the perpendicular normal between $\V{m}_2-\V{m}_1$ and $\V{m}_3-\V{m}_1$ (assuming $\V{m}_1,\ldots,\V{m}_3$ are not colinear).

Because markers will likely be identified on a rectified camera image, the observations $z_{mj}^{(i)}$ will be x-y pixel coordinates determined by projection in an idealized pinhole camera. The squared reprojection loss for marker $j$ $$e(z_{mj},\hat{z}_{mj}) = \|z_{mj}-\hat{z}_{mj}\|^2$$ has units in pixels squared, with $$\hat{z}_{mj} = p($\V{m}_j^C$) = \begin{bmatrix} c_x x_{mj}^C / z_{mj}^C + d_x \\ c_y y_{mj}^C / z_{mj}^C + d_y \end{bmatrix} $$ where $\V{m}_j^C = (x_{mj}^C,y_{mj}^C,z_{mj}^C)$ are the (predicted) marker coordinates in the camera frame. As usual, the Z axis of the camera frame is assumed to be forward, and $c_x,c_y,d_x,d_y$ are scaling factors determined by the camera intrinsic calibration.

$\V{m}_j^C$ is a function of the robot configuration $q$, the unknown extrinsic parameters $T_C^k$ and the nuisance parameters $T_F$. Specifically, $$\V{m}_j^C(q) = (T_k(q) T_C^k)^{-1} T_F \V{m}_j^F.$$ The overall error then sums the reprojection error over all markers and configurations $$E(T_C^k,T_F;D) = \sum_{i=1}^N \sum_{j=1}^M e(z_{mj},p(\V{m}_j^C(q^{(i)})).$$

Estimation is typically accomplished through numerical optimization of a joint 12D parameterization of $T_C^k$ and $T_F$. Due to the singularity where any $z_{mj}=0$, the initial guesses for these transforms must place each marker in front of the camera.

Another common extrinsic calibration procedure is to determine the pose of a static camera $T_C$ relative to the world. This procedure will make use the robot's knowledge from forward kinematics. The general procedure is as follows:

Attach $M \geq 1$ fiducial markers to the robot on link $k$. $k$ is usually the end effector link.

Move the robot to several configurations $q^{(1)},\ldots,q^{(N)}$ in which the marker(s) are observed by the camera. The camera provides observations $z_{m1}^{(1)},\ldots,z_{m1}^{(N)}$ of marker 1, $z_{m2}^{(1)},\ldots,z_{m2}^{(N)}$ of marker 2, and so on. (If the $j$'th marker is not observed at the $i$'th configuration, we allow $z_{mj}^{(i)} = nil$.)

Optimize the camera transform $T_C$ and link-relative marker transforms $T_{m1}^k,\ldots,T_{mM}^k$ to minimize prediction error.

The model is quite similar to the eye-in-hand case, except the loss is converted to the camera frame as follows: $$e(T_{mj}^C,\hat{T}_{mj}^C) = e(T_{mj}^C,T_C^{-1}T_k(q)T_{mj}^k).$$ As usual, we can optimize over $T_C$ and $T_{m1}^k,\ldots,T_{mM}^k$ using a numerical method.

In the case of pose measurements, we can use the block coordinate descent method. The first phase optimizes $T_C$ as an average of $\{ T_k(q^{(i)})T_{mj}^k{z_{mj}^{(i)}}^{-1} \text{ for } i=1,\ldots,N \} $, and the second phase optimizes $T_{mj}^k$ as an average of $\{ T_k^{-1}(q^{(i)})T_C z_{mj}^{(i)} \text{ for }i=1,\ldots,N \}$.