Week #7: Least Squares Estimation and GraphSLAM
Least Squares Estimation
Maximum Likelihood Estimation (MLE)
Maximum a Posteriori Estimation (MAP)
Bayesian Estimation
GraphSLAM
In the occupancy grid mapping problem we wanted to compute \(p(\mathbf{m}|\mathbf{z}_{1:t}, \mathbf{x}_{1:t})\) over all possible maps.
We can see this problem as a specific instance within a category of problems where we are given data (observations) and we want to “explain” or fit the data using a parametric function.
Example: we think that the 2D data was generated by a line \(z = \theta_{0} + \theta_{1}x\) whose parameters we do not know, and was corrupted by noise.
We are given data points \(\mathbf{(x_1, z_1), \dots, (x_N, z_N)}\)
We think that the data was generated by a parametric function \(\mathbf{z = h(\theta, x)}\)
Example: we think that the 2D data was generated by a line \(z = \theta_{0} + \theta_{1}x\) whose parameters we do not know
We are given data points \(\mathbf{(x_1, z_1), \dots, (x_N, z_N)}\)
We think that the data was generated by a parametric function \(\mathbf{z = h(\theta, x)}\)
This parametric model will have a fitting error: \[e(\theta)=\sum_{i=1}^{N}||\mathbf{z}_{i}-\mathbf{h}(\theta,\mathbf{x}_{i})||^{2}\]
The least-squares estimator is: \[\theta_{LS} = \underset{\theta}{\operatorname*{argmin}} e(\theta)\]
Example: we think that the 2D data was generated by a line \(z = \theta_{0} + \theta_{1}x\) whose parameters we do not know
We are given data points \(\mathbf{(x_1, z_1), \dots, (x_N, z_N)}\)
We think that the data was generated by a linear parametric function \(\mathbf{z = h(\theta, x) = H_x\theta}\) where \(\mathbf{H_x}\) is a matrix whose elements depend on \(\mathbf{x}\)
This parametric model will have a fitting error: \[e(\theta) = \sum_{i=1}^{N} ||\mathbf{z}_i - \mathbf{H}_{x_i}\theta||^2\]
The least-squares estimator is: \[\theta_{LS} = \underset{\theta}{\operatorname*{argmin}} e(\theta)\]
Example: we think that the 2D data was generated by a line \(z = \theta_{0} + \theta_{1}x\) whose parameters we do not know
We are given data points \(\mathbf{(x_1, z_1), \dots, (x_N, z_N)}\)
We think that the data was generated by a linear parametric function \(\mathbf{z = h(\theta, x) = \mathbf{H}_x\theta}\)
This parametric model will have a fitting error:
\[\begin{align} e(\theta) &= \sum_{i=1}^{N} ||\mathbf{z}_i - \mathbf{H}_{x_i} \theta||^2 \\ &= \sum_{i=1}^{N} \mathbf{z}_i^T \mathbf{z}_i - 2\theta^T \mathbf{H}_{x_i}^T \mathbf{z}_i + \theta^T \mathbf{H}_{x_i}^T \mathbf{H}_{x_i} \theta \end{align}\]
The least-squares estimator minimizes the error:
\[\frac{\partial e(\theta)}{\partial \theta} = 0 \Leftrightarrow -2\sum_{i=1}^{N} \mathbf{H}_{x_i}^T \mathbf{z}_i + 2\mathbf{H}_{x_i}^T \mathbf{H}_{x_i} \theta = 0 \Leftrightarrow \left[\sum_{i=1}^{N} \mathbf{H}_{x_i}^T \mathbf{H}_{x_i}\right] \theta = \sum_{i=1}^{N} \mathbf{H}_{x_i}^T \mathbf{z}_i\]
\[\theta_{LS} = \operatorname*{argmin}_{\theta} e(\theta) \Leftrightarrow \left[\sum_{i=1}^{N} \mathbf{H}_{x_i}^T \mathbf{H}_{x_i}\right] \theta_{LS} = \sum_{i=1}^{N} \mathbf{H}_{x_i}^T \mathbf{z}_i\]
Example: we think that the 2D data was generated by a line \(z = \theta_{0} + \theta_{1}x\) whose parameters we do not know
We are given data points \((x_1, z_1), \dots, (x_N, z_N)\)
We think that the data was generated by a linear parametric function \(z = h(\mathbf{\theta}, x) = \begin{bmatrix} 1 & x \end{bmatrix}\mathbf{\theta} = \theta_{0} + \theta_{1}x\)
This parametric model will have a fitting error: \[e(\theta_{0},\theta_{1})=\sum_{i=1}^{N}(z_{i}-\theta_{0}-\theta_{1}x_{i})^{2}\]
The least-squares estimator minimizes the error:
\[\theta_{LS} = \operatorname*{argmin}_{\theta_0, \theta_1} e(\theta_0, \theta_1) \Leftrightarrow \left[\sum_{i=1}^{N} \begin{bmatrix} 1 \\ x_i \end{bmatrix} \begin{bmatrix} 1 & x_i \end{bmatrix}\right] \theta_{LS} = \sum_{i=1}^{N} \begin{bmatrix} 1 \\ x_i \end{bmatrix} z_i\]
Which is a linear system of 2 equations. If we have at least two data points we can solve for \(\theta_{LS}\) to define the line.
Example: we think that the 2D data was generated by a quadratic \(z = \theta_{0} + \theta_{1}x + \theta_2x^2\) whose parameters we do not know.
We are given data points \((x_1, z_1), \dots, (x_N, z_N)\)
We think that the data was generated by a linear parametric function \(z = h(\mathbf{\theta},x) = \begin{bmatrix} 1 & x & x^{2} \end{bmatrix}\mathbf{\theta}=\theta_{0}+\theta_{1}x+\theta_{2}x^{2}\)
This parametric model will have a fitting error:
\[e(\theta_0, \theta_1, \theta_2) = \sum_{i=1}^{N} (z_i - \theta_0 - \theta_1 x_i - \theta_2 x_i^2)^2\]
The least-squares estimator minimizes the error: \[\theta_{LS} = \operatorname*{argmin}_{\theta_0, \theta_1, \theta_2} e(\theta_0, \theta_1, \theta_2) \Leftrightarrow \left[\sum_{i=1}^{N} \begin{bmatrix} 1 \\ x_i \\ x_i^2 \end{bmatrix} \begin{bmatrix} 1 & x_i & x_i^2 \end{bmatrix}\right] \theta_{LS} = \sum_{i=1}^{N} \begin{bmatrix} 1 \\ x_i \\ x_i^2 \end{bmatrix} z_i\]
Which is a linear system of 3 equations. If we have at least three data points we can solve for \(\theta_{LS}\) to define the quadratic.
Maximum Likelihood Estimation (MLE)
Maximum a Posteriori Estimation (MAP)
Bayesian Estimation
GraphSLAM
We are given data points \(\mathbf{d}_{1:N} = \mathbf{d}_1, \ldots, \mathbf{d}_N\)
We think the data has been generated from a probability distribution \(p(\mathbf{d}_{1:N}|\mathbf{\theta})\)
We want to find the parameter of the model that maximizes the likelihood function of the data \[L(\mathbf{\theta}) = p(\mathbf{d}_{1:N}|\mathbf{\theta})\]
which is a function of theta, not a probability distribution. \[\theta_{MLE} = \underset{\theta}{\operatorname*{argmax}} p(\mathbf{d}_{1:N} | \theta)\]
\[\theta_{MLE} = \underset{\theta}{\operatorname{argmax}} p(\mathbf{d}_{1:N}|\theta)\]
Find the parameters of the model that maximize the likelihood function of the data \[L(\mathbf{\theta}) = p(\mathbf{d}_{1:N}|\mathbf{\theta})\]
which is a function of theta, not a probability distribution.
Example: assume we know that 1D data points were generated independently from a Gaussian distribution \(\mathcal{N}(\mu, \sigma^2)\) , but we don’t know the mean and variance. The likelihood function of the data is
\[L(\mu, \sigma) = p(\mathbf{d}_{1:N}|\mu, \sigma) = \prod_{i=1}^{N} p(d_i|\mu, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma}} \exp(-0.5(d_i - \mu)^2/\sigma^2)\]
And the maximum-likelihood parameter estimates are
\[(\mu, \sigma)_{MLE} = \operatorname*{argmax}_{\mu,\sigma} p(\mathbf{d}_{1:N}|\mu, \sigma) = \operatorname*{argmax}_{\mu,\sigma} \log p(\mathbf{d}_{1:N}|\mu, \sigma) = \operatorname*{argmax}_{\mu,\sigma} \sum_{i=1}^{N} \log p(d_i|\mu, \sigma)\]
\[\theta_{MLE} = \underset{\theta}{\operatorname{argmax}} p(\mathbf{d}_{1:N}|\theta)\]
Find the parameters of the model that maximize the likelihood function of the data \[L(\mathbf{\theta}) = p(\mathbf{d}_{1:N}|\mathbf{\theta})\]
which is a function of theta, not a probability distribution.
Example: assume we know that 1D data points were generated independently from a Gaussian distribution \(\mathcal{N}(\mu, \sigma^2)\) , but we don’t know the mean and variance. The likelihood function of the data is
\[L(\mu, \sigma) = p(\mathbf{d}_{1:N}|\mu, \sigma) = \prod_{i=1}^{N} p(d_i|\mu, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma}} \exp(-0.5(d_i - \mu)^2/\sigma^2)\]
And the maximum-likelihood parameter estimates are
\[(\mu, \sigma)_{MLE} = \operatorname*{argmax}_{\mu,\sigma} \sum_{i=1}^{N} \log p(d_i|\mu, \sigma) = \operatorname*{argmax}_{\mu,\sigma} \left[ -N\log(\sqrt{2\pi}\sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(d_i - \mu)^2 \right]\]
\[\theta_{MLE} = \underset{\theta}{\operatorname{argmax}} p(\mathbf{d}_{1:N}|\theta)\]
Find the parameters of the model that maximize the likelihood function of the data \[L(\mathbf{\theta}) = p(\mathbf{d}_{1:N}|\mathbf{\theta})\]
which is a function of theta, not a probability distribution.
Example: assume we know that 1D data points were generated independently from a Gaussian distribution \(\mathcal{N}(\mu, \sigma^2)\) , but we don’t know the mean and variance. The likelihood function of the data is
\[L(\mu, \sigma) = p(\mathbf{d}_{1:N}|\mu, \sigma) = \prod_{i=1}^{N} p(d_i|\mu, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma}} \exp(-0.5(d_i - \mu)^2/\sigma^2)\]
And the maximum-likelihood parameter estimates are
\[(\mu, \sigma)_{MLE} = \operatorname*{argmax}_{\mu,\sigma} \sum_{i=1}^{N} \log p(d_i|\mu, \sigma) = \operatorname*{argmax}_{\mu,\sigma} \left[ -N\log(\sqrt{2\pi}\sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(d_i - \mu)^2 \right]\]
Set partial derivatives w.r.t. \(\mu\) and \(\sigma\) to zero
\[\mu_{MLE}=\sum_{i=1}^{N}d_{i}/N\]
\[\sigma_{MLE}^{2}=\frac{1}{N}\sum_{i=1}^{N}(d_{i}-\mu_{MLE})^{2}\]
\[\theta_{MLE} = \underset{\theta}{\operatorname{argmax}} p(\mathbf{d}_{1:N}|\theta)\]
Find the parameters of the model that maximize the likelihood function of the data \[L(\mathbf{\theta}) = p(\mathbf{d}_{1:N}|\mathbf{\theta})\]
which is a function of theta, not a probability distribution.
Example: assume we know that 1D data points were generated independently from a Gaussian distribution \(\mathcal{N}(\mu, \sigma^2)\) , but we don’t know the mean and variance. The likelihood function of the data is
\[L(\mu, \sigma) = p(\mathbf{d}_{1:N}|\mu, \sigma) = \prod_{i=1}^{N} p(d_i|\mu, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma}} \exp(-0.5(d_i - \mu)^2/\sigma^2)\]
And the maximum-likelihood parameter estimates are
\[(\mu, \sigma)_{MLE} = \operatorname*{argmax}_{\mu,\sigma} \sum_{i=1}^{N} \log p(d_i|\mu, \sigma) = \operatorname*{argmax}_{\mu,\sigma} \left[ -N\log(\sqrt{2\pi}\sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(d_i - \mu)^2 \right]\]
Likelihood
\(p(d|\theta)\)
Prior
\(p(\theta)\)
Posterior
\(p(\theta|d) \propto p(d|\theta)p(\theta)\)
\[\begin{align} \theta_{MAP} &= \operatorname*{argmax}_{\theta} p(\theta|\mathbf{d}_{1:N}) \\ &= \operatorname*{argmax}_{\theta} \left[ \frac{p(\mathbf{d}_{1:N}|\theta)p(\theta)}{p(\mathbf{d}_{1:N})} \right] \\ &= \operatorname*{argmax}_{\theta} \left[ p(\mathbf{d}_{1:N}|\theta)p(\theta) \right] \end{align}\]
In the occupancy grid mapping problem we wanted to compute \(p(\mathbf{m}|\mathbf{z}_{1:t}, \mathbf{x}_{1:t})\) over all possible maps.
We can see this problem as a specific instance within a category of problems where we are given data (observations) and we want to “explain” or fit the data using a parametric function.
There are typically three ways to work with this type of problems:
Both MLE and MAP estimators give you a single point estimate .
But there might be many parameters that are compatible with the data.
Instead of point estimates, compute a distribution of estimates that explain the data
\[p(\theta|\mathbf{d}_{1:N}) = \frac{p(\mathbf{d}_{1:N}|\theta)p(\theta)}{p(\mathbf{d}_{1:N})}\]
The probability of the
data is usually hard to
compute. But it does not
depend on the parameter
theta, so it is treated as a
normalizing factor, and we
can still compute how the
posterior varies with theta.
Both MLE and MAP estimators give you a single point estimate .
But there might be many parameters that are compatible with the data.
Instead of point estimates, compute a distribution of estimates that explain the data
Bayesian parameter estimation:
\[p(\theta|\mathbf{d}_{1:N}) = \frac{p(\mathbf{d}_{1:N}|\theta)p(\theta)}{p(\mathbf{d}_{1:N})}\]
\[p(\mathbf{m}|\mathbf{z}_{1:t}, \mathbf{x}_{1:t})\]
Least Squares Estimation
Maximum Likelihood Estimation (MLE)
Maximum a Posteriori Estimation (MAP)
Bayesian Estimation
Enable a robot to simultaneously build a map of its environment and estimate where it is in that map.
This is called SLAM (Simultaneous Localization And Mapping)
Today we are going to look at the batch version, i.e. collect all measurements and controls, and later form an estimate of the states and the map.
We are going to solve SLAM using least squares
MORESLAM system, McGill, 2016
MORESLAM system, McGill, 2016
Source Code: https://github.com/erik-nelson/blam
Google Cartographer: 2D and 3D laser SLAM
\[p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\]
\[p(\mathbf{x}_t, \mathbf{m}_t \mid \mathbf{z}_{0:t}, \mathbf{u}_{0:t-1}, \mathbf{x}_0)\]
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}}\exp\left(-0.5(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right)\)
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}} \exp\left(-0.5\|\mathbf{x} - \mathbf{\mu}\|_{\mathbf{\Sigma}}^2\right)\)
Shortcut notation: \(\|x\|_{\Sigma}^{2}=x^{T}\Sigma^{-1}x\)
From “Computer Vision: Models, Learning, and Inference” Simon Prince
Note: The shapes of these covariances are important, you should
know them well. In particular, when are x1 and x2 correlated?
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}}\exp\left(-0.5(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right)\)
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}} \exp\left(-0.5\|\mathbf{x} - \mathbf{\mu}\|_{\mathbf{\Sigma}}^2\right)\)
Shortcut notation: \(\|x\|_{\Sigma}^{2}=x^{T}\Sigma^{-1}x\)
From “Computer Vision: Models, Learning, and Inference” Simon Prince
x1 and x2 are correlated when the shape of the ellipse is rotated,
i.e. when there are nonzero off-diagonal terms in the covariance matrix.
In this example, (e) and (f)
\[p(x \in R) = c\]
we can then say (for example) there is a 99% probability that the true value is in R
e.g. for a univariate normal distribution \(N(\mu, \sigma^2)\)
\(p(|x - \mu| < \sigma) \approx 0.67\)
\(p(|x - \mu| < 2\sigma) \approx 0.95\)
\(p(|x - \mu| < 3\sigma) \approx 0.997\)
\[\mathbb{E}_{x\sim p(X)}[X]=\int_{x}xp(X=x)dx\]
\(\qquad\qquad\quad \mathbb{E}_{x \sim p(X)}[AX + b] = A\mathbb{E}_{x \sim p(X)}[X] + b\)
\[\mathbb{E}_{x,y \sim p(X,Y)}[XY] = \mathbb{E}_{x \sim p(X)}[X] \mathbb{E}_{x \sim p(Y)}[Y]\]
\[\text{Cov}[X, Y] = E[XY] - E[X]E[Y]\]
\[\begin{align} & \text{Var}[X] = \text{Cov}[X] = \text{Cov}[X, X] = E[X^2] - E[X]^2 \\ & \text{Cov}[AX + b] = A\text{Cov}[X]A^T \\ & \text{Cov}[X + Y] = \text{Cov}[X] + \text{Cov}[Y] - 2\text{Cov}[X, Y] \end{align}\]
\[\text{Cov}[X, Y] = E[XY] - E[X]E[Y]\]
Entry (i,j) of the covariance matrix measures whether changes in variable \(X_i\) co-occur with changes in variable \(Y_j\)
It does not measure whether one causes the other.
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}}\exp\left(-0.5(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right)\)
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}} \exp\left(-0.5\|\mathbf{x} - \mathbf{\mu}\|_{\mathbf{\Sigma}}^2\right)\)
For multivariate Gaussians:
\[\begin{align} E[\mathbf{x}] = \mu \\ \text{Cov}[\mathbf{x}] = \Sigma \end{align}\]
From “Computer Vision: Models, Learning, and Inference” Simon Prince
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}}\exp\left(-0.5(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right)\)
\(p(\mathbf{x}) = \frac{1}{(2\pi)^{D/2}\det(\mathbf{\Sigma})^{1/2}} \exp\left(-0.5\|\mathbf{x} - \mathbf{\mu}\|_{\mathbf{\Sigma}}^2\right)\)
Since we have 2D examples here:
\[\begin{align} \text{Cov}[\mathbf{x}] = \boldsymbol{\Sigma} &= \begin{bmatrix} \text{Cov}[x_1, x_1] & \text{Cov}[x_1, x_2] \\ \text{Cov}[x_2, x_1] & \text{Cov}[x_2, x_2] \end{bmatrix} \\ &= \begin{bmatrix} \text{Var}[x_1] & \text{Cov}[x_1, x_2] \\ \text{Cov}[x_2, x_1] & \text{Var}[x_2] \end{bmatrix} \end{align}\]
Map \(\mathbf{m} = \{\mathbf{m}_0, \mathbf{m}_1\}\) consists of landmarks that are easily identifiable and cannot be mistaken for one another.
i.e. we are avoiding
the data association
problem here.
Map \(\mathbf{m} = \{\mathbf{m}_0, \mathbf{m}_1\}\) consists of landmarks that are easily identifiable and cannot be mistaken for one another.
Notice that the graph is mostly sparse as long as not many states observe the same landmark.
That implies that there are many symbolic dependencies between random variables in \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) that are not necessary and can be dropped.
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\mathbf{x}^*_{1:T}, \mathbf{m}^* = \underset{\mathbf{x}_{1:T},\mathbf{m}}{\operatorname{argmax}} p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_0)\]
See least
squares lecture
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\begin{align} \mathbf{x}_{1:T}^*, \mathbf{m}^* &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m} | \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \frac{p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)}{p(\mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)} \right] \end{align}\]
by definition
of conditional
distribution
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\begin{align} \mathbf{x}_{1:T}^*, \mathbf{m}^* &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m} | \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \frac{p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)}{p(\mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)} \right] \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0) \end{align}\]
denominator does
not depend on
optimization
variables
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\begin{align} \mathbf{x}_{1:T}^*, \mathbf{m}^* &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m} | \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \frac{p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)}{p(\mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)} \right] \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \prod_{t=1}^{T} p(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \prod_{t=0}^{T} \prod_{\mathbf{z}_t^{(k)} \in \mathbf{z}_t} p(\mathbf{z}_t^{(k)}|\mathbf{x}_t, \mathbf{m}_k) \right] \end{align}\]
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\begin{align} \mathbf{x}_{1:T}^*, \mathbf{m}^* &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m} | \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \frac{p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)}{p(\mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)} \right] \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \prod_{t=1}^{T} p(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \prod_{t=0}^{T} \prod_{z_t^{(k)} \in \mathbf{z}_t} p(z_t^{(k)}|\mathbf{x}_t, \mathbf{m}_k) \right] \end{align}\]
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\begin{align} \mathbf{x}_{1:T}^*, \mathbf{m}^* &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m} | \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \frac{p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)}{p(\mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0)} \right] \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}|\mathbf{x}_0) \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \prod_{t=1}^{T} p(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \prod_{t=0}^{T} \prod_{\mathbf{z}_t^{(k)} \in \mathbf{z}_t} p(\mathbf{z}_t^{(k)}|\mathbf{x}_t, \mathbf{m}_k) \right] \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T}, \mathbf{m}} \left[ \sum_{t=1}^{T} \log p(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) + \sum_{t=0}^{T} \sum_{\mathbf{z}_t^{(k)} \in \mathbf{z}_t} \log p(\mathbf{z}_t^{(k)}|\mathbf{x}_t, \mathbf{m}_k) \right] \end{align}\]
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\mathbf{x}_{1:T}^{*},\mathbf{m}^{*} = \underset{\mathbf{x}_{1:T},\mathbf{m}}{\operatorname{argmax}}\left[\sum_{t=1}^{T}\log p(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{u}_{t-1})+\sum_{t=0}^{T}\sum_{\mathbf{z}_{t}^{(k)}\in \mathbf{z}_{t}}\log p(\mathbf{z}_{t}^{(k)}|\mathbf{x}_{t},\mathbf{m}_{k})\right]\]
Main GraphSLAM assumptions:
1. Uncertainty in the dynamics model is Gaussian
\(\mathbf{x}_{t}=f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1})+\mathbf{w}_{t}\)
\(\mathbf{w}_{t} \sim \mathcal{N}(0, \mathbf{R}_{t})\)
so
\(\mathbf{x}_t|\mathbf{x}_{t-1}, \mathbf{u}_{t-1} \sim \mathcal{N}(\mathbf{f}(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}), \mathbf{R}_t)\)
2. Uncertainty in the sensor model is Gaussian
\(\mathbf{z}_t^{(k)} = \mathbf{h}(\mathbf{x}_t, \mathbf{m}_k) + \mathbf{v}_t\)
\(\mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}_t)\)
so
\(\mathbf{z}_t^{(k)} | \mathbf{x}_t, \mathbf{m}_k \sim \mathcal{N}(\mathbf{h}(\mathbf{x}_t, \mathbf{m}_k), \mathbf{Q}_t)\)
\(\mathbf{x}_1|\mathbf{x}_0, \mathbf{u}_0 \sim \mathcal{N}(\mathbf{f}(\mathbf{x}_0, \mathbf{u}_0), \mathbf{R}_0)\)
Expected to end up at \(\mathbf{x}_1 = \mathbf{f}(\mathbf{x}_0, \mathbf{u}_0)\) from \(\mathbf{x}_0\)
\(\mathbf{x}_1|\mathbf{x}_0, \mathbf{u}_0 \sim \mathcal{N}(\mathbf{f}(\mathbf{x}_0, \mathbf{u}_0), \mathbf{R}_0)\)
Expected to end up at \(\mathbf{x}_1 = \mathbf{f}(\mathbf{x}_0, \mathbf{u}_0)\) from \(\mathbf{x}_0\) but we might end up around it, within the ellipse defined by the covariance matrix \(\mathbf{R}_0\)
\(\mathbf{z}_1^{(0)}|\mathbf{x}_1, \mathbf{m}_0 \sim \mathcal{N}(\mathbf{h}(\mathbf{x}_1, \mathbf{m}_0), \mathbf{Q}_1)\)
Expected to get measurement \(\mathbf{h}(\mathbf{x}_1, \mathbf{m}_0)\) at state \(\mathbf{x}_1\) but it might be somewhere within the ellipse defined by the covariance matrix \(\mathbf{Q}_1\)
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\begin{align} \mathbf{x}_{1:T}^{*}, \mathbf{m}^{*} &= \underset{\mathbf{x}_{1:T}, \mathbf{m}}{\operatorname{argmax}} \left[ \sum_{t=1}^{T} \log p(\mathbf{x}_{t}|\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) + \sum_{t=0}^{T} \sum_{\mathbf{z}_{t}^{(k)} \in \mathbf{z}_{t}} \log p(\mathbf{z}_{t}^{(k)}|\mathbf{x}_{t}, \mathbf{m}_{k}) \right] \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T},\mathbf{m}} \left[ - \sum_{t=1}^{T} \|\mathbf{x}_t - \mathbf{f}(\mathbf{x}_{t-1}, \mathbf{u}_{t-1})\|_{\mathbf{{R}}_t}^2 - \sum_{t=0}^{T} \sum_{z_t^{(k)} \in \mathbf{z}_t} \|\mathbf{z}_t^{(k)} - \mathbf{h}(\mathbf{x}_t, \mathbf{m}_k)\|_{\mathbf{{Q}}_t}^2 \right] \end{align}\]
Notation:
\(\mathbf{x}^T \mathbf{Q}^{-1} \mathbf{x} = ||\mathbf{x}||_{\mathbf{Q}}^2\)
\(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1} \sim \mathcal{N}(\mathbf{f}(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}), \mathbf{R}_t) \qquad\qquad\qquad \mathbf{z}_{t}^{(k)}|\mathbf{x}_{t}, \mathbf{m}_{k} \sim \mathcal{N}(\mathbf{h}(\mathbf{x}_{t}, \mathbf{m}_{k}), \mathbf{Q}_{t})\)
Instead of computing the posterior \(p(\mathbf{x}_{1:T}, \mathbf{m} \mid \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0})\) we are going to compute its max
\[\begin{align} \mathbf{x}_{1:T}^{*}, \mathbf{m}^{*} &= \underset{\mathbf{x}_{1:T}, \mathbf{m}}{\operatorname{argmax}} \left[ \sum_{t=1}^{T} \log p(\mathbf{x}_{t}|\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) + \sum_{t=0}^{T} \sum_{\mathbf{z}_{t}^{(k)} \in \mathbf{z}_{t}} \log p(\mathbf{z}_{t}^{(k)}|\mathbf{x}_{t}, \mathbf{m}_{k}) \right] \\ &= \operatorname*{argmax}_{\mathbf{x}_{1:T},\mathbf{m}} \left[ - \sum_{t=1}^{T} \|\mathbf{x}_t - \mathbf{f}(\mathbf{x}_{t-1}, \mathbf{u}_{t-1})\|_{\mathbf{{R}}_t}^2 - \sum_{t=0}^{T} \sum_{z_t^{(k)} \in \mathbf{z}_t} \|\mathbf{z}_t^{(k)} - \mathbf{h}(\mathbf{x}_t, \mathbf{m}_k)\|_{\mathbf{{Q}}_t}^2 \right] \\ &= \underset{\mathbf{x}_{1:T}, \mathbf{m}}{\operatorname{argmin}} \left[ \sum_{t=1}^{T} \|\mathbf{x}_t - \mathbf{f}(\mathbf{x}_{t-1}, \mathbf{u}_{t-1})\|_{\mathbf{R}_t}^2 + \sum_{t=0}^{T} \sum_{\mathbf{z}_t^{(k)} \in \mathbf{z}_t} \|\mathbf{z}_t^{(k)} - \mathbf{h}(\mathbf{x}_t, \mathbf{m}_k)\|_{\mathbf{Q}_t}^2 \right] \end{align}\]
Notation:
\(\mathbf{x}^T \mathbf{Q}^{-1} \mathbf{x} = ||\mathbf{x}||_{\mathbf{Q}}^2\)
\(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1} \sim \mathcal{N}(\mathbf{f}(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}), \mathbf{R}_t) \qquad\qquad\qquad \mathbf{z}_{t}^{(k)}|\mathbf{x}_{t}, \mathbf{m}_{k} \sim \mathcal{N}(\mathbf{h}(\mathbf{x}_{t}, \mathbf{m}_{k}), \mathbf{Q}_{t})\)
Need to minimize the sum of the following quadratic terms:
\[\begin{align} & ||\mathbf{x}_1 - \mathbf{f}(\mathbf{x}_0, \mathbf{u}_0)||_{\mathbf{R}_1}^2 \\ & ||\mathbf{x}_2 - \mathbf{f}(\mathbf{x}_1, \mathbf{u}_1)||_{\mathbf{R}_2}^2 \\ & ||\mathbf{x}_3 - \mathbf{f}(\mathbf{x}_2, \mathbf{u}_2)||_{\mathbf{R}_3}^2 \\ & ||\mathbf{z}_0^{(0)} - \mathbf{h}(\mathbf{x}_0, \mathbf{m}_0)||_{\mathbf{Q}_0}^2 \\ & ||\mathbf{z}_1^{(0)} - \mathbf{h}(\mathbf{x}_1, \mathbf{m}_0)||_{\mathbf{Q}_1}^2 \\ & ||\mathbf{z}_1^{(1)} - \mathbf{h}(\mathbf{x}_1, \mathbf{m}_1)||_{\mathbf{Q}_1}^2 \\ & ||\mathbf{z}_2^{(1)} - \mathbf{h}(\mathbf{x}_2, \mathbf{m}_1)||_{\mathbf{Q}_2}^2 \end{align}\]
with respect to variables: \(\mathbf{x_1 \quad x_2 \quad x_3 \quad m_0 \quad m_1}\)
initial state \(\mathbf{x}_0\) is given
Covariance matrices \(\mathbf{R_t, Q_t}\)
are problem-dependent, but
they usually do not change
with time.
\(\mathbf{x}_t = \mathbf{f}(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) + \mathbf{w}_t\)
\(\mathbf{z}_{t}^{(k)} = \mathbf{h}(\mathbf{x}_{t}, \mathbf{m}_{k}) + \mathbf{v}_{t}\)
Can be any of the dynamical systems we saw in Lecture 2.
\(\mathbf{z}_t^{(k)}\) can be any of the sensors we saw in Lecture 4:
Laser scan \(\{(r_{i},\theta_{i})\}_{i=1:K}\) where \(\mathbf{m}_k\) is an occupancy grid
Range and bearing \((r, \theta)\) to the landmark \(\mathbf{m}_k = (x_k, y_k, z_k)\)
Bearing measurements from images
Altitude/Depth
Gyroscope
Accelerometer
Claim:
\[p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_{0}) = p(\mathbf{x}_{0}) \prod_{t=1}^{T} p(\mathbf{x}_{t}|\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \prod_{t=0}^{T} \prod_{\mathbf{z}_{t}^{(k)}\in \mathbf{z}_{t}} p(\mathbf{z}_{t}^{(k)}|\mathbf{x}_{t}, \mathbf{m}_{k})\]
Proof:
\[\begin{align} p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) &= p(\mathbf{z}_T | \mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T-1}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \, p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T-1}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= p(\mathbf{z}_T | \mathbf{x}_T, \mathbf{m}) \, p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T-1}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= p(\mathbf{z}_T | \mathbf{x}_T, \mathbf{m}) \, p(\mathbf{z}_{T-1} | \mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T-2}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \, p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T-2}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= p(\mathbf{z}_T | \mathbf{x}_T, \mathbf{m}) \, p(\mathbf{z}_{T-1} | \mathbf{x}_{T-1}, \mathbf{m}) \, p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{z}_{0:T-2}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &\quad \dots \\ &= \prod_{t=0}^{T} p(\mathbf{z}_t | \mathbf{x}_t, \mathbf{m}) \, p(\mathbf{x}_{1:T}, \mathbf{m}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \prod_{t=0}^{T} p(\mathbf{z}_t | \mathbf{x}_t, \mathbf{m}) \, p(\mathbf{x}_T | \mathbf{x}_{1:T-1}, \mathbf{m}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \, p(\mathbf{x}_{1:T-1}, \mathbf{m}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \prod_{t=0}^{T} p(\mathbf{z}_t | \mathbf{x}_t, \mathbf{m}) \, p(\mathbf{x}_T | \mathbf{x}_{T-1}, \mathbf{u}_{T-1}) \, p(\mathbf{x}_{1:T-1}, \mathbf{m}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &= \prod_{t=0}^{T} p(\mathbf{z}_t | \mathbf{x}_t, \mathbf{m}) \, p(\mathbf{x}_T | \mathbf{x}_{T-1}, \mathbf{u}_{T-1}) \, p(\mathbf{x}_{T-1} | \mathbf{x}_{T-2}, \mathbf{m}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \, p(\mathbf{x}_{1:T-2}, \mathbf{m}, \mathbf{u}_{0:T-1}, \mathbf{x}_0) \\ &\quad \dots \\ &= \left[ \prod_{t=0}^{T} p(\mathbf{z}_t | \mathbf{x}_t, \mathbf{m}) \right] p(\mathbf{x}_0) \prod_{t=1}^{T} p(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \end{align}\]
Claim: \[p(\mathbf{z}_{t}|\mathbf{x}_{t},\mathbf{m}) = \prod_{\mathbf{z}_{t}^{(k)}\in\mathbf{z}_{t}}p(\mathbf{z}_{t}^{(k)}|\mathbf{x}_{t},\mathbf{m}_{k})\]
where \(\mathbf{z}_t = \left\{ \mathbf{z}_t^{(k)} \text{ iff landmark } \mathbf{m}_k \text{ was observed} \right\}\)
\(\mathbf{m} = \{\text{landmarks } \mathbf{m}_k\}\)
Proof:
Suppose without loss of generality that \(\mathbf{z}_t = \left\{ \mathbf{z}_t^{(k)}, k = 1...K \right\}\) and \(\mathbf{m} = \{\mathbf{m}_{k}, k = 1...K\}\)
i.e. that all landmarks were observed from the state at time t. Then:
\[\begin{align} p(\mathbf{z}_t^{(1)}, \ldots, \mathbf{z}_t^{(K)}|\mathbf{x}_t, \mathbf{m}) &= p(\mathbf{z}_t^{(1)}|\mathbf{z}_t^{(2)}, \ldots, \mathbf{z}_t^{(K)}, \mathbf{x}_t, \mathbf{m}) \, p(\mathbf{z}_t^{(2)}, \ldots, \mathbf{z}_t^{(K)}|\mathbf{x}_t, \mathbf{m}) \\ &= p(\mathbf{z}_t^{(1)}|\mathbf{x}_t, \mathbf{m}_1) \, p(\mathbf{z}_t^{(2)}, \ldots, \mathbf{z}_t^{(K)}|\mathbf{x}_t, \mathbf{m}) \\ &= p(\mathbf{z}_t^{(1)}|\mathbf{x}_t, \mathbf{m}_1) \, p(\mathbf{z}_t^{(2)}|\mathbf{z}_t^{(3)}, \ldots, \mathbf{z}_t^{(K)}, \mathbf{x}_t, \mathbf{m}) \, p(\mathbf{z}_t^{(3)}, \ldots, \mathbf{z}_t^{(K)}|\mathbf{x}_t, \mathbf{m}) \\ &= p(\mathbf{z}_t^{(1)}|\mathbf{x}_t, \mathbf{m}_1) \, p(\mathbf{z}_t^{(2)}|\mathbf{x}_t, \mathbf{m}_2) \, p(\mathbf{z}_t^{(3)}, \ldots, \mathbf{z}_t^{(K)}|\mathbf{x}_t, \mathbf{m}) \\ &\quad \vdots \\ &= \prod_{k=1}^{K} p(\mathbf{z}_t^{(k)}|\mathbf{x}_t, \mathbf{m}_k) \end{align}\]