Kalman Filtering

Explained a bit more simply.

Set up

Consider a linear time-invariant dynamical system, written in discrete time:

xk+1=Axk+Buk+wkyk=Cxk+vk\begin{align*} x_{k+1} &= Ax_k + B u_k + w_k\\ y_{k} &= C x_k + v_k \end{align*}

where

  • wkN(0,Q)w_k \sim \mathcal{N}(0, Q) is a zero-mean process noise with covariance QQ.
  • vkN(0,R)v_k \sim \mathcal{N}(0, R) is a zero-mean measurement noise with covariance RR.
  • x0N(x^0,P0,0)x_0 \sim \mathcal{N}(\hat{x}_0, P_{0, 0}) is the initial state, which is assumed to the normally distributed around x^0\hat{x}_0 with covariance P0,0P_{0,0}.

For the KF to work, we must assume that the system (A,C)(A, C) is observable.

Notation:

  • xkx_{k} is the true state at time kk.
  • x^k,k\hat{x}_{k,k} is our best estimate of xkx_k, using all measurements upto and including time kk. This is called the posterior estimate.
  • x^k+1,k\hat{x}_{k+1, k} is our best estimate of xk+1x_{k+1}, using all measurements upto and including time kk. This is called the prior estimate.
  • Pk,k=cov(xkx^k,k)P_{k,k} = \text{cov}(x_k - \hat{x}_{k,k}) is the covariance of the posterior estimate.
  • Pk+1,k=cov(xk+1x^k+1,k)P_{k+1,k} = \text{cov}(x_{k+1} - \hat{x}_{k+1,k}) is the covariance of the prior estimate.

Final Equations

We will be generating an estimate of the state, in the following way:

  1. Predict the state, in an open-loop sense
  2. Correct the estimate, using the received measurements.

Mathematically,

x^k+1,k=Ax^k,k+BukPk+1,k=APk,kAT+Q\hat{x}_{k+1, k} = A \hat{x}_{k,k} + B u_k\\ P_{k+1, k} = A P_{k,k} A^T + Q

is a direct prediction of where the state should be based on the previous best estimate. The covariance of the prior is also updated.

Now, once the new measurement comes in, we have the ability to check how wrong the estimate is. We call this the innovation, ik+1i_{k+1}:

ik+1=yk+1Cx^k+1,ki_{k+1} = y_{k+1} - C \hat{x}_{k+1, k}

and so we correct the estimated state as

x^k+1,k+1=x^k+1,k+Kik+1Pk+1,k+1=(IKC)Pk+1,k\hat{x}_{k+1,k+1} = \hat{x}_{k+1, k} + K i_{k+1}\\ P_{k+1, k+1} = (I - K C) P_{k+1, k}

by simplying taking the prior estimate, and shifting it, to better match the measurement. Notice, since the measurement is noisy, we don’t shift it to perfectly match the noisy measurement. This proportion is controlled by KK, the Kalman Gain, which is given by

K=Pk+1,kCT(CPk+1,kCT+R)1K = P_{k+1, k} C^T (C P_{k+1, k} C^T + R)^{-1}

which looks scary, but is actually quite straightforward, as demonstrated below. Also note, KK is changing at each time step. I have just dropped the subscript KkK_k since it only gets in the way.

The important thing to remember is what the Kalman gain is trying to optimize: it is trying to make Pk+1,k+1P_{k+1, k+1}, the posterior state estimate covariance, as ‘small’ as possible. Rigourously speaking, it is trying to minimise the trace(Pk+1,k+1)\text{trace}(P_{k+1, k+1}).

In the rest of this page, we derive the above equations. We first need a bit of mathematical background, and then we derive the equations.


Useful Mathematical properties

There are a few mathematical properties that will be used in the proof. If you are familiar with them, feel free to skip this section.

Properties of Gaussian Random Vectors

A vector wRnw \in R^{n} is a gaussian vector if it is sampled from a normal distribution. We write

wN(wˉ,Q)w \sim \mathcal{N}(\bar w, Q)

where wˉ=E(w)\bar w = E(w) is the mean, and Q=cov(w)Q = \text{cov}(w) is the covariance matrix,

Q=cov(w)=E[(wwˉ)(wwˉ)T]=E[wwT]wˉwˉTQ = \text{cov}(w) = E[(w - \bar w) ( w - \bar w)^T] = E[w w^T ] - \bar w \bar w^T

and is always square (of size n×nn \times n), symmetric, positive-semi definite. Notice that if wˉ=0\bar w = 0, the covariance matrix simplifies to cov(w)=E[wwT]\text{cov}(w) = E[w w^T]

The covariance matrix also has the following properties, which we will be using:

  • if ARn×n,bRnA \in R^{n\times n}, b \in R^n are constant matrix and vector,
cov(Aw+b)=Acov(w)AT\text{cov}(A w + b) = A \text{cov}(w) A^T
  • the negative sign does not affect the covariance:
cov(Aw)=Acov(w)AT\text{cov}(-Aw) = A\text{cov}(w)A^T
  • but you have to be careful when subtracting:
cov(AwBw)=(AB)cov(w)(AB)T\text{cov}( Aw - B w) = (A-B) \text{cov}(w) (A-B)^T

not Acov(w)AT+Bcov(w)BT=(A+B)cov(w)(A+B)TA \text{cov}(w) A^T + B \text{cov}(w) B^T = (A+B) \text{cov}(w) (A+B)^T.

  • if (and only if) you have two uncorrelated random vectors wN(wˉ,Q),vN(vˉ,R)w \sim \mathcal{N}(\bar w, Q), v\sim \mathcal{N}(\bar v, R), then
cov(v+w)=cov(v)+cov(w)\text{cov}(v + w) = \text{cov}(v) + \text{cov}(w)
  • if the two random vectors are correlated (will not be used in this document, but useful to know), we have
cov(v+w)=cov(v)+cov(w)+cov(v,w)+cov(w,v)\text{cov}(v + w) = \text{cov}(v) + \text{cov}(w) + \text{cov}(v, w) + \text{cov}(w, v)

where cov(v,w)=E[(vvˉ)(wwˉ)T]\text{cov}(v, w) = E[(v - \bar v) ( w - \bar w)^T]. You can use this to show why the minus sign appears in cov(AwBw)\text{cov}(Aw-Bw).

Properties of Trace and Trace Derivatives

We will be minimising the trace of a matrix, and so it is useful to know the following:

Consider two matrices ARm×n,BRn×mA \in R^{m \times n}, B \in R^{n\times m}

tr(AB)=tr(BA)\text{tr}(AB) = \text{tr}(BA) tr(A+B)=tr(A)+tr(B)\text{tr}(A+B) = \text{tr}(A) + \text{tr}(B) ddAtr(AB)=ddAtr(BA)=BT\frac{d}{dA} \text{tr}(A B) = \frac{d}{dA} \text{tr}(B A) = B^T ddAtr(ABAT)=A(BT+B)\frac{d}{dA} \text{tr}(A B A^T) = A ( B^T + B)

Derivation of the Kalman Filter Equations:

Predictor Step Derivations

The prediction step involves an open-loop prediction of the new state, based on the dynamics of the system:

x^k+1,k=Ax^k,k+Buk\hat{x}_{k+1, k} = A \hat{x}_{k,k} + B u_k

therefore, we can determine the prior estimation error:

ek+1,k=xk+1x^k+1,k=(Axk+Buk+wk)(Ax^k,k+Buk)=A(xkx^k,k)+wk=Aek,k+wk\begin{align*} e_{k+1, k} &= x_{k+1} - \hat{x}_{k+1, k}\\ &= (A x_k + B u_k + w_k) - (A \hat{x}_{k,k} + B u_k)\\ &= A(x_k - \hat{x}_{k,k}) + w_k\\ &= Ae_{k, k} + w_k \end{align*}

therefore, the update of the prior covariance is

cov(ek+1,k)=cov(Aek,k+wk)Pk+1,k=cov(Aek,k)+cov(wk)=APk,kAT+Q\begin{align*} \text{cov}(e_{k+1, k}) &= \text{cov}(Ae_{k, k} + w_k)\\ P_{k+1, k} &= \text{cov}(Ae_{k, k}) + \text{cov}(w_k)\\ &= A P_{k,k} A^T + Q \end{align*}

where the first step used the assumption that the noise is uncorrelated with the state and error. Notice the APATAPA^T comes from the properties of random vectors.

Corrector Step

Now the corrector step performs the following operation:

x^k+1,k+1=x^k+1,k+K(yk+1Cx^k+1,k)\hat{x}_{k+1, k+1} = \hat{x}_{k+1, k} + K (y_{k+1} - C \hat{x}_{k+1, k})

therefore, the posterior error is

ek+1,k+1=xk+1x^k+1,k+1=xk+1x^k+1,kK(yk+1Cx^k+1,k)=ek+1,kK(yk+1Cx^k+1,k)=ek+1,kK((Cxk+1+vk)Cx^k+1,k)=ek+1,kKC(xk+1x^k+1,k)Kvk=ek+1,kKCek+1,kKvk=(IKC)ek+1,kKvk\begin{align*} e_{k+1, k+1} &= x_{k+1} - \hat{x}_{k+1, k+1}\\ &= x_{k+1} - \hat{x}_{k+1, k} - K (y_{k+1} - C \hat{x}_{k+1, k})\\ &= e_{k+1, k} - K (y_{k+1} - C \hat{x}_{k+1, k})\\ &= e_{k+1, k} - K ((C x_{k+1} + v_k) - C \hat{x}_{k+1, k})\\ &= e_{k+1, k} - K C (x_{k+1} - \hat{x}_{k+1, k}) - K v_k\\ &= e_{k+1, k} - K C e_{k+1, k} - K v_k\\ &= (I - K C) e_{k+1, k} - K v_k\\ \end{align*}

and the covariance will be

cov(ek+1,k+1)=cov((IKC)ek+1,kKvk)Pk+1,k+1=cov((IKC)ek+1,k)+cov(Kvk)=(IKC)Pk+1,k(IKC)T+KRKT=(IKC)Pk+1,k(IKC)T+KRKT\begin{align*} \text{cov}(e_{k+1, k+1}) &= \text{cov}((I - K C) e_{k+1, k} - K v_k)\\ P_{k+1, k+1} &= \text{cov}((I - K C) e_{k+1, k}) + \text{cov}(- K v_k)\\ &= (I - K C) P_{k+1, k}(I - K C)^T + K R K^T\\ &= (I - K C) P_{k+1, k}(I - K C)^T + K R K^T\\ \end{align*}

which shows rather clearly that the posterior covariance is a weighted sum of the prior covariance and the measurement noise! The weight is essentially the Kalman gain KK.

So how should we pick KK? Ultimately we want a small posterior covariance, and so we want to choose a KK that minimizes Pk+1,k+1P_{k+1, k+1}. Before we do this minimization, lets apply some intuition.

Suppose, after the minimization, we have a large KK. That means that we are not worried about the second term - the RR must be really small - therefore, the measurement noise is very small. Conversely, suppose KK was close to 0 - then we have Pk+1,k+1Pk+1,kP_{k+1, k+1} \approx P_{k+1, k}: we are ignoring the measurements, since the measurement noise is too large!

In this way, the ‘magnitude’ of KK is determined by whether or we trust the measurements (large KK), or trust the open-loop dynamics (small KK).

Lets use calculus to determine the optimal KK:

Kalman Gain:

We wish to solve the following mathematical problem:

minimiseKtr(Pk+1,k+1)\begin{array}{rl} \underset{K}{\text{minimise}} & \text{tr}(P_{k+1, k+1}) \end{array}

where the trace is used since it represents the sum of the eigenvalues of the matrix - correlated with the sum of the variances of each element in the vector.

Lets first expand out Pk+1,k+1P_{k+1, k+1}:

Pk+1,k+1=(IKC)Pk+1,k(IKC)T+KRKT=(IKC)(Pk+1,kPk+1,kCTKT)+KRKT=Pk+1,kKCkPk+1,kPk+1,kCTKT+KCPk+1,kCTKT+KRKT=Pk+1,kKCkPk+1,kPk+1,kCTKT+K(CPk+1,kCT+R)KT\begin{align*} P_{k+1, k+1} &= (I - K C) P_{k+1, k}(I - K C)^T + K R K^T\\ &= (I - K C) (P_{k+1, k} - P_{k+1, k} C^T K ^T) + K R K^T\\ &= P_{k+1, k} - K C_k P_{k+1, k} - P_{k+1, k} C^T K ^T + K C P_{k+1, k} C^T K^T + K R K^T\\ &= P_{k+1, k} - K C_k P_{k+1, k} - P_{k+1, k} C^T K ^T + K (C P_{k+1, k} C^T + R ) K^T \end{align*}

and so the trace is

tr(Pk+1,k+1)=tr(Pk+1,kKCkPk+1,kPk+1,kCTKT+K(CPk+1,kCT+R)KT)=tr(Pk+1,k)tr(KCkPk+1,k)tr(Pk+1,kCTKT)+tr(K(CPk+1,kCT+R)KT)=tr(Pk+1,k)2tr(KCkPk+1,k)+tr(K(CPk+1,kCT+R)KT)\begin{align*} \text{tr}(P_{k+1, k+1}) &= \text{tr}(P_{k+1, k} - K C_k P_{k+1, k} - P_{k+1, k} C^T K ^T + K (C P_{k+1, k} C^T + R ) K^T)\\ &= \text{tr}(P_{k+1, k}) - \text{tr}(K C_k P_{k+1, k}) - \text{tr}(P_{k+1, k} C^T K ^T) + \text{tr}(K (C P_{k+1, k} C^T + R ) K^T)\\ &= \text{tr}(P_{k+1, k}) - 2\text{tr}(K C_k P_{k+1, k}) + \text{tr}(K (C P_{k+1, k} C^T + R ) K^T) \end{align*}

Therefore, we can take the derivative and set it to zero:

ddKtr(Pk+1,k+1)=0ddK[tr(Pk+1,k)2tr(KCkPk+1,k)+tr(K(CPk+1,kCT+R)KT)]=0ddKtr(Pk+1,k)2ddKtr(KCkPk+1,k)+ddKtr(K(CPk+1,kCT+R)KT)=002(CkPk+1,k)T+2K(CPk+1,kCT+R)=0\begin{align*} \frac{d}{dK} \text{tr}(P_{k+1, k+1}) & = 0\\ \frac{d}{dK} \Big[\text{tr}(P_{k+1, k}) - 2\text{tr}(K C_k P_{k+1, k}) + \text{tr}(K (C P_{k+1, k} C^T + R ) K^T) \Big] &= 0\\ \frac{d}{dK} \text{tr}(P_{k+1, k}) - 2 \frac{d}{dK} \text{tr}(K C_k P_{k+1, k}) + \frac{d}{dK}\text{tr}(K (C P_{k+1, k} C^T + R ) K^T) &= 0\\ 0 - 2 (C_k P_{k+1, k})^T + 2 K (C P_{k+1, k} C^T + R ) &= 0\\ \end{align*}

and therefore, solving for KK we have

K=Pk+1,kCT(CPk+1,kCT+R)1K = P_{k+1, k} C^T (C P_{k+1, k} C^T + R)^{-1}

and this inverse will (almost) always exist, since RR is positive semi-definite, and CPCTC P C^T is positive semi-definite, and symmetric. If it does not exist, the Kalman Filter cannot be used. This can be remedied by choosing RR to be positive definite, in which case the inverse always exists.

Finally, since we have found Kalman gain, we can substitute it back into the posterior covariance update, and notice some simplifications:

Pk+1,k+1=(IKC)Pk+1,k(IKC)T+KRKT=(IKC)(Pk+1,kPk+1,kCTKT)+KRKT=(IKC)(Pk+1,k(IKC)(Pk+1,kCTKT)+KRKT=(IKC)Pk+1,kPk+1,kCTKT+KCPk+1,kCTKT+KRKT=(IKC)Pk+1,kPk+1,kCTKT+K(CPk+1,kCT+R)KT=(IKC)Pk+1,kPk+1,kCTKT+Pk+1,kCTKT=(IKC)Pk+1,k\begin{align*} P_{k+1, k+1} &= (I - K C) P_{k+1, k}(I - K C)^T + K R K^T\\ &= (I - K C) (P_{k+1, k} - P_{k+1, k} C^T K^T) + K R K^T\\ &= (I - K C) (P_{k+1, k} - (I - K C) (P_{k+1, k} C^T K^T) + K R K^T\\ &= (I - K C) P_{k+1, k} - P_{k+1, k} C^T K ^T + K C P_{k+1, k} C^T K^T + K R K^T\\ &= (I - K C) P_{k+1, k} - P_{k+1, k} C^T K ^T + K (C P_{k+1, k} C^T + R)K^T \\ &= (I - K C) P_{k+1, k} - P_{k+1, k} C^T K ^T + P_{k+1, k} C^T K ^T\\ &= (I - K C) P_{k+1, k} \end{align*}

and thats it. Thats the derivation of the Kalman Filter equations. Hopefully this is less scary than whatever you had in the textbook. In essence we are simply propagating the covariance, based on the predictor and the corrector step.

Summary:

To summarise, here are the equations again, in the order you need to compute them:

x^k+1,k=Ax^k,k+BukPk+1,k=APk,kAT+QK=Pk+1,kCT(CPk+1,kCT+R)1x^k+1,k+1=x^k+1,k+K(yk+1Cx^k+1,k)Pk+1,k+1=(IKC)Pk+1,k\boxed{ \begin{align*} \hat{x}_{k+1, k} &= A \hat{x}_{k,k} + B u_k\\ P_{k+1, k} &= A P_{k,k} A^T + Q\\ K &= P_{k+1, k} C^T (C P_{k+1, k} C^T + R)^{-1}\\ \hat{x}_{k+1,k+1} &= \hat{x}_{k+1, k} + K (y_{k+1} - C \hat{x}_{k+1, k})\\ P_{k+1, k+1} &= (I - K C) P_{k+1, k} \end{align*} }

If the matrices A,B,C,Q,RA, B, C, Q, R are time varying, simply use the time-varing versions in the equations above.

Closing Notes:

Technically, to prove the optimality of the kalman filter, you would need to determine why the predictor-corrector separation is possible and indeed optimal in the first place. I would refer to textbooks for that part of the derivation.

Also, in the above derivation, we assumed that the A, B, C, Q, R matrices are perfectly well known. I wonder how this should be modified, if these matrices are uncertain too. My intuition says we can bound the errror of A, B, C and include it in the Q, R matrices, but I am not sure.