Consider a linear time-invariant dynamical system, written in discrete time:
xk+1yk=Axk+Buk+wk=Cxk+vk
where
wk∼N(0,Q) is a zero-mean process noise with covariance Q.
vk∼N(0,R) is a zero-mean measurement noise with covariance R.
x0∼N(x^0,P0,0) is the initial state, which is assumed to the normally distributed around x^0 with covariance P0,0.
For the KF to work, we must assume that the system (A,C) is observable.
Notation:
xk is the true state at time k.
x^k,k is our best estimate of xk, using all measurements upto and including time k. This is called the posterior estimate.
x^k+1,k is our best estimate of xk+1, using all measurements upto and including time k. This is called the prior estimate.
Pk,k=cov(xk−x^k,k) is the covariance of the posterior estimate.
Pk+1,k=cov(xk+1−x^k+1,k) is the covariance of the prior estimate.
Final Equations
We will be generating an estimate of the state, in the following way:
Predict the state, in an open-loop sense
Correct the estimate, using the received measurements.
Mathematically,
x^k+1,k=Ax^k,k+BukPk+1,k=APk,kAT+Q
is a direct prediction of where the state should be based on the previous best estimate. The covariance of the prior is also updated.
Now, once the new measurement comes in, we have the ability to check how wrong the estimate is. We call this the innovation, ik+1:
ik+1=yk+1−Cx^k+1,k
and so we correct the estimated state as
x^k+1,k+1=x^k+1,k+Kik+1Pk+1,k+1=(I−KC)Pk+1,k
by simplying taking the prior estimate, and shifting it, to better match the measurement. Notice, since the measurement is noisy, we don’t shift it to perfectly match the noisy measurement. This proportion is controlled by K, the Kalman Gain, which is given by
K=Pk+1,kCT(CPk+1,kCT+R)−1
which looks scary, but is actually quite straightforward, as demonstrated below. Also note, K is changing at each time step. I have just dropped the subscript Kk since it only gets in the way.
The important thing to remember is what the Kalman gain is trying to optimize: it is trying to make Pk+1,k+1, the posterior state estimate covariance, as ‘small’ as possible. Rigourously speaking, it is trying to minimise the trace(Pk+1,k+1).
In the rest of this page, we derive the above equations. We first need a bit of mathematical background, and then we derive the equations.
Useful Mathematical properties
There are a few mathematical properties that will be used in the proof. If you are familiar with them, feel free to skip this section.
Properties of Gaussian Random Vectors
A vector w∈Rn is a gaussian vector if it is sampled from a normal distribution. We write
w∼N(wˉ,Q)
where wˉ=E(w) is the mean, and Q=cov(w) is the covariance matrix,
Q=cov(w)=E[(w−wˉ)(w−wˉ)T]=E[wwT]−wˉwˉT
and is always square (of size n×n), symmetric, positive-semi definite. Notice that ifwˉ=0, the covariance matrix simplifies to cov(w)=E[wwT]
The covariance matrix also has the following properties, which we will be using:
if A∈Rn×n,b∈Rn are constant matrix and vector,
cov(Aw+b)=Acov(w)AT
the negative sign does not affect the covariance:
cov(−Aw)=Acov(w)AT
but you have to be careful when subtracting:
cov(Aw−Bw)=(A−B)cov(w)(A−B)T
notAcov(w)AT+Bcov(w)BT=(A+B)cov(w)(A+B)T.
if (and only if) you have two uncorrelated random vectors w∼N(wˉ,Q),v∼N(vˉ,R), then
cov(v+w)=cov(v)+cov(w)
if the two random vectors are correlated (will not be used in this document, but useful to know), we have
cov(v+w)=cov(v)+cov(w)+cov(v,w)+cov(w,v)
where cov(v,w)=E[(v−vˉ)(w−wˉ)T]. You can use this to show why the minus sign appears in cov(Aw−Bw).
Properties of Trace and Trace Derivatives
We will be minimising the trace of a matrix, and so it is useful to know the following:
where the first step used the assumption that the noise is uncorrelated with the state and error. Notice the APAT comes from the properties of random vectors.
Corrector Step
Now the corrector step performs the following operation:
which shows rather clearly that the posterior covariance is a weighted sum of the prior covariance and the measurement noise! The weight is essentially the Kalman gain K.
So how should we pick K? Ultimately we want a small posterior covariance, and so we want to choose a K that minimizes Pk+1,k+1. Before we do this minimization, lets apply some intuition.
Suppose, after the minimization, we have a large K. That means that we are not worried about the second term - the R must be really small - therefore, the measurement noise is very small. Conversely, suppose K was close to 0 - then we have Pk+1,k+1≈Pk+1,k: we are ignoring the measurements, since the measurement noise is too large!
In this way, the ‘magnitude’ of K is determined by whether or we trust the measurements (large K), or trust the open-loop dynamics (small K).
Lets use calculus to determine the optimal K:
Kalman Gain:
We wish to solve the following mathematical problem:
Kminimisetr(Pk+1,k+1)
where the trace is used since it represents the sum of the eigenvalues of the matrix - correlated with the sum of the variances of each element in the vector.
and this inverse will (almost) always exist, since R is positive semi-definite, and CPCT is positive semi-definite, and symmetric. If it does not exist, the Kalman Filter cannot be used. This can be remedied by choosing R to be positive definite, in which case the inverse always exists.
Finally, since we have found Kalman gain, we can substitute it back into the posterior covariance update, and notice some simplifications:
and thats it. Thats the derivation of the Kalman Filter equations. Hopefully this is less scary than whatever you had in the textbook. In essence we are simply propagating the covariance, based on the predictor and the corrector step.
Summary:
To summarise, here are the equations again, in the order you need to compute them:
If the matrices A,B,C,Q,R are time varying, simply use the time-varing versions in the equations above.
Closing Notes:
Technically, to prove the optimality of the kalman filter, you would need to determine why the predictor-corrector separation is possible and indeed optimal in the first place. I would refer to textbooks for that part of the derivation.
Also, in the above derivation, we assumed that the A, B, C, Q, R matrices are perfectly well known. I wonder how this should be modified, if these matrices are uncertain too. My intuition says we can bound the errror of A, B, C and include it in the Q, R matrices, but I am not sure.