NB The derivation and “intuitive” explanations of terms and concepts in this post are designed with a reader like myself in mind. I wrote this post so that I would have something to refer back to. I have tried to make it accessible to outsiders, but if there are parts that just don’t make sense, consider it an insight into my twisted perspective on mathematics. Alternatively, get in touch and I’ll improve it as best I can.
The problem
In A theory of cortical responses Karl Friston introduces a generative model \(p(u,v)\) that a cognitive system might use. The model enables the system to infer the value of an unobservable state \(v\) from observed data \(u\).
Friston describes an objective function for this model (his equation 3.4, page 821): $$ L= \log{p(u)} - D(q(v)||p(v|u)) $$
By minimising this function, the system obtains an appropriate distribution over \(v\), \(q(v)\). This distribution represents the appropriate degrees of belief for the system to have, given observed data \(u\) and generative distribution \(p(u,v)\).
Friston goes on to define a few related terms and state a few assumptions, before claiming that \(L\) is equal to the following (his equation 3.7, page 821): $$ L= -\frac{1}{2}\xi_u^T\xi_u -\frac{1}{2}\xi_p^T\xi_p -\frac{1}{2}\log{|\Sigma_u|} -\frac{1}{2}\log{|\Sigma_p|} $$
The difference between these two formulations is stark. It is not in the least obvious how to derive equation 3.7 from equation 3.4.
This post is an attempt to reconstruct the intended derivation.
Definitions
The idea is that we’re going to guess a value of \(v\) after observing \(u\). The guessed value of \(v\) is labelled \(\class{mj_red}{\phi}.\)
- \(u = g(v,\theta)+\epsilon_u\) for some function \(g\), parameters \(\theta\) and noise term \(\epsilon_u.\) Intuition: what we observe is caused by something we cannot observe.
- \(v = v_p + \epsilon_p\) for some prior expectation \(v_p\) and noise term \(\epsilon_p.\) Intuition: unobservable states fluctuate depending on extraneous forces, all of which can be represented as a noise term.
- \(\mu_v=\mu_p=v_p\): Friston prefers the subscript \(p\) instead of \(v\). This term is the mean of the prior distribution over causes. Since \(p\) stands for ‘prior’ and \(\mu\) stands for ‘mean’, the mean of the prior could be called \(\mu_p.\) Friston calls it \(v_p.\)
- \(\mu_u=g(\class{mj_red}{\phi},\theta)\): The expected value of \(u\) given the choice \(v=\class{mj_red}{\phi}.\) Intuition: the observation is just what you get when you apply the generative process, \(g\), to the inferred cause, \(\class{mj_red}{\phi}\), given your chosen parameters \(\theta\).
- \(\Sigma_{uv}\): Covariance between \(v\) and \(u.\) See Wikipedia on covariance matrices.
- \(\Sigma_{uu}=\Sigma_{u}\): The covariance over data \(u.\)
- \(\Sigma_{vv}=\Sigma_{v}=\Sigma_p\): The covariance over causes \(v\) treated as part of our prior distribution, hence \(\Sigma_p.\)
Finally, we have the funky looking \(\xi\) (Greek letter xi) that appears in the second equation for \(L\). Friston defines these as follows:
$$ \begin{align} \xi_u &= \class{mj_blue}{\Sigma_u^{-\frac{1}{2}}}(u-g(\class{mj_red}{\phi},\theta))\\ \xi_p &= \class{mj_blue}{\Sigma_p^{-\frac{1}{2}}}(\class{mj_red}{\phi}-v_p) \end{align} $$
The two \(\xi\) terms define two constraints on our choice of \(\class{mj_red}{\phi}\). Intuitively, \(\xi_p\) represents the fact that we don’t want to stray too far from the prior. By the same token, \(\xi_u\) represents the fact that we want to make the data probable.
Assumptions
Friston takes the model to be a “static nonlinear generative model under Gaussian assumptions” (821). This entails the following:
- \(p(u)\) is a multivariate Gaussian distribution (see below; see also appendix 2)
- \(p(v|u)\) is a conditional multivariate Gaussian distribution (see below; see also appendix 3)
Friston makes three key assumptions that will allow us to show that the two definitions of \(L\) are equivalent:
- Assumption 1: \(q(v)\) is a point mass, so \(q(\class{mj_red}{\phi})=1\) and \(q(v)=0\) for any \(v\neq\class{mj_red}{\phi}\)
- Assumption 2: Both \(p(u)\) and \(p(v|u)\) are Gaussian
- Assumption 3: The noise terms \(\epsilon_u\) and \(\epsilon_p\) are uncorrelated.
The derivation
We will proceed in four parts. Here’s an overview:
- Get rid of \(q(v)\) by applying assumption 1
- Convert \(p(u)\) and \(p(v|u)\) to Gaussians by applying assumption 2
- Simplify \(p(v|u)\) by applying assumption 3
- A small amount of additional fiddling
Part 1: getting rid of \(q(v)\)
Start with our initial definition of \(L\):
$$ L= \log{p(u)} - D(q(v)||p(v|u)) $$
Write the relative entropy out fully:
$$ L= \log{p(u)} - \sum_v{q(v)\log{\frac{q(v)}{p(v|u)}}} $$
By assumption 1, \(q(v)\) is a point mass. This means every value is zero except \(q(\class{mj_red}{\phi})\), which is equal to \(1\).
$$ \begin{align} L&= \log{p(u)} - \left( 0 + 0 + … + q(\class{mj_red}{\phi})\log{\frac{q(\class{mj_red}{\phi})}{p(v|u)}}+ … + 0 + 0 \right)\\ &= \log{p(u)} - \log{\frac{1}{p(v|u)}} \end{align} $$
Logarithms have this cool property: \(\log{\frac{1}{a}}=-\log{a}.\) In other words, \(-\log{\frac{1}{a}}=\log{a}.\)
$$ L= \log{p(u)} + \log{p(v|u)} $$
Whew! Time to take a break.
Part 2: convert \(p(u)\) and \(p(v|u)\) to Gaussians
By assumption 2, we are going to substitute in the formulas for multivariate Gaussians. We’ll do it one at a time, starting with \(p(u)\):
$$ p(u) = \frac{1}{\sqrt{(2\pi)^k\class{mj_blue}{|\Sigma_u|}}} \exp{\left( -\frac{1}{2} (u-\mu_u)^{\intercal} \class{mj_blue}{\Sigma_u^{-1}} (u-\mu_u) \right)} $$
Blurgh! What an ugly formula. (By the way, \(k\) is the dimension of the data – the number of entries in the vector \(u.\)) Fortunately, since we’re interested in the logarithm of \(p(u)\), the result will be prettier:
$$ \log{p(u)}= \log{\frac{1}{\sqrt{(2\pi)^k\class{mj_blue}{|\Sigma_u|}}}} -\frac{1}{2} (u-\mu_u)^{\intercal} \class{mj_blue}{\Sigma_u^{-1}} (u-\mu_u) $$
Hmm… a little prettier, anyway.
We can improve matters by changing \(\mu_u\), the mean of \(u\). Recall that because \(u=g(v,\theta)+\epsilon_u\) we can express the mean – the expected value – of \(u\) by applying the function \(g\) to our guessed value of \(v\) and ignoring the noise:
$$ \mu_u= g(\class{mj_red}{\phi},\theta) $$
Subbing this in, and making the logarithms a little nicer, we get:
$$ \begin{align} \log{p(u)}= -&\frac{1}{2} (u-g(\phi,\theta))^{\intercal} \class{mj_blue}{\Sigma_u^{-1}} (u-g(\phi,\theta))\\ &-\frac{1}{2}\log{\class{mj_blue}{|\Sigma_u|}}\\ &-\frac{1}{2}\log{(2\pi)^k} \end{align} $$
Remember, however ugly this looks, it’s still just a Gaussian with a logarithm applied.
The same formula gives \(\log{p(v|u)}\):
$$ \begin{align} \log{p(v|u)}= -&\frac{1}{2} (v-\mu_{v|u})^{\intercal} \class{mj_blue}{\Sigma_{v|u}^{-1}} (v-\mu_{v|u})\\ &-\frac{1}{2}\log{\class{mj_blue}{|\Sigma_{v|u}|}}\\ &-\frac{1}{2}\log{(2\pi)^k} \end{align} $$
Part 3: simplifying \(\log{p(v|u)}\)
Now we are faced with a tricky problem. The definitions of \(\mu_{v|u}\) and \(\Sigma_{v|u}\) are rather difficult to work with:
$$ \begin{align} \mu_{v|u} &= \mu_v + \Sigma_{vu}\Sigma_{uu}^{-1}(u-\mu_u)\\ \Sigma_{v|u} &= \Sigma_{vv} - \Sigma_{vu}\Sigma_{uu}^{-1}\Sigma_{uv} \end{align} $$
Fortunately, help is at hand in the shape of assumption 3, which states that the noise terms \(\epsilon_u\) and \(\epsilon_v\) are uncorrelated. This means their respective covariance matrices are zero:
$$ \Sigma_{uv} = \Sigma_{vu} = 0 $$
As a result, the conditional mean and covariance simplify nicely:
$$ \begin{align} \mu_{v|u} &= \mu_v = v_p\\ \Sigma_{v|u} &= \Sigma_{vv} = \Sigma_v \end{align} $$
So the simplified expression for \(\log{p(v|u)}\) is basically the same as that for \(\log{p(u)}\). We also swap \(v\) for \(\class{mj_red}{\phi}\) to emphasise that the value of this term depends on what we choose to believe:
$$ \begin{align} \log{p(\class{mj_red}{\phi}|u)}= -&\frac{1}{2} (\class{mj_red}{\phi}-v_p)^{\intercal} \class{mj_blue}{\Sigma_{v}^{-1}} (\class{mj_red}{\phi}-v_p)\\ &-\frac{1}{2}\log{\class{mj_blue}{|\Sigma_{v}|}}\\ &-\frac{1}{2}\log{(2\pi)^k} \end{align} $$
Part 4: fiddly bits
Recall what we are aiming for:
$$ L= -\frac{1}{2}\xi_u^T\xi_u -\frac{1}{2}\xi_p^T\xi_p -\frac{1}{2}\log{\class{mj_blue}{|\Sigma_u|}} -\frac{1}{2}\log{\class{mj_blue}{|\Sigma_p|}} $$
Compare it to what we currently have:
$$ \begin{align} L=&\ \log{p(u)}+\log{p(v|u)}\\ =&-\frac{1}{2} (u-g(\class{mj_red}{\phi},\theta))^{\intercal} \class{mj_blue}{\Sigma_u^{-1}} (u-g(\class{mj_red}{\phi},\theta))\\ &-\frac{1}{2}\log{\class{mj_blue}{|\Sigma_u|}}\\ &-\frac{1}{2}\log{(2\pi)^k}\\ &-\frac{1}{2} (\class{mj_red}{\phi}-v_p)^{\intercal} \class{mj_blue}{\Sigma_{v}^{-1}} (\class{mj_red}{\phi}-v_p)\\ &-\frac{1}{2}\log{\class{mj_blue}{|\Sigma_{v}|}}\\ &-\frac{1}{2}\log{(2\pi)^k} \end{align} $$
Good news: we have a couple of matching terms! The \(-\frac{1}{2}\log{\class{mj_blue}{|\Sigma|}}\) are just what we needed. We’re half way there!
To see how the other terms match up, notice that what we’re asking is whether the following is true:
$$ \begin{gather*} -\frac{1}{2}\xi_u^{\intercal}\xi_u\\ =_?\\ -\frac{1}{2} (u-g(\class{mj_red}{\phi},\theta))^{\intercal} \class{mj_blue}{\Sigma_u^{-1}} (u-g(\class{mj_red}{\phi},\theta)) \end{gather*} $$
Using the definition of \(\xi_u\), this can be written more explicitly:
$$ \begin{gather*} -\frac{1}{2} \left( \class{mj_blue}{\Sigma_u^{-\frac{1}{2}}}(u-g(\class{mj_red}{\phi},\theta)) \right)^{\intercal} \left( \class{mj_blue}{\Sigma_u^{-\frac{1}{2}}}(u-g(\class{mj_red}{\phi},\theta)) \right)\\ =_?\\ -\frac{1}{2} (u-g(\class{mj_red}{\phi},\theta))^{\intercal} \class{mj_blue}{\Sigma_u^{-1}} (u-g(\class{mj_red}{\phi},\theta)) \end{gather*} $$
You can see that we at least have all the right terms in there! But do they match?
Let’s just look at the abstract form of the equation. \(\class{mj_blue}{\Sigma_u}\) is just a matrix, so we’ll call it \(A.\) And \(u-g(\class{mj_red}{\phi},\theta)\) is just a vector, so we’ll call it \(w\) (since we’re using \(v\) elsewhere). Ignoring the factor of \(-\frac{1}{2}\), we are asking:
$$ \begin{gather*} \left(A^{-\frac{1}{2}}w\right)^{\intercal}\left(A^{-\frac{1}{2}}w\right)\\ =_?\\ w^{\intercal}A^{-1}w \end{gather*} $$
And we can muck around a bit with the bottom equation to make it equal to the top:
$$ \begin{align*} w^{\intercal}A^{-1}w &=\left(w^{\intercal}A^{-\frac{1}{2}}\right)\left(A^{-\frac{1}{2}}w\right)\\ &= \left(A^{-\frac{1}{2}}w\right)^{\intercal}\left(A^{-\frac{1}{2}}w\right) \end{align*} $$
All right! We’ve matched every term in the equation!…… except–
Part 5: what about \(\pi\)?
Yes, there’s a fifth unannounced part to this mystery, because we have two leftover terms. Both are \(-\frac{1}{2}\log{(2\pi)^k}\) and they add up to \(-k\log{2\pi}\). What’s going on?
Friston doesn’t mention this term at all, and at first I assumed I’d got a sign wrong and they were supposed to cancel each other out. But I’m pretty sure that’s not it. I think he’s ignoring it because it’s a constant term that can’t be affected by any of the other statistics. The only term that affects it is \(k\), which is the dimension of the data, i.e. the number of components your variables \(u\) and \(v\) have.
Still, one would have thought a little bit of clarity is not too much to ask.
Appendix 1: Gaussian distributions
A distribution is a function that tells you the probability of a random variable taking a particular value. A multivariate distribution does the same thing for a vector of values.
The formula for the univariate Gaussian distribution is: $$ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( -\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2} \right)} $$
where
- \(x\) is the value whose probability you are calculating
- \(\sigma^2\) is the variance of the distribution
- \(\mu\) is the mean of the distribution
Here the parameters of the distribution are its variance and mean. By supplying these two values, you can calculate the probability of any \(x\).
Intuitively, consider what happens if \(x\) is far away from the mean:
- \(x\) is far from \(\mu\)
- \((x-\mu)^2\) is a large (positive) value
- \(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\) is a large (negative) value
- \(\exp{\left(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\right)}\) is a small (positive) value
- \(p(x)\) is a small positive value
On the other hand, consider what happens when \(x\) is close to the mean:
- \(x\) is close to \(\mu\)
- \((x-\mu)^2\) is a smaller (positive) value
- \(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\) is a smaller (negative) value
- \(\exp{\left(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\right)}\) is a larger (positive) value
- \(p(x)\) is a larger positive value
The closer your \(x\) is to the mean, the greater is its probability.
The variance modulates this effect. A large variance nullifies the difference between the negative values (look at the third lines above), so pushes all the different values of \(p(x)\) closer together.
Appendix 2: Multivariate Gaussian distributions
Consider how this formula must change when \(x\) is a vector. The mean \(\mu\) can be a vector, so you can still use \((x-\mu)\) in the formula. But the variance is no longer a single number: it becomes a covariance matrix.
A covariance matrix \(\Sigma\) lists all the variances of each component \(x_i\) on the diagonal: $$ \Sigma_{ii} = \mathrm{Var}(x_i) $$
Its non-diagonal entries contain the covariances between components: $$ \Sigma_{ij} = \Sigma_{ji} = \mathrm{Cov}(x_i,x_j) $$
From this you can tell that covariance matrices are always symmetric.
In addition, covariance matrices are always positive-semidefinite, a property we won’t worry about for now.
Finally, note that \(\class{mj_blue}{|\Sigma|}\) means the determinant of \(\class{mj_blue}{\Sigma}\).
Since multivariate distributions contain matrices and vectors, the formula is going to contain matrix multiplication. For a random variable with \(k\) components, the formula is:
$$ p(x)= \frac{1}{\sqrt{(2\pi)^k\class{mj_blue}{|\Sigma|}}} \exp{\left( -\frac{1}{2} (x-\mu)^{\intercal} \class{mj_blue}{\Sigma^{-1}} (x-\mu) \right)} $$
What’s going on with the covariance matrix?
In the univariate distribution, the point of the variance was to modulate how the distance from the mean affected the probability. The same thing is going on here. Instead of dividing the squared-difference by the variance, we are putting the inverse covariance matrix in between the difference vector and its transpose. The result is a real number. (As for why we use this particular relationship between covariance and distance-from-the-mean, that’s beyond the scope of this post.)
As in the univariate case, the use of \(\class{mj_blue}{|\Sigma|}\) is just to normalise the whole thing.
Appendix 3: Conditional multivariate Gaussian distributions
Our formula contains a conditional distribution \(p(v|u)\). We need to know the formula for a conditional multivariate Gaussian.
Like the multivariate Gaussian, the formula will contain a covariance matrix \(\Sigma_{v|u}\). It will also contain a modified mean, \(\mu_{v|u}\).
The bad news is that these two terms have awkward definitions. The good news is that we’re going to assume the noise terms \(\epsilon_u\) and \(\epsilon_v\) are uncorrelated, which means those awkward definitions can be much simplified. The definitions are:
$$ \begin{align} \mu_{v|u} &= \mu_v + \Sigma_{vu}\Sigma_{uu}^{-1}(u-\mu_u)\\ \Sigma_{v|u} &= \Sigma_{vv} - \Sigma_{vu}\Sigma_{uu}^{-1}\Sigma_{uv} \end{align} $$
By assuming the noises for \(u\) and \(v\) are uncorrelated, we are saying \(\Sigma_{vu}=\Sigma_{uv}=0\). So those definitions simplify:
$$ \begin{align} \mu_{v|u} &= \mu_v\\ \Sigma_{v|u} &= \Sigma_{vv} = \Sigma_v \end{align} $$