# The Free Energy Principle and Active Inference

Hello World!

I wanted to start the Blog with something that has been on my mind since the great Computational Psychiatry Course 2016. Some of the talks there revolved around the free energy/active inference framework in cognitive neuroscience. Although “free energy” does sound a little esoteric at first, the name derives from the thermodynamic free energy in statistical physics and just describes the functional form of the objective function, which the brain is hypothesized to minimize.

So in this (and the following) post I want to outline the idea and the mathematical machinery behind this framework and pair it with recent advances in stochastic inference using deep generative models (i.e. Stochastic Backpropagation/Variational Autoencoders and Evolution Strategies) to create autonomous agents that can perform goal-directed action while building a generative model of themselves and their surroundings. I decided to try and use the minimal amount of formulae required. But if you find bits too vague (or too technical) please let me know and I’ll try to be clearer (or less technical).

## Active Inference

Here I will just give a brief and subjective summary of the arguments in this paper by Karl Friston.

Active Inference rests on the basic assumption that any agent in exchange with a dynamic, fluctuating environment must keep certain inner parameters within a well defined range in order to survive. Think of the pH of your blood or the concentration of certain ion species. E.g. even changing the potassium concentration in your blood serum by a tiny amount (some mmol/l) might disrupt the electrical signal transduction in your heart muscle cells, changing the electrical and mechanical activity of your heart from a stable, cyclic attractor (pumping blood) to a chaotic one, which can lead to immediate heart failure. Looking at the space spanned by all relevant variables required to define the state that the agent is in (including its interaction with its environment, e.g. the forces of gravity acting on it or the temperature of the surrounding environment), this means that the agent must restrict itself to a small volume in this “state space” in order to keep all relevant parameters within a viable range.

One can formalize this notion in terms of the probability distribution on the state space, i.e. how likely it is to find an agent in a certain state. A measure of how concentrated a probability distribution $p(x)$ is on a space $X$ is its entropy:

$H(X) = \int_{x\in X} \left( -\ln p(x)\right) p(x) \,\mathrm{d}x$

The entropy is just the average surprise $-\ln p(x)$ over the distribution. It increases, the more “spread out” the distribution becomes. The entropy of a univariate Gaussian, for example, is proportional to the logarithm of its standard-deviation. By minimizing the entropy of its distribution on the space of possible states it can be in, an agent can counter dispersive effects from its environment and keep a stable “identity” (in terms of its inner parameters) in a fluctuating environment.

However, a human agent does not have (or at least during most of our evolutionary history did not have) direct access to an objective measurement of its blood-potassium concentration. Instead we only perceive the world around us via our sensory epithelia. But agents whose sensory organs did not have a good mapping of relevant physical states to appropriate sensory inputs, would not have lasted very long. So we further assume that we can upper bound the entropy of the agent’s physical states by the entropy of its sensory states plus a constant term (technical details here, note that this approximation is not required anymore by more recent formulations). Let now $O$ denote the space of all possible observations of an agent. To ensure its survival (in terms of keeping its physiological variables within well-defined bounds) it has to minimize its sensory entropy:

$H(O) = \int_{o\in O} \left( -\ln p(o)\right) p(o) \,\mathrm{d}o$

A further assumption we use is ergodicity. This means that – looking at a large population of agents – the relative proportion of agents in a certain state at a given moment of time is equal to the relative amount of time a single agent spends in this state (i.e. that time- and ensemble-averages are equivalent). This allows us to write the sensory entropy as

$H(O) = \lim\limits_{T \rightarrow \infty} - \frac{1}{T} \int_0^T \ln p(o(t)) \,\mathrm{d}t$

From the calculus of variations now follows that an agent can minimize its sensory entropy by minimizing the sensory surprise $- \ln p(o(t))$ at all times:

$\nabla_o \left( - \ln p(o(t)) \right) = 0$

To be able to efficiently do this, our agent needs a statistical model of its sensory inputs, to evaluate $p(o)$. Since the world in which we live is hierarchically organised, dynamic, and features a lot of complex noise sources with different time scales and which enter at different levels of the hierarchy, we assume that the model of our agent is a deep (hierarchical), recurrent (dynamic), latent variable (structured noise) model. In fact, there is even a theorem that “every good regulator of a system must be a model of that system”. Furthermore, we assume that this model is generative, using the observation that we are able to imagine certain situation and perceptions (like the image of a blue sky over a desert landscape) without actually experiencing or having experienced them. We will discuss a possible implementation of this model in the brain later, but let’s first assume our agent possesses a generative model $p_\theta(o,s)$ of sensory observations $o$ and latent variables or states $s$, which we can factorise into a likelihood function $p_\theta(o|s)$ and a prior on the states $p_\theta(s)$:

$p_\theta(o,s) = p_\theta(o|s) p_\theta(s)$

The set $\theta$ comprises the (time-invariant or slowly changing) parameters that the agent can change to improve its model of the world. In the brain this might be the pattern and strength of synapses, the connections between individual neurons. Given this factorisation, to minimize surprise, the agent must solve the hard task of calculating

$p_\theta(o) = \int p_\theta(o|s)p_\theta(s) \,\mathrm{d}s$

by marginalizing over all possible states that could lead to a given observation. As the dimensionality of the latent state space $S$ can be very high, this integral is extremely hard to solve. Therefore, a further assumption of the free energy principle is, that agents do not try to minimize the surprise $-\ln p_\theta(o)$ directly, but rather minimize an upper bound, which is a lot simpler to calculate.

Using the fact that the (don’t be afraid by the name, the definition is in the next line) Kullback-Leibler (KL) Divergence

$D_{\mathrm{KL}}(p_a(x)||p_b(x)) = \int_{x \in X} \ln\left( \frac{p_a(x)}{p_b(x)}\right)p_a(x) \,\mathrm{d}x$

between two arbitrary distributions $p_a(x)$ and $p_b(x)$ with a shared support on a common space $X$ is always greater than or equal to zero, and equal to zero if and only if the two distributions are equal. Don’t be scared by the long name or the – on the first look – intimidating notation. This quantity is also known as relative information and has some quite intuitive interpretations. E.g. it is the average amount of information required to be transmitted to communicate the density $p_a(x)$, given that the receiver already knows the density $p_b(x)$. And although there are many more divergence measures between probability densities that are non-negative, and zero if and only if the two densities are equal, the KL-divergence is the only one to fulfil three crucial properties: It is independent of the coordinate system used to represent the space on which the two densities are defined, it is local, e.g. local changes to one of the two densities only affect the respective parts of the integral, and it is what physicists call extensive. This just means that it is additive over independent (in the stochastic sense) subsystems. To learn more about the special role of the KL-divergence please refer to the wonderful notes on divergence measures by Danilo Rezende. Beyond our current discussion, for these reasons the KL divergence is also exceptionally suitable to be evaluated using sampling based methods, as shown and discussed recently in great work by Tran, Ranganath, and Blei (2017).

So using the KL-Divergence, we can define the free energy as:

$F(o,\theta, u) = -\ln p_\theta(o) + D_{\mathrm{KL}}(q_u(s)||p(s|o)) \geq -\ln p_\theta(o)$

After this – admittedly loooong – list of assumptions, you might now again wonder, what $q_u(s)$ is supposed to mean: $q_u(s)$ is an arbitrary (so called variational) density over the space of latent states $s$, which belongs to a family of distributions parameterized by a time-dependent, i.e. fast changing, parameter set $u$. If $q_u(s)$ was a diagonal Gaussian, $u = \left\{\mu, \sigma\right\}$ would be the corresponding means and standard-deviations. This parameter set can be encoded by the internal states of our agent, e.g. by the neural activity (firing pattern) of the neurons in its brain. Thus, the upper bound $F(o,\theta, u)$ now only depends on quantities to which our agent has direct access. Namely the states of its sensory organs $o$, the synapses encoding its generative model of the world $\theta$ and the neural activity representing the sufficient statistics $u$ of the variational density.

Using the definition of the Kullback-Leibler divergence, the linearity of the integral, Bayes’ rule, and manipulation of logarithms, one (You! Try it!) can derive the following equivalent forms of the free energy functional:

$\begin{array}{rcl} F(o,\theta, u) & = & -\ln p_\theta(o) + D_{\mathrm{KL}}(q_u(s)||p_\theta(s|o)) \\ & = & \left< -\ln p_\theta(o,s) \right>_{q_u(s)} - \left< - q_u(s) \right>_{q_u(s)}\\ & = & \left< -\ln p_\theta(o|s) \right>_{q_u(s)} + D_{\mathrm{KL}}(q_u(s)||p(s)) \\ \end{array}$

Here $\left< f(s) \right>_{q_u(s)}$ means calculating the expectation value of $f(s)$ with respect to the variational density $q_u(s)$.

If the agent had been tied to a rock, unable to interact with or change its environment, the only thing it could do to minimize $F$ would be to change the parameters $\theta$ of its generative model and the sufficient statistics $u$ of its inner representation. Looking at the first form of the free energy, optimizing $u$ would correspond to minimizing the Kullback-Leibler divergence between the variational distribution $q_u (s)$ and the true posterior distribution $p_\theta(s|o)$, i.e. the probability over states $s$ given the observations $o$. Thus, the agent automatically shows a probabilistic representation of an approximate posterior on the states of the world, given its sensory input. The optimization of the sufficient statistics $u$ of the variational density $q_u$ is therefore what we call “perception”. As $q_u$ has to be optimized on a fast timescale, quickly changing with the sensory input $o$, it is likely represented in terms of neural activity. This might explain why we often find hallmarks of Bayesian computations and probabilistic representations of sensory stimuli in the brain (Berkes et al.; c.f. Knill & PougetFiser & et al.). As the variational free energy upper bounds the surprise $-p_\theta(o)$ , minimising free energy with respect to the parameters $\theta$ of the generative model will simultaneously maximise the evidence $p_\theta(o)$ for the agent’s generative model of the world. The resulting optimisation of the parameters $\theta$ of the generative model is what we call perceptual learning and what might be implemented by changing the synaptic connections between neurons in the brain. The second form is just to show the Physicists among you, where the name “free energy” comes from, since its form is very similar to the Helmholtz Free Energy of the Canonical Ensemble (just to demystify its name, if you are not a Physicist, feel free to ignore this sentence).

So far, this is exactly how variational inference on generative models works. In practice, you do not optimise the sufficient statistics $u$ of the variational density directly, since you would have to perform this optimisation for every single observation $o$. Instead one can use a very flexible, parameterised function $u_{\theta}(o)$, and optimise its parameters together with the parameters of the generative model. E.g. in the diagonal Gaussian case you could use deep neural networks to calculate the means and standard-deviations of $q$ from the observations $o$ (c.f. the nice work by Rezende, Mohamed & Wierstra, Kingma & Welling, and Chung & al.). This allows us to fit generative latent variable models to our data while simultaneously getting an easy to evaluate approximation to the posterior distribution over latent variables given data samples.

Now comes the interesting part, Action:

We now free our agent from its ties, i.e. we give it actuators that allow it to actively change the state of its environment. Suddenly the sensory observations $o$ become a function of the current and past states $a$ of the agent’s actuators (“muscles”), via their influence on the state of the world generating the sensory inputs. Now the agent can not only minimize the free energy bound by learning (optimising the parameters of its generative model) and perception (optimising the sufficient statistics of the variational density), but also by changing the observations it makes. This is called Active Inference. So while learning and perception help the agent to make the Free Energy a tight bound on the actual sensory surprise $-p(o)$, action now can use this tight bound to change the observations an agent makes, thereby minimizing (given the bound is tight enough) the actual sensory entropy $H(O)$, which was the objective that we started with.

Remembering the last form of the variational free energy

$F(o,\theta, u) = \left< -\ln p_\theta(o|s) \right>_{q_u(s)} + D_{\mathrm{KL}}(q_u(s)||p(s))$

We see that our agent can minimize it by seeking out observations $o$ that have a high likelihood $p_\theta(o|s)$  under its own generative model of the world (averaged over its approximate posterior of the state of the world given its previous observations).

This gives us the following dynamics for the parameters $\theta$ of an agent’s generative model, its internal states $u$, encoding the sufficient statistics of the variational density $q$, and the states of its actuators $a$:

$\left(\theta, u, a\right) = \underset{(\theta^*, u^*, a^*)}{\mathrm{arg\,min}}F(o(a^*),\theta^*,u^*)$

But how do we instill goal-directed behavior in such an agent? As mentioned above, the agent will seek out states that conform to its expectations, i.e. that have a high likelihood under its generative model of the world. So, you can encode specific goals in the priors of the generative model of the agent: assigning a higher a priori probability to a specific state, the agent will try to seek out this state more often.

As noted above, while the parameters $\theta$ describe the agent’s beliefs about the dynamic rules that govern the world, and should be optimized on a slow timescale, i.e. over large batches of data, the internal states $u$ and the action states $a$ must change on the timescale of the sensory input. Thus, in a discrete world they would have to be optimized at each time step. As these kinds of optimizations are costly, it is easier to use functional approximations of $u$ and $a$, e.g.  by deep neural networks, and optimize their parameters together with the parameters of the generative model (analogous to the work in variational inference on deep generative models by Rezende, Mohamed & Wierstra, Kingma & Welling, and Chung & al.).

I admit that the premises of Active Inference take some time to get used to. But when you do, suddenly many apparently disparate findings in neuroscience fit together. From optical illusions, to neural responses to oddball stimuli, mirror neurons, predictive codingbasic features of neural architecture, critical neural dynamics, and human choice behaviourmany findings can be framed and explained by the free energy principle and the resulting active inference dynamics.

That’s it. To minimize mathematical machinery, I stayed quite abstract in this post, but I will make up for it by showing you how to implement a full active inference agent in a small simulated environment, using deep neural networks, sampling and evolution strategies in the next one.

Edited Nov. 14th, 2018: Fixed some typos, added a paragraph with some details on the KL-divergence, added a short discussion on the roles of learning, perception and action in the optimisation of the sensory entropy, and added a short note that the assumption of a linear sensory mapping can be relaxed, using more recent formulations (but there will be a separate blog post on this, soon*ish*).

## 5 thoughts on “The Free Energy Principle and Active Inference”

1. Thank you Kai, this is an extremely useful primer! If I’m understanding correctly, the discussion of updating a parameterized function for the recognition density q_u used theta just to illustrate it’s a parameterized function, and not to imply these are the same parameters as those used by the generative model (also called theta).

> Instead one can use a very flexible, parameterised function $u_{\theta}(o)$, and optimise its parameters together with the parameters of the generative model

Did I get that right? The relationship between the generative and recognition models (e.g. whether one is an inversion of the other or not) seems to be somewhat controversial already, so I wanted to clarify the notation here.

Like

1. Thank you for the nice feedback and the important question, Jeremy. You‘re exactly right. The amortized recognition function has its own set of parameters. I just lumped it together with the parameters of the likelihood- and action-function into the set theta, which summarizes all the parameters of the agent, that are optimized during the training process. This was supposed to declutter the notation, but might have created confusion, I‘m sorry. Please refer to the paper and the accompanying code for more details (cf. the publications page).

Liked by 1 person