Elsevier

NeuroImage

Volume 34, Issue 1, 1 January 2007, Pages 220-234
NeuroImage

Technical Note
Variational free energy and the Laplace approximation

https://doi.org/10.1016/j.neuroimage.2006.08.035Get rights and content

Abstract

This note derives the variational free energy under the Laplace approximation, with a focus on accounting for additional model complexity induced by increasing the number of model parameters. This is relevant when using the free energy as an approximation to the log-evidence in Bayesian model averaging and selection. By setting restricted maximum likelihood (ReML) in the larger context of variational learning and expectation maximisation (EM), we show how the ReML objective function can be adjusted to provide an approximation to the log-evidence for a particular model. This means ReML can be used for model selection, specifically to select or compare models with different covariance components. This is useful in the context of hierarchical models because it enables a principled selection of priors that, under simple hyperpriors, can be used for automatic model selection and relevance determination (ARD). Deriving the ReML objective function, from basic variational principles, discloses the simple relationships among Variational Bayes, EM and ReML. Furthermore, we show that EM is formally identical to a full variational treatment when the precisions are linear in the hyperparameters. Finally, we also consider, briefly, dynamic models and how these inform the regularisation of free energy ascent schemes, like EM and ReML.

Introduction

The purpose of this note is to describe an adjustment to the objective function used in restricted maximum likelihood (ReML) that renders it equivalent to the free energy in variational learning and expectation maximisation. This is important because the variational free energy provides a bound on the log-evidence for any model, which is exact for linear models. The log-evidence plays a central role in model selection, comparison and averaging (for examples in neuroimaging, see Penny et al., 2004; Trujillo-Barreto et al., 2004).

Previously, we have described the use of ReML in the Bayesian inversion of electromagnetic models to localise distributed sources in EEG and MEG (e.g., Phillips et al., 2002). ReML provides a principled way of quantifying the relative importance of priors that replaces alternative heuristics like L-curve analysis. Furthermore, ReML accommodates multiple priors and provides more accurate and efficient source reconstruction than its precedents (Phillips et al., 2002). More recently, we have explored the use of ReML to identify the most likely combination of priors using model selection, where each model comprises a different set of priors (Mattout et al., in press). This was based on the fact that the ReML objective function is the free energy used in expectation maximisation and is equivalent to the log-evidence Fλ = ln p(y|λ,m), conditioned on λ, the unknown covariance component parameters (i.e., hyperparameters) and the model m. The noise covariance components encoded by λ include the prior covariances of each model of the data y.

However, this free energy is not a function of the conditional uncertainty about λ and is therefore insensitive to additional model complexity induced by adding covariance components. In this note we finesse this problem and show how Fλ can be adjusted to provide the variational free energy, which, in the context of linear models, is exactly the log-evidence ln p(y|m). This rests on deriving the variational free energy for a general variational scheme and treating expectation maximisation (EM) as a special case, in which one set of parameters assumes a point mass. We then treat ReML as the special case of EM, applied to linear models.

Although this note focuses on the various forms for the free energy, we also take the opportunity to link variational Bayes (VB), EM and ReML by deriving them from basic principles. Indeed, this derivation is necessary to show how the ReML objective function can be generalised for use in model selection. The material in this note is quite technical but is presented here because it underpins many of the specialist applications we have described in the neuroimaging literature over the past years. This didactic treatment may be especially useful for software developers or readers with a particular mathematical interest. For other readers, the main message is that a variational treatment of imaging data can unite a large number of special cases within a relatively simple framework.

Variational Bayes, under the Laplace approximation, assumes a fixed Gaussian form for the conditional density of the parameters of a model and is used implicitly in ReML and many applications of EM. Bayesian inversion using VB is ubiquitous in neuroimaging (e.g., Penny et al., 2005). Its use ranges from spatial segmentation and normalisation of images during pre-processing (e.g., Ashburner and Friston, 2005) to the inversion of complicated dynamical casual models of functional integration in the brain (Friston et al., 2003). Many of the intervening steps in classical and Bayesian analysis of neuroimaging data call on ReML or EM under the Laplace approximation. This note provides an overview of how these schemes are related and illustrates their applications with reference to specific algorithms and routines we have described in the past (and are currently developing; e.g., dynamic expectation maximisation; DEM). One interesting issue that emerges from this treatment is that VB reduces exactly to EM, under the Laplace approximation, when the precision of stochastic terms is linear in the hyperparameters. This reveals a close relationship between EM and full variational approaches.

This note is divided into seven sections. In the first we summarise the basic theory of variational Bayes and apply it in the context of the Laplace approximation. The Laplace approximation imposes a fixed Gaussian form on the conditional density, which simplifies the ensuing variational steps. In this section we look at the easy problem of approximating the conditional covariance of model parameters and the more difficult problem of approximating their conditional expectation or mode using gradient ascent. We consider a dynamic formulation of gradient ascent, which generalises nicely to cover dynamic models and provides the basis for a temporal regularisation of the ascent. In the second section we apply the theory to nonlinear models with additive noise. We use the VB scheme that emerges as the reference for subsequent sections looking at special cases. The third section considers EM, which can be seen as a special case of VB in which uncertainty about one set of parameters is ignored. In the fourth section we look at the special case of linear models where EM reduces to ReML. The fifth section considers ReML and hierarchical models. Hierarchical models are important because they underpin parametric empirical Bayes (PEB) and other special cases, like relevance vector machines. Furthermore, they provide a link with classical covariance component estimation. In the sixth section we present some toy examples to show how the ReML and EM objective functions can be used to evaluate the log-evidence and facilitate model selection. This section concludes with an evaluation of the Laplace approximation to the model evidence, in relation to Monte Carlo–Markov chain (MCMC) sampling estimates. The final section revisits model selection using automatic model selection (AMS) and relevance determination (ARD). We show how suitable hyperpriors enable EM and ReML to select the best model automatically, by switching off redundant parameters and hyperparameters. The Appendices include some notes on parameterising covariances and the sampling scheme used for validation of the Laplace approximations.

Empirical enquiry in science usually rests upon estimating the parameters of some model of how observed data were generated and making inferences about the parameters (or model). Estimation and inference are based on the posterior density of the parameters (or model), conditional on the observations. Variational Bayes is used to evaluate these posterior densities.

Variational Bayes is a generic approach to posterior density (as opposed to posterior mode) analysis that approximates the conditional density p(ϑ|y,m) of some model parameters ϑ, given a model m and data y. Furthermore, it provides the evidence (also called the marginal or integrated likelihood) of the model p(y|m) which, under prior assumptions about the model, furnishes the posterior density p(m|y) of the model itself (see Penny et al., 2004 for an example in neuroimaging).

Variational approaches rest on minimising the Feynman variational bound (Feynman, 1972). In variational Bayes the free energy represents a bound on the log-evidence. Variational methods are well established in the approximation of densities in statistical physics (e.g., Weissbach et al., 2002) and were introduced by Feynman within the path integral formulation (Titantah et al., 2001). The variational framework was introduced into statistics though ensemble learning, where the ensemble or variational density q(ϑ) (i.e., approximating posterior density) is optimised to minimise the free energy. Initially (Hinton and von Cramp, 1993, MacKay, 1995a, MacKay, 1995b) the free energy was described in terms of description lengths and coding. Later, established methods like EM were considered in the light of variational free energy (Neal and Hinton, 1998; see also Bishop, 1999). Variational learning can be regarded as subsuming most other learning schemes as special cases. This is the theme pursued here, with special references to fixed-form approximations and classical methods like ReML (Harville, 1977).

The derivations in this paper involve a fair amount of differentiation. To simplify things we will use the notation fx = f/∂x to denote the partial derivative of the function f, with respect to the variable x. For time derivatives we will also use x˙ = xt.

The log-evidence can be expressed in terms of the free energy and a divergence termlnp(y|m)=F+D(q(ϑ)||p(ϑ|y,m))F=L(ϑ)qlnq(ϑ)qL=lnp(y,ϑ).

Here − 〈ln q(ϑ)〉q is the entropy and 〈L(ϑ)〉q the expected energy. Both quantities are expectations under the variational density. Eq. (1) indicates that F is a lower-bound approximation to the log-evidence because the divergence or cross-entropy D(q(ϑ)|| p(ϑ|y,m)) is always positive. In this note, all the energies are the negative of energies considered in statistical physics. The objective is to compute q(ϑ) for each model by maximising F, and then compute F itself, for Bayesian inference and model comparison, respectively. Maximising the free energy minimises the divergence, rendering the variational density q(ϑ)  p(ϑ|y,m) an approximate posterior, which is exact for linear systems. To make the maximisation easier one usually assumes q(ϑ) factorises over sets of parameters ϑi.q(ϑ)=iqi.

In statistical physics this is called a mean field approximation. Under this approximation, the Fundamental Lemma of variational calculus means that F is maximised with respect to qi = q(ϑi) when, and only whenδFi=0fiqi=fqii=0fi=FϑiδFi is the variation of the free energy with respect to qi. From Eq. (1)fi=qiq\ilnL(ϑ)dϑ\iqiq\ilnq(ϑ)dϑ\ifqii=I(ϑi)lnqilnZiI(ϑi)=L(ϑ)q\iWhere ϑ\i denotes the parameters not in the ith set. We have lumped terms that do not depend on ϑi into ln Zi, where Zi is a normalisation constant (i.e., partition function). We will call I(ϑi) the variational energy, noting its expectation under qi is the expected energy. Note that when all the parameters are considered in a single set, the energy and variational energy become the same thing; i.e., I(ϑi) = L(ϑ). The extremal condition in Eq. (3) is met whenlnqi=I(ϑi)lnZiq(ϑi)=1ziexp(I(ϑi)).If this analytic form were tractable (e.g., through the use of conjugate priors) it could be used directly. See Beal and Ghahramani (2003) for an excellent treatment of conjugate-exponential models. However, we will assume a Gaussian fixed-form for the variational density to provide a generic scheme that can be applied to a wide range of models. Note that assuming a Gaussian form for the conditional density is equivalent to assuming a quadratic form for the variational energy (cf. a second order Taylor approximation).

Laplace’s method (also known as the saddle-point approximation) approximates an integral using a Taylor expansion of the integrands logarithm around its peak. Traditionally, in the statistics and machine leaning literature, the Laplace approximation refers to the evaluation of the marginal likelihood or free energy using Laplace's method. This is equivalent to a local Gaussian approximation of p(ϑ|y) around a maximum a posteriori (MAP) estimate (Kass and Raftery, 1995). A Gaussian approximation is motivated by the fact that, in the large data limit and given some regularity conditions, the posterior approaches a Gaussian around the MAP (Beal and Ghahramani, 2003). However, the Laplace approximation can be inaccurate with non-Gaussian posteriors, especially when the mode is not near the majority of the probability mass. By applying Laplace's method, in a variational context, we can avoid this problem: In what follows, we use a Gaussian approximation to each p(ϑi|y), induced by the mean field approximation. This finesses the evaluation of the variational energy I(ϑi) which is then optimised to find its mode. This contrasts with the conventional Laplace approximation; which is applied post hoc, after the mode has been identified. We will refer to this as the post hoc Laplace approximation.

Under the Laplace approximation, the variational density assumes a Gaussian form qi = N(μii) with variational parameters μi and Σi corresponding to the conditional mode and covariance of the ith set of parameters. The advantage of this is that the conditional covariance can be evaluated very simply. Under the Laplace assumptionF=L(μ)+12i(Ui+ln|Σi|+piln2πe)I(ϑi)=L(ϑi,μ\i)+12jiUjUi=tr(ΣiLϑiϑi)pi = dim(ϑi) is the number of parameters in the ith set. The approximate conditional covariances obtain as an analytic function of the modes by differentiating I(ϑi) in Eq. (6) and solving for zeroFΣi=12Lϑiϑi+12Σi1=0Σi=L(μ)ϑiϑi1.

Note that this solution for the conditional covariances does not depend on the mean-field approximation but only on the Laplace approximation. Eq. (7) recapitulates the conventional Laplace approximation; in which the conditional covariance is determined from the Hessian L(μ)ϑiϑi, evaluated at the variational mode or maximum aposteriori (MAP). Substitution into Eq. (6) means Ui = pi andF=L(μ)+i12(ln|Σi|+piln2π).

The only remaining quantities required are the variational modes, which, from Eq. (5), maximise I(ϑi). This leads to the following compact variational scheme.

The modes can be found using a gradient ascent based onμ˙i=I(μi)ϑi=I(μi)ϑi

It may seem odd to formulate an ascent in terms of the motion of the mode in time. However, this is useful when generalising to dynamic models (see below). The updates for the mode obtain by integrating Eq. (10) to giveΔμi=(exp(tJ)I)J1μ˙iJ=μ˙iϑi=I(μi)ϑiϑi.

When t gets large, the matrix exponential disappears; because the curvature is negative definite and we get a conventional Newton schemeΔμi=I(μi)ϑiϑi1I(μi)ϑi.

Together with the expression for the conditional covariance in Eq. (7), this update furnishes a variational scheme under the Laplace approximation

Note that this scheme rests on, and only on, the specification of the energy function L(ϑ) implied by a generative model.

In some instances deviations from the quadratic form assumed for the variational energy I(ϑi) under the Laplace approximation can confound a simple Newton ascent. This can happen when the curvature of the objective function is badly behaved (e.g., when the objective function becomes convex, the curvatures can become positive and the ascent turns into a descent). In these situations some form of regularisation is required to ensure a robust ascent. This can be implemented by augmenting Eq. (10) with a decay termμ˙i=I(μi)ϑivΔμi.

This effectively pulls the search back towards the expansion point provided by the previous iteration and enforces a local exploration. Integration to the fixed point gives a classical Levenburg–Marquardt scheme (cf. Eq. (11))Δμi=J1μ˙i=(vII(μi)ϑiϑi)1I(μi)ϑiJ=I(μi)ϑiϑivIwhere v is the Levenburg–Marquardt regularisation. However, the dynamic formulation affords a simpler alternative, namely temporal regularisation. Here, instead of constraining the search with a decay term, one can abbreviate it by terminating the ascent after some suitable period t = v; from Eq. (11)Δμi=(exp(vJ)I)J1μ˙i=(exp(vI(μi)ϑiϑi)I)I(μi)ϑiϑi1I(μi)ϑiJ=I(μi)ϑiϑi

This has the advantage of using the local gradients and curvatures while precluding large excursions from the expansion point. In our implementations v = 1/η is based on the 2-norm of the curvature η for both regularisation schemes. The 2-norm is the largest singular value and, in the present context, represents an upper bound on rate of convergence of the ascent (cf. a Lyapunov exponent).1 Terminating the ascent prematurely is reminiscent of “early stopping” in the training of neural networks in which the number of weights far exceeds the sample size (e.g., Nelson and Illingworth, 1991, p. 165). It is interesting to note that “early stopping” is closely related to ridge regression, which is another perspective on Levenburg–Marquardt regularisation.

A comparative example using Levenburg–Marquardt and temporal regularisation is provided in Fig. 1 and suggests, in this example, temporal regularisation is better. Either approach can be implemented in the VB scheme by simply regularising the Newton update if the variational energy I(ϑi) fails to increase after each iteration. We prefer temporal regularisation because it is based on a simpler heuristic and, more importantly, is straightforward to implement in dynamic schemes using high-order temporal derivatives.

The second reason we have formulated the ascent as a time-dependent process is that it can be used to invert dynamic models. In this instance, the integration time in Eq. (16) is determined by the interval between observations. This is the approach taken in our variational treatment of dynamic systems; namely, dynamic expectation maximisation or DEM (introduced briefly in Friston, 2005 and implemented in spm_DEM.m). DEM produces conditional densities that are a continuous function of time and avoids many of the limitations of discrete schemes based on incremental Bayes (e.g., extended Kalman filtering). In dynamic models the energy is a function of the parameters and their high-order motion; i.e., I(ϑi)  I(ϑi, ϑ.i,…,t). This entails the extension of the variational density to cover this motion, using generalised co-ordinates q(ϑi)  q(ϑi, ϑ.i,…,t). This approach will be described fully in a subsequent paper. Here we focus on static models.

Having established the operational equations for VB under the Laplace approximation we now look at their application to some specific models.

Consider the generative model with additive error y = G(θ) + ε(λ). Gaussian assumptions about the errors or innovations p(ε) = N(0,Σ(λ)) furnish a likelihood p(y|θ,λ) = N(G(θ),Σ(λ)). In this example, we can consider the parameters as falling into two sets ϑ = {θ,λ} such that q(ϑ) = q(θ)q(λ), where q(θ) = N(μθθ) and q(λ) = N(μλλ). We will also assume Gaussian priors p(θ) = N(ηθθ−1) and p(λ) = N(ηλλ−1). We will refer to the two sets as the parameters and hyperparameters. These likelihood and priors define the energy L(ϑ) = ln p(y|θ,λ) + ln p(θ) + ln p(λ). Note that Gaussian priors are not too restrictive because both G(θ) and Σ(λ) can be nonlinear functions that embody a probability integral transform (i.e., can implement a re-parameterisation in terms of non-Gaussian priors).

Given n samples, p parameters and h hyperparameters the energy and its derivatives areL(ϑ)=12εTΣ1ε+12ln|Σ1|n2ln2π12εθTΠθεθ+12ln|Πθ|p2ln2π12ελTΠλελ+12ln|Πλ|h2ln2πε=G(μθ)yεθ=μθηθελ=μληλandL(μ)θ=GθTΣ1εΠθεθL(μ)θθ=GθTΣ1GθΠθL(μ)λi=12tr(Pi(εεTΣ))Πi·λελL(μ)λλij=12tr(Pij(εεTΣ))12tr(PiΣPjΣ)ΠijλPi=Σ1λiPij=2Σ1λiλj

Note that we have ignored second-order terms that depend on Gθθ, under the assumption that the generative model is only weakly nonlinear. The requisite gradients and curvatures areI(θ)θk=L(θ,μλ)θk+12tr(ΣλAk)I(λ)λi=L(μθ,λ)λi+12tr(ΣθCi)I(θ)θθkl=L(θ,μλ)θθkl+12tr(ΣλBkl)I(λ)λλij=L(μθ,λ)λλij+12tr(ΣθDij)Aijk=Gθ·kTPijεCi=GθTPiGθBijkl=Gθ·kTPijGθ·lDij=GθTPijGθwhere Gθ·k denotes the kth column of Gθ. These enter the VB scheme in Eq. (13), giving the two-step scheme

The negative free energy for these models isF=12εTΣ1ε+12ln|Σ1|n2ln2π12εθTΠθεθ+12ln|Πθ|+12ln|Σθ|12ελTΠλελ+12ln|Πλ|+12ln|Σλ|

In principle, these equations cover a large range of models and will work provided the true posterior is unimodal (and roughly Gaussian). The latter requirement can usually be met by a suitable transformation of parameters. In the next section, we consider a further simplification of our assumptions about the variational density and how this leads to expectation maximisation.

There is a key distinction between θ and λ in the generative model above: The parameters λ are hyperparameters in the sense, like the variational parameters, they parameterise a density. In many instances their conditional density per se is uninteresting. In variational expectation maximisation EM, we ignore uncertainty about the hyperparameters. In this case, the free energy is effectively conditioned on λ and reduces toFλ=lnp(y|λ)D(q(θ)p(θ|y,λ))=12εTΣ1ε+12ln|Σ1|n2ln2π12εθTΠθεθ+12ln|Πθ|+12ln|Σθ|.

Here, Fλ  ln p(y|λ) becomes a lower bound on the log likelihood of the hyperparameters. This means the variational step updating the hyperparameters maximises the likelihood of the hyperparameters ln p(y|λ) and becomes an M-step. In this context, Eq. (20) simplifies because we can ignore the terms that involve Σλ and Πλ to give

Expectation–maximisation or EM is an iterative parameter re-estimation procedure devised to estimate the parameters and hyperparameters of a model. It was introduced as an iterative method to obtain maximum likelihood estimators with incomplete data (Hartley, 1958) and was generalised by Dempster et al. (1977). Strictly speaking, EM refers to schemes in which the conditional density of the E-step is known exactly, obviating the need for fixed-form assumptions. This is why we used the term ‘variational EM’ above.

In terms of the VB scheme, the M-step for μλ = maxI(λ) is unchanged because I(λ) does not depend on Σλ. The remaining variational steps (i.e., E-steps) are simplified because one does not have to average over the conditional density q(λ). This ensuing scheme is that described in Friston (2002) for nonlinear system identification and is implemented in spm_nlsi.m. Although this scheme is applied to time series it actually treats the underlying model as static, generating finite-length data sequences. This routine is used to identify hemodynamic models in terms of biophysical parameters for regional responses and dynamic causal models (DCMs) of distributed responses in a variety of applications; e.g., fMRI (Friston et al., 2003), EEG (David et al., 2006), MEG (Kiebel et al., 2006) and mean-field models of neuronal activity (Harrison et al., 2005).

A key point here is that VB and EM are exactly the same when Pij = 0. In this instance the matrices A, B and D in Eq. (19) disappear. This means the VB-step for the parameters does not depend on Σλ and becomes formally identical to the E-step. Because the VB-step for the hyperparameters is already the same as the M-step (apart from the loss of hyperpriors) the two schemes converge. One can ensure Pij = 0 by adopting a hyper-parameterisation, which renders the precision linear in the hyperparameters; for example, a linear mixture of precision components Qi (see Appendix 1). This resulting variational scheme is used by the SPM5 version of spm_nlsi.m for nonlinear system identification.

The second key point that follows from this analysis is that one can adjust the EM free energy to approximate the log-evidence, as described next.

The EM free energy in Eq. (22) discounts uncertainty about the hyperparameters because it is conditioned upon them. This is a well-recognised problem, sometimes referred to as the overconfidence problem, for which a number of approximate solutions have been suggested (e.g., Kass and Steffey, 1989). Here we describe a solution that appeals to the variational framework within which EM can be treated.

If we treat EM as an approximate variational scheme, we can adjust the EM free energy to give the variational free energy required for model comparison and averaging. By comparing Eqs. (21), (22) we can express the variational free energy in terms of Fλ and an extra term2F=Fλ+12ln|Σλ|Σijλ=L(μ)λλ1where the expression for L(μ)λλ comes from Eq. (18). Intuitively, the extra term encodes the conditional information (i.e., entropy) about the models covariance components. The log-evidence will only increase if there is conditional information about the extra component. Adding redundant components will have no effect on F (see the section below on automatic model selection). This term can be regarded as additional Occam factor (Mackay and Takeuchi, 1996).

Note that even when conditional uncertainty about the hyperparameters has no effect on the conditional density of the parameters (e.g., when the precisions are linear in the hyperparameters—see above) this uncertainty can still have a profound effect on model selection because it is an important component of the free energy and therefore the log-evidence for a particular model.

Adjusting the EM free energy to approximate the log-evidence is important because of the well-known connections between EM for linear models and restricted maximum likelihood. This connection suggests that ReML could also be used to evaluate the log-evidence and therefore be used for model selection. We now consider ReML as a special case of EM.

In the case of general linear models G(θ) =  with additive Gaussian noise and no priors on the parameters (i.e., Πθ = 0) the free energy reduces toFθ=lnp(y|λ)D(q(θ)p(θ|y,λ))=12εTΣ1ε+12ln|Σ1|n2ln2π+12ln|Σθ|.

Critically, the dependence on q(θ) can be eliminated using the closed form solutions for the conditional momentsμθ=ΣθGTΣ1yΣθ=(GTΣ1G)1to eliminate the divergence term and giveFθ=lnp(y|λ)=12tr(Σ1RyyTRT)+12ln|Σ1|n2ln2π12ln|GTΣ1G|ε=RyR=IG(GTΣ1G)1GTΣ1

This free energy is also known as the ReML objective function (Harville, 1977). ReML or restricted maximum likelihood was introduced by Patterson and Thompson in 1971 as a technique for estimating variance components, which accounts for the loss in degrees of freedom that result from estimating fixed effects (Harville, 1977). The elimination makes the free energy a simple function of the hyperparameters and, effectively, the EM scheme reduces to a single M-step or ReML-step

Notice that the energy has replaced the variational energy because they are the same: from Eq. (6) I(ϑ) = L(λ). This is a result of eliminating q(θ) from the variational density. Furthermore, the curvature has been replaced by its expectation to render the Newton decent a Fisher–Scoring scheme usingRyyTRT=RΣRT=ΣGΣqθGT=RΣ.

To approximate the log-evidence we can adjust the ReML free energy, after convergence, as with the EM free energyF=Fθ+12ln|Σλ|Σλ=L(μ)λλ1.

The conditional covariance of the hyperparameters uses the same curvature as the ascent in Eq. (27). Being able to compute the log-evidence from ReML is useful because ReML is used widely in an important class of models, namely hierarchical models reviewed next.

The application of ReML to the linear models of the previous section did not accommodate priors on the parameters. However, one can generally absorb these priors into the error covariance components using a hierarchical formulation. This enables the use of ReML to identify models with full or empirical priors. Hierarchical linear models are equivalent to parametric empirical Bayes models (Efron and Morris, 1973) in which empirical priors emerge from conditional independence of the errors ε(i)  N(0,Σ(i)):y(1)=θ(1)=G(1)θ(2)+ε(1)θ(2)=G(2)θ(3)+ε(2)θ(n)=ε(n)y(1)=ε(1)+G(1)ε(2)+G(1)G(2)ε(3)+G(1)G(n1)θ(n)

In hierarchical models, the random terms model uncertainly about the parameters at each level and Σ(λ)(i) are treated as prior covariance constraints on θ(i). Hierarchical models of this sort are very common and underlie all classical mixed effects analyses of variance.3 ReML identification of simple two-level models likey(1)=G(1)θ(2)+ε(1)θ(2)=ε(2)is a useful way to impose shrinkage priors on the parameters and covers early approaches (e.g., Stein shrinkage estimators) to recent developments, such as relevance vector machines (e.g., Tipping, 2001). Relevance vector machines represent a Bayesian treatment of support vector machines, in which the second-level covariance Σ(λ)(2) has a component for each parameter. Most of the ReML estimates of these components shrink to zero. This means the columns of G(1) whose parameters have zero mean and variance can be eliminated, providing a new model with sparse support. This is also known as automatic relevance determination (ARD; MacKay, 1995a, MacKay, 1995b) and will be illustrated below.

Estimating these models through their covariances Σ(i) with ReML corresponds to empirical Bayes. This estimation can proceed in one of two ways: First, we can augment the model and treat the random terms as parameters to givey=Jθ+εy=[y(1)00]J=[K(2)K(n)II]ε=[ε(1)ε(2)ε(n)]θ=[ε(2)θ(n)]K(i)=j=1iG(j1)Σ=[Σ(1)Σ(n)]with G(0) = I. This reformulation is a nonhierarchical model with no explicit priors on the parameters. However, the ReML estimates of Σ(λ)(i) are still the empirical prior covariances of the parameters θ(i) at each level. If Σ(i) is known a priori, it simply enters the scheme as a known covariance component. This corresponds to a full Bayesian analysis with known or full priors for the level in question.

spm_peb.m uses this reformulation and Eq. (27) for estimation. The conditional expectations of the parameters are recovered by recursive substitution of the conditional expectations of the errors into Eq. (30) (cf. Friston, 2002). spm_peb.m uses a computationally efficient substitution12tr(PiR(yyTΣ)RT)=12yTRTPiRy12tr(PiRΣRT)to avoid computing the potentially large matrix yyT. We have used this scheme extensively in the construction of posterior probability maps or PPMs (Friston and Penny, 2003) and mixed-effect analysis of multi-subject studies in neuroimaging (Friston et al., 2005). Both these examples rest on hierarchical models, using hierarchical structure over voxels and subjects, respectively.

An equivalent identification of hierarchical models rests on an alternative and simpler reformulation of Eq. (30) in which all the hierarchically induced covariance components K(i)TΣ(i)K(i)T are treated as components of a compound errory=εy=y(1)ε=i=1nK(i)ε(i)Σ=i=1nK(i)TΣ(i)K(i)T.

The ensuing ReML estimates of Σ(λ)(i) can be used to compute the conditional density of the parameters in the usual way. For example, the conditional expectation and covariance of the ith level parameters θ(i) areμθ(i)=Σθ(i)K(i)TΣ˜1yΣθ(i)=(K(i)TΣ˜1K(i)+Σ(i)1)1Σ˜=jiK(j)TΣ(j)K(j)Twhere Σ˜ represents the ReML estimate of error covariance, excluding the level of interest. This component Σ(i) = Σ(λ)(i) is treated as an empirical prior on θ(i). spm_ReML.m uses Eq. (27) to estimate the requisite hyperparameters. Critically, it takes as an argument the matrix yyT. This may seem computationally inefficient. However, there is a special but very common case where dealing with yyT is more appropriate than dealing with y (cf. the implementation using Eq. (33) in spm_peb.m):

This is when there are r multiple observations that can be arranged as a matrix Y = [y1,…,yr]. If these observations are independent then we can express the covariance components of the vectorised response in terms of Kronecker tensor productsy=vec{Y}=εε=i=1nIK(i)ε(i)cov{ε(i)}=IΣ(i).

This leads to a computationally efficient scheme employed by spm_ReML.m, which uses the compact forms4Lλi=12tr((IPiR)(yyTIΣ)(IRT))=r2tr(PiR(1rYYTΣ)RT)Lλλij=12tr(IPiRΣPjRΣ)=r2tr(PiRΣPjRΣ).

Critically, the update scheme is a function of the sample covariance of the data (1/r)YYT and can be regarded as a covariance component estimation scheme. This can be useful in two situations:

First, if the augmented form in Eq. (32) produces prohibitively long vectors. This can happen when the number of parameters is much greater than the number of responses. This is a common situation in underdetermined problems. An important example is source reconstruction in electroencephalography, where the number of sources is much greater than the number of measurement channels (see Phillips et al., 2005, for an application that uses spm_ReML.m in this context). In these cases one can form conditional estimates of the parameters using the matrix inversion lemma and again avoid inverting large (p × p) matrices.μθ(i)=Σ(i)K(i)TΣ˜1YΣθ(i)=Σ(i)Σ(i)K(i)TΣ˜1K(i)Σ(i)Σ˜=i=1nK(i)TΣ(i)K(i)T.

The second situation is where there are a large number of realisations. In these cases it is much easier to handle the second-order matrices of the data YYT than the data Y itself. An important application here is the estimation of nonsphericity over voxels in the analysis of fMRI time-series (see Friston et al., 2002, for this use of spm_ReML.m). Here, there are many more voxels than scans and it would not be possible to vectorise the data. However, it is easy to collect the sample covariance over voxels and partition it into nonspherical covariance components using ReML.

In the case of sequential correlations among the errors cov{ε(i)} = V⊗Σ(i) one simply replaces YYT with YV 1YT. Heuristically, this corresponds to sequentially whitening the observations before computing their second order statistics. We have used this device in the Bayesian inversion of models of evoked and induced responses in EEG/MEG (Friston et al., 2006).

In summary, hierarchical models can be identified through ReML estimates of covariance components. If the response vector is relatively small it is generally more expedient to reduce the hierarchical form by augmentation, as in Eq. (32), and use Eq. (33) to compute the gradients. When the augmented form becomes too large, because there are too many parameters, reformulation in terms of covariance components is computationally more efficient because the gradients can be computed from the sample covariance of the data. The latter formulation is also useful when there are multiple realisations of the data because the sample covariance, over realisations, does not change in size. This leads to very fast Bayesian inversion. Both approaches rest on estimating covariance components that are induced by the observation hierarchy. This enforces a hyper-parameterisation of the covariances, as opposed to precisions (see Appendix 1).

This section contains a brief demonstration of model selection using ReML and its adjusted free energy. In these examples, we use the covariance component formulation (spm_ReML.m) as in Eq. (34), noting exactly the same results would be obtained with augmentation (spm_peb.m). We use a simple hierarchical two-level linear model, implementing shrinkage priors, because this sort of model is common in neuroimaging data analysis and represents the simplest form of empirical Bayes. The model is described in Fig. 2. Briefly it has eight parameters that cause a 32-variiate response. The parameters are drawn from a multivariate Gaussian that was a mixture of two known covariance components. Data were generated repeatedly (128 samples) using different parameters for each realization. This model can be regarded as generating fMRI data over 32 scans, each with 128 voxels; or EEG data from 32 channels over 128 time bins. These simulations are provided as a proof of concept and illustrate how one might approach numerical validation in the context of other models.

The free energy can, of course, be used for model selection when models differ in the number and deployment of parameters. This is because both F and Fθ are functions of the number of parameters and their conditional uncertainty. This can be shown by evaluating the free energy as a function of the number of model parameters, for the same data. The results of this sort of evaluation are seen in Fig. 3 and demonstrate that model selection correctly identifies a model with eight parameters. This was the model used to generate the data (Fig. 2). In this example, we used a simple shrinkage prior on all parameters (i.e., Σ(2) = λ(2)I) during the inversions.

The critical issue is whether model selection will work when the models differ in their hyperparameterisation. To address this, we analysed the same data, produced by two covariance components at the second level, with models that comprised an increasing number of second-level covariance components (Fig. 4). These components can be regarded as specifying the form of empirical priors over solution space (e.g., spatial constraints in an EEG source reconstruction problem). The results of these simulations show that the adjusted free energy F correctly identified the model with two components. Conversely, the unadjusted free energy Fθ rose progressively as the number of components and accuracy increased. See Fig. 5.

The lower panel in Fig. 5 shows the hyperparameter estimates for two models. With the correctly selected model the true values fall within the 90% confidence interval. However, when the model is over-parameterised, with eight second-level components, this is not the case. Although the general profile of hyperparameters has been captured, this suboptimum model has clearly overestimated some hyperparameters and underestimated others.

Finally, to establish that the variational approximation to the log-evidence is veridical, we computed ln p(y|ϑ) using a standard Monte Carlo–Markov chain (MCMC) procedure, described in Appendix 2. MCMC schemes are computationally intensive but allow one to sample from the posterior distribution p(ϑ|y) without making any assumptions about its form. These samples can then be used to estimate the marginal likelihood using, in this instance, a harmonic mean (see Appendix 2). These resulting estimates are not biased by the mean-field and Laplace approximations implicit in the variational scheme and can be used to asses the impact of these approximations on model comparison. The sampling estimates of free energy are provided in Fig. 6 (upper panels) for the eight models analysed in Fig. 5. The profile of true [sampling] log-evidences concurs with the free-energy profile in a pleasing way and suggests that the approximations entailed by the variational approach do not lead to inaccurate model selection, under these linear models. Furthermore, the sampled posterior p(λ|y) of the largest model (with eight second-level covariance components) is very similar to the Laplace approximation q(λ) as judged by their first two moments (Fig. 6; lower panels).

Hitherto, we have looked at model selection in terms of categorical comparisons of model evidence. However, the log-evidence is the same as the objective function used to optimise q(ϑi) for each model. Given that model selection and inversion maximise the same objective function; one might ask if inversion of an over-parameterised model finds the optimal model automatically. In other words, does maximising the free energy switch off redundant parameters and hyperparameters by setting their conditional density q(ϑi) to a point mass at zero; i.e., μi→0 and Σi→0. Most fixed-form variational schemes do this; however, this is precluded in classical5 ReML because the Laplace approximation admits improper, negative, covariance components, when these components are small. This means classical schemes cannot switch off redundant covariance components. Fortunately, it is easy to finesse this problem by applying the Laplace assumption to λ = lnα, where α are scale parameters encoding the expression of covariance components. This renders the form of q(α) log-normal and places a positivity constraint on α.

Consider the hyper-parameterisationΣ=iexp(λi)Qiwith priors p(λ) = N(ηλλ−1). This corresponds to a transformation in which scale parameters αi = exp(λi) control the expression of each covariance component, Qi. To first order, the conditional variance of each scale parameter isΣiα=αiλiΣiλαiλi|λ=μiλ=μiλΣiμiα.

This means that when μiα = exp(μiλ)  0 we get Σiα  0, which is necessary for automatic model selection. In this limit, the conditional covariance Σiλ is given by Eqs. (7) and (18)Σλ=Lλλ1=Πλ1because Pi = Pij  0 when exp(μiλ)  0 (see Appendix 1). In short, by placing log-normal hyperpriors on αi, we can use a conventional EM or ReML scheme for AMS. In this case, the conditional uncertainty about covariance components shrinks with their expectation, so that we can be certain they do not contribute to the model. It is simple to augment the ReML scheme and free energy to include hyperpriors (cf. Eq. (27))

F=Fθ+12ln|Σλ|+12ln|Πλ|12ελTΠλελ

Note, that adding redundant covariance components to the model does not change the free energy because the entropy associated with conditional uncertainty is offset exactly by the prior uncertainty they induce; due to the equality in Eq. (41).6 This means that conventional model selection will show that all over-parameterised models are equally optimal. This is intuitive because the inversion has already identified the optimal model. Fig. 7 shows the hyperparameter estimates for the model described in Fig. 5; with eight second-level covariance components. The upper panel shows the results with a classical hyper-parameterisation and the lower panels show the results with log-normal hyperpriors (before and after log-transformation). Note that hyperpriors are necessary to eliminate unnecessary components. In this example (and below) we used relatively flat hyperpriors; p(λ) = N(− 16,32).

When automatic model selection is used to eliminate redundant parameters, it is known as automatic relevance determination (ARD). In ARD one defines an empirical prior on the parameters that embodies the notion of uncertain relevance. This enables the inversion to infer which parameters are relevant and which are not (MacKay, 1995a, MacKay, 1995b). This entails giving each parameter its own shrinkage prior and estimating an associated scale parameter. In the context of linear models, this is implemented simply by adding a level to induce empirical shrinkage priors on the parameters. ReML (with hyperpriors) can then be used to switch off redundant parameters by eliminating their covariance components at the first level. We provide an illustration of this in Fig. 8, using the models reported in Fig. 3. The lower panel shows the conditional estimates of the parameters using Eq. (35) for the over-parameterised model with 16 parameters. Note that ARD with ReML correctly shrinks and switches off the last eight redundant parameters and has implicitly performed AMS. The upper panel shows that the free energy of all models, with an increasing number of parameters, reaches a maximum at the correct parameterisation and stays there even when redundant parameters (i.e., components) are added.

Section snippets

Discussion

We have seen that restricted maximum likelihood is a special case of expectation maximisation and that expectation maximisation is a special case of variational Bayes. In fact, nearly every routine used in neuroimaging analysis (certainly in SPM5; http://www.fil.ion.ucl.ac.uk/spm) is a special case of variational Bayes, from ordinary least squares estimation to dynamic causal modelling. We have focussed on adjusting the objective functions used by EM and ReML to approximate the variational free

Acknowledgments

The Wellcome Trust and British Council funded this work.

References (44)

  • C. Phillips et al.

    Systematic regularisation of linear inverse solutions of the EEG source localisation problem

    NeuroImage

    (2002)
  • C. Phillips et al.

    An empirical Bayesian solution to the source reconstruction problem in EEG

    NeuroImage

    (2005)
  • N. Trujillo-Barreto et al.

    Bayesian model averaging

    NeuroImage

    (2004)
  • K.Z. Adami

    Variational Methods in Bayesian Deconvolution

    (2003)
  • Beal, M.J., 1998. Variational algorithms for approximate Bayesian inference; PhD thesis:...
  • M.J. Beal et al.

    The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures

  • J.O. Berger

    Statistical Decision Theory and Bayesian Analysis

    (1985)
  • C. Bishop

    Latent variable models

  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. R. Stat. Soc., Ser. B

    (1977)
  • B. Efron et al.

    Stein's estimation rule and its competitors—An empirical Bayes approach

    J. Am. Stat. Assoc.

    (1973)
  • L. Fahrmeir et al.
  • R.P. Feynman

    Statistical Mechanics

    (1972)
  • Cited by (711)

    View all citing articles on Scopus
    View full text