Variational free energy and the Laplace approximation

doi:10.1016/j.neuroimage.2006.08.035

NeuroImage

Volume 34, Issue 1, 1 January 2007, Pages 220-234

https://doi.org/10.1016/j.neuroimage.2006.08.035 Get rights and content

Abstract

This note derives the variational free energy under the Laplace approximation, with a focus on accounting for additional model complexity induced by increasing the number of model parameters. This is relevant when using the free energy as an approximation to the log-evidence in Bayesian model averaging and selection. By setting restricted maximum likelihood (ReML) in the larger context of variational learning and expectation maximisation (EM), we show how the ReML objective function can be adjusted to provide an approximation to the log-evidence for a particular model. This means ReML can be used for model selection, specifically to select or compare models with different covariance components. This is useful in the context of hierarchical models because it enables a principled selection of priors that, under simple hyperpriors, can be used for automatic model selection and relevance determination (ARD). Deriving the ReML objective function, from basic variational principles, discloses the simple relationships among Variational Bayes, EM and ReML. Furthermore, we show that EM is formally identical to a full variational treatment when the precisions are linear in the hyperparameters. Finally, we also consider, briefly, dynamic models and how these inform the regularisation of free energy ascent schemes, like EM and ReML.

Introduction

The purpose of this note is to describe an adjustment to the objective function used in restricted maximum likelihood (ReML) that renders it equivalent to the free energy in variational learning and expectation maximisation. This is important because the variational free energy provides a bound on the log-evidence for any model, which is exact for linear models. The log-evidence plays a central role in model selection, comparison and averaging (for examples in neuroimaging, see Penny et al., 2004; Trujillo-Barreto et al., 2004).

Previously, we have described the use of ReML in the Bayesian inversion of electromagnetic models to localise distributed sources in EEG and MEG (e.g., Phillips et al., 2002). ReML provides a principled way of quantifying the relative importance of priors that replaces alternative heuristics like L-curve analysis. Furthermore, ReML accommodates multiple priors and provides more accurate and efficient source reconstruction than its precedents (Phillips et al., 2002). More recently, we have explored the use of ReML to identify the most likely combination of priors using model selection, where each model comprises a different set of priors (Mattout et al., in press). This was based on the fact that the ReML objective function is the free energy used in expectation maximisation and is equivalent to the log-evidence F^λ = ln p(y|λ,m), conditioned on λ, the unknown covariance component parameters (i.e., hyperparameters) and the model m. The noise covariance components encoded by λ include the prior covariances of each model of the data y.

However, this free energy is not a function of the conditional uncertainty about λ and is therefore insensitive to additional model complexity induced by adding covariance components. In this note we finesse this problem and show how F^λ can be adjusted to provide the variational free energy, which, in the context of linear models, is exactly the log-evidence ln p(y|m). This rests on deriving the variational free energy for a general variational scheme and treating expectation maximisation (EM) as a special case, in which one set of parameters assumes a point mass. We then treat ReML as the special case of EM, applied to linear models.

Although this note focuses on the various forms for the free energy, we also take the opportunity to link variational Bayes (VB), EM and ReML by deriving them from basic principles. Indeed, this derivation is necessary to show how the ReML objective function can be generalised for use in model selection. The material in this note is quite technical but is presented here because it underpins many of the specialist applications we have described in the neuroimaging literature over the past years. This didactic treatment may be especially useful for software developers or readers with a particular mathematical interest. For other readers, the main message is that a variational treatment of imaging data can unite a large number of special cases within a relatively simple framework.

Variational Bayes, under the Laplace approximation, assumes a fixed Gaussian form for the conditional density of the parameters of a model and is used implicitly in ReML and many applications of EM. Bayesian inversion using VB is ubiquitous in neuroimaging (e.g., Penny et al., 2005). Its use ranges from spatial segmentation and normalisation of images during pre-processing (e.g., Ashburner and Friston, 2005) to the inversion of complicated dynamical casual models of functional integration in the brain (Friston et al., 2003). Many of the intervening steps in classical and Bayesian analysis of neuroimaging data call on ReML or EM under the Laplace approximation. This note provides an overview of how these schemes are related and illustrates their applications with reference to specific algorithms and routines we have described in the past (and are currently developing; e.g., dynamic expectation maximisation; DEM). One interesting issue that emerges from this treatment is that VB reduces exactly to EM, under the Laplace approximation, when the precision of stochastic terms is linear in the hyperparameters. This reveals a close relationship between EM and full variational approaches.

This note is divided into seven sections. In the first we summarise the basic theory of variational Bayes and apply it in the context of the Laplace approximation. The Laplace approximation imposes a fixed Gaussian form on the conditional density, which simplifies the ensuing variational steps. In this section we look at the easy problem of approximating the conditional covariance of model parameters and the more difficult problem of approximating their conditional expectation or mode using gradient ascent. We consider a dynamic formulation of gradient ascent, which generalises nicely to cover dynamic models and provides the basis for a temporal regularisation of the ascent. In the second section we apply the theory to nonlinear models with additive noise. We use the VB scheme that emerges as the reference for subsequent sections looking at special cases. The third section considers EM, which can be seen as a special case of VB in which uncertainty about one set of parameters is ignored. In the fourth section we look at the special case of linear models where EM reduces to ReML. The fifth section considers ReML and hierarchical models. Hierarchical models are important because they underpin parametric empirical Bayes (PEB) and other special cases, like relevance vector machines. Furthermore, they provide a link with classical covariance component estimation. In the sixth section we present some toy examples to show how the ReML and EM objective functions can be used to evaluate the log-evidence and facilitate model selection. This section concludes with an evaluation of the Laplace approximation to the model evidence, in relation to Monte Carlo–Markov chain (MCMC) sampling estimates. The final section revisits model selection using automatic model selection (AMS) and relevance determination (ARD). We show how suitable hyperpriors enable EM and ReML to select the best model automatically, by switching off redundant parameters and hyperparameters. The Appendices include some notes on parameterising covariances and the sampling scheme used for validation of the Laplace approximations.

Empirical enquiry in science usually rests upon estimating the parameters of some model of how observed data were generated and making inferences about the parameters (or model). Estimation and inference are based on the posterior density of the parameters (or model), conditional on the observations. Variational Bayes is used to evaluate these posterior densities.

Variational Bayes is a generic approach to posterior density (as opposed to posterior mode) analysis that approximates the conditional density p(ϑ|y,m) of some model parameters ϑ, given a model m and data y. Furthermore, it provides the evidence (also called the marginal or integrated likelihood) of the model p(y|m) which, under prior assumptions about the model, furnishes the posterior density p(m|y) of the model itself (see Penny et al., 2004 for an example in neuroimaging).

Variational approaches rest on minimising the Feynman variational bound (Feynman, 1972). In variational Bayes the free energy represents a bound on the log-evidence. Variational methods are well established in the approximation of densities in statistical physics (e.g., Weissbach et al., 2002) and were introduced by Feynman within the path integral formulation (Titantah et al., 2001). The variational framework was introduced into statistics though ensemble learning, where the ensemble or variational density q(ϑ) (i.e., approximating posterior density) is optimised to minimise the free energy. Initially (Hinton and von Cramp, 1993, MacKay, 1995a, MacKay, 1995b) the free energy was described in terms of description lengths and coding. Later, established methods like EM were considered in the light of variational free energy (Neal and Hinton, 1998; see also Bishop, 1999). Variational learning can be regarded as subsuming most other learning schemes as special cases. This is the theme pursued here, with special references to fixed-form approximations and classical methods like ReML (Harville, 1977).

The derivations in this paper involve a fair amount of differentiation. To simplify things we will use the notation f_x = ∂f/∂x to denote the partial derivative of the function f, with respect to the variable x. For time derivatives we will also use x˙ = x_t.

The log-evidence can be expressed in terms of the free energy and a divergence term $\ln p (y | m) = F + D (q (ϑ) | | p (ϑ | y, m)) F = 〈 L (ϑ) 〉_{q} - 〈 \ln q (ϑ) 〉_{q} L = \ln p (y, ϑ) .$

Here − 〈ln q(ϑ)〉_q is the entropy and 〈L(ϑ)〉_q the expected energy. Both quantities are expectations under the variational density. Eq. (1) indicates that F is a lower-bound approximation to the log-evidence because the divergence or cross-entropy D(q(ϑ)|| p(ϑ|y,m)) is always positive. In this note, all the energies are the negative of energies considered in statistical physics. The objective is to compute q(ϑ) for each model by maximising F, and then compute F itself, for Bayesian inference and model comparison, respectively. Maximising the free energy minimises the divergence, rendering the variational density q(ϑ) ≈ p(ϑ|y,m) an approximate posterior, which is exact for linear systems. To make the maximisation easier one usually assumes q(ϑ) factorises over sets of parameters ϑⁱ. $q (ϑ) = \prod_{i} q^{i} .$

In statistical physics this is called a mean field approximation. Under this approximation, the Fundamental Lemma of variational calculus means that F is maximised with respect to qⁱ = q(ϑⁱ) when, and only when $δ F^{i} = 0 \Leftrightarrow \frac{\partial f^{i}}{\partial q^{i}} = f_{q^{i}}^{i} = 0$ $f^{i} = F_{ϑ^{i}}$ δFⁱ is the variation of the free energy with respect to qⁱ. From Eq. (1) $f^{i} = \int q^{i} q^{\ i} \ln L (ϑ) d ϑ^{\ i} - \int q^{i} q^{\ i} \ln q (ϑ) d ϑ^{\ i} f_{q^{i}}^{i} = I (ϑ^{i}) - \ln q^{i} - \ln Z^{i} I (ϑ^{i}) = 〈 L (ϑ) 〉_{q^{\ i}}$ Where ϑ^\i denotes the parameters not in the ith set. We have lumped terms that do not depend on ϑⁱ into ln Zⁱ, where Zⁱ is a normalisation constant (i.e., partition function). We will call I(ϑⁱ) the variational energy, noting its expectation under qⁱ is the expected energy. Note that when all the parameters are considered in a single set, the energy and variational energy become the same thing; i.e., I(ϑⁱ) = L(ϑ). The extremal condition in Eq. (3) is met when $\ln q^{i} = I (ϑ^{i}) - \ln Z^{i} \Leftrightarrow q (ϑ^{i}) = \frac{1}{z^{i}} \exp (I (ϑ^{i})) .$ If this analytic form were tractable (e.g., through the use of conjugate priors) it could be used directly. See Beal and Ghahramani (2003) for an excellent treatment of conjugate-exponential models. However, we will assume a Gaussian fixed-form for the variational density to provide a generic scheme that can be applied to a wide range of models. Note that assuming a Gaussian form for the conditional density is equivalent to assuming a quadratic form for the variational energy (cf. a second order Taylor approximation).

Laplace’s method (also known as the saddle-point approximation) approximates an integral using a Taylor expansion of the integrands logarithm around its peak. Traditionally, in the statistics and machine leaning literature, the Laplace approximation refers to the evaluation of the marginal likelihood or free energy using Laplace's method. This is equivalent to a local Gaussian approximation of p(ϑ|y) around a maximum a posteriori (MAP) estimate (Kass and Raftery, 1995). A Gaussian approximation is motivated by the fact that, in the large data limit and given some regularity conditions, the posterior approaches a Gaussian around the MAP (Beal and Ghahramani, 2003). However, the Laplace approximation can be inaccurate with non-Gaussian posteriors, especially when the mode is not near the majority of the probability mass. By applying Laplace's method, in a variational context, we can avoid this problem: In what follows, we use a Gaussian approximation to each p(ϑⁱ|y), induced by the mean field approximation. This finesses the evaluation of the variational energy I(ϑⁱ) which is then optimised to find its mode. This contrasts with the conventional Laplace approximation; which is applied post hoc, after the mode has been identified. We will refer to this as the post hoc Laplace approximation.

Under the Laplace approximation, the variational density assumes a Gaussian form qⁱ = N(μⁱ,Σⁱ) with variational parameters μⁱ and Σⁱ corresponding to the conditional mode and covariance of the ith set of parameters. The advantage of this is that the conditional covariance can be evaluated very simply. Under the Laplace assumption $F = L (μ) + \frac{1}{2} \sum_{i} (U^{i} + \ln | Σ^{i} | + p^{i} \ln 2 π e)$ $I (ϑ^{i}) = L (ϑ^{i}, μ^{\ i}) + \frac{1}{2} \sum_{j \neq i} U^{j}$ $U^{i} = t r (Σ^{i} L_{ϑ^{i} ϑ^{i}})$ pⁱ = dim(ϑⁱ) is the number of parameters in the ith set. The approximate conditional covariances obtain as an analytic function of the modes by differentiating I(ϑⁱ) in Eq. (6) and solving for zero $F_{Σ^{i}} = \frac{1}{2} L_{ϑ^{i} ϑ^{i}} + \frac{1}{2} Σ^{i^{- 1}} = 0 \Rightarrow Σ^{i} = - L (μ)_{ϑ^{i} ϑ^{i}}^{- 1} .$

Note that this solution for the conditional covariances does not depend on the mean-field approximation but only on the Laplace approximation. Eq. (7) recapitulates the conventional Laplace approximation; in which the conditional covariance is determined from the Hessian L(μ)_ϑⁱ_ϑⁱ, evaluated at the variational mode or maximum aposteriori (MAP). Substitution into Eq. (6) means Uⁱ = pⁱ and $F = L (μ) + \sum_{i} \frac{1}{2} (\ln | Σ^{i} | + p^{i} \ln 2 π) .$

The only remaining quantities required are the variational modes, which, from Eq. (5), maximise I(ϑⁱ). This leads to the following compact variational scheme.

The modes can be found using a gradient ascent based on ${\dot{μ}}^{i} = \frac{\partial I (μ^{i})}{\partial ϑ^{i}} = I (μ^{i})_{ϑ^{i}}$

It may seem odd to formulate an ascent in terms of the motion of the mode in time. However, this is useful when generalising to dynamic models (see below). The updates for the mode obtain by integrating Eq. (10) to give $Δ μ^{i} = (exp (t J) - I) J^{- 1} {\dot{μ}}^{i} J = \frac{\partial {\dot{μ}}^{i}}{\partial ϑ^{i}} = I (μ^{i})_{ϑ^{i} ϑ^{i}} .$

When t gets large, the matrix exponential disappears; because the curvature is negative definite and we get a conventional Newton scheme $Δ μ^{i} = - I (μ^{i})_{ϑ^{i} ϑ^{i}}^{- 1} I (μ^{i})_{ϑ^{i}} .$

Together with the expression for the conditional covariance in Eq. (7), this update furnishes a variational scheme under the Laplace approximation

Note that this scheme rests on, and only on, the specification of the energy function L(ϑ) implied by a generative model.

In some instances deviations from the quadratic form assumed for the variational energy I(ϑⁱ) under the Laplace approximation can confound a simple Newton ascent. This can happen when the curvature of the objective function is badly behaved (e.g., when the objective function becomes convex, the curvatures can become positive and the ascent turns into a descent). In these situations some form of regularisation is required to ensure a robust ascent. This can be implemented by augmenting Eq. (10) with a decay term ${\dot{μ}}^{i} = I (μ^{i})_{ϑ^{i}} - v Δ μ^{i} .$

This effectively pulls the search back towards the expansion point provided by the previous iteration and enforces a local exploration. Integration to the fixed point gives a classical Levenburg–Marquardt scheme (cf. Eq. (11)) $Δ μ^{i} = - J^{- 1} {\dot{μ}}^{i} = (v I - I (μ^{i})_{ϑ^{i} ϑ^{i}})^{- 1} I (μ^{i})_{ϑ^{i}} J = I (μ^{i})_{ϑ^{i} ϑ^{i}} - v I$ where v is the Levenburg–Marquardt regularisation. However, the dynamic formulation affords a simpler alternative, namely temporal regularisation. Here, instead of constraining the search with a decay term, one can abbreviate it by terminating the ascent after some suitable period t = v; from Eq. (11) $Δ μ^{i} = (\exp (v J) − I) J^{- 1} {\dot{μ}}^{i} = (\exp (v I (μ^{i})_{ϑ^{i} ϑ^{i}}) - I) I (μ^{i})_{ϑ^{i} ϑ^{i}}^{- 1} I (μ^{i})_{ϑ^{i}} J = I (μ^{i})_{ϑ^{i} ϑ^{i}}$

This has the advantage of using the local gradients and curvatures while precluding large excursions from the expansion point. In our implementations v = 1/η is based on the 2-norm of the curvature η for both regularisation schemes. The 2-norm is the largest singular value and, in the present context, represents an upper bound on rate of convergence of the ascent (cf. a Lyapunov exponent).¹ Terminating the ascent prematurely is reminiscent of “early stopping” in the training of neural networks in which the number of weights far exceeds the sample size (e.g., Nelson and Illingworth, 1991, p. 165). It is interesting to note that “early stopping” is closely related to ridge regression, which is another perspective on Levenburg–Marquardt regularisation.

A comparative example using Levenburg–Marquardt and temporal regularisation is provided in Fig. 1 and suggests, in this example, temporal regularisation is better. Either approach can be implemented in the VB scheme by simply regularising the Newton update if the variational energy I(ϑⁱ) fails to increase after each iteration. We prefer temporal regularisation because it is based on a simpler heuristic and, more importantly, is straightforward to implement in dynamic schemes using high-order temporal derivatives.

The second reason we have formulated the ascent as a time-dependent process is that it can be used to invert dynamic models. In this instance, the integration time in Eq. (16) is determined by the interval between observations. This is the approach taken in our variational treatment of dynamic systems; namely, dynamic expectation maximisation or DEM (introduced briefly in Friston, 2005 and implemented in spm_DEM.m). DEM produces conditional densities that are a continuous function of time and avoids many of the limitations of discrete schemes based on incremental Bayes (e.g., extended Kalman filtering). In dynamic models the energy is a function of the parameters and their high-order motion; i.e., I(ϑⁱ) → I(ϑⁱ, ϑ.ⁱ,…,t). This entails the extension of the variational density to cover this motion, using generalised co-ordinates q(ϑⁱ) → q(ϑⁱ, ϑ.ⁱ,…,t). This approach will be described fully in a subsequent paper. Here we focus on static models.

Having established the operational equations for VB under the Laplace approximation we now look at their application to some specific models.

Consider the generative model with additive error y = G(θ) + ε(λ). Gaussian assumptions about the errors or innovations p(ε) = N(0,Σ(λ)) furnish a likelihood p(y|θ,λ) = N(G(θ),Σ(λ)). In this example, we can consider the parameters as falling into two sets ϑ = {θ,λ} such that q(ϑ) = q(θ)q(λ), where q(θ) = N(μ^θ,Σ^θ) and q(λ) = N(μ^λ,Σ^λ). We will also assume Gaussian priors p(θ) = N(η^θ,Π^θ−1) and p(λ) = N(η^λ,Π^λ−1). We will refer to the two sets as the parameters and hyperparameters. These likelihood and priors define the energy L(ϑ) = ln p(y|θ,λ) + ln p(θ) + ln p(λ). Note that Gaussian priors are not too restrictive because both G(θ) and Σ(λ) can be nonlinear functions that embody a probability integral transform (i.e., can implement a re-parameterisation in terms of non-Gaussian priors).

Given n samples, p parameters and h hyperparameters the energy and its derivatives are $L (ϑ) = - \frac{1}{2} ε^{T} Σ^{- 1} ε + \frac{1}{2} \ln | Σ^{- 1} | - \frac{n}{2} \ln 2 π$ $- \frac{1}{2} ε^{θ T} Π^{θ} ε^{θ} + \frac{1}{2} \ln | Π^{θ} | - \frac{p}{2} \ln 2 π$ $- \frac{1}{2} ε^{λ T} Π^{λ} ε^{λ} + \frac{1}{2} \ln | Π^{λ} | - \frac{h}{2} \ln 2 π$ $ε = G (μ^{θ}) - y$ $ε^{θ} = μ^{θ} - η^{θ}$ $ε^{λ} = μ^{λ} - η^{λ}$ and $L (μ)_{θ} = - G_{θ}^{T} Σ^{- 1} ε - Π^{θ} ε^{θ}$ $L (μ)_{θ θ} = - G_{θ}^{T} Σ^{- 1} G_{θ} - Π^{θ}$ $L (μ)_{λ i} = - \frac{1}{2} t r (P_{i} (ε ε^{T} - Σ)) - Π_{i \cdot}^{λ} ε^{λ}$ $L (μ)_{λ λ i j} = - \frac{1}{2} t r (P_{i j} (ε ε^{T} - Σ)) - \frac{1}{2} t r (P_{i} Σ P_{j} Σ) - Π_{i j}^{λ}$ $P_{i} = \frac{\partial Σ^{- 1}}{\partial λ_{i}} P_{i j} = \frac{\partial^{2} Σ^{- 1}}{\partial λ_{i} \partial λ_{j}}$

Note that we have ignored second-order terms that depend on G_θθ, under the assumption that the generative model is only weakly nonlinear. The requisite gradients and curvatures are $\begin{matrix} I (θ)_{θ k} = L (θ, μ^{λ})_{θ k} + \frac{1}{2} t r (Σ^{λ} A^{k}) & I (λ)_{λ i} = L (μ^{θ}, λ)_{λ i} + \frac{1}{2} t r (Σ^{θ} C^{i}) \\ I (θ)_{θ θ k l} = L (θ, μ^{λ})_{θ θ k l} + \frac{1}{2} t r (Σ^{λ} B^{k l}) & I (λ)_{λ λ i j} = L (μ^{θ}, λ)_{λ λ i j} + \frac{1}{2} t r (Σ^{θ} D^{i j}) \\ A_{i j}^{k} = - G_{θ \cdot k}^{T} P_{i j} ε & C^{i} = - G_{θ}^{T} P_{i} G_{θ} \\ B_{i j}^{k l} = - G_{θ \cdot k}^{T} P_{i j} G_{θ \cdot l} & D^{i j} = - G_{θ}^{T} P_{i j} G_{θ} \end{matrix}$ where G_θ·k denotes the kth column of G_θ. These enter the VB scheme in Eq. (13), giving the two-step scheme

The negative free energy for these models is $F = - \frac{1}{2} ε^{T} Σ^{- 1} ε + \frac{1}{2} \ln | Σ^{- 1} | - \frac{n}{2} \ln 2 π$ $- \frac{1}{2} ε^{θ T} Π^{θ} ε^{θ} + \frac{1}{2} \ln | Π^{θ} | + \frac{1}{2} \ln | Σ^{θ} |$ $- \frac{1}{2} ε^{λ T} Π^{λ} ε^{λ} + \frac{1}{2} \ln | Π^{λ} | + \frac{1}{2} \ln | Σ^{λ} |$

In principle, these equations cover a large range of models and will work provided the true posterior is unimodal (and roughly Gaussian). The latter requirement can usually be met by a suitable transformation of parameters. In the next section, we consider a further simplification of our assumptions about the variational density and how this leads to expectation maximisation.

There is a key distinction between θ and λ in the generative model above: The parameters λ are hyperparameters in the sense, like the variational parameters, they parameterise a density. In many instances their conditional density per se is uninteresting. In variational expectation maximisation EM, we ignore uncertainty about the hyperparameters. In this case, the free energy is effectively conditioned on λ and reduces to $F^{λ} = \ln p (y | λ) - D (q (θ) ‖ p (θ | y, λ)) = - \frac{1}{2} ε^{T} Σ^{- 1} ε + \frac{1}{2} \ln | Σ^{- 1} | - \frac{n}{2} \ln 2 π - \frac{1}{2} ε^{θ T} Π^{θ} ε^{θ} + \frac{1}{2} \ln | Π^{θ} | + \frac{1}{2} \ln | Σ^{θ} | .$

Here, F^λ ≤ ln p(y|λ) becomes a lower bound on the log likelihood of the hyperparameters. This means the variational step updating the hyperparameters maximises the likelihood of the hyperparameters ln p(y|λ) and becomes an M-step. In this context, Eq. (20) simplifies because we can ignore the terms that involve Σ^λ and Π^λ to give

Expectation–maximisation or EM is an iterative parameter re-estimation procedure devised to estimate the parameters and hyperparameters of a model. It was introduced as an iterative method to obtain maximum likelihood estimators with incomplete data (Hartley, 1958) and was generalised by Dempster et al. (1977). Strictly speaking, EM refers to schemes in which the conditional density of the E-step is known exactly, obviating the need for fixed-form assumptions. This is why we used the term ‘variational EM’ above.

In terms of the VB scheme, the M-step for μ^λ = maxI(λ) is unchanged because I(λ) does not depend on Σ^λ. The remaining variational steps (i.e., E-steps) are simplified because one does not have to average over the conditional density q(λ). This ensuing scheme is that described in Friston (2002) for nonlinear system identification and is implemented in spm_nlsi.m. Although this scheme is applied to time series it actually treats the underlying model as static, generating finite-length data sequences. This routine is used to identify hemodynamic models in terms of biophysical parameters for regional responses and dynamic causal models (DCMs) of distributed responses in a variety of applications; e.g., fMRI (Friston et al., 2003), EEG (David et al., 2006), MEG (Kiebel et al., 2006) and mean-field models of neuronal activity (Harrison et al., 2005).

A key point here is that VB and EM are exactly the same when P_ij = 0. In this instance the matrices A, B and D in Eq. (19) disappear. This means the VB-step for the parameters does not depend on Σ^λ and becomes formally identical to the E-step. Because the VB-step for the hyperparameters is already the same as the M-step (apart from the loss of hyperpriors) the two schemes converge. One can ensure P_ij = 0 by adopting a hyper-parameterisation, which renders the precision linear in the hyperparameters; for example, a linear mixture of precision components Q_i (see Appendix 1). This resulting variational scheme is used by the SPM5 version of spm_nlsi.m for nonlinear system identification.

The second key point that follows from this analysis is that one can adjust the EM free energy to approximate the log-evidence, as described next.

The EM free energy in Eq. (22) discounts uncertainty about the hyperparameters because it is conditioned upon them. This is a well-recognised problem, sometimes referred to as the overconfidence problem, for which a number of approximate solutions have been suggested (e.g., Kass and Steffey, 1989). Here we describe a solution that appeals to the variational framework within which EM can be treated.

If we treat EM as an approximate variational scheme, we can adjust the EM free energy to give the variational free energy required for model comparison and averaging. By comparing Eqs. (21), (22) we can express the variational free energy in terms of F^λ and an extra term² $F = F^{λ} + \frac{1}{2} \ln | Σ^{λ} | Σ_{i j}^{λ} = - L (μ)_{λ λ}^{- 1}$ where the expression for L(μ)_λλ comes from Eq. (18). Intuitively, the extra term encodes the conditional information (i.e., entropy) about the models covariance components. The log-evidence will only increase if there is conditional information about the extra component. Adding redundant components will have no effect on F (see the section below on automatic model selection). This term can be regarded as additional Occam factor (Mackay and Takeuchi, 1996).

Note that even when conditional uncertainty about the hyperparameters has no effect on the conditional density of the parameters (e.g., when the precisions are linear in the hyperparameters—see above) this uncertainty can still have a profound effect on model selection because it is an important component of the free energy and therefore the log-evidence for a particular model.

Adjusting the EM free energy to approximate the log-evidence is important because of the well-known connections between EM for linear models and restricted maximum likelihood. This connection suggests that ReML could also be used to evaluate the log-evidence and therefore be used for model selection. We now consider ReML as a special case of EM.

In the case of general linear models G(θ) = Gθ with additive Gaussian noise and no priors on the parameters (i.e., Π^θ = 0) the free energy reduces to $F^{θ} = \ln p (y | λ) - D (q (θ) ‖ p (θ | y, λ)) = - \frac{1}{2} ε^{T} Σ^{- 1} ε + \frac{1}{2} \ln | Σ^{- 1} | - \frac{n}{2} \ln 2 π + \frac{1}{2} \ln | Σ^{θ} | .$

Critically, the dependence on q(θ) can be eliminated using the closed form solutions for the conditional moments $μ^{θ} = Σ^{θ} G^{T} Σ^{- 1} y$ $Σ^{θ} = (G^{T} Σ^{- 1} G)^{- 1}$ to eliminate the divergence term and give $F^{θ} = \ln p (y | λ)$ $= - \frac{1}{2} t r (Σ^{- 1} R y y^{T} R^{T}) + \frac{1}{2} \ln | Σ^{- 1} | - \frac{n}{2} \ln 2 π - \frac{1}{2} \ln | G^{T} Σ^{- 1} G |$ $ε = R y$ $R = I - G (G^{T} Σ^{- 1} G)^{- 1} G^{T} Σ^{- 1}$

This free energy is also known as the ReML objective function (Harville, 1977). ReML or restricted maximum likelihood was introduced by Patterson and Thompson in 1971 as a technique for estimating variance components, which accounts for the loss in degrees of freedom that result from estimating fixed effects (Harville, 1977). The elimination makes the free energy a simple function of the hyperparameters and, effectively, the EM scheme reduces to a single M-step or ReML-step

Notice that the energy has replaced the variational energy because they are the same: from Eq. (6) I(ϑ) = L(λ). This is a result of eliminating q(θ) from the variational density. Furthermore, the curvature has been replaced by its expectation to render the Newton decent a Fisher–Scoring scheme using $〈 R y y^{T} R^{T} 〉 = R Σ R^{T} = Σ - G Σ_{q}^{θ} G^{T} = R Σ .$

To approximate the log-evidence we can adjust the ReML free energy, after convergence, as with the EM free energy $F = F^{θ} + \frac{1}{2} \ln | Σ^{λ} | Σ^{λ} = - 〈 L (μ)_{λ λ} 〉^{- 1} .$

The conditional covariance of the hyperparameters uses the same curvature as the ascent in Eq. (27). Being able to compute the log-evidence from ReML is useful because ReML is used widely in an important class of models, namely hierarchical models reviewed next.

The application of ReML to the linear models of the previous section did not accommodate priors on the parameters. However, one can generally absorb these priors into the error covariance components using a hierarchical formulation. This enables the use of ReML to identify models with full or empirical priors. Hierarchical linear models are equivalent to parametric empirical Bayes models (Efron and Morris, 1973) in which empirical priors emerge from conditional independence of the errors ε⁽ⁱ⁾ ∼ N(0,Σ⁽ⁱ⁾): $\begin{matrix} y^{(1)} = \\ θ^{(1)} = G^{(1)} θ^{(2)} + ε^{(1)} \\ θ^{(2)} = G^{(2)} θ^{(3)} + ε^{(2)} \\ ⋮ \\ θ^{(n)} = ε^{(n)} \end{matrix} \equiv \begin{matrix} y^{(1)} = ε^{(1)} \\ + G^{(1)} ε^{(2)} \\ + G^{(1)} G^{(2)} ε^{(3)} \\ ⋮ \\ + G^{(1)} \dots G^{(n - 1)} θ^{(n)} \end{matrix}$

In hierarchical models, the random terms model uncertainly about the parameters at each level and Σ(λ)⁽ⁱ⁾ are treated as prior covariance constraints on θ⁽ⁱ⁾. Hierarchical models of this sort are very common and underlie all classical mixed effects analyses of variance.³ ReML identification of simple two-level models like $y^{(1)} = G^{(1)} θ^{(2)} + ε^{(1)}$ $θ^{(2)} = ε^{(2)}$ is a useful way to impose shrinkage priors on the parameters and covers early approaches (e.g., Stein shrinkage estimators) to recent developments, such as relevance vector machines (e.g., Tipping, 2001). Relevance vector machines represent a Bayesian treatment of support vector machines, in which the second-level covariance Σ(λ)⁽²⁾ has a component for each parameter. Most of the ReML estimates of these components shrink to zero. This means the columns of G⁽¹⁾ whose parameters have zero mean and variance can be eliminated, providing a new model with sparse support. This is also known as automatic relevance determination (ARD; MacKay, 1995a, MacKay, 1995b) and will be illustrated below.

Estimating these models through their covariances Σ⁽ⁱ⁾ with ReML corresponds to empirical Bayes. This estimation can proceed in one of two ways: First, we can augment the model and treat the random terms as parameters to give $y = J θ + ε$ $y = [\begin{matrix} y^{(1)} \\ 0 \\ ⋮ \\ 0 \end{matrix}] J = [\begin{matrix} K^{(2)} & \dots & K^{(n)} \\ - I \\ ⋱ \\ - I \end{matrix}] ε = [\begin{matrix} ε^{(1)} \\ ε^{(2)} \\ ⋮ \\ ε^{(n)} \end{matrix}] θ = [\begin{matrix} ε^{(2)} \\ ⋮ \\ θ^{(n)} \end{matrix}]$ $K^{(i)} = \prod_{j = 1}^{i} G^{(j - 1)}$ $Σ = [\begin{matrix} Σ^{(1)} \\ ⋱ \\ Σ^{(n)} \end{matrix}]$ with G⁽⁰⁾ = I. This reformulation is a nonhierarchical model with no explicit priors on the parameters. However, the ReML estimates of Σ(λ)⁽ⁱ⁾ are still the empirical prior covariances of the parameters θ⁽ⁱ⁾ at each level. If Σ⁽ⁱ⁾ is known a priori, it simply enters the scheme as a known covariance component. This corresponds to a full Bayesian analysis with known or full priors for the level in question.

spm_peb.m uses this reformulation and Eq. (27) for estimation. The conditional expectations of the parameters are recovered by recursive substitution of the conditional expectations of the errors into Eq. (30) (cf. Friston, 2002). spm_peb.m uses a computationally efficient substitution $\frac{1}{2} t r (P_{i} R (y y^{T} - Σ) R^{T}) = \frac{1}{2} y^{T} R^{T} P_{i} R y - \frac{1}{2} t r (P_{i} R Σ R^{T})$ to avoid computing the potentially large matrix yy^T. We have used this scheme extensively in the construction of posterior probability maps or PPMs (Friston and Penny, 2003) and mixed-effect analysis of multi-subject studies in neuroimaging (Friston et al., 2005). Both these examples rest on hierarchical models, using hierarchical structure over voxels and subjects, respectively.

An equivalent identification of hierarchical models rests on an alternative and simpler reformulation of Eq. (30) in which all the hierarchically induced covariance components K^(i)TΣ⁽ⁱ⁾K^(i)T are treated as components of a compound error $y = ε y = y^{(1)} ε = \sum_{i = 1}^{n} K^{(i)} ε^{(i)} Σ = \sum_{i = 1}^{n} K^{(i) T} Σ^{(i)} K^{(i) T} .$

The ensuing ReML estimates of Σ(λ)⁽ⁱ⁾ can be used to compute the conditional density of the parameters in the usual way. For example, the conditional expectation and covariance of the ith level parameters θ⁽ⁱ⁾ are $μ^{θ (i)} = Σ^{θ (i)} K^{(i)^{T}} {\tilde{Σ}}^{- 1} y$ $Σ^{θ (i)} = {(K^{(i)^{T}} {\tilde{Σ}}^{- 1} K^{(i)} + Σ^{(i)^{- 1}})}^{- 1}$ $\tilde{Σ} = \sum_{j \neq i} K^{(j) T} Σ^{(j)} K^{(j) T}$ where Σ˜ represents the ReML estimate of error covariance, excluding the level of interest. This component Σ⁽ⁱ⁾ = Σ(λ)⁽ⁱ⁾ is treated as an empirical prior on θ⁽ⁱ⁾. spm_ReML.m uses Eq. (27) to estimate the requisite hyperparameters. Critically, it takes as an argument the matrix yy^T. This may seem computationally inefficient. However, there is a special but very common case where dealing with yy^T is more appropriate than dealing with y (cf. the implementation using Eq. (33) in spm_peb.m):

This is when there are r multiple observations that can be arranged as a matrix Y = [y₁,…,y_r]. If these observations are independent then we can express the covariance components of the vectorised response in terms of Kronecker tensor products $y = v e c {Y} = ε ε = \sum_{i = 1}^{n} I \otimes K^{(i)} ε^{(i)} cov {ε^{(i)}} = I \otimes Σ^{(i)} .$

This leads to a computationally efficient scheme employed by spm_ReML.m, which uses the compact forms⁴ $L_{λ i} = - \frac{1}{2} t r ((I \otimes P_{i} R) (y y^{T} - I \otimes Σ) (I \otimes R^{T})) = - \frac{r}{2} t r (P_{i} R (\frac{1}{r} Y Y^{T} - Σ) R^{T}) 〈 L_{λ λ i j} 〉 = - \frac{1}{2} t r (I \otimes P_{i} R Σ P_{j} R Σ) = - \frac{r}{2} t r (P_{i} R Σ P_{j} R Σ) .$

Critically, the update scheme is a function of the sample covariance of the data (1/r)YY^T and can be regarded as a covariance component estimation scheme. This can be useful in two situations:

First, if the augmented form in Eq. (32) produces prohibitively long vectors. This can happen when the number of parameters is much greater than the number of responses. This is a common situation in underdetermined problems. An important example is source reconstruction in electroencephalography, where the number of sources is much greater than the number of measurement channels (see Phillips et al., 2005, for an application that uses spm_ReML.m in this context). In these cases one can form conditional estimates of the parameters using the matrix inversion lemma and again avoid inverting large (p × p) matrices. $μ^{θ (i)} = Σ^{(i)} K^{(i)^{T}} {\tilde{Σ}}^{- 1} Y Σ^{θ (i)} = Σ^{(i)} - Σ^{(i)} K^{(i) T} {\tilde{Σ}}^{- 1} K^{(i)} Σ^{(i)} \tilde{Σ} = \sum_{i = 1}^{n} K^{(i) T} Σ^{(i)} K^{(i) T} .$

The second situation is where there are a large number of realisations. In these cases it is much easier to handle the second-order matrices of the data YY^T than the data Y itself. An important application here is the estimation of nonsphericity over voxels in the analysis of fMRI time-series (see Friston et al., 2002, for this use of spm_ReML.m). Here, there are many more voxels than scans and it would not be possible to vectorise the data. However, it is easy to collect the sample covariance over voxels and partition it into nonspherical covariance components using ReML.

In the case of sequential correlations among the errors cov{ε⁽ⁱ⁾} = V⊗Σ⁽ⁱ⁾ one simply replaces YY^T with YV^− 1Y^T. Heuristically, this corresponds to sequentially whitening the observations before computing their second order statistics. We have used this device in the Bayesian inversion of models of evoked and induced responses in EEG/MEG (Friston et al., 2006).

In summary, hierarchical models can be identified through ReML estimates of covariance components. If the response vector is relatively small it is generally more expedient to reduce the hierarchical form by augmentation, as in Eq. (32), and use Eq. (33) to compute the gradients. When the augmented form becomes too large, because there are too many parameters, reformulation in terms of covariance components is computationally more efficient because the gradients can be computed from the sample covariance of the data. The latter formulation is also useful when there are multiple realisations of the data because the sample covariance, over realisations, does not change in size. This leads to very fast Bayesian inversion. Both approaches rest on estimating covariance components that are induced by the observation hierarchy. This enforces a hyper-parameterisation of the covariances, as opposed to precisions (see Appendix 1).

This section contains a brief demonstration of model selection using ReML and its adjusted free energy. In these examples, we use the covariance component formulation (spm_ReML.m) as in Eq. (34), noting exactly the same results would be obtained with augmentation (spm_peb.m). We use a simple hierarchical two-level linear model, implementing shrinkage priors, because this sort of model is common in neuroimaging data analysis and represents the simplest form of empirical Bayes. The model is described in Fig. 2. Briefly it has eight parameters that cause a 32-variiate response. The parameters are drawn from a multivariate Gaussian that was a mixture of two known covariance components. Data were generated repeatedly (128 samples) using different parameters for each realization. This model can be regarded as generating fMRI data over 32 scans, each with 128 voxels; or EEG data from 32 channels over 128 time bins. These simulations are provided as a proof of concept and illustrate how one might approach numerical validation in the context of other models.

The free energy can, of course, be used for model selection when models differ in the number and deployment of parameters. This is because both F and F^θ are functions of the number of parameters and their conditional uncertainty. This can be shown by evaluating the free energy as a function of the number of model parameters, for the same data. The results of this sort of evaluation are seen in Fig. 3 and demonstrate that model selection correctly identifies a model with eight parameters. This was the model used to generate the data (Fig. 2). In this example, we used a simple shrinkage prior on all parameters (i.e., Σ⁽²⁾ = λ⁽²⁾I) during the inversions.

The critical issue is whether model selection will work when the models differ in their hyperparameterisation. To address this, we analysed the same data, produced by two covariance components at the second level, with models that comprised an increasing number of second-level covariance components (Fig. 4). These components can be regarded as specifying the form of empirical priors over solution space (e.g., spatial constraints in an EEG source reconstruction problem). The results of these simulations show that the adjusted free energy F correctly identified the model with two components. Conversely, the unadjusted free energy F^θ rose progressively as the number of components and accuracy increased. See Fig. 5.

The lower panel in Fig. 5 shows the hyperparameter estimates for two models. With the correctly selected model the true values fall within the 90% confidence interval. However, when the model is over-parameterised, with eight second-level components, this is not the case. Although the general profile of hyperparameters has been captured, this suboptimum model has clearly overestimated some hyperparameters and underestimated others.

Finally, to establish that the variational approximation to the log-evidence is veridical, we computed ln p(y|ϑ) using a standard Monte Carlo–Markov chain (MCMC) procedure, described in Appendix 2. MCMC schemes are computationally intensive but allow one to sample from the posterior distribution p(ϑ|y) without making any assumptions about its form. These samples can then be used to estimate the marginal likelihood using, in this instance, a harmonic mean (see Appendix 2). These resulting estimates are not biased by the mean-field and Laplace approximations implicit in the variational scheme and can be used to asses the impact of these approximations on model comparison. The sampling estimates of free energy are provided in Fig. 6 (upper panels) for the eight models analysed in Fig. 5. The profile of true [sampling] log-evidences concurs with the free-energy profile in a pleasing way and suggests that the approximations entailed by the variational approach do not lead to inaccurate model selection, under these linear models. Furthermore, the sampled posterior p(λ|y) of the largest model (with eight second-level covariance components) is very similar to the Laplace approximation q(λ) as judged by their first two moments (Fig. 6; lower panels).

Hitherto, we have looked at model selection in terms of categorical comparisons of model evidence. However, the log-evidence is the same as the objective function used to optimise q(ϑⁱ) for each model. Given that model selection and inversion maximise the same objective function; one might ask if inversion of an over-parameterised model finds the optimal model automatically. In other words, does maximising the free energy switch off redundant parameters and hyperparameters by setting their conditional density q(ϑⁱ) to a point mass at zero; i.e., μⁱ→0 and Σⁱ→0. Most fixed-form variational schemes do this; however, this is precluded in classical⁵ ReML because the Laplace approximation admits improper, negative, covariance components, when these components are small. This means classical schemes cannot switch off redundant covariance components. Fortunately, it is easy to finesse this problem by applying the Laplace assumption to λ = lnα, where α are scale parameters encoding the expression of covariance components. This renders the form of q(α) log-normal and places a positivity constraint on α.

Consider the hyper-parameterisation $Σ = \sum_{i} \exp (λ_{i}) Q_{i}$ with priors p(λ) = N(η^λ,Π^λ⁻¹). This corresponds to a transformation in which scale parameters α_i = exp(λ_i) control the expression of each covariance component, Q_i. To first order, the conditional variance of each scale parameter is ${Σ_{i}^{α} = \frac{\partial α_{i}}{\partial λ_{i}} Σ_{i}^{λ} \frac{\partial α_{i}}{\partial λ_{i}} |}_{λ = μ_{i}^{λ}} = μ_{i}^{λ} Σ_{i}^{\propto} μ_{i}^{α} .$

This means that when μ_i^α = exp(μ_i^λ) → 0 we get Σ_i^α → 0, which is necessary for automatic model selection. In this limit, the conditional covariance Σ_i^λ is given by Eqs. (7) and (18) $Σ^{λ} = - L_{λ λ}^{- 1} = Π^{λ^{- 1}}$ because P_i = P_ij → 0 when exp(μ_i^λ) → 0 (see Appendix 1). In short, by placing log-normal hyperpriors on α_i, we can use a conventional EM or ReML scheme for AMS. In this case, the conditional uncertainty about covariance components shrinks with their expectation, so that we can be certain they do not contribute to the model. It is simple to augment the ReML scheme and free energy to include hyperpriors (cf. Eq. (27))

F = F^{θ} + \frac{1}{2} \ln | Σ^{λ} | + \frac{1}{2} \ln | Π^{λ} | - \frac{1}{2} ε^{λ T} Π^{λ} ε^{λ}

Note, that adding redundant covariance components to the model does not change the free energy because the entropy associated with conditional uncertainty is offset exactly by the prior uncertainty they induce; due to the equality in Eq. (41).⁶ This means that conventional model selection will show that all over-parameterised models are equally optimal. This is intuitive because the inversion has already identified the optimal model. Fig. 7 shows the hyperparameter estimates for the model described in Fig. 5; with eight second-level covariance components. The upper panel shows the results with a classical hyper-parameterisation and the lower panels show the results with log-normal hyperpriors (before and after log-transformation). Note that hyperpriors are necessary to eliminate unnecessary components. In this example (and below) we used relatively flat hyperpriors; p(λ) = N(− 16,32).

When automatic model selection is used to eliminate redundant parameters, it is known as automatic relevance determination (ARD). In ARD one defines an empirical prior on the parameters that embodies the notion of uncertain relevance. This enables the inversion to infer which parameters are relevant and which are not (MacKay, 1995a, MacKay, 1995b). This entails giving each parameter its own shrinkage prior and estimating an associated scale parameter. In the context of linear models, this is implemented simply by adding a level to induce empirical shrinkage priors on the parameters. ReML (with hyperpriors) can then be used to switch off redundant parameters by eliminating their covariance components at the first level. We provide an illustration of this in Fig. 8, using the models reported in Fig. 3. The lower panel shows the conditional estimates of the parameters using Eq. (35) for the over-parameterised model with 16 parameters. Note that ARD with ReML correctly shrinks and switches off the last eight redundant parameters and has implicitly performed AMS. The upper panel shows that the free energy of all models, with an increasing number of parameters, reaches a maximum at the correct parameterisation and stays there even when redundant parameters (i.e., components) are added.

Section snippets

Discussion

We have seen that restricted maximum likelihood is a special case of expectation maximisation and that expectation maximisation is a special case of variational Bayes. In fact, nearly every routine used in neuroimaging analysis (certainly in SPM5; http://www.fil.ion.ucl.ac.uk/spm) is a special case of variational Bayes, from ordinary least squares estimation to dynamic causal modelling. We have focussed on adjusting the objective functions used by EM and ReML to approximate the variational free

Acknowledgments

The Wellcome Trust and British Council funded this work.

References (44)

J. Ashburner et al.
Unified segmentation
NeuroImage
(2005)
O. David et al.
Dynamic causal modeling of evoked responses in EEG and MEG
NeuroImage
(2006)
K.J. Friston
Bayesian estimation of dynamical systems: an application to fMRI
NeuroImage
(2002)
K.J. Friston et al.
Posterior probability maps and SPMs
NeuroImage
(2003)
K.J. Friston et al.
Classical and Bayesian inference in neuroimaging: theory
NeuroImage
(2002)
K.J. Friston et al.
Dynamic causal modelling
NeuroImage
(2003)
K.J. Friston et al.
Mixed-effects and fMRI studies
NeuroImage
(2005)
S.J. Kiebel et al.
Dynamic causal modelling of evoked responses in EEG/MEG with lead field parameterization
NeuroImage
(2006)
W.D. Penny et al.
Comparing dynamic causal models
NeuroImage
(2004)
W.D. Penny et al.
Bayesian fMRI time series analysis with spatial priors
NeuroImage
(2005)

C. Phillips et al.

Systematic regularisation of linear inverse solutions of the EEG source localisation problem

NeuroImage

(2002)

C. Phillips et al.

An empirical Bayesian solution to the source reconstruction problem in EEG

NeuroImage

(2005)

N. Trujillo-Barreto et al.

Bayesian model averaging

NeuroImage

(2004)

K.Z. Adami

Variational Methods in Bayesian Deconvolution

(2003)

Beal, M.J., 1998. Variational algorithms for approximate Bayesian inference; PhD thesis:...

M.J. Beal et al.

The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures

J.O. Berger

Statistical Decision Theory and Bayesian Analysis

(1985)

C. Bishop

Latent variable models

A.P. Dempster et al.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc., Ser. B

(1977)

B. Efron et al.

Stein's estimation rule and its competitors—An empirical Bayes approach

J. Am. Stat. Assoc.

(1973)

L. Fahrmeir et al.

R.P. Feynman

Statistical Mechanics

(1972)

Cited by (711)

Reduced default mode network effective connectivity in healthy aging is modulated by years of education
2024, NeuroImage
Aging is a major risk factor for neurodegenerative diseases like dementia and Alzheimer's disease. Even in non-pathological aging, decline in cognitive functioning is observed in the majority of the elderly population, necessitating the importance of studying the processes involved in healthy aging in order to identify brain biomarkers that promote the conservation of functioning. The default mode network (DMN) has been of special interest to aging research due to its vulnerability to atrophy and functional decline over the course of aging. Prior work has focused almost exclusively on functional (i.e. undirected) connectivity, yet converging findings are scarce. Therefore, we set out to use spectral dynamic causal modeling to investigate changes in the effective (i.e. directed) connectivity within the DMN and to discover changes in information flow in a sample of cognitively normal adults spanning from 48 to 89 years (n = 63). Age was associated to reduced verbal memory performance. Modeling of effective connectivity revealed a pattern of age-related downregulation of posterior DMN regions driven by inhibitory connections from the hippocampus and middle temporal gyrus. Additionally, there was an observed decline in the hippocampus’ susceptibility to network inputs with age, effectively disconnecting itself from other regions. The estimated effective connectivity parameters were robust and able to predict the age in out of sample estimates in a leave-one-out cross-validation. Attained education moderated the effects of aging, largely reversing the observed pattern of inhibitory connectivity. Thus, medial prefrontal cortex, hippocampus and posterior DMN regions formed an excitatory cycle of extrinsic connections related to the interaction of age and education. This suggests a compensatory role of years of education in effective connectivity, stressing a possible target for interventions. Our findings suggest a connection to the concept of cognitive reserve, which attributes a protective effect of educational level on cognitive decline in aging (Stern, 2009).
Multimodal acoustic-electric trigeminal nerve stimulation modulates conscious perception
2024, NeuroImage
Multimodal stimulation can reverse pathological neural activity and improve symptoms in neuropsychiatric diseases. Recent research shows that multimodal acoustic-electric trigeminal-nerve stimulation (TNS) (i.e., musical stimulation synchronized to electrical stimulation of the trigeminal nerve) can improve consciousness in patients with disorders of consciousness. However, the reliability and mechanism of this novel approach remain largely unknown. We explored the effects of multimodal acoustic-electric TNS in healthy human participants by assessing conscious perception before and after stimulation using behavioral and neural measures in tactile and auditory target-detection tasks. To explore the mechanisms underlying the putative effects of acoustic-electric stimulation, we fitted a biologically plausible neural network model to the neural data using dynamic causal modeling. We observed that (1) acoustic-electric stimulation improves conscious tactile perception without a concomitant change in auditory perception, (2) this improvement is caused by the interplay of the acoustic and electric stimulation rather than any of the unimodal stimulation alone, and (3) the effect of acoustic-electric stimulation on conscious perception correlates with inter-regional connection changes in a recurrent neural processing model. These results provide evidence that acoustic-electric TNS can promote conscious perception. Alterations in inter-regional cortical connections might be the mechanism by which acoustic-electric TNS achieves its consciousness benefits.
Conceptual foundations of physiological regulation incorporating the free energy principle and self-organized criticality
2023, Neuroscience and Biobehavioral Reviews
Bettinger, J. S., K. J. Friston. Conceptual Foundations of Physiological Regulation incorporating the Free Energy Principle & Self-Organized Criticality. NEUROSCI BIOBEHAV REV 23(x) 144-XXX, 2022. Since the late nineteen-nineties, the concept of homeostasis has been contextualized within a broader class of "allostatic" dynamics characterized by a wider-berth of causal factors including social, psychological and environmental entailments; the fundamental nature of integrated brain-body dynamics; plus the role of anticipatory, top-down constraints supplied by intrinsic regulatory models. Many of these evidentiary factors are integral in original descriptions of homeostasis; subsequently integrated; and/or cite more-general operating principles of self-organization. As a result, the concept of allostasis may be generalized to a larger category of variational systems in biology, engineering and physics in terms of advances in complex systems, statistical mechanics and dynamics involving heterogenous (hierarchical/heterarchical, modular) systems like brain-networks and the internal milieu. This paper offers a three-part treatment. 1) interpret "allostasis" to emphasize a variational and relational foundation of physiological stability; 2) adapt the role of allostasis as "stability through change" to include a "return to stability" and 3) reframe the model of homeostasis with a conceptual model of criticality that licenses the upgrade to variational dynamics.
Risky decisions are influenced by individual attributes as a function of risk preference
2023, Cognitive Psychology
It has long been assumed in economic theory that multi-attribute decisions involving several attributes or dimensions – such as probabilities and amounts of money to be earned during risky choices – are resolved by first combining the attributes of each option to form an overall expected value and then comparing the expected values of the alternative options, using a unique evidence accumulation process. A plausible alternative would be performing independent comparisons between the individual attributes and then integrating the results of the comparisons afterwards. Here, we devise a novel method to disambiguate between these types of models, by orthogonally manipulating the expected value of choice options and the relative salience of their attributes. Our results, based on behavioral measures and drift-diffusion models, provide evidence in favor of the framework where information about individual attributes independently impacts deliberation. This suggests that risky decisions are resolved by running in parallel multiple comparisons between the separate attributes – possibly alongside an additional comparison of expected value. This result stands in contrast with the assumption of standard economic theory that choices require a unique comparison of expected values and suggests that at the cognitive level, decision processes might be more distributed than commonly assumed. Beyond our planned analyses, we also discovered that attribute salience affects people of different risk preference type in different ways: risk-averse participants seem to focus more on probability, except when monetary amount is particularly high; risk-neutral/seeking participants, in contrast, seem to focus more on monetary amount, except when probability is particularly low.
Path integrals, particular kinds, and strange things
2023, Physics of Life Reviews
This paper describes a path integral formulation of the free energy principle. The ensuing account expresses the paths or trajectories that a particle takes as it evolves over time. The main results are a method or principle of least action that can be used to emulate the behaviour of particles in open exchange with their external milieu. Particles are defined by a particular partition, in which internal states are individuated from external states by active and sensory blanket states. The variational principle at hand allows one to interpret internal dynamics—of certain kinds of particles—as inferring external states that are hidden behind blanket states. We consider different kinds of particles, and to what extent they can be imbued with an elementary form of inference or sentience. Specifically, we consider the distinction between dissipative and conservative particles, inert and active particles and, finally, ordinary and strange particles. Strange particles can be described as inferring their own actions, endowing them with apparent autonomy or agency. In short—of the kinds of particles afforded by a particular partition—strange kinds may be apt for describing sentient behaviour.
Accelerating scientific progress through Bayesian adversarial collaboration
2023, Neuron
Adversarial collaboration has been championed as the gold standard for resolving scientific disputes but has gained relatively limited traction in neuroscience and allied fields. In this perspective, we argue that adversarial collaborative research has been stymied by an overly restrictive concern with the falsification of scientific theories. We advocate instead for a more expansive view that frames adversarial collaboration in terms of Bayesian belief updating, model comparison, and evidence accumulation. This framework broadens the scope of adversarial collaboration to accommodate a wide range of informative (but not necessarily definitive) studies while affording the requisite formal tools to guide experimental design and data analysis in the adversarial setting. We provide worked examples that demonstrate how these tools can be deployed to score theoretical models in terms of a common metric of evidence, thereby furnishing a means of tracking the amount of empirical support garnered by competing theories over time.

View all citing articles on Scopus

View full text

Technical NoteVariational free energy and the Laplace approximation

Abstract

Introduction

Section snippets

Discussion

Acknowledgments

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

Variational Methods in Bayesian Deconvolution

The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures

Statistical Decision Theory and Bayesian Analysis

Latent variable models

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc., Ser. B

Stein's estimation rule and its competitors—An empirical Bayes approach

J. Am. Stat. Assoc.

Statistical Mechanics

Technical Note
Variational free energy and the Laplace approximation