Elsevier

NeuroImage

Volume 65, 15 January 2013, Pages 69-82
NeuroImage

Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control

https://doi.org/10.1016/j.neuroimage.2012.09.063Get rights and content

Abstract

An ever-increasing number of functional magnetic resonance imaging (fMRI) studies are now using information-based multi-voxel pattern analysis (MVPA) techniques to decode mental states. In doing so, they achieve a significantly greater sensitivity compared to when they use univariate frameworks. However, the new brain-decoding methods have also posed new challenges for analysis and statistical inference on the group level. We discuss why the usual procedure of performing t-tests on accuracy maps across subjects in order to produce a group statistic is inappropriate. We propose a solution to this problem for local MVPA approaches, which achieves higher sensitivity than other procedures. Our method uses random permutation tests on the single-subject level, and then combines the results on the group level with a bootstrap method. To preserve the spatial dependency induced by local MVPA methods, we generate a random permutation set and keep it fixed across all locations. This enables us to later apply a cluster size control for the multiple testing problem. More specifically, we explicitly compute the distribution of cluster sizes and use this to determine the p-values for each cluster. Using a volumetric searchlight decoding procedure, we demonstrate the validity and sensitivity of our approach using both simulated and real fMRI data sets. In comparison to the standard t-test procedure implemented in SPM8, our results showed a higher sensitivity. We discuss the theoretical applicability and the practical advantages of our approach, and outline its generalization to other local MVPA methods, such as surface decoding techniques.

Highlights

► New group statistics for classification-based brain decoding ► Based on permutation tests and bootstrap methods ► Higher statistical sensitivity than t-test frameworks ► Higher spatial specificity than t-test frameworks

Introduction

In this paper, we present a non-parametric approach for dealing with group-level analysis in local multi-voxel pattern analysis (MVPA) methods based on classification.

Local MVPA, or information-based brain mapping, aims to assess the task-related neural information at every location in the brain, by analyzing the signal patterns extracted from a spatial neighborhood (searchlight) centered at the location (Chen et al., 2011, Kriegeskorte et al., 2006). A convenient way to estimate the information contained in these regions is classification-based MVPA. This involves training a classifier on a subset of the data and predicting the class labels of another, yet unseen subset of the data. Thereby, the generalizability of the classifier is assessed (Kriegeskorte, 2011). The average percentage of correctly predicted labels, known as the decoding accuracy, is taken as an indicator of the information content of the searchlight volume. Customarily, the accuracy is mapped to the central voxel of the searchlight. The repetition of this procedure for all searchlight locations in the brain mask, results in a three-dimensional accuracy map, which reflects the spatial distribution of information decodable from the functional brain images.

In the context of neuroscientific studies, the decoding accuracies themselves are usually not of primary interest. Instead, the statistical significance of the decoding accuracy on the group level is of relevance. For this, it is common practice (e.g., Bode and Haynes, 2009, Carlin et al., 2012, Kahnt et al., 2010) to estimate a group-level statistic by performing a voxel-wise t-test against the theoretical chance level (e.g., an accuracy of 0.5 in a two class paradigm) using the accuracy maps of all subjects. Finally, the multiple testing problem is corrected at the cluster level with family-wise error (FWE) or false-discovery rate (FDR) methods.

This commonly practiced group statistic procedure is, however, questionable for various reasons; particularly, the low number of observations and the non-gaussianity of the probability distribution of accuracy. As a consequence, several assumptions of the t-statistics are not met, rendering the procedure invalid from a theoretical point of view. We will demonstrate that t-test procedures (for decoding studies) are problematic not only from a strict theoretical perspective, but also from practical considerations: using simulations, we will show that t-test procedures, which implement cluster control with FDR, exhibit exceedingly high levels of false positivity. Alternatively, the evaluation of a classifier's performance can be modeled as Bernoulli trials (Pereira and Botvinick, 2011) and this leads to a binomial test for the subsequent statistical inference. However, the dependency between the cross-validation folds causes a problem here, because the aggregated performance over all the cross-validation folds is modeled as the performance of a single classifier. This approximation is only valid if the underlying binomial variables from each cross-validation are assumed to be independent. Because the cross-validation procedure introduces a correlation between the binomial variables, the assumption of independency is challenged. Moreover, it is unclear to which extent the deviation of the accuracy distribution from Gaussian and binomial distribution has an effect on the group-level statistical test.

To overcome these potential pitfalls, and as a non-parametric alternative, permutation tests can be used to assess statistical significance. Permutation tests for fMRI analysis were pioneered more than 15 years ago (Arndt et al., 1996, Holmes et al., 1996). An excellent primer on the topic is found in Nichols and Holmes (2002). Permutation tests rely on minimal assumptions (Good, 2006) and their use in the context of classification has been verified theoretically (Golland and Fischl, 2003). These tests have now also been adapted to fMRI studies using classification-based MVPA (Chen et al., 2011, Pereira and Botvinick, 2011) and MVPA toolboxes (Hanke et al., 2009).

The idea behind permutation tests for classification is to estimate the dependency between class labels and observations. In a general sense, the null hypothesis for permutation tests is defined as an independency between class labels and observations. To approximate the probability distribution of accuracy under this null hypothesis empirically, a large number of permutations are applied on the class labels and the corresponding accuracies are estimated. The probability or significance level for rejecting the null hypothesis is then evaluated by comparing the original accuracy against the accumulated empirical distribution.

The practice of permutation tests, in the context of fMRI decoding studies on the group level, however, is not directly evident. On the one hand, the permutation test applied on the single subject level produces the significance of labeling for each subject, while the assessment of statistical significance is more desirable on the group level. On the other hand, the number of (independent) observations in fMRI studies is small, which greatly limits the number of available permutations and thus the precision of statistical significance evaluation. To account for this, we combine the permutation tests on the single subject level with a bootstrapping procedure on the group level: instead of evaluating the significance of labeling on the single subject level, we first generate subject-wise the empirical null distribution. Next, a bootstrapping procedure is used to build an empirical null distribution of the mean accuracy across subjects, allowing voxel-wise estimation of significance. This procedure allows us to overcome the limitation of both the potentially small number of available permutations and the computational resources.

The voxel-wise null hypothesis, given our proposed method is defined as the following: the mean group accuracy follows an empirical group null distribution of accuracy values (which is derived by permutation and bootstrapping methods). In other words, the null hypothesis for our method states that there is no class information present at a group level and, hence the classifiers behaved randomly. In contrast, the (voxel-wise) null hypothesis for t-based methods is defined such that the sample of single subject accuracies stems from a normal distribution centered at an accuracy value of 0.5.

In whole brain analysis approaches, voxel-wise statistics are often not of ultimate interest, because the statistical testing procedures are applied many times (about 50,000 locations for whole brain data at 3 T and 500,000 at 7 T). Hence, both t-based methods and permutation test procedures are subject to the multiple testing problem. Setting an arbitrary threshold on statistical significance level could be questionable; due to the sheer number of statistical tests a large number of false positives may arise if low thresholds are applied, while high thresholds may obscure true effects. On the other hand, the statistical tests at proximate spatial locations are known to be interdependent, mainly due to two facts: first, the BOLD effects of interest are spatially widespread over several voxels (see Chumbley and Friston, 2009); second, local MVPA approaches introduce spatial correlations in the analyzing procedure. For instance, the voxels extracted by spherical searchlights largely overlap at adjacent locations.

To address this multiple testing problem, we followed Nichols and Holmes (2002) and used a cluster size inference on the group level. The basic idea of cluster size inference is to exploit the fact that the probability of two voxels exceeding a given voxel threshold and simultaneously being contiguous is smaller than the chance of one sole voxel surpassing a threshold (Forman et al., 1995). In this approach, the fundamental units of interest are therefore regions and not voxels (Heller et al., 2006). Furthermore, cluster-based approaches have been demonstrated to be statistically more powerful than voxel-based tests (Hayasaka and Nichols, 2003). It is important to emphasize that a cluster size inference applicable for local MVPA is required to account not only for the spatial correlations due to the BOLD effect, but also for spatial correlations from the analyzing procedure (e.g., searchlight). Furthermore, it should be mentioned that the latter source of correlation strongly depends on the location and local information content.

Hence, we ultimately consider cluster-wise statistics instead of voxel-wise statistics; we are interested in how likely it is to observe a cluster of voxels which surpass a certain voxel-wise threshold. In commonly practiced t-based methods, the probability of the occurrence of a cluster (voxels being contiguous and surpassing a t-based voxel-wise threshold) is usually derived by random field methods in combination with FWE or FDR corrections. Most critically, the underlying smoothness of the accuracy maps has to be estimated for this. In our non-parametric approach, we empirically construct a cluster size distribution and use FDR corrections on the cluster level. Our procedure implicitly implements the smoothness of the accuracy maps and renders the estimation of the latter obsolete.

To conclude, our study aims to provide a framework for non-parametric inference for classification-based decoding on a group level, which controls for multiple testing. Using a volumetric searchlight technique, we demonstrate the benefits of our method:

  • We will show the sensitivity of our approach by using a simulation where the information content's size and distribution are known. This allows us to quantify and thus compare the detection rate.

  • We will demonstrate the validity of our method using a large number of null simulations (i.e., simulations where the actual null hypothesis holds up). Every significant cluster found here can, thus, be ascribed to false positivity.

  • We will display the applicability and sensitivity of our method using a real fMRI data set.

For all data sets, we compare our method to the common practice of conducting t-tests with multiple testing corrections.

Section snippets

Materials and methods

In this section we want to illuminate the methods used in this paper. The methods section is divided into two parts, one constitutes the raw data generation used (two simulation approaches and one fMRI experiment). The second part describes the statistical analysis of the raw data.

Results

We applied our method to multiple simulated data sets and a real fMRI data set. In the first simulation, we were able to precisely define informative regions, and assess their information content more accurately, compared to standard procedures. The second simulation allowed us to validate our method using multiple null data sets with no information content. In both the simulations and the real data set, we compared the results obtained by our method with the commonly practiced t-test based

Discussion

We present a group analysis method tailored for decoding studies based on local MVPA (such as the searchlight approach). Our method incorporates non-parametric statistics and provides a solution for the multiple testing problem based on a cluster size thresholding. In the following sections we want to argue our concerns about t-based frameworks and discuss our proposed method.

References (33)

Cited by (257)

View all citing articles on Scopus
View full text