Elsevier

NeuroImage

Volume 42, Issue 1, 1 August 2008, Pages 196-206
NeuroImage

Technical Note
Within-subject variation in BOLD-fMRI signal changes across repeated measurements: Quantification and implications for sample size

https://doi.org/10.1016/j.neuroimage.2008.04.183Get rights and content

Abstract

Functional magnetic resonance imaging (fMRI) can be used to detect experimental effects on brain activity across measurements. The success of such studies depends on the size of the experimental effect, the reliability of the measurements, and the number of subjects. Here, we report on the stability of fMRI measurements and provide sample size estimations needed for repeated measurement studies. Stability was quantified in terms of the within-subject standard deviation (σw) of BOLD signal changes across measurements. In contrast to correlation measures of stability, this statistic does not depend on the between-subjects variance in the sampled group. Sample sizes required for repeated measurements of the same subjects were calculated using this σw. Ten healthy subjects performed a motor task on three occasions, separated by one week, while being scanned. In order to exclude training effects on fMRI stability, all subjects were trained extensively on the task. Task performance, spatial activation pattern, and group-wise BOLD signal changes were highly stable over sessions. In contrast, we found substantial fluctuations (up to half the size of the group mean activation level) in individual activation levels, both in ROIs and in voxels. Given this large degree of instability over sessions, and the fact that the amount of within-subject variation plays a crucial role in determining the success of an fMRI study with repeated measurements, improving stability is essential. In order to guide future studies, sample sizes are provided for a range of experimental effects and levels of stability. Obtaining estimates of these latter two variables is essential for selecting an appropriate number of subjects.

Introduction

The effect of an intervention, for example pharmacological treatment or repetitive transcranial magnetic stimulation (rTMS), can be investigated with repeated measurements on the same subjects. By administering experimental and control treatment in random order to the same group of subjects, the mean difference between treatment conditions can be calculated and tested for statistical significance. Recently, this type of study (i.e. crossover design) has been applied to functional MRI (fMRI). For instance, fMRI signal changes were observed in the motor cortex of patients recovering from stroke after treatment with fluoxetine (Pariente et al., 2001), in the amygdala following oxytocin administration (Kirsch et al., 2005), and in the prefrontal cortex in response to a catecholamine-O-methyltransferase inhibitor (Apud et al., 2007). The success of such a design depends on statistical power, which in turn depends on (a) the difference between experimental and control treatment, (b) measurement error, and (c) sample size. For single-session fMRI studies, the effect of measurement error on statistical power and sample size has been determined (Desmond and Glover, 2002). These findings may not be valid for fMRI studies with multiple sessions, however, as factors that are stable within a session (e.g. subject position in the scanner) can differ between sessions (Genovese et al., 1997). To obtain an estimate of this between-session measurement error, a test–retest reliability analysis measuring the same variable on the same sample of subjects should be performed, in absence of any between-measurement experimental manipulation.

A number of studies has investigated the test–retest reliability of fMRI, and reported almost perfect (Aron et al., 2006, Fernandez et al., 2003, Specht et al., 2003) to at best moderate reliability (Raemaekers et al., 2007, Wei et al., 2004). The majority of studies expressed test–retest reliability of fMRI signal changes in terms of correlation coefficients, such as the Pearson product–moment correlation coefficient (Pearson's r) and various types of the intraclass correlation coefficient (ICC). Pearson's r is often used to assess the stability of activation in a voxel relative to other voxels (Fernandez et al., 2003, Specht et al., 2003, Tegeler et al., 1999), while the ICC is used to express how consistent activation is in subjects over sessions relative to the rest of the group (Aron et al., 2006, Fernandez et al., 2003, Manoach et al., 2001, Raemaekers et al., 2007, Specht et al., 2003, Wei et al., 2004).

Such correlation analyses, especially Pearson's r, may not always be suitable for assessing fMRI test–retest reliability. That is, Pearson's r is insensitive to systematic differences between measurements, meaning that the correlation coefficient can be high even when large changes have occurred across measurements (Bland and Altman, 1986), as long as those changes are consistent over the sample (either subjects or voxels). Importantly, both Pearson's r and ICC are sensitive to the spread of values in the sample. For example, exactly the same within-subject variation over sessions leads to either a low or high correlation coefficient depending on a low or high degree of heterogeneity in the group, respectively (Bland and Altman, 1990). Indeed, Fernandez et al. (2003) reported a highly heterogeneous sample, which could have contributed to the high correlation coefficients they present. Another study with high correlation coefficients (Aron et al., 2006) also showed a large spread of values between subjects, both in performance data as well as in BOLD signal changes. It is therefore difficult to make inferences based on these results, as other samples will have a different degree of heterogeneity, even if they have a similar within-subject variation. Finally, there is no consensus on what type of ICC should be used to determine fMRI test–retest reliability (McGraw and Wong, 1996, Shrout and Fleis, 1979). Indeed, studies that have used the two-way random effects model ICC (Raemaekers et al., 2007, Wei et al., 2004) report lower reliability coefficients than those using the one-way random effects model (Aron et al., 2006, Fernandez et al., 2003, Specht et al., 2003). The main reason for this difference may lie in the fact that the one-way random effects model ICC assumes systematic variance to originate from subjects only, while the two-way random effects model ICC assumes that systematic variance can arise from subjects as well as sessions (McGraw and Wong, 1996).

One way of minimizing the impact of sample heterogeneity, and thus to compare and generalize results, is to express fMRI test–retest reliability in terms of only within-subject variation of activation across sessions (Marshall et al., 2004, Tjandra et al., 2005). Measures of within-subject variation include the within-subject standard deviation (σw, expressed in the same units as the measurement) and within-subject coefficient of variation (CVw, expressed as a percentage of the measurement). In addition, by using this within-subject variation, combined with an estimate of the difference between the experimental and control condition, the sample size needed to detect an effect of intervention can be estimated.

The goal of this study is two-fold. First, within-subject variation in fMRI signal changes is quantified. For this, ten subjects are scanned on three occasions (one week apart) on a motor task (Vink et al., 2005, Vink et al., 2006). Reliability is calculated in terms of the within-subject standard deviation (σw), not only for brain activation, but also for behavioural measures to assess potential contribution of training effects on the task. Second, sample sizes are estimated for fMRI studies with repeated measurements on the same subjects, based on the σw of fMRI signal changes and estimates of the difference between experimental and control condition.

Section snippets

Overview of the method

The present study proposes a method for estimating stability of fMRI activation levels across measurements. Such an analysis should be performed prior to an intervention study with a crossover design. Stability is expressed in terms of the within-subject standard deviation across measurements (i.e. σw). Our method consists of a three-step approach.

First, stability of task performance is assessed, involving analysis of response data (e.g. reaction times) using σw and group-level spatial

Reliability of task performance

To assess the potentially confounding effects of practice or strategy shifts on the test–retest reliability analysis of the fMRI data, we first analysed the behavioural data (Fig. 1). The group mean reaction times to GO and STOP trials were not significantly different across sessions (main effect session, F2,8 < 1, p = .45; interaction effect session by trial type, F2,8 = 3.26, p = .09) and the within-subject variation in reaction times was small. Correspondingly, we observed a highly reproducible

Reliability of fMRI

In this study, we investigated the test–retest reliability of fMRI by calculating the within-subject variation in fMRI signal changes across measurements. Despite stable task performance (Fig. 1) and broadly overlapping spatial activation patterns (Fig. 2), we found considerable within-subject variation in fMRI signal changes of task-related regions of interest between sessions (Fig. 3, Fig. 4). The within-subject variation over three sessions was up to about fifty percent of the mean ROI

Conclusion

This study reports on the stability of fMRI in terms of within-subject variation. The major advantage of this approach is that, in contrast to correlation analyses, sample heterogeneity is left out of the equation. Our results do not depend on sample composition, making them more generalizable. Together with the effect of intervention, the amount of within-subject variation will determine the success of an fMRI study with repeated measurement. Optimizing these variables reduces the need for

References (49)

  • PoldrackR.A.

    Imaging brain plasticity: conceptual and methodological issues—a theoretical review

    Neuroimage

    (2000)
  • RaemaekersM. et al.

    Test–retest reliability of fMRI activation during prosaccades and antisaccades

    Neuroimage

    (2007)
  • RazA. et al.

    Ecological nuances in functional magnetic resonance imaging (fMRI): psychological stressors, posture, and hydrostatics

    Neuroimage

    (2005)
  • SeifritzE. et al.

    Effect of ethanol on BOLD response to acoustic stimulation: implications for neuropharmacological fMRI

    Psychiatry Res.

    (2000)
  • TjandraT. et al.

    Quantitative assessment of the reproducibility of functional activation measured with BOLD and MR perfusion imaging: implications for clinical trial design

    Neuroimage

    (2005)
  • Tzourio-MazoyerN. et al.

    Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain

    Neuroimage

    (2002)
  • VinkM. et al.

    Striatal dysfunction in schizophrenia and unaffected relatives

    Biol. Psychiatry

    (2006)
  • WeiX. et al.

    Functional MRI of auditory verbal working memory: long-term reproducibility analysis

    Neuroimage

    (2004)
  • WiseR.G. et al.

    Resting fluctuations in arterial carbon dioxide induce significant low frequency variations in BOLD signal

    Neuroimage

    (2004)
  • WorsleyK.J. et al.

    A general statistical analysis for fMRI data

    Neuroimage

    (2002)
  • ApudJ.A. et al.

    Tolcapone improves cognition and cortical information processing in normal human subjects

    Neuropsychopharmacology

    (2007)
  • BlandJ.M. et al.

    Statistical methods for assessing agreement between two methods of clinical measurement

    Lancet

    (1986)
  • CohenM.S. et al.

    Stability, repeatability, and the expression of signal magnitude in functional magnetic resonance imaging

    J. Magn. Reson. Imaging.

    (1999)
  • FernandezG. et al.

    Intrasubject reproducibility of presurgical language lateralization and mapping using fMRI

    Neurology

    (2003)
  • Cited by (88)

    • Perceiving social injustice during arrests of Black and White civilians by White police officers: An fMRI investigation

      2022, NeuroImage
      Citation Excerpt :

      Therefore, readers should note that our whole-brain analyses are exploratory. Because previous analyses indicate that adequately powered sample sizes can be reduced by three to four times by adopting an ROI-based approach (Vul and Pashler, 2017; Zandbelt et al., 2008), we focus our interpretation on results from the a priori ROIs. As part of a previously published experiment (Dang et al., 2020), 38 audio-less videos (19 Black civilian arrests and 19 White civilian arrests) ranging from 4 to 27 s in duration were equated on police officer and civilian aggression separately as well as overall aggression of the interaction.

    • The YOUth study: Rationale, design, and study procedures

      2020, Developmental Cognitive Neuroscience
    • Processing of Targets and Non-targets in Verbal Working Memory

      2020, Neuroscience
      Citation Excerpt :

      Therefore, we applied an analysis that is also sensitive for potential differences between target and non-target processing in laterality as well as fronto-parietal location. We applied a region of interest (ROI) based analysis method to optimize statistical power, as well as to be able to quantitatively compare activity changes between regions (Poldrack, 2007; Kriegeskorte et al., 2009; Zandbelt et al., 2008). Furthermore, ROI-based methods also allow for better quantitative replication of findings of fMRI studies in future research.

    View all citing articles on Scopus
    View full text