Abstract
Efficient perception in natural environments depends on neural interactions between voluntary processes within cognitive control, such as attention, and those that are automatic and subconscious, such as brain adaptation to predictable input (also called repetition suppression). Although both attention and adaptation have been studied separately and there is considerable knowledge of the neurobiology involved in each of these processes, how attention interacts with adaptation remains equivocal. We examined how attention interacts with visual and auditory adaptation by measuring neuroimaging effects consistent with changes in either neural gain or selectivity. Male and female human participants were scanned with functional magnetic resonance imaging (fMRI) first while they discriminated repetition of morphed faces or voices and either directed their attention to stimulus identity or spatial location. Attention to face or voice identity, while ignoring stimulus location, solely increased the gain of respectively face- or voice-sensitive cortex. The results were strikingly different in an experiment when participants attended to voice identity versus stimulus loudness. In this case, attention to voice while ignoring sound loudness increased neural selectivity. The combined results show that how attention affects adaptation depends on the level of feature-based competition, reconciling prior conflicting observations. The findings are theoretically important and are discussed in relation to neurobiological interactions between attention and different types of predictive signals.
SIGNIFICANCE STATEMENT Adaptation to repeated environmental events is ubiquitous in the animal brain, an automatic typically subconscious, predictive signal. Cognitive influences, such as by attention, powerfully affect sensory processing and can overcome brain adaptation. However, how neural interactions occur between adaptation and attention remains controversial. We conducted fMRI experiments regulating the focus of attention during adaptation to repeated stimuli with perceptually balanced stimulus expectancy. We observed an interaction between attention and adaptation consistent with increased neural selectivity, but only under conditions of feature-based competition, challenging the notion that attention interacts with brain adaptation by only affecting response gain. This demonstrates that attention retains its full complement of mechanistic influences on sensory cortex even as it interacts with more automatic or subconscious predictive processes.
Introduction
Operating successfully in natural environments requires adaptable processing of goal-relevant and predictable sensory evidence. Attention and adaptation are important in these regards, but how these neurobiological processes interact remains controversial.
Adaptation, also called “repetition suppression” or “stimulus-specific adaptation” (SSA), is prevalent in the animal brain, characterized by neurons reducing their activity to stimulus repetition or redundant repeating events (Malmierca et al., 2015). Although the functional role of adaptation is debated, ranging from familiarity-based memory to neural calibration to environmental statistics, the consensus view is of an automatic form of prediction not requiring consciousness (Barlow and Foldiak, 1989; Clifford et al., 2000; Bekinschtein et al., 2009; Bastos et al., 2012). In humans, fMRI is often used to investigate adaptation [fMR-adaptation (fMR-A); Grill-Spector et al., 1999; Barron et al., 2013; Larsson and Harrison, 2015], exploiting paradigms that reduce hemodynamic responses to stimulus repetition compared with stronger responses to repetition of stimuli with different features. The dominant theoretical explanations for fMR-A effects are “fatigue” or “sharpening” models (Grill-Spector et al., 2006). The fatigue model describes systematic response reduction throughout the stimulus driven neuronal population (reduction in gain). In comparison, the sharpening model describes changes in selectivity or tuning, with some neuronal responses being enhanced while others reduced. By contrast, voluntary goal-directed attention can affect neural gain or selectivity under conditions that are well described; resulting in more selective responses when spatially overlapping features compete for attention and neural processes (Desimone, 1998; Maunsell and Treue, 2006; Okamoto et al., 2007; Reynolds and Heeger, 2009).
However, whether attentional interactions during adaptation result in gain or selectivity changes remains equivocal. There is evidence that neural adaptation can be modified by top-down feedback via attention, expectation, and learning (Murray and Wojciulik, 2004; Summerfield and de Lange, 2014; Barron et al., 2016). Most studies show that attention can affect fMR-A primarily by nonselectively boosting the gain of sensory signals during adaptation (for review, see Summerfield and de Lange, 2014), in essence counteracting the gain reduction induced by adaptation (Eger et al., 2004; Altmann et al., 2008). However, some authors have found that attention increases neural selectivity during fMR-A (Murray and Wojciulik, 2004; Weiner et al., 2010). Furthermore, when assessing how adaptation interacts with attention in the brain, it is crucial to control for other predictive processes, such as the participant's expectation of whether a repetition will occur, which could also increase neural selectivity (Spratling, 2008; Kok et al., 2012b; Summerfield and de Lange, 2014). Thus, a timely outstanding question is as follows: how do attention and adaptation interact at a neural level? The prior conflicting results may indicate that cognition differentially counteracts fMR-A under conditions that require explication. In particular, a critical outstanding issue is whether the modulatory effects of attention on adaptation are dependent upon whether attended stimulus features are in competition with each other or not (Altmann et al., 2008).
To address this issue, we first manipulated the focus of attention in two sensory modalities (visual or auditory) while participants were scanned with fMRI as they either attended to face or voice identity or changes in stimulus spatial location (Experiment 1: nonspatial vs spatial selective attention). During the attention task, we systematically manipulated fMR-A while balancing for the expectation of a stimulus change or no-change for each individual. These experiments in either modality only resulted in attentional gain increases during adaptation in high-level sensory cortex, possibly because attentional processes to spatial versus nonspatial features were under minimal competition. Therefore, in Experiment 2 we manipulated attention to voice identity versus sound loudness (both nonspatial features intrinsic to the complex stimulus), which, unlike the observations from Experiment 1, resulted in increased neural selectivity during adaptation. Altogether, the results indicate that top-down attention can differentially counteract neural adaptation effects via changes in gain or neural selectivity contingent on the level of feature-based competition.
Materials and Methods
Experiment 1
Participants.
Twelve volunteers (7 females, age 22 ± 1.4 SD; 5 males age 21 ± 2.0 SD) participated in Experiment 1, with each conducting the experiment in both the auditory or visual modality across two separate 1 h fMRI scanning sessions (24 total scanning sessions). Modality order was counterbalanced across participants. Participants provided informed consent, had normal or corrected-to-normal vision, and reported normal hearing. Ethical approval for the study was awarded by the Newcastle University Ethical Review Committee, and all research was performed in accordance with the ethical standards laid down in the 1990 Declaration of Helsinki.
Stimuli.
Figure 1, A and B, illustrates the face and voice stimuli and how identity and spatial position were systematically manipulated. The auditory stimuli consisted of emotionally neutral “Ah” utterances from six female and six male speakers recorded in a sound-attenuated room using an Edirol R-09HR voice recorder (Roland), sampled at 44.1 kHz and 32-bit resolution. Each female voice was arbitrarily paired with a male voice to generate six voice pairs that differed maximally. We used TANDEM-STRAIGHT (Kawahara et al., 2008; Kawahara and Morise, 2011) to morph between voices in each pair to create six voice identity continua. In brief, spectral envelope, fundamental frequency, and aperiodicity parameters were extracted from the voice sounds. Each parameter was aligned to match between different voices and manipulated parametrically between the two voices of each pair from 0% (male voice) to 100% (female voice) in 5% steps (Fig. 1A), giving a total of 21 voice stimuli per continuum and 126 stimuli in total (6 voice pairs × 21 voice stimuli). We next used REAPER software (http://www.reaper.fm/) to spatially manipulate each voice stimulus so that it was perceived to originate from 40 virtual positions along the azimuth from left (−90°) to right (+90°) in 4° steps (with constant elevation of 0°). Thus, in total, there were 5040 auditory stimuli (40 spatial position × 126 voice stimuli). This manipulation allowed us to present voice pairs from different spatial positions in virtual acoustic space (Fig. 1B). The average duration of each stimulus was 456 ± 120 ms (mean ± SD). Finally, the voice stimuli were down sampled to 16 bits, gated with a 20 ms cosine ramp (to shape sound onset and offset), and root mean square normalized following direct-current offset correction.
The visual stimuli consisted of grayscale images of six female and six male faces (Goffaux et al., 2005). Faces had emotionally neutral expressions and were masked to show only internal facial features (i.e., no hair and ears). Each female face was arbitrarily paired with a male face to create six face pairs. We used FantaMorph software v5.0 (Abrosoft) to morph between face pairs to create six face identity continua. In brief, we first manually marked 93 corresponding facial features for each of the 12 faces (e.g., corner of eyes, tip of the nose, etc.) to establish a meaningful correspondence across faces regardless of size and facial shape. Next, we took a weighted average between the paired faces from 0% (male face) to 100% (female face) in 5% steps at each facial feature point (Fig. 1A). Thus, there were 21 face images per face pair and 126 face images in total. Finally, we used the SHINE toolbox (Willenbockel et al., 2010) to equalize the luminance across the 126 face images. As the duration of each voice varied, to ensure that the duration of presentation of the faces were equivalent, each of the 126 faces was arbitrarily paired with one of the voices and presented for that duration. Face stimuli subtended ∼8.4° (w) × 7.1° (h) visual angle. To manipulate spatial position, each face image was presented vertically centered in the left or right half of the screen with a horizontal spatial offset relative to each other (Fig. 1B).
Apparatus.
Participants were trained in a mock scanner environment (Psychology Software Tools) at Newcastle University and tested in a 3-Tesla scanner at the Newcastle Magnetic Resonance Centre (Philips). For the mock scanner, auditory stimuli were presented at ∼75 dB SPL (calibrated with an XL2 sound level meter, NTI Audio) through Sennheiser 380 headphones using an Edirol external sound card (Roland). Visual stimuli were presented on a 19-inch flat-panel monitor at the head of the scanner at 1280 × 1024 pixel resolution. Participants viewed the monitor through a mirror positioned within the mock scanner head coil. They responded using a keypad while in the mock scanner.
At the 3-Tesla scanner, auditory stimuli were presented using an MR-compatible audio system and delivered with electrostatic transducer headphones (NordicNeuroLab). Participants wore earplugs to further shield against scanner noise. Visual stimuli were back-projected onto a screen at the foot of the scanner using a Canon XEED LCD projector (1280 × 1024 pixels). Participants viewed the projection through an angled mirror attached to the head coil ∼10 cm above their eyes. Head motion was restricted by placing foam pads between the head and the head coil. Participants responded via an MR-compatible response pad. The experiments were programmed in MATLAB (MathWorks) with Psychtoolbox (Brainard, 1997; Pelli, 1997).
Task and calibration procedure to balance stimulus change expectancy.
In Experiment 1, participants discriminated the identity or spatial position of sequentially presented face or voice pairs. They were instructed to attend to one of these stimulus features and judge whether the attended feature was the same or different, while ignoring changes to the unattended feature. Differences in both identity and spatial position varied systematically across trials, as we manipulated stimulus repetition and controlled for the participant's expectancy as to whether a stimulus would change on any given trial.
Before the fMRI experiments, we calibrated the psychometric functions of each individual so that during scanning we could present stimuli that balanced the participant's expectancy that pairs of stimuli would be different or not. Perceptually balanced psychometric functions help to minimize differential perceptual expectations for repeated stimuli that may have occurred in previous studies where there were more (perceptually) “different” trials than “same” trials (Murray and Wojciulik, 2004). During the mock scanner calibration session we estimated the perceptual identity levels and spatial position levels needed to balance the psychometric function, which were subsequently used for that individual's fMR-A session. This ensured that nearly half of the stimulus repetition conditions would be perceived as being a stimulus “change” and the others “no-change” (see Fig. 3A,B), while we assessed attention and adaptation interactions.
For the identity task, participants were instructed to attend to identity and ignore any changes to the spatial position between the two stimuli. For the spatial position task, participants were instructed to attend to spatial position and ignore any changes to identity between the two stimuli. Across participants, we counterbalanced the hand and which finger they used to make their responses. Before each block, the participants completed 20 practice trials with feedback to familiarize themselves with the procedure and stimuli. Each block took ∼25 min to complete, and breaks between blocks were given to reduce fatigue. Each participant was trained in a calibration session in the mock scanner followed by two fMR-A scanning sessions. The three sessions were run on separate days.
The identity and spatial position levels are defined as the physical stimulus difference in identity or spatial position, respectively, between a stimulus pair that leads to balanced discrimination performance. To achieve this we estimated the 50 and 75% perceptual levels (i.e., 50 and 75% correct discrimination) in Experiment 1 by fitting psychometric functions to participants' discrimination responses for each stimulus feature.
For identity, the physical stimulus difference was the morph difference between the two faces or two voices presented on a given trial. Seven morph differences were selected as follows. We first randomly selected with replacement a morph stimulus face/voice pair from the 21 face/voice stimuli available. We then selected the second morph stimulus from the remaining 20 stimuli that provided the desired morph difference. For example, suppose we randomly select a face stimulus that was a 35% morph between Face A and Face B. In this case, for a 10% morph difference, we would randomly select either the 25% or the 45% morph between Face A and Face B. For a 0% morph difference (i.e., same stimulus), we would select the same 35% morph between Face A and Face B. Across participants, we sampled the entire identity morph continuum for each face/voice pair. As expected, the morph differences varied slightly for each participant. On average, the seven physical morph differences were for voices: 0 (same), 11.3, 21.3, 32.5, 45.0, 55.0, and 66.3%, and for faces: 0 (same), 11.7, 22.1, 33.3, 45.7, 56.3, and 67.5%.
For spatial position, the physical stimulus difference was the relative spatial displacement between the two voices or the two faces presented on each trial. The seven spatial position differences were selected in an analogous manner as for identity. For the voice stimuli, the two stimuli were presented sequentially in the right or left auditory field separated by, on average, 0 (same), 2.9, 6.1, 9.0, 11.9, 14.8, and 17.5° along the azimuth. For the face stimuli, the center of the face images were presented sequentially in the right or left visual field (i.e., right or left half of the screen) separated by, on average, 0 (same), 9, 18, 27, 36, 45, or 54 pixels (∼45 pixels/° of visual angle). For both the face and voice stimuli, the stimuli occurred equally often in the right and left auditory/visual field.
We estimated the 50% identity level from participants' discrimination responses as follows. For each stimulus type, we first calculated the proportion different response as a function of morph difference, pooling across spatial position differences, based on the 42 trials for each of the seven morph differences (6 face/voice pairs × 7 spatial differences). We then fitted a cumulative Gaussian function to each participant's responses for each condition. For each of the two psychometric functions, we estimated the morph difference, which gave rise to 50% correct responses for the middle morph difference. The same procedure was used to estimate the 75% identity level and 50% and 75% spatial position levels, pooled across identity morph differences, also based on the 42 trials for each of the seven spatial differences (6 face/voice pairs × 7 morph differences).
Table 1 shows the estimated 50 and 75% identity and spatial position levels averaged across participants.
The four conditions resulting from the factorial combination of stimulus type (face, voice) and task (identity, spatial position) were run in separate blocks, counterbalanced across participants using a Latin square design. Each block consisted of 294 trials (6 face/voice pairs × 7 morph differences × 7 spatial differences) presented in random order. Each trial began with a white fixation cross presented at the center of the screen for 500 ms, followed by a blank period (black screen) for 1500 ms. The two stimuli were then presented sequentially for their duration, separated by a 500 ms black screen. The participants were instructed to respond by pressing a same or different key after the offset of the second stimulus.
fMRI procedure and experimental design.
On each fMR-A session, participants were presented with only face or voice stimuli, and they discriminated either identity or spatial position on four alternating runs (2 runs per attended feature). Both the order of stimulus type and the order of task were counterbalanced across participants using a Latin square design. Each run was ∼8 min in duration, and consisted of 54 trials (6 face/voice pairs × 3 identity levels × 3 spatial position levels). The three perceptual levels for each stimulus feature [0 (same trial), 50, and 75%] were individualized for each participant as described in the prior section, and were randomly interleaved within a run.
We used a sparse imaging protocol so that auditory stimuli could be presented during silent periods without scanner noise. This is a common auditory fMRI scanning paradigm so that the scanner noise does not mask or compete with the acoustic stimuli (Hall et al., 1999); although it is not required for the visual modality, we opted to keep the scanning paradigm the same throughout. The scanner transmitted a transistor–transistor logic (TTL) pulse every 8000 ms and acquired an MRI volume in 2000 ms after each pulse. Each trial sequence was triggered by the TTL pulse. The trial began with a 2000 ms blank screen, followed by a white fixation cross at the center of the screen for 500 ms. Then the first stimulus was presented for its duration, followed by a black screen for 500 ms (interstimulus interval), and then the second stimulus for its duration, with the stimulation trial ending with a white fixation cross. This white cross changed color from white to green 6000 ms after trial onset and remained on the screen for 1700 ms during which time the participants made their responses (Fig. 1C). There was a brief 300 ms blank period before a TTL pulse triggered the next trial. The MRI volume acquired on a given trial is timed to occur ∼4 s after the offset of the second stimulus to accommodate for the delay in the hemodynamic response (Fig. 1C,D).
Participants were instructed to respond by pressing the same or different key during the response period when the fixation cross was green. If they did not respond during this period, the trial was counted as an error. As for the mock-scanner experiments, across participants, we counterbalanced the hand and which finger they used to make their responses. Each auditory or visual fMR-A testing session lasted ∼1 h, which included structural scans if needed. In addition, during one of the scanning sessions we also acquired functional data during face (Schultz and Pilz, 2009) and voice (Belin et al., 2000) localizer runs to identify face- and voice-sensitive regions.
Experiment 2
Participants.
A separate group of 12 volunteers (7 females, age 28.5 ± 7.8 SD; 5 males age 24.4 ± 3.1 SD) participated in Experiment 2 conducted in the auditory modality. The number of participants was the same as those scanned for the auditory Experiment 1 to ensure the same statistical power across both experiments.
Stimuli.
The Experiment 2 stimuli consisted of the voice stimuli from Experiment 1 concatenated with a short noise burst (50 ms) at the end. Note that we treat the concatenated stimuli as a single complex acoustic object as there are no delays between the two components (with identity and loudness being intrinsic features of this object). For the noise bursts, we used MATLAB to synthesize acoustic pink-noise stimuli that varied in intensity. Pink noise has a 1/f power spectrum that is present in naturally occurring environmental sounds, containing high-power at low frequencies with an exponential decrease in power as frequency increases. To vary perceived loudness, we systematically manipulated the (physical) mean intensity of the pink noise (i.e., intensity averaged across the entire 50 ms duration). Like voice morph percentages, there were 21 levels of mean intensity of the pink noise ranging from 0.5 to 0.95 in 0.0214 steps (arbitrary normalized intensity unit). On each trial, we independently manipulated the voice identity and the concatenated noise stimulus intensity.
Perceptual calibration, fMRI procedure, and experimental design.
All participants were tested in a mock scanner calibration session followed by one fMR-A session on a separate day. The procedure for both sessions was the same as in Experiment 1, except that participants discriminated identity or loudness across two pairs of complex stimuli. For the identity task, participants were instructed to judge whether the identity of the two voices was the same while ignoring any changes to the noise stimulus. For the loudness task, participants were instructed to judge the loudness of the two noise stimuli while ignoring any changes in voice identity.
Similar to Experiment 1, the purpose of the perceptual calibration session was to estimate the 50% and 80% identity and loudness levels, which were subsequently used for the fMR-A session for each participant (Table 1). The two perceptual levels for voice identity were determined as in Experiment 1. For the perceptual level of loudness, the physical stimulus difference was the intensity difference between the two noise stimuli on a given trial. The 50% and 80% loudness levels were determined as follows. We first randomly selected a noise stimulus from the 21 possible stimuli. The intensity of the second noise was selected to give a balanced behavioral response function. For Experiment 2, the voice morph differences were as follows: 0, 15, 30, 45, 60, 75, and 90%; and the intensity differences were as follows: 0, 0.0645, 0.129, 0.1935, 0.258, 0.3225, and 0.387 arbitrary units. There were 294 trials for each attended feature (6 voice pairs × 7 morph differences × 7 intensity differences) were run in a random order. The auditory stimuli were always presented at 0° azimuth in virtual acoustic space (i.e., straight ahead).
In the fMR-A session in Experiment 2, participants discriminated either identity or loudness in four alternating runs (2 runs per attended feature). The task order was counterbalanced across participants. As in Experiment 1, we systematically manipulated the magnitude of identity or intensity changes on each trial within a run based on each participant's estimated 50% and 80% perceptual level for each stimulus feature. Before testing, participants were given 20 practice trials with feedback. Each run was ∼8 min in duration and consisted of 54 trials (6 voice pairs × 3 identity levels × 3 loudness levels) run in a random order (for an illustration of the trial sequence, see Fig. 1D). The three perceptual levels were 0, 50, and 80%. The scanning session lasted ∼1 h, which included structural and functional scans including a voice localizer (Belin et al., 2000).
Image acquisition.
For all participants in Experiments 1 and 2, anatomical T1-weighted images and functional T2*-weighted gradient-recall echo planar images (EPIs) were acquired from a 3-Tesla Philips Intera Achieva MR scanner using a Philips 8-channel head coil. The higher resolution T1-weighted structural image consisted of 150 slices and took ∼5 min to acquire. The parameters of the structural scan were as follows: repetition time (TR) = 9.6 ms, echo time (TE) = 4.6 ms, flip angle = 8°. The field-of-view (FOV) was 249 × 240 × 180 mm3 with a matrix size of 216 × 208 pixels. Each voxel was 0.87 × 0.87 × 1.2 mm3 in size. The T2*-weighted EPI functional images consisted of 28 axial slices acquired from the bottom to the top of the head. The parameters of the EPI scans were as follows: acquisition time = 2.0 s, TR = 8.0 s, TE = 30 ms, flip angle = 90°. The FOV was 192 × 192 × 125.5 mm3 with a matrix size of 64 × 62 pixels. Each voxel was 3 × 3 × 4 mm3 in size, with a 0.5 mm gap between slices. We use sensitivity encoding (SENSE) with factor = 2 to increase the signal-to-noise ratio of the functional images. For each participant, a total of 54 functional images were acquired in each run. Before each functional run, four “dummy” scans were acquired to allow for equilibration of the T1 signal.
fMRI preprocessing.
We used FSL (Jenkinson et al., 2012) for all fMRI data processing and analyses. The data acquired for the auditory and visual stimuli were analyzed separately. For each dataset, we removed the first volume from each run and concatenated all four runs into a single time series. We then extracted each participant's brain from the skull using BET, normalized the intensity of the resulting functional images, and spatially coregistered the images to that participant's structural scan. The images were spatially smoothed with a 5 mm full-width at half-maximum Gaussian kernel to improve the signal-to-noise ratio. Last, we applied a high-pass filter with a cutoff of 432 s to remove low-frequency drifts in the signal.
Face and voice regions-of-interest.
To identify face-sensitive areas, we used the face localizer from Schultz and Pilz (2009). Briefly, this localizer consisted of fixations, static faces, static phase-scrambled faces, dynamic faces, and dynamic phase-scrambled faces. These conditions were run in a block design, with every condition preceded by each condition equally often (i.e., history-matched blocks). Participants performed a one-back matching task. The face localizer lasted ∼7.5 min, and we used a TR = 1.92 s. To identify voice-sensitive areas, we used the voice localizer from Belin et al. (2000). Briefly, there were 20 × 8 s vocal sounds, 20 × 8 s non-vocal sounds (industrial, environmental, and animal sounds) and 20 silent periods. The stimuli were presented in a fixed intermixed order optimized for the contrast vocal > non-vocal sounds. Participants performed a fixation-change detection task. The duration of the voice localizer was 10 min and we used a TR = 10 s.
Functional data from the localizer runs were preprocessed as described above, and spatially coregistered to each participant's structural scan. Following this, β images for the different stimulus conditions were calculated using procedures described by Schultz and Pilz (2009) and Belin et al. (2000); these images were used in different contrasts to functionally localize face- and voice-sensitive areas. The face regions-of-interest (ROIs) for each participant were defined as voxels from the contrast (static and dynamic faces) > (static and dynamic phase-scrambled faces) with Z > 5.3 and limited to be within the anatomically-defined temporal lobe. Similarly, the voice ROIs for each participant were defined as voxels from the contrast vocal > non-vocal sounds with Z > 5.3 and limited to be within the temporal lobe. Face and voice ROIs are shown in Figure 2. We used standard human atlases in FSL for the temporal lobe (Harvard-Oxford Cortical Structural Atlas and the Juelich Histological Atlas) and spatially coregistered this region to each participant's structural scan.
Statistical analyses.
On separate functional runs, participants discriminated changes to one stimulus feature (e.g., identity) of face or voice pairs, while ignoring changes to another feature (e.g., spatial position). The magnitude of physical changes for both stimulus features was systematically varied. For both experiments, we compared the magnitude of BOLD responses in face and voice ROIs (see above) across different experimental conditions. To estimate the magnitude of BOLD responses for each condition, we performed a first-level fixed-effects general linear model analysis using FSL's FEAT function. For each participant and each stimulus type, we constructed two design matrices depending on whether we varied the identity level or the spatial position/loudness level. Each design matrix was constructed as follows. There were six regressors-of-interest to reflect the six experimental conditions [2 tasks × 3 perceptual levels (either identity or spatial position/loudness)]. Because we used sparse imaging, each fMRI time point in the design matrix reflects the BOLD response to the stimulus pair presented on that trial. In addition, the six movement parameters (roll, yaw, pitch, and 3 translation terms) calculated from the motion correction were included in the design matrix as regressors-of-no-interest. A linear combination of the regressors was fitted to the BOLD signal to calculate the parameter estimate for each regressor. The parameter estimate for each experimental condition was extracted from each voxel in the respective individually-localized face or voice ROIs and averaged.
For Experiment 1, the proportion different responses and the mean parameter estimates were submitted to two separate within-subjects ANOVAs with stimulus type (face, voice), task (identity, spatial position) and either identity level or spatial position level (0, 50, and 75% for both) as repeated measures. For Experiment 2, the data were also submitted to separate within-subjects ANOVAs with task (identity, loudness) and either identity level or loudness level (0, 50, and 80% for both) as repeated measures. We also conducted a priori trend analyses to detect polynomial trends in the data (e.g., linear trends). For all analyses, we used a significance threshold of p = 0.05. All analyses were conducted using MATLAB and/or SPSS v23.
Results
In this study, we conducted several attentional manipulations within the context of conditions eliciting systematic fMR-A repetition suppression. As an overview, in Experiment 1 in two sensory modalities (auditory and visual) participants were presented with pairs of morphed faces or voices while we measured how the BOLD response changed as a function of the perceptual differences between the stimuli. Here, the participants either attended to identity or spatial position. In Experiment 2, we tested the influence of attention to acoustic features of the complex auditory objects by measuring fMR-A effects while participants listened to pairs of voices and discriminated their identity or the loudness of a noise burst presented immediately after the vocalization.
First, we overview our predictions and how the neuroimaging results are analyzed to test them: fMR-A effects are strongest for repetition of identical stimuli, thus we expected that the mean parameter estimate in face- and voice-sensitive ROIs would increase in response to the second stimulus as we morphed voice or face identity levels away from the identity presented as a first stimulus. This serves as the fMR-A stimulus-related adaptation function. If attention only affects gain during fMR-A and not selectivity, we expect that the mean parameter estimate in these ROIs would increase by equal amounts at each identity level when participants discriminated identity compared with when they discriminated spatial position (Experiment 1) or loudness (Experiment 2). By comparison, if attention affects neural selectivity, we expect that the mean parameter estimate would increase differentially at each identity level between tasks, evident as an interaction between task and identity level. Data for both experiments reported below are freely available at the Open Science Framework (https://osf.io/dxevz/).
Experiment 1
Attention to identity
Participants' ability to discriminate identity and spatial position as a function of identity level for faces and voices is shown in Figure 3, A and B. As expected, participants' proportion different response increased as identity level increased for both faces and voices (i.e., as the morph difference between the two stimuli increased) but only when they attended to and discriminated identity (red and blue lines) rather than spatial position (magenta and cyan lines). These observations are reflected by a significant interaction between task and identity level for faces (F(2,22) = 121.474, p < 0.001, ηp2 = 0.917) and for voices (F(2,22) = 26.572, p < 0.001, ηp2 = 0.707). There was a main effect of identity level (F(2,22) = 121.053, p < 0.001, ηp2 = 0.917), but not task (F(1,11) = 1.389, p = 0.263, ηp2 = 0.112) for faces, and a main effect of identity level (F(2,22) = 212.264, p < 0.001, ηp2 = 0.951), but not task (F(1,11) = 0.261, p = 0.600, ηp2 = 0.026) for voices. The behavioral data in this and the subsequent experiment validate the perceptual levels estimated during the calibration sessions.
During fMRI, participants attended voice or face identity and Figure 3, C and D, shows the mean parameter estimate as a function of stimulus type, task, and identity level. Participants' mean parameter estimate in face and voice ROIs increased as identity level increased, commensurate with release from adaptation (main effect of identity level: F(2,22) = 5.151, p = 0.015, ηp2 = 0.319; significant linear trend: F(1,11) = 6.912, p = 0.023, ηp2 = 0.386). We also found that attention to identity increased the mean parameter estimate in both ROIs (main effect of task: F(1,11) = 5.329, p = 0.04, ηp2 = 0.33). Importantly, we found no significant interaction between task and identity level and no significant three-way interaction (all F values < 1.043) indicative of only increases in gain as a function of attention. Finally, there was a significant main effect of stimulus type (F(1,11) = 34.484, p < 0.001, ηp2 = 0.758), but no interactions between stimulus type and task or identity level (all F values < 0.525).
Attention to spatial position
Participants' behavioral ability to discriminate spatial position for faces and voices is shown in Figure 4, A and B. Similar to the first analysis, participants' proportion different responses increased as spatial position level increased for both stimulus types (i.e., as the spatial displacement between the two stimuli increased) only when they attended to and discriminated spatial position (magenta and cyan lines) but not identity (red and blue lines). These observations are reflected in a significant interaction between task and spatial position level for faces (F(2,22) = 64.981, p < 0.001, ηp2 = 0.855) and for voices (F(2,22) = 15.095, p < 0.001, ηp2 = 0.578). There was a main effect of spatial position level (F(2,22) = 74.950, p < 0.001, ηp2 = 0.872), but not task (F(1,11) = 1.347, p = 0.270, ηp2 = 0.109) for faces, and a main effect of spatial position level (F(2,22) = 16.401, p < 0.001, ηp2 = 0.599), but not task (F(1,11) = 0.278, p = 0.608, ηp2 = 0.025) for voices.
As for the fMRI results during attention to spatial location, the mean parameter estimate as a function of stimulus type, task and spatial position level is shown in Figure 4, C and D. The mean parameter estimate was greater when the participants attended to and discriminated identity compared with spatial position (main effect of task: F(1,11) = 5.714, p = 0.039, ηp2 = 0.342). Moreover, unlike the previous analysis, their mean parameter estimate did not significantly increase with spatial position level, meaning that there was no spatial adaptation effect in face- or voice-sensitive cortex. There were also no interactions (all F values < 1.0). Finally, there was a main effect of stimulus type (F(2,22) = 33.656, p < 0.001, ηp2 = 0.754). These results are expected because face and voice ROIs are more involved in processing face or voice features than spatial location. Furthermore, although attention to spatial position increased the gain of neural responses in these ROIs, it did not change the selectivity of these responses.
Experiment 2
Attention to identity
Participants' ability to discriminate identity and loudness as a function of identity level is shown in Figure 5A. As expected, the proportion different responses increased as identity morph distance increased only when the participants attended to and discriminated identity (blue line) but not loudness (cyan line). These observations are reflected in a significant interaction between task and identity level (F(2,22) = 225.525, p < 0.001, ηp2 = 0.953). There was a main effect of identity level (F(2,22) = 123.443, p < 0.001, ηp2 = 0.918), but not task (F(1,11) = 0.772, p = 0.399, ηp2 = 0.066).
The mean fMRI parameter estimate as a function of task and identity level is shown in Figure 5C. As found in Experiment 1, participants' mean parameter estimate in voice ROIs increased as identity level increased, commensurate with release from adaptation (main effect of identity level: F(2,22) = 18.006, p < 0.001, ηp2 = 0.621; significant linear trend: F(1,11) = 42.5, p < 0.001, ηp2 = 0.795). In addition, there was a main effect of task (F(1,11) = 5.710, p = 0.036, ηp2 = 0.342). However in contrast to Experiment 1, there was a significant interaction between task and identity level (F(2,22) = 18.004, p < 0.001, ηp2 = 0.621) indicative of a change in selectivity during attention to voice identity. To interrogate this interaction further, we analyzed main effects of identity level separately for each task. We found that the mean fMRI parameter estimate varied as a function of identity level for the identity task (F(2,22) = 30.16, p < 0.001, ηp2 = 0.733; linear trend: F(1,11) = 88.264, p < 0.001, ηp2 = 0.889; quadratic trend: F(1,11) = 9.512, p = 0.01, ηp2 = 0.464) but not for the loudness task (F(2,22) = 1.281, p = 0.298, ηp2 = 0.104). Overall, our findings suggest that attention to voice identity while ignoring loudness increases both the gain and selectivity of neural responses in the voice ROIs during fMR-A.
Attention to loudness
Participants' ability to discriminate identity and loudness as a function of loudness level is shown in Figure 5B. Their proportion of different response increased as loudness level increased (i.e., as the intensity difference between the two noise component of the complex stimuli increased) only when they attended to and discriminated loudness (cyan line) but not identity (blue line). These observations are reflected in a significant interaction between task and loudness level (F(2,22) = 29.936, p < 0.001, ηp2 = 0.731). There was a main effect of identity level (F(2,22) = 37.498, p < 0.001, ηp2 = 0.773), but not task (F(1,11) = 0.846, p = 0.378, ηp2 = 0.071).
The mean fMRI parameter estimate as a function of task and loudness level is shown in Figure 5D. Similar to Experiment 1, participants' mean parameter estimate was greater when they attended to and discriminated loudness compared with identity (main effect of task: F(1,11) = 5.872, p = 0.034, ηp2 = 0.348). As expected, there was no main effect of loudness level (F < 1), suggesting no clear adaptation function for loudness level in voice-sensitive cortex. However, unlike the analogous analysis from Experiment 1, there was a significant interaction between task and loudness level (F(2,22) = 4.430, p = 0.024, ηp2 = 0.287). This interaction was driven by a quadratic trend (F(1,11) = 5.314, p = 0.042, ηp2 = 0.326; linear trend: F(1,11) = 2.890, p = 0.117, ηp2 = 0.208; Fig. 5D). Thus although the voice ROI is insensitive to changes in noise intensity, attention to the complex stimulus features including loudness nonetheless moderated the selectivity of neural responses within the ROI.
Finally, to assess whether the effects on adaptation of attending to different stimulus features were significantly different from each other in Experiments 1 and 2, we performed an additional three-way mixed ANOVA on the fMRI parameter estimates measured when participants were presented with auditory stimuli and we varied the stimulus identity. Here we entered Experiment as a between-subjects factor (Experiment 1, Experiment 2), along with within-group factors of task (attend identity, attend other: position or loudness) and stimulus level (0, 50, 75/80%). The principal finding from this analysis was the significant three-way interaction (F(2,44) = 10.882, p < 0.001, ηp2 = 0.331) that reflects that the effect of the focus of attention on adaptation effects is significantly different between the two experiments in the auditory modality. This interaction is consistent with the gain effects we observed in Experiment 1 and the selectivity (and gain) effects we observed in Experiment 2.
Discussion
The results show that what we attend to in our environment interacts with brain adaptation in distinctly different ways, impacting on cortical response gain or selectivity during adaptation. The findings reconcile prior results showing either attention-related gain (Altmann et al., 2008) or changes in selectivity (Murray and Wojciulik, 2004) during adaptation. Attentional gain modulation during adaptation has most commonly been observed, and to date is the most parsimonious explanation for such interactions (for review, see Summerfield and de Lange, 2014), which we also see here under certain conditions. However, our combined results both challenge the notion that attentional influences during fMR-A only affect response gain, and they clarify the conditions under which attention and adaptation interact to change selectivity.
In the first experiment, with selective attention to identity versus spatial position, attention works additively with adaptation. The effects were remarkably similar in auditory or visual modalities, in voice- or face-sensitive cortex, respectively. Here, participants attended to the identity of voices/faces while ignoring the spatial location of the stimuli, or vice versa. Because the voice/face localizer ROIs were selected to identify areas processing these features, attention to stimulus features versus their spatial location in this case selects predominantly separate neural populations. For instance, temporal lobe voice-sensitive areas do not overlap with cortical regions involved in sound localization (Belin et al., 2000; Zatorre et al., 2002), although spatial operations are known to be more broadly distributed in auditory cortex than previously thought (Ortiz-Rios et al., 2017). Thereby, voice- and face-sensitive regions will adapt to these stimulus features and reduce their responses more to repetition of stimuli with similar features, as a form of neural selection. In these regions, and independent of the level of adaptation, attention amplifies and counteracts the adapted neural signal by a constant multiplicative factor. The resulting fMR-A interaction is a constant change in gain.
By contrast, in the second experiment, attention to identity results in disproportionately greater enhancement for conditions during which there is less adaptation in voice-sensitive cortex; this interaction between attention and adaptation is consistent with increases in neural selectivity (Murray and Wojciulik, 2004). Although intensity is not an optimal feature for voice-sensitive cortex, seen as weak to no adaptation effects to sound intensity, attention to identity or loudness interacted in ways that were different from the effects seen during attention to identity or spatial location. Adaptation, as in Experiment 1, again acts as a neural selection process tuned to identity. However, in Experiment 2, attention needs to boost the activity of a subpopulation responding to voice content over sound intensity. Specifically, in trials where both voices presented are equal (identity difference = 0%), there is full adaptation and attention to identity negligibly boosts the signal. However, in trials where two voices differ by 50%, some neurons that are specifically tuned for identity adapt less because of the different features being processed in succession. Upon this, we see that attention selectively boosts the reduced adaptation response to these stimuli. The effect of attention in this case, will therefore, be dependent on the level of adaptation, but rather than a “constant gain change” as we see in the first experiment, there is an interaction between adaptation and attention.
We now consider how these results can be more formally understood within the broader literature on attentional interactions with different types of predictive signals. SSA, one of the processes that we manipulated here, is thought to result from subcortical and cortical inhibitory interneuron influences on pyramidal neurons, typically based on neuronal data modeled as a decrease in neuronal response gain without a change in selectivity (Parras et al., 2017). Neurophysiological data on goal-directed attention has led to computational models accounting for both attentional gain and changes in selectivity. For instance, in the normalization model of attention by Reynolds and Heeger (2009) two mechanisms lead to attentional enhancement and a change in selectivity. The model is as follows: R = E/N where the neural response (R) results from two processes:attentional enhancement (E), which is the product of the stimulus–response field (Sf) and the attention field (Af) where attention is directed; and divisive normalization (N) created by the suppressive surround for features that fall outside the attentional field and need to be ignored. Typically changes in selectivity occur when there is “biased competition”, such as when competing features that spatially overlap are either being attended to or ignored (Desimone, 1998).
Murray and Wojciulik (2004) observed and modeled an increase in selectivity during fMR-A as a function of attention. First, they modeled the greater adaptation for more similar stimulus features, which we depict here as follows: â(s), where (s) is the stimulus-driven response or tuning curve and â is the effect of adaptation as a function of tuning. Then they modeled the multiplicative effect of attention (g) on the adaptation function as follows: gâ[g(s)]. Note that the nesting of the attentional gain (g) with both (â) and (s) is required for a change in fMR-A selectivity: the greater effect of attention on stimuli that produce less adaptation. Integrating the effects of adaptation into the normalization model of attention can be done as follows: R = [E × (Af*â)]/N. As before, E is attentional enhancement, equivalent to g(s) in the adaptation model, and Af × â reflects attention interacting with the level of adaptation, equivalent to gâ in the adaptation model.
Other prominent models that are relied on by the research community to explain attentional influences are those based on predictive coding. In the predictive coding model of brain function (Friston, 2005; Rao, 2005; Bastos et al., 2012), neurons compute the difference between sensory input and prior predictions, resulting in a prediction error passed from lower to higher hierarchical processing stages. Attention, in this setting, is a top-down influence that weighs the flow of prediction errors: prediction errors from neurons responding to attended object features are given more weight than those to ignored features. Mathematically, the weight is computed as precision (inverse of the SD: σ) of the distribution of prediction errors. Within this framework the neural response can be summarized as follows: . Here (y − ŷ) is the prediction error and σ is the SD of prediction errors. The change in precision (σ) is equivalent to the normalization term (N) in the Reynolds and Heeger (2009) model. This term can also change the gain of attentional enhancement (1/σ is equivalent to g in the adaptation or the attentional field Af in the normalization models above). Thus considering the effects of adaptation on selectivity, the modified predictive coding model accounting for adaptation is as follows: , where â, as above, is the adaptation function and σ(â) denotes the dependence of the SD (and precision) on the level of adaptation as a function of tuning.
Although, as we have considered, the predictive coding framework can formalize adaptation and attention influences, it is important to consider the distinctions between different types of predictive signals. For instance, typically within this framework sensory input is processed in relation to a (prior) expectation. However, expectancy effects (controlled in our experiment), partially segregate from SSA effects, are more accessible to consciousness, and thus might interact with attention differently (Summerfield and Egner, 2009; Kok et al., 2012a). Expectation suppression occurs when a repetition is expected, and when violated results in an expectation prediction error. Attention interacting with expectancy manipulations can result in different forms of fMRI-based changes in selectivity (Larsson and Smith, 2012; attentional dampening or sharpening during repetition expectation: Kok et al., 2012a; de Lange et al., 2018). However, even here expectancy effects sometimes only result in gain changes. Namely, Vinken et al. (2018) used the face repetition and expectancy manipulation paradigm from Summerfield et al. (2008) but found no effect of repetition expectancy on neuronal local field potentials in face-sensitive areas in nonhuman primates. Whereas prior work on interactions between adaptation and attention contained a higher proportion of perceived stimulus differences during repetition, which may have influenced participants' expectancy and thus affected selectivity (Murray and Wojciulik, 2004), in our experiment we minimized differences in perceptual expectations individually (balanced psychometric functions where perceived stimulus repetition or change occur with equal probability). Therefore, our study aimed to clarify SSA effects interacting with attention. In this regard, the conditions under which gain and selectivity changes occur under expectancy manipulations could benefit from further study to identify common mechanistic principles.
In conclusion, the properties of the stimulus to which attention is directed appear to determine whether attention in interaction with adaptation effects solely modulates the gain of adapted signals or also affects selectivity. We find that adaptation induced gain changes are prominent during tasks where the observer's attention is directed to features processed by largely segregated neural populations (e.g., spatial vs object features). By contrast, changes in selectivity occur when attention is directed to competing stimulus features processed by neural populations that overlap to a greater extent (e.g., voice identity and sound intensity). Thus, attention plays a multifaceted role in how adaptation induced gain and selectivity changes affect the flow of information in the brain.
Footnotes
This work was supported by the Biotechnology and Biological Sciences Research Council, BBSRC UK (C.I.P. and Q.C.V.), and additional support from a Wellcome Trust Capital Award to Newcastle University (WT092504) and the Reece Foundation. C.I.P. is a Wellcome Trust Investigator (WT102961MA) and supported by a European Research Council award (MECHIDENT). We thank Tugrul Zengin for assistance with creating face morphs, and Tim Griffiths and Scott Murray for useful discussion.
The authors declare no competing financial interests.
- Correspondence should be addressed to Nick E. Barraclough at nick.barraclough{at}york.ac.uk or Quoc C. Vuong at quoc.vuong{at}ncl.ac.uk
This is an open-access article distributed under the terms of the Creative Commons Attribution License Creative Commons Attribution 4.0 International, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.