Selective attention contributes to perceptual efficiency by modulating cortical activity according to task demands. Visual attention is controlled by activity in posterior parietal and superior frontal cortices, but little is known about the neural basis of attentional control within and between other sensory modalities. We examined human brain activity during attention shifts between vision and audition. Attention shifts from vision to audition caused increased activity in auditory cortex and decreased activity in visual cortex and vice versa, reflecting the effects of attention on sensory representations. Posterior parietal and superior prefrontal cortices exhibited transient increases in activity that were time locked to the initiation of voluntary attention shifts between vision and audition. These findings reveal that the attentional control functions of posterior parietal and superior prefrontal cortices are not limited to the visual domain but also include the control of crossmodal shifts of attention.
- crossmodal attention
- attentional control
- visual attention
- auditory attention
- posterior parietal cortex
- superior parietal lobule
- functional magnetic resonance imaging
Attention is the cognitive mechanism by which salient or behaviorally relevant sensory information is selected for perception and awareness (Desimone and Duncan, 1995). Attended visual events are perceived more rapidly and accurately than ignored ones (Posner et al., 1980), and attention to one sensory modality can impair the perception of otherwise salient events in another (Strayer et al., 2003). Evidence from functional neuroimaging and neurophysiology has revealed that neural activity is greater and neural spiking is more synchronous in sensory cortical regions responding to attended visual events than in those responding to ignored ones (Moran and Desimone, 1985; Desimone and Duncan, 1995; O'Craven et al., 1997; Reynolds et al., 1999). A network of areas in the posterior parietal cortex (PPC) and prefrontal cortex (PFC) has been implicated in the control of both spatial and nonspatial deployments of visual attention (Corbetta et al., 1991; Kastner and Ungerleider, 2000; Vandenberghe et al., 2001; Corbetta and Shulman, 2002; Yantis et al., 2002; Giesbrecht et al., 2003; Liu et al., 2003; Serences et al., 2004).
Although the role of the frontoparietal network in controlling visual selective attention is well established, its role in controlling shifts of attention between and within other sensory modalities is less well understood. Most relevant studies have only indirectly suggested that the functions of the parietal lobe might include goal-directed control of attention across modalities (Teder-Salejarvi et al., 1999; Downar et al., 2001; Macaluso et al., 2002b).
We examined cortical activity in human observers during shifts of attention between vision and audition using rapid event-related functional magnetic resonance imaging (fMRI). Observers viewed a visual display containing five rapid serial visual presentation streams of letters appearing in the center of a computer screen; at the same time, they listened via headphones to three streams of letters spoken at the same rate as the visual stream (see Fig. 1). The central visual stream or the central (binaural) auditory stream contained target digits to be detected and identified; distractor streams of letters were continuously present in both modalities. Before each run, an auditory instruction directed attention to either the auditory or the visual stream. The subjects' task was to detect digits in the attended stream. Depending on the identity of the digit, subjects were to either shift their attention from one modality to another or maintain their attention within the currently attended modality. This design permits the measurement of the neural events that are time locked to the act of shifting attention, unconfounded by sensory responses to a distinct attention cue. The nearby distractors maximize sensory competition and, hence, the opportunity for attention modulation (Desimone and Duncan, 1995; Reynolds et al., 1999)
Materials and Methods
Subjects. Eleven neurologically healthy young adults (19-35 years of age; seven women; mean, 24.8) were recruited from the Johns Hopkins University community. Informed consent was obtained from each subject in accordance with the human subjects research protocol approved by the Institutional Review Board at Johns Hopkins University.
Stimuli. Auditory stimuli consisted of 16 letters recorded digitally using the Computerized Speech Lab software (CSL; Kay Elementics, Lincoln Park, NJ). One female and one male talker, both native English speakers, produced two digits and 16 letters (2, 4, A, C, F, G, H, J, K, M, N, P, R, T, U, V, X, and Y). The utterances were edited to be exactly 240 msec in duration with an additional 10 msec of silence added at the end of each utterance, yielding a total duration of 250 msec for each utterance. The male voice was presented binaurally and served as the central auditory target stream. Two different monaural sequences of letters spoken by the female voice were presented to the left and right ears, respectively. The amplitude of the auditory distractor streams were each ∼70% of the amplitude of the target stream; all streams were presented well above threshold.
The visual stimuli were white characters (the same set of letters and digits as that used for the auditory stimuli) rendered on a black background. Each letter subtended ∼0.65° horizontally of the visual angle and 0.8° vertically. The target stream was presented in the center of the display, and four distracting streams were presented surrounding the central stream with edge-to-edge separation of 0.25°.
Procedure. At the beginning of each run, subjects heard the word “audition” or “vision,” which instructed them to begin the run by attending to the auditory or visual target stream, respectively (Fig. 1). Subjects pressed a button held in their dominant hand whenever they detected the digits 2 or 4 within the attended stream. For one-half of the subjects, the digit 2 instructed them to shift attention from the currently attended stream to the unattended stream (e.g., vision to audition), whereas the digit 4 instructed them to maintain their attention on the currently attended stream; this mapping was reversed for the remaining subjects.
The order of targets was random, with two constraints: (1) no more than two hold events occurred in succession, and (2) targets never appeared in the unattended streams. Targets within the attended stream were separated by a temporal interval that was pseudorandomly jittered between 3 and 5 sec, with an average intertarget interval of 4 sec (Dale and Buckner, 1997). Such temporal jittering allows for the extraction of individual event-related blood oxygenation level-dependent (BOLD) signal time courses after the target events (Burock et al., 1998). Each subject performed two practice runs and 10 experimental runs; each run was 2 min, 28 sec in duration and included eight occurrences of each of the four target types: attend vision, attend audition, switch attention from vision to audition, and switch attention from audition to vision (a total of 32 target events per run). The subjects were instructed to hold attention on the currently attended stream even if they thought they had missed a target. Only the detected target events were included in our analysis.
Monitoring eye position. To check for the possibility that switch-related activity could be attributed to movement of the eyes away from the central region of visual stimulation or to subjects closing their eyes whenever the targets were delivered within the auditory stream, eye position was monitored for 6 of the 11 subjects while they performed the task in the scanner via a custom-made video camera. The output of the camera was digitally recorded and later analyzed for eye movements. A calibration procedure established that eye position deviations of 2.5° or more could be detected reliably, as could blinks and eyelid closure. All of the subjects kept their eyes open and directed to the center of the display throughout the task, and no significant changes in fixation were made during the experimental runs. This conclusion was supported by the finding that the locus of attention modulated the magnitude of cortical activity in early visual areas and did not cause a spatial displacement of the locus of cortical activity, which would have been expected if observers diverted their eyes from the visual streams during those intervals.
MRI data acquisition. Imaging data were acquired with a 3 T Philips Gyroscan MRI scanner (Philips, Bothell, WA). Whole-brain echoplanar functional images (EPIs) were acquired with a sensitivity-encoding (SENSE) head coil (MRI Devices, Waukesha, WI). Twenty-seven transverse slices were acquired [repetition time (TR), 1480 msec; echo time (TE), 30 msec; flip angle, 65°; SENSE factor, 2; field of view, 240 mm; matrix, 64 × 64; slice thickness, 3 mm with 1 mm gap, yielding voxels that were 3.75 × 3.75 × 3 mm in size]. These parameters allowed for a whole-brain coverage with 100 volume acquisitions per run. High-resolution anatomic images were acquired with a T1-weighted, 200 slice magnetization-prepared rapid gradient echo sequence with SENSE level 2 (TR, 8.1 msec; TE, 3.8 msec; flip angle, 8°; prepulse T1 delay, 852.1 msec; isotropic voxels, 1 mm).
fMRI data analysis. Neuroimaging data were analyzed using BrainVoyager software (Brain Innovation, Maastricht, The Netherlands). First, functional data were motion and slice-time corrected and then high-pass filtered to remove components occurring three or fewer times over the course of the run. Spatial smoothing was not used. To correct for between-scan motion, each subject's EPI volume was coregistered to the 10th run acquired for that subject (the last functional run performed before the anatomical scan). After this interscan motion correction, functional data were registered with the anatomical volume.
Using a general linear model, the magnitude of the hemodynamic response was estimated for each time point after each event type. Specifically, we separately estimated the scaled fit coefficients (beta weights) at the onset of the target and the subsequent 11 time points (0-16.3 sec) for each event (hold vision, shift to audition, hold audition, and shift to vision) (http://afni.nimh.nih.gov/pub/dist/doc/3dDeconvolve.pdf). The magnitude of the beta weight associated with each time point reflects the relative change in the BOLD signal at that time point after each event.
To test for modality-specific attentional modulations, two targeted beta weight contrasts were constructed to represent the crossover interaction pattern observed in previous studies (Yantis et al., 2002; Liu et al., 2003; Serences et al., 2004). For sensory regions that respond more to audition than to vision, we constructed a contrast involving 24 weighted regressors, including regressors associated with six time points after each of the four targets (hold audition, hold vision, switch from audition to vision, and switch from vision to audition). The six regressors indexing activity ∼3-13 sec after hold audition targets were assigned positive values, and the six regressors indexing activity during this same interval after hold vision targets were assigned negative values. To model the crossover in activity evoked by the shift from vision to audition, the three regressors corresponding to time points 2-4 after the target (∼3-6 sec after target onset) were assigned a negative value, and the regressors corresponding to time points 7-9 after the target (∼10-13 sec after target onset) were assigned a positive value. Similarly, the three regressors corresponding to time points 2-4 after switching from audition to vision targets were assigned a positive value, and the regressors corresponding to time points 7-9 after the target were assigned a negative value. A second contrast was conducted with the opposite sign to identify regions that respond more to visual than to auditory input. The data were also analyzed using a generic, less targeted contrast (Yantis et al., 2002; Liu et al., 2003), and this yielded similar results.
For each contrast, the single-voxel threshold in the group data were set to t(11) = 4.0, p < 0.002, uncorrected. A spatial cluster extent threshold was used to correct for multiple comparisons using AlphaSim (Alpha Simulations) with 2000 Monte Carlo simulations and taking into account the entire EPI matrix (http://afni.nimh.nih.gov/pub/dist/doc/AlphaSim.pdf). This procedure yielded a minimum cluster size of three voxels in the original acquisition space with a map-wise false-positive probability of p < 0.01.
Event-related averages of the BOLD signal were extracted from each significantly activated cluster, as revealed by the two contrasts. The time course was extracted for each subject using the following procedure: a temporal window was defined, extending from 6 sec before the target onset (4 TRs) to 16 sec after the target onset. Time courses were then averaged across all subjects for each of the four event types. The “baseline” (or 0% signal change) was defined as the mean activity during the 6 sec preceding each target. It is important to note that negative deflections in activity cannot necessarily be interpreted as “deactivations” but rather as relative differences in activity after a given event.
Responses to auditory targets were slightly slower than to visual targets, and there was an additional delay associated with the need to shift attention from one modality to another. Mean response times were 556, 615, 640, and 654 msec for hold vision, shift from vision to audition, hold audition, and shift from audition to vision targets, respectively. A two-way repeated-measures ANOVA was conducted, with target type (hold and shift) and modality (vision and audition) as within-subject factors (modality for the shift events was defined relative to the starting modality, i.e., the shift from audition to vision was defined relative to the auditory modality). There was a significant main effect of modality (F(1,10) = 9.18; p < 0.05) and of target type (F(1,10) = 17.8; p < 0.01). The interaction approached significance (F(1,10) = 4.35; p = 0.085). Detection accuracy was 98, 99, 97, and 98% for hold vision, shift from vision to audition, hold audition, and shift from audition to vision, respectively. No significant effect of modality or target was observed for accuracy (F < 2).
Attentional modulation of sensory cortical activity
Cortical regions exhibiting the crossover time course characteristic of attention shifts in auditory sensory regions are listed in Table 1; these include bilateral activations along the superior temporal gyrus (STG) in Heschl's gyrus (Fig. 2a), the locus of early auditory sensory cortex (Tzourio et al., 1997). The group mean event-related BOLD time course extracted from the left and right STG is shown in Figure 2, b and c, respectively.
The magnitude of the BOLD signal was greater after hold audition targets than after hold vision targets. This difference was already evident at the moment the targets themselves were presented (time 0 on the abscissa). This is because observers were already attending to either the auditory stream or the visual stream when the target in question appeared.
A markedly different BOLD time course in these regions was elicited by shift targets. After the shift from audition to vision targets, activity in STG (bilaterally) begins by closely tracking the hold audition time course but, after ∼2 sec, decreases toward the hold vision time course. The opposite pattern was observed for shifts of attention from vision to audition. This crossover pattern reflects the dynamics of sensory modulation during a crossmodal shift of attention. The 2 sec delay before the crossover begins is attributable to the sluggish hemodynamic response of the BOLD signal (Boynton et al., 1996).
Several brain regions exhibited greater activity when attention was directed to vision than when it was directed to audition (Table 1), including the bilateral fusiform gyrus (Fig. 2d) and right middle temporal cortex. These attentional modulations of cortical activity in auditory and visual sensory regions are similar to those reported previously in vision (Corbetta et al., 1990; Treue and Maunsell, 1996; Hopfinger et al., 2000; Yantis et al., 2002; Serences et al., 2004) and in comparisons of attention to vision and touch (Macaluso et al., 2000). The crossover pattern that accompanies shifts of attention from one modality to another mirrors that reported previously in vision (Yantis et al., 2002; Liu et al., 2003; Serences et al., 2004). They confirm that, in these perceptually demanding tasks, top-down attentional modulation serves to bias cortical competition in favor of behaviorally relevant sensory input (Moran and Desimone, 1985; Motter, 1994; Desimone and Duncan, 1995; Reynolds et al., 1999).
The central focus of this study is to examine the control of crossmodal attention shifts. We contrasted the magnitudes of the regression weights associated with shift targets (shift to vision and shift to audition) with the hold regressors (hold vision and hold audition). Cortical regions exhibiting a significantly greater BOLD response after shift events than after hold events included right precuneus/superior parietal lobule (SPL), left inferior parietal lobe, and right medial frontal gyrus (Table 1). The group mean event-related BOLD time course for each of the four target types extracted from precuneus/SPL is shown in Figure 3.
Three aspects of these shift-related time courses distinguish them from the time courses seen in the contrast of attention to vision versus audition (Fig. 2). First, of course, is that shift-evoked responses occurred in parietal and frontal regions and not in primary sensory regions. Second, these regions exhibited no difference in activity for the attend-vision and attend-audition events at the onset of the target event (i.e., time 0). This distinguishes the pattern seen here from that in sensory areas, in which significant differences in the level of cortical activity were already apparent at time 0. Third, unlike the crossover pattern seen in sensory regions, BOLD activation differed only slightly in magnitude for shifts of attention from audition to vision versus shifts from vision to audition.a
The present study examined cortical activity changes measured with fMRI during shifts of attention between vision and audition. Attention significantly modulated the magnitude of cortical activity in early visual and auditory sensory regions, and the dynamics of attention shifts were reflected in a crossover pattern of activity. These results corroborate and extend previous demonstrations of attentional modulation within vision and audition, respectively (Moran and Desimone, 1985; O'Craven et al., 1997; Tzourio et al., 1997; Jancke et al., 1999; Reynolds et al., 1999; Yantis et al., 2002), and confirm that our task effectively recruited selective attention and therefore could be used to examine the control of crossmodal attention shifts.
By holding constant the sensory input and response demands of the task, we could examine cortical activity that reflected the initiation of a voluntary shift of attention. We observed transient increases in activity that were time locked to the initiation of crossmodal attention shifts in PPC and PFC. These dorsal frontoparietal attentional control signals are similar to those reported previously during shifts of attention within vision (Corbetta and Shulman, 2002). Previous studies from our laboratory, using a very similar experimental design but examining attention shifts in other perceptual domains (locations, features, and objects, respectively), revealed similar transient increases in activation during attention shifts within similar cortical regions (Yantis et al., 2002; Liu et al., 2003; Serences et al., 2004). Because of differences in stimuli and the lack of within-subject comparisons, strong conclusions about the degree of overlap in activation loci must await future studies.
As noted earlier, there was no difference in the magnitude of activity in these shift-related regions at the moment the target in question appeared. This suggests that these areas do not provide a sustained signal to maintain a modality-specific attentive state. However, in this task, observers were never in a “neutral” or “relaxed” attentive state; instead, attention was continuously focused on either vision or audition throughout each run (except for brief intervals during the attention shifts themselves). It could be the case that the frontoparietal regions are continuously active throughout the task and maintain focused attention and that their activity is transiently enhanced when a shift of attention is required. Because we had no baseline for comparison, we could not test this hypothesis in the present paradigm.
Studies of attention shifts within vision (e.g., shifts of attention from left to right) have often shown modulation of sensory activity in early areas of the visual pathway, including V1, V2, and V4 (Tootell et al., 1998; Yantis et al., 2002). In contrast, the present experiment revealed attentional modulation in lateral occipital regions but not in striate or early extrastriate regions of the occipital lobe. One possible reason is that full attentional modulation requires sensory competition from stimuli within the same modality, which was the case in previous studies. Here, although targets in each modality were accompanied by distracting stimuli in that modality, the primary locus of sensory competition came from stimuli in a different modality, and this may have limited the magnitude of modulation by attention in the earliest visual areas. If selective attention operates by directly modulating competitive neural interactions, stronger within-modality competition would be expected to lead to stronger within-modality attentional modulation (e.g., shifting attention between two visual locations). Tests of this conjecture await additional studies.
Our findings support several conclusions. First, as shown in previous studies, early sensory cortical responses (in auditory and visual cortices) are modulated by attention. The “push-pull” effect of switching attention between vision and audition suggests a neural basis for behavioral evidence that focusing attention on auditory input (e.g., a cellular telephone conversation) can impair the detection of important visual events (e.g., while driving an automobile) (Strayer and Johnston, 2001; Strayer et al., 2003). When attention must be directed to audition, the strength of early cortical representations in the visual system are compromised (and vice versa), leading to potentially significant behavioral impairments.
Second, we observed transient increases in activity in PPC and superior frontal sulcus/precentral gyrus after signals to shift attention from vision to audition or vice versa (Table 1). This suggests that these areas mediate executive control over attention shifts between vision and audition. Two possible implications of this finding can be considered. First, PPC activity at the initiation of a shift of attention from vision to audition and vice versa could reflect a (crossmodal) spatial attention shift (from the center of the head to the center of the computer screen and vice versa). On this account, PPC provides a supramodal spatial map that determines the current locus of attention regardless of sensory modality [for a discussion of supramodal spatial maps in parietal cortex, see the study by Macaluso et al. (2002a)]. In other words, when attention is directed to the viewing screen during epochs of attention to vision, then (on the supramodal account) auditory attention is also directed there and is therefore directed away from the location of the auditory letter stream (i.e., the center of the head). Similarly, when attention is directed to the center of the head during auditory epochs, then visual attention is also directed there and is therefore directed away from the visual letter stream (on the computer screen in front of the subject). This is consistent with the known role of PPC in visually guided behavior (Andersen and Buneo, 2002) and in mediating shifts of spatial attention (Corbetta et al., 1991; Vandenberghe et al., 2001; Yantis et al., 2002; Bisley and Goldberg, 2003).
An alternative possibility is that PPC mediates both spatial and nonspatial shifts of attention within and across multiple modalities. This account is supported by findings that nonspatial visual attention shifts evoke transient activity in PPC (Liu et al., 2003; Serences et al., 2004). This would indicate that the role of PPC in selective attention is more general than believed previously (Wojciulik and Kanwisher, 1999; Corbetta and Shulman, 2002; Macaluso et al., 2003) and is not limited to purely visual, visuomotor, or spatial cognitive operations. Experiments investigating the control of nonspatial shifts of attention within audition and other nonvisual sensory modalities will be required to discriminate among these possibilities.
↵ a A repeated-measures ANOVA was conducted with time (4-10 sec after the target onset) and shift event (shift from audition to vision and shift from vision to audition) as within-subject factors and percentage of signal change as the dependent measure. The ANOVA revealed a significant main effect of time (F(4,40) = 7.82; p < 0.001) and a trend for a significant interaction (F(4,40) = 2.35; p = 0.07). This interaction was primarily driven by the difference evident at the 7 sec time point. The same analysis in the inferior parietal lobe and superior frontal sulcus/precentral gyrus yielded no significant interaction.
This work was supported by National Institutes of Health Grant R01-DA13165 (S.Y.). We thank Ed Connor, Amy Shelton, Brenda Rapp, J. T. Serences, T. Liu, J. B. Sala, F. Tong, M. Behrmann, X. Golay, T. Brawner, and K. Kahl for comments, advice, and assistance.
Correspondence should be addressed to Sarah Shomstein, Department of Psychology, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213. E-mail:.
Copyright © 2004 Society for Neuroscience 0270-6474/04/2410702-05$15.00/0