Attending a certain region in space enhances activity in visual areas retinotopically mapped to this region; stimuli presented in this region are preferentially processed. The zoom lens model of visual attention proposes that the attended region can be adjusted in size and predicts a tradeoff between its size and processing efficiency because of limited processing capacities. By means of event-related functional magnetic resonance imaging, we analyzed neural activity in multiple visual areas as a function of the size of an attended visual field region, which was defined by a spatial cue stimulus. After cueing, a target object, defined by a specific feature conjunction, had to be identified among objects within the cued region. Neural activity preceding the objects in multiple retinotopic visual areas correlated with the size of the attended region, as did subjects' performance. While the extent of activated retinotopic visual cortex increased with the size of the attended region, the level of neural activity in a given subregion decreased. These findings are consistent with the physiological predictions of the zoom lens model. Size-related modulations of neural activity were pronounced in early visual areas. We relate this finding to the small receptive field of these areas, whereby only neuronal units with receptive fields covering the attended region received a top-down bias. This preactivation of neuronal units may then have gated selective processing of the features of the object that appeared at the attended location, thus enabling feature integration and object identification.
Typical visual scenes contain a variety of different objects, which because of the limited processing capacity of the visual system cannot be all processed simultaneously. Therefore, selection of information that is relevant to our current behavior is mediated by visual attention. This selection is often compared with a spotlight that highlights a definite region in space, where information processing is then facilitated at the expense of information at other locations (Posner and Petersen, 1990). A neural basis of the attention spotlight has been found in recent functional magnetic resonance imaging (fMRI) studies, in which subjects were asked to attend a region in the visual field periphery while maintaining central fixation (Tootell et al., 1998; Brefczynski and DeYoe, 1999;Martinez et al., 1999; Somers et al., 1999). In these studies, visual areas that were retinotopically mapped to the attended location showed enhanced neural activity, even in the absence of any visual stimulus (Kastner et al., 1999). Thus, visuospatial attention is proposed to be retinotopcially organized.
However, in real-life situations attention cannot always be directed to a single stimulus at a certain location. Instead, two or more objects of a scene might be relevant at the same time. Thus, the attended region has to be adjusted in size to process information from these locations in parallel. For example, when driving a car, attention should cover as much of the visual field as possible to realize any upcoming obstacle, at the cost of fine resolution. On the other hand, when trying to find a certain street name on a map, it would be helpful to scale down the size of the attention focus to the relevant region, allowing the identification of small details. Therefore, Eriksen and St. James (1986) (see also Eriksen and Yeh, 1985; Castiello and Umilta, 1990) have suggested a “zoom lens” model of visual attention, in which the size of the attention focus can be varied continuously. This model is supported by behavioral evidence suggesting a tradeoff between the size of the covered region and processing efficiency, i.e., resolution.
In the present study we aimed at extending previous findings of physiological correlates of spatial attention (Brefczynski and DeYoe, 1999). We sought physiological evidence for Eriksen and St. James's zoom lens model by addressing neural activity in visual areas as a function of the size of the attended region. More specifically, we asked whether (1) the extent of activated visual cortex increases with the size of the attention focus, (2) activity in a given retinotopic subregion representing a certain location in the visual field decreases when the size of the attended region increases, and (3) the level of activation is related to the discrimination performance for target stimuli.
To ensure that subjects had to use attention for target identification, we presented targets that were discriminable from the surrounding nontargets only in terms of a conjunction of elementary features (color and form). Under these circumstances, task performance is known to depend on the use of visual attention (Treisman and Gelade, 1980).
Materials and Methods
Subjects. Five healthy students (ages, 23–29 years) from the Humboldt University of Berlin, with normal color vision and sufficient visual acuity, served as subjects in the study, which was conducted in conformity with the Declaration of Helsinki. All subjects (three females, two males) were right-handed and were paid for their participation.
Behavioral procedure. The experimental design is illustrated in Figure 1. To determine attention modulation of the blood oxygen level-dependent (BOLD) response as a function of focus size, a delay period was interposed between the presentation of a cue indicating the size of the region to attend to and the presentation of a stimulus array. Within the precued region the presence of a target object had to be detected.
To facilitate the alignment of spatial attention after cue onset, placeholders at the locations used for the stimulus array were presented during the whole experiment. These were composed of four black squares in the upper hemifield superimposed on a gray background (luminance, 6.8 cd/m2). Each square subtended 3° of visual angle and was centered on an imaginary circle 7.3° off the fixation point. Small squares (visual angle, 0.2°) at 0.5° off the fixation point served as cues. After cueing, simple two-dimensional geometric objects (circle, square, rhombus; ∼2.5°) in three different colors approximately isoluminant to the gray background color were presented in each of the four upper field squares. The blue circle was defined as the target.
Each trial started with a fixation period of variable duration (0–2750 msec; step size, 250 msec). Then either one, two, or four of the central small squares turned dark, indicating whether subjects should focus their attention on a small region in space (designated by the middle left placeholder square), a region of medium size (designated by the two squares in the left hemifield), or a large region encompassing all four squares.
The cue squares remained dark (to reduce demands on working memory) during a variable period of 4, 7, or 10 sec. These long intervals were chosen to allow for the delay in the BOLD response. After the cueing period, an array of four objects was presented for 30 msec. If the blue circle was present (in 50% of the trials), subjects had to press a button with their right index finger within 2 sec. If not, they had to press another button with their right middle finger. The speed and accuracy of responses were stressed. To avoid decision conflicts, no invalid cues were used; that is, either the target was presented within the cued region or it was not presented at all.
After response, subjects fixated on the central cross passively for 10,320–13,070 msec, depending on the initial offset. Because the duration of the cueing period and the onset of the search array were unpredictable to the subjects, they had to pay attention during the whole cueing phase. Trials with different cueing duration, with target present or not and with different offset times, were randomized. Each size condition (small, medium, large) was repeated 24 times within two scanning sessions. Each session lasted ∼33 min; in between, subjects were allowed to rest for ∼10 min. Subjects were instructed to maintain central fixation during the whole experiment.
Fixation control. Because subjects had to maintain central fixation during the whole experiment, their fixation capabilities were tested in previous training sessions outside the scanner, where eye movements were recorded with an infrared video eye tracker system (SensoroMotoric Instruments, Teltow, Germany). All subjects maintained fixation within 2° around the center in >99% of trials.
Retinotopic mapping and regions of interest. The present report focuses on the modulation of the BOLD response of retinotopic areas with regard to the size of the attention focus. The BOLD response was measured during the cueing period in regions of interests (ROIs) determined in separate sessions, in which subjects passively viewed the test stimuli presented in each square sequentially at a rate of 8 Hz for 4 × 21 sec while maintaining central fixation. The ROIs were additionally subdivided along retinotopic boundaries (visual areas V1, V2, VP, and V4). The borders of these visual areas were identified in yet another session, in which checkerboard stimuli were presented at the horizontal and vertical meridians (Sereno et al., 1995). ROIs were marked on reconstructed cortical surfaces for each of the five subjects, resulting in 16 ROIs (4 locations × 4 visual areas) for each subject. However, because only the middle left location was cued in all conditions, the ROIs for this location were of particular interest, and reports of results will focus on these ROIs.
fMRI procedure. fMRI data were acquired with a 1.5 Tesla magnetic resonance imaging system (Magnetom Vision; Siemens, Erlangen, Germany). Subjects' heads were stabilized with a vacuum pillow in a standard head coil. Stimuli were projected on a back-projection screen by an liquid crystal display projector (NEC 8000; NEC, Stuttgart, Germany). Subjects fixated on the screen via a mirror, and they used a fiberoptic two-button response box for reporting.
Functional images were taken with a gradient echoplanar imaging sequence [repetition time (TR), 3000 msec; echo time (TE), 51 msec; flip angle, 90°; in-plane resolution, 3.28 × 3.28 mm] in all experiments. During each functional run, 667 volumes of 26 axial slices (3 mm thickness, spanning the cerebral cortex) were collected. Structural three-dimensional data sets were acquired in the same session using a T1-weighted sagittal magnetization prepared-rapid gradient-echo sequence [TR, 10 msec; TE, 4 msec; flip angle, 12°; inversion time (TI), 100 msec; 265 × 256 matrix; 190 sagittal slices (thickness, 1 mm); voxel size, 1 mm3]. Moreover, high-quality structural three-dimensional data sets of all subjects were recorded using a T1-weighted sagittal fast low angle shot sequence (TR, 38 msec; TE, 5 msec; flip angle, 30°; TI, 100 msec; 265 × 256 matrix; 190 sagittal slices; voxel size, 1 mm3).
Stimulus presentation was controlled by a laptop computer using the experimental run time system software package (Beringer, Frankfurt, Germany). The computer was triggered by a transistor transistor logic-signal from the scanner that was sent at the beginning of every image. To allow for a better time resolution in recording the BOLD signal, a variable offset between trigger and stimulus presentation was used, covering a range between 0 and 2750 msec in 250 msec steps. Because the total length of a trial was determined by the number of triggers counted, the final fixation period in each trial varied in length, depending on the initial offset; that is, the total length of a trial had constant values of either 18, 21, or 24 sec depending on the duration of the cueing period.
Behavioral data. Mean reaction times (RTs) for correct answers and errors (in percentages) were entered in separate one-way repeated-measures ANOVAs with the factor “size of focus” (small, medium, large). Uncorrected degrees of freedom and pvalues are reported because Mauchly tests did not reveal violations of sphericity.
fMRI preprocessing. fMRI data were analyzed with the Brainvoyager 2000 software (BrainInnovation, Maastricht, The Netherlands). Data from each subject were transformed into Talairach space (Talairach and Tournoux, 1988). To allow for steady-state magnetization, the first four scans of each functional run were discarded from analysis. After correction for slice scan time differences within a volume, functional volumes were coregistered with the three-dimensional structural data sets to generate volume–time courses. Volume–time courses were motion-corrected by translating and rotating all remaining volumes with respect to the first volume using the Levenberg–Marquardt algorithm (Press et al., 1992) to find the least-squares fit and temporally high-pass-filtered at 240 sec.
fMRI: ROI mapping. For mapping of the ROIs, multiple-regression models were fitted to compute statistical maps for the effect of each stimulus location. The predictors (one for each location) were generated by convolving a square-wave function representing the time course of experimental conditions with a γ function (δ, 2.5; τ, 1.25) modeling the hemodynamic impulse response. Voxels activated by the contrast “location of interest” versus “remaining locations” at p < 10−5 (uncorrected) or better were marked on the surfaces and assigned to visual areas V1 to V4, whose borders were identified by mapping the meridians.
fMRI: attention task. Data were analyzed in two ways. First, multiple-regression analyses on a single-subject basis were computed after z normalization across sessions. Data from the 4 sec delay were not analyzed further because of the delay in the BOLD response; the short delay was introduced only to ensure that subjects paid attention at the beginning of the cueing period. Three predictors modeled the BOLD response during cueing (small, medium, large) and one predictor the activity during the presentation of the search array. We compared the activation maps during cueing (p < 10−5, uncorrected) with those obtained during ROI mapping. In that way, we controlled whether attention activated the same retinotopic areas as passive stimulation. Furthermore, to address whether the extent of activated visual subregions varied with increasing attention focus, the activated cortical surface (in square millimeters; p < 10−5, uncorrected) was compared between the three size conditions. The activated subregions of different visual areas tended to merge into one another when attention covered a large region. Moreover, unlike with the analysis in predefined ROIs, the extent of activation is strongly determined by the selected significance threshold and, therefore, is confounded with the level of activation that differed systematically across visual areas. For these reasons, we renounced analyzing separate subregions but collapsed the data across visual areas.
In a second approach, we addressed the level of activation in a given retinotopic subregion identified during ROI mapping with respect to the size of the attention focus. For this purpose, event-related activity during the cueing phase of the attention task was calculated in individual subjects by averaging the BOLD response across voxels and repetitions of each condition for each ROI. The 6 sec preceding cue onset (fixation period) served as the baseline, i.e., the mean BOLD signal in every condition in this period of passive fixation was subtracted from all other values to compensate for signal shifts. As the order of trials was randomized, the different size conditions contributed equally to the activity during fixation, avoiding systematic differences (i.e., related to arousal) between conditions before cueing. Then the (temporal) peak responses of the percent signal change between 2 and 9 sec (7 sec cueing period) or between 2 and 12 sec (10 sec cueing period) after cue onset were extracted for each ROI. A repeated-measures ANOVA with the factors size of focus (small, medium, and large), visual area (V1, V2, VP, and V4), and cueing duration (7 and 10 sec) was calculated with these peak values. Because Mauchly tests revealed no violations of sphericity, degrees of freedom and p values were not adjusted. Pairwise comparisons [least significant difference (LSD)] were calculated when an interaction between factors or a factor with more than two levels had a significant effect.
The behavioral data are presented in Figure2. Subjects reacted fastest and made the least errors in the small attention focus condition and were slowest and made the most errors in the large condition (main effect size of focus: F (2,8) = 8.58,p < 0.01 for RT;F (2,8) = 4.99, p < 0.04 for errors). Pairwise comparisons (LSD) revealed significant differences between the small and medium conditions (p < 0.01 for RT; p < 0.02 for errors), the small and large conditions (p < 0.04 for RT; p < 0.05 for errors), but not between the medium and large conditions (p = 0.13 for RT;p = 0.10 for errors).
The first row of Figure 3 represents the topography of ROIs defined by passive stimulation at the four single locations. The second row shows the activity during cueing in the small condition when the subject expected a target at the middle left location. Note that passive stimulation of this location yielded a very similar activation pattern. Obviously, the region of attention modulation for the cued stimulus location corresponded closely to its retinotopic location in early visual areas. The third and fourth rows represent the activation patterns in the medium and large conditions. In the medium condition, activation encompassed a larger cortical surface than in the small condition. In the large condition, additional activation in the left hemisphere was observed. These observations are quantitatively summarized in Figure4 A, where the total of activated visual cortical surface area is presented, collapsed across visual areas and subjects. An ANOVA with the activated surface area as the critical factor revealed a main effect for size of focus (F (2,8) = 13.83, p < 0.01), but no effect for cueing duration (F (1,4) = 0.15).
From the individual waveforms during cueing, calculated for ROIs representing the middle left location, the peak responses were extracted and averaged across subjects (Fig. 4 B). The ANOVA of these peak values revealed a highly significant main effect for size of focus (F (2,8) = 26.75,p = 0.00) but no effect for visual area (F (3,12) = 0.95) or for cueing duration (F (1,4) = 0.01). No interaction occurred for size of focus × visual area (F (6,24) = 0.3); that is, attention had the same impact on all visual areas. After the main effect for size of focus, pairwise comparisons were computed that revealed a significant difference between small and medium (p < 0.01) and between small and large (p < 0.01), but not between medium and large (p = 0.32).
We tested neurophysiological predictions derived from the zoom lens model of visual attention (Eriksen and St. James, 1986) by correlating neural activity (as assessable by fMRI) in visual areas with the size of the attended visual field region and behavioral performance. Neural activity was analyzed after a cue had indicated whether a small, medium, or large region had to be attended to and before a target stimulus, defined by a conjunction of form and color, could appear within this region.
Corroborating findings of Kastner et al. (1999), we observed attention modulation in visual areas retinotopically mapped to the cued region before stimulus presentation. Furthermore, our behavioral results are in accord with Eriksen and St. James's (1986) original findings insofar as reaction times increased and accuracy dropped with the increasing size of the attended region.
Our main finding is that this behavioral observation was reflected in the level of activation in visual areas V1, V2, VP, and V4. BOLD responses in ROIs retinotopically mapped to a certain location dropped when the size of the attention focus increased. On the other hand, when attention had to cover a large region, enhancement in additional subregions could be observed, only to a lesser extent than with a more focused attention beam. We believe that these observations can be best explained within the framework of the zoom lens model of visual attention. However, alternative accounts have to be considered.
First, one may argue that the level of arousal was not the same across conditions. However, according to the behavioral results, the large condition was the most difficult, thereby requiring the highest level of arousal. Nevertheless, the peak activity observed in this condition was smaller than in the other conditions. Second, eye movements can be ruled out as a cause of the observed effects. Measurement of eye movements outside the scanner proved that subjects were well able to suppress eye movements. Moreover, as we measured activity in ROIs defined by peripheral stimulation, eye movements toward these locations during the attention task would have reduced activity in these ROIs, counteracting the observed effects. Third, the design of our experiment, in which the condition “small” always corresponded to the same middle left location, could be criticized as evoking a tendency to attend to this location preferably in all conditions. However, pilot experiments had shown no RT differences with respect to target position. Note that even had the middle left position been preferentially processed this would have reduced the observed BOLD differences between conditions for this location. Fourth, and probably most crucial, one might suppose that subjects, instead of scaling their attention focus to cover the varying number of possible target locations, instead shifted a small, fixed-size attention beam between these locations whenever the precise location was not known in advance. This would have led to reduced activity in medium and large conditions attributable solely to shortened “dwell times” at a given location. However, shifting during cueing should have yielded a parametric modulation related to the number of cued locations in frontoparietal areas known to control attention shifts (Corbetta et al., 2000). We did not observe such a differential activation in frontoparietal areas during cueing (data will be presented in future publications), arguing strongly against a shifting strategy.
In summary, the most straightforward explanation for the observed results is provided by the zoom lens model of visual attention. According to this model, a limited number of processing resources can either be focused on a small region, allowing fast and precise processing in this restricted region, or are distributed over a large region, allowing the processing of multiple stimuli at the cost of efficiency. Because neural activity in visual areas is proposed to reflect their processing capabilities (Ress et al., 2000), the inverse correlation of neural activity with processing speed and accuracy observed here can be taken as physiological evidence of the zoom lens model.
Most previous studies have reported the strongest attention-related modulation in higher visual areas (Kastner et al., 1999). This observation was explained against the background of the biased competition model (Desimone and Duncan, 1995; Kastner and Ungerleider, 2001), in which competition among stimuli is proposed to take place predominantly at the level of the receptive field (RF) of a neuron. As a consequence, attention modulation would have the strongest impact on the visual areas with large receptive field sizes (like V4) because then attention cancels out the suppressive effect of a larger number of stimuli.
In contrast, in our study attention-related modulation was already pronounced in early visual areas V1 and V2. Although statistically the degree of attention modulation did not differ across areas (i.e., no interaction area × size of attended region), on a physiological basis this nevertheless may support preferential modulation of early visual areas because the modulation in higher areas may have been substantially bottom-up driven. We propose that the pronounced modulation of early visual areas observed in our study has two reasons. First, neural activity was assessed during the cueing phase where, in the sense of biased competition, no stimuli were competing against each other. Support for this argument stems from studies showing that attention before object presentation can strongly bias signals in favor of the attended location in early visual areas, depending on task context (Ito and Gilbert, 1999; Posner and Gilbert, 1999). For example,Ress et al. (2000) observed strong and solely attention-driven enhancement in V1 (i.e., an enhancement that occurred in the absence of visual stimuli) after an auditory cue that indicated to their subjects that they should watch out for a hardly detectable low-contrast stimulus. Second, RF sizes were crucial for our experimental design. The smaller RFs in early visual areas are more suited to be adjusted to the varying size of an attended region, allowing the switching off (or even inhibition) of neurons with RFs outside the attended region when attention is narrowly focused. The fact that in our study target stimuli were well defined only by a combination of their features (color and form) might have further driven activity in early visual areas. According to Treisman and Gelade (1980) (see also Treisman, 1998) correct feature binding requires precise spatial selection, which can be accomplished only at an early level of visual processing, in which RFs are small. By preactivating neural units with RFs restricted to the attended region, features from an object presented within the preactivated neurons' RFs can be preferentially processed across the visual system and, as a consequence, interpreted as belonging to the same object, a prerequisite for object identification.
N.G.M. and O.A.B. were supported by the Deutsche Forschungsgemeinschaft. We thank M. Schira for technical assistance and E. Eger, A. Kleinschmidt, and A. Kraft for valuable comments on previous versions of this manuscript.
Correspondence should be addressed to Dr. Notger G. Müller, Department of Neurology, Johann Wolfgang Goethe University, Schleusenweg 2-16, 60528 Frankfurt am Main, Germany. E-mail:.