Inferior temporal cortex neurons have generally been found to have large visual receptive fields that typically include the fovea and extend throughout much of the visual field. However, a problem of such a large receptive field is that it does not easily support object selection by subsequent processing areas, in that all objects within such a large receptive field might activate inferior temporal cortex cells. To clarify this, we recorded from inferior temporal cortex neurons while macaques searched for objects in complex natural scenes or in plain backgrounds, as normally used. Inferior temporal cortex neuron receptive fields were much smaller in natural scenes (mean radius, 11°) than in plain backgrounds (39°). With two objects in a scene, one of which was a target for action (a touch), the firing rates were equally high during foveation of the effective stimulus when it was the target and when it was the distractor in both the plain and the complex scenes. With a plain background and two objects present, the receptive fields were much larger (24°) for the stimulus when it was the target than when it was the distractor (9°). This effect of object-based attention was much less evident in the complex scene, when the receptive fields were small both when the stimulus was a distractor and when it was a target. The results show that the temporal visual cortex provides an unambiguous representation in natural scenes by responding to the object shown at or close to the fixation point.
Inferior temporal cortex (IT) neurons of macaques have responses that provide information about objects or faces (Gross et al., 1972; Perrett et al., 1982; Rolls, 1992, 2000; Booth and Rolls, 1998; Rolls and Deco, 2002). The responses of these neurons are often relatively invariant with respect to the position in the visual field, size, and even view of the object (Gross et al., 1972; Rolls and Baylis, 1986; Tovee et al., 1994; Booth and Rolls, 1998). This is an important property, because when areas that receive from the inferior temporal visual cortex such as the orbitofrontal cortex, amygdala, and hippocampus learn about one view, position, or size of an object, the learning then generalizes to other views, positions, or sizes of the same object (Rolls and Treves, 1998;Rolls and Deco, 2002).
Much visual neurophysiology is conducted with one visual stimulus present in an otherwise blank visual scene (Hubel and Wiesel, 1982;Gross et al., 1985). Even in studies of the neuronal mechanisms of selective attention, there are usually only two small visual stimuli present in the visual field, which is otherwise blank (Chelazzi et al., 1993, 1998; Desimone and Duncan, 1995; Chelazzi, 1998; Chelazzi and Corbetta, 2000). In conditions in which one visual stimulus is present, inferior temporal cortex neurons typically have large receptive fields, ≥50° in diameter, under anesthesia (Gross et al., 1972) and when performing a visual fixation task (Tovee et al., 1994). However, a problem of such large inferior temporal cortex neuron receptive fields is that they do not easily support object selection by subsequent processing areas, in that all objects within such a large neuronal receptive field might activate different inferior temporal cortex cells, so that the output of the inferior temporal cortex might appear as a “tower of Babel.” For example, if multiple objects were present within the large receptive field of inferior temporal cortex neurons, the orbitofrontal cortex and amygdala would retrieve many different reward–punishment associations simultaneously, and the rest of the brain would not know what to approach or avoid. Nor would there be any way of directing action at the correct goal object in such a scene with multiple objects, or indeed in any cluttered natural scene. Therefore, the issue arises of how the visual system operates in a natural and cluttered visual scene.
The aim was to investigate the sizes of the receptive fields of inferior temporal cortex neurons in natural scenes.
Materials and Methods
We measured the magnitude of the responses of inferior temporal cortex neurons when an effective stimulus was shown in blank scenes, in complex natural scenes, and in scenes with one other image present, as is typical in previous studies of attention. In the visual search task, in one condition the effective image was the object of attention, in the sense that the monkey was required to search for that object on the screen and touch it. In another condition, the effective image for the neuron was not the object of attention, in that the monkey was searching for another object to touch.
Recording techniques. The activity of single neurons was recorded with glass insulated tungsten microelectrodes in two macaque monkeys (Macaca mulatta; weight, ∼4–6 kg) in a primate chair using techniques described previously (Rolls et al., 1990; Tovee et al., 1993; Booth and Rolls, 1998). All preparative and subsequent procedures were performed in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and were licensed under the UK Animals (Scientific Procedures) Act of 1986. The action potentials of single neurons were amplified (Rolls et al., 1979) and converted to digital pulses using the trigger circuitry of an oscilloscope and analyzed online using an IBM-compatible personal computer. The isolation of single neurons was ensured using Brainwave enhanced Discovery data acquisition, using cluster cutting for offline data analysis (DataWave Technologies, Longmont, CO), and establishing that no spikes occurred very close together in time (<3 msec) in the interspike interval histogram. Eye position was monitored and measured with the scleral search coil technique (Judge et al., 1980) using 1 kHz digitization and storage of new values every 20 msec.
Stimuli. The monkeys performed a visual search task in which if a particular image shown on a computer monitor was touched, the monkey obtained two to three drops of fruit juice for every touch. The monitor was at a distance of 23 cm from the monkey. The entire screen subtended 70 (horizontal) × 55° at the retina (with 512 × 512 pixels), and an object typically subtended 9 × 7° (∼3.6 × 2.8 cm on the screen) in a first series of experiments and 5 × 3.5° in a second series of experiments. An example of a typical visual display is shown in Figure1. (The refresh rate was 100 Hz.) The object had a resolution of 64 × 64 pixels but was prepared in such a way that each object with its particular outline could be presented on either a complex background or a blank background that had a resolution of 512 × 512 pixels. The target object occurred in random positions on the screen from trial to trial within the boundary of the screen. (We note that if the target was diagonally opposite the fixation position, the eccentricity could be >70°, given that the screen width was 70°.) Each trial was preceded by a 0.5 sec tone cue to enable the monkey to look at the screen. The monkey was allowed to touch up to four times to obtain separate aliquots of a taste reward before the next trial started. Trials in which the target object appeared in a blank screen or a natural scene were run in random order within a block of trials. The first block of trials typically involved measuring the neuronal responses when a single object was shown in a blank screen or in a complex natural scene. Other blocks of trials are described below.
Procedure. Tracks were made into the cortex of the superior temporal sulcus (STS) and the IT; the responses of isolated neurons were measured for a wide variety of small stimuli on the video monitor. These included faces, objects, sine-wave gratings, and boundary curvature descriptors (Rolls and Tovee, 1995). If the neuron responded to some but not other stimuli, the search continued to find an object to which the neuron had a large response (e.g., >50 spikes/sec); it was easy to find another object to which the neuron did not respond. It was also a condition for running the experiment that the neuron did not respond to the background. Most anterior inferior temporal cortex neurons (at coordinates that were typically 3–7 mm posterior to the sphenoid reference; see Fig. 7) did not respond to the background image (which for most experiments was that shown in Fig. 1), but if the neuron did respond, other background images were tried. (In the more posterior inferior temporal visual cortex, at coordinates ∼9–11 mm posterior to the sphenoid reference, clusters of neurons were frequently encountered that did respond to the background images, and the experiment could not be performed with such neurons.)
Once a stimulus-selective cell was found without responses to the complex background natural scene image, blocks of trials were run in which stimuli were shown for which the target object appeared in a blank screen or a natural scene in random order. The first block of trials typically involved measuring the neuronal responses when a single object was shown in a blank screen or in a complex natural scene. The second block of trials typically involved measuring the neuronal responses when two objects were shown, one of which was a target that when touched led to the delivery of a taste reward, and the other of which if touched led to the delivery of aversive saline. The two objects were shown in a random sequence in plain backgrounds or complex natural scenes. In this second block of trials, a noneffective stimulus for the cell was normally the target, so that the effects of attention directed away from the effective visual stimulus could be measured as a function of the distance of the fovea from the effective stimulus when it was not the target for action. In all cases in these experiments, the ineffective stimulus produced no difference of the firing rate from the spontaneous value. The third block of trials was typically similar to the second block, except that in this case the effective object of the pair was the target for action. This condition enabled the neuronal responses to be measured to an effective stimulus when it was the target for action and when it was shown with a single distractor in either a plain background or in a complex scene. In the third block of trials, the monkey quickly learned, when touches of the previously rewarded object resulted in the delivery of a drop of aversive saline, to search for and touch the other object to obtain fruit juice. (It was found, as shown in the figures, that there was no effect on the neuronal responses of this stimulus–reward reversal, in that the neuron responded to the stimuli independently of the reward association provided that the monkey looked at the stimuli, as in previous studies (Rolls et al., 1977). Because trial block 3 was run last, the neuronal responses were sometimes slightly smaller on average than in trial blocks 1 and 2.
Data analysis. The aim of the data analysis was to obtain measures of the firing rate of the cell when fixation was at different distances from the effective object, in both the complex natural scene and in the blank background.
During the experiment, calibration trials for the scleral eye position recording system were run so that the output of the eye position measurement system could be obtained in degrees for every stimulus position on the screen. To obtain calibration data, the output of the eye position monitoring system was measured while the monkey performed a visual fixation task with fixation points in a five-position array (top left, top right, screen center, bottom left, and bottom right) (or, for five neurons in the sample of neurons tested with small stimuli, in a 12-position calibration array). Similarly good eye calibration data were obtained in a similar task in which small objects were shown at the same grid positions, and the monkey touched the objects to obtain a fruit juice reward. In the touch task, the monkey fixated the small stimuli in the period before it touched them (as will be illustrated below). The values obtained during the calibration task enabled other eye position values obtained during the main experiment to be converted to degrees relative to screen center. The conversion program took into account any rotation and shear that was required to transform the eye position values into screen coordinates in degrees relative to screen center. Proof that this procedure worked accurately was that the transform applied to the calibration data placed the stimuli on the screen accurately to within ∼1° over the entire width and height of the screen. One way in which this was confirmed was by showing that the final eye position measured by the procedure was within 1–2° of the stimulus that was being touched wherever the stimulus was on the screen.
Typical eye position data collected during the performance of the task are illustrated in Figure 2. The visual stimulus appeared at time 0 in a complex background. The monkey had to make several saccades (three on the trial shown) around the scene before the stimulus was found. Eventually a saccade found the target, and the eyes tended to remain still fixating the object for several hundreds of milliseconds while the monkey repeatedly touched the touchscreen to obtain up to four aliquots of a fruit juice reward. (One aliquot of ∼0.15 ml was delivered for each press from a tube placed 2 mm in front of the monkey's mouth. The monkey typically had to touch the object within 3° of the center of the object to obtain the reward. If more than one object was present in the scene, touching the wrong object, the S− in what was essentially a visual discrimination task, resulted in the delivery of an aliquot of aversive saline taste.) After each saccade the eyes remained still for periods that were typically in the range of 150–250 msec. In the complex natural scene, the monkey sometimes had to make up to eight saccades before its search found the target. There was no clear pattern to these saccades, and it was only when a saccade landed near the object that the monkey reached to touch the object if it was the target of the search. In the blank scene, often one saccade was sufficient, but especially when two stimuli were on the screen, one or two more saccades were sometimes needed, because sometimes the first saccade was to the nontarget object.
The firing rates of a cell as a function of the distance from the effective stimulus were measured during each period in which the eyes were still for >100 msec during the search task. The algorithm implemented in a computer program searched for any such period in which the eyes were still to within <1° for ≥100 msec and measured the number of spikes in one or more periods each 100 msec long in which the eyes remained still. This resulted in a large number of firing rate measurements, each of which was at a given distance from the effective visual stimulus for the cell. [The distance from the target object (S+) and the distractor object (S−) is plotted in Figure 2. It is clear when the monkey found the target, because after that time the eyes remained relatively still on or close to the target, and ∼250 msec later the monkey had touched the screen, as indicated by theT values.] It is shown in Figure 2 that, typically for the cells analyzed here, the neuron responded to the stimulus ∼100 msec after the eyes landed on the target. This is a typical response latency for anterior inferior temporal cortex neurons (Baylis et al., 1987); it reflects retinal delays as well as the cortical processing in each of the stages from the primary visual cortex to the inferior temporal cortex, which is on the order of 15 msec per stage (Panzeri et al., 2001). If the delay was different for a particular cell (as shown by latency measurements both in the touchscreen task and in a visual fixation task), the lag parameter was adjusted accordingly (to compensate for the delay between the arrival of the stimulus at the retina and the neuronal response). The output of the firing rate measurement algorithm consisted of several hundred 100 msec firing rate measurements, each taken when the monkey was fixating at different angular distances from the effective stimulus for the neuron being recorded. The firing rate measurements were then binned into 2° bins, the first including data for 0–2° from the center of the stimulus, the second 2–4°, etc. (For additional analyses, and in some of the graphs for clarity, 5° bins were used.) The means and SEs of the firing rates at different eccentricities were calculated and graphed to give an indication of which effects were significant. Statistical tests such as t tests were performed to test whether (within each block of trials), for example, the firing rates of a cell were different when an effective stimulus was being fixated in plain and complex natural scenes, and one-way and two-way ANOVAs were performed to test additional hypotheses, as indicated in Results. The statistical analyses included checks that the data were approximately normally distributed and nonparametric analyses for additional confirmation, using the methods described by Siegel and Castellan (1988) and Meddis (1984).
Recording sites. X-radiographs were taken at the end of each recording session to determine the position of the microelectrode relative to bony landmarks and the permanently implanted reference electrodes. At the end of the final tracks, microlesions were made in the areas of cortex in which recordings were made to mark typical recording sites (Feigenbaum and Rolls, 1991). Reconstructions of the tracks were made in serial 50 μm histological sections using the positions of the microlesions and the reference electrodes in the histology, the corresponding x-ray coordinates, and the x-ray coordinates of all recorded cells, to determine the locations of all of the cells.
We recorded from well isolated inferior temporal cortex neurons in three hemispheres of two monkeys. In the course of these recordings, it was possible to find a reasonable number of neurons that responded well and with selectivity to some of the small test images of objects, faces, etc. that were available, with the proportions of different types of responsive cells similar to those we have reported previously (Baylis et al., 1987) (It was also a condition of the experiment that the neurons did not respond to the complex natural background image.) It was possible to complete all of the extensive testing required in these experiments, which took several hours, for one subset of nine cells tested with 9 × 7° stimuli (supplemented by seven additional neurons tested in the condition with one stimulus present in either a plain or complex background) and a second subset of eight cells tested with 5 × 3.5° stimuli. The same results were found in all 17 neurons on which it was possible to complete sufficient testing in all three conditions. [Of the other neurons recorded that could not be used for the experiments described here, only a proportion (∼22%) responded differentially to any of the images of objects and faces in the set of images available, as described byBaylis et al. (1987).] In addition, some of the neurons did not respond with large responses to the small images used in this investigation; ∼20% responded to the complex background stimuli of natural scenes and so could not be used in this particular experiment; and some were not held sufficiently long enough for a data set to be obtained.
The data obtained in one of these experiments with 9° stimuli are shown in Figure 3. The firing rate of cell bj168 was ∼25 spikes/sec when the monkey fixated the effective stimulus in the plain background; this value was little affected when the monkey fixated the same object when shown in the complex natural scene (Fig. 3, left). The firing rate remained high even when the monkey fixated far from (up to 40° away from) the effective stimulus in the plain (blank) background. However, with the complex background, the firing rate fell markedly as a function of the distance of the position being fixated from the effective stimulus. The half-amplitude radius of the receptive-field size (the angle at which the firing rate had dropped to half its value relative to the spontaneous rate when the object was fixated) was ∼17°. The means and SEs of the responses give an indication that for example the firing rate as a function of eccentricity was markedly different in the blank and complex natural scenes; indeed, a two-way ANOVA (performed within trial block 1 and with two conditions, background type and eccentricity of fixation) revealed a highly significant interaction (F (10,2131) = 3.49; p< 0.0003).
When the same cell (Fig. 3) was tested with two stimuli present, one effective and one noneffective for the cell, and the effective stimulus was not the target to be touched, the firing rate when the effective stimulus was being looked at (Fig. 3, middle top, firing rate close to the fovea) was nevertheless very similar to its value when only the effective stimulus was shown. The same firing rate was obtained for the effective stimulus when it was not the target for action in both the plain and complex natural scenes. Thus, the data show that even when an effective stimulus is not the target for action (and in this sense attention is not being paid to it), there is nevertheless a large firing rate response of the inferior temporal cortex neuron to the stimulus provided that it is being fixated. The effect of attention does become evident however when we consider the firing rates when the monkey is fixating more than a few degrees away from the effective stimulus. Under these conditions, the firing rate drops markedly as a function of the fixation distance away from the object (Fig. 3, middle). Indeed, the radius (measured by the half-amplitude width) of the receptive field of the neuron under these conditions was 9°. Thus, attention (in this case making a different stimulus the target for action) influenced the size of the receptive field of the inferior temporal cortex neuron but not its firing rate when an effective but nontarget stimulus was being fixated.
This point is also established by the data obtained in the third block of trials in which the same two stimuli were being shown, but now the effective stimulus for the neuron was the target to be touched (Fig. 3,right). The effect of this was to increase the receptive-field size of the neuron (to 29°) in the two-stimulus display with a plain background (Fig. 3, compare right withmiddle). The effect of making the effective stimulus a target to be selected in the two-stimulus display was also to increase the receptive-field size slightly in the complex background (to 17°). Thus, object-based attention in the visual search task had a minor effect in a complex natural scene of increasing the receptive-field size for a target stimulus compared with the condition when the effective stimulus for the neuron was not a target, but the effect was much smaller in the complex natural scene than when two stimuli were shown in a plain background. The latter is the normal condition in which experiments on attention have been performed previously (Chelazzi et al., 1993, 1998; Desimone and Duncan, 1995; Chelazzi, 1998; Chelazzi and Corbetta, 2000).
A comparison of the data shown in Figure 3, top left andtop right, showed that this neuron had the same firing rate to the effective stimulus when fixated in the blank screen independently of whether it was shown alone or with a second stimulus. However, simply having a second stimulus present, although it was not a target for action, reduced the size of the receptive field of the neuron somewhat. Thus, just having a second stimulus present had some effect on receptive-field size, but the effect was much less than that produced by a complex natural scene (Fig. 3, compare top right blank screen with top left complex natural scene).
The data for the 9° stimuli for the nine neurons in which testing in all conditions was completed are shown in Figure4 and Table1. Figure 4 shows the mean firing rate for the nine neurons in the six conditions tested. To combine the data for different neurons, the firing rates were first expressed as the firing rate for that neuron relative to the condition when the stimulus was being fixated (i.e., as a percentage). The interaction term showing that the population of neurons responded differently as a function of eccentricity in, for example, the condition with one stimulus present (trial block 1) was significant (F (7,56) = 2.53; p < 0.025). Furthermore, a one-way ANOVA showed that the receptive-field sizes were different in the different conditions of trial block 1 (with a single stimulus present) (F (5,48) = 10.27; p < 0.00002). Table 1 shows the receptive-field sizes for the neurons in the six conditions tested. The results for the population of neurons show effects that are very similar to those described for neuron bj168 in Figure 3 (top). In particular, with one object in the scene (Fig. 4, left), the receptive fields are large with a plain background (averaging 71.6° for the half-amplitude width) and very much smaller in a complex natural scene (where they average 25.6° for the half-amplitude width) (post hoc test; p < 0.002; trial block 1). When there are two objects in a plain scene compared with one object, the receptive fields of the neurons are generally smaller (averaging 55.6° when the effective stimulus is the search target). Thus, just introducing a second stimulus into a display, even when it is to be ignored, does reduce the size of inferior temporal cortex neurons somewhat (Fig. 4, compare left andright, plain background condition). However, attentional effects on receptive-field size are most clearly evident when there are two stimuli in an otherwise plain background, because in this condition the receptive field is small (29.6°) when an effective stimulus is not the target for action (Fig. 4, middle, plain background condition) and is larger (55.6°) when the effective stimulus is the target for action (Fig. 4, right, plain background condition) (post hoc test; p < 0.04; comparing trials from blocks 2 and 3 for the plain background condition). (It should be noted that the receptive-field diameters given include the 9 × 7° stimulus, so that a receptive-field width of 20° extends only 5.5° beyond the edge of the stimuli on each side.) In the two-stimulus condition, as soon as a complex natural background is present, the receptive fields become small. Under these conditions of a natural background, attention has some effect in increasing the receptive-field size for a target stimulus, but the effect is rather minor (compare Fig. 4, right andmiddle). [The increase is from 20.4 to 25.4° diameter, as shown in Table 1, and this difference is not significant (WilcoxonT = 8.5; n = 9).]
The values for the firing rates when the effective 9 × 7° stimulus is being fixated under the different conditions were very similar to each other and were not statistically significantly different. First, provided that an object is being fixated, there is little difference in the firing rate in the blank scene and complex natural scene conditions [e.g., 48.9 vs 44.7 spikes/sec with one object present in trial block 1 (p values not significant; t = 1.50; df = 8)]. Thus, inferior temporal cortex neurons can respond as well to objects when the objects are shown in complex scenes as they do when the objects are shown in plain backgrounds, provided that the object is fixated. Given that the neurons did not respond at all to the noneffective stimulus, this finding shows that the tuning of the neurons is not affected, and remains tuned to the same effective stimuli, when they are shown in real scenes as when they are shown against blank backgrounds. Thus, inferior temporal cortex neurons code for objects even when they have to be segmented out of complex backgrounds. Second, it is shown that even if an effective stimulus for a cell is not a target for action, then the neuronal response is as large to the stimulus as when it is a target for action (54.6 vs 48.9 spikes/sec in a plain background and 43.8 vs 44.7 spikes/sec in a complex background). Thus, attention per se (defined by whether the object in the scene is to be selected for action in a visual search task) makes little difference to the firing of inferior temporal cortex neurons, provided that the stimulus is being fixated.
In addition to the results described for neurons tested in all six experimental conditions with the 9° stimuli, additional results, which confirm those already described, were obtained in an additional seven neurons in the main experimental condition, with one object that was the target for visual search in the blank versus complex natural scene conditions. In these seven additional experiments (which were performed in two monkeys), the mean ± SEM firing rates when the effective object was being looked at in the blank and complex scenes were similar (53.3 ± 13.1 vs 53.1 ± 15.4 spikes/sec), and the receptive-field diameters were reduced from a mean of 58.8 ± 10.0° in the blank scene to 29.4 ± 4.8° in the complex natural scene.
In general, comparable results for the second subset of eight neurons tested with 5 × 3.5° stimuli were found, except that the receptive-field sizes were slightly smaller overall than with the 9° stimuli, reflecting, as predicted, the smaller stimulus size. Figure5 shows the data from a single cell tested with the small (5°) stimuli (compare with Fig. 3 for the larger stimuli). The small receptive field in, for example, the condition in which the effective stimulus is not the target for the search is shown in Figure 5 (middle). Table 1 shows the receptive-field sizes in the six test conditions; the firing rates are indicated in Figure 6. The firing rate data show that when the effective stimulus was being fixated, there was little effect of whether the stimulus was shown in a plain background versus a complex natural scene (32.6 vs 33.9 spikes/sec when one stimulus was on the screen). The main effect found was, as before, on the size of the receptive field, which was large with one stimulus present in the plain screen condition (77.6°) and small (22.0°) with the complex background (Table 1). As shown in Table 1, if two stimuli were present, then the receptive fields were still quite large in the plain background if the object was the target of the search (47° diameter) and were quite small (17.2°) if the object was not the target for the search (p < 0.01; one-tailed; Wilcoxon T = 1.5;n = 8). Thus, attention defined operationally in this way did have clear effects on receptive-field size in the plain background. However, in the complex background, the receptive-field size was rather small not only when the object was not the target of the search (15.6° diameter) but also when it was the target of the search (19.2°; Wilcoxon T = 9.5;n = 8; NS) (Table 1). Thus, there was little effect of attention defined operationally in this way with the complex background. The smallest receptive-field diameter found, 15.6°, extended ∼5.3° beyond the edge of the small (5 × 3.5°) stimuli.
The main effects for the first subset of cells tested with 9 × 7° stimuli and for the second subset of eight cells tested with 5° stimuli are summarized in Table 1. Some of the important comparisons have been described above. A diagram that schematizes to scale the results found with the 5 × 3.5° stimuli is provided in Figure6.
Recording tracks in both monkeys were made over an extensive portion of the inferior temporal cortex, from the upper and lower banks and fundus of the superior temporal sulcus, through the middle temporal gyrus to just lateral of the middle temporal sulcus. The recording sites of the cells reported here are illustrated in Figure7. As can be seen in Figure 7, the cells are distributed in a region that extends from the gyrus lateral to the middle temporal sulcus to the lower bank of the STS, and the investigated area of cortex is indicated by the boxed areacontained in the bottom left coronal section.
These experiments show that inferior temporal cortex neurons respond to objects very well when they are shown in complex natural scenes, provided that the object is fixated. In addition, in natural scenes the receptive fields of inferior temporal cortex neurons become much smaller than they are to objects shown in blank scenes. The output of the inferior temporal cortex thus reflects best the single object that is being fixated in complex natural scenes. This property of inferior temporal cortex neurons helps to make the output of the inferior temporal visual cortex easily decoded and interpreted, for there is no confusion about which object in a complex scene with multiple objects the output is about. The output is about the object being fixated, and the receptive field includes or is close to the fovea.
There has been little previous investigation of IT neuronal activity when objects are shown in natural scenes. During a visual fixation task used to control fixation and with one object in a complex natural scene or against a plain background, Rolls et al. (2000) showed that the tuning of neurons to effective stimuli was similar in complex scenes and in plain backgrounds and that the magnitude of the neuronal responses was also similar, but that the receptive-field sizes were reduced in the complex natural scene. Sheinberg and Logothetis (2001)found, in a task in which a single small target object (1.5°) was presented in a complex natural scene 20 × 20° in a task in which the monkey had to move a lever right or left to different objects, that IT neurons had similar tuning to that in a plain background. (They did not compare the sizes of the receptive fields in blank and natural scene conditions, and they had only one object in the display; thus, they did not study the effects of object-based attention.)
Attention can be operationally defined in the experiment described here by whether an object is the target for action, and is an example of object-based attention (Rolls and Deco, 2002). In terms of this definition, attentional effects were found in these experiments in that the receptive fields were larger (23.5° radius) in the plain background for the object that was the search target than for the object that was to be ignored (8.6° radius, data for 5° stimuli as shown in Table 1). However, the magnitude of this object-based attentional effect was very small in the complex natural scene, in that the receptive fields of inferior temporal cortex neurons were only slightly larger (by 1.8°, 9.6 vs 7.8° radius) for the target than for the object to be ignored in the complex scene (with the 5° stimuli). Space-based attention (in which attention is directed to a particular location) might show a larger effect on the firing rate to the attended versus unattended location, but we note that additional investigations would be needed to show that even this is the case in natural scenes, because most previous studies of attention have been performed when two small objects are present in a plain background (Chelazzi et al., 1993, 1998; Desimone and Duncan, 1995; Chelazzi, 1998; Chelazzi and Corbetta, 2000).
It might be suggested that in fact attention is needed in the complex scene to process the distractor object, and that this is why the inferior temporal cortex neurons responded to fixated objects in the complex scene even when they were not the target of the visual search. However, we note that the same high firing rate when an object is fixated, and reduction in receptive-field size, occur when objects are passively being viewed in complex scenes during the performance of a visual fixation task (Rolls et al., 2000; E. T. Rolls and M. C. A. Booth, unpublished observations). Thus, the high firing rates for fixated distractor objects and reduction in receptive-field size found in the experiments described here in natural scenes primarily reflect the operation of the inferior temporal cortex neurons in the presence of a complex background, and not that the monkey was paying special attention to the distractor to determine whether it was the target for action or was not to be selected for action. Thus, the mechanism by which the reduction in receptive-field size occurs appears to be related to competition between different features in the visual scene. The result of the competition is that whatever object is at the fovea appears to be given preference in determining the output of inferior temporal cortex neurons. The evidence that this is the case is that the neurons always fired when the monkey fixated the objects, and that the firing decreased monotonically as a function of the distance from the fovea. Models that use the higher cortical magnification factor for the fovea can account for these findings (Rolls and Deco, 2002; Trappenberg et al., 2002).
It is interesting and important that when objects are shown in complex scenes, not only the magnitude of IT neuronal responses but also the tuning of the neurons to the objects remains relatively unaffected (Sheinberg and Logothetis, 2001). This is shown in the present study by the fact that there is little if any reduction in the firing rate to an effective stimulus when a complex natural background is introduced (relative to the rate in the blank background) (Figs. 3-6), and that the neurons did not start to respond in the complex scene to the stimulus that was ineffective when tested in the blank screen. This finding might be called “background invariance,” to capture the point that the tuning of many inferior temporal cortex neurons is invariant when the stimuli are shown against a background. This particular result complements the findings of DiCarlo and Maunsell (2000) and Missall et al. (1999) that inferotemporal cortex neurons respond similarly to an effective shape stimulus for a cell even if some distractor stimuli are present a few degrees away. This result was also found by Rolls and Tovee (1995), who showed in addition that with two stimuli present in the visual field, the anterior inferior temporal neuronal responses were weighted toward the stimulus that was closest to the fovea. In early visual cortical areas (V1–V4), free viewing in natural scenes may produce some reduction in neuronal responses (Livingstone et al., 1996; Gallant et al., 1998), although the testing conditions were so different from those described here that a direct comparison is not realistic.
The results described here confirm the finding by Rolls et al. (1977)that inferior temporal cortex neurons respond to visual stimuli independently of their reward or punishment association; they show in addition that the inferior temporal cortex provides an output that can be unambiguous to receiving stages because it is about one object, that at the fovea. Succeeding stages can then easily use a pattern associator to determine whether that stimulus is associated with taste reward or punishment (Rolls, 1990, 1999; Rolls and Treves, 1998; Rolls and Deco, 2002).
Both object-based and space-based attentional processes can be understood quantitatively by a model of ventral and dorsal visual stream processing in which during covert visual search (i.e., without eye movements) competition in early common visual areas is biased by top-down object bias applied to the end of the ventral stream or by top-down spatial bias applied to the end of the dorsal stream (Rolls and Deco, 2002). However, the fundamental points made here are that search for objects in complex natural scenes is performed more by overt processes (moving the eyes around the scene) and is facilitated by the much reduced receptive-field sizes of inferior temporal cortex neurons in natural scenes that typically include the fovea attributable in part to its large cortical magnification factor (Rolls and Deco, 2002;Trappenberg et al., 2002). Moreover, the results described in this paper indicate that the coordinates of the object in space that is to be the target for action are passed to the motor system by virtue of the facts that the object represented in the inferior temporal cortex in complex scenes is at the fovea and that the dorsal visual system that executes the actions has information about eye gaze position (cf.Ballard, 1991; Rolls and Deco, 2002).
This work was supported by the Wellcome Trust and by the Medical Research Council Interdisciplinary Research Centre for Cognitive Neuroscience.
Correspondence should be addressed to Prof. E. T. Rolls, Department of Experimental Psychology, University of Oxford, South Parks Road, Oxford OX1 3UD, UK. E-mail:.