Abstract
The primate visual system continuously selects spatial proscribed regions, features or objects for further processing. These selection mechanisms—collectively termed selective visual attention—are guided by intrinsic, bottom-up and by task-dependent, top-down signals. While much psychophysical research has shown that overt and covert attention is partially allocated based on saliency-driven exogenous signals, it is unclear how this is accomplished at the neuronal level. Recent electrophysiological experiments in monkeys point to the gradual emergence of saliency signals when ascending the dorsal visual stream and to the influence of top-down attention on these signals. To elucidate the neural mechanisms underlying these observations, we construct a biologically plausible network of spiking neurons to simulate the formation of saliency signals in different cortical areas. We find that saliency signals are rapidly generated through lateral excitation and inhibition in successive layers of neural populations selective to a single feature. These signals can be improved by feedback from a higher cortical area that represents a saliency map. In addition, we show how top-down attention can affect the saliency signals by disrupting this feedback through its action on the saliency map. While we find that saliency computations require dominant slow NMDA currents, the signal rapidly emerges from successive regions of the network. In conclusion, using a detailed spiking network model we find biophysical mechanisms and limitations of saliency computations which can be tested experimentally.
Introduction
A crucial computational strategy of the primate visual system is to swiftly allocate processing resources to a region, feature or object to deal with the many overlapping and partially occluding objects in natural scenes. Attentional selection can be guided by exogenous signals from the environment, such as a red flashing light (bottom-up, saliency-driven attention), by endogenous signals such as when looking for a specific car in a parking lot (top-down, volitional-controlled attention), or by both. Nevertheless, the neural mechanisms underlying these processes are mostly unknown.
From a computational point of view, a purely feedforward model of bottom-up attention incorporating a saliency map successfully predicts a large fraction of fixated locations under free viewing conditions (Itti et al., 1998; Itti and Koch, 2001; Parkhurst et al., 2002; Cerf et al., 2008; Foulsham and Underwood, 2008; Mannan et al., 2009). At its heart there is a two-dimensional (2-D) topographic arrangement of neurons that represent stimulus saliency throughout the visual scene. Initially, prominent locations corresponding to regions with enhanced feature contrast are computed in individual maps (i.e., conspicuity maps) for different dimensions of the stimulus such as intensity, orientation, color, motion, etc. These computations are performed through a set of multiscale, center-surround and normalization operations. Finally, the conspicuity maps are combined to form a single saliency map. Activity in this map does not encode conspicuity in any one particular feature dimension, but encodes the overall conspicuity of a given location relative to its local and global neighborhood. Based on electrophysiological evidence from the monkey, the lateral intraparietal cortex (LIP) and the frontal eye fields (FEF) have been identified as possible saliency maps (Gottlieb et al., 1998; Kusunoki et al., 2000; Bisley and Goldberg, 2003; Moore and Armstrong, 2003; Thompson and Bichot, 2005).
While it is believed that LIP and FEF represent the saliency map, neurons in lower visual areas V1, V4, and V5 (MT) also show differential responses to a target stimulus depending on surrounding stimuli or its spatiotemporal context (Allman et al., 1985; Knierim and van Essen, 1992; Albright and Stoner, 2002; Hegdé and Felleman, 2003; Burrows and Moore, 2009). For example, Hegdé and Felleman (2003) measured the response of V1 neurons to oriented and colored bars in the receptive field (RF) that had different saliency values. In particular, they compared the response to popout targets—say a red bar among green bars that rapidly attracts the eye—to conjunction targets—say a red, vertical bar among red, horizontal and green, vertical and horizontal ones—targets that are defined by the combination of two or more feature dimensions. Such conjunction targets are not readily detectable. They found that V1 neurons do not distinguish between the popout and conjunction targets and therefore, that V1 neurons do not carry saliency signals. More recently, Burrows and Moore (2009) examined the response of V4 neurons to similar stimuli and concluded that these neurons can distinguish between the popout and conjunction targets. Paradoxically, they also showed that the saliency signal in V4 diminishes if the monkey prepares a saccade to a location far from the RF of the neuron, indicating an important role for top-down attention in the formation of the bottom-up driven saliency signal in V4.
These findings raise some important questions. First, how are saliency signals formed in the visual cortex across the cascade of regions from V1, to V2, V3, V4 and so on? Second, how does top-down attention affect these computations?
To answer above questions and shed light on neural substrates of bottom-up attention and its interaction with top-down attention, we construct a 2-D, biophysically plausible spiking network model. The network contains three distinct layers of neural populations corresponding to three cortical regions—which we identify from here on as V1, V2, and V4—and a higher visual area assumed to instantiate the saliency map (either LIP or FEF). The model neurons receive realistic inputs which are generated from the actual stimuli used in the relevant experiment (Hegdé and Felleman, 2003; Burrows and Moore, 2009). Using our model, we consider biophysical mechanisms and constraints on saliency computations and how these are influenced by top-down attention. Our hypothesis is that feedback from a cortical area representing the saliency map to earlier visual areas improves saliency computations, while top-down attention interferes with these computations by disrupting the feedback through its influence on the activity in the saliency map.
Materials and Methods
We use leaky-integrate-and-fire (LIF) model neurons with realistic synapses as building blocks. Our spiking network model contains many 2-D populations of neurons (24 and 10 in the first and second set of simulations, respectively) with realistic inputs and synapses, making it computationally expensive. For example, simulating 200 trials of the response to a given stimulus takes ∼10 h to run on a standard Unix system with a 3 GHz Intel CPU. We therefore had to adopt some simplifications.
First, we assume that inputs to the network are the outputs of lateral geniculate nucleus (LGN) and V1 neurons that are wavelength- and orientation-selective, respectively. We do not explicitly model these cells, using instead their RF properties to generate their response to visual stimuli. As a result, the inputs to the network have wavelength (here the colors red and green) and/or orientation selectivity (0, 45, 90, 135), and we only explicitly model the visual processing in the output layers of V1, cortical areas V2 and V4, and a higher area corresponding to the saliency map (LIP/FEF). Second, we use Cartesian coordinates and ignore the effect of cortical magnification.
For each trial, we simulate 300 ms of the network dynamics with dt = 0.1 ms using the improved RK2 integration algorithm (Hansel et al., 1998). We directly compare our model against two different electrophysiological experiments in the alert and behaving monkey (Hegdé and Felleman, 2003; Burrows and Moore, 2009) using the same visual stimuli (see below).
Spiking network model.
The model consists of 3 regions of neural populations, each of which contains 4 excitatory and 4 corresponding inhibitory populations (not represented) of LIF units (Fig. 1). Each population consists of 28 × 28 neurons, covering 14° × 14° of the visual field. Therefore, each neuron spans 0.5° of the visual field. We assume periodic boundary conditions for connections between all neurons (i.e., each 2-D neural population is placed on a torus).
We examine two exclusive architectures for the network. In configuration A (Fig. 1A), individual features (i.e., color and orientation) are processed in separate neural populations, and neurons in different populations receive inputs selective to only one feature. That is, there are no cells which participate in saliency computations and are tuned to both color and orientation. To compare the activity of these model neurons with experimentally recorded neurons selective to two features (say a red bar at 0° orientation), we combine the outputs of neural populations selective to the color red and to 0° orientation (Fig. 1A). We formulate this combination by simply adding the spike trains of neurons with similar RF in the corresponding neural populations (to avoid further computations). In configuration B (Fig. 1B), a combination of features is jointly processed and neurons in different populations receive inputs selective to both orientation and color (e.g., red color and 0° orientation). That is, oriented neurons are also color-tuned and we can directly compare the activity of neurons in this configuration with the experimental data. Apart from the input organization, all parameters are similar for the two configurations.
Model parameters for all excitatory neurons are set to: threshold voltage Vth = −50 mV, reset voltage Vreset = −55 mV, leak voltage Vleak = −70 mV, refractory time period tref = 2 ms, capacitance CE = 0.5 nF, and leak conductance Gleak,E = 25 nS. Inhibitory interneurons have similar parameters except that the capacitance and leak conductance are set to CI = 0.2 nF and Gleak,I = 20 nS, respectively. Neurons in each population are connected to all other neurons with a circular Gaussian profile (i.e., with equal width, σ, in both dimensions). That is, most connections are local with synaptic weight falling off with distance.
Excitatory neurons project to their target neurons through two types of synaptic receptors; fast AMPA, with the time constant τs = 2 ms, and slow saturating NMDA, with the time constant τs = 80 ms. The spatial extent of the connectivity profile (i.e., σ in the Gaussian function) is the same for both receptor types (see supplemental Table 1, available at www.jneurosci.org as supplemental material, for connectivity parameters). Inhibitory neurons are connected to their target neurons through GABA synapses with the time constant τs = 10 ms. The peak conductance for all synapses, gsyn, is set to 1 nS multiplied by the connection strength (see below). All synapses are modeled as having exponentially decreasing conductances with time.
To capture the observed response adaptation in visual areas, we include spike-rate-adaptation (SRA) current for all neurons. For neurons in feature-selective populations, we set the change in conductance and the time constant of the SRA current to gSRA = 0.6 nS and τSRA = 50 ms, respectively. To reduce the strong response to the single bar stimulus in the saliency population, we adopt a stronger SRA current for neurons in this population (gSRA = 4.0 nS and τSRA = 50 ms).
Every excitatory neuron in a given population is connected to all other neurons in that population and to all interneurons in the corresponding inhibitory population. Cross-orientation inhibition is implemented by subtracting 25% of the mean of all orientation inputs (0°, 45°, 90°, 135°) from a given orientation input. In addition, there are feedforward connections between excitatory neurons with similar feature selectivity in successive regions.
All connection matrices are normalized Gaussian functions, with width σ, multiplied by the weight of these connections, w (see supplemental Table 1, available at www.jneurosci.org as supplemental material). We assume identical values for σ and w in the three simulated regions. Because the connectivity matrices are normalized, the value of w for a given connection should not be taken by itself as the magnitude of that connection strength.
To study the importance of NMDA currents in saliency computations, we reduce NMDA currents while increasing AMPA currents such that their sum remains approximately the same. Specifically, for the two additional sets of simulations presented here, we set the connection strength parameters between excitatory neurons, wEE,i→iAMPA and wEE,i→iNMDA, equal to [33, 22] and [44, 11], respectively (compare with the original values of [11, 44]) (see supplemental Table 1, available at www.jneurosci.org as supplemental material). As we reduce the NMDA currents between excitatory neurons we also need to reduce the strength of the NMDA currents from the excitatory to inhibitory neurons to avoid the slowly activated inhibitory neurons from suppressing the activity in the network after the onset response. For connections from the excitatory to inhibitory neurons we set wEI,i→iAMPA and wEI,i→iNMDA, equal to [180, 45] and [202.5, 22.5], respectively (compare with the original values of [135, 90]). Note that the connection widths, σ, are similar for AMPA and NMDA synapses. These and the rest of the model parameters are kept the same for simulations on the role of NMDA receptors.
Finally, to study the effect of feedback from the saliency map on the saliency computations in V4, we use a simplified version of our model (for the sake of computational efficiency). This simplified model contains two layers of neural populations: the first layer corresponding to feature-selective neurons in V4 and the second layer corresponding to spatially selective saliency neurons in the LIP/FEF. The saliency map consists of one population of excitatory and one population of inhibitory neurons with strong lateral excitation and inhibition. Each excitatory cell in V4 projects to both excitatory and inhibitory neurons at its corresponding location in the saliency map, and receives feedback from excitatory neurons in this map (see supplemental Table 2, available at www.jneurosci.org as supplemental material, for model parameters). The saliency map also receives an input equal to 20% of the sum of the inputs to all feature-selective populations in V4 to account for direct inputs from the LGN and input layers of V1 to the LIP/FEF, which is supported by observation of an earlier response onset in the FEF with respect to V2 and V4 (Schmolesky et al., 1998). For simulation of the saccade preparation experiment, we use the same parameters while introducing extra inputs to all populations due to the presence of the saccade target (see below).
Inputs to the network.
We use the same visual stimuli as used by Hegdé and Felleman (2003) and Burrows and Moore (2009) to generate the inputs to our model. These visual stimuli consist of arrays of 7 × 7 oriented, colored bars with six different arrangements: singleton (the single bar), homogeneous, color popout, orientation popout, combined popout, and conjunction (supplemental Fig. S1, available at www.jneurosci.org as supplemental material). Popout and conjunction displays contain one colored and oriented bar, the target, which can be distinguished from the rest of the colored and oriented bars, the 48 distractors, by either one or two features, respectively. In the above experiments, preferred and nonpreferred color and oriented bars were determined for each recorded neuron, and then used to construct different stimuli. For convenience, we construct the visual stimuli from four types of bars: green or red and vertical or horizontal. Moreover, we always place the target bar in the center of the array.
All neurons in the network receive two classes of synaptic inputs; background input and feature-selective inputs. The background input represents projections from surrounding cortical neurons and are modeled by Gaussian noise currents. For each excitatory (respectively inhibitory) neuron, this input is equal to the current generated by 1000 cortical neurons firing at 4.0 Hz (respectively 3.0 Hz), through AMPA synapses (with the peak conductance gs = 1 nS). This spontaneous synaptic barrage provides the model with realistic noise and brings neurons near their firing threshold. The feature-selective inputs represent the outputs of color-selective neurons in the LGN, and of orientation-selective neurons in V1 (see below).
Responses to the orientation bars are computed using RF properties of orientation-selective cells in layer 4 of V1 (Dayan and Abbott, 2001). More specifically, the input to location (x, y) at time t is equal to the following: where k is an arbitrary constant, and Lo(x, y, t) is the linear response estimate of neuronal activity in space and time, where s(x, y) is the visual stimulus at location (x, y) and the kernel Do(x′, y′, t′) defines the space-time RF of the neuron. As the input is stationary, s does not depend on time but is equal to the average intensity of that stimulus at a given location. The kernel can be decomposed into spatial and temporal RF components (Dayan and Abbott, 2001), as follows: The spatial RF of orientation-selective neurons is modeled with a Gabor function: where σx and σy determine the extent of the RF, and k = 1/8° is the preferred spatial frequency. In Equation 4, the neuron responds most strongly to 0° orientation. In our model, we employ four types of input neurons selective to different orientations (at 0°, 45°, 90°, 135°). We choose σx = 0.5° for the nonpreferred direction, and σy = 1° for the preferred direction. The temporal RF of the orientation-selective neurons is given by the following (Maex and Orban, 1996; Dayan and Abbott, 2001): where α = 1/(7.5 ms) and bo = 0.85. We introduce a bias factor, bo, to make the integral of Do,t(t′) nonzero, avoiding a vanishing response after the initial onset. Because the image input is stationary, the temporal component of the RF response can be integrated independently of the spatial component.
We assume that the color inputs to our network are generated by the response of the most prevalent type of color-selective neurons in the LGN (Ts'o and Gilbert, 1988). That is, the center region of RF has color-opponency while the surround RF is nonchromatic, matching the modified type II neurons in Ts'o and Gilbert (1988). We assume that the form of spatial and temporal RF of these neurons is similar to ON- and OFF-center neurons in the LGN with the kernel, Here B is a constant which determines the relative contribution of surround to center, σcen = 0.5° and σsur = 1.0° determine the extent of RF in the center and surround, respectively, and Dc,tcen (respectively, Dc,tsur) determines the temporal RF of the center (respectively, surround). We only use ON-center neurons to generate the inputs. The temporal component of the RF for the center and surround are described as follows (Dayan and Abbott, 2001): where αcen = 1/(10 ms), βcen = 1/(32 ms), αsur = 1/(20 ms), βsur = 1/32 ms), and bc = 0.75 is a constant introduced to avoid the integral of the temporal RF vanishing over time. Because the input is stationary, the temporal component of the RF can be integrated independently of the spatial component.
To simulate red and green bars, the response of neurons with an RF described above is generated using three color components of the input image. The red response equals the center response to the red component minus the center response to green plus the center response to the average of the three components. The surround response is nonchromatic and is equal to the average of the three color components (Ts'o and Gilbert, 1988). This way, neurons selective to red show a strong response to a red bar in their RF, but a weak response to a green bar in their RF (they prefer red and then anything but green) and vice versa.
Simulated V2 and V4 neurons receive color- and/or orientation-selective inputs which are 30% and 15% of the selective inputs to V1 (described above), respectively. These inputs are implemented to account for weak direct inputs from the LGN and input layers of V1 to areas V2 and V4 (Girard and Bullier, 1989; Girard et al., 1991). Note that we obtain similar results even in the absence of these direct inputs to V2 and V4.
Finally, in the saccade preparation experiment of Burrows and Moore (2009) the monkey is cued to make a saccade to a target while visual stimuli are presented in the RF of recorded neurons at a random time before the saccade initiation. To simulate this experiment, we assume that the onset of the saccade target introduces a strong input to the saliency population and a weak input (equal to 20% of the input to the saliency population) to all feature-selective populations, at the location of the saccade target. The strong input to the saliency map is assumed to originate from a working memory network which encodes the location of the saccade target. For simplicity, we assume that the onset of the saccade target is always 50 ms before the onset of the visual stimuli.
Data analysis.
For the results presented here, the average response of a simulated neuron to a given visual display is computed by counting all spikes in the 200 ms interval following the onset of activity (defined as firing above 5.0 Hz) and averaging this number over 200 trials. Because neighboring neurons have overlapping RF and their activity is highly correlated due to all-to-all connectivity, we compute these averages over four neurons with overlapping RF (for both target-selective or for distractor-selective neurons). For convenience, we present most of the average responses after normalizing them by the response to the singleton display.
To quantify the saliency computations in successive layers of the network, we compute different quantities. First, we consider the difference between the response to popout and conjunction displays. These differences can be used to define a local saliency measure known as the popout selectivity index (PI) which has been reported in many experimental studies (Knierim and van Essen, 1992; Hegdé and Felleman, 2003; Burrows and Moore, 2009): where Rpopout and Rconj are the average responses of target-selective neurons to a given popout and conjunction display, respectively.
Second, as the saliency of a target depends on its contrast with nearby objects, the neuronal signature of the saliency of a target should depend on the response of target-selective neurons relative to the response of neurons selective to nearby distractors. Therefore, we use the average response of neurons selective to the target and to all 48 distractors surrounding the target to define a global measure of saliency.
We define the global saliency index (GSI) as the difference in the average response of target-selective neurons (Rtarget) and of all distractor-selective neurons (Rdistr) divided by their sum, as follows: GSI, distributed between −1 and +1, is a measure for how easily the target can be distinguished among the distractors: the closer it is to 1, the larger the neuronal representation of the target relative to the distractors.
Results
Saliency computations through lateral interactions in successive layers of spiking neurons
To understand the basic mechanisms underlying saliency computations, we examine two exclusive architectures (Fig. 1). In architecture or configuration A, individual features are processed in separate neural populations and neurons in different populations receive inputs selective to either orientation or color (Fig. 1A). We combine the outputs of such neural populations to construct a neural response selective to both features (see Materials and Methods for more details). In configuration B, a combination of features is jointly processed and neurons in different populations receive inputs selective to both orientation and color (Fig. 1B).
We first compare the response of neurons selective to the target (the central red-vertical bar) in V1, V2, and V4 and for the two configurations (Fig. 2). Note that the target is the same for all displays, while the surrounding distractors differ. The average response to the six displays are computed over 200 trials of network simulation. The onset of response (defined as firing above 5 Hz) to different stimuli occurs at ∼40 ms for neurons in simulated V1, ∼48 ms for V2, and ∼54 ms for V4.
Soon after the activity onset, the responses to the different displays diverge due to lateral interactions in neural populations. Sometimes a smaller, secondary peak can be observed, a remnant of the shock oscillation caused by all-to-all connectivity. However, this oscillation is damped out quickly and does not contribute to the saliency computations. Because lateral interactions are dominated by inhibition, target-selective neurons show the weakest response to the homogeneous display, and simultaneously the strongest response to the singleton display (Fig. 2, compare black and gray traces). Note that when the singleton display is presented, the distractor-selective neurons receive the smallest inputs and so only weakly inhibit central neurons responding to the target. Yet when the homogeneous field of bars is presented, the distractor-selective neurons receive the largest inputs and so strongly inhibit target-selective neurons.
We also find that the response in V1 for both configurations for color and orientation popout displays (cyan and blue traces, respectively, in Fig. 2) are similar to the response to the conjunction display (yellow), while they are smaller than the response to the combined popout display (red). Thus, while target-selective V1 cells already show differential responses, they do not distinguish between popout and conjunction per se, similar to what has been observed in monkey V1 (see Hegdé and Felleman, 2003, their Fig. 10). As the signal propagates through V2 and V4, the response to the conjunction becomes smaller than the response to the orientation and color popout in configuration A but not in B (compare yellow traces in Fig. 2A,B). That is, neurons in higher areas show a differential response to the popout and conjunction displays only when individual features are processed independently.
To quantify the evolution of the saliency signal in successive regions, we next consider the average response to the 6 different displays in successive layers of the network. As expected, we find that the average response to all five arrays of bars are suppressed compared with the singleton for neurons selective to the target (supplemental Fig. S2, available at www.jneurosci.org as supplemental material). Interestingly, the response in V1 is fairly similar for both configurations A and B and qualitatively matches the experimental results in V1 (Hegdé and Felleman, 2003, their Fig. 5).
The saliency of a target depends on its contrast with nearby objects, here the neighboring distractors. Likewise, the neuronal signature of target saliency should depend on the response of target-selective neurons relative to the response of neurons selective to nearby distractors. Therefore, we further analyze the average response of both target- and distractor-selective neurons in each region and for each configuration (Fig. 3).
When individual features are processed separately, the response to the target in popout displays is reduced by small amounts from one region to the next, which is much smaller than the reduction in the homogeneous and conjunction displays (Fig. 3A). As a result, the average response to popout targets exceeds the response to the conjunction target for configuration A. This is accompanied by an increase in the difference between the response to the target and distractors in successive layers for popout displays, but not for the conjunction display. On the other hand, the response to the target in both popout and conjunction displays is reduced in configuration B which is accompanied by an increase in the difference between the response to the target and distractors in successive layers for these displays.(Fig. 3B). The differential response of target-selective neurons to popout and conjunction stimuli can be quantified by the popout selectivity index (Knierim and van Essen, 1992; Burrows and Moore, 2009). We find that V4 popout selectivity indices for configuration A are qualitatively similar to those measured for V4 neurons in the monkey (Fig. 3C; cf. Burrows and Moore, 2009, their Fig. 2B).
This differential response in the two configurations is a consequence of the fact that for configuration B, lateral interactions take place between neurons selective to combinations of features. Thus, the response of target-selective neurons to either color or orientation popout or to conjunction displays is suppressed by active distractor-selective neurons, while this is not the case in configuration A. In the latter, the distractor-selective neurons are active in only one of the two populations for color and orientation popout displays. Consequently, the response to popout and conjunction displays is suppressed differentially for configuration A but to the same extent for B. This effect is compounded when ascending through the three regions (Fig. 3), and results in activity patterns in V4 similar to experimental observations (Burrows and Moore, 2009), but only for configuration A. Therefore, we conclude that saliency computations require independent processing of individual features in successive layers of neurons with lateral interactions.
Although popout selectivity has been used as a measure of saliency signals, it has been argued that the absence of popout selectivity may not be equivalent to the absence of saliency signals (Li 2002). To test this hypothesis we use the average response of neurons selective to the target and all 48 distractors surrounding the target to define the global popout selectivity indices (see supplemental material, available at www.jneurosci.org). Interestingly, we find that local and global popout selectivity indices are highly correlated in all regions (see supplemental Figs. S4, S5, available at www.jneurosci.org as supplemental material). Local and global saliency signals are correlated because they are generated through lateral interactions between neurons with similar selectivity. Therefore, assuming saliency computation is manifested in the brain through the same mechanism, we conclude that both local and global popout indices can be equally informative about the visual saliency of the target.
Up to this point, we examined the response of target-selective neurons to different displays. We showed how a differential response to popout and conjunction displays is formed in successive layers of neurons through lateral excitatory and inhibitory interactions. The reason for examining the signal in target-selective neurons was to compare our results to electrophysiological recordings, although the presence of this signal is not equivalent to target detection per se (see Discussion). Instead, target selection can be performed by finding the most active location in a topographic map which is not selective to any feature (Itti and Koch, 2001). Therefore, we construct a “hypothetical” saliency map by adding the output of feature-selective neurons in each region to examine the signal related to target detection.
We find that the fictitious saliency neurons in early visual areas show little or no difference in response to the target and distractors. However, for the network architecture A, differential responses emerge in higher visual areas for the singleton and popout displays but not for the conjunction display (Fig. 4A). More specifically, the global differential response (i.e., the difference between the response of target- and distractor-selective neurons) increases for popout, while it fluctuates around zero for the conjunction display in all three regions (Fig. 4B). This is not the case for the network architecture B, where the global differential response also increases for the conjunction display (supplemental Figs. S6, S7, available at www.jneurosci.org as supplemental material).
Compatible with these observations, we find that when individual features are processed in distinct populations of neurons, the detection of popout but not conjunction targets is improved in successive layers in configuration A but not B (Fig. 5). Overall, these results indicate that a feature-independent saliency signal can be formed by convergence of outputs of different feature-selective neural populations in higher areas of the visual cortex, but this mechanism is effective only if feature processing is kept separated in lower visual areas.
When do local and global saliency signals first emerge? We examined the time course of the local saliency signal by calculating the difference between the activity of target-selective neurons in response to a given popout and conjunction displays: not only does this difference increase but it emerges earlier relative to the activity onset, as signals propagate from V1 to V2 and V4 (Fig. 6A).
We likewise examined the time course of the global saliency signal by calculating the difference between the response of target-selective neurons and of the most active distractor-selective neurons on each trial (for neurons in the constructed saliency populations). This difference increases and occurs earlier relative to the response onset in higher regions, but only for popout displays (Fig. 6B). Note that at the beginning of a trial before saliency computations are formed through lateral interactions, neurons are mainly driven by external inputs and because of noise many neurons can have higher activity than target-selective neurons. This results in early negative values for global saliency signals in our model.
Role of NMDA in saliency computations
The excitatory currents in our network model is transmitted through two types of synapses, fast AMPA and slow saturating NMDA. Generally, we find that the recurrent excitation should be dominated by NMDA currents but to test the role of these synapses in saliency computations, we reduce NMDA currents while increasing AMPA currents such that the overall recurrent excitation stays at the same level (see Materials and Methods for more details).
We find that increasing the AMPA to NMDA current ratio disrupts the formation of both local and global saliency signals (Fig. 7). That is, the response to popout displays is not different from the response to the conjunction display in higher visual areas (Fig. 7A), and similarly, the difference in response of target- and distractor-selective neurons in the constructed saliency populations is reduced (Fig. 7B).
Moreover, as the NMDA currents are reduced, the amount of increase in the local and global differential response in successive regions is also reduced. Similarly we find that the probability of maximum response at the target location is reduced, especially for color and orientation popout displays (supplemental Fig. S8, available at www.jneurosci.org as supplemental material). Finally, we compute the temporal dynamics of the saliency signal for different values of the AMPA to NMDA current ratio. We find that an increase in the AMPA to NMDA current ratio delays and further eliminates the formation of the saliency signal in higher visual areas and moreover, introduces a strong oscillation in the response in these regions (supplemental Figs. S9, S10, available at www.jneurosci.org as supplemental material).
An explicit saliency map and its action onto earlier stages
To study the effect of feedback from the saliency map on the saliency computations in lower visual areas, we use a smaller version of our model with only two layers of neural populations corresponding to neurons in V4 and LIP/FEF (due to computational expenses). As we show in the previous sections, the saliency signal can be formed in successive layers of neurons when individual features are processed separately. Therefore, here we use the same architecture for neurons in V4 (see Materials and Methods for more details).
We first assure ourselves that approximate saliency signals can form in a single simulated cortical area (compared with three as in the above simulations). We use a stronger and more widely projecting regional connectivity matrix than before (compare model parameters in supplemental Tables 1, 2, available at www.jneurosci.org as supplemental material) to reproduce our basic result: the response of color/orientation-selective neurons to all popout displays is larger than the response to the conjunction display (Fig. 8A). Nevertheless, we observe a small positive but significant global saliency index for the conjunction display, which does not appear in saliency computations in successive regions of V1, V2 and V4 (compare Fig. 8 and results for V4 in Fig. 4B).
Because the input to the saliency map is the sum of the outputs of feature-selective neurons, this input carries a smaller saliency signal than the population of neurons selective to the target. That is, the global differential response is larger in the V4 population selective to red-vertical bars than in the saliency map (compare the response to the target and distractors for each case in Fig. 8A). We will return to this issue in the Discussion.
To examine the result of lateral interactions in the saliency populations, we then compute GSI for the synaptic inputs that originate from V4 cells (to the saliency map) and for the response of neurons in the saliency map. As expected, we find that the lateral interactions within the saliency map enhance the GSI and therefore enhance the saliency signal (Fig. 8B). Note that the observed decrease in the normalized response in the saliency map population (compared with its inputs) is due to the large response to the singleton display in this population.
We next introduce feedback from the excitatory neurons in the saliency map to all excitatory and inhibitory populations in V4 at corresponding locations. These two types of feedback are adjusted such that they approximately modulate the response of the feature-selective neurons rather than driving them (Schwabe et al., 2006). Feedback differentially changes the response to each display, increasing popout selectivity while increasing variability (Fig. 9).
Finally, how does this model act in the presence of a top-down signal, as in the saccade preparation experiment of Burrows and Moore (2009)? In this experiment, the monkey was cued to make a saccade to a location far from the RF of the recorded V4 neuron, and was rewarded a drop of juice for making saccade to the cued location after the fixation spot disappeared. While planning such a saccade, a visual stimulus was presented at a random time before the initiation of the saccade. They found that under this manipulation, the observed differential response of V4 neurons to the popout and conjunction displays in the control passive viewing experiment, was eliminated. We hypothesize that interaction between bottom-up and top-down attentional signals occurs in the saliency map (LIP or FEF), where both signals are found (Thompson et al., 2005; Balan and Gottlieb, 2006; Ipata et al., 2006; Buschman and Miller, 2007). More specifically, we assume that saccade preparation induces an activity within the saliency map at the saccade target location. This corresponds to the introduction a highly salient target at that location through working memory, thereby altering feedback from the saliency map to earlier areas.
To test this hypothesis, we simulate the saccade preparation task by adding excitatory inputs to all populations at locations corresponding to the saccade target (to mimic working memory inputs). These inputs are strong enough so there is a representation of the saccade target in the saliency map on all trials, but are weak enough so they do not alter the response to the singleton display.
During the simulated saccade preparation the response to different displays changes slightly and differentially (Fig. 9), such that the popout selectivity indices for popout are reduced, in line with the experimental data (Burrows and Moore, 2009). That is, saccade preparation reduces the response to popout displays more so than the response to the conjunction display. This happens because for popout displays, the target provokes a strong response in the saliency map (Fig. 8A) which in turn results in a strong feedback to feature-selective neurons. The converse is true for the conjunction target. During saccade preparation, the saccade target also invokes a strong response in the saliency population which reduces the response to the target in this population through lateral inhibition (supplemental Fig. S11, available at www.jneurosci.org as supplemental material) and consequently, the response to the target in feature-selective neurons. Therefore, the amount of reduction in the response of feature-selective neurons depends on the target response in the saliency map during the control trials, and on the response to the saccade target in this map.
Discussion
Despite a large body of literature on the psychology of bottom-up attention and how it operates within visual scenes, much less is known about its neural substrates. Here we design a biophysically plausible spiking network model to investigate the representation and formation of saliency signals in the visual cortex and its interaction with top-down attention.
We focus on lateral excitatory and inhibitory interactions as the main mechanism for saliency computations. By comparing two distinct network architectures, we find that local and global saliency signals emerge and increase in successive layers of neural populations only if individual features are processed in different neural populations (configuration A). That is, while the activity of target-selective neurons in the first visual area of our model (V1) does not discriminate between popout and conjunction displays, neurons in higher areas of the model (V2 and V4) show stronger response to popout than to conjunction displays, similar to experimental observations in V1 and V4, respectively (Hegdé and Felleman, 2003; Burrows and Moore, 2009). Moreover, the difference between the response to the target and distractors, as well as target detectability, increases in successive layers for popout but not for conjunction displays, compatible with the basic difference in detection of popout and conjunction targets (Treisman and Sato, 1990).
Similar to experimental data (Burrows and Moore, 2009) we obtain larger local and global saliency signals for the combined popout than for single popout displays; that is, the detection of the popout target is easier when it differs from distractor in two features rather than one feature. Our finding is also consistent with the so-called “redundant-signal effect” (shortening of the reaction time when the response is triggered by two rather than one response-related target signal) demonstrated in a visual popout search task. This effect has been attributed to coactivation of different visual pathways and their subsequent convergence before response triggering (Krummenacher et al., 2001, 2002, 2010; Zehetleitner et al., 2008; Töllner et al., 2010). Similarly, in our model saliency signals for the combined popout is stronger because of independent input processing related to different features and their parallel processing in separate neural populations before convergence in the saliency map.
As an alternative to saliency computations in successive layers, we also consider wider and stronger lateral interactions in one layer of neural populations which process individual features separately. Even though we observe local saliency signals (i.e., positive popout selectivity indices), this signal was very small for orientation popout display (Fig. 9B). Moreover, this mechanism results in a small positive but significant global saliency index for the conjunction display (Fig. 8) and in a noisier detection of the target (data not shown).
Therefore, we conclude that moderate lateral interactions in successive layers of neurons selective to individual features provide a suitable mechanism for early saliency computations. Furthermore, neurons that process individual features separately are more likely to contribute to bottom-up saliency than neurons that are simultaneously selective to both color and orientation (Livingstone and Hubel, 1987).
Our saliency computations can be compared with the standard Itti-Koch computational saliency model (Koch and Ullman, 1985; Itti et al., 1998; Itti and Koch, 2001). The saliency model exploits center-surround computations (i.e., subtracting a filtered image at a lower spatial resolution from the image at a higher spatial resolution) to capture local feature contrasts in the image and to form feature maps, as well as normalization to enhance (respectively suppress) those maps with a few (respectively many) active locations. Lateral excitation and inhibition between neural populations enables our model to approximately perform center-surround and normalization computations without using a multi-resolution representation of an input image at different scales. An important requirement for this to happen is that inhibitory connections should be wider than excitatory connections.
Normalization of sensory inputs by the sum of inputs has been used in a few models of top-down attention (Reynolds et al., 1999; Lee and Maunsell, 2009; Reynolds and Heeger, 2009). For example, Reynolds and Heeger (2009) proposed that top-down attention improves sensitivity to faint stimulus through multiplicative interaction of inputs and the “attention field,” followed by normalization of the response by the activity in the “suppressive field.” Lateral interactions proposed here can provide a biophysical mechanism for normalization due to the suppressive field in this model. Alternatively, Lee and Maunsell (2009) proposed that top-down attention affects neural response solely through changing the strength of normalization and not the inputs. If such a mechanism is implemented through lateral interactions, then top-down attention should mostly affect the activity in inhibitory neurons to change the strength of normalization.
While lateral interactions between spiking neurons through realistic synapses can approximate center-surround and normalization computations, these computations are limited by biophysical properties of neural elements in the network. This happens because neurons in the network should integrate, in every spike, the noisy background inputs plus excitatory and inhibitory inputs from neighboring neurons and subsequently, transmit this signal to other neurons in the network through realistic synapses with different time constants. Due to these factors, some conditions should be met in order for the network to perform efficient center-surround and normalization computations. First, we find that recurrent excitations between excitatory neurons should be dominated by slow NMDA currents. This enables neurons to integrate the noisy input signals over a longer timescale and to increase the signal-to-noise ratio. Second, excitatory to inhibitory connections, which drive the lateral inhibition through inhibitory interneurons, should have both AMPA and NMDA currents. This is because if these connections are also dominated by NMDA currents, the inhibitory interneurons become active slowly which can suppress the activity in the network after the response onset. These results emphasize the crucial role of the NMDA receptor in the saliency computations, similar to its role in working memory and decision making (Wang, 1999, 2002; Compte et al., 2000). Based on our results, we predict that inactivation of NMDA receptors in the visual cortex results in a nosier and weaker saliency signal, and impairs the performance in the visual search task.
Although we find that recurrent excitation in our spiking network should be dominated by slow NMDA currents, this does not translate to slow emergence of the saliency signal. Interestingly, we find that both local and global saliency signals emerge earlier in successive regions of the model (relative to the response onset in each region). Similarly, Buffalo and colleagues recently showed that top-down attentional effects appear earlier and stronger in V4 than in V2 and V1 (Buffalo et al., 2010). This result may also explain how the saliency signal can appear in higher visual areas such as the LIP and FEF even before the appearance of this signal in striate and extrastriate cortices. For example, compare the emergence of this signal in V4 (Burrows and Moore, 2009) with the LIP and FEF (Buschman and Miller, 2007). Interestingly, Buschman and Miller (2007) found an earlier saliency signal in the LIP than in the FEF during popout search, while this signal emerges earlier in the FEF during a visual search task which requires top-down attention (but see Schall et al., 2007). These observations suggest different roles for the LIP and FEF in saliency computations, a topic which requires further investigations.
V1 has been implicated as the site of early saliency signals (Li, 2002) because V1 neurons are influenced by the stimulation of regions outside their classical RF in a nonlinear fashion (DeAngelis et al., 1994; Albright and Stoner, 2002; Cavanaugh et al., 2002a,b). For example, activity of V1 neurons in the alert monkey is only weakly suppressed (with respect to the singleton display) when the surrounding bars are in orientation perpendicular to the orientation of the central bar (Knierim and van Essen, 1992). However, an experiment specifically designed to detect the presence of saliency signals in monkey V1 came up empty handed; Hegdé and Felleman (2003) found that V1 neurons do not distinguish between popout and conjunction displays; rather, they signal the existence of center-surround discontinuity. Similarly, we find that moderate lateral interactions between neurons with similar feature selectivity can result in a response pattern which depends on the display but does not distinguish between popout and conjunction. In addition, we show that the absence of local saliency signals is indicative of the absence of global saliency signals.
Conversely, single cell data indicates that V4 neurons are selective to bottom-up attentional signals such as popout display, and are modulated by top-down attention as well as the activity of neurons in LIP and FEF (Schiller and Lee, 1991; Hupé et al., 1998; Reynolds et al., 2000; Tolias et al., 2001; Moore and Armstrong, 2003; Reynolds and Desimone, 2003; Armstrong et al., 2006; Armstrong and Moore, 2007; Gregoriou et al., 2009). Compelling evidence for the presence of saliency signals in V4 is found by Burrows and Moore (2009). They demonstrated that V4 neurons, considered as a single population, respond stronger to popout than to conjunction displays. Furthermore, this difference is eliminated when the monkey prepares a saccade to a location far from the RF of the recorded neuron, which indicates that this bottom-up saliency signal is influenced by top-down attentional signals. There is converging evidence that these signals possibly originate in the FEF (Moore and Armstrong, 2003; Armstrong et al., 2006; Armstrong and Moore, 2007; Monosov et al., 2008; Gregoriou et al., 2009). We find in our model that feedback from the saliency map to earlier regions enhances the saliency signal while saccade preparation reduces this signal by altering the feedback.
Interestingly, our model predicts that the effect of saccade preparation on the response to a given stimulus depends on the response of target-selective neurons in the saliency map during the control passive viewing task. This prediction can be tested by recording from neurons in the saliency map (e.g., LIP/FEF) and in feature-selective populations which receive feedback from the saliency map (e.g., V4). Such recording can be used to compute the correlation between the responses of neurons in the saliency map and the reduction in the popout selectivity index for different displays in the control and saccade preparation tasks, respectively. We predict a positive correlation between these two quantities.
By combining the outputs of neural populations which process individual features (in configuration A) we construct different color/orientation-selective neural populations. Among these populations the one which is selective to the target features carries a saliency signal stronger than the signal in the saliency population (Fig. 8). So why should the brain bother with a distinct saliency map in the first place? However, in the absence of an explicit saliency map, the brain needs to detect the target by first identifying this feature-selective population. This is of course quite difficult in dense natural scenes with many, partially occluded targets, which is why the strategy of a saliency map that labels the sites of potential objects is an attractive computational option. Similarly, feedback to neural populations in V1 from V2 or V4 populations with similar selectivity does not improve saliency signals as this feedback does not contain any information about the most salient location in the visual scene, and only acts as a stronger recurrent input from the same area. Instead, feedback from the saliency map improves saliency signal as it can enhance the signal in neurons selective to the most salient location. Finally, a feature-independent saliency map, formed by the convergence of outputs of neural populations selective to different features, is consistent with the observation that the saliency signal in the FEF appears in the spiking activity before the local field potential (Monosov et al., 2008).
Electrophysiological evidence suggests that a saliency map is instantiated in the response of neurons in the posterior parietal area 7a (Constantinidis and Steinmetz, 2001, 2005), area LIP (Gottliebet al., 1998; Kusunoki et al., 2000; Bisley and Goldberg, 2003), and in the FEF (Thompson and Bichot, 2005). Interestingly, in such a setting the activity of the saliency population and also the detection of the target can be adjusted by gating the inputs from different feature-selective populations (Rutishauser and Koch, 2007) and by top-down signals. To model saccade preparation, we assume that saliency neurons selective to the location of the saccade target become active and stay active during the stimulus presentation (at a fixed level of activity). Conceivably, trial-by-trial variability in the representation of the saccade target can alter the feedback to V4 neurons and consequently, the saliency signal. Such variability has been observed in areas LIP and FEF and was shown to be correlated with monkeys ability to ignore distractors (Thompson et al., 2005; Balan and Gottlieb, 2006; Ipata et al., 2006).
At the end, while it has been shown that saliency computations can be easily performed through center-surround and normalization computations (Itti et al., 1998), we find biophysical limits for performing these computations by spiking neurons and realistic synapses. These limits point to the general biophysical mechanisms which are used in other parts of the brain. Namely, due to response variability of cortical neurons, the integration of input signals need to be done through slow NMDA synapses and in successive layers of neural populations. Computation in successive layers of neural populations results in earlier emergence of the saliency signal in higher visual areas, which in turn can provide feedback to lower visual areas and improve the saliency computations.
Footnotes
-
We are grateful to the Mathers Foundations, the Office of Naval Research, and the Defense Advanced Research Projects Agency for financial support of the research reported here. We thank Zahra Ayubi, Brittany Burrows, and Tirin Moore for helpful discussions and comments on the manuscript.
- Correspondence should be addressed to Alireza Soltani at the above address. soltani{at}bcm.edu