Abstract
The brain estimates visual motion by decoding the responses of populations of neurons. Extracting unbiased motion estimates from early visual cortical neurons is challenging because each neuron contributes an ambiguous (local) representation of the visual environment and inherently variable neural response. To mitigate these sources of noise, the brain can pool across large populations of neurons, pool the response of each neuron over time, or a combination of the two. Recent psychophysical and physiological work points to a flexible motion pooling system that arrives at different computational solutions over time and for different stimuli. Here we ask whether a single, likelihood-based computation can accommodate the flexible nature of spatiotemporal motion pooling in humans. We examined the contribution of different computations to human observers' performance on two global visual motion discriminations tasks, one requiring the combination of motion directions over time and another requiring their combination in different relative proportions over space and time. Observers' perceived direction of global motion was accurately predicted by a vector average readout of direction signals accumulated over time and a maximum likelihood readout of direction signals combined over space, consistent with the notion of a flexible motion pooling system that uses different computations over space and time. Additional simulations of observers' performance with a population decoding model revealed a more parsimonious solution: flexible spatiotemporal pooling could be accommodated by a single computation that optimally pools motion signals across a population of neurons that accumulate local motion signals on their receptive fields at a fixed rate over time.
Introduction
The brain estimates visual motion by decoding the responses of populations of neurons. Extracting unbiased motion estimates from early visual cortical populations is challenging: each neuron contributes an ambiguous (local) representation of the visual environment (Hubel and Wiesel, 1962) and inherently variable response (Schiller et al., 1976; Dean, 1981). To mitigate these sources of uncertainty, the brain can pool local motion measurements across large populations of neurons, pool the response of each neuron over time, or a combination of the two (for review, see Braddick, 1993; Mingolla, 2003; Born and Bradley, 2005; Born et al., 2009; Smith et al., 2009).
Recent psychophysical work suggests that spatiotemporal pooling of local motion samples is dynamic and flexible: rigid motion computations evolve over time and can switch when stimulus attributes change (Stone et al., 1990; Stone and Thompson, 1992; Yo and Wilson, 1992; Burke and Wenderoth, 1993; Lorenceau et al., 1993; Cropper et al., 1994; Bowns, 1996; Amano et al., 2009). For example, the human motion system computes the vector average direction of rigid motion at relatively short stimulus durations and low contrasts and intersection of constraints at longer durations and higher contrasts (Yo and Wilson, 1992; Cropper et al., 1994). Many of these computational dynamics are reflected in the behavior of motion-sensitive neurons in middle temporal (MT) cortex (Pack and Born, 2001; Pack et al., 2001; Smith et al., 2005; Majaj et al., 2007) and ocular following and smooth pursuit eye movements (Recanzone and Wurtz, 1999; Ferrera, 2000; Masson, 2004; Born et al., 2006; Barthélemy et al., 2010), pointing to a flexible motion pooling system that arrives at different computational solutions over time and for different stimuli.
Theoretical considerations suggest a more parsimonious pooling solution (Paradiso, 1988; Foldiak, 1993; Seung and Sompolinsky, 1993; Sanger, 1996; Deneve et al., 1999; Weiss et al., 2002; Jazayeri and Movshon, 2006; Stocker and Simoncelli, 2006), one that can accommodate a range of phenomena with a single, coherent computation: the likelihood function. It differs from other pooling computations because it generates not a single estimate of a stimulus but rather the probabilities that different stimuli could have elicited the responses from a population of neurons. Moreover, visual likelihoods can be implemented within a plausible population decoding framework (Jazayeri and Movshon, 2006) and can, with certain assumptions (Weiss et al., 2002; Jazayeri and Movshon, 2006; Stocker and Simoncelli, 2006), account for a wide range of perceptual behaviors, including orientation discrimination (Regan and Beverley, 1985), perceived direction (Webb et al., 2007), perceived velocity (Weiss et al., 2002; Stocker and Simoncelli, 2006), and cue combination both within (Landy et al., 1995; Jacobs, 1999) and across (Ernst and Banks, 2002; Alais and Burr, 2004) modalities.
Here we ask whether a single, likelihood-based computation can accommodate the flexible nature of spatiotemporal motion pooling in humans. We distinguished the contribution of different computations by probing the underlying neural circuits with asymmetrical distributions of local motion samples with distinct summary statistics. Our results point to a single computation that optimally pools motion signals across a population of neurons that temporally summate local motions within their receptive fields at a fixed rate over time.
Materials and Methods
Subjects
Four human observers (three male, one female) with normal or corrected-to-normal vision participated. Three were authors (F.R., T.L., and B.S.W.), and one (D.M.G.) was naive to the purpose of the experiments.
Visual stimuli
Random dot kinematograms (RDKs) (examples shown in Fig. 1) were generated on a personal computer running custom software written in Python, using components of Psychopy (Peirce, 2007). Stimuli were displayed on an IIyama Vision Master Pro 514 cathode ray tube monitor with a resolution of 1280 × 1024 pixels, update rate of 75 Hz at a viewing distance of 76.3 cm. Each RDK was generated anew before its presentation on each trial. Each image in a motion sequence consisted of 226 dots (luminance, 0.05 cd/m2) displayed within a circular window (6° radius) on a uniform luminance background (25 cd/m2). Continuous apparent motion was produced by presenting the images consecutively at an update rate of 18.75 Hz, which is comparable with our previous work (Webb et al., 2007) and that used in other studies of global motion (Williams and Sekuler, 1984; Watamaniuk et al., 1989; Watamaniuk and Sekuler, 1992; Edwards and Badcock, 1995). Dot density and diameter were 2 dots/deg2 and 0.1°, respectively. On the first frame in a motion sequence, dots were randomly positioned in the circular window and were displaced at 5°/s. Dots that fell outside were wrapped to the opposite side of the window.
Examples of RDKs used in the temporal and spatiotemporal experiments. In each experiment, observers judged whether sequentially presented standard or comparison RDKs had a more clockwise direction of motion. Temporal and spatial dot directions were sampled with replacement from an asymmetrical probability distribution with distinct measures of central tendency. All dots in the temporal comparison were displaced in the same randomly sampled direction on each image, generating a temporal sequence of directions across images; individual dots in the spatial comparison were displaced in independently sampled directions on each image, generating a spatial distribution of directions on each image. The comparison RDKs used for the spatiotemporal experiments consisted of different mixtures of spatial and temporal dot directions.
Psychophysical procedure
In a temporal two-alternative forced-choice task, observers judged which of two RDKs had a more clockwise direction of motion. On each trial, “standard” and “comparison” RDKs (Fig. 1) were presented in a random temporal order separated by a 1000 ms interval containing a fixation cross (luminance, 0.05 cd/m2) on a uniform luminance background. The standard RDK was always composed of dots that moved in a common direction on each trial, randomly chosen from a uniform distribution spanning 360°. The comparison RDK was composed of dot directions drawn from a probability distribution with distinct measures of central tendency (Fig. 2A–C).
Temporal pooling experiments.
Both comparison and standard RDKs consisted of 25 images, presented for a total duration of 1300 ms. The comparison dots were all displaced in a common direction on each image, sampled randomly and independently from the distribution, generating a temporal sequence of directions.
In the first two experiments, the temporal statistics of the comparison RDK were manipulated by independently varying the SD of the half-widths of the distribution, assigning the left half as the clockwise (CW) SD and the right half as the counterclockwise (CCW) SD. For the first experiment, dots directions were sampled at 5° intervals from a Gaussian distribution. The SD of the CCW half of the Gaussian was either 30, 40, 50, or 60°; corresponding values on the CW half were 30, 20, 10, or 0°, generating asymmetrical distributions of dot directions with an increasingly distinct mode (Fig. 2A). For the second experiment, dot directions of the comparison RDK were sampled from a Gaussian with CCW SDs of 30, 50, 70, or 90° and CW SDs of 6, 10, 14, or 18°. Each half of the distribution was sampled at 5 and 1° intervals, respectively, generating asymmetrical distributions of dot directions with an increasingly distinct vector average (Fig. 2B). For both experiments, the modal direction of the comparison RDK was randomly chosen on each trial using the method of constant stimuli. In the third experiment, dots directions for the comparison RDK were sampled from a uniform distribution with a total range of 180°. We assigned each half of the distribution a different range and sampling density, sampling the CCW half at 5° intervals over a range of 90, 110, 130, or 150° and the CW half in linear intervals over a range of 90, 70, 50, or 30°. This generated asymmetrical distributions with increasingly different medians and vector averages (Fig. 2C). The median direction of the comparison RDK was randomly chosen on each trial using the method of constant stimuli.
Spatiotemporal pooling experiments.
Both the comparison and standard RDKs consisted of two, four, or eight images, presented consecutively for a total duration of 104, 208, or 416 ms. The comparison RDK consisted of different mixtures of “spatiotemporal” (100–0%) and “temporal” (0–100%) dot directions. Spatiotemporal dot directions are hereafter referred to as “spatial.” Spatial directions were sampled independently from each other on the current image and from their own direction on previous images; temporal dots were displaced in a common direction on each image, independently of their direction on previous images. Spatial and temporal dot directions were sampled in different proportions, with replacement, from a single asymmetrical uniform distribution (see Fig. 5). We chose this distribution because it is diagnostic at distinguishing the predictions of a vector average from a maximum likelihood readout of perceived direction (Webb et al., 2007). The median direction of the comparison RDK was randomly chosen on each trial using the method of constant stimuli.
Data analysis
For each experiment, observers completed a minimum of two runs of 180 trials. Data were expressed as the proportion of trials on which subjects judged the comparison RDK to be more CW than the standard RDK as a function of the angular difference between them. Each psychometric function was fitted with a logistic of the following form:
where Pcw is the proportion of clockwise judgments, μ is the stimulus level at which observers perceived the directions of the standard and comparison to be the same [point of subjective equality (PSE)], and β is an estimate of direction discrimination threshold. Figure 2D shows psychometric functions obtained in the temporal pooling experiment.
Population decoding
Basic model.
We begin with a brief description and then detail the full mathematical implementation of the model simulations. We simulated observers' trial-by-trial performance on the temporal and spatiotemporal tasks with a physiologically inspired population decoding model (cf. Webb et al., 2007). The stimuli, timing, psychophysical procedure, and number of trials were the same in the model simulations and human experiments. On each trial, we accumulated the spiking responses of a population of model direction-selective neurons to the comparison and standard RDKs. The central tendency direction of the comparison was randomly chosen on each trial using the method of constant stimuli. A decoder read out the population responses to the standard and comparison and judged which RDK had a more clockwise direction of motion. Psychometric functions based on the model response were accumulated for different forms of decoder.
Wherever practical, our model and manipulations of its parameters were designed to mimic the behavior of an MT population. The model consists of 360 independently responsive, direction-selective neurons, in which the preferred directions of the adjacent neurons are separated by 1°. The sensitivity of the ith neuron, centered at θi to direction θ is as follows:
where h is the bandwidth (half-height, half-width), fixed at 45°. This bandwidth is chosen to be within the range obtained in previous psychophysical studies on the directional tuning of motion mechanisms (Levinson and Sekuler, 1980; Raymond, 1993; Fine et al., 2004) and physiological studies on the directional tuning of MT neurons (Albright, 1984; Felleman and Kaas, 1984; Britten and Newsome, 1998). The response of the ith neuron to a distribution of dots directions D(θ) is as follows:
where k = Rmaxt. Rmax is the maximum firing rate of the neuron (60 spikes/s), t is stimulus duration, and pr{D(θ)}is the proportion of dot directions. The spiking response (ni) is Poisson distributed with a mean of Ri(D):
We estimated D with three different population decoders. The log likelihood of D was computed by multiplying the response of each neuron by the log of its tuning function (Seung and Sompolinsky, 1993; Jazayeri and Movshon, 2006):
The maximum likelihood direction estimated from the population response is the value of θi for which logL(D) for all D is maximal.
To estimate D with a winner-takes-all decoder, we read off the value of θi where ni max.
To obtain the corresponding estimate from a vector average decoder, we calculated the average preferred direction of all neurons weighted in proportion to their response magnitude:
Model parameter manipulations.
To test systematically whether a likelihood-based pooling computation alone could accommodate observers' psychophysical performance in the spatiotemporal pooling experiment, we parametrically manipulated the behavior of our direction-tuned model neurons.
First, we varied the number of neurons in the population in the range N = 12–720.
Second, we varied the level at which the response of the ith neuron could reach saturation, such that
where Rsati is a fixed level of response saturation, θ is direction, and θ50 is the number of dot directions at which the response reaches half its saturating level (fixed at 20), and η is the slope of the curve (fixed at 0.5).
Third, we conferred temporal dynamics on the response of the ith neuron with a decaying exponential of the following form:
where Rsusi is the sustained part of the response, Rmaxi is the maximum response, τr is a time constant, and t is time (Priebe et al., 2002).
Fourth, we implemented a simple form of temporal summation in which the ith neuron accumulates local motion signals on its receptive field as a power function of time, such that
where ω is a scaling factor, τD is a time constant, and t is time.
Fifth, within each simulated trial, we imposed a correlation structure on the noise in our population of neurons. Using a method described by Huang and Lisberger (2009), we first compute the desired correlation structure (c) across the ith and jth pairs of neurons as
where cmax is the maximum possible correlation between all pairs of neurons, ΔPDi,j is the difference in preferred directions of pairs of neurons, Td is rate at which correlations decay as a function of ΔPD, and ΔPDmax is fixed at 180°. Using a method developed by Shadlen et al. (1996), we enforce the desired noise correlations across the population by calculating the matrix square root (Q) of the desired correlation matrix
such that every eigenvalue has a non-negative real part. We then multiply a vector of independent normal deviates with unit variance and zero mean (z) by Q:
generating a matrix with covariance c. To derive a matrix of responses with a given correlation structure, we scale and offset y. The responses of the population to a distribution of dot directions can then be calculated as follows:
where Ric(D) is a 1 by N vector of correlated responses that depend on the direction preference (θi) of each neuron. [For a complete derivation and discussion of this approach, see Shadlen et al. (1996), their Appendix 1: Covariance].
Results
We first ask which pooling computations govern performance on a task that required human observers to combine local motion directions over time (temporal pooling experiment). Each psychometric function was fitted with a logistic (Eq. 1) (Fig. 2D) to determine the stimulus level at which observers perceived the global directions of the standard and comparison RDKs to be the same (point of subjective equality, or PSE). Figure 2D shows that skewing the distribution of directions in the comparison RDK caused a large (∼45°) shift in the perceived direction of this observer away from the modal toward the median and vector average direction. This huge shift in perceived direction occurred without a concomitant change in the precision of discrimination performance (slopes of the two psychometric functions are similar).
A–C, Distributions of dot directions for different comparison RDKs. Arrows show the median (white), vector average (gray), and modal (black) direction of the comparison RDKs. D, Directions sampled with replacement from the comparison distributions shown in the top and bottom panels of A. When the comparison distribution is symmetrical, the perceived direction of the standard RDKs aligns with the three measures of central tendency of the comparison RDK. When the comparison distribution is asymmetrical, only the vector average direction aligns with the perceived direction of the standard RDK. The smooth lines through the data points are the best-fitting solutions to Equation 1.
The behavior of this individual was representative of the performance of all observers. Perceived direction corresponded very closely to the vector average stimulus direction calculated over time, diverging substantially from the modal and median direction of motion. Figure 3A–C shows how the perceived direction of all observers changes as a function of skew in different comparison distributions (Fig. 2A–C). Symbols represent each subject's PSE; dotted, dashed, and solid lines represent the modal, median, and vector average direction of motion of the comparison RDK, respectively. When the comparison RDK was generated from a Gaussian distribution with a CCW SD of 60° (Fig. 2A, bottom), the modal direction of the comparison had to be rotated by ∼45°, on average, for the standard and comparison to be perceived moving in the same direction (Fig. 3A). With a CCW SD of the dot distribution equal to 90°, the modal direction of the comparison needed to be rotated by ∼20° for observers to perceive the comparison and the standard moving in the same direction (Fig. 3B). Similarly, when comparison directions were drawn from a uniform distribution with CCW range of 150°, the median direction had to be rotated by ∼20° to be perceived moving in the same direction as the standard (Fig. 3C). Unlike perceived direction, observers' discrimination thresholds were relatively independent of degree of skew in the comparison distributions (Fig. 3D–F).
Vector average of direction distributions accumulated over time predicts the temporal pooling of local motion directions. A–C, Symbols show the perceived direction of four observers as a function of different comparison distributions. Lines show the temporal vector average (solid), median (dashed), and modal (dotted) direction of the comparison distribution. B, Note that the median direction has been offset from the mode to reduce clutter. D–F, Symbols show the direction discrimination thresholds of four observers as a function of different comparison distributions. Error bars are 95% confidence intervals (CIs).
A vector average readout (Eq. 6) from a model population of direction-selective neurons (see Materials and Methods for basic model details) also predicted observers' perceived direction (Fig. 4A–C) and the pattern of discrimination thresholds (Fig. 4D–F) in the temporal pooling experiment. This finding contrasts with our previous work, in which we found that maximum likelihood was a robust estimator of performance on a task that required observers to pool local motion samples across space (Webb et al., 2007). These discrepant results appear consistent with the notion of a flexible motion pooling system that can adopt different computations to address different stimulus demands, as others have found for the perception of rigid motion (Stone et al., 1990; Yo and Wilson, 1992; Burke and Wenderoth, 1993; Lorenceau et al., 1993; Cropper et al., 1994; Bowns, 1996; Amano et al., 2009).
Vector average readout from a population coding model predicts the temporal pooling of local motion directions. A–C, White circles show the average perceived direction of four observers as a function of different comparison distributions. Solid lines show the perceived directions estimated by a vector average decoder (Eq. 6). D–F, White circles show the average direction discrimination thresholds of four observers as a function of different comparison distributions. Black circles show the direction discrimination thresholds estimated by a vector average decoder. Error bars are 95% CIs.
To test whether a flexible pooling process can account for these discrepant results, we designed an additional experiment containing components of the previous two. The task and design were the same as above with the following exceptions. The comparison RDK consisted of different mixtures of spatial and temporal dot directions and was presented at three different stimulus durations (see Materials and Methods). All dot directions in the comparison RDK were drawn, with replacement, from a distribution that was particularly diagnostic at distinguishing between the predictions of a maximum likelihood and vector average readout of perceived motion direction. Figure 5 shows examples of how we sampled different mixtures of spatial and temporal directions from this distribution. Note how the temporal dot directions dominate when the numbers of spatial and temporal dots are equally balanced in the comparison distribution (50% spatial, 50% temporal). The predominance of temporal directions in spatiotemporal motion stimuli tightly constrains the behavior of model neurons that can accommodate performance on the spatiotemporal pooling task. We will return to this important point below.
Spatial and temporal dot directions sampled in different proportions from a single asymmetrical uniform distribution. Rows show different relative percentages of spatial and temporal dot directions. Columns show the samples obtained over time on each positional update of the dots.
Figures 6 and 7 show the performance of observers in the spatiotemporal experiment. Perceived direction (Fig. 6) and discrimination thresholds (Fig. 7) are plotted for three different stimulus durations as a function of the percentage of temporal dots in the comparison (note that the percentage of temporal dots is inversely related to the percentage of spatial dots). Varying the mixture of temporal and spatial dots in the comparison RDK caused large (up to 25°) shifts in observer's perceived direction, with PSEs varying between −10° (100% spatial dots) and 15° (100% temporal dots). Stimulus duration modulated this relationship between perceived direction and percentage of temporal dots in the comparison, an effect that was most apparent when the numbers of spatial and temporal dots were equally balanced (Fig. 6). For all observers, increasing the relative percentage of temporal dots (thus reducing percentage of spatial dots) in the comparison RDK caused discrimination thresholds to rise. For one observer (F.R.), the relationship between discrimination thresholds and percentage of temporal dots was modulated by stimulus duration: thresholds were larger at shorter stimulus durations. This effect was not marked for the other two observers.
Different population decoders predict the spatial and temporal pooling of local motion directions. A–C, Symbols show the perceived direction of three observers at three stimulus durations, plotted as a function of the percentage of temporal dots (inversely related to the percentage of spatial dots) in the comparison. D, Symbols show the average perceived direction of observers plotted and notated as in A–C. Dashed lines on the right are the perceived direction at three stimulus durations estimated by a vector average (V. Average) decoder (Eq. 6) when all the dots are temporal; black dashed lines on the left of the plot are the perceived direction at three stimulus durations estimated by a maximum likelihood (M. Likelihood) decoder (Eq. 5) when all of dots are spatial (the maximum likelihood estimates are the same for the three durations). Error bars are 95% CIs.
Direction discrimination thresholds in the spatiotemporal pooling experiment. A–C, Symbols show the direction discrimination thresholds of three observers at three stimulus durations, plotted as a function of the percentage of temporal dots in the comparison. D, Symbols show the average direction discrimination thresholds of observers plotted and notated as in A–C. Error bars are 95% CIs.
Figure 6D shows the average perceived direction of the three observers. The dashed lines on the right show that a vector average decoder (Eq. 6) accurately estimated perceived direction at the three stimulus durations (indicated by different shades of gray) when RDKs were populated by temporal dots. In contrast, a maximum likelihood decoder (Eq. 5) accurately estimated perceived direction at the three stimulus durations (indicated by a single black dashed line because the estimates were the same for three durations) when RDKs were populated by spatial dots. Yet with either population decoder alone, we were unable predict the duration dependence of the relationship between perceived direction and percentage of temporal dots in the comparison. These data suggest a flexible form of motion pooling, one that uses different computations in space and time.
In principle, a single, likelihood-based computation could account for the dynamics of spatiotemporal motion pooling. Likelihoods are derived from the tuning and response properties of individual motion-sensitive neurons (Jazayeri and Movshon, 2006), which raises the possibility that the behavior of the input neurons rather than the pooling computations themselves govern the flexibly of spatiotemporal motion pooling. Our basic model neurons lacked many of the well known characteristics of motion-sensitive neurons, including nonlinear response saturation (Sclar et al., 1990; Albrecht and Geisler, 1991), temporal response integration (for review, see Born et al., 2009; Smith et al., 2009), temporal summation (Snowden and Braddick, 1991; Watamaniuk and Sekuler, 1992; Burr and Santoro, 2001), and a correlation structure to the noise across the population of neurons (Zohary et al., 1994; Bair et al., 2001; Kohn and Smith, 2005). To test whether a likelihood-based pooling computation alone could accommodate observers' psychophysical performance in the spatiotemporal pooling experiment, we systematically introduced some of these characteristics to our population of MT neurons (for details, see Materials and Methods). The left column in Figure 8 shows examples of the effects of manipulating the behavior of the model neurons on the response of the population when the numbers of spatial and temporal directions are equally balanced in the comparison distribution (50% spatial, 50% temporal). Samples from the distribution (inset in each panel) were presented to the model for a total duration of 104 ms (two images). The right column shows how these manipulations of the model neurons changes a maximum likelihood readout (Eq. 5) of the relationship between perceived direction and percentage of temporal dots in the comparison.
Effects of manipulating the behavior of input neurons on the estimate of the maximum likelihood perceived direction in the spatiotemporal pooling experiment. Left column shows basic model population responses to equally balanced (50%, 50%) spatial and temporal samples from a comparison distribution (inset in each panel), presented for a total duration of 104 ms. The left column shows examples of changes to the shape of the population response caused by varying the following: A, numbers of neurons in the population (N = 180 neurons); C, level at which the responses of the neurons saturate (Eq. 7, Rsat = 40 spikes/s); E, correlation structure of the interal noise across the population of neurons (Eq. 10, Cmax = 0.5); G, time constant of temporal response integration (Eq. 8, τr = 20 ms); and I, rate at which neurons respond to the number of dot directions on their receptive fields [Eq. 9, Σ(t,D) = 18]. Right column shows how varying the model parameters N, Rsat, Cmax, τr, and Σ(t,D) modulated the estimate of the maximum likelihood of perceived direction in the spatiotemporal experiment. Varying the rate at which neurons respond to the total number of dots on their receptive field was the only manipulation to the basic model that approximated observers' performance at different stimulus durations.
When the numbers of spatial and temporal dots were equally balanced in the comparison, they did not have equivalent effects on the population response. Because the temporal dots all had the same direction, this inevitably swamped the population response, negating the relative contribution of spatial directions (Fig. 8A, N = 180 neurons). The predominance of temporal directions saturated the estimate of the maximum likelihood of perceived direction. Varying the total number of neurons in the population (N = 12–720) had very little impact on this effect: maximum likelihood produced equivalent estimates of perceived direction regardless of whether the comparison stimulus was populated by 50, 75, or 100% of temporal directions (Fig. 8B).
We attempted to mitigate the effects on the readout by fixing the level at which the responses of all neurons saturate. This flattened the peak of the population response (Fig. 8C) (Eq. 7) (Rsat = 40 spikes/s) and eradicated the saturation of the perceived direction estimated by maximum likelihood, particularly when temporal directions outweighed spatial directions (Fig. 8D). However, we could not find a fixed level of response saturation that produced maximum likelihood estimates of perceived direction that corresponded to observers' pattern of performance in the spatiotemporal experiment (Fig. 6).
Perfectly correlated noise between neurons with similar direction preferences (Eq. 10, Cmax = 1) with correlation strength decaying as a function of the difference in preferred directions of pairs of neurons (Eq. 10, Td = 0.5) both increased and broadened the peak of the population response (Fig. 8E). Different patterns of correlated noise across the population mitigated the saturating effects on the readout such that the gradient of the relationship between the maximum likelihood perceived direction and percentage of temporal dots (Fig. 8F) was very similar to that of observers (Fig. 6). However, changes to the noise structure did not accommodate the way in which stimulus duration modulated this relationship.
Conferring a form of temporal integration in which the response of each neuron has a maximum (Eq. 8, Rmaxi = 60 spikes/s) and decays to a sustained level (Eq. 8, Rsusi = 2 spikes/s) exponentially over time (Eq. 8, τ = 20 ms) both reduced and slightly broadened the population response (Fig. 8G). However, varying the time constant of integration (τ) did not capture the way in which stimulus duration affects performance on this task (Fig. 8H).
Our last manipulation to the model is built on a well known characteristic of motion-sensitive neurons in MT: responses saturate at very small numbers of dot directions (Snowden et al., 1991, 1992). Implementing a simple form of temporal summation in which each neuron accumulates local directional signals present within its receptive field as a power function of time [Eq. 9, Σ(t,D)] both reduced and broadened the population response (Fig. 8I). Together, these changes to the population response were sufficient to counteract the dominance of temporal directions and boost the relative contribution of spatial directions to the readout. By fixing the rate at which neurons accumulated direction signals, maximum likelihood was able to read out different numbers of directions over different time epochs. This simple, physiologically plausible change to the direction-selective neurons produced a family of functions (Fig. 8J) that closely approximates the relationship we found in the spatiotemporal experiment.
Figure 9A show the maximum likelihood readout from this model that most accurately predicts observers' perceived direction in the spatiotemporal experiment. When the input neurons summed local directions at a fixed temporal rate (Eq. 9, τD = 0.36), the correspondence between the model predictions and observers performance (Fig. 6D) is striking. [This model can also accommodate observers' performance in the temporal experiments (data not shown)]. For comparison, we decoded corresponding estimates of perceived direction from the same population of neurons using winner-takes-all (Fig. 9C) and vector average (Fig. 9E). Winner-takes-all predictions are relatively accurate but hugely variable, and vector average predictions diverged substantially from the empirical data. All three decoders produced predictions that captured the relative change in observers' discrimination thresholds as the percentage of temporal directions increase (and percentage of spatial directions decrease) in the spatiotemporal experiment (Fig. 9B,D,F), yet only winner-takes-all approximated the absolute threshold levels (Fig. 9D).
Computations governing spatiotemporal pooling of local motion directions. A, Maximum likelihood (M. likelihood) readout from model neurons that sum local directions at a fixed rate over time (Eq. 9, τD = 0.36) accurately predicts human observers' perceived direction at different stimulus durations in the spatiotemporal pooling experiment. C, E, Corresponding estimates of perceived direction from winner-takes-all and vector average decoders are accurate but hugely variable (C) and completely inaccurate (E), respectively. B, D, F, Maximum likelihood (B), winner-takes-all (D; W-T-A), and vector average (F; V. average) decoders all approximate the general pattern of observers' direction discrimination thresholds in the spatiotemporal pooling experiment. Error bars are 95% CIs.
Discussion
A simple computational model built on realistic physiological principles could accommodate the dynamic nature of human observers' psychophysical performance on two tasks that required the pooling of motion directions over space and time. We did not have to invoke an adaptive pooling mechanism that derives different computational solutions over space and time to explain observers' perception. Our modeling suggested a more parsimonious solution, whereby the flexible nature of spatiotemporal pooling can be accommodated by a single computation that optimally pools motion signals across a population of neurons that effectively “count” the total number of dots on their receptive fields at a fixed rate over time.
Our results suggest that flexible pooling emerges naturally from the dynamics of the input neurons rather than residing with the pooling computations themselves. This conclusion differs from other psychophysical studies of motion perception (Stone et al., 1990; Stone and Thompson, 1992; Yo and Wilson, 1992; Burke and Wenderoth, 1993; Lorenceau et al., 1993; Cropper et al., 1994; Bowns, 1996; Zohary et al., 1996; Amano et al., 2009), which suggest that the visual system can adaptively switch between different pooling computations depending on the nature of the stimulus. Many studies have found that different pooling computations coincide with the perception of weak (low contrast, short duration, one-dimensional) and strong (high contrast, long duration, two-dimensional) forms of rigid motion. Moreover, when a distribution of dot directions is skewed asymmetrically, the perceived direction can be biased away from the mean toward the modal direction of global motion (Zohary et al., 1996), suggesting that the visual system has access to the entire distribution of local directions and adopts a flexible decision strategy (Zohary et al., 1996). However, it is not clear how the brain decides on which computations to choose within an adaptive pooling framework. Much of the psychophysical evidence in favor of adaptive pooling does not distinguish stimulus-based from mechanism-based pooling computations (Stone et al., 1990; Stone and Thompson, 1992; Yo and Wilson, 1992; Burke and Wenderoth, 1993; Lorenceau et al., 1993; Cropper et al., 1994; Bowns, 1996; Zohary et al., 1996; Amano et al., 2009). Without distinguishing the computational description of a visual stimulus from the underlying putative mechanism, it is impossible to know whether a single, mechanism-based computation can fully explain the pooling process. Indeed, many of the adaptive rigid motion effects and the perceptual switch between different motion-based summary statistics can be accommodated by computational models that optimally read out the motion percept with a single, likelihood computation (Weiss et al., 2002; Webb et al., 2007).
We have extended this work to show that computations built on well known physiological properties of MT neurons can accommodate flexible spatiotemporal pooling of local motion signals at a range of stimulus durations in human vision. Temporal pooling improves the precision with which motion signals can be discriminated (van Doorn and Koenderink, 1982; Snowden and Braddick, 1991; Watamaniuk and Sekuler, 1992; Fredericksen et al., 1994; Neri et al., 1998; Burr and Santoro, 2001), but the time window over which signals are accumulated depends on speed, spatial frequency, contrast, and temporal structure of the stimulus (Nachmias, 1967; Vassilev and Mitov, 1976; van Doorn and Koenderink, 1982; Thompson, 1982; De Bruyn and Orban, 1988; Bialek et al., 1991; Buracas et al., 1998; Bair and Movshon, 2004). Although responses saturate at very small numbers of dot directions (Snowden et al., 1991, 1992) and most of the information about the direction of constant motion is available soon after stimulus onset, MT neurons can transmit more information about stimuli with rich temporal structure (Buracas et al., 1998). Our modeling predicts that the way in which motion-sensitive neurons respond to stimuli with rich temporal structure also contributes to the flexible pooling of motion signals read out from MT. The form of temporal summation is not critical to this argument. In our model, the rate at which MT neurons accumulate local directions grew as a power function of time, but the type of temporal summation described in other psychophysical studies (Snowden and Braddick, 1991; Watamaniuk and Sekuler, 1992; Fredericksen et al., 1994; Neri et al., 1998; Burr and Santoro, 2001) may well have performed equally well.
A few studies have emphasized the contribution of rapidly saturating MT responses at small numbers of dot directions (Snowden et al., 1991, 1992) to the pooling of local motion signals (Simoncelli and Heeger, 1998; Dakin et al., 2005), but to our knowledge none have shown how the temporal accumulation of local motion signals mediates flexible pooling. Counting the number of dot directions is equivalent to summing motion energy (Britten et al., 1993), and our results are broadly consistent with the notion that motion-sensitive neurons behave like spatiotemporal energy detectors (Watson and Ahumada, 1983, 1985; van Santen and Sperling, 1984; Adelson and Bergen, 1985; Heeger, 1987; Simoncelli and Heeger, 1998). Recent models of visual motion pooling have extended motion energy models and shown how the nonlinear dynamics of input neurons can contribute to the subsequent pooling of visual motion signals (Rust et al., 2006; Tsui et al., 2010), reinforcing the notion that the complex dynamics of spatiotemporal pooling is inherited rather than adaptively computed at the pooling stage.
Conclusion
We have shown that a single, likelihood-based computation can accommodate the flexible nature of spatiotemporal motion pooling in human vision. Because likelihoods are derived from the tuning and response properties of individual motion sensitive neurons, flexible pooling emerges naturally from the temporal dynamics of these input neurons. This general principle obviates the need to invoke different computations to accommodate the complex dynamics of motion pooling.
Footnotes
This work was funded by a Wellcome Trust Research Career Development Fellowship (B.S.W.). We thank Neil Roach for useful discussions.
- Correspondence should be addressed to Ben S. Webb, Visual Neuroscience Group, School of Psychology, University Park, University of Nottingham, Nottingham NG7 2RD, UK. bsw{at}psychology.nottingham.ac.uk
This article is freely available online through the J Neurosci Open Choice option.