Abstract
We use visual image motion to judge the movement of objects, as well as our own movements through the environment. Generally, image motion components caused by object motion and self-motion are confounded in the retinal image. Thus, to estimate heading, the brain would ideally marginalize out the effects of object motion (or vice versa), but little is known about how this is accomplished neurally. Behavioral studies suggest that vestibular signals play a role in dissociating object motion and self-motion, and recent computational work suggests that a linear decoder can approximate marginalization by taking advantage of diverse multisensory representations. By measuring responses of MSTd neurons in two male rhesus monkeys and by applying a recently-developed method to approximate marginalization by linear population decoding, we tested the hypothesis that vestibular signals help to dissociate self-motion and object motion. We show that vestibular signals stabilize tuning for heading in neurons with congruent visual and vestibular heading preferences, whereas they stabilize tuning for object motion in neurons with discrepant preferences. Thus, vestibular signals enhance the separability of joint tuning for object motion and self-motion. We further show that a linear decoder, designed to approximate marginalization, allows the population to represent either self-motion or object motion with good accuracy. Decoder weights are broadly consistent with a readout strategy, suggested by recent computational work, in which responses are decoded according to the vestibular preferences of multisensory neurons. These results demonstrate, at both single neuron and population levels, that vestibular signals help to dissociate self-motion and object motion.
SIGNIFICANCE STATEMENT The brain often needs to estimate one property of a changing environment while ignoring others. This can be difficult because multiple properties of the environment may be confounded in sensory signals. The brain can solve this problem by marginalizing over irrelevant properties to estimate the property-of-interest. We explore this problem in the context of self-motion and object motion, which are inherently confounded in the retinal image. We examine how diversity in a population of multisensory neurons may be exploited to decode self-motion and object motion from the population activity of neurons in macaque area MSTd.
Introduction
Under natural conditions in which many things may change simultaneously in the environment, incoming sensory signals reflect a combination of multiple environmental causes. However, for specific behavioral tasks, it may be necessary to make judgements about a single environmental cause of sensory input without being influenced by other competing causes (i.e., “nuisance variables”). In general, estimating one variable and ignoring others can be achieved by marginalization, a computation in which some variables are integrated out of a joint probability distribution.
Although marginalization is probably important for many cognitive processes, little is known about its underlying neural mechanisms (Beck et al., 2011). We recently proposed a method for determining a linear decoder that approximates marginalization, and we showed through simulations that diverse multisensory representations could facilitate performance of the method (Kim et al., 2016). Here, we evaluate this method in the context of a naturalistic application: dissociation of self-motion and object motion.
For a moving observer, retinal image motion is a combination of components resulting from self-motion and object motion (Fig. 1A,B). Therefore, to accurately perceive self-motion, the brain needs to marginalize away the effects of object motion, and vice versa. Although some behavioral studies suggest that purely visual mechanisms exist to parse image motion into components related to self-motion and object motion (Rushton and Warren, 2005; Warren and Rushton, 2007, 2008; Matsumiya and Ando, 2009), other studies suggest that nonvisual (e.g., vestibular) signals make important contributions (Wexler, 2003; Wexler and van Boxtel, 2005; MacNeilage et al., 2012; Dupin and Wexler, 2013; Fajen and Matthis, 2013; Dokka et al., 2015a,b).
Little is known about neural computations that can dissociate self-motion and object motion. A few previous studies have examined interactions between visual motion patterns that simulate self-motion and object motion in area MSTd (Logan and Duffy, 2006; Sato et al., 2010; Kishore et al., 2012). These studies show that MSTd responses depend on self-motion and object motion in complex ways, but they did not systematically explore the joint tuning of MSTd neurons for heading and object direction, nor how nonvisual inputs influence the joint representation of self-motion and object motion. Given that vestibular signals contribute to perceptual judgements of self-motion and object motion (Fajen and Matthis, 2013; Dokka et al., 2015a,b), we hypothesized that the capacity of cortical neurons to distinguish self-motion and object motion would be enhanced by vestibular cues.
Cortical neurons in multiple brain areas, including MSTd (Gu et al., 2006, 2008; Fetsch et al., 2011), ventral intraparietal (VIP; Bremmer et al., 2002; Chen et al., 2011b, 2013), visual posterior sylvian (VPS; Chen et al., 2011b), and frontal eye field (FEF; Gu et al., 2016), are selective for the direction of translation (heading) based on both optic flow and vestibular inputs. These areas contain “congruent” cells, which have matched vestibular and visual heading preferences, as well as “opposite” cells, which tend to prefer opposite directions of real and visually-simulated translation (Gu et al., 2006; Chen et al., 2011b). Whereas the activity of congruent cells is compatible with roles in cue integration (Gu et al., 2008; Chen et al., 2013) and reliability-based cue weighting (Fetsch et al., 2011), opposite cells are poorly suited to these computations. However, opposite cells may respond strongly when a moving object produces image motion that is inconsistent with self-translation. Thus, the combined activity of a mixture of congruent and opposite cells could allow the system to identify image motion that is inconsistent with self-motion. Indeed, simulations suggest that a properly trained linear decoder (Kim et al., 2016) can take advantage of this diversity to marginalize over object motion and estimate heading, but this approach has not been applied to real neural data.
We tested macaque MSTd neurons with many combinations of object motion and self-motion (including visual and vestibular cues to self-motion). We examined how vestibular signals modulate the joint tuning of single neurons for heading and object direction. In addition, we evaluated whether our method for approximate linear marginalization (ALM; Kim et al., 2016) could compute the probability distribution over heading while marginalizing out the effects of object motion, or vice versa. Our findings suggest that vestibular signals facilitate a robust neural representation of motion that can be used to estimate either heading or object motion.
Materials and Methods
Subjects and preparation
Two male rhesus monkeys (Macaca mulatta), weighing 12.8 and 11.5 kg, participated in this study. Both animals were 6–8 years of age during the course of the studies, and were pair-housed in a vivarium with a 12 h light cycle (6:00 A.M. to 6:00 PM). Under sterile conditions, monkeys were chronically implanted with a circular Delrin ring (diameter: 7 cm) for head stabilization, as described previously (Gu et al., 2006), as well as a scleral search coil for measuring eye movements. After recovery, animals were trained to fixate visual targets for fluid rewards using standard operant conditioning techniques.
For electrophysiological recording, a Delrin grid (2.5 × 4.5 × 0.5 cm) containing rows of holes was stereotaxically secured to the skull inside the head-restraint ring and was positioned in the horizontal plane. The holes in the grid (0.8 mm spacing) allowed vertical penetration of microelectrodes into the brain via transdural guide tubes that were inserted through a small burr hole in the skull. Burr holes were made surgically under aseptic conditions while the subjects were anesthetized. The recording grid extended bilaterally from the midline to regions overlying area MST in both hemispheres. All experimental procedures conformed to National Institutes of Health guidelines and were approved by the University Committee on Animal Resources at the University of Rochester.
Vestibular and visual stimuli
During experiments, monkeys were seated comfortably in a primate chair with their head restrained. The chair was securely attached to a 6 degree-of-freedom motion platform (MOOG 6DOF2000E, Moog) that was used to passively translate the animals. A field-coil system (CNC Electronics) was mounted on top of the motion platform, surrounding the monkey's head and body, and was used to measure eye movements using the scleral coil technique. Visual stimuli were rear-projected onto a tangent screen by a three-chip digital light projector (Mirage S+3K, Christie Digital Systems), and the tangent screen was attached to the front of the field coil (Gu et al., 2006). The display screen measured 60 × 60 cm and was mounted at a viewing distance of ∼30 cm in front of the monkey, thus subtending ∼90 × 90° of visual angle. The sides and top of the field coil frame were covered with a matte black enclosure such that the animal could only see visual stimuli that were presented on the tangent screen; the room was not visible. Platform motion and visual stimuli were updated at 60 Hz and were precisely calibrated such that visual motion was synchronous with platform motion (Gu et al., 2006).
Visual stimuli simulated translational self-motion along one of eight directions, spaced 45° apart, in the frontoparallel (i.e., vertical) plane (Fig. 1C). We elected to restrict self-motion to the frontoparallel plane for two main reasons: (1) to reduce the dimensionality of the stimulus space, which was necessary because we test all combinations of self-motion and object motion directions; and (2) because it simplifies the types of visual motion interactions that take place within the receptive field (i.e., no radial motion), thus simplifying predictions for how vestibular signals may interact with visual motion.
The visual scene consisted of a three-dimensional (3D) field of stars. Each star was a triangle that measured 0.15 × 0.15 cm, and the cloud measured 100 cm wide by 100 cm tall by 40 cm deep with a star density of 0.01/cm3. Along the depth dimension, the volume of the star field was centered on the distance of the display screen and fixation target. Given that the viewing distance was ∼30 cm, the star field ranged in depth from 10 to 50 cm in front of the monkey. To provide stereoscopic cues, the dot cloud was rendered as a red-green anaglyph and viewed through custom red-green goggles (Kodak Wratten2 filters #29 and #61). The optic flow field contained naturalistic cues mimicking translation of the observer through the 3D star field, including motion parallax, size variations, and binocular disparity. Visual stimuli were generated using the OpenGL libraries (under Microsoft Visual C++) and were rendered using an accelerated graphics card (Quadro FX4800), with anti-aliasing for smooth motion. Optic flow simulating self-motion was generated by translating the OpenGL cameras (1 for each eye) through the 3D star field. The star field always filled the entire visible video display; as dots disappeared from one edge of the display, other dots appeared on the opposite edge. This helped to enhance the perception that background dots were stationary in the world.
To explore the interaction between self-motion and object motion, a large, multipart object moved in the world while the monkey was translated. The object consisted of a cluster of spheres (diameter 15 cm each) that were composed of random dots, with a dot density (0.25/cm3) that is greater than that of the background dots, such that the object is easily detectable in the scene. Each sphere was separated from its neighbors by 22.5 cm and they were transparent such that they did not occlude background optic flow (Fig. 1A). The multipart object was centered in depth at the same distance from the monkey (∼30 cm) as the fixation point, which was in the plane of the display screen. Note that the entire outer boundary of the group of objects moved relative to the screen boundaries, facilitating the distinction between motion of the object and motion of the (stationary) background dots. After exploring various options including single spheres of different sizes, we elected to use the multipart object for the following reasons. First, because object location was not tailored to the receptive field of each neuron (to facilitate decoding analyses), we wanted object motion to cover a substantial portion of the visual field such that most receptive fields would be stimulated by the object. Second, using a single very large object may create figure-ground ambiguity that we wanted to avoid. Using a cluster of transparent moving spheres as the object allows decent coverage of the visual field while still allowing the moving objects to be easily segmented from the background.
As for self-motion, object motion was along one of eight directions (45° apart) within the frontoparallel plane. All self-motion and object motion trajectories were straight and had a Gaussian velocity profile with a SD of 1/3 s (Gu et al., 2006). The total excursion (0.15 m) and peak velocity (0.45 m/s) of object motion was somewhat greater than that for self-motion (0.10 m and 0.30 m/s, respectively), such that there was always movement of the object relative to the background even when the background and the object moved in the same direction.
Electrophysiology procedures
We recorded action potentials extracellularly from area MSTd of two monkeys. A tungsten microelectrode (Frederic Haer; tip diameter 3 μm, impedance 1–2 MΩ at 1 kHz) was advanced into the cortex through a transdural guide tube, using a hydraulic micromanipulator (Narishige) mounted on top of the head-restraint ring. Action potentials were amplified and isolated using a head-stage preamplifier, a bandpass eight-pole filter (Krohn-Hite, model 3384; 400–4000 Hz), and a dual voltage-time window discriminator (Bak Electronics, model RP-1). The times of occurrence of action potentials and all behavioral events were recorded with 1 ms resolution by the data acquisition computer. Raw neural signals were also digitized at 25 kHz and stored to disk for off-line spike sorting and additional analyses. Experimental control and data acquisition were coordinated by scripts written with TEMPO software (Reflective Computing).
Area MSTd was localized with aid from structural MRI scans and a standard macaque atlas (Van Essen et al., 2001). MSTd was typically identified as a region centered ∼15 mm lateral to the midline and ∼3–6 mm posterior to the interaural plane. Electrode penetrations were also guided by the pattern of background activity as the electrode traversed through gray and white matter, as well as the response properties of neurons to visual stimuli. MSTd was usually the first gray matter encountered, ∼6–10 mm below the cortical surface, which exhibited prominent response modulation to flashing random-dot stimuli and direction-selective responses to motion of the dots. We were careful to distinguish MSTd from the lateral subdivision of area MST (MSTl). To do this, we also carefully mapped the portions of area MT that were found beneath MSTd, in the posterior bank of the superior temporal sulcus. We mainly targeted regions of area MSTd that were located postero-medially, overlying portions of area MT that had receptive fields with moderate to large eccentricities.
Experimental protocol
We searched for neurons while presenting a large pattern of flickering or drifting random dots. After isolating the action potential of an MSTd neuron, we used custom software to perform an initial characterization of response properties, making use of a graphical interface that controlled the position, size, and velocity of a patch of dots. After hand-mapping the receptive field, a reverse-correlation procedure was used to obtain a quantitative map of the receptive field, as described in detail previously (Chen et al., 2008). We also measured heading tuning within the frontoparallel plane by presenting eight directions of motion, 45° apart, defined by optic flow alone (visual condition), platform motion alone (vestibular condition), and congruent visual and vestibular inputs (bimodal condition).
After this initial characterization of response properties, the main experimental protocol was run. This protocol included a fully crossed design in which eight directions of self-motion in the frontoparallel plane were crossed with eight directions of object motion. Two full sets of these 8 × 8 stimulus conditions were presented, a visual condition in which self-motion was visually simulated by optic flow while the animal remained stationary, and a bimodal condition in which self-motion was provided by congruent optic flow and vestibular signals. In addition, the main protocol included three self-motion control conditions in which self-motion was presented without any moving object: a visual, a vestibular, and a bimodal condition. The protocol also included an object-motion control condition in which only the moving object was presented, with no self-motion, and a null condition in which no stimuli were presented except for the fixation target. In total, 161 unique stimulus conditions were randomly interleaved, requiring 805 trials to complete 5 repetitions of each distinct stimulus.
Data analysis
The response of a neuron to each stimulus was computed as the firing rate over a time window from 500 to 2000 ms following stimulus onset, as this time window was previously found to contain most of the neural response to these stimuli (Gu et al., 2006, their Fig. 1C). The joint tuning of MSTd neurons for the 64 combinations of heading and object direction was plotted as a color map for visualization (Fig. 2C). In addition, data from the control conditions involving only self-motion or only object motion were plotted as tuning curves (Fig. 2A).
Direction discrimination indices.
To quantify the strength of directional tuning exhibited by a neuron, we computed a well established metric of neural selectivity (Prince et al., 2002; DeAngelis and Uka, 2003) called the direction discrimination index (DDI): In this formulation, Rmax and Rmin denote the mean responses to the most effective and least effective stimulus directions, respectively, SSE is the sum squared error around the mean responses, N is the number of observations (trials), and M is the number of stimulus values tested.
In our application of DDI, we wanted to quantify the strength of neural selectivity for heading (while pooling across object directions) or the strength of selectivity for object direction (pooling across headings). Thus, we computed two DDI metrics. DDIheading was computed after responses were pooled across the eight possible object motion directions, and DDIobject was computed after responses were pooled across the eight possible headings (see Fig. 4). These pooled DDI metrics therefore reflect the consistency of tuning for one variable (e.g., heading) across variations in the other variable (e.g., object motion). A neuron could have a low DDI value because it is not tuned at all or because its tuning for one variable is not consistent across variations in the other variable. These pooled DDI metrics are useful to evaluate whether the addition of vestibular signals makes tuning more consistent or less consistent.
To further investigate whether the interactions between self-motion and object motion in MSTd responses can be described by baseline response shifts, peak shifts, and/or gain changes, we calculated DDI metrics that are compensated for baseline shifts (mean-compensated DDI), peak shifts (shift-compensated DDI), and gain modulations (gain-compensated DDI). To compute the mean-compensated DDIheading, mean responses were equated across object directions before computing DDIheading. To compute the shift-compensated DDIheading, the peaks of heading tuning curves were aligned across object directions before computing DDIheading. To compute the gain-compensated DDIheading, the peak-trough amplitudes of heading tuning curves were equated across object directions before computing DDIheading. Analogous operations were performed to compute the compensated values of DDIobject, but instead equating mean values, peak locations, or peak-trough amplitudes across headings.
Direction separability index.
If heading tuning is independent of object direction (and vice versa), then we may expect the joint tuning profile for heading and object motion to be multiplicatively separable. In contrast, if heading tuning depends on object direction (or vice versa), then the joint heading/object tuning profile may have a somewhat slanted structure (see Fig. 7A). To quantify the separability of joint heading/object tuning, singular value decomposition (SVD) was applied to the joint tuning profile of each neuron. This approach represents the empirically measured joint tuning profile as a weighted sum of multiplicatively-separable components. Thus, if the joint tuning is multiplicatively separable, then the first singular value will be large and all of the other singular values will be relatively small. To quantify separability, we compute a direction separability index (DSI), which depends on the relative magnitude of the first singular value to the sum of all singular values (Depireux et al., 2001; Mazer et al., 2002): where λ(i) is the ith singular value. If the joint tuning is separable, then DSI would be close to zero, and greater values indicate increasing inseparability.
Population decoding by likelihood computation.
Maximum likelihood decoding can be used to estimate heading or object direction from a sample of activity of our population of MSTd neurons. To compute an estimate of the log likelihood over heading [logL(θ)] or object direction [logL(ϕ)], each neuron's response (ri) to a particular stimulus was multiplied by the log of its tuning function [log fi(θ) or log fi(ϕ)] and the result was summed across neurons (Eqs. 3, 4; Dayan and Abbott, 2001; Jazayeri and Movshon, 2006). The second term in these equations compensates for biases caused by a nonuniform distribution of direction preferences. In these formulations, ri represents the response of the ith neuron to a particular combination of heading and object motion in either the visual or bimodal condition. For heading estimation, fi(θ) could be either the visual or vestibular heading tuning function for each neuron, without object motion. For object motion estimation, fi(ϕ) is the object motion tuning curve measured without self-motion. The peak of the resulting distribution represents the maximum-likelihood estimate of either heading or object motion direction. This approach assumes that neurons have independent Poisson spiking statistics.
It should be emphasized that the distributions obtained from Equations 3 and 4 will only equal the corresponding log likelihoods over heading and object direction if the shape of the population activity is not influenced by nuisance variables such as object motion and self-motion, respectively. If these conditions are not met (which will generally be the case in our applications, since tuning for heading is altered by object motion and vice versa), then Equations 3 and 4 do not compute true log-likelihoods but may instead be referred to as recognition models (Hinton and Dayan, 1996). Although we do not expect this simple form of population decoding to effectively perform marginalization, it provides an instructive comparison to an optimal linear decoder that does approximate marginalization, as described below.
In addition, we also computed the joint likelihood over both heading and object direction by applying the same approach to the joint tuning profiles for heading and object motion: Here, fi(θ, ϕ) represents the joint tuning profile for heading and object direction (Fig. 2C) for the ith neuron.
Approximate linear marginalization.
As described above, the responses of MSTd neurons will generally be a function of both heading and object direction, f(θ, ϕ), and the joint posterior over both variables, P(θ, ϕ | r), can be computed from the neural responses, r, and the joint tuning profiles (assuming a flat prior; see Eq. 9). However, in many situations, it is not necessary to compute the joint posterior, and it may be advantageous to directly estimate the marginal posterior over heading or object direction, P(θ | r) or P(ϕ | r), respectively. Mathematically, this involves integrating one of these variables out of the joint posterior (see Eq. 10), a computation known as marginalization. But how can the marginal posterior be computed from neural activity? Is it possible to directly decode neural activity and obtain a reasonable approximation of the marginal posterior?
As described in detail recently (Kim et al., 2016), we sought to find a linear transformation of neural activity that can perform near-optimal marginalization; for example, to compute an approximate marginal posterior distribution over heading that discounts object motion. To perform what we call ALM, we assume that the marginal posterior distribution over heading is approximated by a member (Q) of the exponential family with linear sufficient statistics (Ma et al., 2006): where We then optimize the parameters, h(θ) and g(θ), to best explain data drawn from the true marginal posterior distribution, P(θ | r). This optimization, performed using multinomial logistic regression (Bishop, 2006), maximizes the likelihood of the parameters and minimizes the Kullback–Leibler divergence between the true marginal distribution P and the approximation Q. Given K samples from the joint distribution (θk, ϕk, rk) ∼ P(θ, ϕ, r) for k = 1 … K, where θ denotes heading, ϕ denotes object motion direction, and r denotes the neural population response, the log-likelihood of the model parameters (for heading estimation) is given by the following: where h=[hi(θj)] denotes a matrix of weights specifying how the response ri of neuron i influences the log-probability of heading θj, g(θj) represents a set of bias parameters for each heading, and the constant Z ensures that Q(θk|rk; h, g) is properly normalized. These parameters were obtained by performing multinomial logistic regression using the glmtrain() function in the Netlab toolbox for MATLAB (Nabney, 2002; Bishop, 2006), specifically with the algorithmic variant based on an approximate Hessian. For our purposes, h(θ) represents the key quantity that is obtained by ALM: it contains the optimal decoding weights for each neuron that provide the best approximation to the marginal posterior over heading that can be obtained by a linear transformation of responses. The equations above describe computation of the ALM weights for the case of estimating heading, but the procedure is completely analogous for object direction (substituting ϕ for θ), yielding decoding weights given by h(ϕ).
ALM was trained on a stimulus set including 500 simulated trials for each distinct stimulus condition. For each simulated trial, a pseudo-population response was created by sampling a response from each neuron (with replacement). Once ALM was trained, its performance was tested on a different set of pseudo-population responses, again comprising 500 simulated trials for each distinct stimulus condition. The stimulus conditions in the training set included all combinations of eight self-motion directions and eight object motion directions, as well as conditions involving self-motion only or object motion only. Specifically, when ALM was trained to decode heading, 30% of trials in the training set were self-motion only conditions (10% vestibular, 10% visual, and 10% bimodal) and 70% of trials involved combinations of self-motion and object motion (evenly split between visual and bimodal self-motion cues). Similarly, when ALM was trained to decode object motion direction, 30% of trials in the training set were object-only conditions, and 70% of trials involved combinations of self-motion and object motion (exactly the same as for decoding heading). Conditions involving only self-motion or only object motion were included in the training sets such that the resulting decoder should be able to estimate heading or object motion across a range of stimulus conditions.
Experimental design and statistical analysis
Data were collected from two male animals such that sex was not a variable in the analysis. For each animal, we collected data from between 70 and 100 neurons, which is commensurate with the standards of the field for single-unit studies in awake macaques.
The design of the experiments was such that each neuron was tested with an identical set of stimulus conditions. All stimulus conditions to be directly compared in the main experimental protocol were randomly interleaved in a block of trials for each neuron, such that there were no temporal sequence effects to consider.
Because all neurons were tested with the same stimulus set, statistical analyses focus on examining comparisons of quantitative metrics across the population. For this purpose, data from the two monkeys were pooled together, and there were no between-subject factors in the analysis. All statistical analyses were performed using the Statistics toolbox in MATLAB. Parametric or nonparametric statistics were used, as appropriate to each particular comparison, as described in detail in the Results section.
Results
To investigate how visual motion signals related to self-motion and object motion interact, and how these interactions are modulated by vestibular self-motion signals, we recorded from 164 MSTd neurons in two animals (73 from Monkey D and 91 from Monkey N). Monkeys were required to maintain fixation on a head-fixed target while a large multipart object moved in one of eight directions within the frontoparallel plane (Fig. 1). In addition, self-motion along eight possible directions within the frontoparallel plane was simulated by either full-field optic flow presented alone (visual condition) or by a combination of optic flow and whole-body translation (bimodal condition). All 64 combinations of 8 directions of self-motion and 8 directions of object motion (45° apart) were presented (Fig. 1C), along with controls in which either self-motion or object motion was presented alone.
We attempted to record from any MSTd neuron that exhibited clear visual responses to moving and/or flickering random-dot stimuli. We classified cells as congruent (n = 82) if they had a Pearson correlation coefficient between visual and vestibular heading tuning that was significantly >0.2 (95% confidence interval not including 0.2, bootstrap; n = 1000). Heading tuning curves for an example congruent cell are shown in Figure 2A. Similarly, we classified cells as opposite (n = 29) if the correlation coefficient between visual and vestibular heading tuning was significantly <−0.2 (95% CI not including −0.2, bootstrap). Data from an example opposite cell are shown in Figure 2D. The remaining neurons were denoted as unclassified (n = 53). We found that these criteria produced a categorization of congruent and opposite cells that agreed well with classification by eye.
For comparison with previous studies (Gu et al., 2006), Figure 3 shows the distribution of differences in heading preference (|Δ Preferred Heading|) between the visual and vestibular self-motion conditions. The distribution is bimodal, indicating substantial proportions of both congruent and opposite cells. Although the bimodality is not as strong as in previous studies of MSTd (Gu et al., 2006), this may be due to the fact that heading tuning was measured in the frontoparallel plane in the present study, whereas Gu et al. (2006) measured this relationship in the horizontal plane.
Interactions between self-motion and object motion in single-unit responses
We now examine how self-motion and object motion interact in the responses of MSTd neurons. Joint tuning profiles for heading and object direction are shown in Figure 2, B,C, and E,F, for the example congruent and opposite cells, respectively. When self-motion and object-motion are presented simultaneously, both clearly influence the joint tuning of the example neurons. As a result, heading tuning curves are generally altered by object motion and object tuning curves are altered by heading. Interestingly, for the example congruent neuron (Fig. 2B,C), heading tuning curves are more consistent across object directions in the bimodal condition than in the visual condition, whereas object tuning curves are more consistent across headings in the visual condition than in the bimodal condition. In contrast, for the example opposite cell (Fig. 2E,F), the opposite pattern was observed: heading tuning is more consistent in the visual condition than the bimodal condition, and object tuning is more consistent in the bimodal condition than in the visual condition. In other words, vestibular self-motion signals appear to stabilize heading tuning for congruent cells, whereas they stabilize object tuning for opposite cells. This makes sense, intuitively, because vestibular heading tuning aligns with visual heading tuning for congruent cells, whereas it aligns with object tuning for opposite cells.
To quantify these observations, we used a DDI to measure the overall strength of tuning relative to response variability (Prince et al., 2002; DeAngelis and Uka, 2003). In our application, we wish to quantify both the overall strength of tuning for one variable (heading or object direction) and how it is influenced by variations in the other variable. Thus, we computed a DDI for heading (DDIheading) by pooling responses across all object directions (see Materials and Methods; Fig. 4A), such that DDIheading reflects the consistency of heading tuning across different directions of object motion. Analogously, we computed a DDI for object motion (DDIobject) by pooling responses across headings (see Materials and Methods; Fig. 4B).
Results from our sample of MSTd neurons (Fig. 5) confirm the observations made for the example neurons of Figure 2. For congruent cells (magenta; n = 82), DDIheading is significantly greater in the bimodal condition than in the visual condition (paired t test, t(81) = −8.29, p = 2.1 × 10−12), whereas for opposite cells (cyan; n = 29), DDIheading is significantly greater in the visual condition (paired t test, t(28) = 1.98, p = 0.0057; Fig. 5A). Comparing the two types of cells, the median difference in DDIheading between the bimodal and visual conditions was significantly greater for congruent cells than opposite cells (Wilcoxon rank sum test, p = 9.7 × 10−8). Thus, vestibular signals stabilize the heading tuning of congruent cells, and weaken the heading tuning of opposite cells.
The opposite pattern of results was found for tuning to the direction of object motion (Fig. 5B). For congruent cells, DDIobject is significantly greater in the visual condition than in the bimodal condition (paired t test, t(81) = 6.20, p = 2.2 × 10−8), whereas for opposite cells, DDIobject is greater in the bimodal condition (paired t test, t(28) = −3.30, p = 0.0026). The median difference in DDIobject between bimodal and visual conditions is significantly less for congruent cells than opposite cells (Wilcoxon rank sum test, p = 8.3 × 10−8). Hence vestibular signals stabilize the object tuning of opposite cells but substantially disrupt the object tuning of congruent cells.
To further summarize these effects, we plotted the difference in DDIobject between visual and bimodal conditions against the difference in DDIheading between these stimulus conditions (Fig. 5C). Strikingly, across the entire population, there is a very strong inverse correlation between these variables (Spearman's rank r = −0.84, p = 2.80 × 10−45; n = 164), with congruent and opposite cells being pretty well segregated by the unity-slope diagonal. Note that there is also a strong negative correlation in Figure 5C for neurons that are not statistically categorized as congruent or opposite cells (gray symbols; r = −0.78, p = 5.6 × 10−12; n = 53), indicating that the effect of vestibular signals on joint tuning for heading and object motion extends to most neurons in the population.
One may expect that neurons with stronger vestibular inputs would have greater differences in DDIheading or DDIobject between visual and bimodal conditions. Indeed, this was the case for congruent cells: the difference in DDIheading between bimodal and visual conditions is positively correlated with strength of vestibular tuning as quantified by computing DDI for the vestibular heading tuning curve (r = 0.53, p = 4.1 × 10−7; n = 82, data not shown). Analogously, the difference in DDIobject between bimodal and visual conditions is negatively correlated with strength of vestibular tuning for congruent cells (r = −0.37, p = 0.050; n = 82, data not shown). For opposite cells, vestibular tuning strength is significantly negatively correlated with the difference in DDIheading between bimodal and visual conditions (r = −0.51, p = 1.6 × 10−6; n = 29), and significantly positively correlated with the difference in DDIobject (r = 0.56, p = 0.0019; n = 29, data not shown). There was no significant difference overall, however, between the strength of vestibular heading tuning of congruent and opposite cells (Wilcoxon rank sum test, p = 0.22).
DDI quantifies the overall strength of tuning, but it does not indicate which aspects of heading tuning are modulated by object motion and vice versa. For example, different directions of object motion could change the response gain, shift the peak, or vertically shift the baseline of heading tuning curves (Fig. 2). We explored this issue by equalizing one of these aspects of tuning (e.g., peak location) and recomputing DDI values (Fig. 6; see Materials and Methods for details). For heading tuning, original DDIheading values are significantly less than mean-compensated values (Wilcoxon signed rank test: Visual: p = 1.2 × 10−28, Bimodal: p = 1.6 × 10−28; Fig. 6 A,D) and shift-compensated values (Wilcoxon signed rank test: Visual: p = 1.6 × 10−21, Bimodal: p = 2.2 × 10−18; Fig. 6B,E), but are not significantly different from gain-compensated DDIheading values (Wilcoxon signed rank test: Visual: p = 0.19, Bimodal: p = 0.49; Fig. 6C,F). This indicates that object direction mainly affects DDIheading values through baseline and peak shifts, with relatively little effect of gain changes. Similar results are obtained for object direction tuning (Fig. 6G–L; mean-compensated DDIobject: Visual: p = 5.2 × 10−28, Bimodal: p = 1.8 × 10−28; shift-compensated DDIobject: Visual: p = 1.3 × 10−9, Bimodal: p = 1.5 × 10−6; gain-compensated DDIobject: Visual: p = 0.17, Bimodal: p = 0.58). These findings suggest that variations in heading tuning or object tuning are mainly caused by vertical and horizontal shifts of the tuning curves, with less contribution from variations in response gain.
The fact that heading tuning curves show horizontal and vertical shifts as a function of object direction (and vice versa) implies that joint tuning profiles are generally not multiplicatively separable. To quantify inseparabilities in joint tuning, we used SVD analysis to compute a DSI (see Materials and Methods; Depireux et al., 2001; Mazer et al., 2002). For neurons with joint tuning profiles that are multiplicatively separable, the first singular value should dominate, and the DSI value will be close to zero. For neurons with inseparable joint tuning, such as the slanted structure visible in the example of Figure 7A, the second and higher singular values will have non-negligible values (Fig. 7B), and the DSI value will be greater. The key question is whether inseparability is reduced by the inclusion of vestibular signals. Indeed, we found that DSI values for MSTd neurons were significantly smaller (closer to 0) in the bimodal condition than in the visual condition (Wilcoxon signed rank test, p = 1.3 × 10−12; n = 164). There was no significant difference in DSI between congruent, opposite and unclassified cells (ANCOVA, main effect of congruency, p = 0.082) and no significant interaction between congruency and the effect of vestibular signals (ANCOVA, interaction effect, p = 0.069). These results indicate that vestibular signals reduce the interdependencies of heading and object direction in the responses of MSTd neurons.
Likelihood-based decoding of MSTd population responses
Given that self-motion and object motion interact in the responses of MSTd neurons in rather diverse ways, the question arises as to whether it is possible to decode heading or object direction accurately from a population of such neurons. One possibility is that the brain computes the joint posterior probability distribution over both heading (θ) and object direction (ϕ) from the neural population response (r). To illustrate this, we assumed that the prior distributions over heading and object direction are flat (as in our experiment) and that population activity follows independent Poisson statistics, even though the latter is not accurate for real MSTd neurons (Gu et al., 2010, 2011). With these assumptions, we computed the joint posterior as follows (Dayan and Abbott, 2001): where fi(θ, ϕ) denotes the joint tuning profile (Fig. 2B,C) of the ith neuron and ri denotes the response of the ith neuron on a particular trial.
An example joint posterior, computed from a single-trial sample of activity from all neurons in our population (Fig. 8A), is sharply peaked around the true stimulus (θ = 135, ϕ = 135). If the brain actually computes the joint posterior over both heading and object direction, then it should be possible to estimate both variables accurately (note that this requires the decoder to know the joint heading-object tuning profile of each neuron). Indeed, we found that performance of this decoder was quite accurate and precise (Fig. 8B,C), with errors that were close to zero.
It may not be practical, in general, for the brain to compute the joint posterior because it could have several dimensions. For example, if there were multiple moving objects in a scene, then it would be necessary to compute a multidimensional posterior over heading and each object's motion. Alternatively, for tasks that involve judging heading, object motion can be treated as a “nuisance variable” and an ideal strategy would be for the brain to marginalize over object motion to estimate heading. In other words, it may be necessary to compute the marginal posterior over heading: Analogously, the brain could marginalize over heading to compute the marginal posterior for object motion. Indeed, marginalization is a general operation that is likely to be performed by the brain for a variety of forms of probabilistic inference.
Mathematically, marginalization simply involves integrating one (or more) variables out of a joint posterior, but how can such an operation be implemented neutrally? More specifically, how might the responses of MSTd neurons be decoded to marginalize over object motion and estimate heading, or vice versa? In general, performing optimal marginalization requires nonlinear operations (Beck et al., 2011); thus, we considered whether an approximate solution might be achieved through some form of linear decoding. We first considered the possibility that an approximation to marginalization could be achieved through a standard form of linear decoding that involves estimating the log-likelihood function as a response-weighted sum of tuning curves (maximum likelihood decoding; Dayan and Abbott, 2001). In simulations, we have reported recently that heading could be estimated robustly, in the presence of object motion, by computing approximate log-likelihood functions from a mixed population of congruent and opposite cells based on their vestibular tuning curves (Kim et al., 2016). This approach was quite successful in idealized populations of neurons but substantially less effective in the presence of tuning curve diversity (Kim et al., 2016); hence, it was not clear if it would be successful when applied to real data. Thus, we decoded the responses of real MSTd neurons by computing log-likelihoods based on either visual or vestibular tuning curves (see Materials and Methods; Eqs. 3, 4).
We found that decoding the bimodal responses of MSTd neurons using unimodal tuning curves generally produced large errors (Fig. 9A,B). We summarized these results by computing the root mean square error (RMSE) for estimates of heading (pooled across object directions) and the RMSE for estimates of object direction (pooled across headings). For heading estimation in the bimodal condition, RMSE averaged 73.2° ± 0.98° when heading was decoded based on visual heading tuning curves (Fig. 9C, filled blue bar), and 77.6° ± 6.2° when based on vestibular tuning curves (Fig. 9C, filled red bar). Errors were comparable or larger for the visual condition (Fig. 9C, open blue and red bars). Thus, the strategy of estimating log-likelihood functions over heading from unimodal tuning curves is unsuccessful in the presence of object motion. This strategy likely fails due to the considerable diversity of tuning properties across the MSTd population (Kim et al., 2016).
Analogous results are obtained for decoding object direction based on object-only tuning (Fig. 9D, orange bars), which is also unsuccessful. Thus, it does not appear possible for the brain to effectively dissociate heading and object direction by estimating log-likelihood functions from response-weighted sums of unimodal tuning curves. However, this does not rule out the possibility that other forms of linear decoding might better approximate marginalization.
Approximate linear marginalization
We have recently shown, through biologically-constrained simulations, that it is possible to learn a linear decoder of neural responses that can provide a good approximation to the marginal posterior (Kim et al., 2016). We now apply this ALM approach to population responses from area MSTd to determine whether it is capable of dissociating the effects of self-motion and object motion.
Similar to the likelihood computation described above, ALM involves a weighted linear decoding of neural responses. However, unlike the likelihood computation in which the weight profiles are constrained to be the neuron's tuning curve, ALM finds a weight profile for each neuron that achieves the best approximation to the marginal posterior. More specifically, ALM seeks to compute an approximate version of the marginal posterior (over heading, for example) that can be described as a member of the exponential family with linear sufficient statistics (see Materials and Methods; Eq. 6). We therefore attempted to find the parameters, h(θ) and g(θ), that best approximate the true marginalized posterior (P) by the approximate form Q. In this formulation (Eq. 6), h(θ) represents a matrix of decoding weights, in which each neuron (row) has a weight corresponding to each possible heading (column), whereas g(θ) represents a response-independent bias parameter for each heading.
We have demonstrated previously (Kim et al., 2016) that the desired quantities, h(θ) and g(θ), can be computed by applying multinomial logistic regression to minimize the Kullback–Leibler divergence between P and Q (Fig. 10A). We refer to the linear transformation that best approximates the marginalized posterior as ALM. The result of this analysis is a set of decoding weights for each neuron, given by h(θ), that describe how much each neuron contributes to the representation of each possible heading, θ, in the approximate marginal posterior. The resulting decoding weights provide the best approximation to the marginalized posterior over heading that can be achieved by a linear decoder in the exponential family. Analogously, we can also learn a set of decoding weights, h(ϕ), to estimate the marginal posterior over object direction (see Materials and Methods for details).
We applied ALM to responses of a subpopulation of 58 MSTd neurons: all 29 opposite cells and a subset of 29 congruent cells that were randomly chosen from our sample of 82 congruent cells. Results were obtained for 10 different randomly-chosen subsets of congruent cells, and the outcomes were averaged. This was done such that the contributions of congruent and opposite cells to the population were balanced. Our results demonstrate that ALM is quite accurate in extracting estimates of heading that are robust to object motion and vice versa (compare Figs. 10B,C and 9A,B). For estimating heading, RMSEs were 7.32° and 3.68° for the visual and bimodal conditions (Fig. 9C, brown bars), values that are dramatically smaller than the RMSE values obtained when heading is decoded by computing log-likelihoods based on vestibular tuning (Fig. 9C, red bars; visual: 93.9°, bimodal: 77.6°) or visual tuning (Fig. 9C, blue bars; visual: 72.9°, bimodal: 73.2°). Indeed, the median RMS heading error for ALM was highly significantly less than those obtained by decoding with visual or vestibular tuning curves (Wilcoxon signed rank tests, p < 10−12 for both visual and bimodal conditions). Importantly, the RMSE for ALM is significantly smaller in the bimodal condition (3.68°) than the visual condition (7.32°; Wilcoxon signed rank test, p = 1.5 × 10−5; Fig. 9C, brown bars), indicating that vestibular signals significantly improve the accuracy with which the marginal posterior can be estimated from MSTd activity. This benefit of vestibular signals on ALM performance is roughly similar to recent behavioral results from monkeys (Dokka et al., 2015a), as discussed further below.
Analogous results were obtained when ALM was used to estimate object motion and marginalize over heading (Figs. 10C, Fig. 9D). RMSEs for estimating object direction were greatly reduced compared with when object direction was estimated by computing likelihoods based on object-only tuning (Fig. 9D; visual: p = 1.6 × 10−10, bimodal: p = 6.1 × 10−12). Furthermore, the addition of vestibular signals modestly but significantly reduced the RMSE in object direction from 1.91° in the visual condition to 1.64° in the bimodal condition (Fig. 9D, brown bars; Wilcoxon signed rank test, p = 0.0032).
Although the results of Figure 9, C and D, were obtained from balanced populations of congruent and opposite cells, we also found that qualitatively similar results were obtained when ALM was trained to decode the responses of all MSTd neurons. For decoding heading, RMSE in the bimodal condition (0.37°) was significantly less than RMSE in the visual condition (0.81°), with the difference being statistically significant (p = 0.019, Wilcoxon signed rank test). For decoding object motion, the RMSE for the bimodal condition (0.28°) was also significantly less than the RMSE in the visual condition (0.36°; p = 2.3 × 10−4, Wilcoxon signed rank test). Thus, the effect of vestibular signals on estimation of heading and object motion can be observed when all neurons are decoded, although the differential effects become smaller as the errors approach zero.
Together the results of Figures 9 and 10 show that it is possible to linearly transform MSTd activity and obtain robust estimates of either heading or object direction; moreover, vestibular input significantly improves the accuracy of these estimates. These findings qualitatively mirror the effects of vestibular signals observed in recent behavioral studies (Fajen and Matthis, 2013; Dokka et al., 2015a,b), although one should exercise caution in comparing absolute errors between ALM and behavior (see Discussion).
Comparison of ALM weights and neural tuning curves
Our previous computational study (Kim et al., 2016) showed that the decoding weight profiles learned by ALM for heading estimation were roughly similar to the vestibular tuning curves of model neurons, but not to the visual tuning curves (for opposite cells). To determine whether similar relationships hold for real MSTd neurons, we compared the learned weight profile for each neuron, hi(θ) or hi(ϕ), with the unimodal heading tuning curves and the object-only tuning curve. As described above, we averaged results across 10 balanced populations of 29 congruent and 29 opposite cells.
Results from an example congruent cell (Fig. 11A) reveal that the ALM weights for estimating heading, hi(θ), have a profile similar in shape to the vestibular and visual heading tuning curves of this neuron. The ALM weights for estimating object motion, hi(ϕ), are roughly similar in shape to the object-only tuning curve but not the visual heading tuning curve. For the example opposite cell (Fig. 11B), the results are quite different. Here, hi(θ) resembles the vestibular heading tuning curve, but not the visual heading tuning curve, and hi(ϕ) resembles the object-only tuning curve and not the visual heading tuning. Notably, hi(θ) resembles the vestibular heading tuning for both the congruent cell and the opposite cell in Figure 11, A and B, similar to the findings of Kim et al. (2016) for simulated neural populations.
The pattern of results exhibited by these example neurons is largely confirmed across the population of MSTd neurons, as assessed by computing correlation coefficients between ALM weights and tuning curves. For congruent cells (Fig. 11C), hi(θ) is generally well correlated with both vestibular and visual heading tuning curves: the median correlation coefficients are 0.69 and 0.65, respectively. Both of these values are significantly different from zero (sign tests: vestibular: p = 1.6 × 10−76, visual: p = 5.6 × 10−61), and the median correlation with vestibular heading tuning is significantly greater than the correlation with visual tuning (Wilcoxon signed rank test, p = 4.1 × 10−4). By comparison, hi(ϕ) for congruent cells is modestly, but significantly, correlated with object-only tuning (median correlation coefficient = 0.30; sign test, p = 2.6 × 10−11).
For opposite cells (Fig. 11D), a different pattern emerges. We find that hi(θ) is generally positively correlated with vestibular heading tuning (median correlation coefficient = 0.39; sign test, p = 5.6 × 10−18) but shows no systematic correlation with visual heading tuning (median correlation coefficient = −0.20; sign test, p = 0.80). Comparing ALM heading weights across congruent and opposite cells, we see that ALM weights tend to be positively correlated with vestibular heading tuning for both types of neurons, roughly consistent with the findings of Kim et al. (2016) for model neurons.
For opposite cells, ALM weights for estimating object direction, hi(ϕ), are strongly correlated with object-only tuning (median correlation coefficient = 0.70; sign test, p = 5.3 × 10−49), and this correlation is significantly stronger than the corresponding relationship for congruent cells (Wilcoxon rank sum test, p = 1.3 × 10−13). This finding is consistent with the idea that opposite cells may contribute more substantially to estimating object direction than congruent cells.
The results of Figure 11, C and D, suggest that the weight profiles learned by ALM have some systematic relationships with the unimodal tuning curves for heading and object motion. We also examined whether the amplitudes of the ALM weight profiles, hi(θ) and hi(ϕ), are correlated with the amplitudes of the unimodal tuning curves. We found no significant correlation between the amplitude of hi(θ) and the amplitude of visual or vestibular heading tuning curves, for either congruent or opposite cells (Fig. 11E,F; Spearman rank correlations, p > 0.10). Similarly, we found no correlation between the amplitude of hi(ϕ) and object tuning curves (Spearman rank correlations, p > 0.66). Thus, ALM does not weight the responses of MSTd neurons according to the strength of their unimodal tuning curves (unlike the log-likelihood computation), even though there are systematic relationships between the shapes of ALM weights and tuning curves. This suggests that the superior performance of ALM over the log-likelihood computation arises, at least in part, from applying a gain factor to neurons that is not predictable from their tuning strength.
Discussion
Our findings show that vestibular signals contribute to generating robust representations of self-motion and object motion in area MSTd. Specifically, vestibular signals enhance the separability of the joint heading/object tuning profiles of MSTd neurons. As a result, vestibular signals enhance the representation of heading in congruent cells and the representation of object motion in opposite cells. We demonstrate that a linear transformation of MSTd responses (ALM) can allow fairly accurate decoding of either heading or object motion while marginalizing over variations in the other variable, and that vestibular signals facilitate this marginalization computation. Together, our findings demonstrate a fundamental role for diverse multisensory representations in performing neural computations that parse sensory inputs into components that reflect separate physical causes.
Limitations of stimulus design
Our exploration of interactions between self-motion and object motion was limited to the frontoparallel plane, because it was not practically feasible to record from neurons long enough to study the interactions in all three dimensions. We chose the frontoparallel plane, instead of the horizontal plane, because there are fewer visual cues to distinguish object motion from self-motion, thus making the marginalization problem more difficult. In the horizontal plane, there would also be looming and changing disparity cues that might be exploited to dissociate self-motion and object motion. Previous work (Logan and Duffy, 2006) suggests that the relationship between visual heading tuning and object direction tuning is similar in the frontoparallel and horizontal planes for MSTd neurons; thus, we would expect similar results in the horizontal plane. We recognize, however, that the additional visual cues that would be available in the horizontal plane might also aid the dissociation of self-motion and object motion. Thus, there might be less need for marginalization operations when self-motion occurs in the horizontal plane.
Another limitation is that we used a large, multipart object to elicit robust responses of MSTd neurons to object motion. Although our object is probably larger than most objects encountered in natural scenes, this choice was driven by the fact that most MSTd neurons do not respond well to small objects (Komatsu and Wurtz, 1988a,b), as well as the need to have the object easily visually segmented from the background dots.
Approximate linear marginalization
Marginalization operations are of considerable importance because they allow the brain to estimate a variable of interest and integrate out the effects of nuisance variables. Marginalization can be easily accomplished with a simple linear decoder if nuisance parameters simply change the gain of neural responses (Ma et al., 2006; Beck et al., 2011). However, the interactions between self-motion and object motion in MSTd responses are generally not well described as gain modulations (Figs. 2, 6). Under these conditions, marginalization generally requires nonlinear transformations of neural responses (Beck et al., 2011) that may be difficult for the brain to implement.
We recently demonstrated that it is possible to approximate marginalization using a linear transformation of neural responses that we call ALM (Kim et al., 2016). ALM was successful in biologically-constrained simulations, even when the joint heading-object tuning of model neurons was strongly inseparable. Our present findings show that ALM performs quite well when applied to responses of real MSTd neurons, even when the neural populations are modest in size (∼60 neurons). Our results suggest that simple linear transformations could help to solve a variety of computational problems that require marginalizing out the effects of nuisance parameters. Further work is needed to determine how well ALM performs when multiple nuisance variables are present, but the approach is general and should be applicable to larger-scale problems.
Neural substrates for dissociating heading and object motion
In macaque monkeys, visual motion processing involves a network of cortical areas, including MT, MST, VIP, and FST (Orban et al., 2004). Although many studies have been conducted with a variety of stimuli, including plaids and random-dot patterns that contain multiple velocity components (Treue et al., 2000; Rust et al., 2006; McDonald et al., 2014), only a few previous studies have tested neurons with combinations of visual motion cues that simulate both self-motion and object motion (Logan and Duffy, 2006; Sato et al., 2010; Kishore et al., 2012). Consistent with our findings, these previous studies report that interactions between heading and object motion can be complex, showing a variety of subadditive and superadditive interactions (Sato et al., 2010). A population vector decoder failed to recover heading accurately when object motion was opposite to self-motion (Logan and Duffy, 2006), consistent with our finding that likelihood functions computed as response-weighted sums of tuning curves fail to approximate marginalization. Importantly, these previous studies only examined interactions between visual cues to self-motion and object motion; our study is the first to examine whether vestibular signals help to dissociate self-motion from object motion in neural responses.
Our findings suggest that area MSTd is well suited to play a role in dissociating self-motion and object motion through multisensory integration. However, MSTd neurons generally do not respond well to small moving stimuli (Komatsu and Wurtz, 1988a,b). Indeed, this was also our experience in preliminary experiments, leading us to adopt a stimulus design in which a large multipart object moved independently in the virtual environment (see Materials and Methods). The role of MSTd in representing object motion may therefore be limited to situations in which moving objects subtend a substantial portion of the visual field, and other areas may represent the motion of small objects during self-motion. One candidate area is MSTl, where neurons have smaller receptive fields and frequently exhibit strong surround suppression (Eifuku and Wurtz, 1998). It is unknown, however, whether MSTl neurons carry vestibular signals that could aid in dissociating self-motion and object motion. Another candidate area is V6 (Galletti et al., 1999, 2001), based on human imaging studies (Pitzalis et al., 2010, 2012). To our knowledge, a direct examination of interactions between object motion and self-motion has not been conducted in V6 at the single-neuron level. However, recent results (Fan et al., 2015) indicate that monkey V6 neurons do not carry vestibular signals regarding body translation. Other visual-vestibular areas that may contribute include the VIP area (Chen et al., 2011b), VPS area (Chen et al., 2011a), and the FEF (Gu et al., 2016).
It is important to emphasize that multisensory mechanisms for dissociating self-motion and object motion may operate in parallel with purely visual mechanisms, such as those involving center-surround mechanisms (Anstis and Reinhardt-Rutland, 1976; Frost and Nakayama, 1983). Indeed, computational models have demonstrated that such visual strategies can be effective (Royden, 2002; Royden and Holloway, 2014). Our results suggest that multisensory mechanisms may augment these visual strategies.
Until recently, the functional role of opposite cells has remained unclear. Previous work showed that opposite cells are ill-suited for cue integration (Gu et al., 2008, 2014) and cue-weighting (Fetsch et al., 2011) in heading perception. Choice-related activity suggested that opposite cells might be decoded according to their vestibular heading preferences (Gu et al., 2008, 2014), and simulations showed this to be an effective strategy for estimating heading during object motion (Kim et al., 2016). Our present results are broadly consistent with this idea, as the weights learned by ALM to estimate heading tend to resemble vestibular tuning for opposite cells. Together, these findings suggest that opposite cells make important contributions to dissociating self-motion and object motion; moreover, the computational strategy may be applicable to source separation problems involving other sensory modalities.
Relationship to behavioral studies
Psychophysical studies have explored whether the visual system can parse retinal image motion into components related to self-motion and object motion (Rushton and Warren, 2005; Warren and Rushton, 2007, 2008, 2009; Matsumiya and Ando, 2009). These “flow parsing” studies show that global patterns of optic flow consistent with self-motion can alter the perceived velocity of small moving objects, consistent with discounting global flow, even when the object is located in the hemi-field opposite to optic flow (Warren and Rushton, 2009). However, it is not clear from these studies whether the visual system, by itself, can fully dissociate self-motion and object motion, nor whether vestibular signals may contribute to flow-parsing.
A few recent psychophysical studies have examined perception of object motion during both real and simulated self-motion (MacNeilage et al., 2012; Fajen and Matthis, 2013; Dokka et al., 2015b). Consistent with our findings from decoding MSTd neurons, Dokka et al. (2015b) report that vestibular signals reduce biases in perceived object direction during self-motion. Dokka et al. (2015b) found substantially larger biases than we predict from ALM, however, suggesting that neural processing in humans may not achieve near-optimal marginalization. However, the species and stimulus conditions are different, making direct comparisons difficult. Using a different paradigm, Fajen et al. (2013) also found evidence that vestibular signals contribute to human judgements of object direction during self-motion, indicating that multisensory mechanisms play important roles in dissociating self-motion and object motion.
One can also consider the complementary question of how moving objects affect judgements of heading. Observers can exhibit substantial biases when they judge heading in the presence of objects that move independently in the world, especially when those objects overlap the focus of expansion in optic flow (Warren and Saunders, 1995; Royden and Hildreth, 1996, 1999; Fajen and Kim, 2002; Mapstone and Duffy, 2010). It is generally difficult to compare our population decoding results from MSTd to these previous studies because stimulus conditions were quite different. Our results regarding vestibular signals are most comparable to a recent study of monkey heading perception (Dokka et al., 2015a), despite substantial differences in object size and directions of self-motion. Dokka et al. (2015a) reported that addition of vestibular self-motion signals largely eliminated heading biases caused by object motion, indicating that vestibular signals contribute to dissociating self-motion and object motion. Notably, the relative magnitudes of heading biases found by Dokka et al. (2015a) in the visual and bimodal conditions (see their Fig. 4B) are approximately comparable to our results from applying ALM to MSTd responses (Fig. 9C, brown bars).
In closing, in conjunction with recent computational work (Kim et al., 2016), the present findings suggest that a linear transformation of the responses of a diverse population of multisensory neurons may be sufficient to dissociate self-motion and object motion. Future studies can build on this foundation by recording and manipulating neural activity while animals dissociate object motion and self-motion perceptually, and these efforts will be important to tease apart the circuitry underlying these fundamental computations.
Footnotes
This work was supported by NIH Grants EY016178 (G.C.D.) and DC014678 (D.E.A.), The Uehara Memorial Foundation (R.S.), the Japan Society for the Promotion of Science (R.S.), and an NEI CORE Grant (EY001319). We thank Dina Knoedl, Swati Shimpi, and Emily Murphy for excellent technical support, and Johnny Wen for programming support.
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Gregory C. DeAngelis, Department of Brain and Cognitive Sciences, Center for Visual Science, 310 Meliora Hall, University of Rochester, Rochester, NY 14627-0268. gdeangelis{at}cvs.rochester.edu