Abstract
In the brain, spike trains are generated in time and presumably also interpreted as they unfold in time. Recent work (Oram et al., 1999; Baker and Lemon, 2000) suggests that in several areas of the monkey brain, individual spike times carry information because they reflect an underlying rate variation. Constructing a model based on this stochastic structure allows us to apply order statistics to decode spike trains instant by instant as spikes arrive or do not. Order statistics are timeconsuming to compute in the general case. We demonstrate that data from neurons in primary visual cortex are well fit by a mixture of Poisson processes; in this special case, our computations are substantially faster. In these data, spike timing contributed information beyond that available from the spike count throughout the trial. At the end of the trial, a decoder based on the mixtureofPoissons model correctly decoded about three times as many trials as expected by chance, compared with approximately twice as many as expected by chance using the spike count only. If our model perfectly described the spike trains, and enough data were available to estimate model parameters, then our Bayesian decoder would be optimal. For fourfifths of the sets of stimuluselicited responses, the observed spike trains were consistent with the mixtureofPoissons model. Most of the error in estimating stimulus probabilities is attributable to not having enough data to specify the parameters of the model rather than to misspecification of the model itself.
Introduction
Because responses are presumably interpreted as they unfold in time, our goal is to decode spike trains instant by instant depending on whether a spike arrived. If spike trains are thought of as words in a language, then decoding—figuring out what stimulus elicited a particular observed spike train—can be thought of as looking up words in a neural dictionary. Ideally this dictionary should allow us, at any point in time, to translate a spike train into the best possible guess of which stimulus elicited it.
One approach to constructing a neural dictionary is to make as few assumptions as possible. Another approach, which we take here, is to model spike trains using (and checking) certain assumptions about their structure. Recent work shows that in spike trains from three areas of the monkey brain (the lateral geniculate nucleus, primary visual cortex (V1), and primary motor cortex) spike times appear to have been thrown down at random, with probabilities determined by the firing rate profile over time, the peristimulus time histogram (PSTH) (Oram et al., 1999, 2001; Baker and Lemon, 2000). Oram et al. (1999) and Baker and Lemon (2000) simulated stochastic (but nonPoisson) spike trains with relations among spikes that were indistinguishable from those observed in experimental data. Using these models, it is in principle possible to determine how likely each stimulus is to have elicited every possible spike train. However, compiling a dictionary in this way would require simulating very large data sets, which, although less difficult than gathering enough experimental data, would still be prohibitively timeconsuming.
Here we show that for spike trains with the stochastic structure described by Oram et al. (1999), we can use order statistics (Arnold et al., 1992), instead of simulation, to calculate directly how likely each stimulus is to have elicited each spike train. Order statistics give the probabilities of individual spike times in trains with this stochastic structure based on the spike count and the firing rate profile—exactly the parameters used by Oram et al. (1999). Using order statistics, we can update the stimulus probabilities at each instant depending on whether a spike arrives. Once we have these probabilities, we can guess, for example, that the stimulus with the highest probability elicited the spike train. This is something like looking up a word letter by letter and discarding words that do not begin with the letters seen so far.
Although much faster than simulation, decoding using order statistics is, in the general case, still more timeconsuming than we would like. However, calculating order statistics is much simpler and faster in the special case in which the spike trains are considered as arising from a mixture of a small number of Poisson processes. We show that our data are in fact well fit by such a model. The recognition of this structure in the data substantially simplifies and speeds decoding of singleneuron spike trains instant by instant.
Materials and Methods
Decoding. Decoding, that is, guessing which stimulus elicited a particular spike train, has two steps. We first estimate the probability that each stimulus elicited the spike train and then choose a stimulus based on those estimated probabilities. Here we minimize expected error by choosing the most probable stimulus. If some mistakes are more important than others, a decision rule minimizing an appropriate cost function could be used.
With the decision rule fixed, decoding a spike train as it unfolds in time requires estimating the probabilityp _{s}(t _{j}‖R _{j}): the probability, estimated at time t _{j}, that stimulus s elicited the responseR _{j} = (r _{1}, . . .r _{j}), where eachr _{i} is 1 if a spike occurred at timei and 0 if no spike occurred in time bin i. The presence or absence of a spike is measured in 1 msec bins; timet _{i} is the time in the ith bin. Any bin width can be used as long as there can be at most one spike in any bin. All of our calculations could be recast in continuous time in a straightforward manner.
Using Bayes' rule, estimatingp
_{s}(t
_{j}‖R
_{j}) can be transformed into estimating the probability that a spike occurs in the next bin (which may depend on the history of the spike train):
Using Equations 1 and 2, we proceed bin by bin, updating the stimulus probabilities depending on whether a spike occurs in each bin. For efficiency, responses can be decoded in small jumps rather than instant by instant. The probabilityP(r
_{j} . . .r
_{j + k}‖s) for any response can be calculated using the distribution of first spike times,P(τ_{1}‖s):
We avoid overfitting using threeway crossvalidation: the trials were divided into three subsets, with twothirds used to fit the model and the remaining third used to test decoding. Each third of the data was used twice to fit and once to test; we averaged over the three splits.
Order statistics for spike trains. By considering each spike as the “next first spike,” the distribution of first spike timesP(τ_{1} =t_{j} ‖s) was calculated for each stimulus using Equations 3 and 4.
For a spike train with n spikes, the (n,k) order statistic describes the distribution of thekth of n draws, after those draws are sorted; in our case, this will beP(τ_{k}‖s, n), the time of the kth of n spikes in a train elicited by stimulus s:
Figure 1 illustrates estimating the quantities in Equation 6 from data for spike trains elicited by two stimuli (coded black, gray throughout). One stimulus elicits spikes beginning ∼25 msec earlier than the other. For each stimuluss, the firing rate profile isf _{s}(t) (Fig.1 B).F _{s}(t) (Fig. 1 C) is the corresponding cumulative probability. Fromf _{s}(t) andF _{s}(t), we can calculateP(τ_{1}‖s, n) for any spike count n (Eq. 6; Fig. 1 D). Asn increases, the weight of the first order statistic shifts left; i.e., the first spike is likely to occur earlier.
To avoid requiring the decoder to know in advance how many spikesn will arrive in a particular trial, we average the first order statistics P(τ_{1}‖s, n), weighted with the expected distribution of spike countsp̃
_{s}(n) for stimuluss (Fig. 1E; we usep̃
_{s}(n) to distinguish fromp
_{s}(t), estimated probability of stimulus s at time t). This countweighted first order statistic,P(τ_{1}‖s), gives the expected probability of the first spike occurring at time tfor spike density f
_{s}(t) and the expected distribution of spike counts for stimuluss:
The second order statistic conditioned on the first spike time is calculated as the first order statistic of a restricted responsef̃ _{s}, created by lefttruncatingf _{s}(t) at the time of the previous spike and renormalized to sum to 1. The (n,k + 1) order statistic can be calculated as the (n − k, 1) order statistic of the remaining response (Arnold et al., 1992).
Order statistics show a Markov property such that future spike times depend on past spike times only through the number of spikes already observed and the most recent spike time. This dependence on the number of spikes already observed means that the process is not a renewal process. Figure 2 shows first order statistics calculated at the beginning of a trial and after the first and fifth spikes in the response.
Figure 3 shows the decoding of several spike trains for the twostimulus example of Figure 1.
Simplifying order statistics using a mixture of Poisson distributions. Equation 7 can be used to calculate the countweighted first order statistic for any distribution of spike counts. If the spike count distribution is a mixture of a finite number of Poisson distributions, our calculations become much simpler.
For a Poisson process with timevarying mean rate λ_{s}
f
_{s}(t), the probability that the first spike occurs at time j simplifies to:
Equation 8 calculates the sum of countspecific first order statisticsP(τ_{1}‖s, n) for eachn, weighted with the appropriate Poisson probabilities. This simplifies Equation 7 in the case of a simple Poisson distribution of spike counts. However, the distributions of stimuluselicited spike counts from several brain areas are not adequately modeled by a Poisson distribution (Baddeley et al., 1997; Gershon et al., 1998; Wiener et al., 2001).
A mixture of a finite number of Poisson distributions is more flexible than a single Poisson distribution for modeling stimuluselicited spike count distributions and almost as simple. To model the distribution of spike counts elicited by stimulus s as a mixture of Poisson distributions with mean rates λ_{s,i} with weights summing to 1, we add the first order statistics for each mean rate λ_{s,i} weighted by the probabilityp(λ_{s,i}) of observing that rate:
The calculations in Equations 8 and 10 are conceptually identical to those in Equations 6 and 7. However, because the sum in Equation 6 can be taken analytically in the case of a Poisson distribution, the calculations using a mixture of Poisson distributions are approximately an order of magnitude faster for our data sets. The speedup comes because we calculate order statistics for only a few Poisson means rather than for many spike counts. The two methods differ slightly in how they deal with the discretization of time into bins (the approximation in Eq. 9 is unnecessary when using Eqs. 6, 7). Below we will show that the effect of this difference on the estimated stimulus probabilities is negligible.
Model parameters on subintervals. The decoding procedures outlined above iteratively treat each new spike as the first spike in a shorter response. This requires estimating model parameters for the shorter response from the parameters for the full model. For the multiple Poisson formulation, this is simple and quick (whereas in the general case, it is computationally intensive). The rate function of a Poisson process on a subinterval is the original rate function restricted to that interval. The weightsp(λ_{s,i}‖s) are continuously updated using Bayes' rule, because each weight is the probability with which we believe the spike train being decoded comes from the particular process. The Poisson processes differ only by a multiplier of the rate, so our evidence of which process the train comes from is the number of spikes we have seen:
Generating surrogate data. To compare the performance of our decoder on recorded data with its performance on data with the structure for which the decoder was designed, we simulate spike trains using a procedure similar to that used for decoding. We calculate the distribution P(τ‖s, 1) of next spike times (above) and randomly select a spike time. The simulation also proceeds iteratively: after each spike time is chosen, the distribution for the next is calculated. The probabilities in Equations 7 and 10 do not sum to 1, because they omit the probability of no additional spikes. For implementation, the probability that no more spikes occur in the interval is assigned to an additional bin. When this bin is selected, we consider the spike train under construction to be complete. By construction, such trains have the structure specified by the model.
Surrogate spike trains matched to observed data are generated using the observed spike density and spike count distribution for each stimulus. The same number of trials observed is generated for each stimulus.
Incorporating a refractory period. Real spike train data often depart from the simple mixtureofPoissons model by becoming either less or more likely to fire for a short time after a spike (exhibiting a refractory or rebound period, respectively). To adjust, we depart slightly from classical order statistics and replace the spike density function f(t) withf(t)r(t − τ), where τ is the time of the previous spike.
We estimate r by comparing the observed interspike interval distribution with the interspike interval distribution of a set of surrogate trains. Because interval distribution is related to spike count (more spikes mean shorter intervals), we actually generate more surrogate trains than needed and subselect to match the observed spike count distribution. If the interspike interval in the generated data is significantly different from the observed interspike interval (χ^{2} test, p < 0.05), then r(1) is set tog _{obs}(1)/g _{gen}(1), where g _{obs} andg _{gen} are the interspike interval distributions in the observed and generated data. A new set of trials is simulated with this new refractory period (in which the frequency of interspike intervals of length 1 now matches the frequency in the observed data), and the process is repeated for subsequent intervals until the interspike interval distribution of the generated data is indistinguishable from the interspike interval distribution of the experimental data.
Estimating the spike density function from data. The spike density function measures rate variation over time. To estimate the spike density function for a particular stimulus, we create a histogram at 1 msec precision (the experimental sampling rate) across trials for that stimulus and smooth using local regression (Loader, 1999) with a window using 10% of the data. The spike density functions obtained using this window and the optimal smoothing window determined by generalized crossvalidation (Craven and Wahba, 1979; Loader, 1999) are similar: median correlation, 0.98; interquartile range (iqr), 0.95–0.99. The smoothed histogram is normalized so that its sum across bins is 1.
Fitting a mixture of Poisson distributions to observed spike counts. The probability of drawing a particular count nfrom a mixture of Poisson distributions is the weighted sum of the probabilities of drawing the count n from each of the individual Poisson distributions in the mixture:
For each stimulus s, the parameters λ_{s,i} andp(λ_{s,i}) are estimated by maximizing the log likelihood of the observed spike counts given the corresponding model. We seek the most parsimonious description of each distribution. Thus, for each stimulus, we first find the best single Poisson distribution (k = 1). If the observed distribution could reasonably have come from the fit Poisson (χ^{2} test, p > 0.05), we use this model; otherwise, we fit a mixture of two Poisson distributions and again check for consistency. In general, if the data are inconsistent with a mixture of k Poisson distributions, we fit a mixture with k + 1 Poisson distributions. In this work, we did not require mixtures of more than five Poisson distributions (see Results).
Information calculations. Transmitted information (Shannon and Weaver, 1949; Cover and Thomas, 1991) is defined asI(R, S) = ∑S _{r,s} p(r, s)log(p(s‖r)/p(s)), where the stimulus probabilities p(s) are taken from the experiment, and p(s‖r) is calculated using Equation 1 or 2. Using response models avoids some estimation problems associated with small data sets (Panzeri and Treves, 1996; Golomb et al., 1997) and yields estimates of information comparable with those using other validated methods (Gershon et al., 1998; Wiener and Richmond, 1998).
Data sets. We decoded responses recorded from monkey primary visual cortex in two previously reported experiments (Kjaer et al., 1997; Wiener et al., 2001). In each experiment, responses were recorded using standard singleelectrode techniques from complex cells in primary visual cortex of awake rhesus monkeys. At the beginning of each trial, a fixation point appeared on a screen. One hundred milliseconds after the monkey fixated, a stimulus was flashed on the receptive field of the neuron for 300 msec and then replaced with the background. The monkey was not required to react to the stimulus. If the monkey fixated within ∼2° of the fixation point during the entire period from the appearance of the fixation point until the stimulus disappeared, it was rewarded with a drop of liquid when the stimulus disappeared. If the monkey shifted its gaze further than 2° from the fixation point, the trial was aborted, and the monkey received no reward. There was a delay of 300 msec between trials, during which the monkey was not required to fixate.
In one set of experiments (Wiener et al., 2001), 128 stimuli were shown: 32 oriented bars (Fig.4 A), 32 sinewave gratings (Fig. 4 B), 32 Walsh patterns (Fig.4 C), and 32 photographic images (Fig. 4 D). In another experiment (Kjaer et al., 1997), 16 Walsh patterns were used (Fig. 4 E).
Computation. The calculations presented in this paper were performed in the R statistical computing environment (Ihaka and Gentleman, 1996).
Results
Our original analyses were performed on data from the 29 neurons from the experiment with 16 stimuli (Kjaer et al., 1997). In a single experiment (recording from a single neuron), each stimulus was presented approximately the same number of times; the number of presentations per stimulus varied from neuron to neuron, ranging from 19 to 230 (median, 42). The mixtureofPoissons model made decoding sufficiently fast that we were able to decode data from the 17 neurons from the experiment with 128 stimuli (Wiener et al., 2001). In these experiments, the median number of presentations per stimulus ranged from 8 to 52 (median 14) in different neurons; each stimulus was presented approximately the same number of times. Below we will show that decoding results using the two methods on the data from the experiment with 16 stimuli were very similar. Except where stated otherwise, the results in this paper were obtained using the mixtureofPoissons model.
Mixtures of Poisson distributions for spike count
We examined the mixtureofPoissons model for 2636 spike count distributions in two different sets of V1 data (Kjaer et al., 1997;Wiener et al., 2001); 50.8% of the spike count distributions were adequately fit with a single Poisson distribution, 39.4% with a mixture of two Poisson distributions, 7.4% with three, 2.0% with four, and 0.4% with five. No distribution required a mixture of more than five distributions. Examples of mixture models with one, two, and three components are shown in Figure5.
The modeled distributions are, by design, smoother than the observed distributions. Various spike counts are more or less likely in the model than in the data. Spike counts that are very frequent in the data may be assigned less probability in the model, and spike counts not actually observed in the data may be assigned nonzero probability (as is the case for a count of four in Fig. 5 A,C). Fitting a mixture of Poisson distributions to the observed spike count is, among other things, a form of smoothing (meaning we believe that if we gathered enough data we would encounter a response with four spikes elicited by the stimulus corresponding to Fig. 5 C). As expected when smoothing, the means of the modeled distributions are very close to the observed means (median difference across all neurons, 0% of the measured mean; iqr −3 to +4%), but the variances are lower (median difference, −5% of the measured variance; iqr, −24 to + 9%).
In the Poisson decoding, we model the distribution of mean spike counts in any subinterval of the trial (from any particular time to the end of the trial) on the basis of the distribution of mean spike counts in the entire trial [p(λ_{s,i}) at the start of the trial] and the firing rate modulation over time (i.e., how many of the spikes are expected to have occurred by now; see Eq. 11). If we use the mixtureofPoissons model to predict, on five successively shrinking subintervals (beginning 50, 100, 150, 200, and 250 msec after stimulus onset), the variance of spike counts elicited by each stimulus and take a linear regression of log(predicted variance) versus log(observed variance), we find a negative intercept (median across 46 neurons, −0.13; iqr, −0.25 to −0.03) and a slope near 1 (median, 0.98; iqr, 0.95–1.02). The r ^{2}values of the regressions are high (median, 0.94; iqr, 0.91–0.96). Below we will examine the effect on decoding accuracy of using the estimated subinterval spike count distributions compared with using spike count distributions directly measured on the subintervals.
Decoding
Figure 6 shows an example of the development of stimulus probabilities over time (i.e., the probability that each stimulus elicited the observed spike train) when decoding using the mixtureofPoissons method. Figure 6 A shows an example from an experiment in which 16 Walsh patterns were shown (Kjaer et al., 1997), and Figure 6 B shows an example from an experiment in which 128 stimuli were shown (Wiener et al., 2001). The spike trains decoded are shown underneath each panel. The stimulus probabilities at each time depend only on the spike train observed up to that time; the decoding algorithm does not look into the future. Although the probabilities change more abruptly when spikes arrive, they change even when no spikes are fired. Thus the absence of spikes, as well as their presence, is informative (cf. Sherlock Holmes's dog that did not bark).
Figure 6, A and B, shows correctly decoded trials in which one stimulus is far more probable than all others. In these cases, the probability of the guessed stimulus is high. Table1 summarizes the distribution of the maximum estimated probability, that is, the probability of the guessed stimulus, at the end of our decoding window (300 msec after stimulus onset). It also shows the difference between the probabilities of the most probable and second most probable stimulus (the “margin of victory” over the “runnerup”). The winning probabilities are higher in the experiments with 16 stimuli than in the experiments with 128 stimuli, because the total probability (p = 1) is divided among fewer stimuli.
Intuitively, it should become easier to distinguish among stimuli as the probability of one stimulus rises and the probabilities of others fall. We evaluate the decoding in two ways: by measuring the amount of information obtained from the spike train and also by seeing how well the decoding lets us guess which stimulus elicited the observed response. As stated in Materials and Methods, in all our calculations we avoided overfitting using threeway crossvalidation: the available trials were divided into three subsets, with twothirds used to fit the model and the remaining third used to test decoding. Each third of the data was used twice to fit and once to test; the results we present are averaged over the three sets.
Information theory (Shannon and Weaver, 1949) shows that the amount of information in a set of spike trains depends on how they are interpreted, that is, on the nature of the neural code. Figure7, A and C, shows that for both sets of experiments, decoding using only the number of spikes in a response (dark boxes) yields substantially less information than considering both spike count and spike timing using the mixtureofPoissons model (light boxes).
Does this extra information provided by paying attention to timing actually help us decode more accurately? It is natural to guess that the stimulus with the highest probability elicited the observed spike train (as it did in both examples in Fig. 6). Figure 7, Band D, shows the distribution (across 29 and 17 neurons, respectively) of the percent of trials correctly decoded, that is, for which the highestprobability stimulus did elicit the response. For comparability between the two experiments, the figures are shown as multiples of the percent of trials that could be correctly decoded simply by guessing (1/16 = 6.25%, and 1/128 = 0.78%). In both experiments, more trials were correctly decoded using both spike count and timing via the mixtureofPoissons method than using spike count alone.
When timing is taken into account, the amount of information transmitted by the neuronal responses is larger in the experiment with 128 stimuli than in the experiment with 16 stimuli (Fig.7 A,C). This difference could be because one experiment uses more stimuli than the other, or it could be because the larger experiment used stimuli of several different kinds. To check whether the number of stimuli was the crucial factor, we randomly selected subsets of 64, 32, and 16 stimuli, allowing stimuli of all four kinds to enter each set. Figure8 A shows that this has very little effect on the amount of information transmitted by the responses. Figure 8 B compares the amount of information transmitted by responses to the randomly selected groups of 32 stimuli with the amount of information transmitted by the four subsets of 32 stimuli consisting of the bars, gratings, Walsh patterns, and photographic images, respectively (Fig. 4 A–D). Responses to the Walsh patterns and photographs used in these experiments transmit less information than do the responses to bars and gratings.
Wiener et al. (2001), in an analysis using spike count alone, have previously shown that the difference in information transmission in neuronal responses to stimuli of the four different kinds is primarily attributable to the fact that responses to Walsh patterns and photographs have a smaller dynamic range than responses to bars and gratings. This new result shows that the difference persists in an analysis that takes into account spike timing as well. The persistent difference in results obtained using different kinds of stimuli reinforces a point made by Wiener et al. (2001): conclusions based on experiments using a narrow set of stimuli may not generalize well to experiments using a larger set. The need for a large number of stimuli unfortunately conflicts with the need for many trials per stimulus in an experiment with a limited number of trials.
Relation between decoding success and transmitted information
Intuitively, we expect transmitted information and decoding success to be related: if more information is transmitted by the responses, more responses should be correctly decoded. Above (Fig. 7) we showed that in both sets of experiments (with 16 and 128 stimuli), transmitted information and percent of trials correctly decoded both increase as the trials progress. Figure 9shows that, for individual neurons from the two sets of experiments, more information transmitted generally corresponds to a greater portion of trials correctly decoded.
Effect of refractory period
Because we estimate the spike count distribution from data, we incorporate the effect of refractory period on spike count. We found that the effect of any additional refractory or rebound period on the number of trials correctly decoded is very small: 0.2% fewer trials are correctly decoded when the refractory period is included (median across neurons; iqr, −2.7 to 2.2%). The effect on individual stimulus probabilities was also small: in the experiments using 16 stimuli, the median correlation between the estimated stimulus probabilities obtained when taking the additional refractory effects into account and those obtained when not taking the additional refractory effects into account was 0.997 (median across all trials in 29 neurons), and the fifth percentile of correlations was 0.96. The median correlation in the experiments using 128 stimuli was 0.999 (median across all trials in 17 neurons), and the fifth percentile was 0.99. Except where otherwise noted, the results throughout this study were calculated using the refractory period.
Decoding signal versus decoding noise
When there are no differences in timing among responses elicited by different stimuli, we would expect the decoder to reduce to a spike count decoder. To check this, we simulated two sets of responses with identical spike density functions (normalized PSTH) but different distributions of spike count (Poisson distributions with means of 4 and 10). The only nonrandom differences in timing arise from the differences in spike count. A trial with seven spikes should, at the end of the trial, be assigned with probability 0.4 to the Poisson distribution with mean 4 and with probability 0.6 to the Poisson distribution with mean 10. In fact, depending on the data used to create the Poisson models, different spike trains of length 7 will be assigned different probabilities, but we expect the assigned probabilities to converge to p = 0.4 and 0.6. Figure10 A shows that when spike timing is known to be random, decoding on the basis of timing (using the mixtureofPoissons model; white boxes) converges to the correct answer more slowly than the optimal decoding using spike count alone (black boxes).
A similar effect can be observed in surrogate data generated to match the spike count distribution seen in one of the neurons from the experiment with 16 stimuli but with spikes equally likely at any time (i.e., with a flat PSTH) for all stimuli. Here again, spike timing is purely random and carries no information about which stimulus generated the response. Figure 10 B shows that more trials are correctly decoded using the spike count alone than using the full mixtureofPoissons model. Again, attempting to decode using random timing degrades performance.
In the cases just presented, the spike density function is (by construction) of no help in decoding but still must be estimated. When a small amount of data is available, random spike time variations cause differences in the estimates of the spike density functions for different stimuli, and these spurious differences interfere with decoding. The random differences in the spike density functions become smaller when more data are available to estimate them (although true differences in spike density functions can be detectable even with modest amounts of data). The need for data to eliminate spurious correlations would only increase if more complicated timing features were to be taken into account. Thus it is important to avoid trying to decode on the basis of complicated timing features that can be accounted for as stochastic consequences of the spike count and simpler timing features.
Comparing different decoding variants
Trains from the neurons from the experiment with 16 stimuli were decoded using both the order statistic method (with the spike count distributions represented using mixtures of Poisson distributions) and the simplified mixtureofPoissons method. The correlation between the two sets of probabilities at the end of the trial (that is, 300 msec after stimulus onset) was 0.997 (median across neurons; iqr, 0.992–0.999). This verifies that the two methods give nearly identical results (up to certain issues related to time discretization near the end of a trial). As a result, the same stimulus was guessed by the two methods in 95% of the cases (median across neurons; iqr, 94–98%). It may seem odd that different stimuli could be guessed when the correlation between probabilities is so high. Table 1 shows that the difference between the probability of the winning stimulus and the next most probable stimulus can be quite small; thus small discrepancies can change the guessed stimulus in a small number of trials. As would be expected, the differences in the largest and secondlargest probabilities in trials when the two methods made the same guess were ∼5 times as large as the differences in trials when the two methods made different guesses. We did not decode the experiments with 128 stimuli using the order statistic method because of the computational burden.
As previously presented, when using the mixtureofPoissons model, we have modeled the distribution of mean spike counts in any subinterval of the trial (from any particular time to the end of the trial) on the basis of the distribution of mean spike counts in the entire trial [λ_{s,i} and p(λ) at the start of the trial] and the firing rate modulation over time (Eq. 11). It is possible, at substantial expense in both computation time and additional model complexity, to avoid this approximation by measuring the distribution of spike counts in each subinterval and finding a new mixture of Poisson distributions to model each distribution. Despite the additional model complexity and computation, decoding using measured subinterval distributions gives results similar, but not identical, to those obtained when using estimated subinterval distributions: the correlation between the two sets of probabilities at the end of trials was 0.88 (iqr, 0.83–0.94), and the same stimulus was guessed in 75% of the trials (median across neurons; iqr, 65–82%). Approximately the same number of trials were correctly decoded using the two methods (Fig.11 A), although using the measured distributions yielded slightly more transmitted information (Fig. 11 B).
Decoding using the order statistic method or the mixtureofPoissons method with directly measured spike count distributions on subintervals takes substantially longer (by approximately an order of magnitude) than using the mixtureofPoissons model when estimating the count distributions on subintervals. The results of this section show that the extra time produces little change in decoding.
How good is the mixtureofPoissons model?
Timerescaling goodnessoffit test
Barbieri et al. (2001) and Brown et al. (2002) presented a method for checking the goodness of fit of spiketrain models. Their method is based on the timerescaling theorem, which shows that any point process can be transformed, by a suitable rescaling of time, into a homogeneous Poisson process. To apply their method to our data, we randomly chose twothirds of the trains to estimate the model, withholding onethird of the trains to use as a test set. Each model was tested using the Kolmogorov–Smirnov test presented by Brown et al. (2002), withp = 0.95. In the neurons from the experiment with 128 stimuli, the test spike trains were consistent with the model for 109 stimuli (median across neurons; iqr, 104–112). In the neurons from the experiment with 16 stimuli, the test spike trains were consistent with the model for 11 stimuli (median across neurons; iqr, 10–13). When the refractory–rebound effect was ignored, the results were very similar: the test spike trains were still consistent with the model for 108 of 128 stimuli (median across neurons; iqr, 104–111) and in 11 of 16 stimuli (median across neurons; iqr, 9–13). Overall, 82% of stimuli were consistent with the model when the refractory period was included in the model, and 78% were still consistent when the refractory period was ignored. Thus the mixture of Poisson distributions with different mean rates, and therefore different temporal structures, seems to account for much of the observed temporal structure of the spike trains. This may explain why our decoding results with or without the refractory period are so similar.
Number of trials correctly decoded
Another way to examine how well our model describes the data is to compare the performance of the decoder on real data with its performance on artificially generated surrogate data that match the real data in many ways and are known to have the structure specified by the model. If our model describes the data well, we would expect to decode real spike trains nearly as well as we decode artificially generated spike trains. If our model describes the data poorly, we would expect to decode artificial spike trains much more accurately than real spike trains.
We generated cellmatched surrogate data as described in Materials and Methods. Figure 12 Acompares the percent of trials correctly decoded in real and surrogate data, both for the spike count code and using the mixtureofPoissons model. When decoding using the spike count only, we correctly decode 15% (median across neurons; iqr, 7–20%) more trials from surrogate data than from real data in the experiments with 16 stimuli (circles) and 18% (median; iqr, 10–23%) more in the experiments with 128 stimuli (triangles). The decoder does nearly as well when decoding data from the experiments with 16 stimuli using the mixtureofPoissons method (plus symbols): we correctly decode 17% (median; iqr, 7–23%) more trials from surrogate data than from real data. However, when the mixtureofPoissons method is used with data from the experiments with 128 stimuli (times symbols), many more trials are correctly decoded from surrogate data than from real data (median, 100% more; iqr, 75–150%). Figure 12 B shows that the performance of the mixtureofPoissons decoder is related to the amount of data available: the more data available to estimate the model parameters in the first place, the more closely the number of trials correctly decoded in real data approaches the number correctly decoded in surrogate data.
To check whether the number of stimuli involved (and therefore the complexity of the problem) contributes to the difference between the two sets of results, we created 16stimulus neurons from 128stimulus neurons by taking the responses from 16 randomly chosen stimuli and discarding the rest. Twentyfive subneurons were created from each of the two neurons with the most trials per stimulus (50 and 52). In these artificial neurons, 16% more trials are correctly decoded from surrogate data than from the real data (median; iqr, 11–20%), not significantly different from that seen for the neurons from the experiment that actually had only 16 stimuli (Kruskal–Wallis test,p = 0.7). There is still no significant difference if we try to minimize the influence of the number of trials per stimulus by comparing the results for the subsampled data only with results from the seven neurons from the experiment with 16 stimuli with between 40 and 60 trials per stimulus (Kruskal–Wallis test, p = 0.2). Thus correctly choosing among 128 stimuli requires more data to parameterize the model than correctly choosing among 16 stimuli.
Accuracy of individual stimulus probabilities: model error and estimation error
The most stringent possible test of the performance of the decoder assesses the accuracy of all the estimated stimulus probabilities. For example, if we look at all the stimuli assigned probabilities nearp = 0.1 in various trials, we would hope that those stimuli were actually the stimuli that elicited those trials 10% of the time. This is essentially the same test we might use to evaluate the quality of a weatherforecasting service: it should rain on half the days with a 50% chance of rain, onefourth of the days with a 25% chance of rain, and so on. To test this, we bin the estimated probabilities (we have one estimated probability for each stimulus for each trial) and then ask how often the stimuli represented in each bin really did elicit the corresponding trial.
Departure from accurate prediction can come from two sources. The first possible source of error is model misspecification: the mixtureofPoissons model may not capture the true structure of the spike trains. The second possible source of error is difficulty fitting the model on the basis of the amount of data available. We will call these model error and estimation error. To separate these errors, we must examine how accurately the model estimates stimulus probabilities in three different kinds of data: the real data; artificial data, generated according to the model, with the same number of trials per stimulus as the real data; and artificial data, generated according to the model, with many more trials per stimulus than available in the real data. Model error can be examined by comparing results from the real data and the smaller set of artificial data: because the two sets have equal numbers of trials per stimulus, any difference in our ability to decode must derive from the fact that one set of data is known to be consistent with the model, whereas the other is not. Estimation error can be examined by comparing results from the larger and smaller artificial data sets: because both artificial data sets are known to be consistent with the model, any difference in our ability to decode must derive from difference in the amount of data available.
Figure 13 A shows that, for the data from the experiment with 16 stimuli, the predicted stimulus probabilities are close to the true stimulus probabilities. Small predicted stimulus probabilities tend to be slightly too large, and large predicted stimulus probabilities tend to be slightly too small. This is true whether we decode using the spike count only (filled squares) or using the full mixtureofPoissons model (filled circles), although the difference is smaller for spike count only. The same effect is seen in artificial data with the same number of trials per stimulus as seen in the real data, although it is substantially less pronounced (open squares, open circles). This indicates that some of our error is attributable to the model not quite fitting the data. However, in artificial data with many more trials per stimulus than in our actual data (500 trials per stimulus), the predicted probabilities are almost exactly equal to the observed probabilities both when the spike trains are decoded using the spike count only (plus symbols) and when they are decoded using the full model (times symbols). The excellent agreement between estimated and actual stimulus probabilities obtained on large artificial data sets serves as confirmation that the decoder works as it is supposed to and also indicates that part of the error seen with real data sets is attributable to difficulty correctly estimating the model parameters (the spike density functions and the mixtures of Poissons for spike count distributions) using the amount of data available.
Figure 13 B is similar to Figure 13 A but for data from the two neurons with the most (50) trials per stimulus in the experiments with 128 stimuli. Here the predicted probabilities are almost always lower than the true probabilities, whether decoding using the spike count only or using the mixtureofPoissons model. Because of the amount of computation involved, we do not include results for extremely large artificial data sets in Figure 13 B, but the results in Figure 13 A imply that the decoder works properly, so for a sufficiently large artificial data set, the probabilities would lie along the identity line. Figure 13 A shows that most of our misestimation of probabilities can be ascribed to limited sample size rather than to misspecification of the model (i.e., the distance between the filled and open symbols is smaller than the distance between the open symbols and the identity line), particularly for the full mixtureofPoissons model. Excluding the three smallest probability bins (in which the error is extremely small), estimation error accounts for approximately threefifths of the total estimation error when decoding using spike count only and for almost fourfifths of the error when decoding using the full mixtureofPoissons model. The fact that the decoder estimates probabilities less well in the data with 128 stimuli than in the data with 16 stimuli again shows that more data are required to accurately solve the more difficult problem.
Discussion
In this paper we present the order statistic model of spike trains, which is based on the observation that in several brain areas, singleneuron spike trains are almost entirely described as stochastic samples from the firing rate profile (Oram et al., 1999, 2001; Baker and Lemon, 2000). In accordance with these observations, in the order statistic model, individual spike times are informative because they reflect underlying rate variation. This model requires estimating only the spike count distribution, the spike density function, and the interspike interval distribution from data. All of these can be reasonably estimated using amounts of data gathered in virtually every experiment. Our results show that for decoding problems with many conditions, decoding performance is quite good with amounts of data frequently collected, but very large amounts of data may be needed for optimal decoding.
The order statistic model is general and can be used with any estimate of the spike count distribution, modeled or measured. The mixtureofPoissons model gives up some of this generality, modeling each spike count distribution as a mixture of Poisson distributions and the spike trains as instances of a mixture of Poisson processes. In exchange, the Poisson model speeds up the calculation of order statistics; we calculate distributions of next spike times for a few (up to five) Poisson means instead of for many (up to 50) spike counts (Eqs. 8, 10 vs 6, 7). For our data, decoding using the Poisson model was approximately an order of magnitude faster, and the extra generality of the order statistic model was not needed: >98% of the spike count distributions were fit with mixtures of three or fewer Poisson distributions, and all were fit by mixtures of five or fewer. A mixture of Poisson processes may or may not fit a particular data set. Other structures identified in particular data sets might help simplify calculating order statistics in a similar way, although the details of the calculation would be different.
Any decoding method must decide which aspects of a response are important, that is, carry information that can be used to distinguish among stimuli, and which can be ignored. Ignoring spike timing that does carry information reduces decoding accuracy (Fig. 7), but so does paying attention to timing that does not carry information (Fig. 10). When spike timing is taken into account, there are so many distinct responses that a great deal of data is needed to determine whether apparent correlations between stimulus and response are reliable or merely (as in the examples of Fig. 10) sampling artifacts. Thus decoding using timing is difficult precisely because timing presents great opportunities for encoding. Here we avoid the difficulties of directly counting how often each response is elicited by using a model that captures the structure of observed responses, including highorder correlations among spike times (Oram et al., 1999, 2001). This model leads to simple decoding methods that are much more effective than decoding based on spike count alone.
The decoding scheme described here assumes that the response being decoded was generated by one of the stimuli, and in that case, we checked in a small amount of data from area TE of monkey temporal lobe that the decoder works even when spikes occur between trials. We can also treat the nostimulus, or intertrial, condition as just another experimental condition. Even in this case, the decoding algorithm assumes that the time of stimulus onset is known. Several groups have that suggested that eye movement or other causes of largescale change in the visual field cause neural responses that can be used as synchronizing or reset signals (Sobotka and Ringo, 1997; Sobotka et al., 1997; Huang and Paradiso, 2000; Greschner et al., 2002).
The question of response alignment is related to the question of response latency. Once responses are aligned on stimulus onset, systematic differences in response latency are reflected in the firing rate profile over time and possibly also in the spike count distribution. Figures 1 and 3 show examples, based on real data, of constructing order statistics and decoding real data with a latency difference. Rhythmic firing phaselocked to stimulus onset would also be reflected in the PSTH. Rhythmic firing not phaselocked to stimulus onset would not show up in the PSTH but might be accounted for by the refractory–rebound term. Any additional underlying statistical structure different from that assumed by the model might require additional modeling work to accommodate.
Other models and methods
The models presented here can be estimated using modest amounts of data, because they assume that spike trains have a certain stochastic structure. This contrasts with the “direct method,” which makes no assumptions about the form of the neural code but requires very large data sets, because it directly counts how often each stimulus elicits each response. The direct method has been used to calculate information transmission rates in the fly (Strong et al., 1998), in anesthetized monkeys (Reinagel and Reid, 2000), and in isolated retina (Nirenberg et al., 2001). However, in awake monkeys the amount of data gathered has not been sufficient to estimate information transmitted by spike timing; only calculations using the spike count in single bins have been possible (Reich et al., 2001). Thus, at least in awake monkeys, the direct method cannot currently substitute for finding some principle allowing us to compactly describe spike trains, that is, a model.
Others researchers have modeled spike trains using variants of Poisson processes. Lansky and Vaillant (2000) and Lansky et al. (2001) model the responses of hippocampal place cells using mixtures of two Poisson processes. Kass and Ventura (2001) model firing probability as a product of time from stimulus onset and a refractory term depending on the time of the last spike (i.e., an inhomogeneous Poisson process with a refractory period); they estimate the two components together using a generalized linear model, whereas we estimated them separately.Barbieri et al. (2001) model spike trains by characterizing interspike intervals using gamma and inverse Gaussian distributions (generalizing the exponential distribution of intervals arising from a Poisson process). They use time rescaling to account for inhomogeneities in firing rate, inducing a Markov dependence similar to that seen in the order statistic and mixtureofPoissons models presented here. One possible use of statistical models such as the mixtureofPoissons model and the others described above is to provide a benchmark against which to evaluate spike trains produced by biophysical models that address the cellular mechanisms of neuronal information processing.
Another way to take timing into account when decoding is to sequentially decode the spike count in different time bins. In the limit of bins that can contain at most one spike, the decoder simply asks, at each time, “Did a spike occur now? . . . now? . . . now?” This is equivalent to decoding using order statistics or the mixtureofPoissons model only if the responses in all bins are independent. The Markov dependence built into the order statistic and mixtureofPoissons models seems to incorporate most of the correlations observed in data (Oram et al., 1999).
We decode to determine which of a categorical set of stimuli was presented, without regard to stimulus characteristics such as bar or grating orientation. Other groups have developed approaches to reconstruct continuously varying signals (Gabbiani and Koch, 1996;Rieke et al., 1997; Brown et al., 1998; Schwartz and Moran, 2000;Wessberg et al., 2000; Manwani et al., 2001; Serruya et al., 2002). An appropriate discretization of the continuous signal might allow our methods to be applied to these problems. This might require finding a time scale on which the PSTH gave information about the signal (that is, how long a section of signal should be treated as a single stimulus in our formulation), as well as imposing probabilistic continuity conditions (as by Brown et al., 1998).
Conclusion
The results of Oram et al. (1999) suggest that, at least in V1 and the lateral geniculate nucleus, individual spike times are random and carry information only by reflecting underlying rate variation. Other features of precise spike timing can be predicted from spike count and the PSTH and therefore cannot carry information unavailable from these coarser measures. In this context, order statistics (Arnold et al., 1992), whether calculated directly or in special cases such as a mixture of Poisson processes, formalizes the intimate connection between spike timing and spike count (Wiener and Richmond, 1999; Oram et al., 1999). Because spike count distributions are often nonPoisson (Baddeley et al., 1997; Gershon et al., 1998), single inhomogeneous Poisson models cannot be expected to match features of precise timing. Therefore, the mixtureofPoissons model is a more appropriate null hypothesis when searching in neuronal responses for timing relations that are unexpected, and which therefore may carry unique information.
Footnotes

This work was supported by the National Institute of Mental Health (NIMH) and IRP. We thank Dr. John Hertz (Nordita), Dr. Peter Latham (University of California, Los Angeles, CA), Dr. Zheng Liu (NIMH), Dr. Jeff Sachs (Merck Research Laboratories), and Dr. Yasuko SugaseMiyamoto (NIMH) for comments on this manuscript.

Correspondence should be addressed to Barry J. Richmond, Laboratory of Neuropsychology, National Institute of Mental Health, National Institutes of Health, Building 49, Room 1B80, Bethesda, MD 208924415. Email: bjr{at}ln.nimh.nih.gov.

M. C. Wiener's present address: RY84202, Applied Computer Science and Mathematics Department, Merck Research Laboratories, Rahway, NJ 07065. Email: matthew_wiener{at}merck.com.