Abstract
“How is information decoded in the brain?” is one of the most difficult and important questions in neuroscience. We have developed a general framework for investigating to what extent the decoding process in the brain can be simplified. First, we hierarchically constructed simplified probabilistic models of neural responses that ignore more than Kthorder correlations using the maximum entropy principle. We then computed how much information is lost when information is decoded using these simplified probabilistic models (i.e., “mismatched decoders”). To evaluate the information obtained by mismatched decoders, we introduced an information theoretic quantity, I*, which was derived by extending the mutual information in terms of communication rate across a channel. We showed that I* provides consistent results with the minimum meansquare error as well as the mutual information, and demonstrated that a previously proposed measure quantifying the importance of correlations in decoding substantially deviates from I* when many cells are analyzed. We then applied this proposed framework to spike data for vertebrate retina using short natural scene movies of 100 ms duration as a set of stimuli and computing the information contained in neural activities. Although significant correlations were observed in population activities of ganglion cells, information loss was negligibly small even if all orders of correlation were ignored in decoding. We also found that, if we inappropriately assumed stationarity for long durations in the information analysis of dynamically changing stimuli, such as natural scene movies, correlations appear to carry a large proportion of total information regardless of their actual importance.
Introduction
An ultimate goal of neuroscience is to elucidate how information is encoded and decoded by neural activities (Averbeck et al., 2006). One method of investigating the amount of information encoded about certain stimuli in a certain area of the brain is by calculating the mutual information between the stimuli and their neural responses. Because the mutual information quantifies the maximal amount of information that can be extracted from neural responses, it is implicitly assumed that encoded information is decoded by an optimal decoder. In other words, the brain is assumed to have full knowledge of the encoding process, in which stimuli are transformed into noisy neural activities. Considering the probable complexity of optimal decoding, however, the assumption of an optimal decoder in the brain is doubtful; rather, it is more plausible to consider that information is decoded in a suboptimal manner by a simplified decoder that has only partial knowledge of the encoding process. We call this type of a decoder a “mismatched decoder.”
An example of a mismatched decoder is an independent decoder, which ignores correlations in neural activities. Independent decoders are potentially important because they are simpler, and the brain might use them rather than take on the task of trying to figure out what the correlation structure in the responses is. An experimental finding that a sufficiently large proportion of total information is obtained by an independent decoder would suggest that the brain may function in a manner similar to an independent decoder. In this context, Nirenberg et al. (2001) computed the information obtained by an independent decoder in pairs of retinal ganglion cells activities and found that no pair of cells showed a loss of information >11%. However, their analysis considered pairs of cells only, and the importance or otherwise of correlations in population activities has not been fully elucidated.
Here, we developed a general framework for investigating the importance of correlations in population activities. Because analysis of population activities generally requires consideration of not only secondorder but also higherorder correlations, we hierarchically constructed simplified decoders that ignore more than Kthorder correlations using the maximum entropy method (Schneidman et al., 2006). We inferred how many orders of correlation should be taken into account to extract sufficient information by evaluating the information obtained by the simplified decoders. To accurately quantify information obtained by the mismatched decoders, we introduce an information theoretic quantity derived in the study by Merhav et al. (1994), I*. I* was first introduced in neuroscience in the study by Latham and Nirenberg (2005) to show that the previously proposed information for mismatched decoders in the study by Nirenberg and Latham (2003) is the lower bound of I*.
Here, we showed that this lower bound can be loose when many cells are analyzed. We also justified the use of I* from the viewpoint of the minimum meansquare error. Finally, we quantitatively evaluated the importance of correlations in decoding neural activities by applying our theoretical framework to the vertebrate retina.
Part of this paper was published in the study by Oizumi et al. (2009).
Materials and Methods
Retinal recording.
Details of retinal recording have been described previously (Meister et al., 1994). The darkadapted retina of a larval tiger salamander was isolated in oxygenated Ringer's medium at 25°C. A piece of retina (2–4 mm) was mounted on a flat array of 61 microelectrodes (MEDP2H07A; Panasonic) and perfused with oxygenated Ringer's solution (2 ml/min; 25°C). Six thousand frames of a movie of natural scenes (van Hateren, 1997) were projected at 30 Hz using a cathode ray tube monitor (60 Hz refresh rate; Dell E551). The mean intensity of light was 4 mW/m^{2}. Voltages from the electrodes were amplified, digitized, and then stored. Well isolated action potentials were sorted offline with custombuilt software. All procedures concerning animals met the RIKEN guidelines.
Information for mismatched decoders.
It is well known that neural responses, even to a single repeated stimulus, are noisy and stochastic. Let us represent this stochastic process with the conditional probability distribution p(rs), namely that neural responses r are evoked by stimulus s. We can say that the stimulus s is encoded by neural response r, which obeys the distribution p(rs). We call this p(rs) the “encoding model.” For the brain to function properly, the brain has to somehow accurately infer what stimulus is presented from the observation of noisy neural responses. We call this inference process the decoding process. To date, we have not known how stimulus information is decoded from noisy neural responses in the brain. Thus, when we investigate neural coding problems, we usually simply consider the limit of decoding accuracy assuming that stimulus information is decoded in an optimal way. Optimal decoding can be done by choosing the stimulus that maximizes the Bayes posterior probability, where p(r) = Σ _{s} p(rs)p(s) and p(s) is the prior probability of stimuli. The mutual information invented by Shannon (1948) is one such quantity that provides the upper bound of decoding accuracy. The mutual information between stimulus s and neural responses r is given by the following equation: If we experimentally obtain the conditional probability distribution p(rs), we can easily quantify how accurately the stimulus is decoded from the noisy neural responses with the mutual information.
The mutual information is a useful indicator, which quantitatively shows how much the neural responses are related to the target stimuli. However, it is not evident whether this quantity is biologically relevant because it is implicitly assumed that information about stimuli is optimally decoded in the brain. Taking account of the complexity of optimal decoding and the difficulty of the brain in knowing the actual encoding process p(rs), it is more plausible to consider that information about stimuli is decoded in a suboptimal manner in the brain. Let us assume that the brain has only the partial knowledge of the encoding process p(rs) and denote the probability distribution that partially matches p(rs) by q(rs). For instance, if we assume that the brain does not know the complicated correlation structure in neural responses but rather only knows the individual property of neural responses of each neuron, q(rs) is expressed by the product of the marginal distribution of p(rs), q(rs) = Π _{i}p(r_{i}s).
Here, the important question is how accurately the stimulus is inferred from neural responses only with the partial knowledge of p(rs). In this case, we assume that the inference is done by choosing the stimulus which maximizes the following posterior probability distribution as follows: where q(r)= Σ _{s} q(rs)p(s). This posterior probability distribution is not equal to the actual distribution (Eq. 1) because q(rs) is used instead of the actual encoding model p(rs). We call q(rs) the “decoding model.” When the decoding model q(rs) is mismatched with the actual encoding model p(rs), the accuracy of the decoding is naturally degraded.
To quantify how much stimulus information would be lost because of the mismatch in the decoding model, we need an information theoretic quantity, which corresponds to the mutual information when the mismatched decoding model is used. Nirenberg and Latham (2003) previously proposed that the information obtained by mismatched decoders can be evaluated using the following: We call their proposed information “Nirenberg–Latham information.” By comparing Equations 2 and 4, we can see that I ^{NL} is equal to I when the decoding model q(rs) is equal to the encoding model p(rs). To derive I ^{NL}, they adopted the yes/noquestion formulation of mutual information given by Cover and Thomas (1991). By extending the mutual information in the yes/noquestion framework, they derived I ^{NL}. Using a different approach, Pola et al. (2003) derived I ^{NL} by decomposing the mutual information. Amari and Nakahara (2006) justified the use of I ^{NL} for quantifying the information obtained by mismatched decoding from the point of view of information geometry. Because I ^{NL} is easy to understand and appears sound, it has been used in neuroscience (Nirenberg et al., 2001; Golledge et al., 2003; Montani et al., 2007). However, as is shown in Appendix, I ^{NL} may be an inappropriate measure, particularly when computed in large neural populations.
In the present study, we reintroduce an information theoretic quantity, I*, which was originally derived by Merhav et al. (1994) by extending the mutual information in the context of the best achievable communication rate when a mismatched decoding model is used (see the next section for the information theoretic meaning of I*). We call this quantity “information for mismatched decoders.” In the present study, we use I* to quantify the decoding accuracy when the stimulus information is decoded by using mismatched probabilistic models of neural responses. I* can be computed by the following equations [for the details of the mathematical derivation of I*, see Merhav et al. (1994) and Latham and Nirenberg (2005)]: To compute I*, we need to maximize Ĩ(β) with respect to β. Thus, the equations for I* have no closedform solution. However, we can easily find the maximum of Ĩ(β) numerically by the standard gradient ascent method because this is convex optimization (Latham and Nirenberg, 2005).
By comparing Equations 4 and 6, we can see that I ^{NL} is equal to Ĩ(β) when β = 1 (Latham and Nirenberg, 2005). Because I* is the maximum value of Ĩ(β) with respect to β, I* is always larger than or equal to I ^{NL}. Thus, I ^{NL} is a lower bound of I*. I* was first introduced into neuroscience in Latham and Nirenberg (2005) to show that their proposed information, I ^{NL}, provides a lower bound on I*. To our knowledge, however, no application of I* in neuroscience has appeared.
As is shown in Appendix, this lower bound provided by I ^{NL} can be loose, and can be negative when many cells are analyzed. It is also shown that I* gives consistent results with the minimum meansquare error, whereas I ^{NL} does not. Taking account of these facts, we consider that I* should be used instead of I ^{NL}.
Information theoretic meaning of information I and I*.
In the previous section, we introduced mutual information as a measure that quantifies how accurately a stimulus is inferred from the observation of noisy neural responses. In information theory, the mutual information has a rigorous quantitative meaning [i.e., it gives the upper bound of the amount of information that can be reliably transmitted over a noisy channel (see below)]. In this section, we first review the meaning of the mutual information I within the framework of information theory using the language of neuroscience. We then explain the meaning of I* as an extension of the mutual information.
Let us consider information transmission using a set of stimuli s and neural responses r (Fig. 1). We will consider a randomdot stimulus moving upward, ↑, or downward, ↓, as an example of stimulus s. The sequence of stimuli s _{1} s _{2} … s_{M} (Fig. 1, ↑↑ … ↓) is sent over a noisy channel, which in this case is a neural population, and the sequence of noisy neural responses r _{1}, r _{2}, … , r _{M} to each stimulus is then received (Fig. 1). We assume that the channel is memoryless; that is, the neural responses r _{1}, r _{2}, … , r _{M} are mutually independent. This sequence of stimuli is called a code word. We consider the limit that the length of code word M tends to infinity, M → ∞.
Here, we introduce an important concept, the codebook. A codebook is the assembly of transmitted code words. The sender and the receiver share the codebook. The job of the receiver is to determine which code word was sent from observed neural responses r _{1}, r _{2}, … , r _{M} by consulting the codebook. In this setting, let us consider the following question: How many code words can be sent errorfree when the transmitted code words are “optimally” decoded? In other words, how many code words can the codebook contain?
Optimal decoding is done by choosing a code word that maximizes Bayes posterior probability given by the observed sequence of neural responses r _{1}, r _{2}, … , r _{M} from the codebook. The decoding procedure is described by the following equations: where s_{i} (c) means the ith stimulus of the sequence of stimuli corresponding to code word c. A uniform prior distribution on c is usually assumed, in which case Equation 7 becomes the maximumlikelihood estimation.
If stimuli ↑ and ↓ evoke nonconfusable neural responses, 2 ^{M} code words can be sent errorfree. However, when there is an overlap between neural responses to stimuli ↑ and ↓, the question “How many code words can be sent errorfree?” is not easily answered. In this case, we cannot let our codebook contain all of possible 2 ^{M} code words but rather need to sparsely select the transmission of some of them so as to avoid confusable neural responses to each code word. Shannon's mutual information gives the answer to this nontrivial question (Shannon, 1948). If we denote the upper bound of the number of code words that can be sent errorfree by 2 ^{K} , K is given by the following: where I is the mutual information given by Equation 2. This relationship can be mathematically proved by taking advantage of the law of large numbers (Shannon, 1948; Cover and Thomas, 1991). The ratio K/M is called the communication rate or information rate. Thus, within the framework of information theory, the mutual information defined by Equation 2 has the meaning of the upper bound of communication rate (i.e., the number of code words that can be sent errorfree).
When we have full knowledge of the channel property, p(rs), the mutual information gives the upper bound of the number of code words that can be sent errorfree. The next question is how many code words can be sent errorfree when we only partially know the channel property. In other words, we assume that the mismatched probability distribution q(rs), which partially matches with the actual channel property p(rs), is used for decoding. Similarly to Equations 7 –9, decoding is done by the following equations: Note that q(rs) is used instead of p(rs). Merhav et al. (1994) provided the answer to this question: if we denote the upper bound of the number of code words that can be sent errorfree by 2^{K*} when the mismatched decoding model q(rs) is used, K* (<K) is given by the following: where I* is information for mismatched decoders given by Equations 5 and 6. This relationship can be also mathematically proved by making use of the large deviation theory (Merhav et al., 1994; Latham and Nirenberg, 2005). Thus, I* gives the upper bound of the number of code words that can be sent errorfree for mismatched decoders. In this sense, I* is a natural extension of the mutual information I.
Stationarity assumption about neural responses.
We used a movie of natural scenes, which was 200 s long and repeated 45 times, as a stimulus. We divided the movie into many short segments as is shown in Figure 2 and considered them as stimuli over which information contained in neural activities was computed. We assumed that neural responses were stationary while each short natural scene movie was presented. Thus, the length of each stimulus should be short enough for us to assume the stationarity of neural responses. To determine the appropriate length of the stimuli, we computed the correlation coefficients between the temporally separated frames of the natural scene movie. The correlation coefficient between two frames separated by time τ, C(τ), is computed by the following: where x (t) is the grayscale pixel value of the frame at time t and 〈 x 〉 is the averaged pixel value of the frames over the total time of the natural scene movie. C(τ) is shown as a dotted line in Figure 3. C(τ) rapidly decays initially and then slowly approaches 0. We fit C(τ) with the sum of two exponents y(τ) = a _{1} exp(−τ/τ_{1}) + a _{2} exp(−τ/τ_{2}) by the leastsquares method. The fitted line is shown as a solid line in Figure 3. The fitted time constants τ_{1} and τ_{2} are τ_{1} = 332 ms and τ_{2} = 9.77 s. This result indicates that the length of stimuli should be shorter than the faster time constant, τ_{1} = 332 ms.
Constructing mismatched decoding models by the maximum entropy method.
Figure 4 A shows the response of seven retinal ganglion cells to natural scene movies from 0 to 10 s in length. To apply information theoretic techniques, we first discretized the time into small time bins Δτ and indicated whether or not a spike was emitted in each time bin with a binary variable: r_{i} = 1 means that the cell i spiked and r_{i} = 0 means that it did not. We set the length of the time, Δτ, to 5 ms so that it was short enough to ensure that two spikes did not fall into the same bin. In this way, the spike pattern of ganglion cells was transformed into an Nletter binary word, r = {r _{1}, r _{2}, … , r_{N} }, where N is the number of neurons (Fig. 4 B). We then determined the frequency with which a particular spike pattern, r, was observed during each stimulus and estimated the conditional probability distribution p _{data}(rs) from experimental data. If we set the length of stimuli to 100 ms, there were, effectively, a total of 900 (=20 bins × 45 repeats) samples for estimating the conditional probability distribution p _{data}(rs) of each stimulus because each 5 ms bin within the 100 ms segment was assumed to come from the same stimulus. Using these estimated conditional probabilities, we evaluated the information contained in Nletter binary words r.
Generally, the joint probability of N binary variables can be written as follows (Amari, 2001; Nakahara and Amari, 2002): This type of representation of probability distribution is called a loglinear model. Because the number of parameters in a loglinear model is equal to the number of all possible configurations of an Nletter binary word r, we can determine the values of parameters so that the loglinear model p_{N} (r) exactly matches the empirical probability distribution p _{data}(r): that is, p_{N} (r) = p _{data}(r).
To compute the information for mismatched decoders, we constructed simplified probabilistic models of neural responses that partially match the empirical distribution, p _{data}(r). The simplest model was an “independent model,” p _{1}(r), in which only the average of each r_{i} agreed with the experimental data: that is, 〈r_{i} 〉 _{p1} (r)= 〈r_{i} 〉_{pdata}(r). Many possible probability distributions satisfied these constraints. In accordance with the maximum entropy principle (Jaynes, 1957; Schneidman et al., 2003, 2006), we chose the one that maximized entropy H, H = −Σ _{r} p _{1}(r)logp _{1}(r).
The resulting maximum entropy distribution is as follows: in which model parameters θ^{(1)} are determined so that the constraints are satisfied. This model corresponds to a loglinear model in which all orders of correlation parameters {θ _{ij} ,θ _{ijk} , … ,θ_{12 … N}} are omitted. If we perform maximumlikelihood estimation of model parameters θ^{(1)} in the loglinear model, the result is that the average r_{i} under the loglinear model equals the average r_{i} found in the data: that is, 〈r_{i} 〉 _{p1} (r) = 〈r_{i} 〉_{pdata}(r) 〈r_{i}r_{j} 〉_{p2}(r) = 〈r_{i} 〉_{pdata}(r). This result is identical with the constraints of the maximum entropy model. Generally, the maximum entropy method is equivalent to the maximumlikelihood fitting of a loglinear model (Berger et al., 1996).
Similarly, we can consider a “secondorder correlation model” p _{2}(r), which is consistent with not only the averages of r_{i} but also the averages of all products r_{i}r_{j} found in the data. Maximizing the entropy with constraints 〈r_{i} 〉_{p2}(r) = 〈r_{i} 〉_{pdata}(r)〈r_{i}r_{j} 〉_{p2}(r) = 〈r_{i} 〉_{pdata}(r) and 〈r_{i}r_{j} 〉_{p2}(r) = 〈r_{i}r_{j} 〉_{pdata}(r) 〈r_{i}r_{j} 〉_{p2}(r) = 〈r_{i} 〉_{pdata}(r), we obtain the following: in which model parameters θ^{(2)} are determined so that the constraints are satisfied.
The procedure described above can also be used to construct a “Kthorder correlation model” p_{K} (r). If we substitute the simplified models of neural responses p_{K} (rs) into mismatched decoding model q(rs) in Equation 6, we can compute the amount of information that can be obtained when more than Kthorder correlations are ignored in the decoding as follows: By evaluating the ratio of information, I_{K} */I, we can infer how many orders of correlation should be taken into account to extract sufficient information.
Limited sampling problem in estimating mutual information.
It is well known that estimating mutual information in Equation 2 with a limited amount of neuronal data causes a sampling bias problem (Panzeri and Treves, 1996). With a small amount of data, the mutual information is biased upward. Recently, tight datarobust lower bounds to mutual information, I _{sh}, were developed (Montemurro et al., 2007). I _{sh} was derived using “shuffling,” namely, the shuffling of neural responses across trials, to cancel out the upward bias of the mutual information. I _{sh} can be computed by the following equation: where I is the mutual information in Equation 2, and p _{1}(rs) is the independent model, that is, p _{1}(rs) = Π _{i}p(r_{i} s), and p _{1 − sh}(rs) is the distribution of shuffled neural responses. Using p _{1 − sh}(rs) instead of p _{1}(rs) in Equation 23, the upward bias of I is canceled out by a downward bias of the third term of ΔI _{1−sh}. As a result, ΔI _{1−sh} is mildly biased downward. Since I _{LB−1} is virtually unbiased, I _{sh} is mildly biased downward. Using I in Equation 2 and I _{sh} in Equation 21, the real mutual information, I _{real}, is bounded upward and downward as follows: We computed both I and I _{sh}. We found that the difference between I and I _{sh} was markedly small even when all recorded cells were analyzed (Fig. 5). This meant that we had a sufficient amount of data to accurately estimate the mutual information. Thus, in Results, we show the value of mutual information that is directly computed from Equation 2 only.
Results
Information conveyed by correlated activities is negligibly small despite the presence of substantial correlations
We quantitatively evaluated the importance of correlated activities by comparing the mutual information I (Eq. 2) with the information for mismatched decoders I_{K} * (Eqs. 19, 20). We considered the independent model p _{1}(r) in Equation 17 and the secondorder correlation model p _{2}(r) in Equation 18 as mismatched decoders. We analyzed two spike data recorded from isolated retinas of different salamanders. Seven neurons were simultaneously recorded in spike data 1 and six in spike data 2. The same 200 s natural scene movie was used as a stimulus for spike data 1 and 2.
We computed the spiketriggered averages of all recorded neurons responding to the natural scene movie stimulus in spike data 1 and 2. The recorded cells were all OFF cells. The fits of twodimensional Gaussian functions to the spiketriggered averages are shown in Figure 6. As can been seen, the receptive fields mostly overlapped in both spike data 1 and 2. Figure 7 shows crosscorrelograms of all pairs of cells in spike data 1 and 2. Many pairs show strong peaks with a width of ∼100 ms around the origin. To show the degree of correlation in the population activities of the retinal ganglion cells, we investigated how accurately the independent model and the secondorder correlation model predicted the actual neural responses, following previous studies (Schneidman et al., 2006; Shlens et al., 2006). Figure 8 shows the observed frequency of Nletter binary words r against the predicted frequency of the independent model and the secondorder correlation model. As can be seen from Figure 8, the independent model roundly failed to capture the observed statistics of firing patterns. The secondorder correlation model substantially improved the prediction of the observed pattern rate. We therefore consider that correlations need to be taken into account to explain the observed neural responses. However, this does not necessarily mean that they need to be taken into account in decoding neural activities (see Discussion).
We computed the ratio of information obtained by independent model, I _{1}*/I, and that obtained by a secondorder correlation model, I _{2}*/I. Considering the decay speed τ_{1} = 332 ms of the correlations between the frames of the natural scene movie (see Materials and Methods), we set the length of stimuli to 100 ms, providing 2000 stimuli from the 200 s movie. With a uniform stimulus length of 100 ms, no spikes occurred when some stimuli were presented. We removed these stimuli and used the remaining stimuli for analysis. Figure 9 A shows I _{1}*/I and I _{2}*/I when the number of cells analyzed was changed. Although I _{1}*/I decreased slightly as the number of cells analyzed increased, I _{1}*/I was >90% in both spike data 1 and 2 even when all cells (N = 7 in spike data 1 and N = 6 in spike data 2) were analyzed. This result means that the loss of information associated with ignoring correlations was minor.
We computed the mutual information between all stimuli and neural responses. In terms of average, the percentage of information conveyed by correlations was low. However, it is possible that correlations play an important role in discriminating some stimuli. To test this possibility, we computed I and I _{1}* for pairs of 100 ms natural scene movie stimuli selected from all stimuli. Figure 10 shows the histogram of I _{1}*/I when all recorded cells were analyzed. I _{1}*/I was >90% for ∼95% of pairs of stimuli. Pairs whose correlations carried a large proportion of total information were extremely rare. This result also supports the idea that almost all stimulus information could be extracted even if correlations were ignored in decoding.
An important point is that the amount of information conveyed by correlations was markedly small (Figs. 9 A, 10) even though there were significant correlations in population activities of ganglion cells (Figs. 7, 8). This result shows that, to assess the importance of correlations in information processing in the brain, we should not only evaluate the degree by which the actual neural responses differ from the independent model but should also compute the information obtained by the independent model, I _{1}*/I.
Pseudoimportance of correlations arising from stationarity assumption about neural responses
We also computed I _{1}*/I and I _{2}*/I when the length of stimuli was set to 10 s to see what happens if the stimulus length is made considerably longer than the time constant of the stimulus autocorrelation, τ_{1} = 332 ms. Figure 9 B shows I _{1}*/I and I _{2}*/I when the length of stimuli was set to 10 s. When only two cells were considered, I _{1}*/I exceeded 90%, which means that, consistent with the result obtained by Nirenberg et al. (2001), ignoring correlation leads to only a small loss of information. However, when all cells were used in the analysis, I _{1}*/I was only ∼60% with both spike data 1 and 2. Thus, we reached different conclusions when the length of stimuli was set to 10 s from those when it was 100 ms. This is because 10 s is too long to be considered as one stimulus during which neural responses are stationary, that is, during which neural responses obey the same conditional probability distributions p(rs). If we assume stationarity when neural responses are not in fact stationary, correlations may carry a large proportion of information that is irrelevant to the actual importance of correlated activities.
Figure 11 shows I _{1}*/I and I _{2}*/I when the duration of stimuli was changed. When the length of stimuli is appropriately set, >90% of information can be extracted even if correlations are ignored in decoding neural activities. However, when the length of stimuli is too long, correlations appear to carry a large proportion of total information because of the stationarity assumption about neural responses.
To clarify why the correlation becomes less important as the stimulus is shortened, we used the toy model shown in Figure 12. We considered the case in which two cells fire independently in accordance with a Poisson process and performed an analysis similar to that for the actual spike data. We used simulated spike data for the two cells generated in accordance with the firing rates shown in Figure 12 A. Firing rates with a 2 s stimulus sinusoidally changed with time. We divided the 2 s stimulus into two 1 s stimuli, s _{1} and s _{2}, as shown in Figure 12 B. We then computed mutual information I and the information obtained by independent model I _{1}* over s _{1} and s _{2}. Because the two cells fired independently, there were essentially no correlations between them. However, pseudocorrelation arose because of the assumption of stationarity for the dynamically changing stimulus. The pseudocorrelation was high for s _{1} and low for s _{2}. In contrast to the difference in the degree of “correlation” between the two stimuli, s _{1} and s _{2}, the mean firing rates of the two cells during each stimulus were equal. If the stimulus is 1 s long, therefore, we cannot discriminate two stimuli using the independent model, namely I _{1}* = 0. This implies that, when the stationarity of neural responses is assumed for long durations, correlations could carry a large proportion of total information irrespective of its actual importance.
We also considered the case in which the stimulus was 0.5 s long, as shown in Figure 12 C. In this case, pseudocorrelations again appeared, but there was a significant difference in mean firing rates between the stimuli. Thus, the independent model could be used to extract almost all the information. The dependence of I _{1}*/I on stimulus length is shown in Figure 11 C. Behaviors similar to those in this figure were also observed in analysis of the actual spike data for retinal ganglion cells (Fig. 11 A,B). Even if we observe that correlation carries a significantly larger proportion of information for longer stimuli compared with the speed of change in the firing rates, this may simply have resulted from meaningless correlation. Thus, to assess the role of correlation in information processing, the stimuli used should be sufficiently short that the neural responses to these stimuli can be considered to obey the same probability distribution. Considering the response speed of retinal ganglion cells, 100 ms, to which we set the stimulus length in the present study, is still not short enough for the stationarity assumption. However, we kept the stimulus length equal to or longer 100 ms to ensure sufficient data to allow the mutual information to be reliably estimated. If the stimulus length is shortened, the ratio of information carried by correlations could be smaller, as suggested by the analysis in this section (Fig. 11 C).
Comparison between I^{NL} and I*
In Appendix, we show a simple example in which the difference between I ^{NL} and I* is large particularly when many cells are analyzed. To see the difference between I ^{NL} and I* in the actual spike data, we computed I _{1} ^{NL}, which corresponds to the information obtained by the independent decoder, I _{1}*. The dotdashed lines in Figure 9 plot I _{1} ^{NL}. Although the difference between I ^{NL} and I* increases slightly as the number of cells analyzed increases, the lower bound of I _{1}* provided by I _{1} ^{NL} was relatively tight, even when all recorded cells were analyzed. These results suggest that the values of I ^{NL} previously reported in the analysis of pair of cells were also probably close to I* (Nirenberg et al., 2001; Golledge et al., 2003).
Discussion
Here, we describe a general framework for investigating to what extent the decoding process in the brain can be simplified. In this framework, we first constructed a simplified decoding model (i.e., mismatched decoding model), using the maximum entropy method. We then computed the amount of information that can be extracted using the mismatched decoders. We introduced the information for mismatched decoders, I*, which was derived in terms of communication rate in information theory (Merhav et al., 1994). By analytical computations, we showed that both the mutual information I and the information for mismatched decoders I* are inversely proportional to the minimum meansquare error under the condition that neural responses obey Gaussian statistics. We also pointed out that the difference between the previously proposed information I ^{NL} (Nirenberg and Latham, 2003) and I* may become large when many cells are analyzed. By using the information theoretic quantity I*, we showed that >90% of the information encoded in population activities of retinal ganglion cells can be decoded even if all orders of correlation are ignored in decoding. Our results imply that the brain uses a simplified decoding strategy in which correlation is ignored.
Below, we discuss differences between the present and previous studies using the maximum entropy approach (Schneidman et al., 2006; Shlens et al., 2006; Tang et al., 2008); limitations and extensions of the methodology used in this work; and future directions, which concern animal behavior experiments (Stopfer et al., 1997; Ishikane et al., 2005).
Presence of significant correlated activities does not necessarily mean the importance of correlations in decoding
Previous studies using the maximum entropy approach (Schneidman et al., 2006; Shlens et al., 2006; Tang et al., 2008) emphasized the discrepancy between the independent model and actual probability distribution. That is, their results show that there are significant correlations in large neural populations. The impact of such significant correlated neural activities on information encoding has been recently addressed (Montani et al., 2009). In the present study, we addressed how important the correlations are in information decoding. Our results indicate that, even if the independent model fails to capture the statistics of population activities, it does not necessarily mean that correlations play an important role in extracting information about stimuli. Assume that we experimentally obtained the probability distribution of neural responses to two different stimuli, p _{data}(rs _{1}) and p _{data}(rs _{2}), respectively. Even when the independent models of two stimuli, p _{1}(rs _{1}) and p _{1}(rs _{2}), mostly deviate from the data distribution p _{data}(rs _{1}) and p _{data}(rs _{2}), if the two independent models p _{1}(rs _{1}) and p _{1}(rs _{2}) are significantly different from each other, correlations are not important in decoding neural activities. In fact, the information conveyed by correlated activity in our analysis represented only 10% of the total, albeit that we observed a large deviation in the independent model from the data distribution in our spike data, as in previous studies (Fig. 8). As shown in Figure 8, the independent model fails disastrously in predicting the actual probability distribution. However, the secondorder correlation model considerably improves the fitting accuracy of the actual probability distribution, as was shown in the previous studies (Schneidman et al., 2006; Shlens et al., 2006). If we consider only the discrepancy between the independent model and the actual probability distribution (Fig. 8), we may mistakenly conclude that correlations play an important role in information processing in the brain. To assess the importance of correlations, we rather need to evaluate the difference between the mutual information and the information obtained by simplified probabilistic models I*, as was done in the present study.
Temporal correlations across time bins
In this study, we focused on synchronous firing within one time bin, on the basis of suggestions that synchronous firing has functional importance (Gray et al., 1989; Meister et al., 1995; Meister, 1996; Stopfer et al., 1997; Dan et al., 1998; PerezOrive et al., 2002; Ishikane et al., 2005), and spike timingbased computations taking advantage of synchronous firing can be implemented in a biologically relevant network architecture (Hopfield, 1999; Brody and Hopfield, 2003). Given previous findings that neurons carry substantial sensory information in their response latencies (Panzeri et al., 2001; Reich et al., 2001; Gollisch and Meister, 2008), consideration of temporal correlations across the time bins may be important. Statistical models that take account of timelagged correlations can be constructed based on the maximum entropy method with a Markovian assumption of temporal evolution (Marre et al., 2009) or based on a generalized linear model (Pillow et al., 2005, 2008). By comparing the amount of information obtained by a probabilistic model that takes account of timelagged correlations with that obtained by a probabilistic model that only takes account of simultaneous firing within a short time bin, we can quantitatively evaluate the amount of information conveyed by the complex temporal correlations between spikes.
Using a different approach than ours, Pillow et al. (2008) reported that modelbased decoding that exploits timelagged correlations between neurons extracted 20% more information about the visual scene than decoding under the assumption of independence. Decoding performance was quantified using the log signaltonoise ratio. Our results showed that the secondorder correlation model, which takes account of correlations within one time bin only, extracts only ∼10% more about the visual scene than the independent model. The difference in improvement of decoding performance from the independent model between this work and the work of Pillow et al. may be attributable to the amount of information conveyed by timelagged correlations. Besides, this could be also explained by the fact that they analyzed more cells (27 cells) than we did. Additional investigations of the importance of timelagged correlations in information processing in the brain is required.
Quantitative investigation of the relationship between synchronized activity and animal behavior
We showed that synchronized activity does not convey much information about stimuli from a natural scene. In some experiments, however, a strong correlation between synchronized activity and animal behavior has been demonstrated (Stopfer et al., 1997; Ishikane et al., 2005). Stopfer et al. (1997) showed that picrotoxininduced desynchronization impaired the discrimination of molecularly similar odorants in honeybees but did not prevent coarse discriminations of dissimilar odorants. Ishikane et al. (2005) showed that bicucullineinduced desynchronization suppressed escape behavior in frogs. The important point in these studies is that pharmacological blockade of GABA receptors strongly affected synchronization only, and had little effect on the firing rate of neurons. If the firing rate of neurons relevant to the behavior did not change at all, we could say without doubt that synchronized activity is essential to the decoding of neural activities. However, some ambiguity remains because it is impossible that pharmacological blockade does not alter the firing rate of any neuron at all. To resolve this ambiguity, the information for mismatched decoders, I*, may be helpful.
Let us assume that we experimentally obtain normal neural responses to a specific stimulus s, r _{1}, and altered neural responses to the same stimulus s after pharmacological blockade of neurotransmitter receptors, r _{2}. If animal behavior between r _{1} and r _{2} differed, this would mean that the brain interpreted that two “different” stimuli were presented when r _{1} and r _{2} were evoked, even though the same stimulus, s, had in fact been presented. The important question is what difference in neural activities before and after the pharmacological blockade determined the judgment of the brain. This question can be quantitatively answered by computing the mutual information, I, between the two “different” stimuli interpreted by the brain and the corresponding neural responses and by comparing I with the information for mismatched decoders, I*. For example, if I _{1}*/I is high, it can be said that the decision of the brain is mainly based on the difference in firing rate between two neural responses r _{1} and r _{2}. However, if I _{1}*/I is low, the difference in correlated activities plays a crucial role in discriminating the stimulus. Applying the information theoretic measures, I and I*, to behavioral experiments with physiological measurements will provide profound insights into how information is decoded in the brain.
Appendix: Theoretical evaluation of information I, I*, and I^{NL}
In this appendix, we compared three measures of information contained in neural activities, namely mutual information I, information obtained by mismatched decoding I*, and Nirenberg–Latham information I ^{NL}, by analytical computation. Two results were obtained: (1) I and I* provide consistent results with the minimum meansquare error, and (2) the difference between I* and I ^{NL} may increase when many cells are analyzed and I ^{NL} can take negative values.
First, let us consider the problem in which mutual information is computed when stimulus s, which is a single continuous variable, and slightly different stimulus s + Δs are presented. We assume the prior probability of stimuli p(s) and p(s + Δs) are equal: p(s) = p(s + Δs) = 1/2. Neural responses evoked by the stimuli are denoted by r, which is considered here to be the neuron firing rate. When the difference between two stimuli is small, the conditional probability p(rs + Δs) can be expanded with respect to Δs as follows:
where ′ represents differentiation with respect to s. Using Equation 25, to leading order of Δs, we can write mutual information I as follows:
where
We assume that the encoding model p(rs) obeys the Gaussian distribution as follows: where ^{T} stands for the transpose operation, f(s) is the mean firing rates given stimulus s, and C is the covariance matrix. We consider an independent decoding model q(rs) that ignores correlations as follows: where C _{D} is the diagonal covariance matrix obtained by setting the offdiagonal elements of C to 0. If the Gaussian integral is performed for Equations. 26, 28, and 29, I, I*, and I ^{NL} can be written as follows: Next, let us consider the minimum meansquare error when stimulus s is presented. The optimal estimate of stimulus s when we know the actual encoding model p(rs) is the value of that maximizes the likelihood p(rs). Similarly, the optimal estimate of stimulus s when we can only use the independent model q(rs) is the value of that maximizes the likelihood q(rs). Previously, Wu et al. (2001) computed the minimum meansquare error when the optimal decoder is applied, MMSE, and the minimum meansquare error when the independent decoder is applied, MMSE* (Wu et al., 2001). These are given by the following: If we compare Equation 32 with Equation 35, we can see that mutual information I is inversely proportional to the minimum meansquare error when the optimal decoder is applied. Similarly, as can be seen in Equations 33 and 36, I* is also inversely proportional to the minimum meansquare error when the independent decoder is applied. Thus, I* corresponds to the mutual information not only from the viewpoint of communication rate across a channel but also from that of the minimum meansquare error. However, I ^{NL} is not inversely proportional to the minimum meansquared error.
As a simple example that demonstrates a large discrepancy between I* and I
^{NL}, we considered a uniform correlation model (Abbott and Dayan, 1999; Wu et al., 2001) in which covariance matrix C is given by C_{ij}
= σ^{2} [δ
_{ij}
+ c(1 − δ
_{ij}
)] and assumed that the derivatives of the firing rates were uniform: that is, f′_{i}
= f′. In this case, I, I*, and I
^{NL} become the following:
where N is the number of cells. We can see that I* is equal to I, which means that information is not lost even if correlation is ignored in the decoding process. Figure 13 shows I
^{NL}/I and I*/I when the degree of correlation c is 0.01. As shown in Figure 13, the difference between the correct information I* and Nirenberg–Latham information I
^{NL} is markedly large when the number of cells N is large. When N >
Footnotes

This work was partially supported by GrantsinAid for Scientific Research 18079003, 20240020, and 20650019 from the Ministry of Education, Culture, Sports, Science and Technology of Japan (M. Okada). M. Oizumi was supported by GrantinAid 08J08950 for Japan Society for the Promotion of Science Fellows.
 Correspondence should be addressed to Masato Okada, University of Tokyo, 701 Kibantou, Kashiwanoha, Kashiwa, 2778561 Chiba, Japan. okada{at}k.utokyo.ac.jp