Abstract
Decoding the activity of a population of neurons is a fundamental problem in neuroscience. A key aspect of this problem is determining whether correlations in the activity, i.e., noise correlations, are important. If they are important, then the decoding problem is high dimensional: decoding algorithms must take the correlational structure in the activity into account. If they are not important, or if they play a minor role, then the decoding problem can be reduced to lower dimension and thus made more tractable. The issue of whether correlations are important has been a subject of heated debate. The debate centers around the validity of the measures used to address it. Here, we evaluate three of the most commonly used ones: synergy, ΔIshuffled, and ΔI. We show that synergy and ΔIshuffled are confounded measures: they can be zero when correlations are clearly important for decoding and positive when they are not. In contrast, ΔI is not confounded. It is zero only when correlations are not important for decoding and positive only when they are; that is, it is zero only when one can decode exactly as well using a decoder that ignores correlations as one can using a decoder that does not, and it is positive only when one cannot decode as well. Finally, we show that ΔI has an information theoretic interpretation; it is an upper bound on the information lost when correlations are ignored.
- retina
- encoding
- decoding
- neural code
- information theory
- population coding
- signal correlations
- noise correlations
Introduction
One of the main challenges we face in neuroscience is understanding the neural code; that is, understanding how information about the outside world is carried in neuronal spike trains. Several possibilities exist: information could be carried in spike rate, spike timing, spike correlations within single neurons, spike correlations across neurons, or any combination of these.
Recently, a great deal of attention has been focused on the last possibility, on spike correlations cross neurons (e.g., synchronous spikes). The reason for the strong emphasis on this issue is that its resolution has significant impact on downstream research: whether or not correlations are important greatly affects the strategies one can take for population decoding. If correlations are important, then direct, brute force approaches are ruled out: one simply cannot find the mapping from stimulus to response, as such a mapping would require measuring response distributions in high dimensions, a minimum of N dimensions for N neurons. For more than three or four neurons, the amount of data needed to do this becomes impossibly large, and the direct approach becomes intractable (Fig. 1a). Instead, indirect methods for estimating response distributions, such as modeling the correlations parametrically, must be used.
If, on the other hand, correlations turn out not to be important, then direct approaches can be used, even for large populations. This is because one can build the mapping from stimulus to response for a population of neurons from the individual, single neuron mappings (Fig. 1b). Such an approach would allow rapid movement on the question of how neuronal activity is decoded.
The issue of whether correlated firing is important has been fraught with debate. Several authors (Eckhorn et al., 1988; Gray and Singer 1989; Gray et al., 1989; Meister et al., 1995; Vaadia et al., 1995; deCharms and Merzenich, 1996; Dan et al., 1998; Steinmetz et al., 2000) have suggested that they are, whereas others (Nirenberg et al., 2001; Oram et al., 2001; Petersen et al., 2001; Levine et al., 2002; Panzeri et al., 2002a,b; Averbeck and Lee, 2003; Averbeck et al., 2003; Golledge et al., 2003) have argued that they are not or that they play a minor role. The debate has arisen in large part because different methods have been used to assess the role of correlations, and different methods yield different answers.
One early method was to look for stimulus-dependent changes in cross-correlograms (Eckhorn et al., 1988; Gray and Singer, 1989; Gray et al., 1989; Vaadia et al., 1995; deCharms and Merzenich, 1996). This method, however, has two problems. One is that firing rates can significantly alter the shape of cross-correlograms, making it difficult to separate information carried in firing rates from information carried in correlations. The other is that cross-correlograms only tell us about one type of correlation, synchronous or near-synchronous spikes. Correlations that occur on a longer timescale, or correlations among patterns of spikes, are missed by this method.
More recently, information-theoretic approaches have been applied to the problem, because they are more quantitative and are sensitive to correlations other than just synchrony. These methods, however, also turned out to have problems. In particular, two measures that have appeared in the literature, ΔIshuffled (Nirenberg and Latham 1998; Panzeri et al., 2001; Golledge et al., 2003; Osborne et al., 2004) and synergy/redundancy (Brenner et al., 2000; Liu et al., 2001; Machens et al., 2001; Schneidman et al., 2003), seem intuitive but, in fact, turn out to be confounded. Here, we describe a measure that is not confounded and, in fact, provides an upper bound on the importance of correlations for decoding. In addition, we show why the other two methods, ΔIshuffled and synergy/redundancy, can lead one astray.
Results
Definition of correlations
Correlations in neuronal responses arise from two sources. One is the stimulus: if multiple neurons see the same stimulus, then their responses will be related. For example, if a flash of light is presented to the retina, all of the ON retinal ganglion cells will tend to fire at flash onset and all of the OFF cells at flash offset. On average, ON cells will be correlated with ON cells, OFF cells will be correlated with OFF cells, and ON and OFF cells will be anti-correlated with each other.
These stimulus-induced correlations are typically referred to as “signal correlations” (Gawne and Richmond, 1993) and are defined as follows: responses from N neurons, denoted, ri, i = 1,..., N, are signal correlated if and only if
where p(r1, r2,..., rN) and p(ri) are the joint and single neuron response distributions averaged over stimuli. These distributions are given by the standard relationships p(r1, r2,..., rN) ≡ ∑sp(r1, r2,..., rN|s)p(s) and p(ri) ≡ ∑sp(ri|s)p(s), where s is the stimulus. Here and in what follows, the response from neuron i, ri, is essentially arbitrary; it could be firing rate, spike count, or a binary string indicating when a neuron did and did not fire.
The second source of correlations is common input, which can arise from either a common presynaptic source (e.g., two ON ganglion cells that receive input from the same amacrine cell) or direct or indirect interaction between neurons (e.g., gap junction coupling). Correlations of this type are called “noise correlations” (Gawne and Richmond, 1993), and they differ from signal correlations in that they are a measure of the response correlations on a stimulus-by-stimulus basis. Specifically, responses are noise correlated if and only if
A population of neurons will almost always exhibit a mix of signal and noise correlations. For example, two ON ganglion cells far apart on the retina (two cells that share no circuitry) will exhibit no noise correlations, but they will exhibit signal correlations, so long as the stimulus has sufficiently long-range spatial correlations to make them fire together. In contrast, two ON cells with overlapping receptive fields (two cells that do share circuitry) will exhibit both signal and noise correlation.
It is undisputed that signal correlations are important for decoding, that is, for inferring stimuli from responses. Our brains are built to take correlations in the outside world and reflect them in correlations in neuronal responses. What is not clear, however, is whether noise correlations are important for decoding. It is these that have been the subject of debate, and it is these that we focus on in this paper.
Testing whether correlations are important
The most straightforward way to test whether noise correlations are important for decoding is to build a decoder that does not take them into account and compare its performance with one that does (Dan et al., 1998; Wu et al., 2000, 2001; Nirenberg et al., 2001; Nirenberg and Latham, 2003; Averbeck and Lee, 2003, 2004). If the decoder that does not take correlations into account performs as well as the one that does, then correlations are not important for decoding. If it does not perform as well, then they are.
To construct a decoder that takes correlations into account (and, because we do not know the algorithm the brain uses, we assume optimal decoding), we first record neuronal responses to many stimuli and build the response distribution, p(r|s). [Here, r ≡ (r1, r1,..., rN) is shorthand for the responses from all N neurons.] We then use Bayes' theorem to construct the probability distribution of stimuli given responses, yielding
We will take the approach that p(s|r) is our decoder, although in practice one often takes the additional step of choosing a particular stimulus from this distribution, such as the one that maximizes p(s|r).
To construct a decoder that does not take correlations into account, we perform essentially the same steps we used to construct p(s|r). The only difference is that our starting point is the independent response distribution rather than the true one—the response distribution one would build with knowledge of the single neuron distributions but no knowledge of the correlations. This distribution, denoted pind(r|s), is (1)
Given pind(r|s), we can then construct the “independent” stimulus distribution, pind(s|r), from Bayes' theorem,
where pind(r) ≡∑s pind(r|s)p(s) is the total independent response distribution. By construction, pind(s|r) does not use any knowledge of the noise correlations.
To assess the role of correlations, we simply ask the decoders to decode the true responses (the responses that were actually recorded from the animal) and assess how well they do. Specifically, we take responses from simultaneously recorded neurons and compute both p(s|r) and pind(s|r). If they are the same for all stimuli and responses that could occur, then we know that we do not have to take correlations into account when we decode; if they are different for at least one stimulus-response pair, then we know that we do.
In Figure 2a, we perform this procedure for a pair of neurons with correlated responses. Although the responses are, in fact, highly correlated, the correlations do not matter: if one goes through each of the true responses, one can see that they can be decoded exactly as well using the decoder built from the independent response distribution as they can using the decoder built from the true response distribution. Or, expressed in terms of probabilities, pind(s|r1, r2) = p(s|r1, r2) for all responses that can occur. This demonstrates a key point: cells can be highly correlated without those correlations being important for decoding.
Of course, cells can also be correlated with the correlations being important. An example of this is illustrated in Figure 2c, which shows a pair of neurons whose correlational structure is very similar to that shown in Figure 2a, but different enough so that pind(s|r1, r2) ≠ p(s|r1, r2). In this case, knowledge of correlations is necessary to decode correctly, so correlations are important for decoding.
Dan et al. (1998) were the first ones we know of to assess the role of correlations by building decoders that do and do not take correlations into account: they asked whether a decoder with no knowledge of synchronous spikes would do worse than one with such knowledge. Wu et al. (2000, 2001) later extended the idea so that it could be applied to essentially arbitrary correlational structures, not just synchronous spikes, and, recently, we extended it further and developed an information-theoretic cost function that measured the importance of correlations (Nirenberg et al., 2001; Nirenberg and Latham, 2003). Below we show that this cost function provides an upper bound on the information one loses by ignoring correlations. First, however, we show that the other information-theoretic measures that have been proposed do not do this.
Other approaches
Other approaches for assessing the importance of correlations have been proposed. In this section, we consider two of the most common ones, ΔIshuffled and ΔIsynergy.
Shuffled information
The first measure we consider is ΔIshuffled (see Eq. 2 below). This measure emerged from an approach similar to the one described in the previous section, in the sense that the overall idea is to assess the importance of correlations by removing them and looking for an effect. The difference, however, is in how we look. In the previous section, we looked for an effect by building two decoders, one using the true responses and one using the independent ones. The two decoders are then asked to decode the true responses, and their performance is compared. In the ΔIshuffled approach, the same two decoders are built. What is different, however, is what is decoded: the true decoder is asked to decode the true responses, and the independent one is asked to decode the independent responses. As we will see, this seemingly small difference in what is decoded has a big effect on the outcome.
The quantitative measure associated with this approach, ΔIshuffled, is the difference between the information one obtains from the true responses and the information one obtains from the independent responses (i.e., the “shuffled” responses, whose name comes from the fact that, in experiments, the independent responses are produced by shuffling trials). This difference is given by (2)
Here, I(s; r) is the mutual information between stimuli and responses (Shannon and Weaver, 1949), (3)
and Ishuffled(s; r) is defined analogously, except that p(r|s) is replaced by pind(r|s) (Eq. 1) and p(r) by pind(r). Specifically,
Because I is computed with knowledge of correlations and Ishuffled is computed without this knowledge, one might expect that when ΔIshuffled ≠ 0, correlations are important, and when ΔIshuffled = 0, they are not. This expectation, however, is not correct. The reason is that when one computes ΔIshuffled, one is computing information from a response distribution that never occurred. As a result, one can end up measuring information about a stimulus from responses that the brain never sees.
A corollary of this is that Ishuffled can actually be larger than I. This is troublesome in light of the numerous proposals that correlations can act as an extra channel of information (Eckhorn et al., 1988; Gray and Singer, 1989; Gray et al., 1989; Vaadia et al., 1995; Meister et al., 1995; deCharms and Merzenich, 1996; Steinmetz et al., 2000), because, if they do, removing them should lead to a loss of information rather than a gain. Also troublesome are the observations that ΔIshuffled can be nonzero even when a decoder does not need to have any knowledge of the correlations to decode the true responses, and it can be zero when a decoder does need to have this knowledge (see below).
To gain deeper insight into what ΔIshuffled does and does not measure, let us consider two simple examples, shown in Figure 3. In both examples, correlations are not important; that is, pind(s|r1, r2) = p(s|r1, r2) for all responses that can occur, but, in one case ΔIshuffled is positive and in the other ΔIshuffled is negative.
In the first example, the true (correlated) response distribution is shown in Figure 3a, and its independent (shuffled) counterpart is shown in Figure 3b. It is not hard to see why ΔIshuffled is positive in this case: when the responses are correlated, they are disjoint and thus perfectly decodable (Fig. 3a). When they are made independent, however, they are no longer disjoint; an overlap region is produced, and responses that land in that region could have been caused by either stimulus (Fig. 3b). Thus, the responses in the independent case provide less information about the stimuli than the responses in the correlated case. A straightforward calculation shows that ΔIshuffled = ¼ (see Appendix A).
This example emphasizes why ΔIshuffled is a misleading measure for decoding. Because the amount of information in the shuffled responses is less than that in the true responses, one gets the impression that the decoder built from the shuffled responses would not perform as well as the one built from the true responses. In reality, however, it does perform as well: every one of the true responses is decoded exactly the same using the two decoders. This is reflected in the fact that pind(s|r1, r2) is equal to p(s|r1, r2) for all responses that actually occur.
This is an important point. When one takes a set of correlated responses and makes them independent, one creates a new response distribution. However, the fact that this new response distribution, pind(r1, r2|s), is not equal to the true one, p(r1, r2|s), does not imply the reverse, that pind(s|r1, r2) ≠ p(s|r1, r2). In fact, as the above example indicates and as we showed in Figure 2a, it is easily possible to have pind(s|r1, r2) = p(s|r1, r2) when pind(r1, r2|s) ≠ p(r1, r2|s). This is a surprising finding, and something that would not have been (could not have been) revealed by ΔIshuffled.
In the second example, the true response distribution is shown in Figure 3c, and its independent counterpart is shown in Figure 3d. Here, ΔIshuffled < 0: when the responses are correlated, they land in the overlap region (the region in which the responses could have been caused by either stimulus) on one-half the trials (Fig. 3c), whereas when they are independent, they land in the overlap region on one-quarter of the trials (Fig. 3d). Consequently, when the responses are independent, they provide more information on average (because they are ambiguous less often), and a straightforward calculation yields ΔIshuffled = -¼ (see Appendix A). It is also easy to show that a decoder does not need to know about these correlations: regardless of whether a decoder assumes the neurons are uncorrelated, responses in the overlap region provide no information about the stimulus and responses not in the overlap region are decoded perfectly (see Appendix A). Thus, as with the previous example, the fact that ΔIshuffled is negative is easy to misinterpret: it gives the impression that the correlations are important when they are not.
It is not hard, by extending these examples, to find cases in which ΔIshuffled = 0 when correlations actually are important [Nirenberg and Latham (2003), their supporting information]. This is because the shuffled information can be positive for some parts of the code and negative for others, producing cancellations that make ΔIshuffled = 0. In fact, all combinations are possible: ΔIshuffled can be positive, negative, or zero both when correlations are important and when they are not [Nirenberg and Latham (2003), their supporting information], making this quantity a bad one for assessing the role of correlations in the neural code.
This is not to say that ΔIshuffled is never useful; it can be used to answer the question: given a correlational structure, would more information be transmitted using that structure or using independent responses (Abbott and Dayan, 1999; Sompolinsky et al., 2001; Wu et al., 2002)? This is interesting from a theoretical point of view, because it sheds light on issues of optimal coding, but it is a question about what could be rather than what is.
Synergy and redundancy
Another common, but less direct, measure that has been proposed to assess the role of correlations is the synergy/redundancy measure, denoted ΔIsynergy (Brenner et al., 2000; Machens et al., 2001; Schneidman et al., 2003). It is defined to be (4)
where I, the mutual information, is given in Equation 3.
Positive values of ΔIsynergy are commonly assumed to imply that correlations are important, a claim made explicitly by Schneidman et al. (2003). This claim, however, breaks down with close examination, for essentially the same reason as in the previous section: ΔIsynergy fails to take into account that pind(r|s) can assign nonzero probability to responses that do not occur. To see this explicitly, we will consider an example in which ΔIsynergy is positive but correlations are not important; that is, pind(s|r1, r2) = p(s|r1, r2) for all true responses.
The example is illustrated in Figure 4. Figure 4a shows the correlated distribution. As one can see, the responses form a disjoint set, so every pair of responses corresponds to exactly one stimulus. Figure 4b shows the independent distribution. Here, the responses do not form a disjoint set. That is because responses along the diagonal, in which both neurons produce approximately the same output, can be caused by more than one stimulus. On the surface, then, it appears that, without knowledge of the correlated distribution, one cannot decode all of the responses perfectly. However, this is not the case: as in Figures 2a and 3a, the responses that could have been caused by more than one stimulus (the ones along the diagonal) never happen. Thus, all responses that do occur can be decoded perfectly. This means that we can decode exactly as well with no knowledge of the correlational structure as we can with full knowledge, even though ΔIsynergy > 0.
If ΔIsynergy can be positive when one can decode optimally with no knowledge of the correlations [that is, when pind(s|r) = p(s|r)], then what does synergy really tell us? To answer this, note that it is a general feature of population codes that observing more neurons provides more information. How much more, however, spans a large range and depends in detail on the neural code. At one end are completely redundant codes, for which observing more neurons adds no information (for example, Fig. 3c). At the other end are synergistic codes, for which observing more neurons results in a large increase in information. Thus, the degree of synergy (the value of ΔIsynergy) tells us where along this range a neural code lies. It does not, however, tell us about the importance of correlations.
Although we focused here on showing that ΔIsynergy can be positive when correlations are not important for decoding, it can also be shown, and has been shown in previous work, that ΔIsynergy can be negative or zero when correlations are not important [when pind(s|r) = p(s|r)]. Likewise, it can be positive, negative, or zero when correlations are important [when pind(s|r) ≠ p(s|r)] [Nirenberg and Latham (2003), their supporting information]. Thus, ΔIsynergy is not a very useful measure for assessing the importance of correlations for decoding.
Redundancy reduction
A long-standing proposal about early sensory processing is that one of its primary purposes is to reduce redundancy (Attneave, 1954; Barlow, 1961; Srinivasan et al., 1982; Atick and Redlich, 1990; Atick, 1992) (but see Barlow, 2001). Given that codes with ΔIsynergy < 0 are referred to as “redundant,” one might interpret this proposal to mean that ΔIsynergy should be maximized, so that the code exhibits as little redundancy as possible. Unfortunately, redundant has two meanings. One is “not synergistic” (ΔIsynergy < 0). The other, as originally defined by Shannon and Weaver (1949), is “not making full use of a noisy channel.” Under the latter definition, redundancy, denoted , is given by
where I is, as above, mutual information, and C is channel capacity: the maximum value of the mutual information with respect to p(s) for fixed p(r|s).
The redundancy-reduction hypothesis refers to minimizing , not maximizing ΔIsynergy. This is sensible: minimizing corresponds to making the most efficient use of a channel. Maximizing ΔIsynergy also seems sensible on the surface, because maximum synergy codes can be highly efficient [in fact, they can transmit an infinite amount of information (Fig. 5)]. However, biological constraints prevent their implementation, and they are almost always effectively impossible to decode (Fig. 5b). Thus, maximization principles involving ΔIsynergy are unlikely to yield insight into the neural code.
Quantifying the importance of correlations
Our analysis so far has led us to the following statement: correlations are important for decoding if pind(s|r) ≠ p(s|r). However, what if we want to assess how important they are? Intuitively, we should be able to do this by computing the distance between pind(s|r) and p(s|r). If these two distributions are close, then correlations should be relatively unimportant; if they are far apart, then correlations should be important. The question we address now is: what do we mean by “close”?
In previous work (Nirenberg et al., 2001; Nirenberg and Latham, 2003), we argued that the appropriate measure of close is the Kullback-Leibler distance averaged over responses. This distance, which we refer to as ΔI, is given by (5)
[See also Panzeri et al. (1999) and Pola et al. (2003), who defined the same quantity, but used the notation Icor-dep instead of ΔI].
This measure has a number of desirable properties. First, because ΔI is weighted by p(r), it does the sensible thing and weights responses by how likely they are to occur, with zero weight for responses that never occur (the ones the brain never sees). [This feature also takes care of the problem that p(s|r) is undefined when p(r) = 0.] Second, ΔI is bounded from below by zero, which makes it a nice candidate for a cost function. Third, it is zero if and only if pind(s|r) = p(s|r) for every response that actually occurs (Nirenberg and Latham, 2003), so ΔI = 0 implies that correlations are completely unimportant for decoding. (It is not hard to show that ΔI = 0 for the examples given in Figs. 2a, 3, 4; see Appendix A.) Fourth, it is a cost function that can be thought of in terms of yes/no questions, a common and intuitive way of expressing information theoretic quantities. Specifically, ΔI is the cost in yes/no questions for not knowing about correlations: if one were guessing the stimulus based on the neuronal responses, r, then it would take, on average, ΔI more questions to guess the stimulus if one knew nothing about the correlations than if one knew everything about them (Nirenberg et al., 2001; Nirenberg and Latham, 2003).
The fourth property makes it possible to compare ΔI with the mutual information, I, between stimuli and responses, because I is the reduction in the average number of yes/no questions associated with observing neuronal responses (Cover and Thomas, 1991). The observation that ΔI can be expressed in terms of yes/no questions thus led us to identify the ratio ΔI/I as a measure of the relative importance of correlations. Here we solidify this identification by showing that ΔI is a rigorous upper bound on the information loss. This result is useful because it allows us to interpret the spate of recent experiments in which ΔI/I was found to be on the order of 10% or less (Nirenberg et al., 2001; Petersen et al., 2001, 2002; Panzeri et al., 2002b; Pola et al., 2003): in those experiments, one could ignore correlations when decoding and lose at most ∼10% of the information.
ΔI is an upper bound on information loss
Showing that ΔI is an upper bound on information loss is not straightforward because classical information theory deals with true probability distributions. We, however, want to compute information when a decoder is based (via Bayes' theorem) on the wrong probability distribution: pind(r|s) rather than p (r|s). To do this, we use what is really a very simple idea, one that is closely related to discriminability. The idea is that if a decoder has knowledge of the full correlational structure, it will (typically) be able to make finer discriminations than if it does not, and so will be able to provide more information about the stimulus. For example, a decoder based on p(r|s) might have a discrimination threshold of, say, 1°, whereas one based on pind(r|s) might have a threshold that is twice as large, say 2°. The link between discrimination thresholds and information theory is that the factor of two decrease in the ability to discriminate implies a 1 bit decrease in information, because one-half as many orientations are distinguishable. For this example, then, the information loss associated with ignoring correlations, ΔI, is 1 bit.
To generalize this idea so that it can be applied to any stimulus set, not just simple ones, such as orientated bars, we adopt an information-theoretic construct known as a “codebook.” In information theory, a codebook consists of a set of codewords, each of which is a list of symbols. These codewords are sent over a noisy channel to a “receiver,” who also has access to the codebook. The job of the receiver is to examine the corrupted symbols and determine which codeword was sent. In our application, each symbol is a different stimulus, so a codeword consists of a set of stimuli sent in a particular order. For example, in the case of oriented bars, a codebook might consist of two length-3 codewords, one corresponding to bars that are presented sequentially at 2°, 1°, and 3°, and another to bars presented at 3°, 2°, and 1°. The job of the receiver is to examine the neuronal responses and determine which of the two orders was presented.
The reason we take this approach is that the number of codewords that can be put in the codebook before the receiver starts making mistakes is directly related to the mutual information between stimuli and responses. This follows from what is probably the central, and most profound, result in information theory, which is: if each codeword in the codebook consists of n stimuli, then the upper bound on the number of codewords that can be sent almost error-free is 2nI, where I is the mutual information between stimuli and responses (Shannon and Weaver, 1949; Cover and Thomas, 1991). Importantly, and this is the critical insight that allows us to relate ΔI to information loss, the upper bound depends on the probability distribution used by the receiver. If the receiver uses the true response distribution, p(r|s), to build a decoder, then the upper bound really is 2nI. If, on the other hand, the receiver uses an incorrect response distribution, then the upper bound is typically smaller: , where I* ≤ I (Merhav et al., 1994). The information loss associated with using the wrong distribution, which in our case is pind(r|s) rather than p(r|s), is I - I*.
Although this approach seems abstract, in fact, it is closely related to discriminability. For example, if orientations separated by 1° can almost always be distinguished, then we can put ∼180 codewords of length 1 in our codebook (1°, 2°,..., 180°), 1802 of length 2 [(1°, 1°), (1°, 2°),..., (180°, 180°)], and so on. If, on the other hand, orientations separated by can almost always be distinguished, then the number of codewords would be 360 and 3602, respectively. Thus, the number of codewords in our codebook tells us directly how far apart two orientations must be before they can be easily discriminated.
The advantage of the codebook/codeword approach over discrimination thresholds is that we can compute information loss simply by counting codewords. Here, we outline the main steps needed to do this; the details are provided in Appendix B. Because this is a very general approach (it can be used to evaluate information loss associated with any wrong distribution, not just one based on the independent responses), in what follows, we use q(r|s) to denote the wrong conditional response distribution rather than pind(r|s). At the end, we can return to our particular problem by replacing q(r|s) with pind(r|s).
As discussed above, we will construct codewords that consist of n stimuli presented in a prespecified order that is known to the receiver, with each codeword having equal probability. We will consider discrete stimuli that come from an underlying distribution p(s), and the stimuli that make up each codeword will be drawn independently from this distribution. We will use c to denote codewords and m to denote the number of codewords, so our codebook has the form
For example, if the stimuli were orientations at even multiples of 1°, then a particular six-symbol codeword might be (2°, 10°, 7°, 1°, 14°, 2°).
The question we address here is: if the receiver uses q(r|s) to build a decoder, how large can we make m before the error rate becomes appreciable? The natural way to answer this is via Bayes' theorem: given a uniform prior on the codewords (a result of the fact that the codewords are sent with equal probability) and a set of observations, r1, r2,..., rn, then the probability the receiver uses for a particular codeword, denoted q(w|r1, r2,..., rn), is (6)
where the last ∝ follows from our uniform prior, which is that p(w) is independent of w. Given Equation 6, the optimal estimate of which message was sent, denoted ŵ, is
We can now compute the probability of making a decoding error. If message w* is sent, then the probability, , of an error for a particular codeword w ≠ w* is (7)
As we show in Appendix B, is independent of which codeword, w*, is sent. Thus, for any w*, the probability of making at least one error when we have m codewords in our codebook, denoted , is (8)
(Equation 8 should really have m - 1 in the exponent. However, because m is always large, here and in what follows, we will make no distinction between m and m - 1.)
To see how this equation gives us the upper bound on the transmitted information, let us define I* via the relationship (9)
and let (10)
where ϵ is a small, positive constant. Then, inserting Equations 9 and 10 into Equation 8, we find, after straightforward algebra, that (11)
What Equation 11 tells us is that as n → ∞ the probability of an error vanishes, no matter how small ϵ is. Consequently, is an upper bound on m and thus an upper bond on the number of codewords that can be sent with vanishingly small probability of error. This in turn implies that I* is the information associated with using the wrong probability distribution. Using Equation 9 to express I* in terms of , we have (12)
where is given by Equation 7.
Calculating I* thus amounts to solving a math problem: computing . This we do in Appendix B. Although there is no closed form expression for I*, we are able to show that
In other words, ΔI is an upper bound on the information loss associated with using the wrong distribution. When the wrong distribution is one that has no information about the correlations, then ΔI is an upper bound on the information loss associated with ignoring correlations.
Discussion
Determining whether correlations are important for decoding, both across spike trains from multiple neurons and within spike trains from single neurons, is a critical problem in neural coding. Addressing this problem requires a quantitative measure, one that takes as input neural data and produces as output a number that tells us how important correlations are. Efforts to develop such a measure, however, have led to a great deal of controversy, in large part because those efforts typically rely on intuitive notions. Two measures in particular, ΔIshuffled and ΔIsynergy, fall into this category. The first, ΔIshuffled, is the difference between the information carried by an ensemble of cells and the information that would be carried by the same cells if their responses were conditionally independent (Eq. 2). The second, ΔIsynergy, is the difference between the information carried by an ensemble of cells and the sum of the information carried by the individual cells (Eq. 4). Both sound reasonable at first glance, but deeper analysis raises serious concerns. In particular, beyond the intuitive meaning that has been attached to these quantities, no rigorous link has been made to the importance of correlations for decoding. In addition, as we showed in Results, ΔIshuffled and ΔIsynergy can take on essentially any values (from positive to zero to negative), both when one needs to know about correlations to decode and when one does not.
In this paper, we took a different strategy. Following the work of others (Dan et al., 1998; Panzeri et al., 1999; Wu et al., 2001, 2002), we approached the problem in the context of what is arguably the central question in neural coding, which is: how do we decode the activity of populations of neurons? Importantly, the answer depends almost exclusively on whether knowledge of correlations is necessary for constructing optimal decoding algorithms. If it is, then we will need to develop parametric models of the correlational structure in neuronal responses; if it is not, then we can use methods that ignore correlations and so are much simpler. We thus developed a measure based on the simple question: do we need to know about correlations to construct optimal decoding algorithms? We showed that the answer is determined by the value of ΔI (Eq. 5). Specifically, if ΔI = 0, then knowledge of correlations is not necessary for optimal decoding, and if ΔI > 0, then it is necessary, and its value places an upper bound on the amount of information lost by ignoring correlations. Thus, ΔI tells us whether correlations are important in a sense that has real and immediate impact on our strategies for decoding populations of neurons.
Recently, Schneidman et al. (2003) criticized this approach. Their criticism was, however, based on a single, very strong premise, which was that ΔIsynergy is the correct measure of the importance of correlations (see Schneidman et al., 2003, their Fig. 6 and associated text). Given this premise, they concluded that any measure that does not give the same answer as ΔIsynergy must be flawed. What was missing, however, was an explanation of why their premise is correct; as far as we know, neither they nor anyone else has demonstrated that ΔIsynergy is the correct measure of the importance of correlations.
Finally, we should point out that identifying a relevant measure (ΔI) is the first step in determining the role of correlations. Methods must be developed to apply this measure to populations of neurons. Here the underlying idea, which is that one can assess the role of correlations by building decoders that ignore them, will be just as useful as the measure itself. This is because one does not actually have to calculate ΔI [a difficult estimation problem, especially for population codes (Paninski, 2003, 2004; Nemenman et al., 2004)], but instead one can build decoders that do and do not take some aspect of correlations into account. If taking correlations into account improves decoding accuracy, then correlations are important for decoding; otherwise, they are not. This approach has already been used successfully for population decoding in motor cortex (Averbeck and Lee, 2003), and we expect it to be the method of choice in the future.
Appendix A: Calculation of ΔI, ΔIshuffled and ΔIsynergy
In this appendix, we compute ΔIshuffled for the probability distributions shown in Figure 3, a and c, ΔIsynergy for the probability distribution shown in Figure 4a, and ΔI for both.
Our first step is to turn the continuous response distributions illustrated in these figures into discrete ones, which we can do because the relevant aspect of a response is which box it lands in. Thus, we will let both r1 and r2 take on integer values that range from 1 to 3, with the former labeling column and the later labeling row. With this scheme, the response (r1, r2) = (2, 3), for example, refers to the box that is in the second column and the top row.
For simplicity, we will assume that the stimuli occur with equal probability. Thus, p(s1) = p(s2) = ½ in Figure 3 and in Figure 4. We will also assume that, given a stimulus, the responses that can occur, occur with equal probability. Loosely speaking, this means that all boxes of the same color are equally likely. For example, in Figure 3a, p(2, 3|s2) = p(3, 2|s2) = ½, and, in Figure 4b, p(2, 2|s1) = p(2, 3|s1) = p(3, 2|s1) = p(3, 3|s1) = ¼.
Of the three information-theoretic quantities, ΔI (Eq. 5) is the easiest to compute, so we will start with it. Consider first the distributions in Figures 3, a and b, and 4, a and b. For these, all responses that actually occur (the ones in Figs. 3a and 4a, respectively) are uniquely decodable regardless of whether the decoder assumes independence, which is clear by examining Figures 3, a and b, and 4, a and b. Thus, pind(s|r1, r2) = p(s|r1, r2), which implies, via Equation 5, that ΔI = 0.
For the distributions in Figure 3, c and d, the situation is marginally more complex. The upper right and lower left responses, (r1, r2) = (3, 3) and (1, 1), respectively, are uniquely decodable whether are not the decoder assumes independence. The center response, (r1, r2) = (2, 2), is not. Instead, that response gives no information about the stimulus, meaning that both s1 and s2 are equally likely, and this is true regardless of whether the decoder assumes independence (one can compute this directly, or use symmetry between s1 and s2). Thus, for all responses that actually occur (those in Fig. 3c), pind(s|r1, r2) = p(s|r1, r2), and again ΔI = 0.
Our next task is to compute ΔIshuffled (Eq. 2) for Figure 3, a,b and c,d. Consider first Figure 3a. The responses in this figure are disjoint, so they are uniquely decodable. Thus, the mutual information is equal to the entropy of the stimulus, which is 1 bit. For the corresponding shuffled distribution, Figure 3b, recall that, for each stimulus, all squares of the same color occur with equal probability. Using this fact and examining Figure 3b, we see that, on ¾ of the trials, the responses are uniquely decodable (they provide 1 bit of information), whereas on ¼ of the trials, the responses provide no information. Thus, the responses convey, on average, ¾ bits. This gives ΔIshuffled = 1 - ¾ = ¼.
Turning now to Figure 3c, we see that, on one-half of the trials, the responses provide 1 bit of information (because they are uniquely decodable), and, on the other half, they provide no information. Thus, the mutual information is ½ bits. Because Figure 3d is the same as 3b, the mutual information of the shuffled distribution is again ¾ bits, and ΔIshuffled = ½ - ¾ = -¼.
Our last task is to compute ΔIsynergy (Eq. 4) for the distribution shown in Figure 4a. As in Figure 3a, the responses are uniquely decodable. Consequently, the mutual information, I(s; r1, r2), is equal to the entropy of the stimulus, which is log23 bits. To compute the mutual information between one of the responses and the stimuli, note that receiving any one response reduces the number of possible stimuli from three to two. For example, if we observe that r1 = 2, then we know that either s1 or s3 was presented, both with equal probability. Thus, the mutual information is (log2 3 - log2 2) bits. Because I(s; r1) = I(s; r2), it follows that .
Appendix B: Information-theoretic cost of ignoring correlations
The goal in this appendix is to compute , from which we can calculate I*, the information associated with a decoder that uses the wrong distribution. This computation is divided into two parts. In the first, we derive a set of equations whose solution gives us I*. This is a rederivation of a result shown originally by Merhav et al. (1994) and is a straightforward application of large-deviation theory. We include the rederivation here both for completeness, and because we use a simpler (although slightly less rigorous) method than that of Merheve and colleagues. Unfortunately, the equations for I* have no closed-form solution. However, they can be used to show that I - I*, the information loss associated with ignoring correlations, is bounded by ΔI. This is the focus of the second part.
Derivation of equations for I*
According to Equation 12, to compute I* we first need to compute , the latter given by Equation 7. Our first step, performed for convenience only, is to express in terms of sums of logs of probabilities rather than products of probabilities. This leads to the relationship (B1)
The main idea behind the computation of is that the first term inside the parentheses is a random variable, and the probability that it is greater than the second can be computed using large-deviation theory. This is somewhat tricky because both terms in Equation B1 are random variables. Fortunately, it turns out that we can replace the second one by its average. Intuitively, this is because we are interested in the probability of an error for a single sent codeword, so outliers are not important. Placing this intuition on a rigorous footing, however, requires a fair amount of effort and is the subject of the next section. Those who are satisfied with the intuitive argument should skip directly to the section Large-deviation theory applied to Equation B11, which follows Equation B11.
Justification for averaging the second term in Equation B1
The assertion that the second term in Equation B1 should be replaced by its average while the first should not seems, at first glance, oddly asymmetric: after all, both terms are random variables, so why should we replace one by its average and not the other? The intuitive answer, as stated above, is that only one codeword (w*) is sent at a time, but, for each one we send, there are a large number of possible incorrect ones (all the rest of the w). The goal of this section is to make that intuitive answer mathematically precise.
Our starting point is Equation 8 for the probability, , of making at least one error on m codewords. That equation made the implicit assumption that does not depend on which codeword, w*, was sent. To verify that this assumption is correct, we momentarily drop it and instead average over all possible codewords w*. When we do that, we will see that is effectively independent of w*.
From the point of view of decoding error, the only relevant aspect of w* is its effect on the second term in Equation B1. We thus define (B2)
Because both s and ri are discrete variables, it follows that is also a discrete variable. Letting denote its probability distribution, the expression for the probability of making at least one error is (B3)
where J* is defined via the relationship (B4)
Note that is just the probability of an error, , for a particular codeword, w*. The particular codeword in this case is the one that corresponds to via Equation B2.
Unfortunately, we cannot compute analytically the sum in Equation B3. What we can do, however, is find the maximum number of codewords, m, for which vanishes exponentially fast with n. The information, I*, is then equal to the log of this number divided by n (see Eq. 10).
We begin by letting , as in Equation 10. Then, applying the formula (1 - x)y = exp[y ln(1 - x)], may be written (B5)
Our next step is to make the ansatz (B6)
where is the mean value of . What we do now is show that, with this ansatz, is exponentially small in n so long as ϵ > 0. Inserting Equation B6 into B5, we have (B7)
Because the expression in curly brackets is, for large n, essentially a step function in J*, we need to treat the sum over differently on either side of the step. We thus divide it into two terms: one with and one with , where is defined via the relationship (B8)
Equation B2 then becomes
The reason for this particular division into two terms is that we will be able to find exponentially small upper bounds for both. To find the bound for the first term, we use the fact that the expression in curly brackets is a decreasing function of . Then, because is an increasing function of (see Eq. B4), we can upper bound this term by replacing by and then summing over all . For the second term, an upper bound is easily provided by simply dropping the exponential piece, which can only make this term smaller. Performing these manipulations and using Equation B8 for , we find that (B9)
where G(x) ≡ ln(1 - x)/(-x). Note that limx→0G(x) = 1.
To find the large n behavior of the second term in Equation B9, we use the fact that , which follows from the definition of (Eq. B8) and the fact that is an increasing function of . Moreover, for small ϵ, , independent of n. Thus, Hoeffding's theorem (Hoeffding, 1963) tells us that the second term in Equation B9 is bounded by , where κ is a positive, n-independent constant. We thus have
Finally, because G(x) approaches 1 in the small x limit, it follows that the first two terms in this expression reduce to . Thus, in the limit of small ϵ (where κϵ2≪ϵ/2), we have (B10)
Equation B10 is the main result of this section. It tells us that if ), then, for any fixed ϵ, the probability of the receiver making an error is exponentially small in n. Thus, is an upper bound on the number of codewords that can be sent with vanishingly small probability of error, and we identify as the information, I*, associated with the wrong distribution. What this means is that, given the definition of (Eq. B4), we can compute I* by replacing the second term on the right-hand side of Equation B1 with . As promised, then, is independent of w*. Because , the equation for I* is (B11)
Here and in what follows, the notation 〈...〉p means average the terms inside the angle bracket with respect to the probability distribution p.
Large-deviation theory applied to Equation B11
Our task now is to compute the probability on the right-hand side of Equation B11. The key observation we need to do this is that, when w ≠ w*, ri and si(w) are independent, where independent in this context means p(ri, si(w)) = p(ri)p(si(w)). What we compute, then, is the probability that samples of r and s drawn from the distribution p(r)p(s) will yield enough of an outlier to satisfy the inequality in Equation B11. This can be done using Sanov's theorum (Sanov, 1957; Cover and Thomas, 1991; Dembo and Zeitouni, 1993), which tells us that this probability, , is given by (B12)
where D(·∥·) is the Kullback-Leibler divergence (in the above expression, it is equal to 〈log2[p*(r, s)/p(r)p(s)]〉p*(r, s)), and p*(r, s) is chosen to minimize D(p*(r, s)∥p(r)p(s)) subject to the constraints (B13a) (B13b)
The first constraint, Equation B13a, tells us that p*(r, s) produces enough of an outlier to just barely satisfy the inequality in Equation B11. The second, Equation B13b, tells us that the responses are typical and is derived using the following reasoning. The probability of an error, , should be thought of as a probability over codewords; that is, is the probability that a randomly chosen codeword will satisfy the inequality in Equation B11. This probability depends, of course, on the ri. For large n, it is overwhelmingly likely that the ri will be typical, meaning that the fraction of times a particular r appears is equal to its probability, p(r) (Cover and Thomas, 1991). Moreover, we do not have to worry about outliers because, as mentioned above, we are interested in the probability of error per codeword sent. The same cannot be said about the stimuli: there are an exponentially large number of codewords in the codebook, and the ones most likely to produce an error are those that deviate from the distribution p(s), which is why we do not have the constraint ∑rp*(r, s) = p(s).
Equation B12 tells us that the mutual information associated with the wrong distribution, q(r|s), is given by (B14)
To compute I*, we need to find p*(r, s), the distribution that minimizes D(p*(r, s)∥p(r)p(s)) subject to the constraints given in Equation B13. This is a straightforward problem in constrained minimization: p*(r, s) is found by solving the equation (B15)
and then choosing β and λ(r), the Lagrange multipliers, to satisfy the two constraints in Equation B13.
Equations B13-B15 are, with minor changes in notation, the same as those found by Merhav et al. (1994).
Bounding I*
Although these equations cannot be solved analytically, they can be reduced to a form that allows easy derivation of a bound on I*. To do that, we proceed in steps: first we find the solution for arbitrary β and λ(r), then we eliminate λ(r) by enforcing the constraint in Equation B13b, and finally we cast the remaining equation for β in terms of a minimization problem.
Denoting the solution to Equation B15 p̃(r, s; β, λ), we have, after straightforward algebra,
We can solve for λ(r), and thus eliminate it, by enforcing the constraint given in Equation B13b, and we arrive at
where the normalization, Z(r, β), is given by (16)
It is easy to show that p̃(r, s; β) satisfies Equation B13b; that is, ∑s p̃(r, s; β) = p(r).
Our next step is to find the value of β, denoted β*, that satisfies Equation B13a. Although we cannot do this analytically, we can show that β* satisfies a convex optimization problem. This is clearly convenient for numerical work, and it is also convenient for deriving bounds on I*. We start by defining (B17)
The quantity ΔĨ(β) is significant for two reasons, both of which we demonstrate below: it has a single minimum at β = β*, and its value at that minimum is I - I*, the information loss associated with ignoring correlations.
To show that ΔĨ(β*) is a minimum, we first differentiate both sides of Equation B17 with respect to β, which yields (B18)
The right-hand side of Equation B18 vanishes when β = β*: by definition p̃(r, s; β*) = p*(r, s), and Equation B13a tells us that 〈log2 q(r|s)〉p*(r, s) = 〈log2 q(r|s)〉p(r,s). Thus, β* is an extremum. To show that this extremum is a minimum, we compute the second derivative of ΔĨ(β). Straightforward algebra yields (B19)
where p̃(s|r; β) ≡ p̃(r, s; β)/p(r). Excluding the trivial case of a deterministic mapping from stimulus to response, the variance on the right-hand side of Equation B19 is positive. Thus, ΔĨ(β) is convex and so has a single minimum at β*.
To show that ΔĨ(β*) = I - I*, we use the definition of ΔĨ(β) and a small amount of algebra to derive the relationship
Then, using Equations B13a and B13b to replace the averages with respect to p(r, s) by averages with respect to p*(r, s) and comparing the resulting expression with Equation B14, it is easy to see that the second term in brackets is equal to I*.
This analysis tells us that the information loss associated with the wrong distribution is the minimum value of ΔĨ(β). That allows us to perform three quick sanity checks. First, because ΔĨ is non-negative (it is a Kullback-Leibler divergence; see Eq. B17), the information loss can never be less than zero. Second, when q(r|s) = p(r|s), ΔĨ(1) = 0, indicating that there is no information loss when we use the true distribution. And third, p̃(r, s; 0) = p(r)p(s), which means that ΔĨ(0) = I; this indicates that the information loss can never exceed the actual information, I.
Because there is no closed-form expression for ΔĨ, it is useful to find an upper bound on it. We can do this by evaluating ΔĨ(β) at any β. A convenient choice is β = 1, at which point we have, using Equation B17 for ΔĨ(β), Equation B16 for Z(r, β), and a small amount of algebra, (B20)
Finally, to simplify the right-hand side of Equation B20, we define q(s|r) via
and we find that (B21)
When q(s|r) = pind(s|r), the right-hand side of Equation B21 is ΔI. Thus, Equation B21 tells us that ΔI is an upper bound on the information loss.
Footnotes
P.E.L. was supported by the Gatsby Charitable Foundation and National Institute of Mental Health Grant R01 MH62447. S.N. was supported by National Eye Institute Grant R01 EY012978. We acknowledge Peter Dayan and Liam Paninski for insightful discussion and comments on this manuscript.
Correspondence should be addressed to Sheila Nirenberg, Department of Neurobiology, University of California at Los Angeles, 10833 Le Conte Avenue, Los Angeles, CA 90095. E-mail: sheilan{at}ucla.edu.
Copyright © 2005 Society for Neuroscience 0270-6474/05/255195-12$15.00/0