## Abstract

The amount of information encoded by networks of neurons critically depends on the correlation structure of their activity. Neurons with similar stimulus preferences tend to have higher noise correlations than others. In homogeneous populations of neurons, this limited range correlation structure is highly detrimental to the accuracy of a population code. Therefore, reduced spike count correlations under attention, after adaptation, or after learning have been interpreted as evidence for a more efficient population code. Here, we analyze the role of limited range correlations in more realistic, heterogeneous population models. We use Fisher information and maximum-likelihood decoding to show that reduced correlations do not necessarily improve encoding accuracy. In fact, in populations with more than a few hundred neurons, increasing the level of limited range correlations can substantially improve encoding accuracy. We found that this improvement results from a decrease in noise entropy that is associated with increasing correlations if the marginal distributions are unchanged. Surprisingly, for constant noise entropy and in the limit of large populations, the encoding accuracy is independent of both structure and magnitude of noise correlations.

## Introduction

The accuracy of information processing in the cortex depends strongly on how sensory stimuli are encoded by a population of neurons. Two key factors influence the quality of a population code: (1) the shape of the tuning functions of individual neurons and (2) the structure of interneuronal noise correlations. Although the magnitude of noise correlations is debated, a common finding is that they are strongest for neurons with similar tuning properties (Zohary et al., 1994; Bair et al., 2001; Cohen and Newsome, 2008; Smith and Kohn, 2008; Ecker et al., 2010). Interestingly, such a limited range correlation structure seems to be highly detrimental for a population code, even if correlations are small. If correlations are unavoidable, it is generally believed that reducing them improves a population code (Zohary et al., 1994; Abbott and Dayan, 1999; Sompolinsky et al., 2001; Wilke and Eurich, 2002; Ecker et al., 2010). In line with this notion, a number of recent experimental studies find reduced spike count correlations under experimental conditions in which improved coding is expected, such as under attention (Cohen and Maunsell, 2009; Mitchell et al., 2009), adaptation (Gutnisky and Dragoi, 2008), or after learning (Gu et al., 2011).

Most previous theoretical studies of population coding use homogeneous population models, in which all neurons have tuning functions that differ only in their preferred stimulus and are otherwise identical (Snippe and Koenderink, 1992; Abbott and Dayan, 1999; Sompolinsky et al., 2001; Wilke and Eurich, 2002). In these models, limited range correlations introduce a strong noise component in the subspace in which the stimulus is encoded, impairing the population code (Sompolinsky et al., 2001). However, a very prominent feature of cortical neurons is the diversity of their tuning functions. This heterogeneity changes the way the stimulus is encoded and can critically alter the properties of a population code (Shamir and Sompolinsky, 2006). Unfortunately, this has not been sufficiently appreciated. We extended the approach pioneered by Sompolinsky and colleagues (Sompolinsky et al., 2001; Shamir and Sompolinsky, 2006) to study population codes with heterogeneous tuning functions, arbitrary mean/variance relationships (Fano factors), and a broad range of correlation structures. To characterize the encoding accuracy, we used Fisher information and maximum-likelihood decoding and studied a simple model with heterogeneous tuning functions, Poisson-like noise, and limited range correlation structure. We found that, in contrast to current belief, decreasing correlations does not necessarily lead to increased information. Instead, if correlations are strong enough, increasing them can substantially increase the encoding accuracy, even to the point at which a population with limited range correlations is more accurate than an independent one.

We show that this increase in encoding accuracy with higher correlations is due to a decrease in noise entropy. If the entropy is kept constant, the encoding accuracy can be improved substantially by reducing correlations below a critical value. Surprisingly, in large neural populations, the quality of a code is mainly determined by the amount of heterogeneity in the tuning functions and by the noise entropy, while correlations play only a minor role.

## Materials and Methods

Table 1 contains a list of symbols used throughout this article.

##### Population model.

We consider a population of *n* neurons responding to a stimulus, which is characterized by its direction of motion θ ϵ [0, 2π). The response of neuron *j* is given by
where *f _{j}*(θ) is the tuning curve of neuron

*j*and η

*(θ) is the trial-to-trial variability in the neural responses. The variability is assumed to follow a multivariate normal distribution with zero mean and covariance*

_{j}**Q**(θ). We assume that the preferred directions of the neurons are uniformly spaced around the circle (i.e., ϕ

*= 2π*

_{j}*j*/

*n*, where

*j*= 0, … ,

*n*− 1). Because of the circular nature of the stimulus, it is sometimes convenient (both for presentation and for calculations) to use negative indices ranging from −

*n/*2 to

*n*/2. These indices are simply understood modulo

*n.*Thus, for example, if

*n*= 32, then

*f*

_{−10}is the same as

*f*

_{22}.

##### Homogeneous population model.

In the homogeneous population model, all neurons have identical tuning functions except for their preferred directions. In other words,
In this model, the population is invariant under rotation. This means that any shift in the stimulus can be translated into a renumbering of the cells in the population. The average population activity has the same shape as the tuning function centered on θ. Throughout the paper, we use von Mises tuning functions given by the following:
We use the parameter values α = 1, β = 19, and γ = 2. These parameters closely resemble the average values found in our recordings from monkey V1 (Ecker et al., 2010) and result in tuning curves with a maximum amplitude of 20 Hz (see Fig. 1*E*).

##### Random amplitude model.

In the random amplitude model, all neurons are assumed to have identically shaped tuning functions but potentially different amplitudes:
Here, *f*(θ − ϕ* _{j}*) is as in the homogeneous model above. The amplitudes

*a*are assumed to be drawn independently for each neuron, have mean 〈

_{j}*a*〉 = 1 and their square root has variance κ = Var [

_{j}##### Other forms of heterogeneity.

We numerically simulated two additional types of heterogeneous populations. In the first model (see Fig. 8*A*), we randomly drew the tuning widths, γ, from a lognormal distribution with mean 2 and variance 4 while keeping all other parameters constant as in the homogeneous model described above. In the second model (see Fig. 8*B*), we drew (with replacement) triples of tuning parameters (α, β, γ) from a database of 408 orientation tuning curves measured in V1 of awake monkeys (Ecker et al., 2010). We assigned preferred directions uniformly spaced around the circle as above. We did not combine parameters from different neurons independently (i.e., the number of possible tuning curve shapes was 408 rather than 408^{3}).

##### Correlation structure.

In our model, we assume the correlation coefficient of two neurons to be independent of the stimulus. This allows us to parameterize the covariance matrix as follows:
where σ_{j}^{2}(θ) is the variance of neuron *j* and *r _{jk}* is the correlation coefficient of neurons

*j*and

*k.*The covariance matrix

**Q**can be written as

**Q**=

**SRS**, where

**R**is the correlation coefficient matrix, which is premultiplied and postmultiplied by a diagonal matrix

**S**of standard deviations.

The correlation coefficient of two neurons depends only on |ϕ_{j} ⊝ ϕ_{k}|, the angular difference between their preferred directions (we use ⊝ to express the fact that it is a difference between two circular quantities, in other words, ϕ_{j} ⊝ ϕ_{k} = arg exp[*i*(ϕ_{j}−ϕ_{k})]),
Here, δ* _{jk}* is the Kronecker delta (δ

*= 1 if*

_{jk}*j*=

*k*and δ

*= 0 otherwise). We do not require any specific form for*

_{jk}*c*(|ϕ

_{j}⊝ ϕ

_{k}|), other than that it must lead to a valid covariance matrix. For the large

*n*case, this is equivalent to requiring it to be bounded between −1 and 1 and all its Fourier components to be positive. While the former condition is a requirement for correlation coefficients, the latter ensures that the covariance matrix remains positive definite in the limit of large populations. We further assume that the variances are Poisson-like, which means σ

_{j}

^{2}(θ) =

*f*

_{j}(θ). This is sometimes referred to as “proportional noise” (Wilke and Eurich, 2002), while the case where σ

_{j}

^{2}does not depend on the stimulus is referred to as “additive noise.” Because the correlation coefficients depend only on the difference between the preferred directions of two neurons, the correlation matrix,

**R,**is circulant. We therefore have

*r*=

_{jk}*r*

_{j}_{−k}, where the vector

**r**is the first column of

**R.**We often refer to

**r**simply as the correlation structure.

For all examples, we assume that the correlation matrix has limited range structure. This means that the correlation between two neurons is maximal if they have identical preferred directions and decreases with increasing difference in preferred direction. The parametric form we use is the following:
where *c*_{0} is the correlation of two neurons with identical preferred directions and *L* controls the spatial scale (the larger *L* the longer the range of correlations). We use *L* = 1 for all figures in this paper. Qualitatively, the results do not depend on the exact choice of *L* (within reasonable limits) and this value is in good agreement with our V1 data (Ecker et al., 2010) and previously published studies (Zohary et al., 1994; Bair et al., 2001). The parameter *c*_{0} controls the average level of correlations, 〈*r*〉, which also depends on *L.* For *L* = 1, we have 〈*r*〉 ≈ 0.3 *c*_{0}. These values are included in the figures for reference. Note that larger values of *L* lead to higher average correlations relative to *c*_{0} (for instance, for *L* = 2, we have 〈*r*〉 ≈ 0.5 *c*_{0}).

##### Fisher information.

To quantify the encoding accuracy of the population, we use Fisher information (Cover and Thomas, 1991). If a Gaussian distribution is assumed for the noise, the Fisher information can be written as the sum of two terms, *J* = *J*_{mean} + *J*_{cov}, where (Kay, 1993)
and **f′** and **Q′** are the derivatives of the tuning curve and the covariance matrix with respect to the stimulus direction, θ. The term *J*_{mean} can be thought of as the information that is encoded in changes in the mean firing rates of the population [i.e., the term **f′**(θ)]. In contrast, *J*_{cov} is the information encoded by changes in the covariances [i.e., the term **Q′**(θ)].

To evaluate the above two terms, we have to invert the covariance matrix. Because **R** is circulant, it can be diagonalized by changing to the Fourier basis as follows:
The matrix **U** is the Fourier basis, given by
(with ω = 2π/*n*), **U*** is the hermitian (complex transpose) of **U**, and **R̃** is a diagonal matrix containing the eigenvalues of **R**, which can be calculated by the discrete Fourier transform of **r** (the first column of **R**), given by
Using this factorization, **Q**(θ)^{−1} reads
which is easy to calculate because it contains only inverses of diagonal (**S**, **R̃**) matrices.

##### Calculation of J_{mean}.

A model similar to ours but with the restriction to additive noise (i.e., with stimulus independent covariance matrix, **Q**) has been studied by Sompolinsky et al. (2001). Although the additive case is somewhat simpler because *J*_{cov} is zero (**Q′** = 0), the approach can be generalized to obtain analytic expressions for *J*_{mean} and *J*_{cov} in the nonadditive case. For *J*_{mean}, we use the above factorization of **Q**^{−1} and substitute it into Equation 8. After substituting *g*_{k} = *f*′_{k}/σ_{k}, we obtain the following:
where g̃ is the discrete Fourier transform of **g** (see Eq. 12; equivalently g̃ = **U***g). Note that because **g** depends on the stimulus, θ, also *J*_{mean} is stimulus dependent. We usually omit this dependence on θ for clarity.

##### J_{mean} in homogeneous population model.

To illustrate the formula, we consider a homogeneous population of neurons (see Fig. 1*E*) with limited range correlation structure (see Fig. 1*A–D*). In the general case, *J*_{mean} is a function of the stimulus, θ. However, for a homogeneous population of neurons, where the tuning functions are broad compared with the spacing between the preferred directions, *J*_{mean} can be treated as independent of θ. We can therefore restrict our analysis to the case θ = 0. In this case, the average population activity, **f**, is given by *f _{j}* =

*f*(−ϕ

*), where ϕ*

_{j}*is the preferred direction of neuron*

_{j}*j*(see Fig. 1

*F*). The covariance matrix

**Q**is shown in Figure 1

*B*. Figure 1,

*C*and

*G*, show

**r**and

**g**, whose Fourier transforms (Fig. 1

*D*,

*H*) are the two main quantities entering Equation 14. Because both

**f**and

**σ**are smooth and slowly varying,

**g**has almost all of its power in low frequencies, the power spectrum converging to zero for higher frequencies. We can write

**r**as

*r*= δ

_{j}*(1 −*

_{j}*c*

_{0}) +

*c*. Its Fourier transform can be split up into two parts as well: Because

_{j}**c**(Eq. 7) is also a smooth and slowly varying function of the differences in preferred direction, it also has most of its power in the low frequencies, power in higher frequencies quickly converging to zero. The delta peak at zero (because each neuron is correlated with itself by 1, while neighboring neurons with similar preferred directions have a correlation of

*c*

_{0}) has a constant Fourier transform of magnitude 1 −

*c*

_{0}. Together, this results in a power spectrum with high power in low frequencies decaying to a constant offset 1 −

*c*

_{0}at high frequencies.

Each of the terms |g̃_{k}|^{2}/*r*̃_{k} in Equation 14 can be seen as a signal-to-noise ratio of the *k*th Fourier mode of the population (Fig. 1*I*). *J*_{mean} is then simply the sum over the individual signal-to-noise ratios. For the homogeneous case, the only difference to the additive noise case studied by Sompolinsky et al. (2001) is that, for other mean-variance relationships, the tuning curve derivatives are normalized by the standard deviations. Because of this, the scaling behavior of *J*_{mean} for large populations is similar to the additive case: the low-frequency Fourier components of signal and noise grow at the same rate with *n*, leading to a saturation of *J*_{mean} for large networks (see Fig. 3*A*).

##### J_{mean} in random amplitude model.

The above considerations suggest that the saturation of *J*_{mean} can be avoided by introducing high-frequency components into the signal, **g̃**, for which the noise amplitude is small. In fact, this is naturally the case for any realistic population of neurons. The vanishing power in high frequencies for the homogeneous population model is a result of the simplifying assumption of identical tuning functions for all neurons. This results in the mean population activity **f** having the same shape as the tuning function *f* evaluated at the preferred directions of the neurons. In realistic populations of neurons, however, tuning curves display a significant amount of heterogeneity between neurons—such as different amplitudes, widths, or baselines of their tuning functions—introducing high-frequency components into **f** and, hence, also in **g.**

To illustrate this point, we ignored all types of heterogeneity except for the overall amplitudes of the neurons and constructed a model population by assigning each neuron a tuning curve with a peak amplitude randomly drawn from a distribution of amplitudes that has the same mean as the homogeneous population (see Fig. 1*J*). Figure 1, *K* and *L*, shows the resulting mean population activity **f** as well as the normalized derivatives **g** for this population at stimulus θ = 0. Because of the randomly selected amplitudes for each neuron, **g** has power in all frequencies (see Fig. 1*M*). Intuitively, this affects *J*_{mean} positively in two ways. First, the overall number of Fourier components significantly contributing to *J*_{mean} is increased. Second, the signal-to-noise ratio is better for the high-frequency components g̃_{k} because *r*̃_{k} is small for large *k* (it converges to 1 − *c*_{0}).

The exact value of *J*_{mean} depends on the specific set of amplitudes *a _{j}* that are drawn at random. We here generalize the results of Shamir and Sompolinsky (2006) to the case with nonadditive, stimulus-dependent noise. In the random amplitude model with Poisson-like noise,
We define

*b*

_{j}, which splits

*b*with mean zero and variance κ = Var[

_{j}*J*

_{mean}, we obtain where

*J*

_{mean}

^{hom}is the linear Fisher information of a homogeneous population and

*d*converges to a constant independent of

*n*and is defined as follows: To arrive at Equation 17, we used which holds because

*b*is white noise with variance κ (i.e., 〈

_{j}*b*〉 = δ

_{i}b_{j}*κ) and*

_{ij}*u*

_{jk}

*u**

_{jk}= 1/

*n*.

The above calculations can be generalized to non-Poisson mean-variance relationships. For instance, if σ(θ) = *f*(θ)^{α}, define *a*_{j}^{1−α} = μ+*b _{j}* and κ = Var[

*a*

_{j}

^{1−α}] and Equation 17 will still be valid. By setting α = 0, the result of Shamir and Sompolinsky (2006) is obtained. Unless α = 1 (i.e., the standard deviations are equal to the means), populations with amplitude variability will have

*J*

_{mean}asymptotically proportional to

*n*even in the presence of limited range correlations.

Furthermore, from Equation 17, one can see that for an independent population of Poisson-like neurons, amplitude variability does not affect *J*_{mean} on average. Because for an independent population *r*̃_{k} = 1, we have
which is independent of κ and identical with that of a homogeneous population.

Considering the large *n* limit of *J*_{mean} of a correlated population relative to that of an independent population, we obtain
The first term in the numerator above saturates to a finite value and can therefore be ignored. For the second term, note that **r** is assumed to be smooth and slowly varying, in which case only O(*r*̃_{k} are large and for the remaining components *r*̃_{k}→1−*c*_{0}. Thus, (1/*n*) Σ_{k} 1/* r̃_{k}*→1/(1−

*c*

_{0}) for large

*n*.

##### Calculation of J_{cov}.

_{cov}

Using similar methods as above, we also derived an expression for *J*_{cov} in terms of Fourier transforms that does not contain an inverse of the covariance matrix anymore:
where *h*_{k} = σ′_{k}/σ_{k} and *h̃* is the discrete Fourier transform of **h.** The expression [*r*̃ * 1/*r*̃]_{k} is the *k*th component of the circular convolution of ** r̃** with its pointwise inverse.

We briefly outline the derivation in the following. First, note that the derivative of the covariance matrix with respect to θ is **Q′** = **S′RS** + **SRS′**, where **S′** = Diag(**σ′**). Substituting this expression into Equation 9, expanding the square, and using the matrix trace identity Tr [**ABC**] = Tr [**BCA**], we obtain
Because **S** is diagonal, the first term reduces to
To simplify the second term, let **V** = **U*S**^{−1}**S**′**UR̃** and **W** = **U*S**′**S**^{−1}**UR̃**^{−1}. Because **S**^{−1}**S′** is diagonal, **U*****S**^{−1}**S′U** is circulant with the first column being the inverse Fourier transform of the diagonal elements. A right multiplication by the diagonal matrices **R̃** and **R̃**^{−1} scales the columns, resulting in the following:
Substituting **V** and **W** into the second term for *J*_{cov}, we obtain
For an independent population of neurons, *r*̃_{k} = 1, and by using Parseval's theorem, the simple formula for *J _{d}*, the

*J*

_{cov}of an independent population of neurons (Shamir and Sompolinsky, 2001) is recovered: Note that, for correlated populations,

*J*

_{cov}can be bounded from above and below by the following: Thus, asymptotically

*J*

_{cov}always grows linearly with

*n*, regardless of the correlation structure. In addition, unlike for

*J*

_{mean}, small correlations do not substantially alter

*J*

_{cov}compared with independence (because the upper and lower bound become equal for

*c*

_{0}→ 0). In addition, it is easy to see that

*J*

_{cov}is unaffected by amplitude variability because the amplitudes

*a*appear both in the numerator and the denominator of

_{k}*h*and cancel.

_{k}##### Effect of correlation structure under constant noise entropy.

To study the effect of the average level of correlations under the constraint of constant amount of total noise, we relax the assumption of Poisson-like noise and adjust the Fano factors of the neurons such that the noise entropy is kept constant. We define
Note that, in this section, we write most quantities as a function of *c*_{0} as we are interested in their behavior with varying *c*_{0}. We can now adjust *F*(*c*_{0}) such that the noise entropy remains constant as we vary *c*_{0}. For Gaussian noise, the differential entropy is given by the following:
In the above formula, the only quantity that depends on *c*_{0} is |**Q**|. Thus, to have constant noise entropy, we need |**Q**(*c*_{0})| = |**Q**(0)|, where **Q**(0) is the covariance matrix of an independent population of neurons (*c*_{0} = 0). We can write for the determinant |**Q**| = |**S**|^{2} · |**R**| = |**V**| · |**R**|, where **S** = Diag(**σ**) is the diagonal matrix containing the standard deviations and **V** = Diag(**σ**^{2}). For the independent population, we have **Q**(0) = **V**(0). Writing the determinants as functions of *c*_{0} and requiring constant entropy, we obtain
Solving for the Fano factor *F*(*c*_{0}) results in
which is the inverse of the geometric mean of the Fourier coefficients *r*̃_{k} of the correlation structure.

Applying this constant entropy constraint and considering the limit of large populations (*n* → ∞), we find for the dependence of *J*_{mean} on *c*_{0} the following:
As before (Eq. 21), *J*_{mean}^{hom} saturates to a finite value and therefore the first term in the numerator does not play a role. For the second term, note that, for large populations, *F*(*c*_{0})→1/(1−*c*_{0}) and (1/*n*)Σ* _{k}* 1/

*r*̃

*→ 1/(1−*

_{k}*c*

_{0}), which leads to the above result.

##### Maximum-likelihood estimation.

Under a Gaussian noise model, the log-likelihood function is
Using the fact that **Q**(θ) = **S**(θ)**RS**(θ), we can rewrite the terms that depend on θ as follows:
and
where we have defined **z**(θ) = **S**(θ)^{−1}(**y** − **f** (θ)). Combining all steps and dropping terms that do not depend on θ, we arrive at
With the additional assumption of Poisson-like noise (i.e., σ_{k}^{2}(θ) = *f*_{k}(θ)) and using the same diagonalization of the quadratic forms as above, we obtain for the first and second derivatives
and
Here, z̃′_{k} is the *k*th component of the discrete Fourier transform of the first derivative of **z**(θ) with respect to θ.

We evaluated the maximum-likelihood estimator (MLE) for homogeneous and heterogeneous populations. We used Newton's method to numerically find the maximum-likelihood estimate θ̂. For the heterogeneous populations, the amplitudes of the neurons *a _{k}* were drawn from a lognormal distribution with mean 1 and variance adjusted manually such that it resulted in κ = 0.25. All parameters for tuning functions and correlations were defined as above. For each population size (minimum, 32; maximum, 4096 neurons), we generated 4096 realizations of heterogeneous populations. The error of the MLE was evaluated at

*m*= 32 regularly spaced stimulus values θ

*= 2π*

_{k}*k*/

*m*for 64 random samples drawn from a normal distribution with mean

**f**(θ) and covariance

**Q**(θ). Squared errors (θ̂ − θ)

^{2}for all populations (of equal size), stimuli, and samples were averaged to calculate the mean squared error, the inverse of which was taken as the efficiency of the maximum-likelihood estimator.

##### Optimal linear decoding.

For linear decoding, it is convenient to rewrite the stimulus as a vector **x** = [cos(θ), sin(θ)]^{T} on the unit circle. The direction θ is easily recovered from **x** via θ = atan2(*x*_{2}, *x*_{1}). The optimal linear estimator (OLE) is defined as the linear estimator minimizing the mean squared error 〈(**x̂** − **x**)^{2}〉. It minimizes the mean squared error for any kind of noise distribution (Salinas and Abbott, 1994) (no Gaussian assumption is necessary) and its weights are given by
where **Q_{yy}** is the response covariance matrix over all stimuli,
where

**f̄**(θ) =

**f**(θ) − ∫

_{θ}

**f**(θ)

*d*θ, and

**Q**is the cross-covariance between stimulus

_{xy}**x**and neural response

**y**, given by We numerically estimated bias, variance, and mean squared error of the optimal linear estimator at

*m*= 32 different stimulus values θ

*= 2π*

_{k}*k*/

*m*. For each population size (minimum, 64; maximum, 4096 neurons), we generated 8192 realizations of heterogeneous populations with the same parameters as defined above.

## Results

We consider a simple model in which *n* neurons encode a one-dimensional circular stimulus (e.g., direction of motion) through bell-shaped tuning functions (for details on the model, see Materials and Methods). We introduce heterogeneity in the population by allowing the neurons to have different peak firing rates (amplitudes, *a _{j}*) but identical tuning widths. The degree of heterogeneity is controlled by the amount of variability in the amplitudes, controlled by the parameter κ = Var[

*c*

_{0}, the correlation coefficient of two neurons with identical preferred directions. The relationship of

*c*

_{0}to the average level of correlations 〈

*r*〉 depends on the decay constant in the correlation structure (Fig. 1

*C*). For the set of parameters we used, 〈

*r*〉 is ∼0.3

*c*

_{0}. Our choice of parameterization by

*c*

_{0}is motivated by the fact that, for large populations, it is the more relevant quantity compared with the average correlations.

Although we have to choose a specific set of parameters for the figures, our results hold more generally, as long as both tuning curves and correlation structure do not change as a function of the population size and are sufficiently smooth and slowly varying.

### Dependence of Fisher information on population size

To quantify the accuracy of a population code, we calculate the Fisher information, *J*, which under the assumption of Gaussian noise can be written as the sum of two terms as follows:
*J*_{mean} can be thought of as the information that is encoded in the average population activity, while *J*_{cov} is the information contained in the variances and covariances.

The difficulty in evaluating the Fisher information for large populations of neurons lies in inverting the *n* × *n* covariance matrix. Following the approach of Sompolinsky and coworkers (Sompolinsky et al., 2001; Shamir and Sompolinsky, 2006), we obtained an analytic expression for this inverse in our model, leading to an expression for the Fisher information that is considerably easier to study, even for populations with tens of thousands of neurons (for details, see Materials and Methods; Fig. 1; Eqs. 14, 22).

We first study how the Fisher information depends on the number of neurons in the population. This question has been addressed by a number of authors who reported different results (Abbott and Dayan, 1999; Sompolinsky et al., 2001; Wilke and Eurich, 2002; Shamir and Sompolinsky, 2006). The apparent discrepancy arises from subtle differences in the assumptions that were made about the population activity, such as for instance the noise model (additive vs Poisson-like). In the following, we provide a comprehensive treatment of this problem using our framework that includes most of the previous studies as special cases.

Figure 2 shows the total Fisher information, *J*, as a function of the population size, while Figure 3 splits *J* into its two components, *J*_{mean} and *J*_{cov}. For homogeneous populations, the total Fisher information grows with increasing population size and does not saturate to a finite bound, even in the presence of limited range correlations (Fig. 2*A*) (Wilke and Eurich, 2002). This is because the second term, *J*_{cov}, increases linearly with *n* (Fig. 3*A*) if the variances of the neurons are stimulus dependent (a property of Poisson-like noise). In contrast, *J*_{mean} saturates to a finite value if neurons are correlated (Fig. 3*A*) (Sompolinsky et al., 2001). Interestingly, for independent neurons with Poisson-like noise, the degree of amplitude variability does not affect the Fisher information (Fig. 2, compare black lines in *A*, *B*; see Materials and Methods, Eq. 20). If neurons are correlated, however, the total Fisher information is generally higher for heterogeneous populations than for homogeneous ones (Fig. 2, compare *A*, *B*). Responsible for this difference is *J*_{mean}, which no longer saturates in the presence of correlations if neurons have heterogeneous tuning functions (Fig. 3*B*, Eq. 17). In contrast, *J*_{cov} is unaffected by heterogeneity (Fig. 3, dashed lines).

In summary, the Fisher information saturates to a finite bound only if the variances of the neurons do not depend on the stimulus and all neurons have identical tuning curves. If one of these two conditions is not satisfied, the Fisher information increases linearly with the population size. While for large homogeneous populations it is dominated by *J*_{cov} (Fig. 3*C*), most of the information in a heterogeneous population is contributed by *J*_{mean} (Fig. 3*D*).

### Dependence of Fisher information on magnitude of correlations

We now investigate how the magnitude of noise correlations affects the accuracy of a population code. Generally, small limited-range correlations decrease the accuracy compared with the independent case, consistent with previous reports (Zohary et al., 1994; Abbott and Dayan, 1999; Sompolinsky et al., 2001; Wilke and Eurich, 2002). While in homogeneous populations this detrimental effect becomes stronger with increasing correlations (Fig. 2*C*), in heterogeneous populations it is both non-monotonic and population size dependent (Fig. 2*D*). These differences between homogeneous and heterogeneous populations are exclusively due to differences in *J*_{mean}, as *J*_{cov} is independent of both the degree of heterogeneity and the level of correlations (Fig. 3, dashed lines).

To characterize the effect of varying the magnitude of correlations, we quantified *J*_{mean} of a correlated population relative to the total Fisher information of an independent population. We refer to this quantity (*J*_{mean}/*J*_{indep}) as the relative *J*_{mean}. While in homogeneous populations the relative *J*_{mean} decreases with increasing *n* and increasing correlations (Fig. 4*A*), in the presence of tuning curve variability a higher level of limited range correlations can also increase *J*_{mean} substantially, in particular if the population is large (Fig. 4*B*).

There are two regimes for the effect of changes in correlation strength: one in which reducing correlation improves and one in which it impairs the performance. Which of the two applies to a given population code depends on whether correlations are smaller or larger than the minimum in the curves in Figure 4*B*. We term this value the “maximally detrimental” correlation strength *c*_{min}. If correlations are below *c*_{min}, decreasing them further improves the population code. In this regime, which we refer to as the “low-correlation regime,” the behavior is similar to that of homogeneous populations. In the “high-correlation regime,” in contrast, increasing the correlation strength improves the population code. The value of *c*_{min} is inversely related to the population size and decreases with *O*(1/*n* limit the value of *J*_{mean} relative to the independent case is given by the following (see Materials and Methods, Eq. 21):
Thus, asymptotically the relative performance depends only on the correlation coefficient of two neurons with identical preferred orientations, *c*_{0}, and the degree of heterogeneity, κ. The more heterogeneous a population is, the higher *J*_{mean} becomes (Fig. 6). However, it should be noted that *J*_{mean} cannot be increased arbitrarily by simply increasing the heterogeneity, κ, because the amplitudes *a _{j}* are constrained to be positive, have an average of 1, and neurons have a maximum possible firing rate due to biophysical constraints. At the same time, the value κ = 0.25 that we used for the above figures is likely to be an underestimate because in real neural populations sources of heterogeneity other than amplitudes exist. We explore this issue further below.

### Maximum-likelihood estimator attains Cramér–Rao bound

One potential caveat that has to be addressed when using Fisher information is that it provides only a bound (the Cramér–Rao bound) on the accuracy of a population code. Unfortunately, under some conditions potentially relevant for neural coding this bound is not tight, which means there exists no estimator that achieves this performance (Bethge et al., 2002; Berens et al., 2011). If certain assumptions are placed on the statistics of population activity (e.g., independent and identically distributed samples), the maximum-likelihood estimator can be proven to converge (for large *n*) to the Cramér–Rao bound. However, under the conditions studied here, the responses are neither independent nor identically distributed and we do not know whether and how fast the bound is attained. Thus, it is unclear whether comparing Fisher information under different conditions (such as different levels of correlation or different amounts of heterogeneity) can provide insights into the accuracy of population codes because comparing loose upper bounds is meaningless. To address this problem, we additionally evaluated the efficiency of the maximum-likelihood estimator numerically (for details, see Materials and Methods). For both homogeneous and heterogeneous populations, it attains the Cramér–Rao bound very quickly (Fig. 7), even in the case of nonidentical and nonindependent samples. Interestingly, for homogeneous populations, the rate of convergence depends more on the level of correlations than for heterogeneous populations (Fig. 7). More importantly, for both types of population models, the performance of the maximum-likelihood estimator is within 5% of the Cramér–Rao bound for all population sizes >64 neurons. Thus, the bound is sufficiently tight and the use of Fisher information is well justified.

### Other sources of heterogeneity

Another issue that has to be addressed is whether our findings above generalize to other forms of tuning curve heterogeneity, such as variable widths and baseline firing rates. Because only the amplitude variability model is analytically tractable, we ran numerical simulations to estimate the Fisher information for populations with other forms of heterogeneity. First, we varied the tuning widths while keeping all other parameters constant (Fig. 8*A*,*C*). Second, we created populations of neurons by randomly drawing sets of tuning parameters from a dataset of orientation tuning curves in monkey V1 (Fig. 8*B*,*D*). In both cases, the dependence of the Fisher information on the level of correlations is similar to that in the amplitude variability model. One notable difference is that *J*_{cov} is not completely independent of correlation strength if parameters other than tuning amplitude are variable. For moderate levels of correlation, however, the differences are relatively small.

In the above analysis, we assumed that the preferred directions of the neurons are arranged on a regular grid around the circle. As this is not the case in real neural populations, we numerically analyzed the effect of this assumption on the results presented above. We found that the Fisher information is virtually unaffected by randomly assigning preferred directions compared with equally spacing them (data not shown).

In addition, in real neural populations not only the tuning curves are heterogeneous but also the pairwise correlation coefficients. This case has been studied by adding independent Gaussian noise (with variance σ^{2}) on each correlation coefficient (Wilke and Eurich, 2002; Shamir and Sompolinsky, 2006). Unfortunately, if the variance of the correlation coefficients is fixed independent of the population size, the covariance matrix will be valid (positive definite) only up to *n* ≈ 1/(2σ)^{2}. Thus, as the population size is increased, from a critical *n* on, the model will not be valid any more. One solution to this problem is to scale the variance in the correlation coefficients by 1/*n.* We simulated this scenario numerically and found that, although it increases the Fisher information by a small constant factor, it does not affect the results qualitatively (data not shown).

### Effect of noise entropy on encoding accuracy

So far, we considered the level of correlations a free parameter. We now investigate the implications of this assumption. In general, increasing correlations between neurons while keeping the variances of the individual neurons fixed reduces the noise entropy of the population. Because a lower noise entropy reduces the variance in most directions, it is likely to improve the encoding. To understand this argument intuitively, consider the two-neuron toy example shown in Figure 9. The top row shows the firing rate (marginal) distribution of each neuron, while the bottom row depicts the two-dimensional joint distribution. The entropy of a normal distribution is closely related to the area enclosed by the 1 SD ellipse (it is linearly related to the logarithm of the area). Figure 9*A* shows an uncorrelated reference distribution with marginal SDs of 1. The distribution in Figure 9*B* has the same marginal distributions but a correlation coefficient of 0.8. The entropy of this correlated distribution is ∼0.5 bits smaller than that of the uncorrelated distribution with identical marginals. To generate a distribution that has the same entropy as our reference distribution but a correlation of 0.8, we have to increase the marginal SDs to 1.29 (Fig. 9*C*).

Because the accuracy of a population code depends on how the signal is encoded relative to the noise, a lower noise entropy by itself does not necessarily imply an improvement in coding accuracy. For instance, in the toy example in Figure 9*B*, the variance along the main diagonal is larger than in the independent case in Figure 9*A*. Therefore, estimating the mean activity of the two neurons is less accurate than, for instance, their difference. In a heterogeneous population of neurons, the signal encoding is distributed across all directions (Fig. 1*M*). Because the Fisher information is the sum over the signal-to-noise ratios in each frequency component, a reduction in noise entropy should lead to an improvement in coding accuracy in this case.

To understand how the level of correlations affects the noise entropy in the multineuron (high-dimensional) case, we calculated the noise entropy as a function of the average correlation strength. As *c*_{0} approaches 1, the noise entropy diverges to −∞ (Fig. 10*A*). In other words, there is a subspace in which the system is effectively noise-free. Any signal in this subspace can be decoded with infinite precision. This explains why the relative Fisher information diverges as *c*_{0} approaches 1 (Fig. 4*B*).

In real neural populations, however, there is a certain amount of independent noise in the system, due to input noise (e.g., photoreceptors), channel noise (e.g., unreliable synaptic transmission), or other sources. This noise cannot be removed by any type of encoding. This implies that most models used in previous studies (including our above model) are not well constrained if the correlation strength is considered a free parameter because they allow for a degenerate case in which the noise entropy becomes arbitrarily small (by simply increasing the level of correlations until *c*_{0} = 1).

To disentangle the effect of the noise correlation structure from that of the noise entropy, we here introduce an additional constraint and fix the noise entropy as the correlations are changed. A convenient way to do so is to relax the Poisson assumption on the variances of the neurons, such that in the modified model, the neurons may have Fano factors *F* ≠ 1. For simplicity, we keep the Fano factor constant across the population and define the variances of the neurons as follows:
Figure 10*B* shows the Fano factors necessary to maintain constant noise entropy when increasing the correlation strength. Changing the Fano factor affects only *J*_{mean} but leaves *J*_{cov} unchanged. Figure 10, *C* and *D*, show *J*_{mean} for heterogeneous populations as a function of the population size for different levels of correlation, analogous to Figure 3, *B* and *D*, but with identical noise entropy among populations of equal size.

Similar to the results presented above, under the constant entropy constraint there are two regimes for the effect of correlations. The low-correlation regime applies when populations are small or correlations are low. In this regime, the noise entropy is very similar to the independent case (Fig. 10*A*) and, thus, reducing correlations can lead to a substantial improvement (Figs. 10*D*, 11*B*). As before, the critical level of correlations that separates the low-correlation regime from the high-correlation regime converges to zero for increasing population sizes. In the high-correlation regime, the exact level of correlations is not important if the noise entropy is constant (Fig. 11*B*). This supports the idea that the improvement with increased correlations we observed above is due to a reduction of the noise entropy. In the large *n* limit the expected value for *J*_{mean} of a correlated population is κ times that of an independent population (see Materials and Methods, Eq. 33):
Thus, *J*_{mean} depends only on the amount of amplitude variability, κ. It is independent of both the structure and the magnitude of correlations as none of the parameters of the correlation structure (e.g., *c*_{0} or *L*) enter the right-hand side of Equation 46.

## Discussion

Many theoretical studies have investigated the effect of noise correlations on encoding accuracy (Snippe and Koenderink, 1992; Zohary et al., 1994; Abbott and Dayan, 1999; Sompolinsky et al., 2001; Wilke and Eurich, 2002; Shamir and Sompolinsky, 2004, 2006; Josić et al., 2009). One of the main conclusions has been that limited range correlations are detrimental for a population code compared with independence or other correlation structures (e.g., uniform) (Abbott and Dayan, 1999; Sompolinsky et al., 2001; Wilke and Eurich, 2002). Because of this, it is often assumed that decreasing correlations improves any population code—as it becomes more similar to the independent case (Zohary et al., 1994; Abbott and Dayan, 1999; Sompolinsky et al., 2001).

In this paper, we showed that these results hold only under the assumption of homogeneous populations of neurons or very small numbers of neurons. In the range of biologically plausible parameters (heterogeneous tuning curves, nonadditive noise, thousands of neurons), increasing correlations can substantially increase the Fisher information (Fig. 4*B*). This increase in accuracy is mainly due to an overall reduction of the noise entropy that is associated with stronger correlations. When correlations are increased, the noise power increases in a few low-frequency Fourier components while it decreases in all higher-frequency components (it quickly asymptotes to 1 − *c*_{0} with increasing frequency; Fig. 1*D*). Because in homogeneous populations the stimulus encoding is confined to the low-frequency Fourier components, higher correlations have detrimental effects. In heterogeneous populations, in contrast, the stimulus encoding is distributed among all frequencies and the high-frequency components have a better signal-to-noise ratio if correlations are strong. These two mechanisms compete, leading to the non-monotonic correlation dependence of the Fisher information of heterogeneous populations. As the population size is increased, the high-frequency components dominate because their number grows linearly with the population size. As a consequence, in the large *n* case, increasing correlations is almost always beneficial while a decrease is beneficial only for small population sizes and small enough correlations.

This result raises two important questions. First, if higher correlations improve the accuracy, why does the brain not implement a population code with strong correlations? Second, why do experimental studies find reduced noise correlations under experimental conditions, in which an improved population code is expected (Gutnisky and Dragoi, 2008; Cohen and Maunsell, 2009; Mitchell et al., 2009; Gu et al., 2011)?

With regard to the first question, we suggest that the level of correlations should not be interpreted as a free parameter that can be optimized independently while all other parameters, such as tuning functions and variances, are kept fixed. Under such assumptions, increasing the level of correlations to the maximum (i.e., *c*_{0} = 1) leads to maximum Fisher information (Figs. 4*B*, 6). The noise power in the high frequencies vanishes (as it asymptotes to 1 − *c*_{0}; Fig. 1*D*), allowing for effectively noise-free decoding of any signal that is encoded in these directions. However, in real neural populations, the amount of noise in the response cannot be arbitrarily small because input noise cannot be removed by processing. Thus, finding the optimal level of correlations under the assumption of fixed tuning functions and variances leads to a degenerate solution, which is not biologically plausible.

To avoid the noise-free case and better constrain the problem, we also analyzed population codes in which the total amount of noise (the noise entropy) is kept constant and only the correlation structure is changed. Our results demonstrate that, under this constraint, the performance of population codes in large populations is primarily determined by the amount of heterogeneity in the tuning functions of the neurons and by the overall noise entropy (Fig. 10*D*, Eq. 46), while the specific structure of the noise does not appear to be as important as commonly assumed. Whether the noise entropy is of the same importance for predicting the accuracy of a population code also for other scenarios not considered explicitly in this study—such as stimulus-dependent (Josić et al., 2009) and heterogeneous correlations, or simultaneous encoding of multiple stimulus dimensions (Zhang and Sejnowski, 1999)—remains to be investigated.

The second question raised above was how to reconcile our theoretical results with empirical findings of reduced correlations (Gutnisky and Dragoi, 2008; Cohen and Maunsell, 2009; Mitchell et al., 2009; Gu et al., 2011). One possibility is that the size of the population that is read out is small enough to be in the low-correlation regime. Although we did not assess this in detail by taking into account all relevant parameters from these studies, Figures 5 and 8*B* suggest that the relevant population size would have to be on the order of at most a few hundred cells. Given the number of cells even in a single cortical column, this seems rather unlikely.

A second explanation is related to the neural readout mechanism. Because we currently do not know how information is read out by downstream neurons (or populations thereof), we quantified the maximum amount of information that can be extracted from the population response. Of course, the effect of correlations can be different for other, computationally constrained readouts. If, for instance, a linear readout is assumed, the conclusions would be different. Even though *J*_{mean} is often referred to as the portion of the information that can be read out by linear methods, this notion is problematic—at least in the framework of stimulus reconstruction considered here. To illustrate this point, we estimated the performance of the OLE decoding the activity of a heterogeneous population of neurons. The efficiency of the OLE does not converge to the inverse of *J*_{mean} (Fig. 12*A*). In addition, in contrast to the Fisher information and the maximum-likelihood estimator, the accuracy of the OLE does not increase with increased correlations, not even for large population sizes (Fig. 12*B*). The reason for this behavior is the fact that the OLE is a biased estimator, for which the Cramér–Rao bound is not simply the inverse of the Fisher information but the bias has to be taken into account. Because the estimator bias depends on the correlations, the dependence of the mean squared error on correlations is not captured well by the Fisher information. Consequently, if downstream areas are confined to linear readout mechanisms, reducing pairwise correlations increases the readout accuracy.

Assuming optimal readout, a third possibility is that the improved performance is not exclusively due to reduced correlations. For instance, under attention Fano factors decrease and firing rates increase (Cohen and Maunsell, 2009; Mitchell et al., 2009). Although it has been argued that these changes are small compared with the relative changes in correlation strength and their effect is negligible for large populations, this argument is problematic. The effect of different factors was assessed using highly suboptimal pooling rules (Cohen and Maunsell, 2009; Mitchell et al., 2009) and the conclusions derived from these pooling models do not generalize to other (e.g., optimal) readout mechanisms—as our above results show. Our analysis suggests an alternative interpretation: the higher firing rates under attention increase the signal while the reduced Fano factors and correlations indicate the suppression of a common noise source, which reduces the noise entropy and therefore leads to improved coding accuracy. For example, if the response of a population of neurons with unit variances and weak correlations (〈*r*〉 = 0.05) is confounded by a common noise source with variance 0.05, removing this common noise source reduces noise correlations from ∼0.1 to ∼0.05 (a 50% change) while reducing the variances from ∼1.05 to 1 (a 5% change). In this situation, however, the changes in correlations cannot be separated from the changes in variances and neither of both is more important than the other. In fact, considering each one in isolation is not meaningful.

In summary, the notion that reducing correlations leads to a more accurate encoding is not a general principle but is only true under certain conditions. Assumptions about the size of the population and the way information is read out can strongly affect the conclusions. For optimal decoding of large populations, the total amount of noise—as measured by the noise entropy—is more important than the specific noise correlation structure.

## Notes

Supplemental material for this article is available at http://bethgelab.org/code/ecker2011. It contains MATLAB code to reproduce all figures and numerical simulations in this paper. This material has not been peer reviewed.

## Footnotes

- Received May 21, 2011.
- Revision received July 28, 2011.
- Accepted August 9, 2011.
This work was supported by the Bernstein Award (M.B.) by the German Ministry of Education, Science, Research, and Technology (Bundesministerium für Bildung und Forschung) Grant FKZ 01GQ0601, the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (M.B., A.S.E.), the Max Planck Society (M.B., A.S.E., P.B.), the German National Academic Foundation (P.B.), National Eye Institute–National Institutes of Health Grant R01 EY018847 (A.S.T.), The Arnold and Mabel Beckman Foundation Young Investigator Award (A.S.T.), the Veterans Affairs Merit Award (A.S.T.), and The McKnight Endowment Fund for Neuroscience Scholar Award (A.S.T.).

We thank R. Haefner and S. Gerwinn for discussions and comments on an earlier version of this manuscript.

- Correspondence should be addressed to Alexander S. Ecker, Max Planck Institute for Biological Cybernetics, Spemannstrasse 41, 72076 Tübingen, Germany. aecker{at}tuebingen.mpg.de

- Copyright © 2011 the authors 0270-6474/11/3114272-12$15.00/0

This article is freely available online through the *J Neurosci* Open Choice option.