## Abstract

In natural scenes, objects generally appear together with other objects. Yet, theoretical studies of neural population coding typically focus on the encoding of single objects in isolation. Experimental studies suggest that neural responses to multiple objects are well described by linear or nonlinear combinations of the responses to constituent objects, a phenomenon we call stimulus mixing. Here, we present a theoretical analysis of the consequences of common forms of stimulus mixing observed in cortical responses. We show that some of these mixing rules can severely compromise the brain's ability to decode the individual objects. This cost is usually greater than the cost incurred by even large reductions in the gain or large increases in neural variability, explaining why the benefits of attention can be understood primarily in terms of a stimulus selection, or demixing, mechanism rather than purely as a gain increase or noise reduction mechanism. The cost of stimulus mixing becomes even higher when the number of encoded objects increases, suggesting a novel mechanism that might contribute to set size effects observed in myriad psychophysical tasks. We further show that a specific form of neural correlation and heterogeneity in stimulus mixing among the neurons can partially alleviate the harmful effects of stimulus mixing. Finally, we derive simple conditions that must be satisfied for unharmful mixing of stimuli.

- computational neuroscience
- Fisher information
- neural decoding
- neural encoding
- population coding
- theoretical neuroscience

## Introduction

In natural vision, objects typically appear within the context of other objects rather than in isolation. It is, therefore, important to understand how cortical neurons encode multiple objects. Experimental studies suggest that in many cortical areas, neural responses to the presentation of multiple stimuli can be successfully described as a linear or nonlinear combination of the responses to the individual stimuli. In inferior temporal (IT) cortex, for example, responses of many individual neurons to pairs and triplets of objects are well described by the average of their responses to individual stimuli (Zoccolan et al., 2005, 2007). A similar weighted averaging model also provides a good description of the responses of motion-selective middle temporal (MT; Recanzone et al., 1997; Britten and Heuer, 1999) and medial superior temporal (MST) neurons (Recanzone et al., 1997) to pairs of moving objects, responses of V4 neurons to composite shapes consisting of several oriented line segments (Nandy et al., 2013), population responses in V1 to simultaneously presented gratings (Busse et al., 2009; MacEvoy et al., 2009), and at a larger scale, fMRI responses to multiple objects in object selective area LOC (MacEvoy and Epstein, 2009) and in V4 (Beck and Kastner, 2007).

In working memory and associative learning tasks, when multiple stimuli have to be stored in memory simultaneously, responses of single neurons in prefrontal cortex are again a potentially complex function of multiple stimuli as well as other task parameters (Duncan, 2001; Warden and Miller, 2007, 2010). Such “mixed selectivity” has been argued to be crucial for successful performance in context-dependent behavioral tasks (Rigotti et al., 2010, 2013). However, mixed selectivity is not unreservedly beneficial (Barak et al., 2013). By mapping two similar points in the input space to points that are farther apart in the output space, stimulus mixing can make them more easily discriminable. The same, however, applies to noisy versions of the same stimulus that one would not want to make more discriminable, thus creating a problem of generalization or robustness against noise (Barak et al., 2013). The extent of this problem for commonly observed forms of stimulus mixing in the brain is unknown and an analysis of what types of mixing are more or less vulnerable to this problem is lacking.

In this article, using both analytical and numerical tools, we present a systematic analysis of some common forms of stimulus mixing observed in cortical responses with regard to their consequences for stimulus encoding in the presence of neural variability. We show that some of these common mixing rules, such as weighted averaging, can be profoundly harmful for stimulus encoding. Another commonly observed, divisive form of stimulus mixing (Allman et al., 1985; Cavanaugh et al., 2002) can also be harmful for stimulus encoding, although much less so than weighted averaging. We also derive mathematical conditions that must be satisfied for unharmful mixing of stimuli, and provide geometric explanations for what makes particular forms of stimulus mixing more or less harmful than others.

## Materials and Methods

For ease of reference, Table 1 lists the frequently used symbols in the article.

#### Derivation of the Fisher information matrix

We use a multivariate Gaussian distribution to model neural variability. For a Gaussian distribution, the *ij*th term of the Fisher information matrix (FIM) is given by the following (Abbott and Dayan, 1999):
where **f** is a column vector of the mean responses of all neurons in the population. In the linear mixing model (see Results), the mean response of a neuron *k* to a pair of stimuli (*s*_{1}, *s*_{2}) is assumed to be a weighted average of its mean responses to each individual stimulus alone (Eq. 73). The individual tuning functions describing the mean responses of neurons to single stimuli are assumed to be von Mises:
where α, γ, and η are the tuning function parameters and φ is the neuron's preferred stimulus. Here and in the rest of this paper, differences between circular variables should always be understood as angular differences. The covariance matrix *Q* can be expressed as *Q* = *S R S*, where *S* is a diagonal matrix of the standard deviations (SDs) of neural responses and *R* is the correlation matrix. In our problem, *R* has a block structure:
with *A* representing the correlations between the neurons within the same group and *B* representing the across-group correlations. We assume that within-group correlations decay exponentially with the angular difference between the preferred stimuli of neurons:
where δ is the Kronecker delta function. Across-group correlations are simply scaled versions of the within-group correlations:
The inverse of the covariance matrix is given by *Q*^{−1} = *S*^{−1}*R*^{−1}*S*^{−1}. Since *S* is diagonal, its inverse is straightforward. The inverse of *R* is less so. From Equation 3, blockwise inversion of *R* yields:
Importantly, *A* and *B* are circulant matrices, hence they are both diagonalized in the Fourier basis. This implies that Equation 6 can be written as follows:
where *U* is the unitary discrete Fourier transform matrix with entries *U _{kj}* = exp (− 2π

*ikj/n*)/

*n*is the number of neurons in each group),

*U** is its conjugate transpose, and

*Ã*and

*B̃*are diagonal matrices of eigenvalues of

*A*and

*B*, respectively, which can be found by taking the discrete Fourier transforms (DFT) of the first columns of

*A*and

*B*.

Let us denote **ã** = diag(*Ã*) and **b̃** = diag(*B̃*) to be the diagonals of *Ã* and *B̃*. Note that because *Ã* and *B̃* are diagonal matrices:
Similarly:

#### Poisson-like noise

We first derive *I*_{mean} and *I*_{cov} for a Poisson-like noise model where the mean responses of neurons are equal to their variance.

##### Derivation of I_{mean}.

We can write down the first term of the Fisher information matrix as follows:
where *S*_{1} and *S*_{2} are diagonal matrices of the SDs σ_{k} of the responses of neurons in the first and second group, respectively. For Poisson-like noise, σ* _{k}* =

**g**

*the vector whose*

^{i}*k*th entry is equal to σ

_{k}

^{−1}∂

*f*/∂

_{k}*s*, i.e., the derivative of the

_{i}*k*th neuron's mean response with respect to

*s*divided by the SD of its variability, where

_{i}*k*ranges only over the neurons in the first group. Similarly, we denote by

**h**

*the vector whose*

^{i}*k*th entry is equal to σ

_{k}

^{−1}∂

*f*/∂

_{k}*s*, but where

_{i}*k*now ranges over the neurons in the second group only.

With this notation, we can rewrite Equation 10 as follows:
where **g̃*** ^{i}* =

*U*

**g**

*and*

^{i}**ğ**

*=*

^{i}*U**

**g**

*/*

^{i}**g**

*, respectively. Similarly,*

^{i}**h̃**

*and*

^{i}**h̆**

*are the DFT and the inverse DFT of*

^{i}**h**

*. Recall also that*

^{i}*C*and

*D*are diagonal matrices defined in Equations 8 and 9, respectively. Note that there are different conventions on how to compute the DFT and the inverse DFT; our usage is consistent with MATLAB's implementation of fft and ifft functions.

The scaling of *I*_{ij,mean} with *n* is similar to the corresponding scaling relationship in the case of the encoding of a single stimulus analyzed previously in Sompolinsky et al. (2001) and in Ecker et al. (2011): for a homogeneous population, Equation 11 saturates to a finite value in the presence of noise correlations (*c*_{0} ≠ 0), but diverges for an independent population (*c*_{0} = 0). A detailed analysis of the asymptotic behavior of *I*_{mean} is provided below for the simpler case of additive noise.

##### Derivation of I_{cov}.

We now derive the second term of the FIM, *I*_{ij,cov}. We first recall that *Q* = *SRS* and then note that * _{i}SRS* +

*SR*∂

*where we use the shorthand notation ∂*

_{i}S*to denote*

_{i}S*S*

^{−1}∂

_{i}SS^{−1}∂

*is a diagonal matrix and its trace is given by the following: where we introduced the notation*

_{j}S**p**

*for the vector consisting of the diagonal entries of*

^{i}*S*

_{1}

^{−1}∂

_{i}S_{1}and

**t**

*for the diagonal of*

^{i}*S*

_{2}

^{−1}∂

_{i}S_{2}. For the second term on the right side in Equation 12, we have the following: where we denote

*V*

_{1}

^{i}=

*S*

_{1}

^{−1}∂

_{i}S_{1}and

*V*

_{2}

^{i}=

*S*

_{2}

^{−1}∂

_{i}S_{2}. Note that the diagonals of

*V*

_{1}

*and*

^{i}*V*

_{2}

*are the vectors*

^{i}**p**

*and*

^{i}**t**

*.*

^{i}Considering the first term on the right side in Equation 14, we can write it as follows:
where ○ is the Hadamard (element-wise) product and conv(·, ·) denotes the circular convolution of two vectors. From the first to the second line above, we simply used the definition of the DFT of a vector. The third line follows from the definition of the circular convolution. For vectors, we use (·)* to denote element-wise conjugation (without transposition). We can express the other terms in Equation 14 in a similar fashion. Thus, Equation 14 becomes:
We can derive a similar expression for the third term in Equation 12:
We now note that the first term on the right side of Equation 12 is equal to the last term. Thus, putting it all together, *I*_{ij,cov} can be written as follows:
As for *I*_{ij,mean}, the scaling of *I*_{ij,cov} with *n* is identical to the corresponding scaling relationship studied in Ecker et al. (2011) for the case of encoding a single stimulus: asymptotically *I*_{ij,cov} scales linearly with *n* regardless of the amount of correlations in the population.

##### Effects of heterogeneity in mixing weights in the linear mixing model on I_{mean} and I_{cov}.

For Poisson-like noise, it is difficult to analytically quantify the effect of heterogeneity in mixing weights on *I*_{mean}. Considering a single neuron *k*, when there is heterogeneity in mixing weights, we have the following:
where we use δ*w _{i}* and δ

*w*

_{−}

*to denote the random fluctuations around the mean mixing weights (the subscript −*

_{i}*i*indicates the stimulus that is not the

*i*th stimulus). In this expression, it is not possible to separate out the effect of mixing weights, as it is when encoding a single stimulus studied in Ecker et al. (2011). This makes it difficult to compute expectations over the random fluctuations of mixing weights.

Similarly, unlike in Ecker et al. (2011), heterogeneity in mixing weights also affects the *I*_{cov} term in our model. Again, this is because the *k*th diagonal entry of *S*_{1}^{−1}∂_{i}S_{1} is of the following form (a similar expression holds for the diagonal entries of *S*_{2}^{−1}∂_{i}S_{2}):
The weights in the numerator and the denominator do not cancel in this expression, as they do in the case of the encoding of a single stimulus (Ecker et al., 2011). Unfortunately, because the random fluctuations appear in divisive form in the above expressions (and inside a further square root nonlinearity in the expression for **g*** _{k}^{i}*), it is difficult to quantify the effect of heterogeneity in mixing weights on

*I*

_{mean}and

*I*

_{cov}. Moreover, the asymptotic variance of the optimal estimator, in its turn, is related to the terms of the FIM through an additional nonlinearity (Eq. 21 below).

Because of these difficulties, we resort to sampling to quantitatively account for the effect of heterogeneity in mixing weights on *I*_{mean} and *I*_{cov} and on the asymptotic variance of the optimal estimator. In particular, we draw the mixing weights of the neurons independently from beta-distributions with mean *w*_{1} or *w*_{2} and variance σ_{w}^{2}. The weights are then plugged in Equation 11 for *I*_{mean} and in Equation 18 for *I*_{cov}. A mathematical analysis of the effect of heterogeneity in mixing weights on the FIM is given below for the simpler case of additive noise.

##### Asymptotic variance and correlation of optimal unbiased estimates.

After computing *I*_{mean} and *I*_{cov}, we find the asymptotic variance of the optimal unbiased estimates of the individual stimuli and the correlation between the two estimates as follows:
where *I*_{11} and *I*_{12} can be written as the sum of the mean and covariance contributions:
We note that because the FIM is symmetric, *I*_{12} = *I*_{21}, and for large *n*, *I*_{12} depends only on Δ*s* = | *s*_{1} − *s*_{2} | and does not depend on the stimuli themselves. Moreover, because of the circular symmetry of the stimuli considered in this paper, *I*_{11} = *I*_{22}, and *I*_{11} does not depend on the stimuli for large *n*. In the following, we will refer to the inverse of the asymptotic variance as encoding precision or, more commonly, as encoding accuracy.

##### Scaling of the asymptotic variance with the mean response gain α and the Fano factor.

In the expression for *I*_{mean} (Eq. 11), **g*** ^{i}*∝

_{k}

^{−1}∂

*f*/∂

_{k}*s*, scales as

_{i}**h**and

**g**vectors also scale as

*I*

_{cov}is independent of the gain (to see this, note that the entries of the

**p**and

**t**vectors in Equation 18 are of the form shown in Equation 20 in which the gain term cancels). Hence from Equation 21, we see that the asymptotic variance should have a weaker gain dependence than α

^{−1}. This means that a doubling of the gain leads to a less than twofold increase in encoding precision. By a similar reasoning, it is easy to show that the asymptotic variance has a dependence on the Fano factor (FF), which is weaker than FF (because

*I*

_{mean}scales as FF

^{−1}and

*I*

_{cov}is independent of FF). Again, this implies that halving the Fano factor should lead to a less than twofold increase in encoding precision.

#### Additive noise: finite n

If the noise is assumed to be additive, the covariance matrix *Q* becomes stimulus-independent; hence, *I*_{cov} = 0. Furthermore, the expression for *I*_{mean} (Eq. 11) can be written in a more transparent form in this case, because it becomes possible to separate out the effect of mixing weights in the **g̃** and **h̃** vectors. Each of the terms on the right side of Equation 11 can be expressed as a sum over different Fourier modes. Considering the *I*_{12,mean} term first, we have:
where σ^{2} is the variance of the additive noise, **m** is the vector of the derivatives of the von Mises tuning functions (Eq. 2) with respect to *s*, and δ is an integer that expresses the difference between the two stimuli *s*_{1} and *s*_{2} in terms of the number of tuning function centers that separate them. Because of the circular nature of the stimuli, the Fisher information only depends on the angular distance between the stimuli. Hence, without loss of generality, we assume *s*_{1} = 0 and thus the derivatives in **m** are all evaluated at *s* = 0. In deriving Equation 25, we used the fact that **g*** ^{i}* can be expressed as a scaled circular shift of

**g**

*by δ and similarly as a circular shift of*

^{j}**h**

*. In the Fourier domain, a circular shift corresponds to the multiplication of the*

^{j}*k*th Fourier mode by exp (−2π

*ik*δ/

*n*). In Equation 25, we made use of the fact that

**m̃**

*=*

_{k}**m̃**

_{N−k}(due to the circular nature of the stimulus space) and the identity exp (2π

*ik*δ/

*n*) + exp (2π

*i*(

*n*−

*k*)δ/

*n*) = 2 cos (2π

*k*δ/

*n*).

Similarly, for the *I*_{11,mean} term, we derive the following expression:
In addition, *I*_{22,mean} = *I*_{11,mean} and *I*_{21,mean} = *I*_{12,mean} as usual. Finally, the asymptotic variance and correlation of estimates can be computed using Equations 21 and 22 and recalling that for additive noise *I*_{cov} = 0.

For the additive-noise model, if the neurons are assumed to be independent, we can also write *I*_{11} = σ^{−2} *I*_{12} = ^{2} denotes the common noise variance. Plugging these in Equation 21, we obtain the following proportionality relation for the asymptotic variance of the optimal estimator:
Inverting this proportionality yields Equation 77 for the encoding precision, which is used below to provide a qualitative explanation for the stimulus dependence of encoding accuracy.

##### Effect of heterogeneity in mixing weights on I_{mean}.

Heterogeneity in the mixing weights can be accounted for by writing **g*** ^{i}* = (

**w**+

**δw**)○

**ḡ**

*where we separated out the effect of mixing weights on*

^{i}**g**

*(note that we can do this for additive noise, but not for Poisson-like noise). In this expression,*

^{i}**w**denotes the mean weight that is constant across the neurons in the group and δ

**w**represents the stochastic component of the weight vector that is different from neuron to neuron. We denote the variance of the weights across the neurons by σ

_{w}

^{2}. We first note that heterogeneity in the mixing weights only affects the

*C*terms in

*I*

_{11,mean}in Equation 11, because the weight fluctuations are assumed to be uncorrelated across different groups and across the same group but for different stimuli. With this in mind, from Equation 11, we have the following: where

*I*

_{11,mean}

^{hom}is the Fisher information for a homogeneous population derived above. In terms of

*O*(1) quantities, we can express 〈

*I*

_{11,mean}

^{het}〉 as follows: where

*M*= (1/

*n*

^{2})(∑

*)(∑*

_{j}C_{jj}*(*

_{i}**ḡ**

_{i}^{1})

^{2}+ (

**h̄**

_{i}^{1})

^{2}) ∼

*O*(1). Thus, the effect of heterogeneity is linear in the number of neurons per group. This scaling is the same as the corresponding scaling relationship derived previously in Ecker et al. (2011) for a neural population encoding a single stimulus. In Ecker et al. (2011), it is further shown that the same scaling holds for Poisson-like noise as well, but the variance σ

_{w}

^{2}should, in that case, be interpreted as the variance of

#### Additive noise: large n

In this section, we give a detailed analysis of the additive noise scenario in the limit of large *n*. We first derive explicit expressions for the *C* and *D* matrices (Eqs. 8 and 9) in the large *n* limit. To do this, we first need to derive **ã** and **b̃**. It is convenient in this case to use indices ranging from −(*n* − 1)/2 to (*n* − 1)/2. For the derivations to be presented in this section, we adopt the following notation: ω = 2π/*n*, λ = exp (−ω/*L*) and τ = exp (−*i*ω*jk*). For **ã**, we have:
We now take the sum of the two geometric series in the last equation. Denoting Γ = *i*θ) + exp (*i*θ)=2 cos (θ). We now consider the large *n* value of the expression above by keeping only terms of leading order in ω = 2π/*n*. We recall that exp (*x*) = 1 + *x* + *x*^{2}/2 + *O*(*x*^{3}) and cos (*x*) = 1 − *x*^{2}/2 + *O*(*x*^{4}). The final result is as follows:
This expression is the same as the one derived in Sompolinsky et al. (2001) for the same type of limited-range correlation structure. Proceeding similarly for **b̃*** _{k}*, we find the following expression:
Plugging these expressions in Equations 8 and 9 for

*C*and

_{kk}*D*and considering the large

_{kk}*n*limit again, we arrive at the following large

*n*expressions for

*C*and

_{kk}*D*: where we denote: By plugging these large

_{kk}*n*expressions for

*C*and

_{kk}*D*in Equations 25 and 26, we obtain the following large

_{kk}*n*expressions for

*I*

_{11,mean}and

*I*

_{12,mean}for the case of additive noise: As in Sompolinsky et al. (2001), it can be shown that

*I*

_{11,mean}and

*I*

_{12,mean}saturate to finite values when

*c*

_{0}≠ 0 and diverge for

*c*

_{0}= 0. To see this, write Ψ

*= Φ*

_{k}*/*

_{k}*n*. Similarly, write μ

*=*

_{k}**m̃**

*/*

_{k}*n*. Then, considering

*I*

_{11,mean}as an example, we have the following: We note that Ψ

*∼*

_{k}*O*(

*k*

^{−2}). Thus, if the power spectrum |μ

*|*

_{k}^{2}decays sufficiently rapidly with

*k*, e.g., |μ

*|*

_{k}^{2}∼

*O*(

*k*

^{−p}) with

*p*> 3 (meaning that the tuning function derivatives are sufficiently smooth), the sum above remains

*O*(1) for

*c*

_{0}≠ 0. An identical argument can be made for the sum in

*I*

_{12,mean}to show that

*I*

_{11,mean}and

*I*

_{12,mean}are both

*O*(1) for

*c*

_{0}≠ 0 assuming sufficiently smooth tuning function derivatives.

#### The effect of stimulus mixing on encoding accuracy

We can now ask what the effects of changing various parameters are on encoding accuracy in the case of additive noise. We first consider the effect of stimulus mixing. Assuming *w*_{1} = *w* and *w*_{2} = 1 − *w* and ignoring a common prefactor, which is always positive, we have (from Eqs. 25 and 26) the following:
where we use the following abbreviations:
We now show that for the interval 0.5 < *w* < 1, the variance (Eq. 43) is a decreasing function of *w*. Therefore, stimulus mixing always reduces encoding accuracy and the stronger the mixing, the lower the encoding accuracy. For this purpose, it is sufficient to consider the sign of the derivative of Equation 43 with respect to 2*w*^{2} − 2*w* only, because using the chain rule, the derivative with respect to *w* can be obtained by multiplying with 4*w* − 2, which is always positive in the interval 0.5 < *w* < 1. Denoting the derivatives with respect to 2*w*^{2} − 2*w* with a prime, we first observe the following: (1) *G′* = *X* − *Y* ≥ 0; (2) *H′* = *Z* − *T*; (3) *G′* > |*H′*|; and (4) *G* > 0. These are all easy to verify using the definitions in Equations 44–47 and the large *n* expressions for *C* and *D* (Eqs. 35–36).

Now, taking the derivative of the variance (Eq. 43) and ignoring the denominator in the derivative, which is always non-negative, we have the following:
Thus, the derivative is always negative and the variance is a decreasing function of *w*.

#### The effect of noise correlations on encoding accuracy

In this section, we separately consider the effects of the three parameters determining the shape and the magnitude of noise correlations in the model: *c*_{0}, β, and *L*. Our strategy is to consider the derivative of the variance with respect to the parameter of interest and look at the sign of the derivative for different settings of the parameters. When the derivative is negative, the variance is a decreasing function of the parameter of interest; whereas a positive derivative means that the variance is an increasing function of the parameter of interest. We use the large *n* expressions for the matrices *C* and *D* (thus for *I*_{11,mean}, *I*_{12,mean}, and the asymptotic variance as well) in the analyses to be presented below.

##### The effect of c_{0}.

We first consider the effect of changing *c*_{0}, the maximum correlation between any two neurons in the population. The analysis of the effect of *c*_{0} is easier than the other two cases, because from the large *n* expressions for *I*_{11,mean} and *I*_{12,mean} (Eqs. 38 and 39), we see that the effect of *c*_{0} completely factorizes in these expressions and that they are both proportional to 1/*c*_{0}. Thus, the variance is proportional to *c*_{0}. Hence, increasing *c*_{0} is always harmful for encoding accuracy. This result is consistent with earlier results for homogeneous neural populations with additive noise encoding a single stimulus (Sompolinsky et al., 2001; Ecker et al., 2011).

##### The effect of β.

We next consider the effect of β, the scaling parameter for across-group correlations. β appears in a factorized form in *I*_{11,mean} and *I*_{12,mean} in the large *n* limit (Eqs. 38 and 39) and therefore the derivative of the variance with respect to β is relatively straightforward to compute. Figure 3*A* shows the sign of ∂Var(*ŝ*_{1}|**s**)/∂β for different values of *w*, Δ*s*, and β.

##### The effect of L.

To investigate the effect of correlation length scale *L* on encoding accuracy, we proceed similarly. In Figure 3*B*, we plot the sign of ∂Var(*ŝ*_{1}|**s**)/∂*L* for different values of *w*, Δ*s*, and *L*. This figure suggests that there is a critical value of *L* around *L* ≈ 0.6 below which the derivative is always positive, suggesting that it is beneficial to decrease *L* in this regime. On the other hand, above the critical value of *L*, the derivative is always negative, suggesting that it is beneficial for encoding accuracy to increase *L* in this regime. The critical value of *L* decreases with the concentration parameter of the tuning functions γ (Eq. 2), but does not depend on any other parameters of the encoding model. This type of threshold-like behavior for the effect of correlation length scale *L* is again consistent with a similar behavior reported in Sompolinsky et al. (2001).

#### Optimal linear estimator

For the optimal linear estimator (OLE), we first map the stimuli to Cartesian coordinates, **x** = [cos (*s*_{1}) sin (*s*_{1}) cos (*s*_{2}) sin (*s*_{2})], and calculate the weight matrix *W* that minimizes the mean squared error 〈‖**x** − **x̂**‖^{2}〉 between the actual and estimated stimuli. The optimal weight matrix is given by the following: *W* = *Q*_{rr}^{−1}*Q*_{xr} where *Q*_{rr} = 〈**r**^{T}**r**〉 is the covariance matrix of the responses and *Q*_{xr} = 〈**r**^{T}**x**〉 is the covariance between the stimuli and neural responses (here 〈 · 〉 denotes an average over both stimuli **x** and noisy neural responses **r**). These averages were computed over 8192 random samples of **x** (generated from uniformly distributed *s*_{1} and *s*_{2} values) and **r**. The performance of the estimator was then measured by numerically estimating the mean squared error (MSE) for stimulus pairs of the form (−*s*, *s*) with *s* ∈ [0,π/2].

#### The nonlinear mixing rule of Britten and Heuer (1999)

For the nonlinear mixing model of Britten and Heuer (1999), the mean response, *f _{k}*, of a neuron to a pair of stimuli (

*s*

_{1},

*s*

_{2}) is given by Equation 78. We include a factor of 2 in the denominator in Equation 78 to make the neural gains approximately independent of ν. Britten and Heuer's original equation does not include this factor. The derivative of the mean response with respect to the stimulus

*s*is given by the following: where

_{i}*f*denotes the von Mises tuning function (Eq. 2) and

*f*′ its derivative. Given the mean responses and their derivatives with respect to each stimulus, expressions similar to Equations 11 and 18 can be used to compute the Fisher information matrix for the nonlinear mixing model of Britten and Heuer (1999), taking into account that we assume an unsegregated population for this case (see Results).

#### A divisive form of stimulus mixing

In the divisively normalized stimulus mixing model, the response of a neuron in the first group is described by Equation 79. Responses of neurons in the second group are similar, but with the roles of *s*_{1} and *s*_{2} reversed in the right hand side. In Equation 79, *f*(*s*; φ) is the von Mises tuning function defined in Equation 2 and the weighting profile *w*(φ* _{k}*, φ

*) is a normalized von Mises function given by the following: The derivatives of the mean response*

_{k′}*f*with respect to

_{k}*s*

_{1}and

*s*

_{2}are given by the following: where

*f*′ denotes the derivative of the von Mises tuning function.

Equations 11 and 18 for the Fisher information matrix are still valid for the divisively normalized mixing model with the difference that the mean responses of neurons and their derivatives with respect to the two stimuli are computed according to Equations 79, 51, and 52, respectively.

Figure 10 shows the effect of varying the divisive normalization scaling factor *k _{w}* and the across-group neural correlations β on encoding accuracy for both the divisively normalized mixing model and a model where the off-diagonal terms in the Fisher information matrix (

*I*with

_{ij}*i*≠

*j*) were set to zero, but the diagonal terms were the same as in the divisively normalized mixing model. As explained in Results, this latter model eliminates stimulus mixing, but preserves the neuron-by-neuron response gains in the divisively normalized model.

#### Stimulus mixing is not always harmful for encoding accuracy

We first analyze the general stimulus mixing problem with a two-dimensional toy model. We imagine two “neurons” mixing the two stimuli *s*_{1}, *s*_{2} according to *f*_{1}(*s*_{1}, *s*_{2}) and *f*_{2}(*s*_{1}, *s*_{2}), respectively. The responses of the two neurons are given by *r*_{1} = *f*_{1}(*s*_{1}, *s*_{2}) + ε_{1} and *r*_{2} = *f*_{2}(*s*_{1}, *s*_{2}) + ε_{2}, where ε_{1} and ε_{2} are Gaussian random variables with variance σ^{2} and correlation ρ. Thus, in the following analysis, we assume stimulus-independent additive noise. We denote the Jacobian matrix for the mean responses of the neurons by **J**:
As explained in the Results section, one can think of **J** as a mixing matrix describing the sensitivity of each neuron to changes in each stimulus. The Fisher information matrix is given by *I _{F}* =

**J**

*Σ*

^{T}^{−1}

**J**where

**Σ**is the covariance matrix of the response noise. The inverse of

*I*gives the asymptotic covariance matrix of the maximum likelihood estimator. To find the optimal mixing matrix

_{F}**J**, we minimize the trace of

*I*

_{F}^{−1}, i.e., Tr[

*I*

_{F}

^{−1}] = Tr[

**J**

^{−1}Σ

**J**

^{−T}] with respect to

**J**, subject to the constraint that the sum of the squares of the derivatives in

**J**be a finite constant

*K*, i.e., Tr[

**J**

^{T}**J**] =

*K*.

We find the optimal **J** by the method of Lagrange multipliers. The objective function is given by the following:
and the required derivatives are as follows:
Setting these to zero and rearranging, we get the following equations:
where we denote **JJ*** ^{T}* by

**Y**. By taking the eigendecomposition of the right-hand side of Equation 57, we obtain: where

**P**is the matrix of eigenvectors of

**Σ**and λ

_{1}and λ

_{2}are its eigenvalues. Because the trace of a matrix is equal to the sum of its eigenvalues, from Equation 58, we get

**Σ**are given by the following: Plugging these expressions in Equation 60 and simplifying, we get: Now, the

**Y**matrix can be written as follows: where ∇

*f*

_{1}and ∇

*f*

_{2}denote the gradients of

*f*

_{1}and

*f*

_{2}, respectively. The cosine of the angle θ* between these two gradients is given by the following: The optimal solution is thus to set the gradients to have equal norm

*A*plots the positive solution as a function of ρ.

The solution of the two-dimensional toy model can be readily generalized to models with more than two neurons under certain assumptions. Consider, for instance, two groups of neurons with responses given, respectively, by:
This model is very similar to the linear mixing model analyzed in detail in this paper (see Results), with the only difference being that the restriction for the weights to be positive and symmetric between the groups is now lifted. Assuming *s*_{1} ≈ *s*_{2} = *s* and an additive noise model, the problem of finding the optimal weights *w*_{1}, *w*_{2}, *w*_{3}, and *w*_{4} subject to a total power constraint on the derivatives, i.e.:
can be directly translated into the two-dimensional problem with the following transformation:
where **f′** is a row vector of the derivatives, *df*(*s*; φ* _{k}*)/

*ds*,

**0**is a row vector of zeros, and

*Q*is the covariance structure of the neurons (e.g.,

*Q*=

*SRS*with

*R*as given in Eq. 3). In particular, for uncorrelated neurons (diagonal

*Q*), we find that the optimal solution is to set the weight vectors [

*w*

_{1},

*w*

_{2}] and [

*w*

_{3},

*w*

_{4}] such that they have equal norm and are orthogonal to each other.

For the more general case of *n* neurons encoding two stimuli, as far as we know, there is no closed-form solution for the optimal mixing matrix subject to a constraint on the total power of the derivatives. We thus solve this more general problem numerically. Figure 11 shows three distinct solutions for *n* = 16 with both a diagonal covariance matrix (Fig. 11*A*) and a limited-range correlation structure (Fig. 11*B*) as in Equation 4 with *c*_{0} = 0.3 and *L* = 2.

Similarly, for the linear encoding model (Eqs. 66 and 67), when *s*_{1} ≉ *s*_{2}, it does not seem possible to reduce the problem of finding the optimal mixing weights *w*_{1}, *w*_{2}, *w*_{3}, and *w*_{4} to our two-dimensional toy problem. Thus, we solve this optimization problem numerically as well. Because *I _{F}*

^{−1}depends on Δ

*s*, we minimize the following objective function in this case: where the integral is approximated by the average of Tr[

*I*

_{F}

^{−1}(Δ

*s*)] over 21 Δ

*s*values uniformly spaced in the interval [0, π]. The total resource

*K*is assumed to be equal to the number of neurons and the noise is assumed to be additive (and the noise variance identical for all neurons) both in the arbitrary encoding model and in the linear encoding model. All numerical optimization problems were solved using the genetic algorithm routines in MATLAB's Global Optimization Toolbox.

#### Parameter values for the results reported in the figures

For our main results, reported in Figures 1, 2, 4, and 7⇓⇓–10, we use a Poisson-like noise model and a limited-range noise correlation structure with parameters *c*_{0} = 0.3 and *L* = 2, which is consistent with the small but broad noise correlations typically observed in the visual cortical areas (Cohen and Kohn, 2011). For the tuning function parameters, we again use values broadly consistent with response properties of neurons in the visual cortex: α = 20, γ = 2, η = 0 (Ecker et al., 2011). For convenience, we assume tuning function centers that are uniformly spaced between 0 and 2π.

In Figures 3, 11, and 12, we use an additive Gaussian noise model. The additive-noise assumption is needed to establish the analytic results regarding the effects of varying the parameters of the encoding model on encoding accuracy, as well as for the solution of the optimal mixing model presented at the end of the Results section.

For the number of neurons, we used as large a number of neurons as possible. Specifically, in Figures 1, 2, 4, 6, and 10, we use *n* = 4096 (number of neurons per group). In Figure 9, since there is only a single, unsegregated population, we use *n* = 8192. Due to computational costs, we had to use a smaller number of neurons per group in Figures 7 and 8: in Figure 7, *n* = 1024 (note this means that the total number of neurons is 2048 for set size *N* = 2 and 6144 for *N* = 6) and in Figure 8, *n* = 1024.

Other parameter values specific to each figure are as follows: in Figure 9, *a* = 1, *b* = 0 (parameters of the nonlinear mixing model of Britten and Heuer, 1999). In Figure 10, Δ = 10, γ* _{w}* = 2 (divisive normalization parameters). In Figure 12, for the linear encoding model, tuning function parameters are the same as those reported above for Figure 1.

## Results

### Linear mixing

We consider a population of neurons encoding a pair of stimuli (*s*_{1}, *s*_{2}) where the mean responses of the neurons are expressed as a weighted average of their responses to the individual stimuli:
Here *w*_{1} and *w*_{2} are the mixing weights and φ* _{k}* is the preferred stimulus of neuron

*k*. We call this type of mixed selectivity linear mixing, although it should be noted that the mean response

*f*(

_{k}*s*

_{1},

*s*

_{2}) is linear only in the individual responses and not in the stimuli themselves, therefore linear mixing in this sense is different from what is referred to as linear mixing in Rigotti et al. (2013). Neural responses of this type have been observed throughout the cortex (Recanzone et al., 1997; Britten and Heuer, 1999; Zoccolan et al., 2005, 2007; Beck and Kastner, 2007; Busse et al., 2009; MacEvoy and Epstein, 2009; MacEvoy et al., 2009; Nandy et al., 2013). It is, thus, of considerable interest to understand the information-encoding consequences of this type of mixed selectivity. For our analysis, we separate the neurons into two groups such that neurons in the first group have a larger weight for the first stimulus and neurons in the second group have a larger weight for the second stimulus. For simplicity, we assume symmetric weights for the two groups; i.e., if the mixing weights associated with

*s*

_{1}and

*s*

_{2}are

*w*

_{1}and

*w*

_{2}, respectively, for the first group (with

*w*

_{1}≥

*w*

_{2}), they are

*w*

_{2}and

*w*

_{1}for the second group. We initially consider the case where all neurons within the same group have the same weights for the two stimuli, but later consider the effects of heterogeneity in mixing weights.

To model variability in neural responses, unless otherwise noted, we use a biologically realistic, Poisson-like noise model, where the variance of the noise is equal to the mean response (see Materials and Methods). We assume that neural variability is correlated across the population. Specifically, within-group correlations between neurons decay exponentially with the distance between their preferred stimuli, consistent with experimental measurements of noise correlations throughout the visual cortex (Cohen and Kohn, 2011): where δ is the Kronecker delta function. Across-group correlations are simply scaled versions of within-group correlations: where 0 ≤ β ≤ 1 represents the scaling factor. In this paper, we only consider stimuli defined over circular spaces due to their conceptual simplicity and analytical tractability. Stimuli defined over bounded spaces would introduce edge effects where stimuli toward the edges are encoded with less accuracy than stimuli toward the center. This can be explained entirely by a decrease in the effective number of neurons covering the stimuli toward the edges and hence is uninteresting for our considerations. We do not expect any of our main results concerning stimulus mixing to depend on the choice of a circular rather than a bounded stimulus space.

### Consequences of linear mixing

We derived a mathematical expression for the Fisher information matrix (FIM) of the encoding model described above. The main interest in deriving the FIM comes from the fact that, by the Cramér–Rao bound, the inverse of the FIM provides a lower bound on the covariance matrix of any unbiased estimator of the stimuli and expresses the asymptotic covariance matrix of the maximum-likelihood estimator. From the inverse of the FIM, we obtained expressions for the asymptotic variance of the optimal estimates of *s*_{1} and *s*_{2} and the correlation between the estimates (see Materials and Methods).

We then asked how changes in different parameters affect encoding accuracy, i.e., the inverse of the asymptotic variance of the estimates. Considering the effect of stimulus mixing first and assuming *w*_{1} = *w* and *w*_{2} = 1 − *w* with 0.5 ≤ *w* ≤ 1, we find that increased stimulus mixing (i.e., decreasing *w*) reduces encoding accuracy and that these reductions can be substantial (Fig. 1*A*). The harmful effect of stimulus mixing for encoding accuracy depends on the similarity between the two stimuli (Fig. 1*A*), being more severe for more similar stimuli (smaller Δ*s* = | *s*_{1} − *s*_{2} |). For some stimulus pairs, increased stimulus mixing can cause several orders of magnitude reductions in encoding accuracy (Fig. 1*A*). It is easy to see that the total response across the whole population is independent of the mixing weight *w*. Therefore, the reduction in encoding accuracy with increased stimulus mixing is due entirely to stimulus mixing itself, rather than any reduction in the overall response level.

We next analyzed the effect of heterogeneity in mixing weights by assuming that the weights of different neurons are drawn from a distribution with mean *w* or 1 − *w* and variance σ_{w}^{2} (see Materials and Methods). Such heterogeneity in the mixing weights partially alleviates the harmful effects of stimulus mixing (Fig. 1*B*). Increasing the across-group neural correlations, i.e., increasing β, can also counteract the effects of stimulus mixing under certain parameter regimes (Fig. 1*C*).

#### Effects of stimulus mixing, heterogeneity in mixing weights, and across-group neural correlations on encoding accuracy

The presence of Poisson-like noise makes an analytic quantification of the effects of stimulus mixing, heterogeneity in mixing weights, and across-group neural correlations on the asymptotic variance difficult. However, for the parameter regimes reported in Figure 1, we numerically checked and confirmed the following: (1) increasing stimulus mixing always increases the asymptotic variance (Fig. 1*A*); (2) increasing the heterogeneity of mixing weights generally reduces the asymptotic variance, except for a small number of cases where this pattern is reversed for very close σ_{w}^{2} values due to stochasticity in sampling (Fig. 1*B*); and (3) for all Δ*s*, the increase in variance caused by halving *w*_{1} = *w* from 1 to 0.5 is always greater than the increase in variance caused by halving the mean response gain or doubling the Fano factor (FF). Figure 2 compares the effect of stimulus mixing on the asymptotic variance with the effects of halving the mean response gain or doubling the FF. The results shown in Figure 2 are for Δ*s* = π, for which the effect of stimulus mixing is among the weakest. For smaller Δ*s* values, stimulus mixing generally has a much larger effect on the variance than the effect of changing the gain or the FF. This result could explain why attention acts primarily by stimulus selection, which in our model corresponds to changing the mixing weights, rather than simply through noise reduction or gain increase (Pestilli et al., 2011), because the former mechanism typically leads to a much larger improvement in encoding accuracy than the latter mechanisms (see Discussion).

#### Analytical results for an additive noise model

To develop a better understanding of the effects of changing the parameters of the encoding model on encoding accuracy, we supplement the numerical results for Poisson-like noise with analytical results for a simpler additive noise model. In the additive noise model, the noise variance, rather than being equal to the mean response as in the Poisson-like noise model, is assumed to be the same for all neurons independent of their mean responses. In Materials and Methods, for the additive noise model and in the limit of a large number of neurons, we mathematically show that increased stimulus mixing always reduces encoding accuracy, increased heterogeneity in mixing weights always improves encoding accuracy, whereas the conditions under which increased neural correlations between the groups, i.e., increasing β, improves encoding accuracy are slightly more complicated. In Figure 3*A*, we plot the sign of ∂Var(*ŝ*_{1}|**s**)/∂β for different values of *w*, Δ*s*, and β. A positive sign means that the asymptotic estimation variance is an increasing function of β (i.e., it is harmful to increase β), whereas a negative sign means that the asymptotic estimation variance is a decreasing function of the parameter. This figure shows that the derivative is always negative for very small values of Δ*s*, suggesting that it is beneficial (in terms of encoding accuracy) to increase the across-group correlations in this case. For other values of Δ*s*, the sign depends on β and *w*, being more likely to be negative for larger β and larger *w* values. A detailed analysis of the effects of changing the other parameters of the encoding model under the additive noise assumption can also be found in the Materials and Methods section. In summary, this analysis shows that increasing the maximum neural correlation, *c*_{0}, is always harmful for encoding accuracy, whereas for the correlation length scale *L*, there is a critical threshold below which it is always harmful to increase *L* and above which it is always beneficial to increase *L* (Fig. 3*B*).

#### Correlations between the estimates

The FIM also predicts prominent stimulus-dependent correlations between the estimates of the two stimuli. The asymptotic correlation between the optimal estimates is given by Equation 22 (see Materials and Methods). In Figure 4, we show the correlations between the two estimates under different parameter regimes, using the Poisson-like noise model. Figure 4, *A* and *B*, shows the effects of changing *w* and β, respectively, on the correlation between the estimates. Psychophysical tasks where subjects have to estimate multiple stimuli simultaneously are uncommon (see Orhan and Jacobs (2013) for an exception), but Figure 4 suggests that the pattern of correlations between the estimates obtained from such tasks can be potentially informative about the optimality of the subjects' decoding methods and about possible parameter regimes for their encoding models.

### A reduced model to understand the effects of stimulus mixing

To develop a geometric intuition for the effects of stimulus mixing and across-group neural correlations on encoding accuracy, we consider a simpler, reduced version of the linear mixing model. In this model, neurons in each group are reduced to a single neuron. In addition, we model the responses of these two “reduced neurons” linearly, ignoring the nonlinearity introduced by the tuning function, *f*. Thus, the responses of the two neurons are modeled as follows:
where ε_{1} and ε_{2} are zero-mean random variables with correlation β, representing correlated noise in the responses. A given (*r*_{1}, *r*_{2}) pair describes two lines in the (*s*_{1}, *s*_{2}) plane: *r*_{1} = *ws*_{1} + (1 − *w*)*s*_{2} and *r*_{2} = (1 − *w*)*s*_{1} + *ws*_{2}. The maximum likelihood estimate of the stimuli is given by the intersection of these two lines. As *r*_{1} and *r*_{2} vary stochastically from trial to trial due to noise, the lines, as well as their intersection point, change. If there is any stimulus mixing (*w* ≠ 1), the geometry of the lines dictates that the estimates should be stretched along the antidiagonal direction, making them more variable than under the no mixing condition (*w* = 1) for the same (*r*_{1}, *r*_{2}) values. This is illustrated in the middle plot in Figure 5 for *w* = 0.8 and β = 0. Increasing the stimulus mixing makes the slopes of the lines more similar to each other, which stretches the intersection points even further (Fig. 5, left) and increases their variance. Increasing the across-group neural correlation β, on the other hand, makes the intersection points along the diagonal more probable (Fig. 5, right), counteracting the antidiagonal stretching caused by stimulus mixing and decreasing the variance of the estimates.

### Stimulus dependence of the variance

The dependence of the asymptotic variance of the optimal estimates on Δ*s* cannot be explained with the reduced model. To gain some insight into the mechanism behind this dependence, we consider the linear mixing model with additive noise and independent neurons. In this case, it can be shown that the encoding precision is proportional to (see Eq. 27 in Materials and Methods):
where ‖ · ‖ denotes the Euclidean norm and **f** is a column vector of the mean responses of all neurons in both groups. In the linear mixing model *s*, hence the dependence of encoding precision on Δ*s* is determined solely by the magnitude of the inner product *s*_{1} and *s*_{2} overlap more, the encoding precision becomes low. Figure 6 shows that *s*, explaining why stimulus mixing is especially harmful for stimuli with small Δ*s*.

Although the simple proportionality relation above does not hold in the case of Poisson-like noise (or for correlated neurons for that matter), it qualitatively captures the stimulus dependence of the asymptotic variance in Figure 1.

### Generalization to more than two stimuli

It is straightforward to generalize the preceding linear mixing model to more than two stimuli. However, deriving a simplified mathematical expression for the FIM, as was done for the case of two stimuli, becomes infeasible for this case. Therefore, we present results from numerically computed FIMs for up to six stimuli.

To generate the mixing weights of different groups of neurons, for each set size *N*, we first draw a random weight vector **w**, uniformly distributed on the (*N* − 1)–dimensional probability simplex, i.e., the region defined by ∑* _{i}w_{i}* = 1 and

*w*≥ 0. We then generate an

_{i}*N*×

*N*circulant matrix

**W**from the weight vector

**w**. The rows of this matrix, which are all circular permutations of

**w**, give the weight vectors of each group in the population. We generate 512 such weight matrices and, for each weight matrix, compute the asymptotic variance of the optimal estimator from the inverse of the Fisher information matrix for the particular stimulus configuration where all the stimuli are identical,

*s*

_{1}=

*s*

_{2}= … =

*s*. The number of neurons per group and the magnitude of noise correlations between groups are held constant across set sizes in these simulations. Similar to the results for

_{N}*N*= 2, the estimation variance increases when the weight vector

**w**becomes more uniform, i.e., when different groups become equally responsive to all stimuli. To quantify the uniformity of the weight vectors, we use the Shannon entropy of

**w**treated as a discrete probability distribution. Figure 7

*A*shows the asymptotic estimation variance as a function of the Shannon entropy of the weight vector. When the logarithm of the estimation variance is linearly regressed on the logarithm of the set size

*N*, we find a highly significant effect with a positive slope of ∼0.82 (

*p*< 0.0001), suggesting that the variance increases with set size (Fig. 7

*B*). This result is not sensitive to the particular way the weight matrices

**W**are chosen and holds as well for the case where the weight vectors of different groups in the population, rather than being circular permutations of a single weight vector, are random samples from the (

*N*− 1)–dimensional probability simplex (Fig. 7

*C*). In this case, the linear regression of the logarithm of the estimation variance on the logarithm of the set size yields a highly significant effect with a positive slope of ∼2.2 (

*p*< 0.0001), again suggesting an increase in the estimation variance with set size. This increase is not caused by a reduction in gain per stimulus, as the number of neurons per group was held constant and the presented stimuli were identical. Rather, it is due to an increase with

*N*in the mean normalized entropy (normalized by the maximum possible entropy, i.e., log

*N*) of weight vectors drawn from a probability simplex (Nemenman et al., 2002). In other words, with increasing

*N*, it becomes more and more difficult to find “harmless,” low-entropy weight vectors. This result suggests a novel mechanism that might contribute to set size effects, i.e., declines in performance with set size, observed in various psychophysical tasks (see Discussion).

### Generalization to a suboptimal decoder

The results presented so far concern the FIM, which describes the asymptotic behavior of the optimal estimator. An important question is to what extent these results generalize to empirically motivated suboptimal decoders. Here, we show that the effects of stimulus mixing, heterogeneity of the mixing weights, and across-group noise correlations obtained from the analysis of the FIM generalize to a particular type of suboptimal decoder called the optimal linear estimator (OLE; Salinas and Abbott, 1994). Because OLE is a biased estimator, we use the mean squared error (MSE) as a measure of the estimator's performance.

Figure 8*A* shows the MSE of the OLE for different degrees of stimulus mixing. Increased stimulus mixing (decreasing *w*) deteriorates the estimator's performance. This is consistent with the results presented above for the FIM. The stimulus dependence of the estimator error, however, has a different form than for the FIM. Figure 8*B* shows the MSE of the OLE for different amounts of heterogeneity in the mixing weights. Again, consistent with the results obtained from the FIM, increased heterogeneity improves the estimator's performance. Figure 8*C* shows the MSE of the OLE for different across-group correlation values. Under strong stimulus mixing (*w* = 0.6), increasing β improves the decoder's performance by up to an order of magnitude in most cases. This effect is also consistent with the results presented earlier for the FIM. The effect of β, however, becomes less significant under the no stimulus mixing condition (*w* = 1). The results for the OLE are also, in general, less dependent on Δ*s* than the results for the FIM.

### Nonlinear forms of stimulus mixing

So far, we have considered a linear stimulus mixing model where the responses of neurons to multiple stimuli are modeled as simple linear combinations of their responses to individual stimuli alone. Do our results also hold for other forms of stimulus mixing, or is the assumption of linearity crucial? To show that our results are not specific to the linear mixing model, here we consider two nonlinear, experimentally motivated forms of stimulus mixing.

#### Nonlinear mixing model of Britten and Heuer (1999)

Britten and Heuer (1999) present pairs of moving gratings inside the receptive fields of MT neurons and show that a nonlinear mixing equation of the following form provides a good characterization of their responses:
where *f*(*s*_{1}; φ* _{k}*) and

*f*(

*s*

_{2}; φ

*) are the mean responses of neuron*

_{k}*k*to the individual gratings. This equation can interpolate smoothly between simple averaging (

*a*= 1, ν = 1) and max-pooling (ν→∞) of responses to the individual gratings. Britten and Heuer (1999) report a wide range of values for the parameters

*a*and ν across the population of neurons they recorded from. They further show that allowing these parameters to vary results in significantly better fits than the simple averaging model for most of the neurons.

We assume a single unsegregated population of neurons and derive the Fisher information matrix as before, using the mean responses described in Equation 78 above. Figure 9*A* shows the asymptotic variance of the optimal estimator as a function of the exponent ν. Increasing ν reduces stimulus mixing and significantly improves the encoding accuracy, consistent with the results from the linear mixing model. As in the linear mixing model, the effect of stimulus mixing on encoding accuracy can be understood, at least qualitatively, by considering the magnitude of the overlap between the profiles of the partial derivatives of the mean responses with respect to the two stimuli, *B*,*C*).

#### Divisive stimulus mixing

We next consider another biologically motivated form of stimulus mixing based on divisive normalization. Specifically, as in the linear mixing model, we separate the neurons into two groups and assume that the responses of neurons in each group are normalized by a weighted sum of the activity of neurons in the other group. The weighting is assumed to be neuron specific such that neurons with similar stimulus preferences exert a larger divisive influence on each other. This type of divisive normalization has previously been motivated by considerations of efficient coding in early visual cortical areas (Schwartz and Simoncelli, 2001) and can be used to describe stimulus-dependent suppressive surround effects in the visual cortex (Allman et al., 1985; Cavanaugh et al., 2002). Mathematically, the response of a neuron in the first group is described by the following:
where for the weighting profile *w*(φ* _{k}*, φ

*), we use a normalized von Mises function (see Materials and Methods). Responses of neurons in the second group are similar, but with the roles of*

_{k′}*s*

_{1}and

*s*

_{2}reversed.

Figure 10, *A* and *B*, shows the effects of varying the divisive normalization scaling factor *k _{w}* and the across-group neural correlations β on encoding accuracy. Increasing

*k*decreases encoding accuracy. However, unlike in the linear mixing model, this decrement in encoding accuracy does not only reflect the effect of stimulus mixing per se, but also a stimulus-independent scaling of the response gain. To tease apart the contribution of stimulus mixing per se from that of a simple gain change in the responses, we built a model that, neuron by neuron, had the same gain as the divisive normalization model in Equation 79 (for all stimuli), but whose Fisher information was computed by treating the denominator in Equation 79 as constant (this was done by setting the off-diagonal entries of the FIM to zero). This second model thus eliminates stimulus mixing, but preserves the neuron-by-neuron response gains in the first model. The results for this second model are shown with dashed lines in Figure 10

_{w}*A*and

*B*. Comparing the dashed lines with the same colored solid lines, we see that stimulus mixing per se induces a cost to encoding accuracy for stimulus pairs with large and small Δ

*s*values, but not for stimulus pairs with intermediate Δ

*s*values. As in the linear mixing model, increasing β improves encoding accuracy in the divisively normalized model as well (Fig. 10

*B*).

Figure 10*A* suggests that stimulus mixing in the divisively normalized form given in Equation 79 is not as harmful as in the linear mixing model (or in the nonlinear mixing model of Britten and Heuer, 1999). To understand this difference between the two forms of stimulus mixing, we consider a two-neuron version of the divisively normalized model analogous to the two-neuron model considered earlier for the linear mixing model. We again ignore the nonlinearities introduced by the tuning function *f*(*s*; φ) and model the responses of the two neurons as follows:
As in the two-neuron version of the linear mixing model, for a given (*r*_{1}, *r*_{2}) pair, the two equations above define two lines in the (*s*_{1}, *s*_{2}) plane whose intersection gives the maximum likelihood estimate of the stimuli. We first note that unlike in the linear mixing model, noise in *r _{i}* changes the slopes of the lines. The slopes of the two lines described in Equation 80 are given by 1/(

*r*

_{1}

*w*) and

*r*

_{2}

*w*, respectively. This suggests that as long as

*r*

_{1}and

*r*

_{2}are sufficiently large, the slope of the first line will be small, and will remain small despite random variations in

*r*

_{1}, whereas the slope of the second line will be large and will remain large in the face of variations in

*r*

_{2}. Unless

*r*

_{1}and

*r*

_{2}are very small, the two neurons thus encode the stimuli approximately orthogonally (Fig. 10

*C*), unlike in the case of linear mixing where the slopes of the lines can become arbitrarily close to each other as stimulus mixing increases. This approximately orthogonal coding of the stimuli, in turn, causes only a relatively small amount of distortion in the estimates (

*ŝ*

_{1},

*ŝ*

_{2}) as

*r*

_{1}and

*r*

_{2}vary stochastically for a particular stimulus pair (as indicated by the spread of the blue dots in Fig. 10

*C*), explaining why stimulus mixing is less harmful in the divisive mixing model than in the linear mixing model. As in the linear mixing model, the stimulus dependence of encoding accuracy in the divisive mixing model can be qualitatively understood by considering the magnitude of the inner product of the derivatives of the mean responses with respect to

*s*

_{1}and

*s*

_{2}(data not shown).

### Stimulus mixing is not always harmful for encoding accuracy

The examples of stimulus mixing considered thus far showed a harmful effect on encoding accuracy. This raises the important question: Is stimulus mixing always harmful for encoding accuracy? Here we show that the answer is no. To show this, we first analyze the general stimulus mixing problem with a two-dimensional toy model similar to the ones presented earlier for linear and divisive mixing. We imagine two “neurons” mixing the two stimuli *s*_{1}, *s*_{2} according to *f*_{1}(*s*_{1}, *s*_{2}) and *f*_{2}(*s*_{1}, *s*_{2}), respectively. The responses of the two neurons are given by *r*_{1} = *f*_{1}(*s*_{1}, *s*_{2}) + ε_{1} and *r*_{2} = *f*_{2}(*s*_{1}, *s*_{2}) + ε_{2} where ε_{1} and ε_{2} are Gaussian random variables with variance σ^{2} and correlation ρ. We denote the Jacobian matrix for the mean responses of the neurons by **J**. One can think of **J** as a mixing matrix describing the sensitivity of each neuron to changes in each stimulus. **J** would be diagonal (or antidiagonal) in the case of no stimulus mixing. The FIM is given by *I _{F}* =

**J**

*Σ*

^{T}^{−1}

**J**where

**Σ**is the covariance matrix of the response noise.

*I*

_{F}

^{−1}gives the asymptotic covariance matrix of the maximum likelihood estimator. To find the optimal mixing matrix

**J**, we minimize the trace of

*I*

_{F}

^{−1}, i.e., Tr[

*I*

_{F}

^{−1}] = Tr[

**J**

^{−1}Σ

**J**

^{−T}] with respect to

**J**. With no constraints on

**J**,

*I*

_{F}

^{−1}can be made arbitrarily small, for example by making

**J**diagonal and its diagonal entries arbitrarily large. Because the derivatives in the Jacobian can be negative or non-negative, a plausible constraint on

**J**is to require the sum of the squares of the derivatives in

**J**to be a finite constant

*K*. In terms of the matrix

**J**, this means requiring Tr[

**J**

^{T}**J**] =

*K*. The optimal

**J**can then be found by the method of Lagrange multipliers (see Materials and Methods).

The optimal solution is to set the gradients of the two neurons, ∇*f*_{1} and ∇*f*_{2}, to have equal norm and the angle between them to θ* with cos θ* given by (Fig. 12*A*):
The absolute orientation of the gradients in the plane, however, can be arbitrary. Thus, Equation 81 describes a one-dimensional family of solutions. Because Fisher information is a local measure of information, this solution holds for a given arbitrary point (*s*_{1}, *s*_{2}) in the plane. For optimal mixing over the entire plane, the conditions specified by the solution have to be satisfied for each point (*s*_{1}, *s*_{2}) in the plane. This can be achieved by choosing the mixing function of the first neuron *f*_{1}(*s*_{1}, *s*_{2}) arbitrarily and then choosing the mixing function of the second neuron *f*_{2}(*s*_{1}, *s*_{2}) such that the optimality conditions on the gradients are satisfied at each point in the plane.

Three important aspects of the solution are worth emphasizing. First, the solution does not require **J** to be diagonal (or antidiagonal). Thus, stimulus mixing is not intrinsically harmful, but rather stimuli should be mixed in complementary ways by different neurons so that the conditions on the gradients are satisfied. Second, the optimal solution, in fact, necessitates a certain amount of stimulus mixing when ρ ≠ 0, as the optimality condition for ρ ≠ 0 cannot be satisfied with ∇*f*_{1} and ∇*f*_{2} aligned with the two axes of the (*s*_{1}, *s*_{2}) plane. Third, for ρ → 0, cos θ* → 0, therefore, the gradients have to be orthogonal to each other in this case. Furthermore, the optimal angle θ* between the gradients changes rather slowly as ρ moves away from 0. Thus the gradients should be close to orthogonal for a large range of ρ values around 0. The orthogonality condition can be understood as follows. Intuitively, ∇*f*_{1} is the first neuron's “least uncertain” direction in the (*s*_{1}, *s*_{2}) plane. The second neuron has to align its least uncertain direction orthogonally to ∇*f*_{1} so that together the two neurons can encode the stimuli with the least total uncertainty. In our original analysis of the linear mixing model, the orthogonality condition can be satisfied only when there is no stimulus mixing, because, motivated by a consideration of consistency with physiological data, the weights were assumed to be non-negative in that case.

The solution of the two-dimensional toy model can be readily generalized to models with more than two neurons under certain conditions (Eqs. 66–71; see Materials and Methods). For the general case of *n* neurons encoding two stimuli, as far as we know, there is no closed-form solution for the optimal mixing matrix, **J**, subject to a constraint on the total power of the derivatives. Numerical solution of this more general problem shows that for any given neural covariance structure, there is a diverse set of solutions: Figure 11*A* shows three example solutions for *n* = 16 independent neurons and Figure 11*B* shows three example solutions for *n* = 16 correlated neurons with a limited-range correlation structure. Moreover, random mixing of the stimuli by neurons performs remarkably well especially for large *n*. Figure 12*B* compares the performance of the median random mixing model with that of the optimal mixing model for different *n* (compare the black asterisks vs the black open squares). For the random mixing models, the gradients of neurons were chosen subject to the total power constraint, i.e., the sum of the squared norms of gradients was constant, but they were otherwise random.

We also wanted to see how well linear mixing models perform compared with unconstrained encoding models where the gradients can be set arbitrarily (subject to the resource constraint). When the encoding model is constrained to be a linear mixing model with two groups and fixed weights for each stimulus (Eqs. 66 and 67; see Materials and Methods), random ensembles of linear mixing models perform worse than random ensembles of arbitrary encoding models (Fig. 12*B*, compare the red vs black open squares). Interestingly, however, the optimal solution for the linear mixing model appears to have the same form as the optimal solution of the two-dimensional problem and performs as well as the optimal solution for arbitrary encoding models (Fig. 12*B*, compare the red vs black asterisks). Figure 12*B* shows the results for populations with independent neurons. Analogous results hold for correlated neural populations. With correlated neurons, the improvement in the relative performance of random ensembles with increasing *n* becomes somewhat slower and the optimal linear encoding model no longer achieves the same performance as the optimal arbitrary encoding model (Fig. 12*C*).

## Discussion

Theoretical studies of population coding have traditionally focused on the encoding of single stimuli by neural populations (Abbott and Dayan, 1999; Sompolinsky et al., 2001; Ecker et al., 2011). In this work, we extended the neural population coding framework to the encoding of multiple stimuli, assuming encoding models with biologically motivated properties. We examined a linear mixing rule commonly observed in cortical responses. According to this rule, the response to the presentation of multiple objects can be described as a linear combination of the responses to the constituent objects. We find that this rule incurs a severe cost to encoding accuracy. This cost is directly related to the mixing of the stimuli in the neural responses, is independent of any general decrease in the overall activity of the population, and can be larger than the cost incurred by even large reductions in the gain or large increases in neural variability. As noted earlier, this result could explain why attention acts primarily as a stimulus selection mechanism (Reynolds et al., 1999; Pestilli et al., 2011), rather than purely as a gain increase or noise reduction mechanism. However, it should be emphasized that mechanistically, stimulus selection can be implemented with a combination of gain increase and a nonlinearity that would amplify the gain differences (Reynolds and Heeger, 2009; Pestilli et al., 2011). Therefore, gain increases might be an integral part of the stimulus selection mechanism by which attention operates.

Why does linear mixing seem to be so prevalent in cortical responses, if it indeed incurs such a large cost? It is possible that linear, or close-to-linear, mixing is an inevitable consequence of computations performed for other purposes, such as achieving object representations invariant to identity-preserving transformations (Zoccolan et al., 2005, 2007). We emphasize that our framework evaluates an encoding model in terms of the ability of a downstream decoder to accurately estimate the constituent stimuli in the face of neural variability. If, however, the goal of the computation is not an accurate representation of the constituent stimuli themselves, but computing a possibly complex function of them, then linear mixing or similar models are not necessarily harmful. For example, in the problem we considered in this paper, if the computational goal were only to estimate a weighted average of the two stimuli *s*_{1} and *s*_{2}, linearly mixed responses would be ideally suited for such a task. Finding the optimal neural codes for the representation of multiple stimuli that achieve the simultaneous objectives of successful performance in behaviorally relevant tasks (Salinas, 2006) and accurate encoding of constituent stimuli could be an important future direction.

The harmful effects of stimulus mixing can be partially alleviated by increased across-group neural correlations or by increased heterogeneity in the mixing weights of the neurons. Importantly, all our main results concerning the linear mixing model, i.e., the effects of stimulus mixing, across-group neural correlations, and heterogeneity in mixing weights generalize to the suboptimal OLE decoder. This is not a trivial result, because there is, in general, no guarantee that manipulating the properties of a neural population should affect the performance of optimal and suboptimal decoders in similar ways. Indeed, a previous study (Ecker et al., 2011), for instance, found that in the presence of diversity in neural tuning properties, limited-range correlations can be beneficial for accurately encoding a single stimulus, but this holds only if the responses are decoded optimally, and does not generalize to the suboptimal OLE decoder.

In the linear mixing model, increasing the number of stimuli makes stimulus mixing even costlier. This result suggests that stimulus mixing might contribute to set size effects commonly observed in visual short-term memory and other psychophysical tasks. Decreases in performance with set size in such tasks are typically attributed to a capacity limitation, e.g., an upper limit on the total amount of neural activity, which might be implemented by divisive normalization (Ma et al., 2014). However, our results demonstrate that even without any constraint on the total amount of neural activity (indeed, in our simulations, total activity was proportional to set size), set size effects would be expected in the linear mixing regime, as it becomes more difficult to find harmless, low (normalized) entropy weight vectors with increasing set size. Such weight vectors can still be found through learning, but any learning algorithm would take longer to reach these low entropy regions in the weight space and it would require more fine-tuning to keep the weights in a low entropy region once such a region is found through learning, as any noise in the weights, or in the learning algorithm itself, would be more likely to push the weights out of the low-entropy region.

The property that makes linear mixing particularly harmful for encoding accuracy is not the linearity of response mixing per se. It is rather the degree of overlap, or similarity, between the derivative profiles of the neural responses with respect to different stimuli that, to a first approximation, determines how harmful a particular form of stimulus mixing can be (Figs. 6, 9*B*,*C*). Indeed, our results for the nonlinear mixing rule of Britten and Heuer (1999) show that stimulus mixing can lead to a severe reduction in encoding accuracy even when mixing takes a strongly nonlinear form.

Stimulus mixing, in itself, is not always harmful for encoding accuracy. As our analytic solution to the optimal mixing problem in a toy model and numerical solutions in more complex cases suggest, it may even be optimal in the presence of neural correlations. Stimulus mixing has to satisfy certain conditions to be unharmful for encoding accuracy. In a simple two-dimensional problem and with sufficiently low neural correlations, those conditions can be condensed into an intuitive orthogonality constraint on the gradients of the two groups' mean responses. In the linear mixing model, this constraint is satisfied only if there is no stimulus mixing at all or negative weights are allowed. We also found that random mixing by individual neurons, assuming that there is no restriction to non-negative weights, performs remarkably well, especially in large populations. This result is reminiscent of other cases where random solutions have been found to perform well (Rigotti et al., 2010; Barak et al., 2013) and calls for a more general account of the effectiveness of such random solutions in diverse computational problems in neuroscience.

A stimulus mixing problem similar to the one investigated in this paper has been studied previously for temporal signals (White et al., 2004; Ganguli et al., 2008). Stimulus mixing in this context refers to the mixing of signals at different time points in the responses of a recurrently connected dynamical population of neurons. White et al. (2004) show that neural networks with an orthogonal connectivity matrix can achieve optimal estimation performance for temporal signals uncorrelated across time. We note that this is formally similar to the solution for the optimal mixing matrix in our toy stimulus mixing problem, where we found that orthogonal mixing matrices are optimal when neurons are independent.

Finally, our results suggest that psychophysical tasks that require the simultaneous encoding of multiple stimuli can be informative about the brain's decoding strategies. Although behavioral tasks with a single encoded stimulus can already provide useful information about the brain's decoding schemes (Haefner et al., 2013; Hohl et al., 2013), tasks with multiple encoded stimuli can yield additional and possibly richer information about the decoder. For example, an optimal decoder predicts specific types of correlations between the estimates of two stimuli in a task where both stimuli have to be estimated simultaneously (Fig. 4). If the pattern of correlations observed in such an estimation task is found to be inconsistent with the predicted correlations from an optimal decoder, this may be taken as evidence against optimal decoding. Similarly, different types of decoders make different predictions about the stimulus dependence of encoding accuracy even in tasks that require the estimation of a single target stimulus among multiple stimuli (compare Figs. 1 and 8 for the stimulus dependence of encoding accuracy using the optimal and OLE decoders, respectively). Again, such differences can be used to rule in or rule out certain decoding schemes as plausible decoding strategies the brain might be using in a given psychophysical task.

## Footnotes

W.J.M. was supported by Grant R01EY020958 from the National Eye Institute and Grant W911NF-12-1-0262 from the Army Research Office.

The authors declare no competing financial interests.

- Correspondence should be addressed to A. Emin Orhan, Center for Neural Science, 4 Washington Place, New York NY 10003. eorhan{at}cns.nyu.edu