## Abstract

Complex cognitive behaviors, such as context-switching and rule-following, are thought to be supported by the prefrontal cortex (PFC). Neural activity in the PFC must thus be specialized to specific tasks while retaining flexibility. Nonlinear “mixed” selectivity is an important neurophysiological trait for enabling complex and context-dependent behaviors. Here we investigate (1) the extent to which the PFC exhibits computationally relevant properties, such as mixed selectivity, and (2) how such properties could arise via circuit mechanisms. We show that PFC cells recorded from male and female rhesus macaques during a complex task show a moderate level of specialization and structure that is not replicated by a model wherein cells receive random feedforward inputs. While random connectivity can be effective at generating mixed selectivity, the data show significantly more mixed selectivity than predicted by a model with otherwise matched parameters. A simple Hebbian learning rule applied to the random connectivity, however, increases mixed selectivity and enables the model to match the data more accurately. To explain how learning achieves this, we provide analysis along with a clear geometric interpretation of the impact of learning on selectivity. After learning, the model also matches the data on measures of noise, response density, clustering, and the distribution of selectivities. Of two styles of Hebbian learning tested, the simpler and more biologically plausible option better matches the data. These modeling results provide clues about how neural properties important for cognition can arise in a circuit and make clear experimental predictions regarding how various measures of selectivity would evolve during animal training.

**SIGNIFICANCE STATEMENT** The prefrontal cortex is a brain region believed to support the ability of animals to engage in complex behavior. How neurons in this area respond to stimuli—and in particular, to combinations of stimuli (“mixed selectivity”)—is a topic of interest. Even though models with random feedforward connectivity are capable of creating computationally relevant mixed selectivity, such a model does not match the levels of mixed selectivity seen in the data analyzed in this study. Adding simple Hebbian learning to the model increases mixed selectivity to the correct level and makes the model match the data on several other relevant measures. This study thus offers predictions on how mixed selectivity and other properties evolve with training.

## Introduction

The ability to execute complex, context-dependent behavior is evolutionarily valuable and ethologically observed (Kalin et al., 1991; Rendall et al., 1999). How the brain carries out complex behaviors is thus the topic of many neuroscientific studies. A region of focus is the prefrontal cortex (PFC; Duncan, 2001; Miller and Cohen, 2001; Botvinick, 2008; Waskom et al., 2014), as lesion (Szczepanski and Knight, 2014) and imaging (Miller and D'Esposito, 2005; Bugatus et al., 2017) studies have implied its role in complex cognitive tasks. As a result, several theories have been put forth to explain how the PFC can support complexity on the computational and neural levels (Miller and Cohen, 2001; Wood and Grafman, 2003; Fusi et al., 2016).

A common way to investigate a neural population's role in a computation is by observing the selectivity profiles of that population's constituent cells. In its simplest form, this involves modeling a neuron's firing rate as a function of a single stimulus, or, perhaps, as an additive function of multiple stimuli (Duhamel et al., 1998; Sahani and Linden, 2003; Moser et al., 2008). More recently, however, the role of neurons that combine inputs in a nonlinear way has been investigated (Mante et al., 2013; Meister et al., 2013; Pagan et al., 2013; Rigotti et al., 2013; Stokes et al., 2013; Raposo et al., 2014; Fusi et al., 2016), often in the PFC. Rather than responding only to changes in one input, or to changes in multiple inputs in a linear way, neurons with nonlinear mixed selectivity have firing rate responses that are a nonlinear function of ≥2 inputs (Fig. 1*B*). Cells with this selectivity (which we call simply “mixed”) are important for population coding because of their effect on the dimensionality of the representation: they increase the dimensionality of the population response, which increases the number of patterns that a linear classifier can read out. This means that arbitrary combinations of inputs can be mapped to arbitrary outputs. In relation to complex behaviors, mixed selectivity allows for a change in context, for example, to lead to different behavioral outputs, even if stimulus inputs are the same (Fusi et al., 2016).

Theoretical work on how these properties can arise on a circuit level shows that random connectivity is surprisingly efficient at increasing the dimensionality of the neural representation (Maass et al., 2002; Jaeger and Haas, 2004; Buonomano and Maass, 2009; Rigotti et al., 2010; Barak et al., 2013; Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017). This means that mixed selectivity can be observed even without learning. However, learning can greatly improve the ability of a linear readout to generalize and hence to make the readout response more robust to noise and variations in the sensory inputs (Fusi et al., 2016). The ideal situation would be one in which a neural population represents only the task-relevant variables and the representation has the maximal dimensionality. In brain areas like the PFC, where there is a huge convergence of inputs from many other brain areas, it might be important to bias the mixed-selectivity representations toward the task-relevant variables, which can be achieved only with learning.

In this study, we characterize the response of a population of PFC cells in terms of the distribution of linear and nonlinear selectivity, the response density, and the clustering of selectivities. All these properties characterize the dimensionality of neural representations and are important for the readout performance. As described above, nonlinear mixed selectivity is important for increasing dimensionality. High dimensionality, however, also requires a diversity of responses. We studied this by determining how the preference to different stimuli are distributed across the population. In some lower sensory areas, cells tend to be categorizable; that is, there are groups of cells that display similar preference profiles (Goard et al., 2016). More associative areas tend to lose this clustering of cell types. Such categories may be useful when an area is specialized for a given task, but diversity is needed for flexibility (Raposo et al., 2014).

After characterizing the PFC response, we show that a model with random connectivity can only partially explain the PFC representation. However, with a relatively small deviation from random connectivity—obtained with a simple form of Hebbian learning characterized by only two parameters—the model describes the data significantly better.

## Materials and Methods

##### Task design.

The data used in this study come from previously published work (Warden and Miller, 2010). In brief, two monkeys performed two variants of a delayed match-to-sample task (Fig. 1*A*). In both task types, after initial fixation, two image cues (chosen from four possible cues) were presented in sequence for 500 ms each with a 1000 ms delay period between the first and second cue. After a second delay period, also lasting 1000 ms, one of two events occurred, depending on the task type. In the “recognition” task, another sequence of two images was shown and the monkey was instructed to release a bar if this test sequence matched the initial sample sequence. In the “recall” task, an array of three images appeared on the screen, and the monkey had to saccade to the two images from the sample sequence in the correct order. Blocks of recall and recognition tasks were interleaved during each recording session. Given that each sequence had two different image cues chosen from the four total image identity options and that there were two task types, the total number of conditions was 4 × 3 × 2 = 24.

##### Neural data.

Recordings were made using grids with 1 mm spacing (Crist Instrument) and custom-made independently moveable microdrives to lower eight dura-puncturing Epoxylite-coated tungsten microelectrodes (FHC) until single neurons were isolated. Cells were recorded from two adult rhesus monkeys (*Macaca mulatta*), one female and one male, and combined for analysis. No attempt was made to prescreen neurons, and a total of 248 neurons were recorded (with each neuron observed under both task types).

For the purposes of this study, firing rates for each neuron were calculated as the total number of spikes during the last 900 ms of the second delay period, as it was at this point that the identities of all task variables were known. Any cells that did not have ≥10 trials for each condition or did not have a mean firing rate of ≥1 spike/s as averaged over all trials and conditions were discarded. This left 90 cells.

##### Fano factor measurements.

Noise is an important variable when measuring selectivity. High noise levels require stronger tuning signals to be useful for downstream areas and to reach significance in statistical testing. Thus, any model attempting to match the selectivity profile of a population must be constrained to have the same level of noise. Here, we measure noise as the Fano factor (variance divided by mean) of each cell's activity across trials for each condition (spike count taken from last 900 ms of the two-object delay). This gives 24 values per cell. This is the trial Fano factor. Averaging over conditions gives one trial Fano Factor value per cell, and averaging over cells gives a single number representing the average noise level of the network. Unless otherwise stated, FF_{T} refers to this network-averaged measure.

Another measure of interest is how a neuron's response is distributed across conditions. Do neurons respond differentially to a small number of conditions (i.e., a sparse response), or is the distribution flatter? To measure this, the firing rate for each condition (averaged across trials) was calculated for each neuron and the Fano factor was calculated across conditions. In this case, a large value means that some conditions elicit a very different response than others, while a small value suggests the responses across conditions are more similar. We call this value the response variability (RV). Averaging across all cells gives the RV of the network.

See Figure 1*C* for a visualization of these measures in an example neuron.

##### Selectivity measurements.

A neuron is selective to a task variable if its firing rate is significantly and reliably affected by the identity of that task variable. In this task, each condition contains three task variables: task type (TT), the identity of the first cue [Cue 1 (C1)], and the identity of the second cue [Cue 2 (C2)]. Therefore, we used a three-way ANOVA to determine whether a given neuron's firing rate was significantly (*p* < 0.05) affected by a task variable or combination of task variables. Selectivity can be of two types: pure or nonlinearly mixed (referred to as just “mixed”), based on which terms in the ANOVA are significant. If a neuron has a significant effect from one of the task variables, for example, it would have pure selectivity to that variable. Interaction terms in the ANOVA represent nonlinear effects from combinations of variables. Therefore, any neurons that have significant contributions from interaction terms as determined by the ANOVA have nonlinear mixed selectivity. As an example, if a neuron's firing rate can be described by a function that is linear in the identity of the TT, the identity of C2, and the identity of the combination of TT and C1, then that neuron has pure selectivity to TT, pure selectivity to C2 and mixed selectivity to the combination of TT and C1 (TT × C1). Note that having pure selectivity to ≥2 task variables is not the same as having nonlinear mixed selectivity to a combination of those task variables.

We also investigate whether the nonlinear interactions we observe indicate supralinear or sublinear effects. To do this, we fit a general linear model that includes second-order interaction terms to each neuron's response. The signs of the coefficients for the second-order terms indicate whether a certain nonlinear effect leads to a response higher (supralinear) or lower (sublinear) than expected from a purely additive relationship.

##### Clustering measurement.

Beyond the numbers of neurons selective to different task variables, an understanding of how preferences to task-variable identities cluster can inform network models. For this, we use a method inspired by the projection angle index of response similarity (PAIRS) measurement as described by Raposo et al. (2014). For this measure, each neuron is treated as a vector in selectivity space, where the dimensions represent preference to a given task-variable identity (Fig. 1*D*). To get these values, neuronal responses are fit with a general linear model (GLM) to find which task-variable identities significantly contribute to the firing rate. Note that this gives a β coefficient for each value of each task variable, such as C1 = *B*. These values dictate how the firing rate changes as task-variable identities differ from the reference condition TT = *Recognition*, C1 = *A*, and C2 = *B*. This is expressed as follows: *FR* = *FR _{ref}* +

*b*

_{1}[

*TT*=

*Recall*] +

*b*

_{2}[

*C*1 =

*B*] +

*b*

_{3}[

*C*1 =

*C*] +

*b*

_{4}[

*C*1 =

*D*] +

*b*

_{5}[

*C*2 =

*A*] +

*b*

_{6}[

*C*2 =

*C*] +

*b*

_{7}[

*C*2 =

*D*]. The β values found for each cell via this method are shown in Figure 3

*C*(nonsignificant coefficients—those with

*p*> 0.05—are set to 0).

This analysis does not include interaction terms (second-order or third-order terms). The reason for this is partly that, given the relatively few trials, the high-dimensional full GLM model would be difficult to confidently fit. In addition, analysis of clustering in a high-dimensional space (the full model would yield a 45-dimensional space) with a relatively small number of neurons would be difficult to interpret. Therefore, we look only at how the cells cluster according to their preference of the identities associated with the pure terms.

The coefficients derived from the GLM define a vector in a 7-D vector space for each neuron (Fig. 1*D*). The clustering method compares the distribution of vectors generated by the data (each normalized to be unit length) to a uniform distribution on the unit hypersphere to determine whether certain combinations of preferences are more common than expected by chance.

In PAIRS (Raposo et al., 2014), this comparison is done by first computing the average angle between a given vector and its *k* nearest neighbors and seeing whether the distribution of those values differs between the data and a random population. That approach is less reliable in higher dimensions. Therefore, we use the Bingham test instead of PAIRS (Mardia and Jupp, 2000). The Bingham test calculates the test statistic . This statistic, which we refer to as the clustering value, measures the extent to which the scatter matrix, *T* (an approximation of the covariance matrix) differs from the identity matrix (scaled by 1/*p*), where *p* and *n* are the dimensions of the selectivity space (seven) and the number of cells (90), respectively. The higher this value is, the more the data deviates from a random population of vectors wherein selectivity values are assumed to be independent and identically distributed random variables. Thus, a high value suggests that neurons in the population cluster according to task-variable identity preferences. To put this clustering value into context, we compared the value found from the data to two distributions: one generated by shuffled data and one generated from data designed to be highly clustered. For the shuffled data, we created “fake” cell vectors by shuffling the selectivity values across all cells. For the clustered data, we created three categories of fake cells, each defined by pure selectivity to two specific task-variable identities. A population of 90 cells was created by combining 30 cells from each category (the population was also designed to have the same average firing rate and *FF _{T}* of the data). This results in a population that has three clear clusters of cell types in selectivity space. One hundred populations based on each type of fake data were created to generate distributions that represent random and clustered data.

Using the Giné–Ajne test of uniformity on the hypersphere (Giné, 1975) gives very similar results to the Bingham test results.

##### Circuit model.

To explore the circuit mechanisms behind PFC selectivity, we built a simple two-layer neural model, modeled on previous work (Barak et al., 2013; see Fig. 4*A*). The first layer consists of populations of binary neurons, with each population representing a task-variable identity. To replicate a given condition, the populations associated with the task-variable identities of that condition are turned on (set to 1) and all other populations are off (set to 0). Each population has a baseline of 50 neurons. To capture the biases in selectivities found in this dataset (particularly the fact that, in the 900 ms period we used for this analysis, many more cells show selectivity to TT than C2 and to C2 than C1), the number of neurons in the TT and C2 populations are scaled by factors that reflect these biases (80 cells in each TT population and 60 in each C2 population). The exact values of these weightings do not have a significant impact on properties of interest in the model.

The second layer represents PFC cells. These cells get weighted input from a subset of the first-layer cells. Cells from the input layer to the PFC layer are connected with probability 0.25 (unless otherwise stated), and weights for the existing connections are drawn from a Gaussian distribution (μ_{W} = 0.207, and σ_{W} = μ_{W} unless otherwise stated; because negative weights are set to 0, the actual connection probability and σ_{W} may be slightly lower than given).

The activity of a PFC cell on each trial, *t*, is a sigmoidal function of the sum of its inputs (Eq. 1):
where *x _{j}* is the activity (0 or 1) of the

*j*input neuron and

^{th}*w*is the weight from the

_{ij}*j*input neuron to the

^{th}*i*output neuron. ϴ

^{th}*is the threshold for the*

_{i}*i*output neuron, which is calculated as a percentage of the total weight it receives: ϴ

^{th}*= λ∑*

_{i}*. The λ value is constant across all cells, making ϴ cell-dependent.*

_{j}w_{ij}*k*scales the responses so that the average model firing rate matches that of the data.

Two sources of noise are used to model trial-to-trial variability. ϵ* _{A}* is an additive synaptic noise term drawn independently on each trial for each cell from a Gaussian distribution with mean zero. The SD for this distribution is controlled by the parameter

*a*, which defines σ

_{A}in units of the mean of the weight distribution, μ

_{W}. The second noise source is multiplicative and depends on the activity of a given cell on each trial (Eq. 2): Thus, the final activity of an output PFC cell on each trial,

*y*, is drawn from a Gaussian with an SD that is a function of

_{i}^{t}*r*. This SD is controlled by the parameter

_{i}^{t}*m*. Both

*m*and

*a*are fit to make the model

*FF*match that of the data.

_{T}To make the model as comparable to the data as possible, 10 trials are run for each condition and 90 model PFC cells are used for inclusion in the analysis.

##### Hebbian learning.

A simplified version of Hebbian learning is implemented in the network in a manner that captures the “rich get richer” nature of Hebbian learning while keeping the overall input to an individual cell constant. In traditional Hebbian learning, weight updates are a function of the activity levels of the presynaptic and postsynaptic neurons: Δ *w _{ij}* =

*g*(

*x*,

_{j}*y*). In this simplified model, we use connection strength as a proxy for joint activity levels: Δ

_{i}*w*=

_{ij}*g*(

*w*). We also implement a weight-normalization procedure so that the total input weight to a cell remains constant as weights change.

_{ij}To do this, we first calculate the total amount of input each output cell, *i*, receives from each input population, *p* (Eq. 3):
The input populations (each corresponding to one task-variable identity) are then ranked according to this value. The top *N _{L}* populations according to this ranking (that is, those with the strongest total weights onto to the output cell) have the weights from their constituent cells increased according to the following equation (Eq. 4):
where η is the learning rate (set to 0.2 unless otherwise stated). This amounts to a multiplicative scaling of synaptic weights, which is compatible with experimental observations (Turrigiano et al., 1998; Loewenstein et al., 2011). After this, all weights into the cell are normalized via the following equation (Eq. 5):
Note, the numerator in the second term is the sum of all weights into the cell before Equation 4 is applied and the denominator is the sum after it is applied. As learning progresses according to this rule, weights from cells that are not in the top

*N*populations trend to zero. At that point, each learning step increases the weights of all remaining connections by η and normalizes them all by the same amount, resulting in no further changes in the weight matrix.

_{L}In this work, two versions of Hebbian learning are tested. In the unrestricted, or “free,” learning condition described above, the top *N _{L}* populations are chosen freely from all input populations (equivalently, all task-variable identities) based solely on the total input coming from each population after the random weights are assigned. The alternative, “constrained” learning, is largely the same, but with a constraint on how these top

*N*populations are chosen: all task variables must be represented before any can be repeated. So, two populations representing different identities of the same task variable (e.g., Cue 1A and Cue 1B) will not both be included in the

_{L}*N*populations unless both other task variables already have a population included (which would require that

_{L}*N*> 3). So, with

_{L}*N*= 3, exactly one population from each task variable (TT, C1, C2) will have weights increased. This variant of the learning procedure was designed to ensure that inputs could be mixed from different task variables to increase the likelihood that mixed selectivity would arise. Both forms of learning are demonstrated for an example cell in Figure 4

_{L}*B*.

In both forms of learning, the combination of weight updating and normalization is applied to each cell once per learning step.

##### Classification performance.

The measures of selectivity we have looked at in the data are important for the ability of a population to represent task information in a way that can be readily read out. We also test directly the ability to read out task information from our model populations using linear discriminant analysis (LDA). We generate 20 trials per condition from the model and use 10 to train the classifiers and 10 to test. Three separate classifiers are trained to read out each of the three linear terms: TT identity, C1 identity, and C2 identity. The average performance across these three tasks gives the “linear” performance. An additional four classifiers were trained to read out each of the joint identities of TT–C1, TT–C2, C1–C2, and TT–C1–C2. The average performance across these four tasks is called the “higher-order” performance.

We also conduct an explicit test of the model's ability to perform a nonlinearly separable task. For this, all combinations of identities for C1 and C2 are generated as inputs to the network, and the classification task is to determine whether the identities are the same or different (the TT input is held constant). Fifty trials are used for training (using LDA) and 50 for testing. We also measure the ability of the input population to perform this task (by using the binary input population activity directly), in which case additive noise is used to generate multiple trials, and the mean firing rate and FF_{T} are fit to match that of the data.

##### Toy model calculations.

To make calculations and visualizations of the impacts of learning easier, we use a further simplified toy model (see Fig. 8*A*, left). Instead of a sigmoidal nonlinearity, the heaviside function is used. The toy model has two task variables (T1 and T2) and each task variable has two possible identities (A or B). Four random weights connect these input populations to the output cell: *W*_{1}* _{A}*,

*W*

_{1}

*,*

_{B}*W*

_{2}

*,*

_{A}*W*

_{2}

*. On each condition, exactly one task-variable identity from each task variable is active (set to 1). This gives four possible conditions, each of which is plotted as a point in the input space in Figure 2. The threshold is denoted by the dotted lines. If the weighted sum of the inputs on a given condition is above the threshold, the cell is active (green), otherwise it is not.*

_{B}The toy model follows the same learning rules defined for the full model. Examples of the impacts of learning on the representation of the four conditions are seen in Figure 2*A*,*B*.

A cell's selectivity is more robust to additive noise (which functions like a shift in threshold) if there is a large range of threshold values for which its selectivity does not change. To explore noise robustness in this model, we will define the following (Eq. 6):
Thus, α is the ratio of the side lengths of the rectangle formed by the four conditions (Fig. 2*C*, top). Without loss of generality, we define the larger of the two sides as associated with T2, *W*_{2}* _{B}* >

*W*

_{2}

*, and*

_{A}*W*

_{1}

*>*

_{B}*W*

_{1}

*.*

_{A}For the cell to display pure selectivity to T2, the following inequality must hold (Eq. 7):
Therefore the range of thresholds that give rise to pure selectivity is as follows (Eq. 8):
The analogous calculations for mixed selectivity (assuming the T1B–T2B condition is active only, but results are identical for T1A–T2A being the only inactive condition) are as follows (Eq. 9):
Thus, pure selectivity is more noise robust than mixed selectivity when α > 2. This imbalance can be seen in Figure 2*C*.

Now we show that, given weights drawn at random from a Gaussian distribution, α > 2 is more common than α < 2. The argument goes as follows: because Δ* _{x}* and Δ

*are differences of normally distributed variables, they are themselves normally distributed (with μ = 0, σ = 2σ*

_{y}*). The ratio of these differences is thus given by a Cauchy distribution. However, because α represents a ratio of lengths, we are only interested in the magnitude of this ratio, which follows a standard half-Cauchy distribution. Furthermore, α is defined such that the larger difference should always be in the numerator. Thus, we propose the following equation (Eq. 10): Therefore, most cells can be expected to have α > 2 with random weights and thus higher noise robustness for pure selectivity than for mixed.*

_{w}This comparison of noise robustness, however, assumes the threshold is placed at the location that is most noise robust for each type of selectivity. Here, the threshold is defined as a fraction of the total weight going into the cell: ϴ = λΣ*W*. As we increase λ then, the threshold is a line with slope of −1 that moves from the bottom left corner up to the top right. Examples of how this affects selectivity are shown in Figure 2*D*.

To investigate how noise robustness changes with λ, we generate a large (10,000) population of cells, each with four random input weights drawn from a Gaussian with positive mean and constrained to be non-negative (qualitative results hold for many weight–variance pairs), and calculate the size of the additive noise shift needed to cause each cell to lose its selectivity (whichever it has).

Assuming a fixed threshold, we then explore how noise robustness varies with learning. In the case of constrained learning with *N _{L}* = 2, Δ

*and Δ*

_{x}*both increase. According to Equations 7 and 9, robustness to both selectivities increases with Δ*

_{y}*. The relative increase in robustness will depend on how α changes. It can be shown that if then Δ*

_{x}*will expand more than Δ*

_{x}*and α will decrease, meaning the increase in noise robustness favors mixed selectivity. If , then α will grow, and the increase in noise robustness will be larger for pure than mixed. However, this condition is less common.*

_{y}When *N _{L}* = 1, learning ultimately leads to a larger ratio between the side lengths. This is straightforward for

*W*

_{2}

*>*

_{B}*W*

_{1}

*(Δ*

_{B}*grows and Δ*

_{y}*shrinks). However, if*

_{x}*W*

_{1}

*>*

_{B}*W*

_{2}

*, α will first decrease as Δ*

_{B}*grows and Δ*

_{x}*shrinks. This is good for mixed noise robustness. The ratio then flips (Δ*

_{y}*> Δ*

_{x}*), and Δ*

_{y}*(the side that is now shorter) is still shrinking and Δ*

_{y}*is growing. In this circumstance, if Δ*

_{x}*/Δ*

_{y}*becomes <½, the representation will favor pure noise robustness over mixed. This flipping of α is possible for some cells when*

_{x}*N*= 2 if , but the weights would likely plateau before α became <½, and so the drop in mixed selectivity does not occur.

_{L}In free learning with *N _{L}* = 2, cells that have

*W*

_{1}

*>*

_{A}*W*

_{2}

*, will see both weights from T1 increase and (due to the weight normalization) both weights from T2 decrease. Because the weights change in proportion to their value, Δ*

_{B}*increases, Δ*

_{x}*decreases, and so α goes down. This leads to more noise robustness for mixed and less for pure. If*

_{y}*W*

_{2}

*>*

_{A}*W*

_{1}

*, these trends are reversed and the cell has more noise robustness for pure and less for mixed.*

_{B}##### Experimental design and statistical analysis.

As described in *Selectivity measurements*, the main statistical test used in this work was a three-way ANOVA (within-subjects, with a total 23 degrees of freedom). Each of the 90 cells used had 10 trials from each condition. As part of calculating the clustering value (see *Clustering measurement*), we calculated the *p* value for the *F* statistic of the hypothesis test that each coefficient in our GLM was equal to 0. All analyses were performed in Matlab.

## Results

In this study, we analyzed various measures of selectivity of a population of PFC cells recorded as an animal carried out a complex delayed match-to-sample task. Through this process, several properties of the representation in the PFC were discovered and a simple circuit model that included Hebbian learning was able to replicate them. These properties, combined with the modeling results, provide support for the notion that PFC selectivities are the result of Hebbian learning in a random network.

### PFC population is moderately specialized and selective

The average firing rate of cells in this population was 4.9 ± 5.1 spikes/s. Fano factor analyses provided measurements of the noise and density of response in the data (Fig. 3*B*). The average value of the across-trial Fano Factor (*FF _{T}* = 2.8 ± 1.7 spikes/s) shows that the data have elevated levels of noise compared with a Poisson assumption. Looking at RV—a measure of how a cell's response is distributed across conditions—suggests that PFC cells are responding densely across the 24 conditions (

*RV*= 1.1 ± 1.1 spikes/s; for comparison, at the observed average firing rates, a cell that responded only to a single condition would have

*RV*≈ 120, one that responded to two conditions would have

*RV*≈ 57). This finding suggests that these cells are not responding sparsely and are not very specialized for the individual conditions of this task.

Each condition is defined by a unique combination of three task variables: task type (TT), identity of image cue 1 (C1), and identity of image cue 2 (C2) (Fig. 1*A*). Selectivity to task variables was determined via a three-way ANOVA. The results of this analysis are shown in Figure 3*A*. This figure shows the percentage of cells with selectivity to each task variable and combination of task variables [as determined by a significant (*p* < 0.05) term in the ANOVA]. A cell that has selectivity to any of the regular task variables (TT, C1, C2) has “pure selectivity,” while a cell that has selectivity to any of the interaction terms (combination of task variables such as TT × C1, TT × C2, etc.) has nonlinear “mixed” selectivity. The final two bars in Figure 3*A* show the number of cells with pure and mixed selectivity defined this way. Note that a cell can have both pure and mixed selectivity. Thus, the two values sum to >100%.

Most cells (77 of 90) showed pure selectivity to ≥1 task variable. But the population shows clear biases in the distribution of these pure selectivities: TT selectivity is the most common (59 cells) and C2 is represented more than C1 (48 vs 30 cells; these biases are observable in the GLM fits as well; Fig. 3*C*). This latter effect may be due to the time at which these rates were collected: these rates were taken during the second delay, which comes directly after the presentation of C2. The former effect is perhaps more surprising. While the TT is changed in blocks and thus knowable to the animal on each trial (with the exclusion of block changes), there is no explicit need for the animal to store this information: the presence of a second sequence or an array of images will signal the TT without the need for prior knowledge. However, regardless of its functional role in this task, contextual encoding is a common occurrence (Eichenbaum et al., 2007; Komorowski et al., 2013). Furthermore, the fact that the Recall task is more challenging than the Recognition task may contribute to clear representation of TT. That is, it is possible that the animals keep track of the TT to know how much effort to exert during the task.

Approximately half of the cells (46) had some form of mixed selectivity, mostly to combinations of two task variables. The population had an approximately equal balance of both supralinear and sublinear effects of these two-way interactions (ratio of positive to negative terms: 1.07). The small number of cells with selectivity to the three-way interaction term (TT × C1 × C2) is consistent with the relatively low value of RV in this population, as a strong preference for an individual condition would lead to a high RV. The number of cells with only mixed selectivity was low (only 1 of 90 cells), 32 cells had only pure selectivity, and 12 cells had no selectivity.

We use a population-level analysis inspired by Raposo et al. (2014) to measure the extent to which cell types are clustered into categories. Here, we used this analysis to determine whether cells cluster according to their responsiveness to different task-variable identities (i.e., Recognition vs Recall). That is, are there groups of neurons that all prefer the same TT and image identities, beyond what would be expected by chance? To explore this, we first use a GLM, with task-variable identities as regressors, to fit each neuron individually. The β coefficients from these fits define a neuron's position in selectivity space (these β-coefficient values, which represent how the identity of each task variable changes a neuron's firing rate compared with the reference condition, are shown in Fig. 3*C*; a schematic of how the clustering measure works is shown in Fig. 1*D*). After normalizing each vector, the clustering measure then determines the extent to which the population of vectors deviates from a uniform distribution on the unit hypersphere. The data had a clustering value of 186.2. Comparing this to the mean values of two distributions of artificially generated populations suggests the data have a mild but significant deviation from random: the average clustering value for populations generated by randomly shuffling the coefficient values is −23 ± 22, and the average value of populations that have three distinct clusters of selectivity is 706.7 ± 6.8. As the data-clustering value sits between these values and closer to the shuffled data, we conclude that some structure does exist in the data, yet the cells in this population do not appear to form strongly separable categories as defined by task-variable identity preference (Fig. 3*D*).

### Circuit model without Hebbian learning cannot replicate mix of density and specialization

A simple circuit model was made to replicate the selectivity properties found in the data. The model contains two layers: an input layer consisting of binary neurons that represent task-variable identities and an output layer consisting of “PFC” neurons that get randomly weighted input from the first layer and whose activity is a nonlinear function of the sum of that input. The model also has two forms of noise: an additive term applied before the nonlinearity (which replicates input/background noise, and implicitly shifts the threshold of the cell), and a multiplicative term applied after (which enforces the observed relationship between firing rate and variance; see Materials and Methods; Fig. 4*A*).

The output of the initial circuit model, before any Hebbian learning, was analyzed in the same way as the data to determine whether it matched the properties found in the PFC. The results of this can be found in Figure 5. First, in Figure 5*A*, we demonstrate the impact of the noise parameters on FF_{T}, pure and mixed selectivity, and the clustering value. As expected, increasing the additive and/or multiplicative noise terms increases the FF_{T}, as this is a measure of trial variability. Increasing within-condition noise also makes it less likely that a cell will show significant differences across conditions, and thus the percentage of cells with pure and mixed selectivity are inversely related to the noise parameters (the relative sensitivities of mixed and pure selectivity to noise will be discussed in depth later). For similar reasons, the clustering value also decreases with noise (finding significant deviations from a uniform distribution is less likely if cells do not show sufficiently strong preferences).

To determine the impact other properties of the model had on our measures of interest, we varied several other parameters. Figure 5*B* shows what happens at different values of the threshold parameter. Here, the threshold is given as the amount of input the cell needs to reach half its maximal activity, expressed as a fraction of its total input weight (keep in mind that, given the number of input cells in each population and the task structure, approximately one-third of input cells are on per trial). The colored lines are, for each measure, the extent to which the model differs from the data, expressed in units of the model's SD (calculated over 100 instantiations of the model). Due to the impact of noise parameters discussed above, at each point in this graph the noise parameters were fit to ensure the model was within ±1.5 SDs of the data FF_{T} (this generally meant that it varied from ∼2.8 to 2.9).

With an increasing threshold, the RV (Fig. 5*B*, green line) increases. This is because higher thresholds mean cells respond to only a few combinations of input, rather than responding similarly to many, and the RV is a measure of variability in response across conditions (note that while RV appears to peak at ≈0.35 and decrease, this particular trend is driven by an increase in RV SD; the mean continues to increase). The percentage of cells with mixed selectivity (red line) also increases with threshold. With a higher threshold, most conditions give input to the cell that lies in the lower portion of the sigmoidal function (Fig. 4*A*, bottom). The nonlinearity is strong here—with some input producing little to no response—thus, more cells can attain nonlinear mixed selectivity. Pure selectivity also increases with threshold, and the percentage of cells with pure selectivity goes quickly to 100 (and the SD of the model gets increasingly small). We go into more detail about the reliance of selectivity on threshold later.

The clustering value relies on cells having preference for task-variable identities and so increases as selectivity increases initially. However, just having selectivity is not enough to form clusters, and so the clustering value in the model levels off below the data value even as the number of cells with pure selectivity reaches full capacity. Thus, with the exception of the clustering value, the model can reach the values found in the data by using different thresholds. As Figure 5*B* shows, however, at no value of the threshold are all measures of PFC response in the model simultaneously aligned with those in the data.

Figure 5*C* shows how the same measures change when the width of the weight distribution from input to PFC cells is varied. Here, the SD of the distribution from which connection strengths are drawn (σ_{W}) is given as a factor of the mean weight, μ_{W}. Increasing this value increases pure and mixed selectivity as well as RV. Because a wider weight distribution increases the chances of a very strong weight existing from an input cell to an output cell, it makes it easier for selectivity to emerge (that is, the output cell's response will be strongly impacted by the task-variable identity the input cell represents). The RV increase occurs for similar reasons: a cell may have uneven responses across conditions due to strong inputs from single-input cells. Clustering values, however, are unaffected by this parameter. At no point, then, can the model recreate all aspects of the data by varying the weight distribution. Furthermore, while values of mixed selectivity and RV approach the data values with large σ_{W}/μ_{W}, such large values are likely unrealistic. Data show that a σ_{W}/μ_{W} ratio of ∼1 is consistent with observations of synaptic strengths from several brain areas (Barbour et al., 2007).

Varying other parameters, such as the mean weight, number of cells per population, and connection probability similarly does not allow the model to capture all properties of the data (data not shown).

Figure 5*D* shows the values of the model compared with the data for the set of parameters marked with arrows in Figure 5*B*,*C*. These parameters were chosen because they were capable of capturing the amount of pure selectivity in the model (any lower value of the threshold would lead to too few cells with pure selectivity, for example). On the left are the percentage of cells with different selectivities as in Figure 3*C*. The bars are the data and the lines are the model. On the right are histograms of model values from 100 instantiations, with the red markers showing the data values. The model matches the average firing rate and FF_{T} of the model, as it was fit to do so. Clustering, RV, and the amount of mixed selectivity are too low in the model. We use these parameters as the starting point for learning in this model.

### Circuit model with Hebbian learning captures PFC responses

As described above, responses of PFC cells have a set of qualities that cannot be explained by random connectivity. In particular, the inability of the random network to simultaneously capture the values of RV, clustering, pure selectivity, and mixed selectivity shows that PFC cells have a balance of specialization that may require learning to achieve. Here, we tested two variants of Hebbian learning to determine whether a network endowed with synaptic plasticity can capture the elements of the data that the random network could not. The simple form of Hebbian learning that we use is based on the idea that the input populations that randomly start out giving strong inputs to a cell would likely make that cell fire and thus have their weights increased.

In both variants of learning tested, each cell has the weights from a subset (*N _{L}*) of its input populations increased while the rest are decreased to keep overall input constant (this is done via a weight-increase step and a normalization step). Such balancing of Hebbian and homeostatic plasticity has been observed experimentally (Keck et al., 2017), particularly via the type of synaptic up and down regulation used here (Lo and Poo, 1991; Scanziani et al., 1996; Chistiakova and Volgushev, 2009; Bourne and Harris, 2011). Therefore, it is plausible for an individual neuron to be able to implement such changes across its synapses.

The difference between our two variants of learning comes from which input populations are increased. In general, the top *N _{L}* input populations from which the cell already receives the most input have their weights increased (to capture the “rich get richer” nature of Hebbian learning). In the “constrained” variant, however, weight increases onto a PFC cell are restricted to populations of input cells that come from different task variables (e.g., C1 and C2; see Materials and Methods). This was done to ensure that cells had enough variety of inputs to create mixed selectivity. In the “free” variant, the populations from which a cell receives increased input due to learning are unrestricted. That is, they are determined only by the amount of input that the cell originally received from each population as a result of the random connectivity. This unrestricted form of learning is more biologically plausible as it can be implemented in a way that is local to the postsynaptic neuron, without knowledge of the identity of the upstream inputs. A toy example of each variant can be found in Figure 4

*B*. In this example, free learning and constrained learning select different input populations to be enhanced. However, given random weights, free and constrained learning will select the same input populations in some cells.

Figure 4*C* shows how the weight matrix changes with different *N _{L}* values (the number of populations from which weights are increased during learning). Eventually, the learning leads to a steady state in which each PFC cell receives input only from cells in the top

*N*populations. The higher the

_{L}*N*, the faster the matrix converges to its final state. When

_{L}*N*is low, convergence takes longer as all the weight is transferred to a small number of cells. This plot is shown with a learning rate of 0.2.

_{L}The results of both forms of learning are shown in Figure 6*A*. The effects of learning are dependent on *N _{L}*, and different

*N*values are in different colors (

_{L}*N*= 1, 2, 3 are tested here). Free learning is shown with solid lines, and constrained with dotted lines, except for the case of

_{L}*N*= 1, where free and constrained learning do not differ and only one line is shown. In each plot, the data value is shown as a small black dotted line.

_{L}Clustering, mixed selectivity, and RV all increase with learning, for any value of *N _{L}* and both learning variants. When

*N*= 1 (green line), mixed selectivity peaks and then plateaus at a lower value (as connections to all but one population are pruned), while other values of

_{L}*N*plateau at their highest values. As it was designed to do so, constrained learning is very effective at increasing mixed selectivity, eventually getting to nearly 100% of cells. Free learning produces more modest increases in mixed selectivity, with

_{L}*N*= 2 leading to slightly larger increases than

_{L}*N*= 3. Before learning, the model matches the data's balance of supralinear and sublinear interaction effects (ratio of positive to negative terms: 1.100 ± 0.048), and learning does not affect this balance (1.095 ± 0.053, SDs over 20 random instantiations).

_{L}A factor affecting selectivity in this model—and especially with this task structure—is that cells that receive inputs from multiple populations from a single task variable may not end up having significant selectivity to that variable. This is especially true for the “TT” variable, as cells can easily end up with input from both “Recall” and “Recognition” populations. If the inputs from these populations are somewhat similar in strength, the cell does not respond preferentially to either. This can help understand the discrepancy in how pure selectivity changes with free and constrained learning. In constrained learning, pure selectivity necessarily increases with learning (to the point where nearly all networks have 100% pure selectivity), whereas free learning can have inputs that effectively cancel each other out. A more direct investigation of how selectivity and other properties change with learning comes with the analysis of our toy model in the next two sections.

In these plots, both noise parameters are fixed, which allows us to see how FF_{T} varies with learning (this is also why the values at step 0 in Fig. 6*A* do not always match those shown in Fig. 5, as that model has noise parameters fit to match the data). The changes in FF_{T} stem from both changes in robustness to the additive noise and from changes in the mean responses, which affects FF_{T} via the multiplicative noise term. Figure 6*A* shows that the variant of learning has less of an impact on FF_{T} than *N _{L}* does. In all cases, however, learning ultimately leads to lower trial variability in the model. This is consistent with observations made in the PFC during training (Qi and Constantinidis, 2012).

Overall, low *N _{L}* leads to more acutely distributed weights and stronger structure and selectivity in the model. Constrained learning, with its guarantee of enhancing weights from different task variables, is also more efficient at enhancing structure and selectivity. The PFC data show a moderate level of structure and selectivity; therefore, the approach best able to capture it is free learning with

*N*= 3. In Figure 6

_{L}*B*, we show how all model values compare to the data as this form of learning progresses. These plots, similar to those in Figure 5

*B*,

*C*, show values in units of model SDs away from the data. It is clear from these plots that this form of learning leads all values in the model closer to those of the data. The best fit to the data comes after six learning steps with a learning rate of 0.2 (black arrow). At this point the ratio of the SD to the mean of the weight distribution has only slightly increased, remaining within a biologically plausible range. While the best fit to the data comes before the model reaches its steady state, all values still eventually plateau to <±2.5 model SDs of the data. Furthermore, there are many reasons why the PFC may not reach steady state; for example, once the animal's performance plateaus, learning may slow (Glimcher, 2011). Also, other uses of the PFC may interfere with learning and prevent the circuit from overfitting to this particular task. A detailed exploration of these mechanisms is beyond the scope of this study.

We plot the values of the data compared with the best-fit model in Figure 6*C*, similarly to Figure 5*D*. At this point, the average percentage of cells with only pure selectivity is 25.4 ± 4.2, with only mixed 4.4 ± 2.2, and with no selectivity 15.9 ± 4.1 (the comparable data values are ≈36, 1, and 13%, respectively). Thus, the model with learning is a much better fit to the data than the purely random network.

In addition to matching the measured properties of the PFC representation, we also tested whether learning makes the neural representation more conducive to decoding. To do this for task information, we trained linear classifiers to read out the task inputs (i.e., the identities of TT, C1, and C2 separately) as well as higher-order terms (i.e., the combined identities of TT–C1, TT–C2, C1–C2, and TT–C1–C2). As expected from a higher-dimensional representation, decoding performance is better in the population after learning for both linear and higher-order terms (Fig. 6*D*, left). Postlearning accuracy for linear terms is 83.2% and for the higher-order terms 70.5% (the respective values for constrained learning after the same number of steps are 88.2 and 83%, data not shown). We also used a same–different task to demonstrate how the representation after learning allows for better performance on a nonlinearly separable problem. Here, all combinations of C1 and C2 identities were generated as inputs, and a linear classifier was trained to read out whether the identities of the two cues were the same or not. Trying to read this information out from the input population is not very successful as these cells only have pure selectivity (Fig. 6*D*, right). Random connectivity is sufficient to expand the dimensionality of the neural representations and to solve nonlinearly separable problems. However, the model PFC population generated from random connectivity performs poorly because the low threshold that we determined by fitting the model to the data leads to low levels of mixed selectivity. After learning, the PFC population performs substantially better on this task.

### Understanding properties of selectivity before learning

We have shown that Hebbian learning can affect selectivity properties in a model of the PFC. Some of these impacts, particularly the increase in mixed selectivity, may seem counterintuitive. Here we use a further simplified toy neuron model to understand the properties of the network before learning and then demonstrate how learning causes these changes.

A schematic of this toy model is in Figure 7*A*, and it is described in the Materials and Methods. Briefly, the cell gets four total inputs: two (A and B) from each of two task variables (T1 and T2). The output of the cell is binary: if the weighted sum of the inputs is above the threshold, ϴ, the cell is active and otherwise it is not. As in the full model, ϴ is defined as a fraction, λ, of the sum of the input weights.

This format makes it easy to spot nonlinear mixed selectivity: if the cell is active (or inactive) for exactly one of the four conditions, it has nonlinear mixed selectivity to the combination of T1–T2. If the cell's output can be determined by the identity of only one task variable, it has pure selectivity (and would be active for two of the four conditions). Otherwise it has no selectivity (active or inactive for all conditions; Fig. 2*A*,*B*).

Learning affects selectivity by altering the way a cell represents these four conditions. To say more about how this occurs, we must first describe the properties of the representation in the random network before learning.

To be robust to noise, the cell's response should be constant across trials within a condition. Additive noise can be thought of as a shift in the threshold, which may lead to a change in the cell's response. Thus, trialwise additive noise drawn from a distribution centered on zero can be thought of as a range of effective thresholds centered on the original one [Fig. 8*A*, black dotted line (threshold without noise), gray shaded area (range of effective thresholds due to noise)]. If the inputs for a given condition fall in this range, the response of the cell will be noisy (i.e., flipping from trial to trial) and selectivity will be lost because the cell's activity will not be a reliable indicator of the condition. Robustness to noise, then, can be measured as the range of thresholds a representation can sustain without any responses flipped, with a larger range implying higher noise robustness (if noise is drawn from a Gaussian distribution, the noise range can represent thresholds within 2 SDs, for example, implying that a cell is robust to noise as long as its response is consistent on 95% of trials).

Assuming optimal threshold values (i.e., those with highest noise robustness) for each type of selectivity, the relative noise robustness of mixed and pure selectivity can be calculated (see Materials and Methods). We find that, thinking of the four conditions as the corners of a rectangle (Fig. 2*C*), mixed selectivity robustness depends on the length of the shorter side, while pure selectivity noise robustness depends on the difference between the two side lengths. We also find that, with random weights, most cells will have a representation that has higher noise robustness for pure selectivity than for mixed selectivity (see Materials and Methods).

Noise robustness changes, however, as thresholds deviate from optimal. The type of selectivity cells have in the absence of noise also varies with threshold in a related way. For example, using a low threshold may result in more cells with mixed selectivity and/or cells with pure selectivity that have low noise robustness (Fig. 2*D*). To quantify these trends, we varied the threshold parameter λ and determined both the probability of different types of selectivity as well as the noise robustness for each type (see Materials and Methods). In Figure 7*B*, we show for three different values of λ the fraction of cells that lose selectivity at a given noise level. Noise robustness (plotted as a function of λ in Fig. 7*C*) is defined then as a normalized measure of the noise value that causes 50% of cells to lose selectivity.

Figure 7*C* demonstrates why the random network from which we start learning is necessarily in a condition of low mixed selectivity. Specifically, the value of λ we choose to use is constrained by the fact that the data show high levels of pure selectivity. Therefore, we need a value that has high probability of pure selectivity and high noise robustness for it (especially because, as we will show, pure selectivity is unlikely to increase much with learning). Values of λ that meet this condition are not favorable for mixed selectivity. Therefore, we need a value that leads to both a high probability of pure selectivity and high noise robustness for pure selectivity (especially because, as we will show, pure selectivity is unlikely to increase much with learning). The fact that mixed selectivity is less noise robust than pure in the full model can be seen in Figure 5*A*.

Note that while the λ used for the random version of the full model shown in Figure 5*D* was ∼0.27, that value is not directly comparable to the λ values in these plots for many reasons. First, the full model has three task variables, compared with the two used in the toy model. This means that, from the perspective of mixed selectivity for two task variables, a given λ value will create a higher ϴ in the full model with three task variables than in the toy one that has only two (because ϴ is a function of the sum total of all weights, not just those relevant for the two-way selectivity). In addition, in the toy model, 50% of the inputs are on for any given condition, whereas the nature of the task in the full model means that only 25% of inputs are on when looking at C1 × C2 mixed selectivity, while one-third are on for TT × C1, TT × C2, and TT × C1 × C2 mixed selectivity. The percentage of cells are also not directly comparable, as cells in the full model are labeled as pure if they have any of three different types of pure selectivity, and mixed if they have any of four different types of mixed. This toy model is thus meant to provide intuition only.

### How learning affects selectivity

For the reasons just discussed, the random model starts in a regime where pure selectivity has high noise robustness and mixed selectivity does not. To match the amount of mixed selectivity seen in the data, we must then rely on learning to increase noise robustness for mixed selectivity, allowing more mixed cells to move out of the noise range.

Learning affects noise robustness by expanding the representation of the different conditions. An example of this is in Figure 8*A*, where the gray shaded area represents the noise-induced range of the threshold. Before learning, the cell's response is affected by the noise. With learning, different conditions get pulled away from each other and the threshold, creating a much more favorable condition for mixed selectivity to be robust to noise. As can be seen, the responses are now outside the noise range.

For the same reason that learning increases noise robustness (because the expansion increases the range of thresholds that support mixed selectivity), it can also increase the probability of a cell having mixed selectivity in the absence of noise. This can be seen in Figure 8*C* (left), where learning steps are indicated by increasing color brightness (constrained learning with rate of 0.25). At lower λ values, cells that are initially above threshold for all conditions (no selectivity) gain mixed selectivity with learning. But for λ values that support higher levels of pure selectivity (e.g., λ = 0.4, marked with a black dotted line), the percentage of cells with mixed selectivity is not as affected by learning. The percentage of cells with pure selectivity increases only slightly at most λ values.

Noise robustness has a different pattern of changes with learning (Fig. 8*C*, right). In particular, at λ = 0.4, the noise robustness still increases with learning even when the percentage of cells with mixed selectivity does not change. Furthermore, when starting from a λ value that has unequal noise robustness for pure and mixed selectivities, if most cells with pure selectivity are already robust to a given noise value, an increase in noise robustness for pure would only have a moderate effect on the population levels of pure selectivity. Conversely, if most mixed cells have noise robustness less than the current noise value, an increase in that robustness could strongly affect the population. In the same vein, a decrease in robustness will affect the pure population more than the mixed population. Thus, changes in noise robustness seem to play a large role in the increase in mixed selectivity observed in the full model.

In particular, constrained learning with *N _{L}* = 2 always increases the lengths of both sides of the rectangle (as one weight from each task variable increases and the other decreases). As mentioned above, noise robustness for mixed selectivity scales with the length of the shorter side and so it necessarily increases with learning in this condition. Under certain weight conditions, noise robustness will also increase for cells with pure selectivity (Fig. 8

*C*; see Materials and Methods).

If *N _{L}* = 1, only one side length will increase and the other decrease. If the shorter side decreases, mixed selectivity noise robustness decreases. If the shorter side increases, mixed noise robustness increases, up until the point at which side lengths are equal. At that point the shorter side is now the decreasing side and mixed noise robustness goes down. This trend is reflected in the shape of the mixed selectivity changes seen with

*N*= 1 in Figure 6

_{L}*A*(mixed selectivity increases, then decreases).

When using free learning (with *N _{L}* = 2), a portion of the cells will by chance have the same changes as with constrained learning. The remaining cells cause the differences observed between the two versions of learning, and can be of two types. In the first type, the larger side length increases and the smaller shrinks, causing a decrease in mixed noise robustness. Free learning does not achieve the same levels of mixed selectivity as constrained learning because these cells continue to be too noisy. In the other type, the shorter side increases and the larger decreases, reducing the difference between the two side lengths and thus reducing pure noise robustness. Free learning loses pure selectivity as these cells become too noisy (Fig. 6

*A*; see Materials and Methods).

Inputs from additional task variables can be thought of as a source of noise as well. In Figure 8*B*, we add a third task variable to the toy model. Now, in the case of the T1B–T2A condition, the identity of T3 determines whether the cell is active or not. From the perspective of T1–T2 mixed selectivity, this has the same impact as shifting the threshold, and thus creates noise. If both T3 inputs are weaker than the strongest two inputs from T1 and T2 (as they are here), they will decrease with learning. This means that not only do different T1–T2 conditions get pulled apart with learning, but the same T1–T2 conditions become closer. This reduces the impact of “noise” from other task variables, and explains why mixed selectivity increases more with *N _{L}* = 2 than with

*N*= 3 (Fig. 6

_{L}*A*).

In sum, learning changes a cell's representation of the task conditions. Depending on the threshold value, this can create changes in the probability of mixed and pure selectivity and the relative noise robustness for each. Here, to match the high levels of pure selectivity seen in the data, we use a threshold regime where mixed selectivity noise robustness increases with learning. This causes a gain in the number of cells with mixed selectivity, such that it reaches the level seen in the data.

### How learning affects other properties

This toy model helps us understand why other properties change with learning as well. RV, for example, increases with learning (Fig. 6*A*). The expansion that comes with learning places different conditions at different distances from the threshold. With a sigmoidal nonlinearity, this would translate to more variance in the responses across conditions, increasing RV. Because constrained learning ensures the most expansion, it increases RV more. These increases depend on *N _{L}* because lower

*N*allows for a more extreme skewing of weights, and thus a subset of conditions will be far above threshold while the rest are below (leading to a high RV). RV has a limit, however, because even with

_{L}*N*= 1, the cell would still respond equally to a quarter of the conditions (assuming an input from a cue variable).

_{L}Clustering values are also affected by how selectivity changes. Clustering in the data appears to be driven by TT selectivity (Fig. 3*C*), and as TT preferences develop in the model, the clustering value increases. Here, the relative sizes of the input populations play a role. Because the input populations that represent TT contain more cells (Fig. 4*A*), these populations are more likely to be among the strongest inputs to a cell, and thus have their weights increased (note that this bias in favor of TT could also arise from the fact that only two TTs are possible, and thus these inputs are on twice as often as cue inputs; such a mechanism cannot be implemented in this model, however, so we use uneven numbers of input cells). Therefore, TT selectivity becomes common and clusters form around the axis representing the first regressor (which captures TT preference). This effect is weaker with free learning because both TT populations may have their weights increased, which diminishes the strength of TT preference. Lower *N _{L}*, which minimizes preferences to other task-variable identities, allows these clusters to be tighter.

Finally, it is important to note that the strength of inputs shown in Figures 2 and 8 (colored arrows) correspond to, in the full model, the summed input from all cells representing a given task-variable identity (i.e., *I _{i}^{p}*), not just to weights from individual cells. These summed values are what need to change to expand the representation and see the observed changes. This is important for why the Hebbian procedure described here is effective at changing selectivity, as it assumes that many cells, acting in unison to cause postsynaptic activity, would lead to the increase of their individual synaptic weights, and thus an increase in the sum of those weights. Merely increasing the variance of the individual weights does not cause such a coordinated effect and would be less effective at driving these changes (Fig. 5

*C*), especially with larger input population size.

## Discussion

Here, motivated by several theoretical proposals about properties that would benefit encoding, we explored how the PFC represents task variables during a complex task. In particular, we were interested in measures of selectivity (particularly nonlinear mixed selectivity), response density, and clustering of cell types according to preferences. By quantifying and measuring these properties in a PFC dataset, this work connects theoretical literature with experimental data to give insight into how the PFC is able to support complex and flexible behavior. Furthermore, we explored how these response properties could be generated by a simple network model. Through this, we find evidence that the particular level of specialization and structure in the PFC response is not readily achievable in a random network without Hebbian learning. After Hebbian learning, the model—despite its relative simplicity—is able to capture many response properties of the PFC. The changes that come with learning act via an expansion of the way cells represent conditions, and corresponding changes in noise robustness.

Interestingly, the variant of Hebbian learning that best matches the data is not the most effective at increasing mixed selectivity. It may be that the more effective method (“constrained” learning) would be too difficult to implement biologically, but perhaps there is also a computational benefit to the balance of mixed and pure selectivity found in the data. Particularly, preventing high levels of selectivity to this particular task may allow the network to retain flexibility.

In addition to retrospectively matching experimental results, this model also makes predictions regarding how certain values should change with training. In particular, clusters of cells defined by selectivity are expected to emerge with training and cell responses should become less dense across conditions. Previous work (Rigotti et al., 2013) has shown the value of mixed selectivity for the ability of a population to perform complex tasks. This work shows that mixed selectivity increases with learning, and these changes in the PFC may correspond to increases in performance (Pasupathy and Miller, 2005), as learning in our model leads to increases in performance on classification tasks. Perhaps surprisingly, this model also predicts a concurrent, though small, decrease in pure selectivity. However, studies that have tracked PFC responses during training show signs of these changes. For example, in Meyer et al. (2011), the amount of pure selectivity was measured directly before and after training, and a significant drop in the percentage of cells with pure selectivity was indeed observed. Furthermore, in the hippocampus, an increase in mixed selectivity and a slight decrease in pure selectivity were also observed with learning (Komorowski et al., 2009). Meyers et al. (2012) showed that the ability to read out match/nonmatch of two input stimuli from the population increases dramatically with learning, suggesting an increase in mixed selectivity. However, the ability to decode the identity of the stimuli (in the comparable portion of the trial) decreases slightly after training, which would be at odds with our linear classification results.

Our model makes many simplifying assumptions. The inputs, for instance, are binary cells that encode only the identity of different task variables. While this implies that the cells representing cue identities already have mixed selectivity (responding to the combination of the image and its place as either C1 or C2), it is still an assumption that the cells providing input to the PFC are otherwise unmixed. This is something that, given current experimental evidence, seems plausible (Pagan et al., 2013), but would benefit from further experimental exploration.

It may seem possible that adding more layers to the network would be a way to get the model to match the data without the need to introduce learning. This, however, is unlikely. For one, the data have high levels of pure selectivity, which would be difficult to maintain through layers of random connections. Mixed selectivity, too, could decrease with layers, especially if each layer is noisy (which would be the realistic way to build such a model). It is also not obvious how such a model would achieve the clustering values observed in the data. Preliminary work on multilayer models supports these notions (data not shown). Also, such a model would not be able to address the changes with training discussed above. Finally, such a model would necessarily contain more parameters than a single layered network, and that would need to be taken into account when comparing with our learning model, which only introduces two additional parameters (*N _{L}* and the amount of learning, defined by the combination of learning rate and number of steps).

Another valuable endeavor would be to expand this model in the temporal domain. Currently in the model, all the task-variable inputs are given to the network simultaneously. In the experiment, of course, there is a delay between C1 and C2. Delay activity is known to exist in such areas as the inferotemporal cortex (Fuster and Jervey, 1982; Woloszyn and Sheinberg, 2009), and so this information could be being fed into the PFC at the same time. But presumably, recurrent connections in the PFC, and even possibly between the PFC and its input areas, can enhance or alter selectivity. A recurrent model could also explore how PFC responses and representation vary over the time course of the trial, as recent experimental work has provided insight on this (Murray et al., 2017). Interestingly, recent work has demonstrated that Hebbian learning can be used to train recurrent neural networks on context-dependent tasks (Miconi, 2017).

## Footnotes

G.W.L. was supported by a Google PhD Fellowship and the National Institutes of Health (T32 NS064929). S.F. was supported by the Gatsby Charitable Foundation, the Simons Foundation, the Kavli Foundation, and the Grossman Foundation. E.K.M. was supported by the National Institute of Mental Health (NIMH R37MH087027) and the MIT Picower Institute Innovation Fund. We thank Brian Lau for supplying code that implements the tests of uniformity on the hypersphere.

The authors declare no competing financial interests.

- Correspondence should be addressed to Stefano Fusi at the above address. sf2237{at}columbia.edu