## Abstract

For many neural network models in which neurons are trained to classify inputs like perceptrons, the number of inputs that can be classified is limited by the connectivity of each neuron, even when the total number of neurons is very large. This poses the problem of how the biological brain can take advantage of its huge number of neurons given that the connectivity is sparse. One solution is to combine multiple perceptrons together, as in committee machines. The number of classifiable random patterns would then grow linearly with the number of perceptrons, even when each perceptron has limited connectivity. However, the problem is moved to the downstream readout neurons, which would need a number of connections as large as the number of perceptrons. Here we propose a different approach in which the readout is implemented by connecting multiple perceptrons in a recurrent attractor neural network. We prove analytically that the number of classifiable random patterns can grow unboundedly with the number of perceptrons, even when the connectivity of each perceptron remains finite. Most importantly, both the recurrent connectivity and the connectivity of downstream readouts also remain finite. Our study shows that feedforward neural classifiers with numerous long-range afferent connections can be replaced by recurrent networks with sparse long-range connectivity without sacrificing the classification performance. Our strategy could be used to design more general scalable network architectures with limited connectivity, which resemble more closely the brain neural circuits that are dominated by recurrent connectivity.

**SIGNIFICANCE STATEMENT** The mammalian brain has a huge number of neurons, but the connectivity is rather sparse. This observation seems to contrast with the theoretical studies showing that for many neural network models the performance scales with the number of connections per neuron and not with the total number of neurons. To solve this dilemma, we propose a model in which a recurrent network reads out multiple neural classifiers. Its performance scales with the total number of neurons even when each neuron of the network has limited connectivity. Our study reveals an important role of recurrent connections in neural systems like the hippocampus, in which the computational limitations due to sparse long-range feedforward connectivity might be compensated by local recurrent connections.

## Introduction

The performance of a neural circuit is often evaluated by determining the number of input–output functions that can be implemented or, equivalently, by the number of inputs that can be classified correctly by the neural circuit. Theoretical studies on perceptrons (Rosenblatt, 1957) and recurrent neural circuits (Amit, 1992) have shown that typically the performance of a neural circuit scales with the number of synaptic connections that individual neurons receive, and not with the total number of synapses or with the total number of neurons (Roudi and Latham, 2007). This is clearly a problem in the biological brain in which the connectivity is sparse, especially when long-range connections are considered (Bullmore and Sporns, 2012). One striking example is the mammalian hippocampus (Drew et al., 2013). A typical pyramidal neuron in rodent CA3 receives only 50 synapses from the upstream area (Amaral et al., 1990), the dentate gyrus (DG), which contains around 10^{6} neurons. Not only is the connectivity sparse, but also the neural activity (Jung and McNaughton, 1993; Chawla et al., 2005), which seems paradoxical because a very limited connectivity could be compensated by denser neural activity.

One possible way to overcome the limitations of sparse connectivity is to adopt the strategy of “committee machines” (Nilsson, 1965), which are basically populations of classifiers. Each classifier is weak, as a perceptron with limited connectivity, but the output is generated by reading out a large number of weak classifiers and by combining them using a majority vote or some more sophisticated strategies. The final classification performance is significantly better than the one of each individual classifier, provided that the errors of the individual classifiers are sufficiently independent. The term committee machines goes back to the 1960s (Nilsson, 1965), but they have also been a focus of more recent studies (Parmanto et al., 1996; Bishop, 2007); basically, they are all based on strategies that in machine learning are known as ensemble methods or hypothesis boosting (Kearns M, unpublished observations; Zhou, 2012), strategies that are often adopted also in statistics (Rao and Subrahmaniam, 1971; Efron and Morris, 1973; Rubin and Weisberg, 1975; Green and Strawderman, 1991). Some of the examples include stacking (Wolpert, 1992; Breiman, 1996b), bagging (Breiman, 1996a), arcing (L. Breiman, unpublished observations), and AdaBoost (Adaptive Boosting; Freund et al., 1996; Freund and Schapire, 1997).

One class of committee machines is implemented using populations of neurons, each essentially behaving as a neural classifier, like a perceptron (Mitchison and Durbin, 1989; Monasson and Zecchina, 1995; Kwon and Oh, 1997; Urbanczik, 1997). Classifiers with limited connectivity are weak classifiers. It is possible to compute the classification capacity when each neural classifier has sparse connectivity (Kwon and Oh, 1997). The connections between the *N* input neurons and the *M* < *N* neural classifiers are assumed to be nonoverlapping (*N*/*M* connections per “perceptron”) and plastic. The final response of the committee machine is obtained by majority vote of the *M* neural classifiers, which can be easily implemented by introducing a readout neuron that is connected to all the neural classifiers with equal weights. The maximum number of correctly classified inputs is proportional to , whereas each neural classifier would not go beyond *N*/*M* inputs. This is a favorable scaling, and it is similar to the one obtained in other committee machines. However, one has to keep in mind that in these implementations the neural classifiers have sparse connectivity, but the readout neuron performing the majority vote should have a number of connections that scale with *N*.

Here we propose a network architecture that overcomes the restrictions imposed by the limited connectivity, as in the committee machines, but replaces the readout neuron that has extensive connectivity with a more biologically plausible recurrent network in which all of the neurons have a number of connections that remains finite when the number of classifiable patterns grows unboundedly. More specifically, we show that the number of random inputs that can be correctly classified scales linearly with the number of input neurons *N*, even when the number of connections per neural classifier *C _{F}* does not increase with

*N*. The number of neural classifiers

*M*is assumed to be proportional to

*N*.

Interestingly, under certain conditions the recurrent scheme has larger classification capacity than the majority vote scheme. This happens for sparse input representations, the regime that is relevant for the mammalian hippocampus and that we investigate in detail.

## Materials and Methods

The following sections describe the models in detail and cover all the analytical calculations. They can be skipped if one is not interested in the technical details because in the Results we reintroduce all the important concepts, although in a less detailed format.

### Fully connected readout

In this section, we derive the classification capacity of a single fully connected linear threshold readout, or perceptron (Fig. 1*a*), achieved with a simple learning rule that we use throughout this work. We assume that the input patterns and labels are random and uncorrelated, meaning that the activity of each input unit as well as the label for each pattern is chosen independently, which makes calculations analytically tractable. We use a simple Hebbian-like learning rule, which is not optimal and thus leads to a lower capacity than the 2*N* result in the study by Cover (1965). However, the scaling of the maximal number of learned input patterns *P*_{max} with the number of input units *N* is still linear, as is shown below.

#### Input statistics

We assume that pairs (ξ^{μ}, η^{μ}) of a pattern ξ^{μ} and a label η^{μ} are drawn from a random ensemble of *P* pairs (pattern, label). The pattern components ξ_{i}^{μ} on all *N* input units and labels η^{μ} are random independent variables. We assume that each component ξ_{i}^{μ} (*i* = 1 … *N* is the unit index and μ = 1 … *P* is the pattern index) is activated to 1 with probability *f*, called “coding level,” and otherwise is 0, and that the label η^{μ} takes one of the following two values: η^{μ} = +1 with probability *y*, called the “output sparseness,” and η^{μ} = −1, otherwise:
We have chosen different representations for the input and output variables for mathematical convenience. One can go from {1, 0} representation to {1, −1} and vice versa by changing the threshold of readout neurons.

#### Learning rule and the synaptic current

The linear threshold readout, or perceptron, classifies its inputs based on the sign of the weighted sum of the input components. This sum is sometimes called the “synaptic current,” as it is viewed as modeling the synaptic current into a biological neuron, as follows:
We say that the network has learned the association between *P* input patterns ξ→^{μ}, and *P* labels η^{μ} if for any pattern μ, as follows:
where θ is the threshold, which we further assume to be equal to zero.

Training the network means finding the set of weights *w _{i}* that satisfies the above expression for all

*P*patterns.

The Hebbian-like learning rule, which we use to train the weights {*w _{i}*} of the classifier is as follows:
In the case when patterns are equally likely to belong to either class (

*y*= ), the learning rule simplifies to the following: Here and in all that follows, we set the threshold θ to zero.

After training, the synaptic current in response to a test pattern ξ→^{ν} is as follows:
If ξ→^{ν} together with its label η^{ν} was part of the training set, we can split the sum over patterns into the contribution from the presented pattern ξ→^{ν} and the contribution from other learned patterns as follows:
Here we used (ξ_{i}^{ν})^{2} = ξ_{i}^{ν} because ξ_{i}^{ν} takes value 0 or 1.

We denote the number of active input units for the pattern ν by *n*^{ν}, as follows:
The variable *n*^{ν} is drawn from a binomial distribution of *N* trials with probability *f*, **B**(*N*, *f*), for each realization of the random patterns. Its expected value is determined by the number of input units *N* and the coding level *f*, as follows:
(Here and throughout this text, the angular brackets denote the mean over the realizations of the input patterns.)

We replace the sum in the square brackets of Equation 3.4 with , where we have introduced a noise random variable, *z*_{i}^{ν}, with zero mean and unit variance. The coefficient is concluded from the fact that each individual term (ξ_{i}^{μ} − *f*)(η^{μ} + 1 − 2*y*) has variance, as follows:
and the fact that the ξ_{i}^{μ} variables are mutually independent. By the central limit theorem, the noise variables *z*_{i}^{ν} can be approximated as Gaussian in the limit *P* → ∞ with finite *f*. The sum ∑_{i = 1}^{N}*z*_{i}^{ν}ξ_{i}^{ν} is also Gaussian, with the variance equal to *n*^{ν}, ∑_{i = 1}^{N}*z*_{i}^{ν}ξ_{i}^{ν} = , with *z*^{ν} being a Gaussian random variable with zero mean and unit variance.

In terms of *z*^{ν} and *n*^{ν}, the synaptic current is written as follows:
If a pattern belongs to either class with equal probability (*y* = 1/2), this expression simplifies to the following:
Note that the first term is the one that reflects the correct classification of the input pattern, and the second one represents the noise caused by the interference from other patterns that were learned by the perceptron. The important parameter is the ratio of the two, which is proportional to .

Integrating over the distribution of *z*^{ν} in the appropriate limits gives the probability of *h*^{ν} to have the same sign as η^{ν}. Requiring this probability to exceed 1 − ϵ, where ϵ is the tolerated error rate, leads to the capacity of a fully connected readout, as follows:
The capacity is extensive in (grows linearly with) the number of input neurons; however, this architecture requires the number of converging connections that also grows linearly with *N* (see Fig. 3).

### Committee machine

We now turn to deriving the classification capacity of a committee machine, the network shown on the Figure 1*b*, where each of *M* perceptrons receives feedforward connections from *C _{F}* input units. The connectivity

*C*does not scale when the number of input units

_{F}*N*increases.

The final decision is the majority vote of the classifiers. In other words, if classification is accurate:
Here *I* ∈ *I _{k}* stands for all the input units (there are

*C*of them) that are connected to the readout

_{F}*k*, and

*w*is the strength of the connection from the input unit

_{i}^{k}*i*to the readout

*k*(for the learning rule, we consider that

*w*does not depend on

_{i}^{k}*k*).

The synaptic current into the readout unit *k* when pattern ξ→^{ν} is presented is determined by the following:
The number of active inputs connected to the perceptron *k*, *n*_{k}^{ν} is drawn from the binomial distribution **B**(*C _{F}*,

*f*) of now

*C*trials with the success rate

_{F}*f*; and its expectation value is as follows: Since the number of connections per readout

*C*stays constant as the number of patterns

_{F}*P*and the size of the network (

*N*and

*M*) grow, the probability of a single perceptron to classify a pattern correctly approaches the chance level. Indeed, in contrast to the fully connected perceptron, the number of active inputs

*n*

_{k}

^{ν}per readout neuron does not change with the size of the network (Eq. 3.11). Hence, the first term of the expression (Eq. 3.10) decreases in the absolute value as the number of patterns

*P*grows, while the typical value of the second term stays the same. However, there is always a slight tendency toward the correct answer (〈

*h*

_{k}

^{ν}η

^{ν}〉 > 0), which can be used by having a growing number of sparsely connected classifiers that take a collective decision by majority vote. This scheme is known by the name of committee machine and has been shown to largely exceed the performance of a single classifier.

It is important to note that in order for the capacity of a committee machine to keep increasing as new classifiers (committee members) are added, the responses of different classifiers should stay sufficiently independent from each other. In the case of limited connectivity, which we consider here, the correlations automatically become smaller and smaller as we increase the number of input units. This happens because the probability of a typical pair of readout neurons to have a common input unit, and thus correlated responses, decreases. In order for the correlations not to be a limiting factor of the classification capacity, we need to increase the number of input units linearly with the number of perceptrons. If one introduces some other mechanism of reducing the correlations between the responses of the classifiers with common input units (e.g., making different perceptrons learn different sets of patterns), a sublinear scaling of the number of input units *N* with the number of perceptrons *M* might be sufficient.

#### Nonoverlapping case

The majority vote of *M* linear threshold classifiers is given by the average vote, as follows:
where *h*_{k}^{ν} is given in Equation 3.10. Positive *r*^{ν}η^{ν} means that the pattern ν is classified correctly.

The expectation value of *r*^{ν} follows from Equation 3.10 after integrating over the noise variable *z*_{k}^{ν}, which is approximated to be normally distributed. We make an assumption *Pf* ≫ *n*_{k}^{ν}, which is justified for a large number of patterns and allows us to use the approximation of the error function for small arguments to get the following:
The expectation value 〈〉 is computed over the binomial distribution **B**(*C _{F}*,

*f*) as follows: In the dense regime

*C*≫ 1, it can be approximated by the following: and, in the extremely sparse case, when

_{F}f*C*≫ 1 and only

_{F}f*n*

_{k}

^{ν}= {0, 1} are encountered substantially often, by the following: To proceed with deriving the classification capacity, let us start with independent classifiers first. The independence of the responses can be achieved either by forcing the connections to be nonoverlapping or by assuming an additional mechanism that, for example, causes different classifiers to update their incoming connections in response to different subsets of the input patterns.

In this case, *r*^{ν} can be thought of as drawn from a Gaussian distribution with the mean given by Equation 3.13 and the variance, as follows:
The Gaussian assumption is justified by the law of large numbers.

From here on we ignore the contributions of the subleading order, 𝒪(*P*^{−1}) in this case.

The probability *p*_{correct} to classify a pattern correctly (*r*^{ν}η^{ν} > 0) can then be easily computed.

Fixing the tolerated error rate ϵ and requiring *p*_{correct} > 1 − ϵ leads to the expression for the maximal number of input patterns that can be classified with the accuracy 1 − ϵ, as follows:
Here 〈n〉 denotes the average over binomial distribution, *n* ∼ **B**(*C _{F}*,

*f*).

This result only holds for the case of nonoverlapping connections or in the presence of a decorrelation mechanism. In the following section, we generalize it to random connectivity.

#### Correction to classification capacity due to overlap in the connections

To derive an analogous expression for the overlapping case without a decorrelation mechanism, we need to compute the variance, as follows:
of the average vote *r*^{ν}, defined by Equation 3.12, taking into account the correlations of individual votes *r*_{k}^{ν}.

We start by splitting the covariance into diagonal and nondiagonal contributions, as follows:
We assume that *M* and *N* scale linearly with *P* and *M*, *N*, *P* → ∞. The leading terms are thus of the order , and we ignore all the subleading contributions.

When the classifiers *k* and *l* share input units, the correlation between their responses is positive and is closely related to the correlation of the input currents *h*_{k}^{ν} and *h*_{l}^{ν} (see Eq. 3.10).

Let *n*_{kl}^{ν} be the number of input units that are connected to both the classifier *k* and the classifier *l* and are active in the pattern ξ→^{ν}. For a large number of input units *N* and finite connectivity *C _{F}*, we can assume that

*n*

_{kl}

^{ν}can be either 0 or 1, but not more. Including the terms corresponding to

*n*

_{kl}

^{ν}> 1 would lead to corrections that scale as 1/

*N*and become negligible in the limit for large

*N*. The probability of

*n*

_{kl}

^{ν}being 1 is given by the following: The number of active units that are connected to only one of the two classifiers are denoted by ñ

_{k}

^{ν}and ñ

_{l}

^{ν}, respectively. In the current approximation, both of them can be assumed to be distributed according to a binomial distribution

**B**(

*C*,

_{F}*f*).

Then, the currents can be written as follows (see Eq. 3.10):
where *z*_{k}^{ν}, *z*_{l}^{ν} and *z*_{k}^{ν}*l* are all independent Gaussian variables with zero mean and unit variance.

To compute the covariance:
we start by integrating over the variables *z*_{k}^{ν} and *z*_{l}^{ν} to get the following:
Then, Equation 3.22 can be evaluated using the following table integral (Geller and Ng, 1971, their Eq. 18, p. 158):
In the leading order, we get the following:
In the dense regime (*C _{F}f* ≫ 1), the expression for ϕ

_{CF,f}in Equation 3.25 can be approximated as follows: which leads to the following: while in the sparse approximation (

*C*≪ 1): and Plugging this result into Equation 3.20, we get for the variance of the majority vote

_{F}f*r*

^{ν}in the overlapping case, as follows: which together with Equation 3.13 leads to the maximal number of input patterns that the committee machine can learn to classify with the accuracy 1 − ϵ, as follows: Here ϕ

_{CF,f}is given in Equation 3.25 and is approximated by Equation 3.26 or 3.28.

If both the number of input units *N* and the number of classifiers *M* increase in proportion to each other, the capacity *P* increases linearly with *N* (or *M*).

In the case of dense representations *C _{F}f* ≫ 1, the last expression simplifies to the following:
and in the ultrasparse limit

*C*≪ 1 to the following:

_{F}f### Committee machine with recurrent connections

The majority rule scenario already overcomes the limitations of the connectivity of a single perceptron, but this is not the final answer to constructing a classifier with limited connectivity. The reason is that we still need to implement the majority rule and bring the classification signal to the level of a single unit. The naive way to do it would require another final readout that would have to sample the entire population of *M* intermediate layer perceptrons. Since *M* has to scale linearly with the number of learned patterns *P*, the connectivity of the final readout would also have to scale linearly with *P* (see Eq. 3.30) and would exceed any predetermined limit for a sufficiently large number of learned patterns.

To implement the majority vote of the intermediate perceptrons while keeping the connectivity of any unit in the network limited, we introduce the recurrent connectivity in the layer of perceptrons. Our goal is to have two attractor states of the intermediate layer dynamics that correspond to the two classes. The feedforward input through the connections {*w _{i}^{k}*}, trained in the same way as before, will be slightly biased in the positive direction for one class of the input patterns and in the negative for the other. This slight bias determines which attractor state the network will choose. It is essential that the attractors are far away from each other and do not become closer when the number of learned patterns

*P*increases. This implies that the final readout will be able to discriminate between these states, and thus indicates the class of the presented pattern, even if its connectivity does not scale with

*P*. It turns out that for binary classification it is enough to have random recurrent connectivity with sufficiently large but not increasing with

*P*number of connections per unit. The weights of these recurrent connections do not have to be tuned (no learning required for recurrent connections).

We compute the probability of the network of recurrently connected readouts to go to the correct attractor (the one assigned to the class of the input pattern presented) as a function of the number of input units *N*, the number of perceptrons *M*, and various parameters of the recurrently connected network of perceptrons.

#### Network topology

The recurrent readout network shown on the right of Figure 1*c* consists of the input layer (green), the intermediate layer of perceptrons (orange), and the final readout unit (purple).

As before, the input layer of *N* neurons is presented with a random and uncorrelated pattern (ξ_{i}^{ν})_{i=1 … N} from a set of *P* patterns (ξ→^{μ})_{μ = 1 … P} that the network has learned to classify.

The layer of perceptrons we now call the intermediate layer. It consists of *M* linear threshold readouts, each of which is connected to a randomly chosen *C _{F}* of

*N*input units. Hence, the feedforward connectivity

*C*is the number of feedforward inputs that each perceptron receives. The

_{F}*C*is an important parameter in the problem as it determines the classification capacity of a perceptron considered in isolation. The intermediate layer is recurrently connected. For the case of binary classification, the probability that two units are connected is the same for each pair. The recurrent connections are not plastic and can be chosen to be all of equal strength α.

_{F}The recurrent connectivity matrix *J _{kl}*,

*k*,

*l*∈ [1 …

*M*] is constructed randomly, as follows: with a constraint of being symmetric,

*J*=

_{kl}*J*for all

_{lk}*k*,

*l*in [1 …

*M*].

Here, *C _{R}* is the expected number of recurrent connections per unit, as follows:
where the average is taken over different realizations of the connectivity matrix.

The final layer consists of a single readout unit that is connected to a randomly chosen subset of *C* perceptrons in the intermediate layer, with the strength of all connections taken to be equal.

We will keep the connectivity parameters *C _{F}*,

*C*, and

_{R}*C*and coding level

*f*at fixed constant values, while sending the number of input units

*N*, the number of intermediate perceptrons

*M*, and the number of patterns

*P*to infinity, as follows: We want to recover the linear scaling of the maximal number of patterns

*P*

_{max}that the network can learn to classify with the number of input units

*N*, which is known to hold for the fully connected perceptron (Cover, 1965).

#### Discrete time dynamic model

We model the recurrent dynamics as a probabilistic dynamic process in discrete time *t* with the probabilistic transition rule from a network state at time *t* to a network state at time *t* + 1. Let *s _{k}*(

*t*) ∈ [1 …

*M*] be the dynamic variable describing the state of unit

*k*at time

*t*in a recurrent network.

Let *h̃*_{k}^{ν} be the total current into the readout unit *k*, as follows:
where the first term corresponds to the recurrent contribution, and the second term represents the feedforward current from the input layer (Eq. 3.10), which is constant in time.

The probabilistic transition rule from the state at time *t* to the state at time *t* + 1 is as follows:
Here β is the inverse temperature parameter for the statistical model of the recurrent dynamics, and it characterizes the level of noise.

We approximate this probabilistic recurrent dynamics with a mean field method.

#### Mean field analysis of the recurrent dynamics

To compute the capacity of such a recurrent classifier, we analyze the recurrent dynamics in the mean field approximation. The activities of the recurrently connected units are represented by the variables *s _{k}* = {+1, −1} with

*k*= 1 …

*M*. The average activation of the recurrently connected intermediate layer in response to the pattern ν is defined as follows: where 〈〉

_{β}is the average over the recurrent noise. The mean field equation for the average activation reads as follows: This equation can be obtained by averaging

*s*over the distribution (Eq. 3.37) and by using the self-consistent expression for the recurrent part of the total synaptic current

_{k}*h̃*(

_{k}*t*). It can also be derived more rigorously by following the standard calculation for the overlaps in the Hopfield network (Amit, 1992). The stored patterns of the Hopfield network are replaced by the eigenvectors of the connectivity matrix

*J*, as every symmetric matrix can be expressed as

*J*=∑

_{i=1}

^{N}

*e*, where

_{i}e_{i}^{T}*e*represents the (non-normalized) eigenvectors. We are interested in the overlap with the eigenvector

_{i}*e*

_{k}

^{1}= 1 for all

*k*in [1 …

*M*]. That this is the eigenvector of the chosen connectivity matrix

*J*can be seen from Equation 3.34. The external current

*h*

_{k}

^{ν}can be easily included in the derivation. The average activation

*m*

^{ν}is close to zero if the amount of active and inactive units is approximately the same. If the majority of the units is in the active state,

*m*

^{ν}will be close to 1, and if the majority is inactive,

*m*

^{ν}will be close to −1.

Here *C _{R}* is the average number of connections per unit, α is the strength of recurrent synapses (we assume they are all excitatory and of equal strength), β is the inverse temperature parameter, and

*h*

_{k}

^{ν}is the feedforward input current given by Equation 3.10.

We proceed by analyzing the above equation graphically. The plot of the right-hand side is a sigmoid curve, and the left-hand side is a line at 45°. The intersections of these two lines determine the solutions to the equation. There are two possible situations that correspond to two different scenarios of the network dynamics.

The first scenario, shown in Figure 2, *a* and *b*, is characterized by having only one point of intersection of the line and the sigmoid. In this case, there is only one solution to the mean field equation (Eq. 3.38) and only one stable state of the recurrent network. The right-hand side of the equation is almost but not quite an odd function of its argument *m*^{ν}, so the sigmoidal curve representing it is slightly shifted to the left if ∑_{k = 1}^{M}tanh(β*h*_{k}^{ν}) > 0 and to the right if ∑_{k = 1}^{M}tanh(β*h*_{k}^{ν}) < 0. If the curve is shifted to the left, the single point of its intersection with the strait line passing thorough the origin will be in the right half-plane. So, for the positive input pattern (η^{ν} = +1 and *h*_{k}^{ν} is more likely to be positive), the mean activity of the intermediate layer in the stable state *m*^{ν} will usually be positive, while for the negative input patterns it will be negative. Even though there is a relation between the sign of the mean activity of the intermediate layer in the stable state and the class of the input pattern, this is not helpful for our purposes. The reason is that we encounter exactly the same problem as for the case of no recurrent connections: the absolute value of the average activity *m*^{ν} will decrease with the number of learned patterns *P*, which means that the number of active and inactive units in the intermediate layer will become more and more similar. Consequently, to sample this small imbalance we would require larger and larger connectivity of the final readout. In short, the regime with one stable solution (Fig. 2*a*,*b*) is not much different from the case of no recurrent connections. Not surprisingly, this regime corresponds to relatively weak recurrent connections.

It is the other situation, shown in Figure 2, *c* and *d*, that is actually of interest. There are three points of intersection of the sigmoid curve of the right-hand side of Equation 3.38 and the straight line of the left-hand side. The stable states of the network correspond to the rightmost and the leftmost solutions, which are both characterized by a large imbalance between active and inactive units (|*m ^{v}*| ∼1). Most importantly, these solutions are virtually insensitive to the distribution of

*h*

_{k}

^{ν}, and hence to the number of learned patterns

*P*. So, if we postulate that the right solution corresponds to the positive input patterns and the left solution to the negative ones, it will be easy for a downstream readout with connectivity that does not increase with

*P*to distinguish between them.

The middle intersection point *m _{u}* corresponds to the unstable solution. When the network is initialized at the state {

*s*

_{k}

^{0}} with

*m*

_{0}= ∑

_{k=1}

^{M}

*s*

_{k}

^{0}on the left of the unstable solution

*m*

_{0}<

*m*, the recurrent dynamics will most likely evolve to the left stable state; and if initialized at

_{u}*m*

_{0}>

*m*; it will evolve to the right stable state. As shown in Figure 2,

_{u}*c*and

*d*, the point of unstable equilibrium will be to the left of the origin for a positive input pattern, and to the right of the origin otherwise (due to the difference in the mean of the distributions of

*h*

_{k}

^{ν}). Hence, initiating the network at

*m*

_{0}= 0 will serve the purpose of biasing the evolution of the network toward the stable state that corresponds to the class of the input pattern. If the number of learned patterns

*P*is large, the point of unstable equilibrium is very close to zero |

*m*| ∼ , this is the manifestation of the same problem as before, namely the decrease of the signal-to-noise ratio with the increasing number of learned patterns. Thus, the noise in the initial state of the network

_{u}*m*

_{0}should also decrease as . This is achieved if all of the units in the intermediate layer are initialized at

*s*

_{k}

^{0}= ±1 with equal probabilities independent from each other, and the number of units

*M*is linear in

*P*(the same scaling as for the committee machine discussed earlier). We use this initialization process to derive the classification capacity and to run the simulations. In the section The initial condition of the recurrent network, we suggest a biologically plausible way to initialize the network at the desired point.

To summarize, the information about the class of the input pattern is contained in the feedforward input to the intermediate recurrently connected layer. In the case of a single stable state (Fig. 2*a*,*b*), although the average activity of the network reflects this information, the signal is very small and a fully connected downstream readout is required. In the case of two stable states (Fig. 2*c*,*d*), this small signal biases the network to choose the one corresponding to the class of the input pattern, and by doing so, the network amplifies the feedforward signal, making it easy to read out by a sparsely connected downstream unit.

#### Number of classifiable inputs

As discussed in the previous section, the requirement for the correct classification of an input pattern by means of a recurrently connected committee machine is that the average activity of the network at the initial moment *m*_{0}^{ν} is on the correct side of the point of unstable equilibrium *m*_{u}^{ν}, namely the following:
where η^{ν} is the desired output (η^{ν} = {± 1}).

In what follows we drop the pattern index ν.

The statistics of *m*_{0} over random initializations of the network follows from its definition, as follows:
where each unit is initialized at *s _{k}* = +1 or

*s*= −1 with equal probability, as follows: Since

_{k}*M*is a large number, we approximate the distribution of

*m*

_{0}by a Gaussian distribution with these mean and variance values.

The position of the unstable equilibrium point *m _{u}*, corresponding to one of the three solutions (the one that is close to zero) of the mean field equation (Eq. 3.38), cannot be computed analytically in the general case. However, there are parameter regimes in which we can compute the approximate first- and second-order statistics of

*m*over random realizations of the input patterns. These parameter regimes and corresponding approximations are discussed in the following section. Once the mean μ

_{u}*value, which depends on the number of learned patterns*

_{u}*P*, and the variance σ

_{u}

^{2}of

*m*are known, the requirement to classify

_{u}*P*input patterns with accuracy 1 − ϵ can be written (assuming the distribution of

*m*to be also Gaussian), as follows: The expected number

_{u}*P*of correctly classified patterns can be found by inverting the above equation.

In the following sections, we consider different parameter regimes that lead to different approximations for μ* _{u}* and σ

*.*

_{u}#### The uniform regime

In the current study, among other issues we are interested in the consequences of the sparsity of input representations. Since we consider the feedforward connectivity *C _{F}* to be a constant number and not to scale with the size of the network, for sparse representations there will be a substantial number of perceptrons that receive zero feedforward input. Unless the dynamic noise is very high, these units should be considered separately, and in the mean field approximation an additional order parameter should be introduced to describe their average activity [in the derivation of the mean field equation, the overlap with the uniform eigenvector is never the only one with a macroscopic value]. We call these units “free units.”

The uniform regime is the parameter regime under which it is not necessary to analyze the free units separately, and Equation 3.38 is valid without modifications. Obviously, when the input representations are dense, *C _{F}f* ≫ 1, the network of the intermediate layer is in the uniform regime, since there are not enough free units to make a difference. However, it is valid to assume the uniform regime even for sparse representations, when the dynamic noise is sufficiently large. To be more precise, the dynamic noise should be large when compared with the typical feedforward input (see the next section).

The conditions defining the uniform regime are as follows:

#### Uniform regime, high noise

One approximation we can make to find the unstable solution *m _{u}* of the mean field equation (Eq. 3.38) is the high-noise approximation, which is defined by the following requirement:
It follows from Equation 3.10 for the feedforward current that this requirement is met if:
where 〈

*n*〉

_{n≠0}stands for the mean of the number of active inputs per readout

*n*over the binomial distribution

*n*∼

**B**(

*C*,

_{F}*f*), with the instances of

*n*= 0 excluded. For large values of

*C*, it can be approximated as 〈

_{F}*n*〉

_{n≠0}= , and the above condition becomes the following: The condition for having three solutions of Equation 3.38 rather than one (Fig. 2) is as follows: Since we are looking for the solution, which is close to zero and Equation 3.42 is satisfied for most of the terms, Equation 3.38 can be approximated by replacing the hyperbolic tangent by its argument, as follows: (note that this approximation is also valid for the terms with

*h*

_{k}

^{ν}= 0).

Solving this equation leads to the mean μ* _{u}* and the standard deviation σ

*of*

_{u}*m*: and The mean μ

_{u}*and the standard deviation σ*

_{h}*of the feedforward current*

_{h}*h*

_{k}

^{ν}are computed from Equation 3.10, as follows: The

*C*/

_{F}*N*term in Equation 3.44 comes from the correlations between the feedforward currents

*h*

_{k}

^{ν}into different readouts

*k*due to overlapping connections (Kushnir and Fusi, 2017, their Appendix A1).

Now the maximum number of learned patterns for the classifier in the uniform regime for high-noise approximation can be computed from Equation 3.40 and is given by the following: We note that because of the applicability condition (Eq. 3.43) making the last term in the denominator small requires fine tuning of the parameter β.

#### Uniform regime, low noise

The other approximation in which Equation 3.38 can be solved is as follows:
which is true for most neurons if:
Under this condition, assuming that the uniform regime is valid only if the input representations are dense:
The condition for having three solutions to the mean field equation in the low-noise approximation becomes (see Eq. 3.51) the following:
In this case, the hyperbolic tangent in Equation 3.38 can be approximated by the sign function, as follows:
Let us denote the right side of this equation by *g*(*m _{u}*), where:
is a stochastic function over different realizations of {

*h*

_{k}

^{ν}}.

Note that in this case, having a substantial fraction of terms with *h*_{k}^{ν} = 0 would lead to a discontinuity of the right-hand side at *m*_{u}^{ν} = 0.

The mean 〈*g*(*m*)〉 can be found by integrating over the distribution of *h*_{k}^{ν} (see Eq. 3.10), as follows:
where μ* _{h}* and σ

*are the mean and standard deviation of*

_{h}*h*

_{k}

^{ν}, respectively, which are given by Equation 3.45.

Thus, when averaged over training patterns, the mean field equation becomes the following:
and it has three solutions when the derivative of the right-hand side with respect to *m* at *m* = 0 is larger than 1, which, for μ* _{h}* ≪ σ

*, leads immediately to Equation 3.48.*

_{h}We now return to estimating the mean and the standard deviation of *m _{u}*, which is the unstable solution to the approximated mean field equation
where

*g*(

*m*) is defined by Equation 3.49.

For μ* _{h}* ≪ σ

*, which is always the case if the number of stored patterns*

_{h}*P*is large enough, we assume that

*C*α

_{R}*m*is also small compared with σ

_{u}*and check the self-consistency later. Then, we can use the approximation for the error function at small arguments to get the following: in which the variance of*

_{h}*g*(

*m*) can be written as the sum of the diagonal and the nondiagonal terms, as follows: which is similar to Equation 3.20 for the variance of ∑

_{k = 1}

^{M}sign(

*h*) computed previously in Equation 3.25. The only difference is that here the distribution of

_{k}*h*is shifted by

_{k}*C*α

_{R}*m*. However, because the mean 〈

*h*

_{k}

^{ν}〉 did not affect the result (Eq. 3.25) and

*C*α

_{R}*m*+ μ

_{u}*is still negligible compared with σ*

_{h}*, we can write the following: where ϕ*

_{h}_{CF,f}is given in Equation 3.25.

As a sum of large number *M* of weakly correlated terms, *g*(*m*) can be assumed to be normally distributed and can be written as follows:
where *z*^{ν} is a Gaussian variable with zero mean and unit variance.

Plugging the expression for *g*(*m*) into Equation 3.52 and solving for *m _{u}*, we get the following:
where ϕ

_{CF,f}is given in Equation 3.25.

So, the expectation value of *m _{u}* is as follows:
and the standard deviation is given by the following:
Because uniform regime and low noise imply dense input representation, we can use the dense approximation (Eq. 3.26) for ϕ

_{CF,f}. Plugging these results into Equation 3.40 leads to the capacity for the uniform regime, low noise, as follows:

#### Nonuniform regimes

When the input representation is sparse: there is a substantial fraction of perceptrons for which all inputs are silent; we call them the free units. If the noise is not very high , these units are statistically different from those that do receive a nonzero input. To analyze such a system in the mean-field approximation, two order parameters and two coupled mean field equations should be introduced. To avoid this complication, we consider a simpler case, which we refer to as the “two-subnetwork regime.” This regime is characterized by the recurrent connections that are relatively weak when compared with the feedforward connections, so that the state of those units that do receive nonzero feedforward input is determined by this input only. Neither recurrent input nor noise can flip them. Only the free units participate in the recurrent dynamics, and their mean activity in the final state reflects the class of the input pattern. Which of the two stable states the subnetwork of free units will go to is biased by the input from the input-receiving units, which have the information about the class of the input pattern from the feedforward input.

This approximation is valid if: To be more precise, this condition does not guarantee that the recurrent input will not be able to flip the input-receiving units close to the final state, when most of the free units are aligned. However, if this is the case, their activity already reflects the correct classification of the input pattern, and the input-receiving units will flip in the right direction.

The mean field equation (Eq. 3.38) should now be seen as describing the subnetwork of free units, and should be modified in several ways.

First, the number of units in the network is
since for small *f* the probability of all *C _{F}* independent inputs to be silent is (1 −

*f*)

^{CF}≈ e

^{−CFf}. Second, only

*C*e

_{R}^{−CFf}of

*C*recurrent connections per unit come from other free units. Also, the external input to the network now comes from other (input-receiving) units in the intermediate layer, rather than from the input layer.

_{R}The modified mean-field equation reads as follows:
where *m̃*^{ν} is the average activity of the subnetwork of free units and the index *k* runs over all the free units.

The external input is as follows:
where the summation is over the input-receiving units and *h*_{k}^{ν} is the feedforward current of Equation 3.10 with *n*_{l}^{ν} ≠ 0. *M _{IR}* is the number of input-receiving units, as follows:
On average, the free unit

*k*receives

*C*inputs, and (1 −

_{R}*e*

^{−CFf})

*C*of them come from input receivers. So, Equation 3.63 will have on average

_{R}*C*(1 −

_{R}*e*

^{−CFf}) nonzero terms. Assuming that this is a large number,

*H*

_{k}

^{ν}is a Gaussian variable with the mean given (in the leading order) by the following: which, using Equation 3.13, becomes the following: The number of active inputs

*n*connected to the intermediate unit comes from the binomial distribution,

*n*∼

**B**(

*N*,

*f*).

The standard deviation of *H*_{k}^{ν} is as follows:
(the corrections due to correlations between different input-receiving units are suppressed as 1/*N* and will become negligible for large networks when *C _{R}* does not scale with

*N*).

To find the statistics of *m̃ _{u}*, the point of unstable equilibrium, we again consider high- and low-noise approximations, but now we should compare the inverse temperature parameter β to the standard deviation of

*H*

_{k}

^{ν}.

What we further call intermediate noise is the noise that is small on the scale of the feedforward input (Eq. 3.60) but large when compared with the typical values of *H*_{k}^{ν}.

#### Two-subnetwork regime, intermediate noise

The following analysis is valid if, in addition to the conditions described in Equations 3.58, 3.59, and 3.60, the dynamic noise is high compared with the typical external input to the subnetwork of free units:
The condition for three solutions to the mean field equation (Eq. 3.62) in this case is as follows:
The former inequality allows us to approximate the hyperbolic tangent in Equation 3.62 by its argument when looking for the unstable solution , which is close to zero, as follows:
Each input-receiving unit *l* has *C _{R}* outgoing connections and approximately e

^{−CFf}

*C*of them terminate on a free unit. Hence, the double sum can be rewritten as follows: Solving this equation for

_{R}*m̃*leads (see Eq. 3.61) to the following: where we have introduced

_{u}*r̄*

^{ν}: the sign of the feedforward current averaged over the units for which this current is nonzero, as follows: The statistics of

*r̄*

^{ν}are closely related to previously computed statistics of

*r*

^{ν}(see Eq. 3.12), which is the sign of the feedforward current averaged over all of the intermediate units, namely: The expression for 〈

*r*

^{ν}〉 is given in Equation 3.13, which leads to the following: To compute the second-order statistics of

*r̄*

^{ν}, we use the following relation: The covariance on the right-hand side was also computed in Equation 3.25, which allows us to write the following: Plugging in Equations 3.71 and 3.72 to Equation 3.69 leads to the expressions for the mean and the standard deviation of

*m̃*

_{u}

^{ν}: [the mean 〈n〉 is computed assuming a binomial distribution for the number of active inputs

*n*connected to a readout

*n*∼

**B**(

*N*,

*f*)], and the following: Now we can use Equation 3.40 to compute the maximum number of learned patterns in the two-subnetwork regime under intermediate noise. The number of units in the network

*M*in Equation 3.40 should be replaced by the number of free units

*M*e

^{−CFf}. The result is as follows: where: It is helpful for analyzing this result to rewrite the expression for γ in terms of the following: which is the measure of how far the current parameters are from the transition to the one-solution scenario (Fig. 2

*a*,

*b*), at which the current framework breaks down, as follows: In the ultrasparse approximation: we can use Equations 3.16 and 3.28 to get the following:

#### Two-subnetwork regime, low noise

We now consider the low-noise approximation to the mean field equation for the subnetwork of free units (Eq. 3.62). This approximation is valid when in addition to Equations 3.58, 3.59, and 3.60: In this approximation, the mean field equation has three solutions if: This condition is derived analogously from Equation 3.47.

Under the assumption (Eq. 3.75), the mean field equation (Eq. 3.62) can then be approximated as follows:
As in the section Uniform regime, low noise, let us introduce a stochastic function *g*(*m̃*), as follows:
For small values of the argument *m̃*, the mean of *g*(*m̃*) over different realizations of *H*_{k}^{ν} is approximated as follows:
where μ* _{H}* and σ

*are given by Equations 3.65 and 3.66.*

_{H}To compute the variance of *g*(*m̃*) we need to know the following:
which is calculated in the study by Kushnir and Fusi, 2017, their Appendix A2) and, for large absolute values of the recurrent connectivity *C _{R}e*

^{−CFf}≫ 1), is approximated by the following: Assuming

*H*

_{k}

^{ν}to be Gaussian, we can write the following: where

*z*

^{ν}is a Gaussian variable with zero mean and unit variance.

The statistics of the unstable, close to zero, solution of Equation 3.76 can now be found by plugging in Equation 3.78 as the right-hand side of Equation 3.76, and solving for .

After substituting Equations 3.65 and 3.66 for μ* _{H}* and σ

*, we get the mean and the variance of the unstable solution*

_{H}*m̃*(assuming ), as follows: Using these expressions and Equation 3.40 with

_{u}*M*replaced by the number of free units

*M*=

_{f}*Me*

^{−CFf}, we get the maximal number of classifiable inputs in the following low-noise approximation of the two-subnetworks regime, as follows: Note, that this is the same expression as Equation 3.30 for the majority vote scenario (see the Results for an intuitive explanation).

For very sparse representations: the expression simplifies to the following:

## Results

### The task and the network architecture

To evaluate the performance of different network architectures, we consider a task in which the network is trained to associate a specific response with each input. The response is expressed by the activity of one output neuron, which could represent a decision, the expected value of an input stimulus, or an action. Each input, for example a sensory stimulus, is a pattern of activity across *N* input neurons. Both input and output neurons are either active or inactive, and hence the variables representing their activity are binary. Moreover, we assume that the inputs and the outputs are random and uncorrelated. Input neurons are active with probability *f*, whereas the output neuron is active on average for half of the inputs. Performing this task is equivalent to solving a binary classification problem in which each input is assigned to belong to one of two possible classes. As a measure of the performance, we consider the classification capacity and the maximum number of input patterns that can be correctly classified, and determine how it scales with the total number of neurons in the network. We now consider architectures with increasing complexity, and we eventually show that it is possible to design a network in which the number of classifiable inputs is large and scales linearly with the number of neurons while each neuron has limited connectivity (i.e., the number of connections per neuron is fixed in the sense that it does not scale with the number of neurons).

#### Single fully connected readout

The most basic network that we consider is the one in which the input neurons are directly connected to the output, which is basically the classical perceptron (Rosenblatt, 1957; Figs. 1*a*, 3, the first model on the left). The network is trained by modifying the weights *w _{i}* that connect each input neuron

*i*(Fig. 3, green) to the output (Fig. 3, yellow). The output activity

*o*

^{μ}in response to stimulus μ is determined by thresholding the weighted sum of the inputs: where θ is a threshold and ξ

_{i}

^{μ}is the activity of neuron

*i*when input pattern μ is selected. The weights

*w*and the threshold θ are learned to impose that

_{i}*o*

^{μ}= η

^{μ}, where η

^{μ}is the desired output in response to stimulus μ. We know from many studies (Cover, 1965; Gardner, 1987) that the maximum number

*P*of random inputs that can be correctly classified scales linearly with the number of input units when

*f*= 1/2 (

*P*∼

*N*; Fig. 3, table). This is a very favorable scaling and, actually, is the optimal one in the benchmark that we consider. Unfortunately, the number of connections

*C*of the output neuron (Fig. 3, feedforward connectivity) is equal to the number of input neurons, and hence when the number of classifiable inputs grows, the connectivity also has to increase accordingly. This is true also in the case of sparse input representations. Indeed, for an arbitrary value of

_{F}*f*, when we used the following simple learning rule inspired by Tsodyks and Feigel'Man (1988): and we obtained the limit of a large number of input neurons

*N*, the maximum number of input patterns that can be classified

*P*is given by the following: where ϵ is the maximum tolerated error.

Notice that the factor containing the coding level of the patterns *f* cannot change the scaling properties of *P*, even in the case in which the inputs become very sparse (i.e., when *f* → 0 as 1/*N*). This seems to be in contradiction with the results of other studies (Tsodyks and Feigel'Man, 1988; Amit and Fusi, 1994) in which *P* can scale as *N*^{2} when the inputs are sparse. However, it is important to remember that the *N*^{2} scaling can be achieved only when both the input and the output are sparse, and in the cases that we analyzed here the output is dense (i.e., active in half of the cases).

We now consider a different architecture that partially overcomes the limitation imposed by the limited connectivity assumption.

#### Committee machines

Consider now the architecture of Figure 1*b* (Fig. 3, Committee machine) in which multiple perceptrons are combined. We assume that each perceptron has limited connectivity, or more precisely, that when the number of input neurons becomes large (mathematically, we consider the limit of *N* → ∞), the number of input connections per perceptron, *C _{F}*, does not increase (i.e.,

*C*remains finite when

_{F}*N*→ ∞; Fig. 3, flat line). As a consequence, each perceptron will sample only a small fraction of the input neurons, and, for this reason, it will misclassify most of the inputs when

*P*becomes large (

*P*→ ∞). More quantitatively, the fraction of correctly classified inputs will be slightly above the level of chance (1/2), approximately 1/2 +

*a*/P when

*P*is large (

*a*is a constant).

In this situation, each perceptron is said to be a weak classifier. However, if the responses of different perceptrons are sufficiently independent, they can be combined to perform significantly better than any individual perceptron. The combination of multiple perceptrons makes what is called a committee machine. Typically, the class of an input is decided by the committee using a majority vote rule: if the majority of perceptrons are active, then the output neuron should also be active, otherwise it should be inactive. The majority rule can be easily implemented by summing with equal weights the outputs of all perceptrons.

As mentioned in the Introduction, adding new readouts without increasing the number of input units *N* cannot increase the classification capacity indefinitely, unless an additional mechanism is introduced to decorrelate the responses of different readouts. Such mechanisms may very well exist in the real brain. For example, one could imagine some local changes of synaptic plasticity during the learning phase, which make different readouts update their connections during the presentation of different subsets of patterns. However, in this article we stick to the simple learning rule (Eq. 4.1) and do not consider any decorrelation mechanisms. So, in the present contexts, the only way of increasing the classification capacity of the network without reaching the saturation is to increase the number of input units *N*. Also, to satisfy the requirement of limited connectivity, the number of connections converging onto the same readout, *C _{F}* cannot increase with

*N*, and we need to add new readouts to connect to the newly added input units. We denote the number of readouts (number of committee members) by

*M*and we derive the classification capacity

*P*under the assumption that

*N*,

*M*, and

*Pf*are large numbers and the

*C*connections of every readout (perceptron) are chosen randomly and independently of any other (there will be a random overlap).

_{F}If we use the simple local learning rule (Eq. 4.1), the maximum number of classifiable inputs is as follows:
where ϕ_{CF,f} is of the order of *C _{F}* and depends on the coding level

*f*, but not on

*N*or

*M*(see Eq. 3.25). The factor 〈n〉 is the mean of the square root of the random variable

*n*over the binomial distribution

**B**(

*C*,

_{F}*f*), which is approximately in the case

*C*≫ 1 (dense approximation) and

_{F}f*C*in the case of

_{F}f*C*≪ 1 (ultrasparse approximation; in practice, ultrasparse approximation is quite accurate also for

_{F}f*C*≈ 1). As for the single readout, the required classification accuracy is 1 − ϵ.

_{F}fUsing also the approximations for ϕ_{CF,f} in these two cases, we get the following:
and
For the dense input representations, *C _{F}f* ≫ 1, if the number of input neurons

*N*is kept constant while the number of perceptrons

*M*is increased, the capacity

*P*will saturate when the total number of feedforward connections

*C*is large compared with

_{F}M*N*. The value at which

*P*saturates is the same as for the fully connected readout (Eq. 4.2). If the input representations are sparse,

*C*≲ 1, saturation occurs when

_{F}f*C*becomes large compared with

_{F}M*N*/

*C*. The saturation value of

_{F}f*P*in this case differs from the fully connected case by a factor of order one, 2/π.

This implies that the committee machine is less efficient than the fully connected readout when the total number of feedforward connections is considered. This will also be the case for the recurrent readout scheme that we propose in the following section. It should be noted, however, that the difference in the total number of connections is modest (several times) unless the input representations are extremely sparse *C _{F}f* ≪ 1.

The dependence of *P* on the coding level *f* is weak, unless *C _{F}f* becomes smaller 1. For sparser representations, the capacity becomes proportional to

*C*(unless

_{F}f*M*is increased). This is not surprising because when

*C*< 1, a significant proportion of perceptrons will read out only inactive neurons, which are not informative about the input. However, even for very sparse representations the capacity can be restored by increasing the expansion ratio

_{F}f*M*/

*N*(see Fig. 5

*c*, green line and Fig. 5

*f*, low-

*f*region).

When both *N* and *M* are increased in proportion to each other, the number of classifiable patterns increases linearly with *N*, as in the case of the fully connected single perceptron (Fig. 3). However, now the connectivity of each perceptron is *C _{F}*, which does not scale with

*N*or

*M*. This means that it is possible to overcome the limitations of sparse connectivity. Unfortunately, this is not a satisfactory solution to the problem of limited connectivity because the readout output neuron, which now has to count the votes of all

*M*members of the committee, needs to be connected to

*M*neurons, and

*M*scales as

*N*. So again, we will need a number of connections per neuron that grows linearly with

*N*. The last row in Figure 3 (connectivity of the final readout) shows this dependence.

### Committee machines with recurrent connections

We propose an alternative way of implementing a committee machine, which is based on the use of recurrent connections, and it does not require a fully connected output neuron (Fig. 3). To understand the idea behind it, it is useful to consider a multilayer readout as a way to count the votes of all perceptrons while respecting the limited connectivity constraint: each neuron in the first layer would count the votes of different perceptrons. The neurons in the second layer would then count the votes of first-layer neurons, and so on. For this architecture, the number of neurons would decrease by a factor in every new layer, leading to a total number of neurons that would scale as log(*M*) or, equivalently, as log(*N*). It is also possible to set up a multilayer network with the same number of layers in which every layer contains the same number of neurons *M*. This network would require more neurons, but it is functionally equivalent to the first one that we considered. The reason we are considering this network architecture is that it can be interpreted as a recurrent network unfolded in time: if one assumes that the network dynamics is discrete in time, then every layer could be seen as the same recurrent network at a different time step. Importantly, the weights of the synaptic connections should be the same for every layer, as it is always the same network but at different time steps. As this network would also be functionally equivalent to the first multilayer network that we discussed, a recurrent network can in principle replace a complex multilayer readout, which would require significantly more neurons.

These considerations induced us to study the architecture represented in Figure 1*c* (Fig. 3, Recurrent readout), as follows: each perceptron of the committee machine is now connected to a randomly chosen set of the others through recurrent connections, whose weights are all the same and are equal to α. The number of recurrent connections per perceptron is *C _{R}*.

The recurrent dynamics basically has the role of stabilizing only two of the following attractor states of the network: one in which all perceptrons are in the active state; and one in which they are all in the inactive state. These two states represent the two possible responses of the output and correspond to the two classes the input could belong to. The recurrent network is an attractor network similar to the one proposed by Hopfield (1982), which in turn took inspiration from the studies on spin glasses in the ferromagnetic state, in which the spins are all aligned to each other in one of two possible directions. These two states of magnetization are analogous to the two decisions of the network that we propose. These attractor models have been used more recently to model decision-making, both in simple, more abstract networks (Usher and McClelland, 2001) and in very detailed biologically plausible networks of spiking neurons (Wang, 2002) in which only the two states corresponding to the possible decisions become stable when a sensory stimulus is presented.

The assumption about the excitatory nature of recurrent connections is crucial to have the two attractor states described above. For the case of binary classification, we assume that recurrent connections are either zero or excitatory. For the case of multinomial classification, however, the recurrent connections can also be inhibitory (see subsection Random output).

Once the network has relaxed into one of the two stable states, it becomes easy to determine the class to which the input belongs, as in principle it is sufficient to read out a single perceptron. However, a single neuron readout would not be robust to noise, and hence we will consider the situation in which a number of different perceptrons is read out. We will show that this number remains finite when *N* and *M* become large, which is equivalent to saying that it is possible to construct a network in which all of the neurons, including the output neuron, have limited connectivity and the number of classifiable inputs grows linearly with *N*. These scaling properties are summarized in the last column of Figure 3.

The number of classifiable inputs is derived analytically in Materials and Methods using a mean field approach. This number depends on the parameters that characterize the network architecture (i.e., the number and the connectivity of the different types of neurons) and on the statistics of the inputs that have to be classified. Depending on the assumptions about the parameters, there are different regimes that lead to different analytical expressions. These distinct regimes are summarized in Figure 4 and described in the following sections in great detail. The first distinction relates to whether all the recurrently connected neurons can be considered statistically equivalent or not. We call uniform the regime in which all the neurons can be assumed to be equivalent. This assumption is reasonable except when the connectivity and/or the input representations are so sparse that there is a significant fraction of neurons in the readout layer that do not receive any feedforward input. The number of these neurons depends on the product *C _{F}f* (Fig. 4). This population of neurons behaves differently from the others, to the point that it can be considered as a different subnetwork that requires a different analysis (for this reason, we call this a two-subnetwork regime). The type of analysis we perform also depends on the amount of noise in the network. In particular, when the noise is large, the network always operates in a uniform regime, even when the input is sparse. We now first discuss the uniform regime, and we will cover the nonuniform regime later.

#### The uniform regimes

The behavior of the network in the uniform regime also depends on the amount of noise that is injected into the neurons. We introduced noise as in the Hopfield model: the state of each neuron is stochastic, and its total synaptic current determines the probability distribution of the states. The noise is characterized by a parameter β, which in the language of statistical mechanics would be the inverse temperature parameter. When β is large, the noise is small and the neurons are basically deterministic. As β goes to zero, the neurons become more noisy and less dependent on the total mean synaptic input.

As we know from previous studies on attractor neural networks (Amit, 1992), the noise cannot be too large, otherwise the attractor states remain stable only for a short time (Fig. 2). More specifically, the noise should be smaller than the recurrent input when the network already settled in one of the two attractors and most of the presynaptic neurons of the recurrent network are in the right state. In the uniform regime, this requirement is expressed as β*C _{R}*α > 1. Moreover, to guarantee attractor stability, the recurrent input should also dominate over the feedforward input. More formally, this condition can be expressed as

*A*<

*C*α, where

_{R}*A*= is approximately the range in which the feedforward synaptic input varies when different inputs are presented. It basically determines the selectivity to the inputs in the absence of the recurrent connections (for more details, see Materials and Methods).

The two conditions on noise versus recurrent input and recurrent input versus feedforward input impose constraints on both *A* and β. However, the range in which these parameters can vary still allows the network to operate in qualitatively different regimes that depend on how large the noise is compared with the typical amplitude of the feedforward input.

##### The uniform, high-noise regime.

In the high-noise regime, the noise is so large compared with the feedforward input (β^{−1} ≫ *A*) that all of the different recurrent neurons can behave similarly (uniform regime) even when the feedforward input is so sparse (*C _{F}f* ≲ 1) that many neurons receive zero input. In this regime, the noise is large compared with

*A*, but still small compared with the recurrent input. The number of classifiable patterns

*P*for the high noise, always uniform regime, is given by the following: where The parameter Δ

*should be positive in order for the network to have two stable attractor states. However, increasing Δ*

_{UH}*decreases the number of classifiable inputs*

_{UH}*P*. In the following analysis, we assume that the parameters of the recurrent dynamics are adjusted in such a way that Δ

*is not too close to zero, so that there is no risk of losing the two attractors, but also not too large, so as to not sacrifice too much of the classification performance. A reasonable choice is Δ*

_{UH}*= 0.2.*

_{UH}As Δ* _{UH}* cannot be too small, and

*C*

_{F}f^{2}(1 − f)β

^{2}≪ 1 by the high-noise assumption (Eq. 3.43), it can be seen from Equation 4.6 that the proposed network, when operating in the uniform regime, has a worse performance than a committee machine (Eqs. 4.4 and 4.5) and a fully connected readout (Eq. 4.2). However, in the limit of large

*C*, the performance becomes the same.

_{F}M/NImportantly, if the number of input units *N* and the number of intermediate readouts *M* are increased in the same proportion, the number of classifiable inputs scales linearly with *M* or *N* (Fig. 5*a*), as in the committee machine case. However, now there is not a single neuron that is required to have a connectivity that scales with *N*, so the connectivity of each neuron can remain finite even when *N* and *M* become arbitrarily large. This supports the claim made in Figure 3 about scaling properties of the proposed network. As we will see below, this is not the only regime in which these scaling properties are valid.

The rate at which *P* grows with the number of input neurons *N*, *P*/*N*, depends on the expansion ratio *M*/*N* (the number of intermediate readouts per input neuron), the coding level *f*, and the parameters of the recurrent dynamics Δ* _{UH}* and β. Instead of using

*M*/

*N*as an independent parameter, it is more convenient to express

*P*as a function of the average number of efferent connections per input neuron, as follows: Indeed,

*C*/

_{F}*N*is the probability that an input neuron is connected to a readout. When multiplied by

*M*, it gives the average number of connections that depart from an input neuron and arrive at the intermediate readouts. We assume that, along with

*C*, the number of efferent connections per input neuron should also be minimized, given that these connections could be long range.

_{F}The dependence of the growth rate of the capacity *P*/*N* on the input coding level *f* is shown in Figure 5*d* for different values of *c*. To make these plots, we assumed that the parameters of the recurrent dynamics are chosen anew for every value of *f* so as to keep Δ* _{UH}* = 0.2 and to satisfy the high-noise condition (Eq. 3.42) by the same margin. We also assumed that the number of feedforward connections per perceptron is

*C*= 50, which is consistent with the observations in the mouse hippocampus (see Discussion).

_{F}The classification performance of the recurrent readout *P*/*N* in the high-noise regime increases with *c*, for any value of *f*. In other words, when *N* and *C _{F}* are kept constant, increasing the number of perceptrons

*M*, and hence the total number of connections

*C*, will always increase the capacity

_{F}M*P*. However, the capacity cannot increase indefinitely in this way because of the correlations between the perceptrons that we discussed above. Interestingly, the saturation of the capacity as a function of

*M*(or

*c*) is reached sooner for denser representations.

As can be seen from Figure 5*d*, the classification performance reaches its maximum at *f*_{max} ≈ 0.05, depending on the value of *c*. When *c* increases, the maximum moves toward sparser *f*. The position of the maximum also depends on the number of feedforward afferent connections per perceptron, *C _{F}*, as

*f*

_{max}∝ 1/

*C*.

_{F}##### The uniform, low-noise regime.

When the noise is low compared with both the recurrent and the feedforward input, the density of the input representations starts playing a crucial role in determining whether the network is in a uniform or a nonuniform regime. If the input representation is dense, *C _{F}f* ≫ 1, the network is again in a uniform regime. As before, all of the neurons have the same average activity, but the main source of inhomogeneity is the feedforward input rather than the noise. The number of classifiable inputs in this uniform low-noise regime is as follows:
with
Again, Δ

*= 0 corresponds to the transition from the network with two stable states, which is suitable for classification, to the network where only one state is stable. In this case, it is not the recurrent noise that may destroy the two attractor states, but the variance in the feedforward input (this is why β does not enter into the expression for Δ*

_{UL}*). We assume that the strength of the recurrent connections α is adjusted to keep Δ*

_{UL}*from being too close to zero.*

_{UL}The result (Eq. 4.7) is very similar to the case of dense representations in the high-noise regime. One obvious difference is that the inverse temperature parameter β does not appear since we assumed to be in the low-noise limit β → ∞. As before, the capacity per input neuron *P*/*N* grows as the expansion ratio *M*/*N* or the number of feedforward connections per perceptron *C _{F}* increases.

When compared with the performance of the fully connected readout (Eq. 4.2), this regime implies a slightly lower classification capacity. The difference disappears when *c* is assumed to be large.

Figure 5, *c* and *d*, summarizes the dependence of the classification performance *P*/*N* on the coding level *f* when the noise in the recurrent network is low on the scale of the feedforward input. The high-*f* segments of the curves correspond to the uniform low-noise regime.

#### Nonuniform regimes

When the noise is small compared with the feedforward input and the representations are sparse, the uniform approximation is not valid and the recurrent network behaves in a qualitatively different way; for each input pattern, there would be two distinct populations of neurons: the free neurons, which receive zero feedforward input, and hence are not constrained (free) by the input; and all the others, the input receivers. The two populations would be different for different inputs, they would have different activity distributions, and they would evolve in time differently, although they constantly interact with each other.

Generally, such a regime is intractable with the mean field method, so we need to make the additional assumption that the feedforward synapses are sufficiently strong relative to the recurrent ones, so that the nonzero feedforward inputs are typically larger than the total recurrent inputs in the initial state (before the network reaches the final state, when most of the neurons have the same activity). Furthermore, we need to assume that these feedforward inputs are also much larger than the noise. Under these assumptions, the state of the input receivers is determined by the feedforward input, at least in the initial stages of the dynamics, while the network is deciding which stable state to choose. We then need only to consider the dynamics of the subnetwork of free units, treating the recurrent input from the input receivers as a fixed external input. It is this input that contains the information about the correct classification.

We refer to the described scenario as to the two-subnetwork regime. The classification capacity in the two-subnetworks scenario also depends on the noise. The noise has to be small compared with the feedforward input, otherwise, it might modify the input-receiving neurons. However, it can be either small or large when compared with the amplitude of the recurrent input coming from the input receivers. The noise amplitude determines whether the network operates in a two-subnetworks low-noise regime or in the two-subnetwork intermediate-noise regime.

The two-subnetwork intermediate-noise regime is realized when the representations are sparse (*C _{F}f* ≲ 1) and the noise is small relative to the feedforward input but large in the subnetwork of free neurons, namely relative to the input into free neurons from the input receivers. This regime leads to the classification capacity of the following:
Here 〈n〉 is the mean of n over the binomial distribution

**B**(

*C*,

_{F}*f*), which is approximated by different functions of

*C*and

_{F}*f*depending on whether

*C*is small or large (Eqs. 3.15 or 3.16). The term ϕ

_{F}f_{CF,f}corresponds to the correlations between input receivers (Eq. 3.25). It is of the order of

*C*and depends on the coding level

_{F}*f*. It is also approximated differently depending on the value of

*C*(see Eqs. 3.26 and 3.28). The quantity γ is given by the following: where As in the uniform regimes, the capacity is maximal (smallest γ) when the network of free units is close to transitioning from three fixed points (Fig. 2

_{F}f*c*,

*d*) to one fixed point (Fig. 2

*a*,

*b*).

In the ultrasparse approximation *C _{F}f* ≪ 1, the expression for

*P*becomes the following: Figure 5

*b*shows the linear dependence of

*P*on the number of input neurons

*N*for different expansion ratios in the two-subnetwork intermediate-noise regime. As for uniform regimes, the dependence is linear, confirming once more the scaling properties of Figure 3. The relation between the slope of this linear dependence,

*P*/

*N*, and the coding level for

*C*= 50 and different values of

_{F}*c*=

*C*is represented by the low-

_{F}M/N*f*segments of the curves on Figure 5

*e*. The high-

*f*segments of the curves correspond to the low-noise approximation of the uniform regime, which is characterized by the same relationship between the noise and the typical values of the feedforward input. We do not plot the low-

*f*curves beyond

*f*= 0.05 because for denser representations the fraction of the free units becomes small (∼8%), and it becomes difficult for the randomly and sparsely connected final readout to distinguish between the two states of the free subnetwork. However, we believe that the classification capacity changes smoothly between the two regimes, achieving its maximum for the coding value close to

*f*= 0.05. The location of the maximum will change when different values of

*C*are assumed,

_{F}*f*

_{max}∝ for large

*C*.

_{F}##### The capacity in the intermediate-noise regime can be larger than the capacity of a majority vote committee machine.

Interestingly, the capacity in the two-subnetwork intermediate-noise regime (Eq. 4.8) is larger than in the case of a majority vote committee machine for the same coding level *f* (Eq. 4.3). This result is counterintuitive, but it can be explained, as follows: in the majority vote scenario, both the input-receiving units and the free units contribute to a collective decision, even though the free units carry no information about the class of the input pattern and they actually generate noise as we assume that initially they are in a random state. In contrast, in the recurrent case, the collective state of the network is initially determined mostly by the input-receiving units, which then drive the free units to the right state. The noise contained in the initial state of the free units does not much affect the initial relaxation dynamics, provided that the noise in dynamics is sufficiently large (relatively low β).

In the case of the majority vote committee machine, the class is decided in only one time step and the initially random free units generate a certain amount of noise that depends on their number. In the case of the recurrent dynamics, the connectivity is sparse and each neuron that participates in it samples the noisy neurons a number of times, which depends on the relaxation time. If these neurons can flip randomly at every time step, then their noise is averaged out and the final effect of the free units can be smaller than in the majority vote committee machine.

##### The two-subnetworks, low-noise regime.

We now consider the two-subnetwork low-noise regime. Not surprisingly, the capacity is identical to the one for the majority vote scenario in the sparse limit (see Eq. 4.3), as follows:
and in the ultrasparse approximation *C _{F}f* ≪ 1:
These results are summarized graphically in Figure 5

*c*(green line) and in the low-

*f*region of Figure 5

*f*. As in Figure 5

*e*, the high-

*f*region represents the uniform low-noise regime. We believe that there is no discontinuity in the capacity at the transition between the regimes. Interestingly, the classification capacity is larger in the intermediate-noise regime when compared with the low-noise regime just discussed and the high-noise regime considered earlier, in which the network was still operating in a uniform regime. This implies that there is an optimal level of noise that maximizes the performance of the network.

### Simulation results

Figure 6 shows the results of numerical simulations run to verify the predictions of the above calculations. The two plots correspond to two different regimes characterized by different coding levels *f* of the input representation. Figure 6*a* shows the case of dense input representations, *C _{F}f* = 10 ≫ 1. The theoretical predictions for the majority vote scheme (committee machine) and the recurrent readout are represented by the two dashed lines, and the solid lines with the error bars depict the results of the simulations in the two scenarios. The case of sparse input representations

*C*= 1 is shown in Figure 6

_{F}f*b*. As predicted by the analytical calculations, the recurrent readout scenario leads to higher classification capacity than the majority vote.

The simulation plots were obtained as follows: we fixed the required accuracy of the classification at 1 − ϵ = 0.9 and the number of feedforward connections per readout at *C _{F}* = 50. For each number of the input units

*N*, we chose the corresponding number of the intermediate readouts

*M*=

*N*/30 (

*c*= 5/3), and we computed the predicted classification capacity

*P*. We then trained the network with the set of

*P*

_{1}=

*P*random and uncorrelated patterns, and tested the classification performance on a subset of 500 learned patterns. The obtained accuracy can be expressed as 1 − ϵ̌. Then we trained the same network on the new set of

*P*

_{2}= random patterns. We repeated the procedure 10 times, and we kept only those runs where the accuracy differed from the required one by not >2% (∣ϵ̌ − ϵ̌∣ < 0.02). We then computed the mean and the SE of the corresponding values of

*P*. For the recurrent readout scenario, the recurrent connectivity was random, with all the connections having the same strength and the number of connections per unit being fixed at

*C*= 200. The connectivity matrix was chosen to be symmetrical, and the recurrent dynamics ran for 30 steps of synchronous update. The parameters of the recurrent dynamics for the plot of Figure 6

_{R}*a*(dense input) were β = 0.5 and α = 0.015 (Δ

*= 0.5), and for Figure 6*

_{UH}*b*(sparse input), β = 33 and α = 0.0005 (Δ

*= 0.21).*

_{TI}For the uniform low-noise regime, the simulation results also agree with the theoretical prediction (data not shown). In the case of the two-subnetwork low-noise regime, to get a good agreement with the theory we need to initialize the input-receiving units in the state determined by their total input first and only then run the dynamics of the randomly initialized free units. This procedure corresponds to assuming that a larger synaptic input causes the neuron to change its state faster, which is a reasonable assumption.

### Optimizing the network architecture

#### Optimizing the architecture under the constraint that the total number of long-range connections is constant

The expressions for the classification capacity as a function of the parameters of the network (Eqs. 4.6 to 4.12) allow us to determine the optimal network architecture under different constraints. More specifically, we determine the optimal relation among the number of the input neurons *N*, the size of the perceptron layer *M*, and the feedforward connectivity *C _{F}*. Note that this notion of optimality is independent from tuning the parameters of the recurrent dynamics, such as α, β, and

*C*, which can be chosen later to ensure optimal values of Δ

_{R}*, Δ*

_{UH}*, or Δ*

_{UL}*.*

_{TI}We first discuss the optimization under the constraint on the total number of long-range connections (i.e., the feedforward connections). More specifically, we assume that the number of inputs *N* and the total number of long-range connections *C _{FM}* are fixed (which implies constant

*c*=

*C*), and we ask what value of

_{F}M/N*C*(or

_{F}*M*) will optimize the capacity

*P*.

For dense input representations in the uniform regimes (high or low noise), rearranging the connections while keeping their total number the same has no effect on the classification capacity. This can be seen from Equations 4.6 and 4.7. Although the parameter *C _{F}* enters Equation 4.6 not only in combination with

*C*, for dense representations,

_{F}M*C*

_{F}f^{2}(1 −

*f*)β

^{2}in the last term in the denominator represents the ratio between the noise and the typical value of the feedforward input, which we assume to be constant (see Eq. 3.42).

For the case of sparse representations, independent of the level of noise, the situation is different. In Equations 4.8 and 4.11, the parameter *C _{F}* enters through the quantities 〈n〉 and ϕ

_{CF,f}, while in Equation 4.6 it enters explicitly in the last term of the denominator and in the scaling of the noise parameter β ∝ (see Eq. 3.42). It turns out that for constant

*C*, when

_{FM}*C*is assumed to be large (which we always do for this study), the capacity

_{F}*P*depends only on the product

*C*, the average number of active inputs per perceptron, but not on

_{F}f*C*or

_{F}*f*individually [apart from the factor (1 −

*f*), which is close to 1 for sparse representations]. The dependence of the capacity on the value

*C*will be represented by the same curves as the low-

_{F}f*f*regions of the curves shown in Figure 5

*d–f*. As we can see from these plots, the capacity

*P*increases as a function of

*C*.

_{F}fSo, for sparse representations under low or intermediate noise the optimum of the classification capacity for a fixed number of input neurons *N* and a fixed total number of feedforward connections *C _{F}M* is achieved for the value of

*C*, which corresponds to the boundary of applicability of the two-subnetwork regime,

_{F}*C*≈ 2. For high levels of noise, the optimum is also approximately

_{F}f*C*= 2 (its exact position depends on the value of

_{F}f*c*and the chosen level of noise relative to the feedforward input).

Increasing *C _{F}* under these assumptions is equivalent to increasing the coding level

*f*.

#### Optimizing the architecture under the constraint that the total number of neurons is constant

We now determine the optimal architecture in the case when the total number of neurons is fixed. Basically, we ask how to partition the total set of neurons between the input and the perceptron layer to maximize the classification capacity. This question is sensible if the dimensionality expansion in the input layer is not a limiting factor, as we assume that all *N* input neurons are independent.

It is straightforward to derive the optimal expansion ratio *M*/*N* from Equations 4.6 to 4.12 under the constraint *M* + *N* = constant.

In the uniform regime, unless the noise is very high or the representations are very sparse (see Eqs. 4.6 and 4.7), the expansion ratio that maximizes the capacity can be approximated by the following:
The number of perceptrons *M* is much smaller than the number of inputs *N* (converging architecture) and the number of feedforward connections per input neuron *c* ≈ .

In the uniform, high-noise regime, for very sparse representations or very high levels of noise, *C _{F}f*

^{2}(1 −

*f*)β

^{2}≪ Δ

_{UH}

^{2}, the optimal expansion ratio is given by the following: which implies a higher proportion of the perceptrons

*M*/(

*M*+

*N*) compared with the previous result.

For the two-subnetwork regime with intermediate noise (Eq. 4.8), the result for the optimal expansion ratio is the same unless the input representations are extremely sparse, in which case the optimal proportion of the perceptrons increases.

The optimal expansion ratio for the low noise is identical to the sparse limit of the intermediate-noise case, as follows:
To summarize, our model predicts that under the constraint on the total number of neurons *M* + *N*, the optimal expansion ratio is given by *M*/*N* ≈ (*c* = ), unless the input representations are very sparse or the noise is very large, in which case the optimal proportion of the perceptrons increases. For the intermediate levels of noise, this increase is less profound (happens for more sparse representations).

### Multinomial classification

We now turn to a more difficult problem of classifying the inputs into more than two categories. The scheme presented above can be generalized in a straightforward way to serve as a multinomial classifier. We first present the generalization of the model where instead of a single population of perceptrons in the intermediate layer, we assume multiple subpopulations, each of which is selective to its own class of input patterns. The model requires that the recurrent connectivity in the intermediate layer is restricted to pairs of neurons belonging to the same subpopulation. This scheme implies that the patterns of activity in the intermediate layer, as well as its connectivity structure, have a specific design, which should be imposed to the network and most likely is driven by top–down signals. The architecture that we will describe is probably the result of a learning process, although here we just focus on the already structured network and we do not model explicitly the synaptic modifications that lead to it.

We later discuss a more realistic scenario that supports multinomial classification. Namely, we assume that the desired activity pattern of the intermediate layer in response to each class of the input pattern is chosen randomly. In this case, the only role of the external supervisor is to guarantee that the activity pattern of the perceptron layer is the same during the presentation of different input patterns belonging to the same class. In this case, the recurrent connectivity within the perceptron layer is random and sparse, as for binary classification, but unlike the latter, the strength of the existing connections are plastic and are modified by the desired activity patterns via Hebbian plasticity.

This last, more realistic scenario cannot be considered analytically, but we show with simulations that the capacity decrease compared with binary classification is moderate and, most importantly, that the linear scaling with the network size is preserved.

#### Structured output

The immediate generalization of the recurrent readout scheme to multinomial classification task is to introduce several nonoverlapping populations of intermediate readout neurons, each of which would activate in response to a single class of input stimuli. The recurrent connectivity within a population would be as described before, while no recurrent connections would exist among the neurons belonging to distinct populations. The desired output pattern in response to an input from each class is then structured so that the population corresponding to the given class is active while the others are inactive. The final readout has to contain multiple readout units, one for each class. Their connectivity can still be sparse and random, but the sign of the connections would have to be adjusted based on whether it comes from the neuron in the population selective for the same class as the given final readout or not (Fig. 7*a*).

The classification capacity can now be computed in the same way as above, by noticing that each population is now doing a binary classification, selecting for one of *L* classes. The only difference is that the proportion of “positive” patterns (the output sparseness) is now *y* = 1/*L* instead of 1/2. The capacity formula for the case of sparse output is derived in Materials and Methods, and it differs from the capacity for a dense case by a factor that depends on *y*, as follows:
It should be noted, that the number of intermediate readouts *M*, entering this formula is the number of units in the population selective for a particular class. So, if the total number of intermediate readout units is *M*_{total}, and all populations have equal size, it is *M* = *M*_{total}/*L* = *yM*_{total} that should enter the formulas for the capacity. So, in terms of the total number of intermediate units, in the two-subnetwork intermediate-noise regime, for example Equation 4.8, we have the following:
where γ is given by Equation 4.9. There are two differences with respect to the binary classification case (Eq. 4.8). The first is the factor *L*/4(*L* − 1), which is equal to 1/2 for the case of two classes (*L* = 2). This is the reflection of the fact that when only two classes are possible, the current scheme is redundant; when the first population is active, the other is not, and vice versa. In the limit of a large number of classes, the prefactor is equal to 1/4. The other difference is in the second term in the denominator, which rescales *N*, the number of the input units. Namely, for the number of intermediate readout units, the role of correlations between them is decreased compared with the binary classification case. This is because there is no interference between the readout neurons belonging to different populations.

#### Random output

Another, more realistic scenario is to assume that each class is represented by a distinct random output pattern. In contrast to the previous scenario, now the output pattern can have a nonzero random overlap. In this case, it is necessary to train the recurrent connections, and we assumed a simple Hebbian learning rule, similar to the one used in the Hopfield model, as follows:
where ζ* _{k}^{a}* is the output pattern corresponding to class

*a*(

*a*= 1 …

*L*). In this case, there are no structurally distinct subpopulations of the intermediate readout neurons, which are defined a priori (Fig. 7

*b*). In contrast, the subpopulations of neurons that represent different classes emerge as a consequence of the learning rule (Eq. 4.13).

Figure 7 shows the simulation results for a five-way classification (*L* = 5) of dense input patterns with high dynamic noise (the parameter values are given in the caption of Fig. 7). As expected, also in this case *P* increases linearly with *N*, even when the number of incoming connections is kept constant for all neurons (i.e., it does not scale with *N*). This result also confirms the validity of our approach in a more realistic case for which it is significantly more difficult to perform analytical calculations.

Notice that in this case, although we did not perform the analysis, we know from previous studies on recurrent neural networks (Amit, 1992) that the number of recurrent connections will have to scale linearly with the total number of classes *L*. This would explain why the number of recurrent connections could be much larger than the number of feedforward inputs. A future study will address this specific issue.

### The initial condition of the recurrent network

An important assumption that we made to implement the majority vote with a recurrent readout is that the recurrent network initial condition is unbiased or, in other words, that *m*_{0} = 0 (see subsection Mean field analysis of the recurrent dynamics). This condition might sound difficult to realize in a network that is basically designed to amplify any small deviation from *m*_{0} = 0. However, this condition could be realized as follows: assume that before a pattern to be classified is presented, the input layer is spontaneously active. This spontaneous activity generates a feedforward input *h _{k}^{sp}*, which causes the disordered state (

*m*= 0) to be the only stable state of the recurrent network (Fig. 2

*a*). There are two conditions on the statistics of

*h*that are required to have

_{k}^{sp}*m*= 0 as the only stable state of the system in the mean field approximation. The first requirement is that

*h*has zero expectation value, which is satisfied if the patterns of spontaneous activity are not correlated with the training patterns. The second requirement is that the standard deviation of the distribution is large enough to make the slope of the sigmoidal curve of Figure 2 less than 1. For instance, in the uniform regime (see subsection The uniform regime), the latter requirement is σ

_{k}^{sp}*> (which is the opposite of Eq. 3.48), where σ*

_{h}^{sp}*is the standard deviation of the feedforward current due to spontaneous activity. When the input pattern is presented, then the noise is assumed to decrease to restore the conditions (Eq. 3.48) that allow the recurrent network to have three solutions, two stable, corresponding to the possible classification outcomes, and one unstable, which was the initial state. A reduction in noise during stimulus presentation has been observed in the study by Churchland et al., 2010.*

_{h}^{sp}### Biological interpretation and testable predictions

One of the important results of our analysis is that the sparseness of the neural activity can play an important role, even in the case in which the connectivity is very limited (see also the Introduction). We showed in Figure 5 that the optimal coding level *f* (i.e., the average fraction of active neurons) is always approximately *f* = 0.05, when the number of feedforward afferent connections per neuron *C _{F}* is assumed to be 50, which would be estimated as the average number of synaptic inputs that each neuron in CA3 receives from the DG. This is a surprising result given that the assumed connectivity is so limited, and it explains why the neural activity in the DG can be so sparse without compromising the computational performance of the hippocampal system.

Intuitively, the optimal sparseness can be explained as follows. One of the reasons why the classification capacity is limited is the noise in the synaptic strengths introduced by the other patterns that have been learned by the classifier. This noise increases as the representations become denser (*f* increases), which leads to a decrease of the classification capacity for dense representations. However, when the coding level is too small, the fraction of intermediate readouts whose inputs are all silent, and hence not informative, becomes larger and the capacity again decreases.

The optimal coding level also has an interesting dependence on the number of patterns *P* that should be classified, and this dependence generates a specific prediction on how the richness of the environment can change the sparseness of the representations in the DG.

We assume that the number of input neurons *N*, which in the application to the mammalian hippocampus corresponds to the number of dentate gyrus granular cells, cannot change substantially. Although adult neurogenesis was observed in this area, the number of adult-born cells is negligible compared with the total number of neurons. It is also reasonable to assume that having higher *f* (i.e., higher activity in the DG) has a metabolic cost.

When the animal is put into an enriched environment and has to learn to classify more input patterns *P* than before while keeping *N* fixed, the classification performance of the network, expressed as *P*/*N* should increase. This can be achieved either by increasing the number of neurons in the perceptron layer (the CA3 area in the application to the hippocampus), by increasing the number of feedforward (DG–CA3) connections per perceptron (CA3 pyramidal cell) *C _{F}*, or, finally, by changing the coding level

*f*of the input (DG) representations.

In line with the main assumption of our work, the number of connections per perceptron *C _{F}* is limited by spatial constraints and cannot be increased further. Increasing the number of cells is also unlikely because the number of required additional neurons would scale as

*N*, and this would require additional wiring and more energy. There is a more efficient alternative, which is to adjust the coding level

*f*.

As neuronal firing has a metabolic cost associated with it, we hypothesize, that *f* is kept as low as possible so that the demand on the number of patterns *P* saturates the classification capacity of the network. In other words, if the required number of classifiable patterns is *P*_{1}, the coding level *f*_{1} is such that the point (*P*_{1}/*N*, *f*_{1}) lies on the curve describing the maximal classification capacity of the network (Fig. 8*a*). There can be two values of *f* that correspond to the same *P*_{1}, of which the network prefers the lower one (on the left from the maximum). When the environment is enriched and the animal needs to classify *P*_{2} > *P*_{1} patterns, the coding level should increase to *f*_{2} > *f*_{1}, so that the point (*P*_{1}/*N*,*f*_{1}) is still on the same curve as shown in Figure 8*a* (we assume that *c* = cannot change).

In Figure 8*b*, we present the predicted relation between the coding level *f* and the number of patterns *P* that the animal has to learn to classify correctly (richness of environment), as estimated by our model. Different colors correspond to different values of the feedforward connections per input neuron *c* (number of DG–CA3 connections per DG granular cell) and the different regimes described above. It should be pointed out that the parameter *c* is difficult to measure in an experiment, as the number of perceptrons *M* does not directly correspond to the number of cells in the CA3 area, but rather to the number of cells in the subpopulation whose target activity is assumed to be uniform (see subsection Multinomial classification). However, while the prediction that the coding level should increase with the richness of the environment does not depend on *c*, it is true for a wide range of values of *c* and it is valid for both the intermediate- and low-noise regimes, which are likely to be the only two regimes that are relevant for a computationally efficient biological system. The prediction of the model is that the coding level of neural representations in the DG, which could be measured using calcium imaging, would increase with the richness of the environment. If this is observed, then it would be interesting to determine the role of neurogenesis. The number of newborn neurons in adult animals is probably too small to account for the increase in *N* that would be needed to deal with an enriched environment. However, the newborn neurons could have a significantly larger coding level *f* and affect the effective coding level of DG in a more substantial way. A new model with mixed coding levels would have to be studied to produce specific quantitative predictions.

## Discussion

We presented a model network based on perceptrons in which all the neurons have limited connectivity, and nevertheless the classification capacity grows unboundedly and linearly with the size of the network. The limitations on the classification capacity of the individual perceptrons that are imposed by the limited connectivity are overcome by reading out multiple perceptrons, as in a committee machine. However, the readout mechanism is different from the one normally used in committee machines as it uses a recurrent attractor dynamics of committee members to generate a final vote. Thanks to the recurrent dynamics, it is then possible to read out a small sample of all the committee members to determine the committee decision. This allows for readouts that have a limited connectivity, even when the size of the network becomes very large. The limitations imposed on the number of connections per neuron make the proposed network less efficient than a single readout with unlimited connectivity when the total number of feedforward connections is considered. However, the decrease in classification performance is modest unless the input representations become extremely sparse. Moreover, and most importantly, the scaling properties of our neural system are the same as those of the readout with unlimited connectivity.

Recent theoretical studies (Barak et al., 2013; Cayco-Gajic et al., 2017; Litwin-Kumar et al., 2017) considered a neural architecture that is similar to the one that we analyzed. In all of these studies, the input neurons are connected to an intermediate layer of neurons and then read out by a single cell with unlimited connectivity. The inputs are completely random in the study by Litwin-Kumar et al. (2017), are low-dimensional and correlated in the study by Barak et al. (2013), and are highly correlated in the study by Cayco-Gajic et al. (2017). The intermediate layer makes the neural representations linearly separable by expanding the dimensionality, so that the readout, which is linear, can be trained. One of the studies (Litwin-Kumar et al., 2017) also shows that this dimensionality expansion can be efficiently implemented using random connectivity, and, surprisingly, the optimal number of random connections per neuron is very small. In our case, we also discuss a situation in which the connectivity is limited, but our work addresses a different computational problem: in the study by Litwin-Kumar et al. (2017), the authors focused on the problem of dimensionality expansion and showed that large connectivity can actually reduce the performance of a downstream classifier that reads out randomly connected neurons. In their case, which nicely applies to the cerebellum, the classifier had unlimited connectivity. Here we focused on the problem of how to implement this downstream classifier under the constraint of limited connectivity. In our case, the constraint on limited connectivity is imposed because of the metabolic and spatial cost of wiring, whereas in the study by Litwin-Kumar et al. (2017) it emerges as a requirement for optimizing the ability to expand the dimensionality of the neural representations. In the study by Cayco-Gajic et al. (2017), the authors showed that sparse connectivity can be beneficial in a problem of pattern separation, but again it is the connectivity of the neurons in the intermediate layer that they refer to, and not the readout.

In our article and in one of the articles discussed in the previous paragraph (Litwin-Kumar et al., 2017), the input patterns were assumed to be random and uncorrelated. For any analysis of the performance of a neural circuit, we need to make an assumption about the nature of the inputs and random patterns is a standard assumption enabling theorists to perform analytical calculations. Although real-world sensory inputs are likely to be highly structured and correlated, it is not completely unreasonable to believe that, at least in some brain areas like the hippocampus, the patterns of activity can be modeled as random and uncorrelated. Indeed, the hippocampus is known to be involved in memory consolidation. The most efficient way of storing real-world correlated memories is to memorize only their uncorrelated component. In other words, if it is possible to compress the memories by taking advantage of their correlations, the resulting compressed representations will look random and uncorrelated (Fusi, 2017). This is a process that is normally not modeled explicitly, although there are some models predicting that the hippocampus might be involved in this compression process (M.K. Benna and S. Fusi, unpublished observations).

One of the results that we discussed in our article is that there are situations in which the proposed recurrent readout scheme can outperform classical readouts that are based on a majority vote despite the fact that the majority vote would require a significantly larger readout connectivity (see subsections Nonuniform regimes and Two-subnetwork regime, intermediate noise). For the majority vote scheme, the classification capacity drops drastically when the input representations are very sparse because the fraction of classifiers whose inputs are all silent becomes substantial and these classifiers just contribute to the noise. Instead, for the recurrent readout the classification capacity can be kept high even for very sparse representations in certain parameter regimes because the recurrent dynamics can align the “free” classifiers to the majority decided by the other, informative classifiers. The lower limit on the coding level *f*, below which the capacity drops is determined by the amount of noise in the recurrent dynamics, the expansion ratio, and the number of feedforward connections per perceptron (Fig. 5).

In general, the proposed system is robust to both sparse connectivity and sparse representations, which makes it suitable to describe neural circuits like the DG and CA3 area, where the number of connections of downstream neurons (CA3) is much smaller than the number of neurons in the input (DG) and the neural activity in the input can be very sparse. CA3 is known to have the recurrent connections that would implement our proposed readout mechanism. We showed that for intermediate noise levels (see Results, sections Nonuniform regimes and Two-subnetwork regime, intermediate noise), the classification capacity stays within a reasonable range even when the expected number of active units read out by each perceptron is smaller 1 (Fig. 5*e*,*f*). This result nicely complements the findings of the studies by Barak et al. (2013), Litwin-Kumar et al. (2017), and Cayco-Gajic et al. (2017), where the authors show that low-dimensional correlated inputs require an intermediate layer of neurons (randomly connected in the study by Barak et al., 2013; randomly connected or learned in the studies by Cayco-Gajic et al., 2017; Litwin-Kumar et al., 2017). For these neurons in the intermediate layer, there is an optimal sparseness level, which minimizes the generalization error of a single perceptron-like readout. In the study by Cayco-Gajic et al. (2017), the authors also showed that sparse representations in the intermediate layer are important for performing pattern discrimination. Here we showed that there is a readout scheme that would also work for the sparse representations required in the intermediate layer in the studies by Barak et al. (2013), Cayco-Gajic et al. (2017), and Litwin-Kumar et al. (2017), and does not require an unreasonably large number of long-range connections.

Our model is intentionally abstract with binary neurons, and no separation between excitation and inhibition. However, the dynamics of the recurrent network is very similar to the attractor dynamics so widely studied first in abstract models like the model of Hopfield (1982) and then in more realistic models that contain integrate-and-fire neurons with dynamic synapses, sparse representations (Amit and Brunel, 1997; Wang, 1999; Compte et al., 2000; Brunel and Wang, 2001), and also, in some cases, plastic synapses (Amit and Mongillo, 2003). We believe that the path that goes from our abstract model to more realistic models such as those just described is very similar to the one already followed. This will be an interesting future project, which most likely will confirm the scaling properties that we derived and discussed in our article, as it happened in the case of the attractor neural networks based on the pioneering work of John Hopfield.

## Footnotes

S.F. is supported by the Gatsby Charitable Foundation, the Simons Foundation, the Schwartz Foundation, the Kavli Foundation, and the National Science Foundation NeuroNex Program Award DBI-1707398. L.K. was supported by Grants ANR-10-LABX-0087, ANR-10-IDEX-0001-02 and ERC grant PrediSpike, ERC - 312227.

- Correspondence should be addressed to Stefano Fusi, Center for Theoretical Neuroscience, College of Physicians and Surgeons, Jerome L. Greene Science Center, 3227 Broadway, 5th and 6th Floors, Quad D, Columbia University, New York, NY 10027. sf2237{at}columbia.edu