## Abstract

Organisms evolve as compromises, and many of these compromises can be expressed in terms of energy efficiency. For example, a compromise between rate of information processing and the energy consumed might explain certain neurophysiological and neuroanatomical observations (e.g., average firing frequency and number of neurons). Using this perspective reveals that the randomness injected into neural processing by the statistical uncertainty of synaptic transmission optimizes one kind of information processing relative to energy use. A critical hypothesis and insight is that neuronal information processing is appropriately measured, first, by considering dendrosomatic summation as a Shannon-type channel (1948) and, second, by considering such uncertain synaptic transmission as part of the dendrosomatic computation rather than as part of axonal information transmission. Using such a model of neural computation and matching the information gathered by dendritic summation to the axonal information transmitted,*H*(*p**), conditions are defined that guarantee synaptic failures can improve the energetic efficiency of neurons. Further development provides a general expression relating optimal failure rate, *f*, to average firing rate, *p**, and is consistent with physiologically observed values. The expression providing this relationship, *f ≈ 4 ^{−H(p*)},* generalizes across activity levels and is independent of the number of inputs to a neuron.

- computation
- efficiency
- energy
- entropy
- information theory
- mutual information
- optimization
- quantal failures
- Shannon

This paper interrelates three topics: synaptic failure rates, dendrosomatic information processing, and neuronal energy use. As an introduction we briefly address each topic.

In the hippocampus and in neocortex, excitatory synaptic connections dominate and are remarkably unreliable. Each synapse transmits, at most, a single standardized package called a quantum (∼10^{4} neurotransmitter molecules). When an action potential arrives presynaptically, the probability of evoking the release of one such quantal package is reported to range from 0.25 to 0.5 with 0.5 being less common and 0.25 being quite common (Thomson, 2000), especially when one takes into account the spontaneous rates of neurons (Stevens and Wang, 1994; Destexhe and Paré, 1999). The failure of quantal synaptic transmission is a random process (Katz, 1966) and is counterintuitive when it exists under physiological conditions. After all, why go to all the trouble, and expense, of transmitting an action potential if a synapse does not use it. The observation of synaptic failures is particularly puzzling in light of observations outside of neocortex showing failure-free, excitatory synaptic transmission can exist in the brain (Paulsen and Heggelund, 1994, 1996; Bellingham et al., 1998). One insight that clarifies this puzzle is that systems with low failure rates tend to form clusters of synapses on a single postsynaptic target neurons that are *boutons terminaux*, whereas those that fail are predominantly forming *en passage* synapses with multiple (thousands or tens of thousands) of postsynaptic neurons. For *en passage* systems, a particular spike works at some synapses but not at others. Thus, in such *en passage* situations, failure at the axon hillock is not equivalent to random synaptic failure because a large number of synapses will transmit, just not a large percentage. Still failures have the feeling of inefficiency. Here we show that the quantal failures can be viewed as an energy efficiency mechanism relative to the information that survives neuronal information processing. That is, under certain circumstances failures will not lower the transmitted computational information of a postsynaptic neuron, but they will lower energy consumption and heat production. The relationship between energy and information has, at least implicitly, been an issue in physics since the time of Maxwell (Leff and Rex, 1990). Today this relationship continues to be discussed particularly because energy consumption, or heat generation, may place the ultimate limits on manmade computation. In the context of biological computation, such issues also seem particularly relevant because of the large fraction of our caloric intake that goes directly and indirectly toward maintaining brain function (Sokoloff, 1989; Attwell and Laughlin, 2001). Indeed because of such costs, we proceed under the hypothesis that, microscopically, natural selection has approximately optimized energy use as well as information processing in constructing the way neurons compute and communicate. Recent successes and interest arising from this hypothesis of a joint optimization (Levy and Baxter, 1996; Laughlin et al., 1998;Andreou, 1999; Abshire and Andreou, 2001;Balasubramanian et al., 2001, Schreiber et al., 2001) encourage us to continue examining the possibility that neuronal communication and computation are efficient when considered in the dual context of energy and information rather than either context alone. Particularly encouraging is the energy audit of Attwell and Laughlin (2001). This work concludes that >85% of the energy consumed by the neocortical neuropil goes toward recovering from the ion fluxes that are, in effect, all of the computation and communication within the neocortex.

## MEASURING INFORMATION

At the level of an individual neuron, neuronal computation can be sensibly quantified by viewing a computational transformation as a communication system. The aptness of using information-theoretic ideas for analyzing analog computation was pointed out by von Neumann (see Bremermann, 1982) and by Bremermann (1982) and is one of several possible measures that seems worth calculating to quantify analog computation. Although they give us no details, most simply an analog computation is just a transformation as *X → f(X)* so that mutual information, *I*(*X*;*f*(*X*)), is obviously relevant and can be aptly called the information available from neuronal integration.

As is traditional (Shannon, 1948), mutual information is defined as: where and are Shannon entropies, and logarithms are base two.

When the conditional entropy is zero (e.g., H(X‖Y)), then mutual information, I(X;Y), equals the entropy H(X). Because this is true for neocortical axons (Mackenzie and Murphy, 1998; Cox et al., 2000; Goldfinger, 2000), we were able to use entropy rather than mutual information when studying axons.

Previously (Levy and Baxter, 1996) we noted that, solely in the context of signaling information capacity, or equivalently representational capacity, information alone is not optimized by neocortical neurons. In the neocortex, where the maximum spike frequency of a pyramidal neuron is ∼400 Hz, the average rate of axonal spiking is 10–20 Hz, not the 200 Hz optimal for information transmission alone. At the other extreme, there would be no energetic cost if a neuron did not exist, so energy alone is not optimally conserved. However, forming the ratio of information transmitted by an axon to the energy it consumes (a measure whose ultimate dimension is bits per joule) leads to an optimal spike rate value that fits with observed values of spike rates and energy consumption (Levy and Baxter, 1996). This particular optimization is critical to what follows.

The information flow for a single neuron is depicted in Figure1*A.* The notation and its biological correspondence are as follows. The random multivariate binary input to a neuron is X, and the output of this neuron is *Z*, a univariate random binary variable, {no spike, spike} ≡ {0, 1}. The spike generator, which in our model absorbs many dendritic nonlinearities, determines when dendrosomatic excitation exceeds threshold. Then *Z* = 1, and the spike is conducted by the axon away from the neuron, eventually arriving presynaptically as Z′ where the cycle begins again. Our specific interest here is the computational transformation that includes the quantal release process. As depicted in Figure1*A*, input signals to a neuron undergo three information-losing transformations before the transformation by the spike generator: (1) quantal release–failure, (2) quantal amplitude variation, and (3) dendrosomatic summation. The release–failure process (Fig. 1*B*) produces a new binary random variate φ(X_{i}). The probability of a quantal failure is denoted by f, whereas the probability of a successful quantal release is denoted by s, and f = 1 − s. The random variate Q_{i} denotes the amplitude of the i^{th} input when release occurs. Using this notation, the information passing through the computation is explicitly expressed as the mutual information I^{C}
I(X; ∑ φ (X_{i})Q_{i}) = H(X) − H(X‖∑_{i}φ(X_{i})Q_{i}). Also, the lack of spontaneous spikes and the faithful conduction of neocortical axons implies Z = Z′ so that I(Z;Z′) = H(P(Z)), as mentioned earlier.

It is also useful to introduce the notation for the energy-optimal capacity of the axon, C^{E}, which occurs at max_{P(Z=1)}[H(P(Z))/*axonal energy use*], and as well p* the value of P(Z = 1) that produces C^{E}. From our earlier calculations (Levy and Baxter, 1996) and from neurophysiological observations of sensory and association cortex, p* ranges from .025 to 0.05 per minimum interspike interval (approximated as 2.5 msec for a synaptically driven pyramidal neuron (Levy and Baxter, 1996)). This produces C^{E} values ranging from 0.169 to 0.286 bits per 2.5 msec. Importantly, we will suppose that both input and output neurons adhere to the same optimum.

We explicitly assume that there is an inconsequential information loss by the spike generator and that the cost of generating extra spikes is negligible. This later assumption is justified by the energy audit ofAttwell and Laughlin (2001). Attwell and Laughlin (2001) showed that the energetic costs associated with action potential production in a functioning brain are highest in axons, with ∼47% of the total energy consumed (which is in agreement withLevy and Baxter, 1996) The next highest cost is associated with dendritic excitation, which is ∼34% of the total energy consumed. A relatively small amount goes to the presynaptic aspects of synaptic transmission. Perhaps the lowest cost (which is negligible) is associated with the cell body because cell bodies have such small surface areas relative to axons and dendrites. In our model, we assume that the spike generator is part of the cell body. Therefore, the cost of generating extra spikes is negligible, whereas the cost of conducting the spike down the axon is quite high. Regardless of that cost, information must still be transmitted. That is, even if one were compelled to postulate failure at the spike generator, one is still left with an average axonal usage (firing) rate of p*. Thus, it is our explicit hypothesis that information is transmitted at the optimal rate, H(p*), and we are now in a position to be much more explicit about energetically efficient computation.

*Conjecture.* Maximize the computational information developed by a neuron and its inputs to no more than the limit imposed by the information capacity of the axon whose capacity is set by optimizing the energy efficiency of its signaling.

That is, if the axonal transmitting system is energy optimized to H(p*), then this rate is an upperbound constraint on the computational information that can be transmitted given the hypothesized spike-generating process. Moreover, when failure rates are zero, the computational information will always have a potential to be greater than H(p*) because this is the amount that would be available after noise free processing by a neuron with more than just a single input. Because failure rates are not zero, this conjecture leads to the hypothesis that failure rates reduce the energy consumption of computation while not wasting any of (that is, while using all of) the axonal capacity.

Quantal failures are an excellent mechanism to create this matching because of the energy they save. [If every successive step from the arrival of an action potential presynaptically down to the depolarization of the cell body is energy consuming (Attwell and Laughlin, 2001) then a mechanism that eliminates as many of these steps as possible will save the most energy. Specifically, failure of synaptic transmission saves the cost of vesicle recycling, transmitter reuptake and repackaging, and most of all it saves on the cost of postsynaptic depolarization.] Moreover, because both the information of computation and of optimal channel use are both controlled by p*, we can determine a failure rate that brings computational information exactly to its maximally transmittable rate. This failure rate then saves as much energy as possible while still allowing the neuron to develop the maximally transmittable information. Curiously the optimal failure rate quickly becomes independent of the number of inputs, and it is in the range of number of inputs that neocortical (and indeed, many other neurons) operate.

In sum, it is our explicit hypothesis that neural computation, as well as neural communication, can be measured from the Shannon perspective of sources and channels. In pursuing an overall analysis, we have opted to partition function. As a result of this partitioning, the physical correspondence between source or channel changes as the separate parts or functions of a neuron are sequentially analyzed. For example, in our previous work an axon is a channel, whereas here the set of axons going into a neuron are an information source. Here the synapses and dendritic summation process are analyzed as if they are a channel. But as we shall see, they will also be viewed as a source for the next stage. But first, let us develop some quantitative intuition by considering a bounding case.

*Special case.* f = 0 and Q_{i} = 1 for all i. If we consider the case with a zero failure probability and all Q_{i} = 1 then I^{C} = H(∑ X_{i}) − H(∑ X_{i}‖X), and we easily obtain H(∑ X_{i}) as an upper bound on the mutual information of the computation. This upper bound occurs when the second term is zero, i.e. in the failure-free, noise-free situation with all quanta the same size. Appealing to the central limit theorem, this entropy is well approximated by the entropy of a normal distribution. Therefore, if we suppose each of the n inputs is an independent Bernoulli process with the same parameter p = p*, we get:
where this value of p* comes from the Levy and Baxter (1996) calculations as well as the actual observed value of average firing rates in neocortex. Although 6.5 bits is a tremendous drop from H(X), which under these assumptions is 2860 bits (10,000 inputs each with 0.286 bits), this 6.5 bits is still a very large number of bits to be transmitted per computational interval compared to the energy-efficient channel capacity of H(p*) = 0.286 bits.

The reason why 6.5 bits is a tremendous excess arises when we consider Shannon's source/channel theorems. These say that the channel limits the maximum transmittable information to its capacity. As a result, any energy that goes toward producing I(X; ∑ X_{i}) that exceeds the channel capacity H(p*) is wasted information. This idea is at the heart of the analysis that follows. Because the total information of the computation is many times the energy-efficient channel capacity, much waste is possible. Indeed, even if we dispense with the independence assumption (while still supposing some kind of central limit result holds for the summed inputs) and suppose that statistical dependence of the inputs is so bad that every 100 inputs act like 1 input, an approximation that strikes us as more than extreme, there still are too many bits (∼3.2 bits) being generated by the computation compared with what can be transmitted. Thus, the computation is not going to be energy-efficient if it takes energy (and it does) to develop this excess, nontransmittable computational information.

## RESULTS

We now begin the formal analysis that substantiates and quantifies the conjecture and that brings to light a set of assumptions making the conjecture true.

### Assumptions

A0: A computation by an excitatory neuron is the summation of its inputs every computational interval. The mutual information of such information processing is closely approximated as:
A1: Axons are binary signaling devices carrying independent spikes and used at their energy optimum; that is, each axon is used at the information rate C^{E} bits per computational interval, which implies firing probability p*.

A2: The number of inputs to a neuron is not too small—say n > 2/p*. Clearly this is true in neocortex; see Fig. 3 for evaluation of this assumption.

A3: With the proviso that A1 and A2 must be obeyed, a process requiring less energy is preferred to a process requiring more energy.

A4: The spike generator at the initial segment, which incorporates generic nonlinearities operating on the linear dendritic summation, creates a bitwise code suitable for the axonal channel, and this encoding is nearly perfect in using the information received from the dendrosomatic computation. That is, as an information source the spike generator produces information at a rate of nearly H(p*).

From these assumptions we have a lemma.

Lemma 1: I^{C} ≥ H(p*). That is, the only way to use an axon at its energy optimal rate, H(p*), is to provide at least that much information to it for possible transmission.

Proof by contradiction: Providing anything less would mean that the axon could be run at a lower rate than implied by p* and as a result save energy while failing to obtain its optimal efficiency which contradicts (A1).

The importance of this lemma is the following: no process that is part of the computational transformation or part of energy saving in the computation or part of interfering fluctuations arising within the computation should drive I^{C} below H(p*). In particular, this lemma dictates that quantal failures, as an energy saving device, will be used (or failure rates will be increased) only when I^{C} is strictly greater than H(p*).

With this lemma and assuming increased synaptic excitation leads to monotonically increasing energy consumption (Attwell and Laughlin, 2001), we can prove a theorem that leads to an optimal failure rate. Thus, the averaged summed postsynaptic activation, E[∑_{i}φ(X_{i})Q_{i}], should be as small as possible because of energy savings (A3), whereas (A2) maintains n and (A1) maintains p*. This restricted minimization of average synaptic activation implies processes, including synaptic failures, that reduce energy use. But when operating on the energy-efficient side of the depolarization versus information curve, reducing the average summed activation monotonically reduces I^{C} as well as reducing energetic costs with this reduction of I^{C} unrestricted until Lemma 1 takes force. That is, this reduction of I^{C} should go as far as possible because of A3-(energy saving) but no lower than H(p*) because of the lemma. As a result, energy optimal computation is characterized by:
an equality that we call “Theorem G.” Accepting Theorem G leads to the following corollary about synaptic failures:

### Corollary F

Provided np* > 2, neuronal computation is made more energy-efficient by a process of random synaptic failures (see and below).

Obviously failures are in the class of processes that lower average postsynaptic excitation in part because I^{C} is reduced uniformly as f increases and, in part, because the associated energy consumption is also reduced uniformly. Just below and in the we prove a quantified version of Corollary F that shows that the failure rate f producing this optimization is approximated purely as a function of p*; specifically,

### Quantified Corollary F

Figure 2*A*illustrates the existence of a unique, optimal failure rate by showing the intersection between C^{ε}, the energy-efficient capacity of the axon, with I^{C}, the information of the computation. Here we have used n = 10^{4}, p* = 0.041. From another perspective, Figure 2*B* shows how one might take some physiologically appropriate failure rate, f = 0.7, and determine the optimal p. In either case we note the single intersection of the two monotonic curves.

The generality of what this figure shows is established in the. Specifically, Part A assumes equal Q_{i} values, whereas Parts B and C allow for Q_{i} to vary; they show:
Because Theorem G requires *I(X, ∑ _{i}φ(X_{i})Q_{i}) = H(p*),* the two results combine, yielding
a statement that is notable for its lack of dependence on n, the number of inputs to a neuron. This lack of dependence, illustrated for one set of values in Figure3, endows the optimization with a certain robustness. Moreover, the predicted values of f also seem about right. For example, p* = 0.05, implies f = 0.67, whereas other values can be read off of Figure4. So, by choosing a physiologically observed p*, the relationship produces failure rates in the physiologically observed range. Thus, on these two accounts (the robustness and the prediction of one physiological observation from another nominally independent experimental observation), we reap further rewards from the analysis of microscopic neural function in terms of energy-efficient information.

The more involved proof of Part B sheds light on the size of one source of randomness (quantal size) relative to another (failure rate). Taking the SD of quantal size to be 12.5% of the mean quantal size leads to an adjustment of about s/65 in the implied values of f. For example, suppose no variation in Q_{i} produces an optimal failure rate of 70%, then taking variation of Q_{i} into account adjusts this value up to 70.46%. Clearly the effect of quantal size variation is inconsequential relative to the failure process itself.

## DISCUSSION

In addition to the five assumptions listed on page 11, we made two other implicit assumptions in the analysis. First, we assumed additivity of synaptic events. While this assumption may seem unreasonable, recent work (Magee, 1999, 2000; Andrásfalvy and Magee, 2001) and (Cook and Johnston, 1997, 1999; Poolos and Jonston, 1999) make even a linear additivity assumption reasonable. The observations of Destexhe and Paré (1999), showing a very limited range of excitation, also makes a linear assumption a good approximation. Even so, we have explicitly incorporated any nonlinearities that might operate on this sum and then group this nonlinearity with the spike generator. Second, we have assumed binary signaling. Very high temporal resolution, in excess of 2–10^{4} Hz, would allow an interspike interval code that outperforms the energetic efficiency of a binary code. Our unpublished calculations (which of necessity must guess at spike timing precision including spike generation precision, spike conduction dither, and spike time decoding precision; specifically, a value of 10^{−4} msec was assumed) indicate a p* for such an interspike interval code would be ∼50% greater than the p* associated with binary coding as well as being more energetically efficient. However, we suspect such codes exist only in early sensory processing and at the input to cerebellar granule cells. Systems, such as considered here, with single quantum synapses, quantal failures, and 10–20 Hz average firing rates, would seem to suffer inordinately using interspike interval codes; a quantal failure can cause two errors per failure and observed firing rates are suboptimal for interspike interval code but fit the binary hypothesis.

The relationship f ≈ 4^{−H(p*)}partially confirms, but even more so, corrects the intuition that led us to do this analysis. That is, we had thought that the excess information in the dendrosomatic computation could sustain synaptic failures and still be large enough to fully use the energy-efficient capacity of the axon, C^{E}. However, this same intuitive thinking also said that the more information a neuron receives, i.e., as either p* or as n grows, the more a failure rate can be increased, and this thought is wrong with regard to both variables.

First, the relationship f ≈ 4^{−H(p*)} tells us that the optimal failure rate actually decreases as p* increases, so intuitive thinking had it backwards. We had thought in terms of the postsynaptic neuron adding up its inputs. In this case, the probability of spikes is like peaches and dollars, the more you possess the less each one is worth to you. This viewpoint led to the intuition that, when there are more spikes, any one of them can be more readily discarded; i.e., f can be safely increased when p increases. However, this intuition ignored the output spike generator that neuronal integration must supply with information. Here at the generator (and its axon and each of its synapses) the probability of spikes is very different than peaches and dollars: because the curve for binary entropy, H(p), increases as p increases from 0 to 1/2, increasing probability effectively increases the average worth of each spike and, as well, nonspikes; so it is more costly to discard one. This result, one that only became clear to us by quantifying the relationships, leads to optimal failure rates that are a decreasing function of p*.

Second, in the neocortically relevant situation, where n is in the thousands, if not tens of thousands, changing n has essentially no effect on the optimal failure rate (Fig. 3). Indeed, the lower bound, (A3), is so generous relative to actual neocortical connectivity, that there is no way to limit connectivity (and thus, no way to optimize it) based on saving energy in the dendrosomatic computation modeled here. To say it another way, one should look elsewhere to explain the constraints on connectivity [e.g., ideas about volume constraints as in Mitchison, 1991 orRingo, 1991, or ideas about memory capacity (Treves and Rolls, 1992).]

Thus, we are forced to conclude that, to a good approximation, once n times p* is large enough, I^{C} depends only on the failure rate, an observation that is visualized by comparing the I^{C} curves of Figures 2, *A* and*B*, and 3.

In sum, the failure channel can be viewed as a process for lowering the energy consumption of neuronal information processing, and synaptic failures do not hurt the information throughput when the perspective is broad enough. More exactly, the optimized failure channel decreases energy consumption by synapses and by dendrites while still allowing the maximally desirable amount of information processing. This result is achieved when I^{C} = C^{E} = H(p*), and this condition implies the optimal failure rate is solely a function of p*.

Finally, optimizations such as these support the long, strong (Barlow, 1959; Rieke et al., 1999;Dayan and Abbott, 2001) and now increasingly popular [e.g., inter alia (Bialek et al., 1991; Tovee and Rolls, 1995; Theunissen and Miller, 1997;Victor, 2000; Atwell and Laughlin, 2001), see also articles in Abbott and Sejnowski (1999)] tradition of analyzing brain function using Shannon's ideas. Successful parametric optimizations like the one presented here (and those produced in some of the previously cited references), reinforce the validity of using entropy-based measures to describe and analyze neuronal information processing and communication. Such results also stimulate hypotheses (Weibel et al., 1998): e.g., not only does natural selection take such measures to heart but often does so in the context of energy efficiency.

## Appendix

In this part we relate the quantal failure rate, f, to H(p*), the energy-efficient channel capacity. Parts A, B, and C produce essentially the same result, but Parts A and C are simpler. Part A develops the result when the failure process is assumed to be by far the largest source of noise. Parts B and C relax this assumption to include the effect of variable quantal size which (as shown in Part B) turns out to be negligibly small. Throughout, we assume all synaptic weights are identically equal to one. However, a small number of multiple synapses can accommodate variable synaptic strength without changing the result.

### Part A

A neuron receives an n -dimensional binary input vector X. Each component of the input vector, X_{i}, is a Bernoulli random variable, P(X_{i} = 1) = p*. Define y = ∑ X_{i} as a realization of the random variables summed (without quantal failures). The failure process, φ(), produces a new random variable denoted φ(X_{i}). Then denote y^{ϕ} = ∑ φ(X_{i}) as a realization of the summed input subject to the failure process. The quantal success rate, the complement of the failure rate, is P(φ(X_{i}) = 1‖X_{i} = 1) = ^{def}* s = 1 − f*and the other characteristic of such synapses is P(φ(X_{i}) = 0‖X_{i} = 0) = 1. We want to examine the mutual information between X and ∑ φ(X_{i}), when Theorem G is obeyed. That is, when I (X; ∑ φ(X_{i})) = H(p*).

First note that because the failure channel at one synapse operates independently of all other inputs defined by X and because the sums ∑ X_{i} = y partition the X values and that P(∑ φ(X_{i})‖X = x where x ⇒ ∑ X_{i} = y) = P(∑ φ(X_{i})‖∑ X_{i} = y), then
so that
We assume that ∑ X_{i} can be modeled as a Poisson random variable with parameter λ = np, where n is the number of inputs (e.g., n ≈ 10,000 and p ≈ 0.05 so λ ≈ 500). That is,
and, likewise when the failure mechanism is inserted:
another outcome evolving from the fact that quantal failures occur independently at each activated synapse. On the other hand, the summed response conditioned on ∑ X_{i} is binomial with parameters (y, s); that is,
But this inconvenient form, with its normalization term depending on the conditioning variable can be reversed. Using the definition of conditional probabilities and our knowledge of the marginals: Thus,
where t = y − y^{φ}, and note that t ≥ 0 because of the way the failure channel works (i.e., y ≥ y^{φ}).

Now it can be seen that both P(∑ X_{i} = y^{φ}) and P(Y = y‖∑ φ(X_{i}) = y^{φ}) are Poisson distributions with parameters λ and λ(1 − s), respectively. Greatly simplifying further calculations, this second Poisson parameter is independent of its conditioning variable y^{φ}, and is particularly easy to conditionally average because all summations occur over the same range; i.e., note that:
To compute I(∑ X_{i}; ∑ φ(X_{i})) we will use the relation:
and the normal approximation for the entropy; that is, for a Poisson distribution with parameter (i.e., variance) λ large enough, the entropy is very nearly log_{2}
. So for the twodistributions with the parameters λ and λ(1 − s), respectively, subtracting one entropy from the other yields
Thus, *I(∑X _{i}‖∑ φ(X_{i}))* matches the energy-efficient channel capacity p* when f = (1/4)

^{H(p*)}. So the quantal failure rate is uniquely determined by p* and is independent of the number of inputs n provided that n is sufficiently large. [In fact, sufficiently large is not very large at all. When the Poisson parameter is two, the relative error of the normal approximation is <4% (Frank andÖhrvik, 1994).] Furthermore, because H(p) ≤ 1 the quantal failure rate has the lower bound, f ≥ 0.25.

### Part B

The following mathematical development accounts for the variation in synaptic excitation caused by the failure channel and plus the variance in quantal size. A number of approximations are involved, but they are all of the same type. That is, we will go back and forth between discrete and continuous distributions (and back and forth between summation and integration) with the justification that if two distributions are approximated by the same normal distribution, they can be used to approximate each other.

Each successful transmission, φ(X_{i}) = 1, results in a quantal event size Q_{i} = q_{i}.

The number of synapses is n, and the failure rate is f = 1 − s, where s is the success rate. Each input is governed by a Bernoulli variable p.

X ∈ {0,1}^{n}, the vector of inputs; φ(X) ∈ {0,1}^{n}, the vector of active inputs passing through the failure channel; ∑ X_{i} ∈ {0, 1, …, n}, the number of active input axons; ∑ φ(X_{i}) ∈ {0, 1, …, n}, the number of successful transmitting synapses; Q_{i} ∈ {0, 1, …, m}, a quantal event with a discrete random amplitude; ∑ φ(X_{i})Q_{i} ∈ {0, 1, …, n·m} the sum of the quantal responses with quantal amplitude variation.

Upper case indicates a random variable and lower case a specific realization of the corresponding random variable. The actual (biological) sequences of variables and transformation is:
Equation EB.1.1Because the Q_{i} are independent of i and identically distributed, there is a mutual information equivalent sequence:
Equation EB.1.2which is what we will use to calculate:
Equation EB.1.3

### Part B.1

Find H (∑ φ(X_{i})Q_{i}). To get H(∑ φ(X_{i})Q_{i}), we need P(∑ φ(X_{i})Q_{i}), which we get via P(∑ φ(X_{i})Q_{i} = h) = ∑_{k} P(∑ φ(X_{i})Q_{i} = h‖∑ φ(X_{i}) = y^{φ}) P(∑ φ(X_{i}) = y^{φ}) and some approximations.

### The first approximation

When nps is large and ps is small, approximate the discrete distribution P(∑ φ(X_{i}) = y^{φ}) = (
) (ps)^{yφ} (1 − ps)^{n−yφ} by the continuous distribution
Equation EB.1.4because each is nearly normally distributed and has the same mean and nearly the same variance.

To take quantal sizes into account, start with the assumption that, the single event, Q_{i}, is distributed as a Poisson that is large enough to be nearly normal and that has the same shape. We can do this if the value of the random variable at the mean and at one SD on either side of the mean yields the same approximate relationship (it will) while the Poisson parameter is large enough for the normal approximation. For now let P(Q_{i} = q) = e^{−α} α^{q}/q!. As noted below experimental observations allow us to place a value of approximately 64 on α. Note that the approximation of quantal amplitude is usually assumed to be normal, but here we assume a Poisson distribution of about the same shape. Indeed, the normal assumption produces biologically impossible negative values that the Poisson approximation avoids. Now note that the Q_{i} are independent of each other and the X_{i} values, so P(X_{i}Q_{i} = q_{i}‖X_{i} = 1) = P(Q_{i}= q_{i}), and when we sum the independent events another Poisson distribution occurs:
Equation EB.1.5Now following Bayes's procedure using these two approximations (B.1.1 and B.1.2):
Equation EB.1.6where the approximation is caused by the approximation noted in B.1.1. This joint distribution is then marginated, by summing over y^{φ} via another approximation (*see B.1. footnote just below), to yield a negative binomial distribution with h ∈ {0, 1, …}
Equation EB.1.7This marginal distribution has mean = nps·(α/α + 1)(1/α + 1)^{−1} = α nps, and variance = *mean*·(1/α + 1)^{−1} = α(α + 1)nps.

Because the negative binomial is nearly a Gaussian under parameterizations of interest here, H(∑φ(X_{i})Q_{i}) ≈ 1/2 log 2πeα(α + 1)nps.

### Part B.1. footnote

*Approximate the summation (i.e., the margination) with the integration
Now let t = (α + 1)y^{φ}, then dt/dy^{φ} = α + 1 and dy^{φ} = dt/α + 1 and substitute to get a recognizable integral:

### Part B.2

To find H(∑ φ(X_{i})Q_{i}‖X) we need P(∑ φ(X_{i})Q_{i}‖X = x).

Because the Q_{i} are independent of the particular i that generate them and only depend on how many synapses transmitted,
Equation EB.2.1
Via the earlier description of the signal flow of computation, we can use the Markov property of conditional probabilities to write:
where the last two steps follow by the definition of conditional probability, multiplying and dividing by the joint probability and then using Markov's idea again.

Now apply to this last result the quantal size approximation already introduced, approximate summation by integration, and replace a binomial distribution by a gamma distribution where they both are nearly approximating the same normal distribution.
where the approximate equality arise just as in the previous subsection. However, in contrast to the calculation of P(∑ φ(X_{i})Q_{i}), for the case of replacing the binomial distribution, here we must accommodate the mean and variance being different. Thus, a more complicated gamma distribution is used. Still, the integration is made easy by a change of variable.
which substitutes to give:
Once again, the resulting distribution, P(∑ φ(X_{i})Q_{i}‖X = x), is a negative binomial with mean:

### Part B.3.

Now calculate the mutual information with normal approximations for each of the two negative binomials.
Because the first two terms combine to approximately zero and when
then
But I(X; ∑ φ(X_{i})Q) is set to H(p*) if we obey Theorem G. Then H(p*) = −1/2 log (f), implying f = 2^{−2H(p*)}.

Two comments are in order.

The approximate E[ log ∑X_{i}/np] ≈ 0 is better as np gets larger but is good enough down to np = 4 (Fig. 3). Second, the variance of quantal amplitudes expresses itself in the term αs/(α + 1) as opposed to just s when there is no quantal variance. That is, if we leave out quantal amplitude variations, we get nearly the same answer without even using these approximations.

The particular parameterization of α is arrived at from experimental data. When k equals one, we have the distribution of quantal amplitudes for a single quantum. From published reports (Katz, 1966 their Fig. 30), we estimate that at 1 SD from the mean, the value of the quantal amplitude changes ∼12.5%. Thus, a Poisson distribution, with a nearly normal distribution of the same shape, has a mean amplitude proportional to α = 64 and an amplitude proportional to 72 at 1 SD from the mean. That is, 12.5% to either side of the mean value is ∼1 SD of quantal size given the mean has a value of α. With α = 64, α/α + 1 = 64/65, which is reasonably close to one.

### Part C

The following simple proof was suggested by Read Montague.

Let Y_{1}, Y_{2}, and Y_{3} be random variables formed from the sums:
where each X_{i} represents the activity of an input signal and is a binary random variable with P(X_{i} = 1) = p and P(X_{i} = 0) = 1 − p = q. The function φ(X_{i}) is associated with quantal failures such that P(φ(X_{i}) = 1‖X_{i} = 1) = s, P(φ(X_{i}) = 0‖X_{i} = 1) = 1 − s = f, P(φ(X_{i}) = 0‖X_{i} = 0) = 1, and P(φ(X_{i}) = 1‖X_{i} = 0) = 0.

We will assume that the quantal amplitudes are Gaussian distributed with mean μ and variance ς^{2} and let such a Gaussian distribution be denoted by P(Q_{i}) = N(μ, ς^{2})dQ_{i}. Furthermore, we will assume that the number of inputs (i.e., the number of terms in the sums) is large enough such that P(Y_{1}) = N(np, npq)dY_{1} and P(Y_{2}) = N(nps, nps)dY_{2}.

We seek to determine the mutual information between Y_{3} and Y_{1}:
To determine I(Y_{3}; Y_{1}), we compute P(Y_{3}‖Y_{1}) as:
where we have used the approximation N(μY_{2}, ς^{2}Y_{2}) ≈ N(μY_{2}, npsς^{2}).

Then we compute P(Y_{3}) as
which yields
If we let ς^{2} = μ, p ≪ 1 (q ∼ 1), and μ ≫ 1, then we again obtain the result of Part B.

## Footnotes

This work was supported by National Institutes of Health Grants MH48161 to W.B.L. and MH57358 to N. Goddard (with subcontract 163194-54321 to W.B.L.), by the Department of Neurosurgery, and by the Meade-Munster Foundation. We thank Toby Berger, Costa Colbert, Read Montague, and Simon Laughlin for their useful comments that helped improve a previous version of this manuscript.

Correspondence should be addressed to William B Levy, University of Virginia Health System, P.O. Box 800420, Department of Neurosurgery, Charlottesville, VA 22908-0420. E-mail:wbl{at}virginia.edu.