## Abstract

A neural correlate of parametric working memory is a stimulus-specific rise in neuron firing rate that persists long after the stimulus is removed. Network models with local excitation and broad inhibition support persistent neural activity, linking network architecture and parametric working memory. Cortical neurons receive noisy input fluctuations that cause persistent activity to diffusively wander about the network, degrading memory over time. We explore how cortical architecture that supports parametric working memory affects the diffusion of persistent neural activity. Studying both a spiking network and a simplified potential well model, we show that spatially heterogeneous excitatory coupling stabilizes a discrete number of persistent states, reducing the diffusion of persistent activity over the network. However, heterogeneous coupling also coarse-grains the stimulus representation space, limiting the storage capacity of parametric working memory. The storage errors due to coarse-graining and diffusion trade off so that information transfer between the initial and recalled stimulus is optimized at a fixed network heterogeneity. For sufficiently long delay times, the optimal number of attractors is less than the number of possible stimuli, suggesting that memory networks can under-represent stimulus space to optimize performance. Our results clearly demonstrate the combined effects of network architecture and stochastic fluctuations on parametric memory storage.

## Introduction

Persistent neural activity occurs in prefrontal (Fuster, 1973; Funahashi et al., 1989; Romo et al., 1999) and parietal (Pesaran et al., 2002) cortex during the retention interval of parametric working memory tasks. Model networks of stimulus-tuned neurons that are connected with local slow excitation (Wang, 1999) and broadly tuned inhibitory feedback (Compte et al., 2000; Goldman-Rakic, 1995) exhibit localized and persistent high-rate spike train patterns called “bump” states (Compte et al., 2000; Renart et al., 2003). Bumps have initial locations that are stimulus-dependent, so population activity provides a code for the remembered stimulus (Durstewitz et al., 2000). These models relate cortical architecture to persistent neural activity, and are a popular framework for studying working memory (Wang, 2001; Brody et al., 2003).

Neural variability is present in all brain regions and limits neural coding in many sensory, motor, and cognitive tasks (Stein et al., 2005; Faisal et al., 2008; Laing and Lord, 2009). In parametric working memory networks, dynamic input fluctuations cause bump states to wander diffusively (Compte et al., 2000; Laing and Chow, 2001; Wu et al., 2008; Polk et al., 2012; Burak and Fiete, 2012; Kilpatrick and Ermentrout, 2013), degrading stimulus storage over time. Psychophysical data show that the spread of the recalled position increases with delay time (White et al., 1994; Ploner et al., 1998), consistent with diffusive wandering of a bump state. While several results examine how bump formation depends upon neural architecture, little is known about how cortical wiring affects the diffusion of persistent neural activity.

The response properties of cells are often heterogeneous (Ringach et al., 2002), a feature that can improve population-based codes (Chelaru and Dragoi, 2008; Shamir and Sompolinsky, 2006; Marsat and Maler, 2010; Osborne et al., 2008; Padmanabhan and Urban, 2010). In particular, there is a large degree of variation in synaptic plasticity and cortical wiring in prefrontal cortical networks involved in persistent activity during working memory tasks (Rao et al., 1999; Wang et al., 2006). Heterogeneity in excitatory coupling quantizes the neural space used to store inputs, reducing the network's overall storage capacity (Renart et al., 2003; Itskov et al., 2011). On the other hand, stabilizing a discrete number of network states improves the robustness of working memory dynamics to parameter perturbation (Rosen, 1972; Koulakov et al., 2002; Brody et al., 2003; Goldman et al., 2003; Miller, 2006). In this study, we investigate how stabilization introduced by synaptic heterogeneity affects the temporal diffusion of persistent neural activity.

We show that spatial heterogeneities in the excitatory architecture of a spiking network model of working memory reduce the rate with which bumps diffuse away from their initial position. However, the same heterogeneities limit the number of stable network states used to store memories. A tradeoff between these consequences maximizes the transfer of stimulus information at a specific degree of network heterogeneity. For a large number of stimulus locations and long retention times, we show that network architectures that under-represent stimulus space can optimize performance in working memory tasks.

## Materials and Methods

##### Recurrent network architecture.

We used for our network a ring architecture commonly used for generating persistent activity to represent direction between 0 and 360° (Ben-Yishai et al., 1995; Compte et al., 2000) with *N _{E}* = 256 pyramidal cells (

*E*) and

*N*= 64 interneurons (

_{I}*I*). Each leaky integrate-and-fire neuron (Laing and Chow, 2001) was distinguished by its cue orientation preference θ

*, where θ*

_{j}*(*

_{j}*E*) = Δ

*·*

_{E}*j*(

*j*= 1, …,

*N*) and θ

_{E}*(*

_{j}*I*) = Δ

*·*

_{I}*j*(

*j*= 1, …,

*N*) for Δ

_{I}*= 360/256 and Δ*

_{E}*= 360/64, respectively. The subthreshold membrane potential of each neuron,*

_{I}*V*

_{α}(θ

*,*

_{j}*t*) (α =

*E, I*), obeyed the following equation: where

*I*= 0.6 and

_{E}*I*= 0.6 are bias currents that determine the resting potential of

_{I}*E*and

*I*neurons. The external current, expressed in the following equation: represents sensory input received only by pyramidal neurons, where

*I*

_{0}= 2 is the input amplitude,

*I*= 3 determines input width, and θ

_{d}*is the cue position. The stimulus was turned on at*

_{ext}*T*= −1 s and off at

_{ON}*T*= 0 s. Interneurons received no external input, so

_{OFF}*I*= 0. Voltage fluctuations were represented by the white noise process

_{ext,I}*I*

_{n,}_{α}(

*t*) with variance σ

_{V,α}

^{2}(σ

*= 0.5 and σ*

_{V,E}*= 0.3). We scaled and nondimensionalized voltage so the threshold potential*

_{V,I}*V*= 1 and the reset potential

_{th}*V*= 0 for all neurons.

_{res}Synaptic currents were mediated by a sum of AMPA, NMDA, and GABA currents:
each modeled as
where *A _{AMPA,E}* = 1,

*A*= 2,

_{NMDA,E}*A*= 0.81,

_{GABA,E}*A*= 1,

_{AMPA,I}*A*= 1, and

_{NMDA,I}*A*= 0. Orientation preference was introduced into synaptic conductance by the spatially decaying functions (Fig. 1

_{GABA,I}*A*) expressed: where

*d*= 0.32,

_{AMPA,E}*d*= 0.32,

_{NMDA,E}*d*= 5,

_{GABA,E}*d*= 5, and

_{AMPA,I}*d*= 5. Equation 1 was used for excitatory (AMPA and NMDA) synapses between pyramidal (

_{NMDA,I}*E*) cells in the case of spatially homogeneous connectivity. In the case of spatially heterogeneous synaptic strength (Fig. 2), the strength of AMPA and NMDA connections between pyramidal cells (

*E*) was given by where

*h*= 0.025 represents the strength of the heterogeneity and

*n*is the frequency of the heterogeneity, which must be integer valued.

Synaptic gating variables of type β = *AMPA, NMDA,* or *GABA* associated with a neuron at location θ* _{j}* were instantaneously activating and exponentially decaying as described by
where α =

*E*for β =

*AMPA*and

*NMDA*while α =

*I*for β =

*GABA*. Instantaneous activation is represented here using the delta function δ, so

*s*

_{β}(θ

*j*,

*t*) increments by 1 when

*V*

_{α}(θ

*j*,

*t*) attains the spike threshold

*V*= 1. Decay time constants for each synapse type are τ

_{th}*= 5 ms, τ*

_{AMPA}*= 100 ms, and τ*

_{NMDA}*= 20 ms. Fluctuations in conductance were introduced into each synapse with the term*

_{GABA}*I*

_{s,}_{β}(

*t*), which is white noise with variance σ

_{s,β}

^{2}(σ

*=*

_{s,AMPA}*0.1,*σ

*= 0.45, and σ*

_{s,NMDA}*= 0.05). We take the variance of noise to NMDA synapses to be high, σ*

_{s,GABA}*= 0.45, because it leads to high variances in the spike times, as commonly observed in prefrontal cortical neurons during the delay period of working memory tasks (Compte et al., 2003). In addition, this generates an error of a few degrees in the recall of cue position for delay periods of 2–10 s, as observed in psychophysical experiments (White et al., 1994).*

_{s,NMDA}Numerical simulations were done using an Euler–Maruyama method with timestep d*t* = 0.1 ms and normally distributed random initial conditions. Spike time rastergrams were smoothed to generate population firing rates as a function of degree and time, whose maximum at each time were used to calculate the centroid of the bump (Figs. 1*D*,*E*, 2). Variances (Figs. 1*F*, 3) and probability densities (Figs. 1*E*, 2) were computed using 1000 values for the bump centroid across 10 s. Linear fits of variance in the case of spatially homogeneous synapses and spatially heterogeneous synapses with *n* = 8 and *n* = 4 (Fig. 3) were performed using linear regression.

##### Diffusion in the potential well model.

To analyze the diffusive dynamics of the bump, we studied an idealized model of bump motion. In this model, the bump position φ(*t*) obeys the stochastic differential equation, (Lifson and Jackson, 1962; Lindner et al., 2001):
Here φ(*t*) was restricted to the periodic domain φ ∈(− π,π] and d*W* was a standard white noise process. The first term in Equation 2 models the periodic spatial heterogeneity that is responsible for attractor dynamics. Heterogeneity is parametrized by its strength *h* and spatial frequency *n*. In this framework, the dynamics of φ(*t*) is a diffusive process occurring on an energy landscape defined by the periodic potential
producing *n* attractors (Fig. 4*A*). These attractors occur at the minima of Equation 3, given by φ = 2*j*π/*n* where *j* = 1, …, *n*. They are separated from one another by repellers or saddles at the maxima of Equation 3.

To analyze the model, we reformulated Equation 2 as an equivalent Fokker–Planck equation (Risken, 1996),
where *p*(φ, *t*) is the probability density of finding the bump at a given value φ at time *t*. For ease of analytic calculations, we let φ evolve on an infinite domain. Since we worked in parameter regimes where the resulting spread of probability densities was relatively narrow, this did not considerably alter the results. Also, experimentally measured errors in cue recall are typically not large enough to span more than a quarter of the possible stimulus space (Ploner et al., 1998).

For large times and sufficiently high-frequency *n,* the variance of the stochastic process Equation 2 can be quantified using an effective diffusion coefficient (Lindner et al., 2001)
The density ρ(φ, *t*) tends asymptotically:
where *p*_{0}(φ) is the stationary, periodic solution, Equation 4, given by
where χ is a normalization factor. The approximation, Equation 6, matches realizations of the full stochastic process Equation 2 very well (Fig. 4*C*). Clearly, the frequency of the probability distribution's microscale is commensurate with that of the periodic potential Equation 3. We were mainly concerned with computing the effective diffusion of the stochastic process defined by Equation 2. Remarkably, second-order statistics are still well approximated by ignoring the microperiodicity of the density in Equation 6, just using:
(Fig. 4*C*). Previous authors have used asymptotic methods for computing the associated effective diffusion coefficient *D*_{eff} inherent in the formula Equation 6 (Lifson and Jackson, 1962; Lindner et al., 2001). The long-standing result is (Lifson and Jackson, 1962)
and we can compute the integrals in the denominator to find:
where *I*_{0}(*x*) is the modified Bessel function of the zeroth kind. Equation 8 demonstrates the monotone increasing dependence of the effective diffusion upon the number of potential wells *n*.

To calculate the probability density *p*(φ, *t*), we simulated 10,000 realizations of Equation 2 using an Euler–Maruyama integration scheme with a timestep d*t* = 0.001 from *t* = 0 to *t* = 10 s (Fig. 4*B*,*C*). The effective diffusion coefficient was calculated as the gradient of the variance across the time window and converted to degrees with the change of variables θ = 180(φ + π)/π yielding:

##### Adding unstructured heterogeneity.

Effects of unstructured heterogeneity were studied by perturbing the potential in Equation 2 with a combination of random periodic functions,
The unstructured heterogeneity was given by the random potential
where *a _{j}*,

*b*(

_{j}*j*= 1, …,

*N*) are randomly drawn from normal distributions. Here η scales the amplitude of the random potential and the maximal frequency of the unstructured components is given by

_{h}*N*. We take the rounded integer

_{h}*N*= [0.05(1 + 0.1ξ)/(η + 0.0005) + 1], ξ is a normally distributed random variable, so larger η values reduce the number of modes added to the potential, decreasing the maximal number of attractors in the system, as in other studies of unstructured heterogeneity in bump attractor networks (Zhang, 1996; Renart et al., 2003; Itskov et al., 2011; Hansel and Mato, 2013). To calculate an effective diffusion coefficient

_{h}*D*, we initialized 10,000 simulations of Equation 10 at φ(0) = 0 and computed

_{h}*D*= <φ(

_{h}*T*)

^{2}>/

*T*for

*T*= 10 s.

##### Information measures.

To measure the performance of the network on a working memory task, we used a Shannon measure of mutual information for a noisy channel (Cover and Thomas, 2006). We considered a channel receiving one of *m* possible stimuli (*X*), storing an input as one of *n* ≤ *m* possible states (*Y*(*t*)), and reading out the remembered stimulus as one of the original *m* possible values (*Z*). The stored variable *Y*(*t*) evolves during storage time *t* ∈ [0, *T*] due to degradation of the initially loaded signal *Y*(0) by dynamic noise. The stimuli were presented with equal probability *p _{j}* = 1/

*m*(

*j*= 1, …,

*m*), so that the stimulus entropy was as follows: The network represented a stimulus as the bump position at one of the system's

*n*attractors. If

*m*was a multiple of

*n*(

*m*=

*qn*with

*q*an integer) then the mapping from stimulus to loaded representation was straightforward with

*Y*(0) = ceil(

*X/q*). When

*m*was not a multiple of

*n,*we allowed the potential well structure of the system to guide the loaded state to the nearest attractor. This led to slightly nonuniform distributions of loaded stimuli. However, the effects of diffusion made this slight nonuniformity insignificant, especially as the length of storage time

*T*was increased. In our theoretical calculations, we assumed that the loading algorithm maximized the entropy of the neural representation

*Y*(0); this sometimes involved random assignments from

*X*to

*Y*(0). In fact, we found our numerical results did not stray too far from this approximation (see Figs. 7, 8).

If *n* = *m,* then *Y*(0) = *X* and the loaded representation had the same entropy as the stimulus *H*(*Y*(0)) = *H*(*X*). If *n* < *m*, then the representation entropy was smaller than the stimulus entropy *H*(*Y*(0)) < *H*(*X*), since the space into which the stimulus is represented is smaller than the original stimulus space. Since the stimulus *X* has a uniform distribution, then so does *Y,* with the probability that attractor *j* loaded as *p _{j}* = 1/

*n*(

*j*= 1, …,

*n*) and the entropy of the representation as

*H*(

*Y*(

*t*)) = log

_{2}

*n*(this holds for all

*t*). The readout of the neural representation required an expansion from

*Y*(

*T*) to

*Z*. If

*m*was a multiple of

*n*, then the expansion was straightforward with

*Z*having probability 1/

*q*for

*Z*=

*q*(

*Y*(

*T*) − 1) + 1, …,

*qY*(

*T*) and zero otherwise. In this case

*H*(

*Z*) = log

_{2}

*m*. If

*m*was not a multiple of

*q,*we subdivided the domain into

*m*evenly spaced subdomains and assigned

*Z*accordingly, and for theoretical calculations we again assumed

*H*(

*Z*) = log

_{2}

*m*.

While *H*(*Y*) ≤ *H*(*Z*), *H*(*Y*) nevertheless set an upper limit for the mutual information between the stimulus *X* and readout *Z*, expressed as:
To compute the conditional entropy *H*(*Y*|*Z*) we first calculated the probability of transition between one state and another during the diffusion phase (0 ≤ *t* ≤ *T*). A direct estimate of the transition probabilities was obtained by numerically simulating many realizations of the model and estimating *p*(φ(0)|*Z*) where φ(0) is the center of mass of the bump at time *t* = *T*. We subdivided φ into *n* subdomains of equal width, and the area of each subdomain is *p _{j}*

_{→k}, the transition probability from the loaded state

*X*=

*j*to another state (

*k*≠

*j*) or itself (

*k*=

*j*), where

*k*= 1, …

*n*. Due to discrete translation symmetry in both systems, we expected

*p*

_{j}_{→k}=

*p*

_{j}_{+}

_{l}_{→k+}

*. The conditional entropy can then be computed as follows (Cover and Thomas, 2006): Our second method of computing the conditional entropy employed the effective diffusion coefficient*

_{l}*D*

_{eff}associated with the probability density for locations of the bump or particle.

*D*

_{eff}depends upon

*n,*ultimately introducing this further

*n*dependence into the mutual information (Eq. 12). Using the associated pure Gaussian probability (Eq. 7), we computed transition probabilities analytically for the case

*j*= 1, so, as Equation 14 states: for a set delay time

*T,*where

*a*(

*k*) = (2π

*k*− 1)/

*n*is the lower boundary of the

*k*th subdomain and

*k*= 1, …, ceiling(

*n*/2). Due to reflection symmetry of the Gaussian, we expected

*p*

_{1→k}=

*p*

_{1→n−}

*. For even*

_{k}*n,*the (

*n*/2 + 1)th subdomain is (−π, π/

*n*− π) ∪ (π − π/

*n,*π), which would lead to two integrals in Equation 14. As mentioned, the transition probabilities

*p*

_{j}_{→k}for

*j*= 2, …,

*n*were easily computed as

*p*

_{j}_{→k}=

*p*

_{1→k−}

_{j}_{+1}(see Fig. 6

*B*,

*C*, insets). We then plugged each

*p*

_{j}_{→k}value into our formula for conditional entropy, Equation 13. The analytic and numeric calculation of

*p*

_{j}_{→k}led to similar results for

*I*(

*X*;

*Z*), the calculated value of mutual information (Eq. 12; see Fig. 7).

## Results

### Diffusion of bumps in a spatially homogeneous network

The neural mechanics of parametric working memory has a long history of theoretical investigation (Amari, 1977; Camperi and Wang, 1998; Compte et al., 2000; Laing and Chow, 2001; Wang, 2001; Brody et al., 2003; Renart et al., 2003). Motivated by the working memory of visual cue orientations, we consider a network of spiking model neurons where each neuron has a preferred orientation in its feedforward input. Persistent neural spiking within the network is due to a combination of assumptions about synaptic connectivity (Goldman-Rakic, 1995; Rao et al., 1999; Lewis and Gonzalez-Burgos, 2000). First, the strength of pyramidal-to-pyramidal connectivity decreases as the distance between the tuning peak of each neuron increases (Fig. 1*A*, red line). Second, excitatory synaptic currents involve both fast-acting AMPA and slow-acting NMDA components (see Materials and Methods). Third, feedback connections from interneurons are broadly tuned (Fig. 1*A*, blue line). With these architectural features, neurons in the network respond to a transient stimulus (Fig. 1, green bar) with an elevated rate of spiking that persists long after the stimulation ceases (Fig. 1*B*). Short-range excitation leads to high-rate pyramidal spiking across a short range of orientations, while wide-range inhibition localizes this spiking (Fig. 1*C*); we refer to this pattern of activity as a “bump.” The position of the bump encodes the initial stimulus position in working memory (Compte et al., 2000; Wang, 2001; Brody et al., 2003).

We model the inherent trial-to-trial variability of neural response with an orientation-independent fluctuating input to each neuron, as well as a stochastic component of the recurrent synaptic feedback (see Materials and Methods). These fluctuations degrade the storage of the orientation cue by causing the bump to stochastically wander away from its initial position (Fig. 1*C*,*D*). Spatiotemporal averaging of the spike time raster plots identifies the maximal firing rate at each time point, and visualizes the bump wandering across the network (Fig. 1*D*, magenta line). We fix the stimulus orientation and perform many trials of the network simulation, with the only difference between trials being the realization of the stochastic forces in the network. The bump's position after a delay period of 10 s can be described by a probability density having an overall Gaussian profile (Fig. 1*E*), and the variance in bump position increases linearly as a function of time (Fig. 1*F*). These last two properties suggest the bump position behaves as a diffusion process (Risken, 1996).

Diffusive dynamics in working memory networks have been studied in several different frameworks (Compte et al., 2000; Miller, 2006; Wu et al., 2008; Burak and Fiete, 2012; Polk et al., 2012; Kilpatrick and Ermentrout, 2013). The intuition for the diffusive character of these networks is best gained from an analysis of the deterministic network. A bump can be formed with its center of mass located at any orientation, allowing for the storage of a continuum of stimuli (Amari, 1977; Camperi and Wang, 1998). However, perturbations that change the bump's position will be integrated and stored as if they were another input. Stochastic inputs lead to a continuous and random displacement of the bump, without the bump relaxing back to its original location. Over time, the position of the bump effectively obeys Brownian motion and recall error increases with the delay period. This diffusion-based error is consistent with psychophysical studies that show the spread of recalled continuous variables scales sublinearly with time (White et al., 1994; Ploner et al., 1998).

### Reduced diffusion in a spatially heterogeneous network

Previous models of working memory have considered networks that use neuronal units with bistable properties (Rosen, 1972; Koulakov et al., 2002; Brody et al., 2003; Goldman et al., 2003; Miller, 2006). These networks lack the homogeneity required for a continuum of neutrally stable stimulus representations, and rather have a discrete number of stable states. One advantage of this network heterogeneity is a “robustness” of representation with respect to parameter perturbation, a feature that is absent in homogeneous networks (Brody et al., 2003; Goldman et al., 2003). We consider spatially periodic modulation of excitatory coupling (Fig. 2, left), where the period of the modulation is 360/*n* degrees, so that *n* cycles cover orientation space. We assume such an architecture would not be biased to favor one particular cue location because errors reported in recalling cues are approximately the same for each cue location (White et al., 1994). Such an architecture may develop from Hebbian plasticity rules during training in working memory tasks, since orientation cues are typically chosen at fixed and evenly spaced locations around the circle (Funahashi et al., 1989; White et al., 1994; Goldman-Rakic, 1995; Meyer et al., 2011). In this situation, some neuron pairs are activated more than other pairs, leading to relative strengthening of their recurrent connections (Clopath et al., 2010; Ko et al., 2011). Alternatively, reward-based plasticity mechanisms could also set up spatial heterogeneity in synapses if it improved a subject's performance during a task (Schultz, 1998; Wang, 2008; Klingberg, 2010).

Spatial biases introduced to network architecture shift the smooth continuum of stable states to a chain of discrete attractors, each separated by a repeller (Fig. 2, compare *A*, *B*, *C*, middle column). This discrete attractor structure occurs because some pyramidal neurons receive stronger excitatory projections than others (Zhang, 1996; Itskov et al., 2011; Hansel and Mato, 2013). Spatial heterogeneity in the strength of excitatory connections (decreasing *n*) stabilizes bump positions to perturbations by noise (Fig. 2, compare *A*, *B*, *C*, right column). For all *n* tested, the probability density of bump positions retains an approximately Gaussian shape with periodic modulation, so the variance of bump positions still grows nearly linearly with time (Fig. 3), and it is well approximated by *Dt* where *D* is the diffusion coefficient (see Materials and Methods). The coefficient *D* drops considerably for a network with spatially heterogeneous synapses, compared with the bump diffusion measured with the homogenous network. In total, spatial heterogeneity of excitatory coupling helps stabilize bump position in models of working memory with fluctuating stochastic inputs.

### Potential well model for bump diffusion

To analyze the relationship between network heterogeneity and bump diffusion more deeply, we now study an idealized model for parametric working memory. Briefly, a noise-driven particle on a periodic potential landscape retains the essential effects of noise and spatial heterogeneity in our spiking network model (see Materials and Methods). Our simplified model treats the bump position as a particle moving in a landscape of peaks, from which it is repelled (Fig. 4*A*, maxima), and wells, to which it is attracted (Fig. 4*A*, minima). In the potential well model, the memory of the stimulus location is tracked by the particle's position θ(*t*) obeying the following stochastic differential equation,
Here *h* is the amplitude of the periodic potential and *W*(*t*) is a Wiener process (Risken, 1996). The sine function determines the θ-dependent drift of the particle, which ultimately affects its diffusive motion. The positive integer *n* determines the number of stable attractors. Similar reduced neural models have been explored (Renart et al., 2003; Itskov et al., 2011), and the general problem of noise-induced behavior in periodic potentials has been well studied (Lifson and Jackson, 1962; Risken, 1996; Lindner et al., 2001).

Diffusive behavior occurs in periodic potentials, yet the mechanics are different than diffusion on a free potential landscape (*h* = 0). In periodic potentials, a particle typically undergoes small variance dynamics confined within a well. However, rare but large noise kicks eventually push the particle to a neighboring well. These noise-induced well transitions continue indefinitely, and diffusion over the potential landscape occurs in a punctuated fashion. Across many trials, the probability *p*(θ, *t*) of finding the particle at position θ at time *t* evolves like a Gaussian kernel modulated by a periodic function (Fig. 4*B*; see Materials and Methods). The maxima of *p*(θ, *t*) are centered at the minima of the potential, indicating a higher likelihood of finding the particle at the bottom of a well than in transition between wells.

Treating the transitions between wells as a jump process, we can approximate the diffusion of the sine potential model, Equation 15, with the following Brownian walk:
The effective diffusion coefficient *D*_{eff}(*n*) is derived using standard approaches (see Materials and Methods, Eq. 9). Under this approximation, θ(*t*) obeys a Gaussian distribution with variance *D*_{eff}(*n*)*t,* and *p*(θ, *t*) does not possess the periodic microstructure of the actual probability density (Fig. 4*C*, compare blue and black curves). Despite these differences, the approximation agrees very accurately with the variance in the particle's position in the periodic potential (Fig. 4*D*). In this framework we can directly relate the frequency of heterogeneity *n* to the overall diffusivity of the particle through *D*_{eff}(*n*). As with bumps in the spiking network, increasing the frequency *n* of the spatial heterogeneity increases the effective diffusion *D*_{eff} (Fig. 4*D*). This is true across the entire range of frequencies *n,* and as *n* → ∞ the particle's variance saturates to that of a system with a flat potential (Fig. 4*E*). Thus, despite the simplicity of the potential well framework, it can qualitatively explain the diffusivity observed in the spiking network. In both Figure 3 and Figure 4*D*, the variance in bump position is fit well by a linear function whose slope decreases with the number of attractors *n*.

### Impact of unstructured heterogeneity

As we have shown, a structured heterogeneity of cortical architecture that generates evenly spaced attractors (Figs. 2, 4*A*) curtails the rate of diffusion (Figs. 3, 4*D*). However, synaptic architecture may also have random components that are neither related to task specifics nor optimized to any specific computation (Wang et al., 2006). In principle, such perturbations in architecture could degrade the performance of working memory networks that require a fine-tuned architecture. Renart et al. (2003) demonstrated that the deleterious effects of such unstructured heterogeneity in bump attractor networks can be mitigated by homeostatic plasticity, which spatially homogenizes network excitability. In our model, considering such a process would bar the system from establishing spatially structured heterogeneity, which we have shown improves storage accuracy. Thus, we next study how a combination of structured and unstructured spatial heterogeneity affects the diffusive dynamics in working memory networks. We show that the system is still robust to noise, even when the potential is altered in this way.

Specifically, we modify the shape of the potential in Equation 15, so that,
where *U _{p}*(θ) is an unstructured perturbation of the underlying cosine potential function (see Materials and Methods). Briefly,

*U*(θ) is a component of the potential that is randomized from trial to trial (two realizations of the full potential are shown in Fig. 5

_{p}*A*,

*B*). Adding a small component of unstructured heterogeneity (η > 0) to a network with an initially flat potential function (

*h*= 0) substantially alters the attractor structure (Fig. 5

*A*). The network shifts from having a continuum of preferred locations to having a small number of preferred locations at disordered positions. This is analogous to the drastic collapse in the number of possible stable bump locations observed in bump attractor models whose synaptic structure is randomly perturbed (Zhang, 1996; Renart et al., 2003; Itskov et al., 2011; Hansel and Mato, 2013). On the other hand, a network that possesses structured heterogeneity (

*h*> 0) retains the original positions of its stable attractors after unstructured heterogeneity is added, even though the profile of the potential is distorted (Fig. 5

*B*). When the severity of heterogeneity is increased (larger η), the number of attractors is considerably reduced in the network without structured heterogeneity, while remaining the same in the network with structured heterogeneity (Fig. 5

*C*). Last, the effective diffusion

*D*(see Materials and Methods) of the network containing structured heterogeneity increases only gradually as the degree of unstructured heterogeneity is increased (Fig. 5

_{h}*D*), contrasting the distinct rise in the effective diffusion

*D*in the network without structured heterogeneity.

_{h}Therefore, the spatial organization of attractors in the network with structured heterogeneity is robust to random perturbations of the underlying potential landscape. Recent studies have shown that parametrically perturbed spiking network models of bumps retain dynamics whose spatial profile is bump-shaped (Brody et al., 2003; Itskov et al., 2011; Hansel and Mato, 2013; Kilpatrick and Ermentrout, 2013). Brody et al. (2003) showed the effective dynamics of the resulting system can then be numerically approximated by a potential well model like Equation 17. Thus, the low-dimensional dynamics of the spiking network model can still be described by the potential well model, so we believe the spiking network will also be robust to unstructured perturbations in its spatial architecture. This robustness allows the reduction of diffusion due to structured heterogeneity to be relatively unaffected by sources of unstructured heterogeneity that undoubtedly exist in most cortical networks (Wang et al., 2006).

### Memory storage as a noisy channel

Structured spatial heterogeneity in recurrent excitatory coupling has two distinct influences on response fidelity in working memory networks. First, it produces a finite set of attractors with which to store stimuli. Second, as we have shown, heterogeneity reduces the diffusion of persistent bump states across the network. These two influences have consequences for the overall storage performance by the network. We next characterize the working memory network as a noisy information channel (Cover and Thomas, 2006) and show how spatial heterogeneity of excitatory coupling mitigates a tradeoff between errors due to these two influences.

Consider a stimulus that chosen from *m* equally likely values and is to be stored by a working memory network. The network has *n* attractors and must store the stimulus value for *T* seconds before being read out. From a coding perspective, we have a chain where random input *X* ∈ [1, *m*] is loaded into attractor *Y*(0) ∈ [1, *n*] and remains in storage until *Y*(*T*) ∈ [1, *n*], after which it is finally read out as response *Z* ∈ [1, *m*] (Fig. 6*A*). If *n* < *m* then the transition *X* → *Z* involves a compression (*X* → *Y*(0)) and expansion (*Y*(*T*) → *Z*) of data, causing errors in transmission due to the quantization of the neural representation (Materials and Methods).

The transition *Y*(0) → *Y*(*T*) involves diffusion across the network, which also degrades storage. To compute the probability of transitioning from one attractor to another during the storage phase, we need only integrate the Gaussian approximation with variance *D*_{eff}(*n*)*T* over the appropriate domain (Fig. 6*B*,*C*). In this way, we calculate the matrix of transition probabilities from one attractor to another during the retention interval. With this matrix, we can calculate the information lost due to diffusion (Materials and Methods). Naturally, as the product *D*_{eff}(*n*)*T* increases, diffusion becomes more prominent (Fig. 6*B*,*C*), and information loss due to diffusion increases.

In total, an increase in *n* has the dual effect of reducing quantization error, yet increasing diffusion error. Thus, we predict that spatial heterogeneity (measured by *n*) causes a tradeoff between quantization and diffusion-based error, and an optimal heterogeneity will maximize the overall information flow across the channel. We explore this prediction in the next section.

### Optimizing information flow with heterogeneous network coupling

Delay time *T* and the number of possible stimuli *m* can be easily controlled in working memory experiments (Funahashi et al., 1989). By fixing the protocol in working memory tasks, it has been shown animals can improve their performance through extensive training, and boost their average reward rate (Meyer et al., 2011). Performance improvements are likely caused by modifications to the structure of networks underlying working memory, so we presume the spatial heterogeneity of the network, parametrized by *n,* evolves internally through reward-based plasticity mechanisms (Schultz, 1998; Wang, 2008; Klingberg, 2010). To measure the overall success of storage, we consider the mutual information *I* between *X* and *Z* for both the potential well model and the full spiking network (Materials and Methods). Mutual information measures the reduction in uncertainty in stimulus *X* when response *Z* is known. Furthermore, mutual information allows for a clean dissection of the information loss due to quantization-based and diffusion-based errors.

The stimulus space compression involved in *X* → *Y* (0) involves a loss of log_{2}(*m*/*n*) bits, a quantity that decreases with *n*. Calculating information loss due to diffusion (*Y* (0) → *Y* (*T*)) requires that we compute the effective diffusion coefficient *D*_{eff}(*n*). To obtain *D*_{eff}(*n*) for the potential well model, we use our analytical approach (see Materials and Methods, Eq. 9), while for the spiking model we use a numeric fit to *D*_{eff}(*n*) (Fig. 3). Information loss due to diffusion increases with *D*_{eff}(*n*), which in turn increases with *n* (Figs. 3, 4*E*). Given both sources of information loss, we compared the channel theory prediction for *I*(*X*, *Z*) to direct estimates of *I*(*X*, *Z*) based on the joint density *p*(*X*, *Z*); Materials and Methods), for both the spiking and potential well models.

For a fixed *m* and very short delay time *T* the information *I*(*X*, *Z*) monotonically increases with *n* and is maximized when *n* = *m* (Fig. 7*A*,*B*; *T* = 0.1 s). This is expected, since recall is near immediate, so that diffusion-based error is negligible and quantization error dominates the information loss. This error vanishes when *n* = *m* and *I*(*X*, *Z*) approaches the stimulus entropy (log_{2}(*m*) bits). As the delay time is increased, information loss due to quantization error does not change, but errors due to diffusion increase. For sufficiently long *T,* information peaks at a value of *n*_{max} < *m* in both the potential well model and the spiking model (Fig. 7*A*,*B*; *T* = 10 s). The value *n*_{max} marks a compromise between quantization and diffusion errors. For the potential well model, the optimal heterogeneity *n*_{max} decreases as the delay time *T* increases (Fig. 7*C*), since diffusion error grows as *T* increases. In total, we find that for sufficiently long delay times the degree of heterogeneity *n* should be less than the stimulus size *m* to optimize information transfer.

For a fixed delay time *T,* varying the number of possible inputs *m* also shifts *n*_{max}. Diffusion error is independent of the number of possible inputs *m*; however, the total possible information increases with the number of inputs *m*. For small *m*, we have that *n*_{max} = *m* since when *n* ≥ *m*, the quantization error is always zero and network diffusion increases with *n* (Fig. 8*A*,*B*; *m* = 4). However, for larger *m*, we find *n*_{max} < *m,* due to a compromise between quantization error and diffusion (Fig. 8*A*,*B*; *m* = 16). These results hold for the potential well model over a wide range of *m* (Fig. 8*C*). Overall, we highlight that a combination of varying *T* and *m* uncovers the effect of diffusion and quantization error on mutual information in a working memory network. In particular, for many combinations of *T* and *m*, an optimal spatial heterogeneity for information transfer can be found.

The information *I*(*X*, *Z*) measures the general relation between *X* and *Z,* one that is decoder independent. However, in psychophysical experiments, a reward is only given when the recall is correct, i.e., *X* = *Z*. Thus, it is important to consider how the probability of correct recall depends on the spatial heterogeneity of excitatory connections. In both the potential well and spiking network models, the probability of correct recall is maximized for a fixed *n* < *m* when *T* is sufficiently long (Fig. 9*A*,*B*), consistent with observations of *I*(*X*, *Z*) (Fig. 7*A*,*B*). The *n* that maximizes the probability of correct recall decreases as storage time increases (Fig. 9*C*), also in agreement with results using *I*(*X*, *Z*) (Fig. 7*C*). The probability of correct recall provides a quantification of error that weights all incorrect responses the same. To use knowledge of the spatial organization of the cue set in determining error, we also measure the impact of the number of attractors on the angular difference between the recalled and cued stimulus position. Specifically, we compute the variance of the difference between the recalled and input cue location *X* − *Z*. The magnitude of the recall error is minimized for a fixed *n* < *m* (Fig. 9*D*,*E*), corroborating our findings for *I*(*X, Z*) and proportion correct. Thus, our core finding that information transfer across the memory network is maximized for a specific degree of spatial heterogeneity also holds for a measure of task performance.

## Discussion

We have outlined how both neural architecture and noisy fluctuations determine error in working memory codes. In working memory networks, the position of a bump in spiking activity encodes the memory of a stimulus, and input fluctuations cause diffusion of the bump position, which degrades the memory. Spatially heterogeneous recurrent excitation reduces the diffusion of bumps by stabilizing a discrete set of bump positions. However, this also introduces memory quantization, limiting the capacity of information transfer. By analyzing the information loss incurred by both error sources, we can maximize the transfer of information between the stimulus and the memory output by tuning the spatial heterogeneity of recurrent excitation. We found that the ideal heterogeneity gives a number of attractors in the network *n*_{max}, which can be less than the number of possible inputs to the network *m*.

### Robust bump dynamics through quantization

Networks whose dynamics lie on a continuum attractor have steady-state activity that can be altered by arbitrarily weak noise and input (Bogacz et al., 2006). The advantage of this feature is that two stimuli with an arbitrarily fine distinction can be reliably stored and distinguished upon recall. However, this structure requires fine-tuning of network architecture, since any parametric jitter will destroy a continuum attractor. Previous work has shown how spatial heterogeneity in recurrent excitatory coupling quantizes the continuum attractor and stabilizes persistent network firing rates to perturbations in model parameters (Koulakov et al., 2002; Brody et al., 2003; Goldman et al., 2003; Cain and Shea-Brown, 2012) and fluctuations (Fransén et al., 2006). We believe our results apply to these models and have extended this previous work in two major ways.

First, we have shown that quantizing the state space of a spatially structured network into a finite number of attractors stabilizes bump position to dynamic noise. Second, we have shown that there is an optimal number of attractor positions for storing stimuli when dynamic noise is present. The optimal number can be lower than the actual quantization of possible stimuli, so that under-representing stimulus space can lead to more reliable coding. Studies of networks encoding the memory of eye position show individual neurons exhibit bistability in their firing rates (Aksay et al., 2003), which motivated modeling their firing rate to input relations as quantized, staircase-shaped functions (Goldman et al., 2003). This provides an example of a working memory network thought to provide a discrete delineation of a continuous variable. Our results also suggest parametric working memory networks should coarsen the stored signal to guard against diffusion error.

### The advantage of spatial heterogeneity

Past work has suggested spatial heterogeneity in working memory networks is a barrier to reliable memory storage (Zhang, 1996; Renart et al., 2003; Itskov et al., 2011; Hansel and Mato, 2013). In these studies, parameters of single neurons (Renart et al., 2003) or synaptic architecture (Itskov et al., 2011; Hansel and Mato, 2013) change throughout the network in a spatially aperiodic way. Substantial quantization error results since the bump drifts toward one of a finite number of attractor positions that may not be evenly spread over representation space. Renart et al. (2003) show this effect can be overcome by considering homeostatic mechanisms that balance excitatory drive to each neuron in the network, halting drift of bumps altogether. On the other hand, both Itskov et al. (2011) and Hansel and Mato (2013) show drift can be slowed by including short-term facilitation in the network. Rather than exploring ways to remove the effective drift introduced by spatial heterogeneity, we have shown spatial heterogeneity can improve network coding. As long as quantization error is outweighed by a reduction in diffusion error, heterogeneous networks make less overall error in recall tasks than spatially homogeneous networks.

Our model represents the space of possible oriented cues as evenly distributed in space with uniform probability of presentation, a protocol often used in experiments (Funahashi et al., 1989; White et al., 1994; Goldman-Rakic, 1995; Meyer et al., 2011). Thus, we reason that the ideal covering of stimulus space by the network will have a uniform distribution. This translates into network spatial heterogeneity that is exactly periodic. The periodicity allows for a compact derivation of the effective diffusion coefficient *D*_{eff}(*n*). Were the stimulus set to have an asymmetric probability distribution, we would expect the ideal spatial heterogeneity would not produce evenly spaced attractors. In this case, approximating bump position is possible, but motion between attractors will depend on θ and will be difficult to interpret as a simple diffusion process. Nevertheless, we expect that the specifics of spatially uneven heterogeneous coupling will significantly impact both the attractor quantization and stochastic drift across the network, and control the information transfer from stimulus to recall.

### Mechanisms that produce structured heterogeneity

We conceive of two main biophysical processes that could produce structured spatial heterogeneity in a working memory network. First, Hebbian plasticity may operate at locations in the network that are driven by common external cues. Such cues will consistently activate neurons of similar orientation preference, so clusters of similarly tuned cells will tend to strengthen recurrent excitation between each other (Goldman-Rakic, 1995; Clopath et al., 2010; Ko et al., 2013). Recent experiments show training does increase the delay period firing rates of neurons with a preference for the encoded cue (Meyer et al., 2011), which may occur due to reinforcement of recurrent excitation. This mechanism would create attractors only at locations in the network that consistently receive feedforward input during training. Neurons that are never directly stimulated by cues would be deprived of continual reinforcement of their excitatory inputs, allowing broadly tuned inhibition to decrease their delay period firing (Wang, 2001). In the framework of our models, a network trained on *n* cue locations would form *n* attractors. Depending on the length of delay, this might not be the optimal number of attractors, but it would improve coding compared with the network without quantization.

Second, reward-based plasticity mechanisms signaled by dopamine may supervise the reinforcement of synaptic excitation to form a network with the optimal number of attractors. Many studies have verified that dopamine can carry reward signals back to the network responsible for a correct action (Schultz, 1998). Selectively acting on specific subsets of neurons, dopamine can prompt plasticity in network architecture to improve future chances of rewards (McNab et al., 2009; Klingberg, 2010). Such supervisory mechanisms could seek an optimal architecture in the network to maximize reward yields for a fixed retention time and number of possible cues. As we have demonstrated, this resulting structured heterogeneity will improve coding, even if there is unstructured heterogeneity present (Wang et al., 2006).

### Relating diffusion of neural activity to behavior

Our results (and those of many other past studies) assume that neural activity has a diffusive component. However, how exactly neural variability drives behavioral variability is largely unknown (Britten et al., 1996; Churchland et al., 2011; Brunton et al., 2013; Haefner et al., 2013). Psychophysical studies of spatial working memory tasks reveal that subjects typically respond with nonzero error. In particular, Ploner et al. (1998) show that the midspread of memory-guided saccades in humans scales sublinearly with delay time over 0.5–20 s. This scaling is consistent with a diffusion of neural activity involved in the storage of memory, giving support to our modeling assumptions.

In contrast to these data, recent psychophysiological work in rodents and humans performing decision-making tasks lasting 0.5–2 s suggests that models with sensory noise, rather than internal diffusion, best capture these behavioral data (Brunton et al., 2013). However, this study ultimately considers a two-alternative forced-choice task, and does not consider the storage and recall of inputs over a large stimulus space. When there are only two attractors in our network, the diffusion coefficient is near zero, consistent with Brunton et al. (2013). In addition, the timescale of tasks studied by Brunton et al. (2013) may not be long enough to substantially reveal the effects of internal diffusion. These differences between the diffusive nature of working memory and decision integrator networks suggest that more work needs to be done to link variability of neural and behavioral response.

### Implications for multiple object working memory

We emphasize that we did not study network encoding of multiple object working memory. Added complications arise when several items must be remembered at once (Luck and Vogel, 1997). For instance, the error made in recalling the value of a set of multiple continuous variables increases with the set size (Wilken and Ma, 2004). Recently, it has been shown that a spiking network model can recapitulate many of these set size effects (Wei et al., 2012). Interestingly, there is an optimal spread of pyramidal synapses that minimizes errors due to set size. However, reduction of the effects of dynamic noise on the accuracy of memories has yet to be studied. Our ideas could be extended to analyze how networks that encode multiple object memory could be made more robust, applying network quantization to the storage of multiple bump attractors.

## Footnotes

This work was supported by National Science Foundation Grants NSF-DMS-1121784 (B.D.), NSF-DMS-1311755 (Z.P.K.), and NSF-DMS-1219753 (B.E.). We thank Robert Rosenbaum, Ashok Litwin-Kumar, and Krešimir Josić for useful discussions.

The authors declare no competing financial interests.

- Correspondence should be addressed to Zachary P. Kilpatrick, Department of Mathematics, University of Houston, Phillip G. Hoffman 651, Houston, TX 77204-3008. zpkilpat{at}math.uh.edu