## Abstract

Insects and vertebrates harbor specific neurons that encode the animal's head direction (HD) and provide an internal compass for spatial navigation. Each HD cell fires most strongly in one preferred direction. As the animal turns its head, however, HD cells in rat anterodorsal thalamic nucleus (ADN) and other brain areas fire already before their preferred direction is reached, as if the neurons anticipated the future HD. This phenomenon has been explained at a mechanistic level, but a functional interpretation is still missing. To close this gap, we use a computational approach based on the movement statistics of male rats and a simple model for the neural responses within the ADN HD network. Network activity is read out using population vectors in a biologically plausible manner, so that only past spikes are taken into account. We find that anticipatory firing improves the representation of the present HD by reducing the motion-induced temporal bias inherent in causal decoding. The amount of anticipation observed in ADN enhances the precision of the HD compass read-out by up to 40%. More generally, our theoretical framework predicts that neural integration times not only reflect biophysical constraints, but also the statistics of behaviorally relevant stimuli; in particular, anticipatory tuning should be found wherever neurons encode sensory signals that change gradually in time.

**SIGNIFICANCE STATEMENT** Across different brain regions, populations of noisy neurons encode dynamically changing stimuli. Decoding a time-varying stimulus from the population response involves a trade-off: For short read-out times, stimulus estimates are unreliable as the number of stochastic spikes is small; for long read-outs, estimates are biased because they lag behind the true stimulus. We show that optimal decoding of temporally correlated stimuli not only relies on finding the right read-out time window but requires neurons to anticipate future stimulus values. We apply this general framework to the rodent head-direction system and show that the experimentally observed anticipation of future head directions can be explained at a quantitative level from the neuronal tuning properties, network size, and the animal's head-movement statistics.

- anticipation
- computational modeling
- head-direction system
- neural population activity
- optimal decoding
- rodent anterior dorsal nucleus

## Introduction

Spatial orientation requires information about one's current position and heading direction. In rodents, positional information is provided by place cells (O'Keefe and Dostrovsky, 1971), grid cells (Hafting et al., 2005), boundary vector and border cells (Barry et al., 2006; Solstad et al., 2008), and further spatially modulated neurons. As in other vertebrate (Finkelstein et al., 2015) and insect (Seelig and Jayaraman, 2015; Varga and Ritzmann, 2016) species, directional information is represented by head-direction (HD) cells across various brain areas (Taube et al., 1990a; Wiener and Taube, 2005; Taube, 2007).

HD cells integrate motor efference copies with vestibular, visual, and proprioceptive signals, and are maximally active when the animal's head points into a cell-specific preferred direction (Taube, 2007). These directions are uniformly distributed and rotate in close synchrony with rotated salient visual cues. During navigation in darkness, direction-specific firing is maintained, but the preferred directions drift over time (Taube et al., 1990b). Optogenetically interfering with the HD system impairs homing, demonstrating that the HD system plays a crucial role for spatial navigation (Butler et al., 2017).

Conceptually, one can arrange the preferred directions of HD cells on a ring that is anchored to the local environment (see Fig. 1). When the animal turns its head, the packet of active neurons is updated such that those cells are most active whose preferred directions match the animal's new HD. As the HD can be read out from the position of this activity packet along the ring, the HD system is often referred to as an internal or neural compass (Schultheiss and Redish, 2015). Unlike the needle of a physical compass, which points toward magnetic north, the preferred directions of HD cells are determined by local cues and vary from environment to environment.

How accurately can an animal determine its HD from this compass? Typical tuning widths of individual HD cells (see Figs. 1*B*, 2) range from 30° to 70° (see Table 1), too large for accurate navigation on the basis of single-cell responses. However, combining the information provided by all simultaneously active HD cells leads to a far more reliable estimate. To understand HD coding at the network level, we construct a computational model, read out its activity using population-vector decoding (Georgopoulos et al., 1983; Seung and Sompolinsky, 1993; Salinas and Abbott, 1994), and study the accuracy and bias of this neural compass. For concreteness, we focus on the anterodorsal thalamic nucleus (ADN), which is a core region of the HD system (Taube, 2007) and contains the highest fraction of HD cells (Taube and Bassett, 2003).

During foraging, the animal's HD becomes a time-varying quantity. In this case, causal decoding, for which only spikes from the past are used, biases the population vector, as we will show. To counteract this temporal bias, a decoder can reduce the read-out time, yet shorter read-out times result in higher variability.

Interestingly, HD cells in the ADN fire in an anticipatory manner: during head turns, their preferred directions shift such that the cells fire earlier than without movement, as though they encoded the future HD (Blair and Sharp, 1995). Our decoding approach suggests an alternative interpretation: anticipatory firing improves the representation of the present HD by compensating the motion-induced bias inherent in the population vector. If cells anticipate the HD by 25 ms, as observed in the ADN, the HD compass read-out improves by up to 40%. Moreover, across the layers within the HD circuit, the anticipatory time intervals (ATIs) should be staggered to optimize the population-vector codes, which is consistent with experimental observations (see, e.g., Taube, 2007). Similar phenomena are to be expected in other brain areas that encode stimuli that vary smoothly in time, neural coding will benefit from anticipation wherever sensory predictions or motor errors (see, e.g., Mumford, 1992; Wolpert et al., 1995; Berry et al., 1999; Rainer et al., 1999; Bastian, 2006; Bastos et al., 2012; Palmer et al., 2015) need to be calculated.

## Materials and Methods

##### HD trajectories.

The HD trajectories θ(*t*) were recorded in the Buzsáki laboratory (Mizuseki et al., 2009) and are publicly available at http://dx.doi.org/10.6080/K0Z60KZ9. The tracking data (file extension *.whl) are from male rats foraging on square platforms (side length: 120 or 180 cm). The files contain time traces (sampling rate: 39.6 Hz) of the *x* and *y* coordinates of two LEDs mounted on the rats' head stage. To facilitate analyses at a fine temporal scale, we linearly interpolated the line segments between the data points to obtain an artificial sampling rate of 1 kHz. The time course of the angle between the vector connecting the LEDs (caudal to rostral) and the positive *x* axis defines the HD trajectory θ(*t*) with −π ≤ θ(*t*) < π. To avoid erroneous data points, we cut each tracking file at time points where tracking data were missing and only kept trajectory segments during which the distance between the LEDs was between 5 and 16 cm. Lower distances result from strong head tilts (e.g., during eating or rearing); larger distances indicate recording errors (distances cannot be larger than the distance between the LEDs). From the remaining segments, only those >10 s were kept for further analysis. Two of those segments were sorted out manually because of recording irregularities. In sum, this procedure resulted in 1163 segments. The angular head speeds have mean values of 87.7 ± 19.1 deg/s, median values of 60.4 ± 16.5 deg/s, and maximal values of 652 ± 173 deg/s; the forward speeds have mean values of 12.3 ± 3.0 cm/s, median values of 9.9 ± 2.9 cm/s, and maximal values of 65 ± 20 cm/s. The correlation between angular head speed and forward speed is 0.45, indicating that faster head turns are more likely during periods of faster forward motion.

A representative segment with a duration *T _{E}* of approximately 90 s (see Fig. 3) was chosen for the Monte Carlo simulations described below. This segment length is a compromise between sampling enough of the (foraging) behavior of the animal while keeping the computation time at a moderate level, and leads to representative statistics as shown in Figure 7. We mainly use the radian measure for angles in Materials and Methods. In the figure legends and Results, we state the angles in the degree measure for a more intuitive reading.

For comparison, we analyzed a second set of movement data from foraging male rats (Sargolini et al., 2006). As for the Mizuseki et al. (2009) data, we used recordings with two LEDs and selected trajectory periods by excluding erroneous data points, periods of very short and very long LED distances, and all short trajectory segments between these points. The Sargolini et al. (2006) data are already smoothed with a 300 ms running-average filter to remove tracking jitter. When the same smoothing operation is applied to the Mizuseki data, the difference of the duration-weighted averages between both datasets is <0.15° for time windows *T* < 250 ms (for mathematical definitions, see below). This indicates that the Mizuseki et al. (2009) dataset and the trajectory that we selected from this dataset for our simulations capture representative movement statistics of a rat during foraging behavior.

To analyze the effect of anticipatory firing, the time-dependent angular head velocity (AHV) ω(*t*) needs to be known. We numerically calculated ω(*t*) from the experimentally obtained HD trajectories θ(*t*) by central differences (i.e., ω(*t _{i}*) = [θ(

*t*

_{i+1}) − θ(

*t*

_{i−1})]/[

*t*

_{i+1}−

*t*

_{i−1}]) to make sure that position and velocity samples are synchronized. In this case, the circular trajectory θ(

*t*) was unfolded at the wrap-around θ(

*t*) = ±π.

##### Spike train model.

Each model neuron *j* generates an inhomogeneous Poisson spike train with time-dependent rate λ* _{j}*(θ(

*t*)). Accordingly, the number of spikes

*k*(

_{j,T}*t*) fired in a time window of length

*T*ending at time

*t*is given by a Poisson distribution with time-dependent mean α

*(*

_{j,T}*t*) = ∫

_{t−T}

^{t}λ

*(θ(*

_{j}*s*))

*ds*as follows: The population response in that time window is given by the spike-count vector

*K⃗*(

_{T}*t*) = (

*k*

_{1,T}(

*t*),

*k*

_{2,T}(

*t*), …,

*k*(

_{N,T}*t*)). Assuming independence, the probability

*P*(

*K⃗*(

_{T}*t*)) to observe a specific spike-count vector factorizes (i.e.,

*P*(

*K⃗*(

_{T}*t*)) = ∏

_{j=1}

^{N}

*P*(

*k*(

_{j,T}*t*)). Figure 3 shows a raster plot of HD spike trains generated by this process.

##### Homogeneous and inhomogeneous cell populations.

Our mathematical analysis focuses on homogeneous populations of HD cells. In these model populations, neurons have identical von Mises tuning curves λ* _{j}*(θ), given by Equation 17 in Results, which differ only by their equidistant preferred directions θ

*. All other parameters (maximum firing rate*

_{j}*f*, background firing rate

_{max}*f*, inverse tuning width κ, and ATI τ) are chosen as the mean values from Table 1, which contains average experimental HD tuning parameters. The mean (background-corrected) population firing rate is: where

_{bg}*I*

_{0}(κ) is the modified Bessel function of first kind. Using

*I*

_{0}(κ) ≈

*e*

^{κ}/, one obtains (Zhang, 1996) the following: The analytic results obtained for homogeneous populations serve as a good approximation, also in the case of the related inhomogeneous populations (especially as

*N*→∞; see Results).

For a description that is as close as possible to the biological findings, we simulate inhomogeneous populations. In such populations, every neuron has individual tuning parameters (*f _{max,j}*,

*f*, κ

_{bg,j}*and τ*

_{j}*), as shown by the example in Figure 3. Each of these parameters is sampled from a beta distribution, that is, a probability density function*

_{j}*p*(

*x*) ∝ (

*x*−

*x*)

_{min}^{a−1}(

*x*−

_{max}*x*)

^{b−1}on the interval [

*x*,

_{min}*x*], whose parameters

_{max}*a*and

*b*are obtained from fitting the distribution's mean and variance to the values shown in Table 1 (see also Fig. 2). By shuffling the background firing rates, we ensure that the ratio between a cell's maximal firing rate

*f*and its background firing rate

_{max,j}*f*exceeds 5 for every cell, as suggested by Taube (2007).

_{bg,j}This method is not meant to exactly replicate the experimental tuning parameter distributions; rather, we use the method to explore the qualitative effect of inhomogeneous cell properties on the HD code. Nevertheless, because the first two moments of the experimental distributions as well as the observed parameter ranges are reproduced by the beta distributions, they mimick realistic HD populations. A comparison of Figure 2 with Taube (1995, their Figs. 5, 6) shows that the model distributions capture key features of the experimental histograms; however, our distributions have less pronounced peaks, especially for *f _{max}* and

*f*. To model the preferred HD of each cell, an equidistant arrangement (θ

_{bg}*=*

_{j}*j*× 2π/

*N*− π) is distorted by adding to each θ

*a random shift that is chosen from a Gaussian distribution with mean 0 and SD 2π/*

_{j}*N*.

##### Circular error measure.

The accuracy of the HD compass is determined by taking the angle θ̂(*t*) of the population vector (see Eq. 17) and computing its average deviation from the true trajectory θ(*t*). Because we are calculating a difference of circular directions, which are defined on the interval [−π, π), measures such as the squared difference between θ̂(*t*) and θ(*t*) are not appropriate. Consider, for example, cases in which θ̂(*t*) and θ(*t*) represent similar directions but lie at different sides of the wrap-around at ±180°, for example, θ̂(*t*) = 175° and θ(*t*) = −175°. Although the two directions actually differ by only 10°, the squared difference measure would be (350°)^{2}. A well-defined error measure for this situation is the circular distance given by 1 − cos(θ̂(*t*) − θ(*t*)), as discussed by Jammalamadaka and Sengupta (2001). For small angular deviations, the circular distance approaches the squared difference, as illustrated in Figure 5*A*:
Averaging the circular distance between the population vector's direction and the true trajectory value over all possible neural responses defines the mean circular error as follows:
where the expectation is taken over all realizations of the spike-count vector. The mean circular error is the circular equivalent of the mean squared error used for nondirectional variables. Averaging the mean circular error *D*(*t*) along the underlying trajectory θ(*t*) results in the average mean circular error:
The error measure *D* includes the influences of both the stimulus dynamics and the noisy spike trains on the accuracy of the HD code. Compared with studies on the encoding and decoding of static stimuli, such as Mathis et al. (2012), the trajectory takes the role of the prior stimulus distribution. For graphical representation, we transform the error *D* to the degree-scale by defining the estimator's accuracy *A* = (180°/π) · arccos (1−D), similar to the root mean square error used to describe the performance when dealing with noncircular stimuli.

##### Properties of the circular error measure.

Because the reader might not be familiar with directional statistics, we briefly introduce the equivalents of the mean value, variance, and bias (see, e.g., Mardia and Jupp, 2009). The equivalent of the mean value is the mean or “expected” direction 〈θ̂(*t*)〉 = arg {E[*e*^{iθ̂(t)}]}. Note that the *arg* operation is performed *after* the expectation *E* is taken, again to deal with the circular wrap-around in a well-defined manner.

The circular variance is given as *V*[θ̂(*t*)] = 1 − |E[*e*^{iθ̂(t)}]| = 1 − E[cos(θ̂(*t*) − 〈θ̂(*t*)〉)]. Accordingly, the mean circular error can be decomposed into contributions of the circular variance and a circular bias:
where the second term captures the systematic deviation of the estimator's mean direction from the actual trajectory value. This latter term is thus the circular equivalent of a bias (for a visualization of the two error sources, see Fig. 5) and Equation 6 is the circular equivalent of the decomposition of the mean square error into variance and squared bias.

##### Treatment of the zero spike-count case.

If not a single HD cell fires in the interval [*t-T*, *t*] (i.e., *K⃗ _{T}*(

*t*) = 0⃗), the direction of the population vector is undefined and no information about the current HD is available. The probability for this case can be approximated by

*P*

_{0⃗,T}=

*P*(

*K⃗*(

_{T}*t*) = 0⃗) ≈

*e*

^{−(R+N·fbg)·T}, where the population firing rate

*R*is defined in Equation 2. To extend the decoding scheme in a consistent manner to this case, θ̂(0⃗) should be picked randomly from

*U*

_{[−π,π)}, the uniform distribution on the interval [−π, π). Separating this special case from all

*K⃗*≠ 0⃗, one obtains the following: The size of the term

*P*

_{0⃗,T}=

*e*

^{−(R+N·fbg)·T}determines whether the case

*K⃗*= 0⃗ is relevant or not. For the mean parameters of the ADN HD system shown in Table 1,

_{T}*T*∼ 1̸(10

*Hz*·

*N*) in order for the zero-spike case to contribute significantly. Even for a small population of only

*N*= 100 neurons, this time scale is not more than 1 ms, much smaller than the optimal read-out times, in which we are interested. This shows that the zero-spike case is not relevant for our study so that we will not include the extended definition of the error in our analytical calculations.

##### Causal population-vector code.

Equation 19 in the main text describes the relation 〈θ̂* _{T}*(

*t*)〉 ≈ θ̄

*(*

_{T}*t*) between the expected HD of the population vector and the corresponding time-averaged true HD (for the mathematical definition of these quantities, see Results). We now derive this relation for two different scenarios; both require homogeneous populations, but for large enough

*N*, the results are also useful to describe inhomogeneous populations. In the context of 2D place fields and a maximum likelihood decoder, Mosheiff et al. (2017) recently derived a similar relationship.

Let us first assume that the encoded trajectory segments have constant AHV ω. This allows us to exploit a symmetry that arises for trajectory segments that are centered around the preferred direction of one particular neuron. For notational simplicity, let us assign the index 0 to this neuron and θ_{0} = 0, so that θ(*t*) = with *t* ∈ [0, *T*]. The other neurons are indexed such that θ_{−j} = − θ* _{j}*. This arrangement implies that the activity of the population is symmetric around θ = 0 in the time interval

*t*∈ [0,

*T*], i.e., α

_{−j}(

*t*=

*T*) = α

*(*

_{j}*t*=

*T*). Introducing the notation

*K⃗′*for a count vector with mirror indices relative to

*K⃗*, that is,

*k*=

_{j}′*k*

_{−j}, the symmetric activation implies

*P*(

*K⃗′*(

_{T}*t*=

*T*)) =

*P*(

*K⃗*(

_{T}*t*=

*T*)), while θ̂

*(*

_{T}*K⃗′*) = −θ̂

*(*

_{T}*K⃗*). Using these identities, the mean direction 〈θ̂

*(*

_{T}*t*=

*T*)〉 of the estimator θ̂

*(*

_{T}*t*=

*T*) is given by: An illustration of the symmetry of the distribution of θ̂(

*t*=

*T*/2) around the trajectory average of a trajectory segment with constant angular velocity is shown in Figure 6

*A*.

For an encoded variable that is static, the mean population vector can be calculated following Glasius et al. (1997). Applying their approach to a general, time-varying HD signal θ(*t*), we obtain the following:
so that
Thus, the mean population vector points into the direction of the trajectory average θ̄* _{T}*(

*t*). We cannot directly transfer this result to the mean direction 〈θ̂

*(*

_{T}*t*)〉 of the population vector, but it is reasonable to expect a similar behavior. Indeed, as shown by the example trajectory in Figure 6

*B*, the numerical estimates agree well with 〈θ̂(

*t*)〉. As also demonstrated in this figure, the population vector effectively smoothens the representation of a trajectory and delays it by a time lag of

*T*/2.

The histogram in Figure 6*A* shows that, for reasonable HD velocities and neural population sizes, the distribution of θ̂(*t*) is narrow and closely concentrated around 〈θ̂(*t*)〉. We can thus expand the mean circular error *D*(*t*) for small local angular distances to first nonvanishing order as *D*(*t*) ≈ . To obtain an approximation of the time-averaged circular error *D*, we integrate this expression along the whole trajectory over the corresponding time window i.e., *T _{e}* −

*T*, where

*T*indicates the trajectory's duration:

_{e}*V*and

*B*

^{2}denote the (approximate) time-averaged variance and time-averaged squared bias, respectively.

Figure 6 also shows that the width of the θ̂(*t*) distribution is only mildly affected by the AHV, even for high velocities. This implies that its variance *V* depends only weakly on time *t* so that we can omit the temporal average when calculating *V*. Furthermore, as the θ̂(*t*) distribution hardly varies with ω, results by Seung and Sompolinsky (1993) for the (local) variance of the direction of the population vector (derived assuming a static stimulus) apply. We use their approach, which is based on calculating the Fisher information, to approximate the local squared deviation of the population vectors angle from its mean direction as follows:
where θ is a static stimulus parameter, θ̂* _{T}* its estimator, and
are the Fourier components of the zero-centered tuning curve λ(θ). This result holds for any symmetric unimodal tuning curve.

For von Mises curves (2), one obtains λ̃* _{n}* = (

*f*−

_{max}*f*)

_{bg}*e*

^{−κ}

*I*(κ) +

_{n}*f*δ

_{bg}_{0n}because

*I*(κ) = . Inserting the first two terms of the expansion of each Bessel function for large κ, we arrive at the following: for a homogeneous neural ensemble with von Mises tuning curves (see Fig. 7, dotted lines). This shows that, as the average number

_{n}*RT*of (nonbackground) spikes in the spike-count window increases, the variability

*V*decreases and more information about the stimulus is available. Similarly, the variance of the estimator increases for higher background activity (Seung and Sompolinsky, 1993; Mathis et al., 2012).

To first approximation, the time-averaged circular error can be written as
where the second term needs to be evaluated for the encoded trajectory θ(*t*). In Figure 7 (right column), Equation 14 is shown as a function of *T* using the representative trajectory segment θ(*t*) mentioned above. The tuning curve parameters are chosen according to the mean values shown in Table 1.

##### Monte Carlo simulations.

We use Monte Carlo simulations of the average circular error to check the analytical results and assess the effect of inhomogeneous tuning parameters. To this end, we sample the integral over the trajectory and the Poisson noise for *M* ≫ 1 times: first, an endpoint of a trajectory segment *t _{i}* ∈ [

*T*,

*T*] is randomly selected. Then, a count vector

_{e}*K⃗*=

_{i}*K⃗*(

*t*) is sampled from the respective Poisson distribution. Finally, the circular distance

_{i}*d*(

*t*) = 1 − cos(θ̂(

_{i}*K⃗*) − θ(

_{i}*t*)) is calculated and the average circular error is approximated as

_{i}*D̂*= 1/

*M*∑

_{i=1}

^{M}

*d*(

*t*).

_{i}##### Experimental design and statistical analysis.

We reanalyzed data originally recorded by Mizuseki et al. (2009) and by Sargolini et al. (2006), and we refer the reader to those publications for details on the experimental designs. All our analyses were performed in Python (RRID:SCR_008394) using functions from the SciPy (RRID:SCR_008058) and NumPy (RRID:SCR_008633) modules.

## Results

The HD system of rodents encodes the time-dependent azimuth θ(*t*) of the animal's head relative to a landmark-anchored reference direction (Fig. 1) and plays a functional role for spatial navigation (Butler et al., 2017). Notably, during head turns, HD cells start to fire before their preferred direction θ* _{j}* is reached, as though the neurons anticipated the future HD (Blair and Sharp, 1995). Faster rotations lead to stronger shifts in time than slow turns. Denoting the time-dependent AHV by ω(

*t*), the shift can be approximated by the following: where the time constant τ

*is called the ATI of cell*

_{j}*j*and describes the extent to which HD firing seems to be shifted forward in time (Blair and Sharp, 1995; Blair et al., 1997; Taube and Muller, 1998); for a visualization, see Fig. 2

*A*, right.

The rodent HD system consists of multiple processing stages along which ATIs decrease from stage to stage: from the lateral mammillary nuclei (LMN, τ ≈ 40–75 ms) to the anterodorsal thalamic nucleus (ADN) and retrosplenial cortex (τ ≈ 25 ms) to the postsubiculum, where neurons tend to neither lag nor anticipate the cell's preferred firing direction (Taube, 2007). As with other HD cell parameters (see also Table 1), ATIs vary from cell to cell (Blair et al., 1997; Taube and Muller, 1998). At the mechanistic level, ATIs have been explained by an interplay between afferent dynamics and movement statistics (see also Zhang, 1996; van der Meer et al., 2007; Tsanov et al., 2014), but a consistent functional interpretation of ATIs is still missing.

As we will show, anticipatory firing improves the accuracy with which the present HD can be decoded from time-dependent HD signals. For concreteness, we use the population-vector decoder (Georgopoulos et al., 1983, 1986), which provides a reliable estimate of variables encoded by the distributed activity pattern of a neural ensemble (Seung and Sompolinsky, 1993; Salinas and Abbott, 1994; Glasius et al., 1997; Stemmler et al., 2015). We focus on HD cells in the ADN, which contains the highest fraction of HD cells (Taube and Bassett, 2003) and is considered a “core region” (Taube, 2007) or even “central hub” (Peyrache et al., 2017) of the HD system. To cover the dynamic character of head movements, we explicitly consider time-varying stimuli (Mosheiff et al., 2017). We take a computational approach and use experimentally recorded HD trajectories θ(*t*) to generate artificial spike trains (Fig. 3). The periodic tuning curves of individual HD cells can be captured by von Mises functions (Zhang, 1996) whose tuning width, peak firing rate, and background firing rate are chosen according to the available experimental literature (Fig. 2*B*). Spike generation is modeled by inhomogeneous Poisson processes (see Eq. 1). From the spike trains of this simulated neural population, we compute an estimated trajectory θ̂(*t*) based on the population-vector decoder. We then measure the accuracy of the HD system by analyzing the distribution of θ̂(*t*) and comparing its mean 〈θ̂(*t*)〉 with the true trajectory θ(*t*), as described in Materials and Methods. Using this framework, we systematically explore the HD representation and characterize the effect of anticipatory firing. We derive an analytic approximation for the population code's accuracy and use numerical simulations to check its validity and to study populations with inhomogeneous tuning curves. An approach based on a reconstruction from experimentally measured spike trains, as in Johnson et al. (2005) for the postsubiculum, is currently not feasible because simultaneous recordings of HD cells in ADN are only available for small populations of less than 20 HD neurons (Peyrache et al., 2015).

### HD trajectories

The HD trajectories θ(*t*) for this study are taken from a dataset that was recorded in the laboratory of G. Buzsáki (Mizuseki et al., 2009). Because the HD is a circular variable, temporal averages and expectation values need to be consistent with circular statistics (see Materials and Methods). In particular, the temporal average θ̄* _{T}*(

*t*) of a trajectory segment θ(

*s*) within some time interval [

*t*−

*T*,

*t*], is defined by: so that for constant AHV (i.e., θ(

*t*) = θ

_{0}+ ω ·

*t*) the trajectory average corresponds to the midpoint of the trajectory segment, θ̄

*(*

_{T}*T*/2) = θ

_{0}. To avoid overly complicated mathematical expressions, we will omit indices, such as

*T*in θ̄

*(*

_{T}*t*), or variables, such as

*t*in θ̄

*(*

_{T}*t*), unless this could cause misunderstanding.

### Tuning curves and spike-train model

HD cells in the ADN have been studied in great detail (see e.g., Knierim et al., 1995, 1998; Blair and Sharp, 1995; Blair et al., 1997, 1998; Goodridge and Taube, 1997; Stackman and Taube, 1998; Taube and Muller, 1998; Taube, 2007; Clark et al., 2009, 2012; Peyrache et al., 2015, 2017; Tan et al., 2015). Their stereotypical tuning profiles show a single hump and are well fit by von Mises functions (Zhang et al., 1998):
with peak firing rate *f _{max,j}*, background firing rate

*f*, “concentration” parameter κ

_{bg,j}*, and preferred direction θ*

_{j}*. The cell's tuning width σ*

_{j}*can be calculated as σ*

_{j}*= 180°/(). Based on these tuning curves, each model neuron generates an inhomogeneous Poisson spike train (see Eq. 1) with time-dependent rate λ*

_{j}*(θ(*

_{j}*t*)). If we use λ without subscript, we refer to a neuron with preferred direction θ = 0.

According to Taube and Bassett (2003), there are *N* ≈ 12,000 HD cells in ADN per brain hemisphere. We therefore study ensembles up to this size but also consider smaller *N* to analyze size-dependent effects.

The tuning parameters of HD cells vary strongly from neuron to neuron. To evaluate the effect of such variations on the neural code, we describe the parameters by probability distributions whose ranges, means, and SDs are estimated from the literature (see Table 1; Materials and Methods). Sampling the parameters of each model neuron from these distributions results in what we call an inhomogeneous population (Fig. 3). To model the preferred HD of each cell, the equidistant arrangement
is altered by adding to each θ* _{j}* a random shift chosen from a Gaussian distribution with mean 0 and SD 2π/

*N*. The mathematical analysis focuses on homogeneous populations. Here, neurons differ only by their (equidistant) preferred direction but share all other parameters (so that we omit the subscript

*j*), chosen as the mean values in Table 1. For simulations of inhomogeneous populations, we sample ATIs from the distribution shown in Figure 2; for homogeneous populations, we assume τ = 25 ms for all cells, unless noted otherwise.

### Causal population vector reconstruction

The population vector *PV*(*t*) is a weighted sum, *PV*(*t*) = ∑_{j=1}^{N} *k _{j,T}*(

*t*)

*V*, where the

_{j}*V*are unit vectors in θ

_{j}*direction and*

_{j}*k*

_{j}_{,}

*(*

_{T}*t*) denotes the number of spikes fired between time

*t*−

*T*and

*t*(Georgopoulos et al., 1983, 1986). Introducing the phasors exp(

*i*θ

*) = cos(θ*

_{j}*) +*

_{j}*i*sin(θ

*), the population vector can be written as ∑*

_{j}_{j=1}

^{N}

*k*(

_{j,T}*t*) ·

*e*

^{iθj}; its direction is given by the argument of this complex number. Defining the spike-count vector

*K⃗*(

_{T}*t*) = (

*k*

_{1,T}(

*t*),

*k*

_{2,T}(

*t*), …,

*k*(

_{N,T}*t*)), the animal's HD can be decoded from the experimental or modeled HD cell responses as follows: The decoder θ̂

*(*

_{T}*t*) is chosen to obey causality: only spikes from spike-count windows ending at time

*t*are used to estimate θ(

*t*), as illustrated in Figure 4. This choice reflects the fact that any downstream processing step of the HD signal can only depend on spikes that were fired in the past.

### Read-out accuracy: a function of systematic lag (bias) and random noise (variance)

The theoretical framework now at hand allows us to calculate how accurately the current HD θ(*t*) can be determined from the time-dependent population activity of non-anticipatory HD cells in a moving animal. Because the PV estimator θ̂* _{T}*(

*t*) relies exclusively on spikes fired before time

*t*, the computed HD always lags behind the current HD θ(

*t*). In statistical terms, this means that θ̂

*(*

_{T}*t*) is biased with respect to θ(

*t*). Additionally, because the spiking behavior of an ensemble of neurons is noisy, θ̂

*(*

_{T}*t*) exhibits random fluctuations, which are measured by the variance

*V*. As shown in Materials and Methods, the (circular) error

*D*for the HD read-out from a neural ensemble is given by the sum of the (squared) bias

*B*and the variance

*V*as follows: In the following, we describe the behavior of the bias

*B*and the variance

*V*of the causal PV estimator and demonstrate at which point anticipatory firing introduces an improvement of the read-out accuracy.

To determine the bias *B* of the PV estimator, we start by evaluating the estimator's average at a given instance of time. As derived in Materials and Methods,
This means that, when averaging out the noise induced by stochastic spiking, the PV estimator provides the average HD of an animal during the spike-count time *T*. Because the bias is given by the difference between 〈θ̂* _{T}*(

*t*)〉 and θ(

*t*) (see Fig. 5), the dynamic nature of the HD induces a systematic error. To visualize this, consider a trajectory segment with constant AHV: θ(

*t*) = θ

_{0}+ ω

*t*. Here, 〈θ̂

*(*

_{T}*t*)〉 ≈ θ(

*t*) − ω

*T*/2, such that the bias amounts to

*B*= ω

*T*/2. The PV estimator is thus biased for every spike-count window for which the average HD does not match the final HD. Generally, longer spike-count windows will lead to larger biases because the HD has more time to change so that the average HD deviates stronger from the endpoint HD (see also Figure 6). We evaluated the average squared bias

*B*

^{2}along all experimentally recorded trajectories as a function of

*T*. The results are shown in Figure 7

*A*, and all share an increase with integration time

*T*.

Based on this analysis of the bias, it appears reasonable to use very short spike count times to keep the spikes used for decoding as close as possible to the current time point *t*. This will, however, increase the noise of the read-out as less and less spikes are involved in the estimation and their stochastic variations cannot be averaged out anymore. Together, the two opposing effects define an optimal read-out time for which there is an ideal trade-off between systematic lag and statistical fluctuations.

The amount of error induced by noisy spiking (i.e., the variance *V*) can be approximated analytically using Fisher information. Because this noise component is independent of the AHV, we can refer to results derived for the well-studied static case (for the full result, see Materials and Methods; Eq. 13). We find that the variance decreases with the average number of collected spikes as 1/RT, thus decreasing as spikes are collected during longer time windows (Fig. 7*A*, dotted lines).

According to Equation 12, the variance term depends only on the scaling factor (2*NT*)^{−1} and the ratio (λ̃_{0} − λ̃_{2})/λ̃_{1}^{2}. Comparing different tuning curves, normalized such that they result in the same population firing rate, numerical values for this ratio are 0.083 (von Mises tuning), 0.080 (Gauss tuning), and 0.085 (triangular tuning). This calculation shows that the variance term for von Mises tuning differs by less than 5% from the other two tuning functions often used for HD cells.

The trade-off arising from the two different error sources is in Figure 7 (right column). Along with data points obtained from applying our analytical treatment to an experimentally recorded trajectory, we show results from Monte Carlo simulations (see Materials and Methods). These involved either homogeneous or inhomogeneous HD cell ensembles and confirm our mathematical prediction. In particular, as *N* increases, results from inhomogeneous and homogeneous cell ensembles align because all possible HDs are uniformly covered by tuning curves.

When decoding time-independent stimuli, the read-out window *T* is usually assumed to be a free parameter that needs to be chosen in a biologically or methodologically useful way (see, e.g., Seung and Sompolinsky, 1993; Brown et al., 1998; Zhang and Sejnowski, 1999; Mathis et al., 2012). Considering the dynamic nature of a stimulus, as we do in this study, leads to specific spike-count windows *T* for which the highest possible accuracy is achieved (see also Mosheiff et al., 2017). The bias depends exclusively on the shape of the HD trajectory, whereas the variance depends exclusively on the parameters of the neural ensemble. For a non-anticipatory population of HD cells, the only chance to decrease the optimal read-out error is to invest more neural resources (*N*, *f _{max}*) and thus reduce the contribution of the random spiking noise to the error. A causal decoder cannot directly mitigate the systematic lag. We show in the following paragraph that anticipatory firing does influence the bias term and can lead to significant improvements of the HD code.

### Effect of anticipatory firing in HD ensembles

As proposed by Blair and Sharp (1995), we assume that the shift of the preferred HD scales linearly with AHV. Writing λ̃* _{j}* for the average firing rate of an anticipatory neuron in a homogeneous population and inserting Equation 15 for the preferred direction, we obtain the following:
This formulation treats anticipation in HD cells as a shift in the angular trajectory given by ϕ(

*t*) = θ(

*t*) + ω(

*t*)τ, as illustrated in Figure 8. If the AHVω(

*t*) were constant between

*t*and

*t*+ τ, the product ω(

*t*)τ equals the total turning angle in that time interval.

Accordingly, the expected direction of a homogeneous population of anticipatory neurons is given by the following:
The subscript τ indicates that we are describing an anticipatory HD population. Because the trajectory ϕ(*t*) corresponds to an estimate of the future HD, this change of the expected direction θ̂_{T,τ}(*t*) can lead to a compensation of the bias of θ̂(*t*) compared with the non-anticipatory case (Eq. 20).

To better understand this effect, we first examine trajectory segments with constant AHV ω (i.e., with θ(*t*) = θ_{0} + ω · *t*). In this case, ϕ(*t*) = θ_{0} + ω · (*t* + τ) = θ(*t* + τ) so that ϕ(*t*) is a time-shifted version of θ(*t*). We obtain
for the expected direction of the causal estimator θ̂(*t*). The ATI τ thus temporally shifts the expected direction of θ̂(*t*) along the trajectory θ(*t*). Evaluation of the average bias term for a linear trajectory segment leads to *B* = |ω·(*T*/2 − τ)|. Choosing τ within (0, *T*) reduces the bias compared with the situation without anticipation. For constant AHV, it is even possible to fully remove the bias by choosing τ = *T*/2, as illustrated in Figure 8.

Anticipation also reduces the bias when the velocity of the HD θ(*t*) varies in time. To quantify this effect in the experimentally relevant case, we recalculate *B* as for Figure 7*A*, but with anticipatory firing. We use Equation 21 to compute the expected HD. The results of this analysis are shown in Figure 7*B*, *C* (left panels) for τ = 25 ms and τ = 50 ms, respectively. In stark contrast to populations without anticipation (Fig. 7*A*), *B* does not increase monotonically with *T* but has a minimum. Similar to the case of trajectories with constant AHV, the average bias *B* is smallest for *T* ≈ 2τ. In this case, the implicit interpolation into the future described by ϕ(*t*) has the optimal size. However, because of the time-varying nature of the AHV, the bias is not completely canceled.

In contrastg, the variance of θ̂(*t*) does not depend on the ATI τ, as seen by considering trajectory segments with constant AHV or the data shown in Figure 8. To describe the reconstruction error *D* for anticipatory firing analytically, we directly adopt the variance term from Equation 13 and replace the bias term by the anticipatory bias as analyzed above. We now numerically calculate the average circular error *D* with homogeneous and inhomogeneous populations. For the inhomogeneous populations, the ATIs are sampled from the distribution depicted in Figure 2 (for τ = 50 ms, the distribution of τ is shifted by 25 ms to the right). Figure 7*B*, *C* (right panels) shows the analytical result from Equation 13 together with simulation results for ATIs of τ = 25 ms (to mimic experimental ATIs) and τ = 50 ms (to illustrate the effect of longer anticipation).

As in Figure 7*A*, the data shown in Figure 7*B*, *C* demonstrate a qualitative agreement between the analytic description and the simulations. Most notably, the results for populations with realistic ATI distribution align closely with the results for homogeneous populations (for sufficiently large *N*) and show that cell-to-cell ATI variability does not reduce the advantages of anticipatory firing. Compared with populations without anticipation, the minimal errors of anticipatory populations are shifted toward longer spike-count windows *T* because the bias does not increase monotonically with *T*, but rather has a minimum for *T* > 0. Consequently, spikes can be collected over longer time windows (as long as the bias remains small), which reduces the variance. For τ = 25 ms, this improved trade-off leads to a reduction of the minimal error for every population size shown in Figure 7*B*. For large ATIs, however, the minimum achievable bias can become so high that the minimal reconstruction errors are larger than without anticipation (Fig. 7*C*). Hence, anticipation only improves the accuracy of the causal population-vector decoder if the ATIs lie within a certain interval.

In Figure 9*A*, we show how much the minimum achievable error can be reduced by anticipatory firing with different ATIs. To generate this figure, the function *A*(*T*) was numerically minimized for τ = 0 ms and various values τ > 0 ms. The minimum errors for τ = 0 ms are shown in the inset as a function of population size *N*. The main plot presents the (average) fraction of the minimum *A* with and without anticipatory firing. The figure reveals that anticipatory firing can lead to large improvements in reading out the HD compass: the experimentally determined average value of τ = 25 ms increases the accuracy by up to 40%. Figure 9*B* demonstrates how large the population size *N*_{0} of a non-anticipatory population has to be to reach the same accuracy as an anticipatory population of size *N*. For a large range of population sizes and ATIs, non-anticipatory populations need to be more than three times as large as populations of anticipatory neurons (Fig. 9*B*). For τ = 25 ms, the ratio *N*_{0}/*N* is particularly high for (biologically plausible) population sizes between 1000 and 10,000 neurons. These results reveal that the experimentally reported value of τ = 25 ms lies exactly in the range that leads to an optimal improvement of the HD code given the broadly tuned neurons and the temporal statistics of the head movements.

## Discussion

Single HD cells in the rat ADN predict the animal's HD roughly 25 ms into the future (Blair and Sharp, 1995; Taube and Muller, 1998). Ironically, such anticipatory firing helps to encode the present HD, not the future HD, as we have shown here by analyzing the read-out process. Taking the temporal statistics of head movements in foraging rats into account, we found that the error in modeled neuronal population vectors is smallest when HD cells anticipate the true HD by an ATI of around 25 ms for populations with a few thousand neurons, in agreement with the ADN measurements. With realistic parameters for the HD tuning curves, anticipatory firing decreases the minimal error in the HD estimates by up to 40%. One would need a fivefold increase in the number of neurons to achieve the same performance in the absence of anticipation.

Because of their simplicity, population vectors are often the method of choice to decode the HD signal from multielectrode recordings (Peyrache et al., 2015) and calcium-imaging data (Kim et al., 2017). Moreover, population vectors are readily interpreted in the framework of ring- networks, which posits that vestibular and other sensory signals drive bumps of neuronal activity to move along a ring-like network of neurons (for review, see Laurens and Angelaki, 2018). Other decoding approaches, such as spike-by-spike population decoding (Huys et al., 2007), Bayesian filtering (Bobrowski et al., 2009), or predictive Kalman filters, are not likely to provide significant advantage over population-vector decoding. Indeed, in a similar task (i.e., the reconstruction of spatial trajectories from place-cell activity), such alternative strategies did not improve the decoding accuracy (Brown et al., 1998).

Various questions lie outside the scope of this study. We did not model path integration based on accumulating and error-correcting the HD signal (Cheung et al., 2008), nor did we address the mechanistic origin of anticipatory firing. Possible explanations range from neurons sensing head acceleration (Zhang, 1996) to high-pass filtering based on short-term synaptic depression (Puccini et al., 2007) or a combination of spike-rate adaptation and postinhibitory rebound (van der Meer et al., 2007). Instead, we focused on the role anticipatory firing plays in reducing the bias in estimating the time-varying HD signal.

Anticipation resolves a quandary inherent in decoding any time-dependent quantity from a population of spiking neurons. Given the stochastic nature of neuronal discharges, averaging over time should improve the decoding accuracy, were it not for the fact that the stimulus is continually changing. If the decoder reads out the population activity over some time window *T*, a trade-off arises (see, e.g., Brown et al., 1998; Zhang et al., 1998; Mosheiff et al., 2017; and our approach): the stimulus dynamics causes the deviations in the population-vector read-out (relative to the true, instantaneous stimulus) to increase with *T*, whereas the variance for estimating a static stimulus would decrease with *T* (Burak and Fiete, 2012). Systematic biases caused by averaging over time are exacerbated by the requirement that a biologically plausible decoder should be causal: only past neural activity can be taken into account, not future activity. Temporal averaging will therefore shift the population vector to reflect an estimate of the stimulus that lies in the past, unless the neurons anticipated future stimulus values. In an ideal encoding of a time-varying stimulus (HD is just one example), the average ATI should be less than the correlation time of the stimulus, whereas *T*, which reflects the effective integration time constant of a read-out, should be set to achieve the desired accuracy, given the number of neurons and their tuning properties. Crucially, though, *T* and τ should be matched, so that *T* = 2τ, to make the population-vector estimate unbiased. This theoretical result applies to any neural system encoding stimuli that are gradually changing over time.

As shown in this study, measured ATIs in the rat ADN are ideally suited to maximize the read-out accuracy, given the animal's movement statistics, single-neuron parameters, and network size. ATIs are advantageous because the HD trajectory is sufficiently smooth to allow the extrapolation θ(*t*+τ)≈θ(*t*)+ω(*t*)τ so that HD cells can make a useful prediction of future HD values, reminiscent of predictive coding in sensory systems (Palmer et al., 2015). Random head movements with a frequency higher than 1/τ would make it impossible to significantly reduce the bias by anticipation. Indeed, the measured movement correlations are in the range of seconds, much longer than the ATIs of HD cells. The importance of anticipation is underlined by the fact that the bias contributes as much to the error in estimating the HD as neural variability itself, at least for model populations of HD cells with experimentally measured parameters (Fig. 7).

Anticipatory firing might compensate for delays that inevitably arise in sensory transduction (Berry et al., 1999) or as information is processed by a multilayered system (Nijhawan and Wu, 2009), but this might be only part of anticipation's computational purpose. Compensating for delays, for example, does not explain why ATIs increase when rats are rotated passively (Bassett et al., 2005), the delays across the HD system should remain unaffected. Indeed, the increase in ATIs also runs counter to the hypothesis that motor-efference signals cause the ATIs; passive rotation eliminates the motor efference and should cause the ATI to decrease, not increase. Yet a different experimental observation might have an important implication for decoding HD under passive rotation: firing rates in ADN decrease (Bassett et al., 2005). To counterbalance the resulting loss of information about HD, the read-out window *T* should increase (Burak and Fiete, 2012), and so should the ATI according to our framework. Likewise, HD cells with longer ATIs typically fire at a lower rate than cells with shorter ATIs (Blair et al., 1997). If downstream neurons sample from distinct subpopulations of HD cells, this systematic ATI difference would once again match the theoretical framework. Whether ATIs indeed adapt to maintain an optimal read-out is, as yet, unproven. Moreover, what appears to be adaptation, whether optimal or not, might simply be a byproduct of the biophysical mechanism that underlies the ATIs, much like gain control in motion detection is intrinsic to the “Reichardt detector” mechanism (Borst et al., 2005).

If the anticipatory HD signal in the ADN were computed based on the read-out from another HD population, we expect even larger ATIs for that upstream population. As optimal ATIs change only slightly with system size (Fig. 9*A*), ATI differences of ∼25 ms are likely to occur along the hierarchy of the HD system. The staggered ATIs found in the HD system (from τ = 40–75 ms in the rat LMN to τ ≈ 25 ms in the ADN and retrosplenial cortex to zero lag in the postsubiculum) (Taube, 2007) are consistent with this idea. In parallel, LMN activity could also be read out directly over a time window of 80–150 ms to yield a second estimate of current HD in addition to that provided by the ADN. A third and almost instantaneous estimate can be obtained from the postsubiculum. Consistent with our framework, the population size increases with decreasing ATI from LMN (∼1000 cells) to ADN (∼12,000 cells) to postsubiculum (∼60,000 cells). Although these three subnetworks seem to face different trade-offs concerning network size and single-neuron dynamics, they may nevertheless provide excellent HD estimates for downstream computations in other brain areas. For instance, the grid-cell system in the entorhinal cortex may combine HD estimates based on direct input from ADN and postsubiculum (Shibata, 1993; van Groen and Wyss, 1995) with distance information to achieve accurate path integration.

Furthermore, the hypothesis of optimal population-vector decoding leads to testable predictions for the ATIs in other species with different movement statistics, such as flies (Seelig and Jayaraman, 2015; Kim et al., 2017), cockroaches (Varga and Ritzmann, 2016), or bats (Finkelstein et al., 2015). All the information needed to calculate the optimal ATIs can be derived from the movement statistics, single-neuron properties, and network size. If the movements vary with the behavioral state, so should the ATIs. While cells in the dorsal subiculum of bats change the encoding of HD from pure to conjunctive coding when the animal switches from straight long-range navigation to rapid small-scale maneuvering (Finkelstein et al., 2018), the anticipatory behavior of these cells has not yet been addressed.

The HD system is a well-studied model, but it is only one particular example of how bias-variance trade-offs influence causal decoders of neuronal information. Computing error signals, as envisioned in theories of motor control (Wolpert et al., 1995; Bastian, 2006), also faces this trade-off. Neurons must gather sufficient information to reliably perform computations while still adapting quickly enough to reflect changing inputs throughout many hierarchical processing streams in the nervous system. Rather than being governed solely by biophysical limits on the speed of signal propagation, integration, and action-potential generation, the time constants of dendritic and somatic integration as well as axonal propagation delays might also reflect decoding constraints. Further work is needed to elucidate how the requirement that the nervous system operate in real time shapes processing delays.

## Footnotes

This work was supported by the German Federal Ministry for Education and Research Grants 01GQ0440. We thank E.I. Moser and G. Buzsáki for making data from Sargolini et al. (2006) and Mizuseki et al. (2009), respectively, publicly available; and A. Loebel and A. Mathis for stimulating discussions.

The authors declare no competing financial interests.

- Correspondence should be addressed to Andreas V.M. Herz at herz{at}bio.lmu.de