Recently, dopamine (DA) neurons of the substantia nigra pars compacta (SNc) were found to exhibit sustained responses related to reward uncertainty, in addition to the phasic responses related to reward-prediction errors (RPEs). Thus, cue-dependent anticipations of the timing, magnitude, and uncertainty of rewards are learned and reflected in components of DA signals. Here we simulate a local circuit model to show how learned uncertainty responses are generated, along with phasic RPE responses, on single trials. Both types of simulated DA responses exhibit the empirically observed dependencies on conditional probability, expected value of reward, and time since onset of the reward-predicting cue. The model's three major pathways compute expected values of cues, timed predictions of reward magnitudes, and uncertainties associated with these predictions. The first two pathways' computations refine those modeled by Brown et al. (1999). The third, newly modeled, pathway involves medium spiny projection neurons (MSPNs) of the striatal matrix, whose axons corelease GABA and substance P, both at synapses with GABAergic neurons in the substantia nigra pars reticulata (SNr) and with distal dendrites (in SNr) of DA neurons whose somas are located in ventral SNc. Corelease enables efficient computation of uncertainty responses that are a nonmonotonic function of the conditional probability of reward, and variability in striatal cholinergic transmission can explain observed individual differences in the amplitudes of uncertainty responses. The involvement of matricial MSPNs and cholinergic transmission within the striatum implies a relation between uncertainty in cue–reward contingencies and action-selection functions of the basal ganglia.
Firing patterns observed in the dopamine cells of the substantia nigra pars compacta (SNc) and ventral tegmental area (VTA) are related to the occurrence and nonoccurrence of rewards and cues that predict reward (Schultz et al., 1997; Schultz, 1998). Dopamine (DA) cells, which fire tonically at moderate levels, respond immediately to unexpected rewards with a phasic burst. When a reward consistently follows a preceding conditioned stimulus (CS), the phasic burst “transfers” from the time of expected reward delivery to the time of CS onset. The amount of the “transfer” depends on the expectation of reward, R̂ = |R*| × p(R*|CS), that is, the conditional probability, p(R*|CS), that a reward of magnitude |R*| follows the CS (Schultz et al., 1997; Fiorillo et al., 2003; Tobler et al., 2005; Schultz, 1998, 2004). After learning, the omission of an expected reward induces a depression in firing rate to a below-baseline level at the time of expected reward delivery. Thus, dopamine cells are part of an adaptive system that uses learned expectations to filter reward-related signals. This filtering creates dopamine bursts and pauses that respectively signal positive and negative violations of reward-related predictions. Therefore, SNc/VTA dopamine signals, which are broadcast to the dorsal and ventral striatum as well as other brain regions, such as the amygdala and frontal cortex, can be conceptualized as internal teaching signals that foster rapid acquisition of goal-directed behavior (Schultz et al., 1997; Schultz, 1998; Doya, 2002).
Several proposals have been made to explain the adaptive computations that give rise to the “reward-prediction error” (RPE) responses of dopamine neurons (for review, see Wörgötter and Porr, 2005). Some have been implementations of the temporal difference (TD) model (Nakahara et al., 2004; Pan et al., 2005), whereas others have been based on local circuit anatomy and physiology (Houk et al., 1995; Brown et al., 1999). Brown et al. (1999) introduced and simulated a local circuit model that can explain most of the key results without predicting effects that are known not to occur. In particular, as recently shown by Pan et al. (2005), most parameterizations of the TD model incorrectly predict that during training the DA burst will gradually “slide” from the time of reward delivery to the time of CS onset, such that as learning progresses, the burst appears at a full range of successively earlier intermediate times within a trial, but never appears at both the time of the CS and the time of primary reward on a single trial. In contrast, the Brown et al. (1999) model correctly predicts that the burst never occurs at intermediate times between the CS onset and the reward delivery, and that instead there will be DA bursts at just two times on each trial. In particular, across early learning trials, a burst at the time of CS onset will wax (become gradually larger across trials) as the burst at the time of reward delivery wanes (becomes gradually smaller).
Recently, however, Fiorillo et al. (2003) discovered a new reward-related component of DA cell discharges in SNc, which they called an “uncertainty response.” This sustained component builds up during the interval from CS onset to the expected time of reward if the reward schedule is probabilistic, and contrary to published TD model predictions (Niv et al., 2005), uncertainty responses in SNc are robust in single-trial data (Fiorillo et al., 2005). The size of the buildup is a nonmonotonic, “inverted-U” function of p(R*|CS), and also depends on the amount of reward, |R*|, that is at risk. Neither the “uncertainty response” nor its functional dependencies were predicted by any basal ganglia (BG) learning model of the time. However, the Brown et al. (1999) model did imply a separate new observation of Fiorillo et al. (2003) and Tobler et al. (2005), namely that after asymptotic learning, both the degree of waxing of the DA burst to the CS onset and the degree of waning of the burst at the time of reward delivery are monotonic functions of |R*| × p(R*|CS).
The Brown et al. (1999) model, like others, omits many known features of the BG microcircuit that can be expected to play important roles in shaping DA responses, including uncertainty responses. The model SN had only one class of DA cells and lacked corelease of GABA and neuropeptides. The model striatum lacked cholinergic tonically active neurons as well as GABAergic interneurons. In this paper, we propose an extended model based on such features and corresponding computational hypotheses. The new model retains the explanatory successes of the prior model, but also incorporates an efficient basis for the uncertainty responses discovered by Fiorillo et al. (2003).
Materials and Methods
The proposed model is schematized in Figure 1, which also labels sites in the circuit with the names of corresponding variables in the formal model. The model for the genesis of phasic DA responses follows the proposal by Brown et al. (1999), with modest modifications (explained below). The model uses ordinary differential equations in a Hodgkin–Huxley type formulation, modified to emphasize key dynamical properties of cell types. The model is qualitative. Model neuron firing rates range from zero to one, and parameters were constrained to reflect empirically reported responses of neurons. No attempt was made to optimize curve fits. Parameter ranges across which the qualitative behavior of the model DA neuron's phasic responses are preserved and consistent with experimental reports were discussed by Brown et al. (1999). Therefore, we report and discuss only the sensitivity analyses for the key parameters governing the genesis of sustained DA responses. These analyses yield new predictions that are experimentally testable. Mathematical description of the model is specified in the Appendix (Eqs. 1–21). The model was implemented in Matlab. Numerical integration was performed with an adaptive step size, fourth-order, Runge-Kutta method. The model was simulated with three different reward magnitudes (|R*|; 0.05, 0.15, and 0.50; arbitrary units) intended to span a full range of magnitudes representable by a single neuron, and five different probabilistic schedules [p(R*|CS); 0.00, 0.25, 0.50, 0.75, and 1.00]. Parameter values used for simulations are given in Table 1. What follows below describes the assumptions and hypotheses about the biological substrates of the model. The next section (Results) reports real-time behavior of the model and describes how its operation leads to model DA neuron responses.
Circuit for learned phasic dopamine responses.
A circuit basis for genesis of learned phasic dopamine responses was proposed and discussed in detail by Brown et al. (1999). We provide here a summary for completeness, and describe modifications to the original model. Formal specification is given in the Appendix. A model pedunculopontine tegmental nucleus (PPTN) stage relays signals generated by conditioned and unconditioned rewarding stimuli to DA cells in the SNc (Nakamura and Ono, 1986; Semba and Fibiger, 1992; Brown et al., 1999; Pan and Hyland, 2005). Accordingly, the major excitatory input to phasic DA cells is a signal from the PPTN. The CS is assumed to be relayed to the striatum via corticostriatal afferents (although the relay to ventral striatum may be via amygdala). Within the BG, CS-related inputs excite both ventral striatum, via a set of adaptive synaptic weights, and medium spiny projection neurons (MSPNs) in the striosome compartment of the striatum, via another set of adaptive synaptic weights. After learning, the former pathway is responsible for eliciting the phasic DA response that immediately follows CS onset. This adaptive excitatory pathway is complemented in the model by an adaptive inhibitory pathway from CSs to SNc DA cells via striosomes. An inhibitory projection from striosomes to DA cells is well established (Gerfen, 1992). To explain the data on timing of dopamine dips noted above, the inhibitory signals carried by this projection must be adaptively timed to arrive at DA cells at the expected time of primary reward. Therefore, model striosomal MSPN dendritic spines exhibit a spectrum of delayed calcium spikes mediated by second messengers (Fiala et al., 1996; Brown et al., 1999). When a delayed spike coincides with DA burst release engendered by primary reward (via lateral hypothalamus and PPTN, as shown in Fig. 1), learning specific to the corticostriatal synapse on that spine can occur. Such learning potentiates inhibitory outputs at the relevant delay, and such learning is self-terminating, because once the inhibition is strong enough, it precisely cancels the excitatory effect of PPTN inputs to DA cells.
The ensemble of mechanisms captured in Equations 1–10 (given in the Appendix) enables gradual learned transfer of scaled DA responses from the time of rewarding stimulus onset to the (earlier) time of CS onset (Schultz et al., 1997; Schultz, 1998, 2004). The original formulation of this model was shown to successfully explain the reward- and timing-related responses of phasic dopamine bursts (Brown et al., 1999), and as shown below, all those properties are preserved in the current formulation. However, the learning laws used in our model deviate from those in the report by Brown et al. (1999) in three respects. First, because of a misprint in the report by Brown et al. (1999), the published equation governing CS-striosomal synaptic weight, Zij, converged to zero when negative and positive reinforcement signals (N+ and N−) were both equal to zero. Our equation for the CS-striosomal weight (Eq. 9) corrects this problem while also setting an intrinsic upper bound on weight growth. Second, because DA-dependent learning in the striatum is mediated by second messenger-mediated calcium release (Kötter, 1994), which is slightly delayed relative to CS onset, we modified the learning rule (Eq. 4) to reflect calcium-gated adaptation of corticoventral striatal synapses. This more realistic learning rule helps explain classic observations that a CS must slightly precede a reward for behavioral learning to occur. At the same time, the improved rule eliminates the “self-learning” (i.e., reinforcement of a CS by the very DA burst that it induces) that O'Reilly et al. (2007) noted as a problematic aspect of the Brown et al. (1999) model. Third, the increment and decrement scalars, αWS and βWS, for the nonstriosomal striatal weights (Eq. 4), were adjusted to partially compensate for the very small range of RPEs represented by DA dips relative to the large range of RPE represented by DA bursts (Fiorillo et al., 2003; Nakahara et al., 2004). This adjustment improved the model's weights' sensitivities to conditional probability (treated in more detail in Results).
Circuit for genesis of learned sustained dopamine responses.
It is currently not known whether DA cells in VTA exhibit similar uncertainty responses. Therefore, we limit our discussion to SNc DA cells. Both GABAergic and dopaminergic cells in the SN exhibit (apparently) unlearned baseline activity levels. However, GABAergic MSPNs of the striatal matrix and striosome compartments can inhibit the firing of the GABAergic pars reticulata (SNr) and dopaminergic SNc cells, respectively (Ragsdale and Graybiel, 1990; Joel and Weiner, 2000). Moreover, the dendrites of GABAergic SNr cells are intermingled with distal dendrites from the dopaminergic cells of the SNc (Condé, 1992; Gerfen and Wilson, 1996). Thus, we hypothesize that matricial MSPNs of the direct pathway, whose terminals contact SNr dendrites, also contact invading dendrites of DA cells whose somata are in SNc (Fig. 1). In particular, the model embodies the hypothesis that such matricial MSPNs, which also release the neuropeptide substance P (SP) (Jessell, 1978; Iversen et al., 1980; Parent, 1986; Condé, 1992; Otsuka and Yoshioka, 1993), exert a net excitatory effect on dopaminergic SNc cells, and that this excitatory effect is responsible for the sustained uncertainty responses of such cells. This complements the treatment above, in which model MSPNs attributed to striosomes learned to regulate the phasic DA responses that are widely regarded as RPE signals (Schultz et al., 1997; Schultz, 1998; Brown et al., 1999).
The primary inputs to the matricial and striosomal MSPNs in striatum are relayed through glutamatergic cortical afferents via adaptive corticostriatal connections (Joel and Weiner, 2000). The model also captures the influence of cholinergic transmission on MSPNs both directly (Di Chiara et al., 1994) and indirectly through GABAergic interneurons (Koós and Tepper, 1999, 2002; de Rover et al., 2002; Ravel et al., 2003), via Equation 11. This treatment is consistent with reports showing that tonically active neurons (TANs; giant cholinergic interneurons) are preferentially located in, or near the border of, the matrix compartment of the striatum (Kubota and Kawaguchi, 1993; Prensa et al., 1999). Although both learned phasic and learned sustained responses of model DA neurons show qualitatively correct functional dependencies on conditional probability of reward in the absence of fast-spiking interneurons (FS-INs) and TANs in the striatal microcircuit, inclusion of the latter markedly enhances the sensitivity of the modeled sustained responses to probability (see Results).
As noted above, in addition to releasing GABA, the MSPNs of the direct pathway release the neuropeptide SP. Such SP release from the axon terminals of MSPNs projecting to SNr can excite the comingled dendrites of DA cells (Condé, 1992; Otsuka and Yoshioka, 1993; Hanson et al., 2002; Martorana et al., 2003; Betarbet and Greenamyre, 2004) via neurokinin receptors (Whitty et al., 1997; Lévesque et al., 2007). Beaujouan et al. (2004) reported activity-dependent release of SP from striatonigral terminals, and SP released intranigrally increases striatal dopamine release (Reid et al., 1990; Khan et al., 1996). The exact mechanisms and net effects of interactions between the coreleased peptide (SP) and nonpeptides (GABA) are currently unknown. However, “practically all combinations of peptide and non-peptide transmitter have been found. … The exact significance of cohabitation remains to be determined. It may be that the peptide and the non-peptide act synergistically or, in contrast, it could be that the peptide or non-peptide act presynaptically to inhibit the release of their companion” (Smith, 1996). Several studies point toward a presynaptic interaction between a peptide and nonpeptide transmitter, such that the nonpeptide transmitter inhibits both its own and its companion's release (Malcangio and Bowery, 1999; Salio et al., 2005, 2006). Notably, several studies clearly show a GABAB-autoreceptor mediated inhibition of corelease of peptides in the cerebral cortex (Bonanno et al., 1996), spinal cord (Malcangio and Bowery, 1999; Riley et al., 2001), and further regions of the nervous system (Bowery, 1993). Available data suggest a similar mechanism in the substantia nigra. Indeed, GABA acts via GABAB autoreceptors on presynaptic terminals in SN to progressively inhibit release from those terminals [for a detailed description of the data, see Paladini and Tepper (1999) and Tepper and Lee (2007)], and Humpel and Saria (1989) suggested that GABA receptors are involved in the presynaptic regulation of tachykinin release from striatonigral terminals. Direct evidence for presynaptic inhibition of SP by GABA was provided in seminal reports of Jessell (1978) and Iversen et al. (1980). They showed in vitro that higher GABA concentrations produced progressively greater inhibition of potassium-evoked SP release from striatonigral terminals, without affecting its spontaneous release. This regulation was mediated by GABA receptors. Thus, the weight of available evidence makes it likely that any accurate computational model must assume that GABA, coreleased with SP from striatonigral terminals, inhibits SP release from those terminals via presynaptic GABAB autoreceptors.
To represent these interactions in the model, the net SP release level in SN initially grows as matricial MSPN activation exceeds the firing threshold, but as MSPN activation increases further, the growth in SP release is eventually terminated, and then reversed, by GABA-dependent presynaptic inhibition. For this reason, the net SP release in the model is a nonmonotonic function of matricial MSPN firing rate, which because of learning is a monotonic function of conditional reward probability. Mathematical formulations of this case, as well as an alternative case in which GABA also acts postsynaptically, are provided in the Appendix (Eqs. 12–18). Whereas available data apparently require some presynaptic inhibition of SP release by GABA in models of substantia nigra, our model further assumes that such inhibition of SP release is stronger than presynaptic inhibition of GABA release itself. No empirical assessment of the relative sizes of these effects is known to us. Therefore, the Discussion clarifies how model predictions could be tested with new, in vivo, experiments. A statement of the hypothesis that a GABA–SP interaction of an appropriate mathematical form could enable “uncertainty responses,” together with results of one preliminary simulation (without any model equations), appeared earlier, in a brief letter (Tan and Bullock, 2008a). Since then, the model has been modified in several key respects, and this is the first report detailing model equations, parameter ranges, and alternative interactions that simulate learning and signal processing sufficient to yield both sustained and phasic DA responses as seen in vivo.
Some SNc neurons appear to respond (at statistically significant levels) only with phasic or sustained components to a predictive stimulus, whereas other SNc cells show both components in their responses (C. Fiorillo, personal communication) [see also Fiorillo et al., 2003 (Fig. 3D)]. Therefore, to simulate coexisting responses in a single neuron, the model includes a third set of DA neurons affected by all three pathways (PPTN-mediated excitatory projections and projections from both matricial and striosomal MSPNs).
Dopamine level in the striatum.
In the model, the striatal DA level, D̄, tracks, and is a running average of, momentary DA cell firing rate. This corresponds well with observations (Floresco et al., 2003) that dopamine uptake is quite rapid in striatum, and so DA level will not be elevated above baseline levels for very long after termination of a DA neuron's burst. The model's positive reinforcement signal (which gates learning) reflects transient positive deviations from the running average, whereas the complementary negative reinforcement signal reflects transient negative deviations. Because slowly ramping changes in the DA cell firing rate, e.g., as seen in the model's sustained “uncertainty” responses, induce only negligible deviations from the running average in the normal striatum (with fast DA uptake), they do not induce reinforcement signals. However, such slow changes will be reflected in the striatal DA level (the running average).
The equations governing model striatal FS-INs and TANs, whose inclusion improves but was not critical for emergence of the basic DA responses (as shown below), are derived and discussed in detail by Tan and Bullock (2008b), and provided as supplemental Equations S1–S8 (available at www.jneurosci.org as supplemental material) for completeness. Operation of the model in real time is explained in more detail next.
Scaled DA bursts induced by uncued unconditioned stimuli (primary rewards)
In the absence of any predictive stimulus, delivery of reward in the model induces a DA burst, at the onset of reward, that is a monotonically increasing function of the reward magnitude (Fig. 2), consistent with neurophysiological observations (Schultz et al., 1997; Schultz, 1998; Tobler et al., 2005). The primary reward input generates phasic firing in the lateral hypothalamus (LH) (Nakamura and Ono, 1986), which transiently excites the PPTN (Semba and Fibiger, 1992) and ventral striatum (Schultz et al., 1992). The PPTN signal, in turn, excites SNc dopaminergic cells (Scarnati et al., 1988; Condé, 1992; Futami et al., 1995) and leads to the phasic dopamine burst (Gerfen, 1992). The present model emphasizes LH cells that respond to a rewarding unconditioned stimulus (US). It does not treat LH cells that show learned responses to CS onsets. However, the network embedding and role of such cells were treated recently in a complementary model (Grossberg et al., 2008) of LH–amygdala–orbitofrontal interactions that mediate such processes as simultaneous visual discrimination, motivational/attentive enhancement of stimulus representations, and rapid selective devaluations of stimuli after satiety.
Learned DA bursts and pauses that reflect reward-prediction errors
When an initially neutral conditioned stimulus arrives to striatum via cortical afferents, it excites both ventral striatum and striosomal MSPNs through trainable adaptive weights. Ventral striatum disinhibits PPTN through ventral pallidum (Yang and Mogenson, 1987). In the model, phasic DA signals induced by primary reward input act as a teaching signal on these two sets of adaptive weights (Wickens et al., 1996). As learning proceeds, the cortical CS representation learns to excite DA cells by itself through ventral striatum, while it also learns to inhibit DA cells at the expected time of arrival of the primary reward input. The timing of the latter inhibition depends on the internal calcium dynamics of model striosomal MSPNs (Gerfen, 1992; Fiala et al., 1996; Brown et al., 1999). The larger the magnitude of primary reward, the larger the weights (onto ventral striatal and striosomal MSPNs) become during learning. Hence, after learning, the DA burst induced by a predictive stimulus is a monotonically increasing function of the reward magnitude |R*|.
Because the strength of striatal inhibition of model DA cells by MSPNs is matched to the excitation that would be generated by the expected reward, after learning, a larger than expected reward elicits a DA burst, whereas a smaller than expected reward elicits a pause in DA cell firing. These bidirectional DA responses in the model reflect deviations from internal predictions (Fig. 3), i.e., are RPEs. This behavior was also achieved by the precursor dual-pathway model of Brown et al. (1999), and is consistent with many neurophysiological studies (Schultz et al., 1997; Schultz, 1998, 2004; Tobler et al., 2005).
Learning effects of probabilistic schedules of reward
During exposure to probabilistic cue–reward contingencies, adaptive weights are incremented by DA bursts induced by delivery of primary reward, but decremented by dips consequent to each omission of the expected reward. Therefore, the net potentiation of the model's striatal synaptic weights is a function not only of the reward magnitude, as noted above, but also of the conditional probability of reward given the CS, p(R*|CS). Given this dual dependence of the striatal synaptic strengths (Fig. 4A), model neuronal responses to a CS (Fig. 4B; see also Fig. 7) increase with that CS's expected reward value, R̂ = |R*| × p(R*|CS). This dual dependence of CS-induced striatal activations is consistent with experimental observations (Tobler et al., 2005).
Weights of corticostriatal synapses onto matricial MSPNs, striosomal MSPNs, and ventral striatal cells all express the same dual dependence. Thus, striatal inhibition of DA cells in the model is also an increasing function of the CS's expected reward value, R̂, rather than the absolute reward magnitude |R*|. This dependence implies that, for any probabilistic schedule, expected reward value associated with the CS after learning is smaller than the absolute reward magnitude (R̂ ≤ |R*|), because probability, by definition, must be between zero and one. Furthermore, the expected reward value of the CS will be equal to the absolute reward value only if the probability of reward given the CS is one [R̂ = |R*| only if p(R*|CS) = 1.00]. As a consequence, a primary reward elicits a residual DA burst in the model (see below, Fig. 7) even after asymptotic learning on a probabilistic schedule, consistent with the observations of Fiorillo et al. (2003) and Tobler et al. (2005). The latter study also observed that after a training schedule with a CS equiprobably followed by two potential outcomes (e.g., no reward or a cue-specific volume of liquid reward, |R*|), the sizes of residual phasic DA responses to the cue-specific reward volumes had become almost independent of the reward volume, in stark contrast to the continuing strong dependence of phasic DA responses on sizes of uncued rewards. A simulation of this paradigm (Fig. 5) showed that the model exhibits similar behavior. Furthermore, a short mathematical derivation by Tan et al. (2008) showed that such an outcome is a robust property of an entire class of RPE models defined by a simplified subset of the assumptions incorporated in the model developed here.
The adaptively weighted CS inputs to the model matricial MSPNs, and thus these neurons' CS-induced activities, become increasing functions of the expected reward during learning. Therefore, in the model as in the data, striatal MSPN activity is modulated by the expected amount of reward (Kawagoe et al., 1998; Watanabe et al., 2003; Samejima et al., 2005). Model simulations (see below) showed that the depth of this modulation is enhanced because matricial MSPNs also receive inputs related to the reward contingency indirectly, through the pathway from DA neurons to striatal TANs to FS-INs to MSPNs (Fig. 1), in accord with neurophysiological studies (Samejima et al., 2005).
Emergence of uncertainty responses on probabilistic schedules of reward
The distal dendrites of SNc dopaminergic cells are densely intermingled with SNr GABAergic cells (Condé, 1992; Gerfen and Wilson, 1996). Thus, we hypothesized that although matricial and striosomal MSPNs project to SNr and SNc, respectively (Flaherty and Graybiel, 1994), matricial MSPNs also affect SNc dopaminergic cells. During the time between CS onset and expected time of reward, the matricial MSPN projection of GABA and SP to the SNc is active, and model transmitter release depends on MSPN activation, in accord with Khan et al. (1996) and Beaujouan et al. (2004). In particular, the model's net SP release, and consequent excitation of SNc, is progressively gated off by presynaptic action of the coreleased GABA. The net effect of the proposed multiplicative gating interaction is an inverted-U-shaped excitation that is lowest for high, p(R*|CS) = 1.00, and low, p(R*|CS) = 0.00, conditional probabilities, and maximum at p(R*|CS) = 0.5. Thus, as shown in Figures 6 and 7, the model is able to explain the genesis of DA uncertainty responses and their nonmonotonic dependence on p(R*|CS), as reported by Fiorillo et al. (2003).
Coexistence of sustained and phasic responses in single dopamine neurons
In the model, DA neurons responding with sustained or phasic activations constitute two of three subpopulations (Eqs. 10, 16). In a third subpopulation, as shown in Figure 7D, both sustained and phasic responses occur in single cells, and these components exhibit the same functional dependencies noted above.
Parameter ranges across which the qualitative behavior of the model DA neurons' phasic responses are preserved were discussed by Brown et al. (1999). Similarly, the choice of parameters for, and associated sensitivity analyses of, striatal TANs was delineated by Tan and Bullock (2008b). An additional key component in the current model is the overall effect of the striatal microcircuit. As the expected reward value [|R*| × p(R*|CS)] increases, so does the DA burst to CS onset, as does the depth of the TAN pause response. Thus, post-CS acetylcholine (ACh) release in the striatum is inversely related to the expected value signaled by the CS. As ACh release decreases, so does the feedforward inhibition exerted on the matricial MSPNs by FS-INs. Therefore, a CS-representing cortical input of a particular weighted value will induce higher activations of MSPNs with than without the DA-induced pause response by the TANs. Thus, although weights of corticostriatal synapses onto MSPN themselves reflect expected values, cholinergic modulation of MSPNs amplifies the contrast between matricial MSPN activations in response to different reward contingencies (Fig. 8).
Although Fiorillo et al. (2003) reported an increase in the sustained dopamine response with greater reward magnitude, they did not perform a parametric study to characterize the nature of the dependency. In the model, presynaptic inhibitory gating by GABA (G) of the excitatory substance-P effect, MRS [1 − G]+, on substantia nigra dopaminergic cell dendrites translates the monotonic relation between probability and matricial MSPN activation at the striatal level into a nonmonotonic relation between probability and sustained dopamine activity. The maximum of the resulting nonmonotonic function is determined by the activation threshold of matricial MSPNs for GABA and substance-P release in substantia nigra (Eq. 12). Sensitivity analysis reveals an inverted-U-shaped response to reward probabilities regardless of the reward magnitude if the threshold for GABA and substance-P release is proportional to the magnitude of DA burst in response to CS onset. To assess this, we assumed that after learning, a signal, proportional to the magnitude of DA burst, arrives near terminals of matricial MSPNs at the onset of the predictive stimulus, to scale the release threshold. If such a signal exists, Figure 9 (top) shows that the sustained response of dopamine cells would be approximately similar across a range of absolute reward magnitudes. If such a signal does not exist, and the threshold is constant regardless of the DAergic signal, then the model predicts that the sustained response of dopamine neurons would no longer be a function of reward probability alone (Fig. 9, bottom left). Rather, the sustained response will be a function of expected reward value, R̂ = |R*| × p(R*|CS). As shown in Figure 9 (bottom right), the response is maximum for moderate values of expected reward, and declines toward smaller and larger values. Thus, while also serving as a sensitivity analysis for key parameters regulating GABA and substance-P corelease, these results show that a DA-dependent threshold for GABA and substance-P release from the striatonigral terminals is required if the sustained response of dopaminergic cells in response to the reward probability is to be similar across a range of absolute reward magnitudes.
Many reports demonstrate the sensitivity of striatal and DA neurons to properties of cue–reward contingencies that are important for learning to predict outcomes and make decisions (Kawagoe et al., 1998; Watanabe et al., 2003; Samejima et al., 2005). This research helps unify these observations by showing how the sustained uncertainty responses (Fiorillo et al., 2003) that emerge in DA cells when animals experience probabilistic cue–reward contingencies can be explained by correcting and extending a circuit model that Brown et al. (1999) developed to explain phasic DA responses. The new model incorporates several conspicuous, interwoven features of the striatonigral architecture that were previously omitted. The simulations show that uncertainty responses can be computed efficiently via the corelease of GABA and SP by MSPN fibers that synapse not just on SNr cells but also on the comingled dendrites of DA cells whose somata are in SNc. This mechanism works even if the adaptive synapses on cue-excited inputs to the matricial MSPNs giving rise to the GABA/SP fibers obey the same learning law that enables the model to learn to generate phasic DA signals that represent RPEs (Schultz et al., 1997; Schultz, 1998). The model's uncertainty responses are robust on single trials. Because they do not arise by averaging across trials, the uncertainty computation can influence single-trial decision making. This distinguishes the current model from TD models, proponents of which (Niv et al., 2005) showed how TD models could generate uncertainty responses, but only as averaging artifacts. The present model's single-trial RPEs and single-trial uncertainty responses accord with findings (Fiorillo et al., 2005) that uncertainty responses appear in DA neurons on single trials.
Another model proposed as an alternative to TD models was the “primary value, learned value” (PVLV) model (O'Reilly et al., 2007). It incorporated core assumptions from Brown et al. (1999), but omitted others judged problematic, and added postulates that diverge from both the current model and more abstract RPE models. Although the PVLV model cannot explain sustained uncertainty responses, it can avoid a self-learning problem in the Brown et al. (1999) model. Whereas our model avoids that problem via a known biophysical constraint (namely, delayed, calcium-dependent gating of learning, which also explains why a CS must precede a US for robust conditioning), the PVLV model avoids self-learning via formal postulate, and this and further postulates produce a model that deviates in several consequential ways from most RPE-based learning theories. A common feature of the PVLV model and the new model presented here is that both exhibit the phenomenon known as blocking (Kamin, 1968). If a CS, CSA, predicts a reward, with p(US|CSA) = 1, and if CSA learning proceeds until the delayed inhibition of DA neurons (induced by CSA) cancels the US-induced excitation of DA neurons, then a newly introduced second CS, CSB, whose onset is simultaneous with CSA, will not “condition.” The predictively redundant CSB is blocked from becoming able to generate DA bursts, in accord with neural data (Waelti et al., 2001). Finally, O'Reilly et al. (2007) reported a TD model simulation in which, contrary to published observations, a CS failed to become able to induce a model DA burst when the reward followed the CS after unpredictable intervals. Although O'Reilly et al. (2007) mistakenly supposed otherwise, models such as Brown et al. (1999) and the one presented here readily explain CS conditioning with unpredictable CS–US intervals.
Perhaps the most novel feature of the new model is the role for GABA/SP corelease in genesis of sustained DA responses. If correct, this links two conspicuous BG features that are pivotal in neurodegenerative disease. SP-MSPN loss underlies Huntington's disease, whereas nigral DA cell loss underlies Parkinson's disease. The SP in fibers sent from “direct BG pathway” MSPNs (Kaneko et al., 2000; Beaujouan et al., 2004) to ventral SN make it the most SP-rich part of the brain (Otsuka and Yoshioka, 1993), with dramatic SP depletion in Huntington's disease (Buck et al., 1981; Cicchetti et al., 2000). As noted above, direct pathway activation causes MSPN corelease of SP and GABA at nigral terminals that are positioned to affect DA neurons, which SP excites. The model predicts that SP-mediated excitation is presynaptically gated by GABAergic inhibition of release, mediated by GABAB receptors. The simulated inverted-U interaction accords with observations (Fiorillo et al., 2003) that cue-induced sustained DA responses scale nonmonotonically with the conditional probability of reward, given the cue. Because in vivo data are scarce, validation of the main hypothesis awaits further experiments. One testable prediction is that local blockade of presynaptic GABAB receptors, in animals trained on a simple pavlovian task with a probabilistic schedule [similar to Fiorillo et al. (2003)], should eliminate the nonmonotonic relationship between probability and sustained responses, and leave sustained responses that are a monotonic function of expected value.
Reid et al. (1990) showed that SP delivery alone may have a nonmonotonic effect on DA cell activation. Part of this nonmonotonic effect may have resulted from SP activation of nigral GABAergic neurons with local collaterals. Whitty et al. (1997) reported that NKR1 mRNAs were found in some nonpigmented (i.e., non-DA) SN cells, and Mendez et al. (1993) reported that GABAergic cells in the SNr that receive SP terminals do synapse onto dendrites that DA cells of SNc extend ventrally into SNr. Thus, SP-excited GABAergic neurons of SNr may shunt the excitatory effect of SP onto SNc DAergic cells' dendrites. The Appendix shows that the model is robust in the presence of such a shunt.
Tobler et al. (2005) reported that “the gain of neural activity with respect to liquid volume appeared to adapt in proportion to the range … of the predicted reward outcomes” (Fig. 5), and noted that adaptation might “be achieved by subtracting the expected value from the absolute reward value and then dividing by the variance.” They also acknowledged that the apparent adaptation might not arise in DA neurons, but might instead arise “upstream.” The choice between these two attributions is theoretically important. If DA neurons divided the difference between expected and actual inputs by the variance, such neurons could not compute RPEs that could guide learning of synaptic efficacies that reflect absolute reward magnitude. Published observations suggest otherwise. Apparent adaptation arises in the current model from “upstream” striatal weight learning, and requires no normalization of the DA neurons' sensitivities to differences. Tan et al. (2008) proved this for a class of dual-path learning models that assume that DA neurons compute the (unnormalized) difference between an unlearned excitation scaled by primary-reward magnitude and a cue-conditioned, time-delayed inhibition, whose size is adjusted by the history of RPEs. One other mathematical constraint was needed: that negative RPEs (DA dips) have smaller dynamic ranges than positive RPEs (DA bursts). This model feature accords with data (Fiorillo et al., 2003; Nakahara et al., 2004).
Fiorillo et al. (2003) did not fully assess how the sustained component depends on reward magnitude: probability varied over its full range, but reward magnitude (|R*|) did not. Simulations of the current model provide novel predictions regarding this dependency. Although the observed function emerged for a full range of conditional probabilities when cues predicted moderate-sized rewards, the model predicts that the peak of the inverted U could shift away from p = 0.5 with very large or very small rewards, unless the threshold for activation of corelease is a function of expected reward magnitude (Eq. 12; ΓM). With fixed threshold, simulated sustained activation can be a nonmonotonic function of expected reward, R̂ = |R*| × p(R*|CS), rather than of probability alone. It is maximum for moderate values of R̂, and declines for smaller and larger values. If the relationship between p(R*|CS) and sustained DA response is to be qualitatively similar across at least moderate-to-large reward magnitudes, then the GABA and substance-P release threshold should depend on expected reward magnitude. In principle (although not modeled), CS-induced phasic release of dopamine from SNc dendrites in SNr might help create such a functional dependence.
Although both sustained and phasic responses of model DA neurons survive without the TANs and FS-INs in the model striatum, the amplitude of the inverted U was enhanced by these striatal interneurons. ACh release by model striatal TANs modulates the matricial MSPNs via a dual control of FS-INs by ACh (de Rover et al., 2002; Koós and Tepper, 2002), and via a direct inhibitory influence of ACh on SP-MSPNs (Bernard et al., 1993; Di Chiara et al., 1994). Interestingly, simulations (Tan and Bullock, 2008b) show that many data accord with the hypothesis that TANs' conditioned pause responses largely reflect learned phasic DA inputs. Cholinergic modulation of SP-MSPNs may shed light on the individual differences in uncertainty responses observed (Fig. 6, inset). For example, the model's uncertainty response to p = 1 is zero in the absence of interneurons, whereas the model responds with an ∼10% increase in baseline activity when the striatal circuit is intact (although the qualitative profiles of the uncertainty responses are similar) (Fig. 8). These model variations are consistent with the variations observed in vivo (Fig. 6, inset).
In SNc, it appears that ∼50% of dopaminergic cells respond to a CS with a phasic activation, 9% with a sustained activation, and 18% with both (C. Fiorillo, personal communication). However, neurons with no significant activation might show such activation, if a larger reward were tested. It remains to be discovered whether, in vivo, there are three DA subpopulations that correspond to the model's three distinct DA neurons (phasic only, sustained only, and mixed). Whereas learning is not affected by relative proportions of these cells, the size of the sustained component felt by target structures could be.
Uncertainty responses may be important components of DA signals to the forebrain. Its local effects may differ dramatically across signal recipients because of differing DA receptors, local circuits, DA uptake rates, etc. Predictions vary considerably across conceptualizations of the role of DA in forebrain function (Seamans and Yang, 2004). One hypothesis is that sustained DA release during an interval between CS onset and possible reward delivery might act in frontal cortex to facilitate vigilant attention and working memory storage during the delay, while acting in striatum to facilitate switches among alternative plans. A stimulus that is an uncertain predictor is often the leading edge of an information stream, later components of which resolve the uncertainty. Uncertainty-reducing information will be better detected and learned by a vigilant perceiver who remembers recent events, and better used by an actor ready to switch as further information favors one plan among several suggested by the initial information. Uncertainty responses may also facilitate behavioral switches during reversal learning (Pasupathy and Miller, 2005) (but see Tan and Bullock, 2008a).
Finally, many brain areas not modeled here are implicated in value-related computations. For example, orbitofrontal cortex mediates explicit cognitive “reframing” of costs and benefits (De Martino et al., 2006), operations beyond this model's scope. Model development was driven by data from experiments using pavlovian protocols, in which there is no required action. Therefore, there are no action costs to factor into value computations. Thus, the model does not address many issues in motivated behavior and foraging.
Changes in the level of P are described by the following differential equation, which reflects two afferents to PPTN: a primary reward-induced input, IR, from lateral hypothalamus and a conditioned stimulus-induced input, S, from ventral striatum (via the interposed ventral pallidum): where WRP and WSP are synaptic weights that multiply IR and S, respectively, and UP is an afterhyperpolarization governed by the following equation: The learned ventral striatal activation level, S, is governed by where IR again is the primary reward signal, multiplied by fixed weight WRS, Ii is a signal coming from the ith CS representation to the ventral striatal cells, and WiS is the adaptive synaptic weight that multiplies this signal. Weight adaptation is governed by where CWSmax is the upper bound on each weight. Provided that the CS-induced signal Ii is positive, then synaptic weight potentiation and depression are induced, respectively, by phasic dopamine burst or dip signals, N+ and N− (defined below in Eqs. 20, 21). These signals are never on simultaneously. Learning at the model ventral striatal adaptive weights is gated by delayed release of a second messenger, and calcium signal GWS is governed by Equations 5 and 7 (below) at a rate r = 12.5.
The formal model approximates the spectrum of delayed calcium spikes (adaptive timing spectrum) as follows: a spectrum of striosomal MSPN second messenger activities xij respond to the ith input at rates rj: where the second messenger buildup rates are given by The activities xij induce intracellular calcium dynamics within a given spine (j) at delays determined by rj. The intracellular calcium spike is represented by the quantity [GijYij]+ (Brown et al., 1999), where and where fG(x) is a step function: 0 for x ≤ 0 and 1 for x > 0. Parameters ΓG and ΓY define signal thresholds for calcium accumulation (Eq. 7) and calcium decay (Eq. 8), respectively. In the brief interval when the calcium concentration at a particular spine exceeds a threshold activity ΓS, CS-striosomal weight Zij at that particular spine becomes eligible for adaptation that may be induced by dopaminergic bursts (N+) or dips (N−): Thus, phasic DA bursts and dips during learning trials respectively potentiate and depress two sets of adaptive weights, the Zij and the WiS (treated above). With the parameters chosen in Equations 4 and 9, the latter weights change somewhat faster during extinction than the former. Indeed, primate data reported by Ljungberg et al. (1992), and new rodent data reported by Pan et al. (2005), led the latter authors to conclude that “there are different time courses for acquisition of new conditioned responses to cues and the loss of responses to the rewards predicted by those cues”; notably, the data indicate that the learning responsible for controlling the cue-dependent inhibition of DA responses to primary rewards is slower than that controlling predictive excitatory responses to cues.
As learning progresses, the CS comes to immediately excite the DA neurons through the ventral striatal pathway and to inhibit the same DA cells after a learned delay. The equation governing activity of phasic DA cells is Here the thresholded PPTN signal [P − ΓP]+ excites DA cells, but the summed spectrum of striosomal MSPN signals Σi,j [GijYij − ΓS]+ Zij inhibits DA cells. In Equation 10, ID represents endogenous factors that control the low baseline activation (and associated firing rate) of phasic DA cells.
Matricial MSPNs are modeled by where WiSIi is adaptively weighted cortical input, and the inhibition, F (supplemental Eq. S8, available at www.jneurosci.org as supplemental material), from FS-INs, is gated by factor (1 − T), which reflects activation of presynaptic receptors on FS-IN axon terminals (Koós and Tepper, 2002) by ACh released from TANs (whence “T”). The last term, T, reflects the direct inhibitory action of ACh on MSPNs (Di Chiara et al., 1994). The model used to generate the ACh signal T is treated by Tan and Bullock (2008b) and summarized in Striatal Microcircuit in the supplemental material (Eq. S7, available at www.jneurosci.org as supplemental material). The synaptic strengths of cortical inputs to the matricial MSPNs are assumed to be governed by the same mechanism as ventral striatal cells (WiS in Eq. 4).
To model how axon terminals from such matricial MSPNs affect SNc cell firing, suppose that, in the absence of any synaptic feedbacks, the GABA release level, MRG, and the substance-P release level, MRS, in SN would be equal and proportional to the thresholded matricial MSPN activation: If the release of both GABA and SP from the MSPN terminals is inhibited by released GABA acting via presynaptic GABAB receptors, net GABA (G) and SP (SP) release can be written as where 0 < βM < 1. The latter term introduces an asymmetry to GABAB receptor-mediated inhibition of SP release relative to GABA release. Stronger inhibition of SP release than GABA release is necessary for the model's successful genesis of nonmonotonic uncertainty responses. Although data are apparently lacking to prove this assumed asymmetry, several studies in the cortex and spinal cord indicate that modulation of SP and GABA release via GABA receptors can be heterogeneous, and may even involve different receptor subtypes (Bonanno et al., 1996; Teoh et al., 1996). Therefore, the model's prediction of asymmetric effects is not implausible. Equations 13 and 14 imply that the potential levels MRS and MRG are gated by factors that reflect feedback inhibition of release. Because the multiplicative gating factor declines as corelease of GABA increases, the net SP release given by the product MRS[1 − G]+ is a nonmonotonic, inverted-U, function of matricial MSPN activation level M (see also Tan and Bullock, 2008a). In accord with data already noted, released SP excites the “sustained” type of model DA cell. The net SP level, S, that acts on postsynaptic neurokinin receptors can be approximated by The activity, (Dsust), of the subset of SNc DA cells that receive this net SP signal is modeled by where ID is the same endogenous excitation defined for Equation 10, and αSD is the gain of the SP excitation effect. Equation 16, which was used for all the simulations reported in the main text, does not include a postsynaptic effect of GABA release. This omission seems reasonable for present purposes, because DA neurons are relatively insensitive to GABA (Tepper and Lee, 2007), and because any postsynaptic effect of increased GABA release from matricial MSPN inputs would likely be fully offset by decreased GABA release from nearby terminals of tonically active SNr cells that would be inhibited by the same matricial MSPNs. This “offset hypothesis” is supported by data indicating that DA cells do not show responses that are time locked to the motor initiations gated by matricial MSPN outputs (DeLong et al., 1983; Romo and Schultz, 1990; Schultz, 1998). Nevertheless, it is important to show that the conclusions reached on the basis of simulations using Equation 16 are robust even if the GABA released by GABA/SP terminals has a postsynaptic effect on DA neurons that is not fully offset by reduced GABA release by fibers from SNr. To treat that case, we also studied the DA cell equation: where ID is a baseline excitation, E is the net excitatory input attributable to released SP, and I is the net inhibitory input attributable to GABA released by GABA/SP terminals. In particular, let I = kX, where X = αR [M − ΓM]+. The coefficient k (set to 0.5) scales the postsynaptic effect of GABA, as needed to reflect the relatively low sensitivity of DA cells to GABA (cf. Tepper and Lee, 2007). One more parameter is needed to allow a thorough study of the robustness of the system's ability to exhibit DA uncertainty responses that are nonmonotonic and approximately proportional to p × (1 − p). That parameter, c, scales the strength of the presynaptic GABAergic reduction of SP release, and 0 ≪ c < 1 allows for significant, but less than total, blockade. (For simplicity, we ignore, without loss of generality, the feedback effect of GABA on itself.) Thus, let E = X × (1 − cX). For three parameterizations, Figure 10 plots a full range of solutions of Equation 17 at equilibrium, for values of X ranging from 0 to 1. As can be seen, the qualitative shape of the response is nonmonotonic, with only small differences between the three. The black trace shows the result if the postsynaptic effect of GABA is negligible, i.e., if k = 0 and c = 1. Both the other curves reflect a postsynaptic (ultimately divisive) effect of GABA, and indeed both curves have a smaller amplitude than the black curve. If k = 0.5 and c = 1 (red curve), there is a slight leftward “peak shift.” Although the resultant nonmonotonic function peaks below p = 0.5, it is still compatible with the rank ordering probed in extant experiments: the response at p = 0.5 is greater that the responses at p = 0.25 or p = 0.75, and there is essentially no response at p = 0 or p = 1.
Furthermore, much of this peak shift disappears (blue curve) if k = 0.5 and c = 0.9, i.e., if the presynaptic gating of SP release by GABA remains proportionate, but with a slope less than one. This analysis indicates that the key property, a nonmonotonic function whose shape approximates the function p × (1 − p), is a very robust property of the system, even in the presence of a significant postsynaptic effect of GABA.
To simulate coexisting responses in a single neuron, the model includes a third set of DA neurons affected by all three pathways (PPTN-mediated excitatory projections and projections from both matricial and striosomal MSPNs): Striatal DA release is modeled by the following equation: where D = Dphasic + Dsust is the overall firing rate of DA cells as a population. Positive (N+) and complementary negative (N−) reinforcement signals are respectively derived from above- and below-baseline fluctuations from the overall DA level, D̄:
This work was supported by National Science Foundation Grant SBE-354378. C.O.T. was partly supported by the Higher Education Council of Turkey and Canakkale Onsekiz Mart University of Turkey. We thank the anonymous reviewers for careful readings and comments that inspired numerous improvements to this article.
- Correspondence should be addressed to Daniel Bullock, Cognitive and Neural Systems Department, Boston University, 677 Beacon Street, Boston, MA 02215.