Pavlovian cues often elicit motivations to pursue and consume the rewards (or avoid the threats) with which they have been associated. The cues are called conditioned stimuli or CSs; the rewards or threats are called unconditioned stimuli or UCSs. For addicts and sufferers from related compulsive urges, cue-triggered motivations may become quite powerful and maladaptive; they also underpin various lucrative industries (Bushong, King, Camerer, & Rangel, 2010). Pavlovian learning and responding interacts in a rich and complex manner with instrumental learning and responding, in which subjects make choices contingent on expectations or past experience of the outcomes to which they lead.

Computational analyses of instrumental learning (involved in predicting which actions will be rewarded) have paid substantial attention to the critical distinction between model-free and model-based forms of learning and computation (see Fig. 1). Model-based strategies generate goal-directed choices employing a model or cognitive-style representation, which is an internal map of events and stimuli from the external world (Daw, Niv, & Dayan, 2005; Dickinson & Balleine, 2002; Doya, 1999). That internal model supports prospective assessment of the consequences of taking particular actions. By contrast, model-free strategies have no model of outside events; instead, learning takes place merely by caching information about the utilities of outcomes encountered on past interactions with the environment. This generates direct rules for how to behave, or propensities for performing particular actions, on the basis of predictions of the long-run values of actions. Model-free values can be described as being free-floating, since they can become detached from any specific outcome. The model-based/model-free distinction has been experimentally highly fruitful (Daw, Gershman, Seymour, Dayan, & Dolan, 2011; Fermin, Yoshida, Ito, Yoshimoto, & Doya, 2010; Gläscher, Daw, Dayan, & O’Doherty, 2010; Wunderlich, Dayan, & Dolan, 2012). For example, model-based mechanisms are held to produce cognitive or flexibly goal-directed instrumental behavior, whereas model-free mechanisms have often been treated as producing automatic instrumental stimulus–response habits (Daw et al., 2005; though cf. Dezfouli & Balleine, 2013). There are also intermediate points between model-based and model-free instrumental control, which we will briefly discuss below.

Fig. 1
figure 1

A summary comparison of computational approaches to reward learning. The columns distinguish the two chief approaches in the computational literature: model-based versus model-free. The rows show the potential application of those approaches to instrumental versus Pavlovian forms of reward learning (or, equivalently, to punishment or threat learning). We suggest that the Pavlovian model-based cell (colored at lower left) has hitherto been comparatively neglected, since computational approaches have tended to treat Pavlovian learning as being purely model-free. However, evidence indicates that model-based Pavlovian learning happens and is used for mesolimbic-mediated instant transformations of motivation value. By contrast, instrumental model-based systems that model the value of an outcome on the basis of memory of its hedonic experience may require retasting or reexperiencing an outcome after revaluation in order to update the model (see the text for discussion and alternatives). Each cell contains (a) a brief description of its characteristic computation, (b) an example of behavioral or neural demonstrations in the experimental literature, and (c) a distinguishing feature by which it can be recognized in behavioral or neural experimental findings. Citations: 1Dickinson & Balleine (2010); 2Daw et al. (2005); 3M. J. F. Robinson & Berridge (2013); 4Schultz et al. (1997)

What makes learning Pavlovian is that the conditioned response is directly elicited by a CS that is predictive of a UCS, without regard to the effect of the response on the provision or omission of that UCS (Mackintosh, 1983). This offers the significant efficiency advantage of substituting genotypic for phenotypic search, amongst a potentially huge range of possible actions, for an action that is usually appropriate to a circumstance, but at the expense of inflexibility of response in particular cases. By contrast, with instrumental learning, computational analyses of Pavlovian learning have, with only few exceptions (Doll, Simon, & Daw, 2012), presumed the computation of prediction to be model-free, leading to simple stored caches of stimulus–value associations. However, here we will conduct a closer inspection of model-free and model-based alternatives specifically for Pavlovian learning and value predictions, attempting to meld recent insights from affective neuroscience studies of incentive motivation. We will conclude that model-based computations can play a critical role in Pavlovian learning and motivation, and that this generates flexibility in at least some affective/motivational responses to a CS (Fig. 1).

In order to illuminate the contrast between model-free and model-based predictions in Pavlovian situations, we draw on an illustrative experiment (which we call the “Dead Sea salt” experiment) recently performed with rats by M. J. F. Robinson and Berridge (2013). This experiment built on many past demonstrations that inducing a novel body need (sodium appetite) can reveal latent learning about salty food sources. Those sources may not have been attractive during learning, but can be used adaptively when a sodium need state is induced at a later time. For example, rats suddenly put into a salt appetite state will appropriately return to, work for, or even ingest, cues previously associated with salt that had no particular value to them when learned about earlier (Balleine, 1994; Dickinson, 1986; Fudim, 1978; Krieckhaus & Wolf, 1968; Rescorla & Freberg, 1978; Schulkin, Arnell, & Stellar, 1985; Stouffer & White, 2005; Wirsig & Grill, 1982). Sodium need also allows Pavlovian CS stimuli related directly to salt to undergo a hedonic transformation, to become “liked” when reencountered in a relevant appetite state (i.e., CS alliesthesia, similar to alliesthesia of the salty UCS; Berridge & Schulkin, 1989).

In the Dead Sea salt experiment (depicted in Fig. 2), a distinctive Pavlovian CS (the insertion of a lever through a wall into the chamber accompanied by a sound) was first paired with an inescapable disgusting UCS (M. J. F. Robinson & Berridge, 2013). The disgusting UCS was an intra-oral squirt of a saline solution whose high sodium chloride concentration, equivalent to that in the Dead Sea (i.e., triple that of ordinary seawater), made it very aversive. Simultaneously, a different CS (a lever inserted from the opposite side of chamber accompanied by a different sound) predicted a different, pleasant UCS squirt of sweet sucrose solution into the mouth. The rats approached and nibbled the sucrose-related CS lever, but duly learned to be spatially repulsed by the salt-related CS whenever the lever appeared, physically “turning away and sometimes pressing themselves against the opposite wall” (p. 283), as though trying to escape from the repulsive CS lever and keep as far away as physically possible. This is a prime case of appetitive versus aversive Pavlovian conditioning (Rescorla, 1988), with the escape response being drawn from a species-typical defensive repertoire, and the appetitive response from an ingestive repertoire. Such Pavlovian responses are elicited by cues that predict their respectively valenced outcomes, albeit somewhat adapted to the natures of both the CS and the UCS. The Pavlovian responses here achieved no instrumental benefit, and would likely have persisted even if this had actually decreased sucrose delivery or increased the probability of noxious salt delivery (as in Anson, Bender, & Melvin, 1969; Fowler & Miller, 1963; Morse, Mead, & Kelleher, 1967).

Fig. 2
figure 2

Instant transformation of a CS’s incentive salience observed in the Dead Sea salt study (M. J. F. Robinson & Berridge, 2013). Initial aversive Pavlovian training of CS+ with a disgusting UCS taste produces gradual learned repulsion. The CS+ value declines negatively over successive CS+ pairings with an NaCl UCS (learned Pavlovian values). After training, sudden hormone injections induce a novel state of salt appetite. The CS value is transformed instantly, to become positive on the very first reencounter in the new appetite state (CS+ presented alone in the crucial test, without the salty UCS being retasted). Behaviorally, rats approach and nibble the CS+ lever, which was previously associated with the disgusting NaCl taste as UCS, as avidly as a different CS lever that had previously been associated with a pleasant sucrose UCS. Neurobiologically, mesolimbic brain activations were observed during the combination of CS+ reencounter plus novel appetite state in dopamine-related structures: ventral tegmentum, nucleus accumbens, prefrontal cortex, and so forth. The quantitative transformation depicted is based on Zhang et al.’s (2009) computational model of incentive salience. Modified from “Instant Transformation of Learned Repulsion Into Motivational ‘Wanting’,” by M. J. F. Robinson and K. C. Berridge, 2013, Current Biology, 23, pp. 282–289, copyright 2013 by Elsevier Ltd., and from “A Neural Computational Model of Incentive Salience,” by J. Zhang, K. C. Berridge, A. J. Tindell, K. S. Smith, and J. W. Aldridge, 2009, PLoS Computational Biology, 5, e1000437, published open access under Creative Commons license. Adapted with permission

On a subsequent day, the rats were injected for the first time ever with the drugs deoxycorticosterone and furosemide. These mimic brain signals normally triggered by angiotensin II and aldosterone hormones under a state of salt deprivation (which the rats had never previously experienced). In their new state of salt appetite, the rats were then again presented with the lever CS, but in extinction (i.e., without the provision of any outcome). Their Pavlovian behavior toward the CS in the new appetite state was reassessed, as was the activation of an immediate early gene in neurons as a signature of neural activity (i.e., c-fos gene translation into Fos protein).

In the new condition, far from eliciting repulsion, as before, the salt-related CS lever suddenly and specifically now became nearly as strongly attractive as the sweet-related lever (appetitive engagement with salt-associated CS increased by a factor of more than 10, as compared with the predeprivation training days), so that the metal CS object was avidly approached, sniffed, grasped, and nibbled (M. J. F. Robinson & Berridge, 2013). These novel salt-related CS responses were again Pavlovian, achieving no instrumental benefit (the metal lever was not salty, and pressing it had never obtained salt). The transformation of the motivation (creating what is known as a motivational magnet) occurred on the very first presentations of the CS in the new state, before the newly positive valence of the salty UCS taste had been experienced, and so without any new learning about its altered UCS or new CS–UCS pairing (Fig. 2). No change in behavior was seen toward the sucrose-associated lever, nor toward a third, control, lever that predicted nothing and typically was behaviorally ignored. Sometimes the salt-associated CS also elicited affective orofacial “liking” reactions in the new appetite state that would ordinarily be elicited by a palatable taste UCS, such as licking of the lips or paws (though the rats had never yet tasted the concentrated NaCl as positively “liked” in their new appetite state; M. J. F. Robinson & Berridge, 2013).

These behavioral changes consequent on first reencountering the salt-related CS lever in the new appetite state were not the only new observation. Neurobiologically, activity in a collection of mesocorticolimbic brain areas was also dramatically up-regulated by the combination of (a) reencountering the CS+ lever simultaneously with (b) being in the new appetite state (Fig. 2). Fos was elevated in the core and rostral shell of the nucleus accumbens, as well as in the ventral tegmental area (VTA) and the rostral ventral pallidum, and in infralimbic and orbitofrontal regions of the prefrontal cortex (M. J. F. Robinson & Berridge, 2013). At least some of those brain areas, and particularly the neuromodulator dopamine (projected from the VTA to the nucleus accumbens and other structures), play a key role in the motivational attribution of incentive value to Pavlovian stimuli, a process known as incentive salience, which makes attributed stimuli (e.g., CSs as well as UCSs) become positively “wanted.” The changes in mesolimbic structures were not merely a function of increased physiological drive, but rather also required the CS+ in the new state. No significant Fos elevation at all was detected in ventral tegmentum given just the isolated state of salt appetite by itself, in the absence of the CS+ lever, and Fos elevation in nucleus accumbens regions was only one-third as high (or less), as compared to when the salt CS and appetite state were combined together. This apparent requirement for simultaneous CS plus appetite state in order to activate mesolimbic circuits maximally replicates a previous finding that firing of neurons in ventral pallidum was also elevated only by simultaneous salt CS plus appetite (but again, without actually tasting the NaCl in the deprived state; Tindell, Smith, Berridge, & Aldridge, 2009), as compared either to the deprived state alone or to CS encounters alone in the normal state. The earlier study also used a diffuse auditory CS that could not be spatially approached, ensuring that CS value and not elicited appetitive behavior was driving the neural activation (Tindell et al., 2009).

These new experiments helped resolve an important motivational question as to whether such sudden appetitive behavior toward the saltiness source was motivated simply to alleviate the negative distress of the salt appetite state (i.e., to reduce aversive drive), or whether Pavlovian CSs for saltiness actually become positively “wanted,” endowed with incentive salience when they are reencountered in a novel, relevant state. Pavlovian CSs that are the targets of incentive salience capture attention, are attractive, stimulate approach, and even elicit some forms of consumption behavior, almost as if they had come to share some key characteristics with the food, drug, or other reward UCSs themselves (Berridge, 2007; Toates, 1986). We interpret the results of the Dead Sea salt experiment as demonstrating spontaneous generation of positive Pavlovian incentive salience in a fashion that we suggest is model-based. It also shows that the CS’s transformation of Pavlovian motivation can be so powerful as to reverse nearly instantly an intense earlier learned repulsion into a suddenly positive, intense incentive “want.”

We focus on the Pavlovian reversal from repulsion to attraction because it is an especially vivid example of state-induced transformation in CS value. However, it is only one exemplar of a wider and long-studied class of revaluation changes in Pavlovian responses that we suggest demands explanation in terms of similar model-based mechanisms, involving stimulus–stimulus associations that preserve the details about the identities of events that have been learned (Bouton & Moody, 2004; Dickinson, 1986; Holland, 1990; Holland, Lasseter, & Agarwal, 2008; Rescorla, 1988; Rizley & Rescorla, 1972; Zener & McCurdy, 1939). In all of these cases, individuals show that they can use learned information about the identity of a UCS that is associated with a particular CS when new information is later added to that CS (e.g., developing a taste aversion to an absent UCS when its associated CS later becomes paired associatively with illness; Holland, 1990). Other related cases indicate that the incentive salience of CSs can be similarly multiplied at the time of CS reencounter as a result of neurobiological activations of mesolimbic systems, induced either by sudden dopamine/opioid pharmacological stimulation or by drug-induced neural sensitization interposed between CS–UCS training and CS reencounter (DiFeliceantonio & Berridge, 2012; Pecina & Berridge, 2013; Smith, Berridge, & Aldridge, 2011; Tindell, Berridge, Zhang, Peciña, & Aldridge, 2005; Vezina & Leyton, 2009; Wyvell & Berridge, 2000).

What is needed is a computational dissection of the way that such Pavlovian transformations in CS-triggered motivation happen. We seek the same quality of understanding for Pavlovian conditioning and motivation at the three levels of computational, algorithmic, and implementational analysis (Marr, 1982) that has emerged for instrumental conditioning and action (Dayan & Daw, 2008; Doll et al., 2012). We take each of these levels of analysis in turn, reassembling and reshaping the pieces at the end.

The computational level

The computational level is concerned with the underlying nature of tasks and the general logic or strategy involved in performing them (Marr, 1982). Here, the task is prediction, and we consider both model-based and model-free strategies.

A model-based strategy involves prospective cognition, formulating and pursuing explicit possible future scenarios based on internal representations of stimuli, situations, and environmental circumstancesFootnote 1 (Daw et al. 2005; de Wit & Dickinson, 2009; Sutton & Barto, 1998). This knowledge jointly constitutes a model and supports the computation of value transformations when relevant conditions change (Tolman, 1948). Such models are straightforward to learn (i.e., acquisition is statistically efficient). However, making predictions can pose severe problems, since massive computations are required to perform the prospective cognition when this involves building and searching a tree of long-run possibilities extending far into the future. The leaves of the tree report predicted future outcomes whose values are also estimated by the model. Such estimates could be available in memory gained through corresponding past experience—for example, actually tasting salt in a novel state of sodium need. This process of acquiring new values through relevant state experiences is sometimes called UCS retasting (in the case of foods) or incentive learning, in the more general instrumental case (Balleine & Dickinson, 1991). However, in cases such as the Dead Sea salt experiment that involve completely novel values and motivational states, the tree-search estimates are inevitably constrained by what is not yet known (unless specific instructions or relevant generalization rules are prescribed in advance). That is, any experienced-derived search tree as yet contained no “leaves” corresponding to a value of “nice saltiness.” Only nasty memories of intense saltiness were available. A new leaf would be required somehow to bud.

The other computational strategy is model-free. This is retrospective, in the sense of operating purely using cached values accumulated incrementally through repeated experience (Daw et al., 2005; Dickinson & Balleine, 2002; Doya, 1999; Sutton & Barto, 1998), typically via a temporal-difference prediction error (Sutton, 1988). Such model-free processes must make their future estimates on the basis of reward values that have been encountered in the past, rather than estimating the possible future. In the salt experiment above, therefore, the cached CS prediction error value would have been negative in the new appetite state, as it had been in the past CS–UCS learning experiences. Model-free predictions are free of any content other than value and are unaffected if the environment or the individual’s state suddenly changes, since the past associations were learned—at least until new learning coming from reencounters with the CS and UCS in the new state has adjusted the contents of the cache. Model-free algorithms such as temporal-difference learning make predictions of the long-run values of circumstances—that is, of the same quantities for which model-based learning builds a tree. They achieve this by bootstrapping—that is, substituting current, possibly incorrect, estimates of the long-run worth for true values or samples thereof. Model-free estimation is statistically inefficient because of this bootstrapping, since as at the outset of learning the estimates used are themselves inaccurate. However, model-free values are immediately available, without the need for complex calculations.

The very different statistical and computational properties (Dayan & Daw, 2008) of model-based versus model-free strategies are a good reason to have both in the same brain. But when they coexist, the two strategies can produce values that disagree (Dickinson & Balleine, 2002, 2010). Such discrepancies might be reconciled or resolved in various ways—for instance, according to the relative uncertainties of the systems (Daw et al., 2005). So, for example, a model-based strategy might dominate in early instrumental trials, when its superior statistical efficiency outweighs the noise associated with the complex calculations, but a model-free strategy might dominate once learning is sufficient to have overcome the statistical inefficiency of bootstrapping. However, intermediate points between the strategies are also under active investigation, at in an instrumental context, from viewpoints both theoretical (Dayan, 1993; Dezfouli & Balleine, 2012; Doll et al., 2012; Keramati, Dezfouli, & Piray, 2011; Pezzulo, Rigoli, & Chersi, 2013; Sutton & Barto, 1998) and empirical (Daw et al., 2011; Gershman, Markman, & Otto, 2014; Simon & Daw, 2011). In particular, model-based predictions might train model-free predictions either offline (e.g., during quiet wakefulness or sleep: Foster & Wilson, 2006, 2007) or online (Doll, Jacobs, Sanfey, & Frank, 2009; Gershman et al., 2014), or by providing prediction errors that can directly be used (Daw et al., 2011).

Distinguishing Pavlovian model-free from model-based learning

The computational literature has often assumed that Pavlovian learning is purely model-free (Montague, Dayan, & Sejnowski, 1996), similar to stimulus–response habits (Suri & Schultz, 1999). By contrast, we suggest here that a model-based computation is required to encompass the full range of evidence concerning Pavlovian learning and prediction. Our chief reason is that the results from the Dead Sea salt experiment and others cited above hint at a crucial model-based feature: The computation must possess information about the sensory/perceptual identity of Pavlovian outcomes, distinct from mere previous values. Identity information is necessary to appropriately predicting the value of a truly novel UCS that has never yet been experienced (e.g., an intense saltiness sensation as the UCS identity, distinct from previous nastiness or niceness), and to apply that value change selectively to the appropriate CS (i.e., the salt-associated lever) without altering responses to other CSs (i.e., either the sucrose-associated lever or the control CS lever that had been associated with nothing). Identity information is also the most basic expression of a model-based mechanism that predicts an outcome, rather than just carrying forward a previously cached value. However, as we noted above, an identity prediction does not by itself suffice; it must also be connected to the modulation of value by the current physiological state, so that the saltiness representation of the UCS associated with CS could be predicted to have positive value in a way that would make the CS become attractive and appropriately “wanted.” This predictive transformation is tricky, since the taste outcome’s value had always been disgusting in the past. In particular, we must ask how this Pavlovian value computation is sensitive to the current brain–body state, even if novel, as the empirical results show that it is.

We note that several straightforward ways of making a CS value computation sensitive to current state can be ruled out. For example, UCS retasting could have allowed the outcome to have been experienced as positively “liked” rather than as disgusting, which would have updated any cognitive model-based representations derived from value experiences (Balleine & Dickinson, 1991; Dickinson & Balleine, 2010). But in the actual experiment, prior to the crucial CS test, neither the appetite nor the resulting pleasant value of the saltiness UCS had been experienced. Ensuring this novelty was one of the key intents of this experiment; it would be harder to guarantee with food satiety, for instance, since through alliesthesia, the subjects may have the experience of eating food whilst relatively sated at the end of sustained bouts of feeding. In the Dead Sea experiment, from a computational view, the new value worth could only be inferred—that is, recomputed anew on the basis of internal representations of both the saltiness outcome and the novel motivational state relevant to future value. We will have to turn to alternative methods of computation that go beyond mere recall.

The algorithmic level

The algorithmic level concerns the procedures and representations that underpin computational strategies (Marr, 1982). Psychologically, this is clearest for instrumental conditioning (Daw et al., 2005; Dickinson & Balleine, 2002; Doya, 1999), with a rather detailed understanding of model-free temporal-difference learning (Sutton, 1988) and a variety of suggestions for the nature of model-based calculations (Keramati et al., 2011; Pezzulo et al., 2013; Sutton & Barto, 1998).

There are two main issues for Pavlovian conditioning. The first concerns the nature of the predictions themselves, and the second, how those predictions are translated into behavior. Our focus is on the former concern; however, first we will touch on general aspects of the latter, since Pavlovian responses to CSs are how the predictions are assessed.

Pavlovian responses and incentive salience

CSs that predict appetitive and aversive outcomes elicit a range of conditioned responses. Appetitive predictors ordinarily become attributed with incentive salience during original reward learning, mediated neurobiologically by brain mesolimbic systems. The Pavlovian attribution of incentive salience let a targeted CS elicit surges of motivation that make that CS and its UCS temporarily more “wanted” (Flagel et al., 2011; Mahler & Berridge, 2012; Robinson & Berridge, 2013; T. E. Robinson & Berridge, 1993). Incentive salience attributed to a CS can direct motivated behavior toward that CS object or location, as in the Dead Sea salt experiment (with an intensity level that is dynamically modifiable by state-dependent changes in brain mesolimbic reactivity to Pavlovian stimuli; Flagel et al., 2011; Saunders & Robinson, 2012; Yager & Robinson, 2013). Incentive salience attributed to the internal representation of a UCS associated with an external CS can also spur instrumental motivation to obtain that UCS as forms of what is known as Pavlovian-to-instrumental transfer (PIT; Colwill & Rescorla, 1988; Dickinson & Balleine, 2002; Dickinson & Dawson, 1987; Estes, 1943; Holland, 2004; Lovibond, 1981, 1983; Mahler & Berridge, 2012; Pecina & Berridge, 2013; Rescorla & Solomon, 1967). One form is specific PIT, when instrumental actions are directed at exactly the same appetitive outcome that the CS predicts (e.g., the same sugar pellets are Pavlovian UCSs and instrumental rewards). Another form is general PIT, in which instrumental effort is directed to a different outcome from the UCS associated with CS (though the outcome will generally be a related one, such as when a CS associated with one food spurs effort to obtain another food). In both cases, the CS spurs a burst of increased motivated effort, even though the CS may never previously have been associated with the instrumental behavior (i.e., no stimulus–response habit or association exists between the CS and the instrumental action).

An aversive CS that predicts an outcome, such as a shock UCS, may elicit freezing; it may also suppress any ongoing, appetitively directed, instrumental responding for food or another reward (Estes & Skinner, 1941; Killcross, Robbins, & Everitt, 1997). This can be seen as an aversive form of general PIT. Pavlovian anticipation of future punishments has further been suggested to lead to pruning of the model-based tree of future possibilities, potentially leading to suboptimal model-based evaluation (Dayan & Huys, 2008; Huys et al., 2012). Pruning is an influence over internal, cognitive actions, rather than external, overt ones (Dayan, 2012).

Pavlovian values

As we noted, the Dead Sea salt experiment suggests that the identity of the UCS (i.e., its saltiness) is predicted by the CS, distinct from the associated previous values (i.e., disgustingness). Once such a model-based mechanism is posited for Pavlovian learning it may be recognized as potentially playing a role in many CS-triggered incentive responses. Related UCS identity representations of sucrose reward, drug reward, and so forth, all might be implicated in Pavlovian CS amplifications of motivation induced by many neurobiological/physiological manipulations, ranging from permanent drug-induced sensitization to sudden brain stimulations of mesolimbic brain structures that magnify cue-triggered “wanting” (e.g., dopamine/opioid drug stimulation of amygdala or nucleus accumbens).

In the Dead Sea salt experiment, the consequences of the prediction of identity can go one stage further, to conditioned alliesthesia (Toates, 1986), in which the CS is subject to the same physiological modulation as its UCS. Indeed, the Pavlovian lever/sound CS presentation sometimes elicited positive orofacial “liking” reactions in the new appetite state, much as the taste of salt UCS itself later would on the same day (M. J. F. Robinson & Berridge, 2013). By contrast, if a model-free or pure valence-based Pavlovian mechanism had controlled responding, the mechanism would have continued to generate only disgust gapes and avoidance of the lever CS. Model-based control is also consistent with findings that Pavlovian blocking (i.e., the ability of an already-learned CS to prevent new learning to a second CS that begins to be simultaneously paired with the same UCS) dissipates when the identity of the blocked CS’s UCS changes, but its valence remains matched (McDannald, Lucantonio, Burke, Niv, & Schoenbaum, 2011).

However, CS revaluation is not ubiquitous. For example, sometimes in devaluation experiments, an originally appetitive CS persists in stimulating appetitive efforts after its UCS has been made worthless, consistent with model-free evaluation. This is especially well documented in taste aversion conditioning experiments involving overtraining of the food-seeking response prior to UCS devaluation. Although we will discuss some differences later, there are further similarities between Pavlovian and instrumental effects of extensive training. For instance, it has been observed (Holland et al., 2008) that the predictive capacities of first-order appetitive CSs (which are directly associated with UCSs) are immediately affected by UCS revaluation, whereas second-order CSs (whose associations are established via first order CSs) are less influenced (Holland & Rescorla, 1975; Rescorla, 1973, 1974). Such results suggested that first order CSs establish stimulus–stimulus associations (i.e., identity predictions), whereas second order CSs instead directly elicit responses engendered during conditioning (via stimulus–response associations). Indeed a related gradient of increasing CS resistance may apply to from UCS-proximal to distal CSs in Pavlovian serial associations, and from outcome-proximal to -distal actions in instrumental circumstances (Balleine, Garner, Gonzalez, & Dickinson, 1995; Corbit & Balleine, 2003; Smith et al., 2011; Tindell et al., 2005).

The characteristic of model-free values of being tied to value but not identity of any specific outcome is especially evident in other paradigms. For instance in some situations, the absence of a negative-valenced event may be treated by an individual as similar to the occurrence of a positive-valenced event (Dickinson & Balleine, 2002; Dickinson & Dearing, 1979; Ganesan & Pearce, 1988; Holland, 2004).

Pavlovian models

Having shown that model-based Pavlovian prediction can occur, we next consider what sort of model might be involved.

Stimulus substitution as the most elemental model-based mechanism

One simple form of model-based or sensory prediction that has not been considered from an instrumental viewpoint is Pavlov’s original notion of stimulus substitution. In this, the predicting CS comes to take on for the subject at least some of the sensory properties or qualities of the UCS it predicts, via direct activation of UCS-appropriate brain sensory regions (Pavlov, 1927); and the CS could then naturally come to take on some of the outcome’s incentive properties. Something akin to stimulus substitution is suggested when responses that are directed to the CSs resemble responses to the UCS (e.g., pigeons pecking a key in a different way when it predicts food rather than water; Jenkins & Moore, 1973), and perhaps also some aspects of the progressive instinctive drift evident in Pavlovian misbehavior, in which subjects come to manipulate the CS in some crucial ways as if it shares properties with the UCS (e.g., as when a pig roots a small CS object in a way that would normally be directed at a UCS piece of food; Breland & Breland, 1961; Dayan, Niv, Seymour, & Daw, 2006). Note, though that substitution is never complete or literal: the CS is never actually mistaken for its UCS, and instead the nature of a Pavlovian response is always channeled by the CS identity as well as the UCS identity (Holland, 1977; Tomie, 1996). For example, a hungry rat that learns that the sudden appearance of another rat as CS predicts quick arrival of a food UCS, does not try to eat its fellow rat but rather responds with effusive approach, engagement, social grooming and related positive social behaviors (Timberlake & Grant, 1975). Such cases reflect CS substitution of UCS stimulus incentive properties rather than strict identity processes (Bindra, 1978; Toates, 1986). In short, the CS does not evoke a UCS hallucination.

Stimulus substitution might be seen as one of the simplest steps away from a pure valence expectation, involving a very simple associative (or mediating; Dwyer, Mackintosh, & Boakes, 1998) prediction (Dickinson, 2012). However, it is an efficient representational method to achieve some of the computational benefits of model-based predictions without requiring sophisticated previsioning machinery that depends on such processes as working memory. At the least, it is an important hint that there may be more than one Pavlovian model-based mechanism.

Defocusing of modeled UCS identity representation

Another way to reconcile the facts that CSs can sometimes admit instantaneous revaluation (as in the Dead Sea salt study), yet sometimes resist it (as in overtraining prior to taste aversion in the studies mentioned above), is to view the representation of the predicted UCS as flexible, and able to change over extended learning. We call this hypothesized process model-based UCS defocusing. For instance, over the course of extensive training, the UCS representation might become generalized, blurred or otherwise partially merged with representations of other related outcomes. This defocusing might lead to simple representations that afford generalization by dropping or de-emphasizing some of the particular sensory details of the UCS. Defocusing would be similar to basic concept learning of a category that contains multiple exemplars, such as of “tasty food,” whose representation evolves to be distinct from the unique identity of any particular example.

For a more intuitive view of defocusing, imagine the common experience of walking down a street as mealtime approaches and suddenly encountering the odor of food cooking inside a nearby building. Usually you guess the identity of what is being cooked, but sometimes you cannot. The food smell may be too complex or subtle or unfamiliar for you to recognize the precise identity of its UCS. In such a case, you have merely a defocused representation of the food UCS. But still you might feel suddenly as hungry as if you knew the UCS identity, and perhaps quite willing—even eager—at that moment to eat whatever it is that you smell, despite the lack of focus or any detailed identity knowledge in your UCS representation.

The implications of UCS defocusing are quite profound for the interpretation of experiments into devaluation insensitivity. Instead of resulting exclusively from model-free or habitual control as result of overtraining, persistence of responding to a CS could at least partly remain model-based. But if extensive training led a model’s representation of the UCS outcome to defocus, that defocused representation might escape any devaluation that depended on recalling the UCS’s precise sensory-identity details (e.g., Pavlovian taste aversion conditioning). The defocused representation could still support appetitive responding (at least until the UCS was actually obtained), despite the reduction in value of the actual UCS—of which the subject might still show full awareness if tested differently. Thus, dropping the particular identity taste representation of, say, fresh watermelon CS, which has now been paired with visceral illness as UCS, may leave a vaguer representation of juicy pleasant food that could still motivate appetitive effort until the previously delicious watermelon is finally retasted as now disgusting (Balleine & Dickinson, 1991; Dickinson & Balleine, 2010). This defocusing effect might especially result when the manipulations used to revalue an outcome are essentially associative or learned, as distinguished from the physiological manipulation by appetite states, drugs, or brain activations that might more directly change CS value in parallel with UCS value, similar to the CS result for Dead Sea saltiness (Berridge, 2012; Zhang, Berridge, Tindell, Smith, & Aldridge, 2009). That difference may be because associative revaluations (e.g., Pavlovian taste aversions) layer on multiple and competing associations to the same food UCS, whereas physiological/brain states (e.g., salt appetite or addictive drugs) may more directly engage revaluation circuitry, and perhaps more readily revalue a CS’s ability to trigger incentive salience (Berridge, 2012).

The suggestion that defocusing occurs for predictions of a UCS should not be seen as contradicting our main proposition that the sensory identity of outcomes is key to understanding model-based Pavlovian learning and motivation. Instead, defocusing is associated with the development of a sophisticated, likely hierarchical, representation of the UCS and model-based predictions thereof, which admits an enriched set of multiple inferences and predictions, arranged along a spectrum of abstraction. For Pavlovian reward or threat exemplars, a variety of defocused or categorical UCS representations might exist: tasty foods, quenching drinks, sexual incentives, arousing drug reward states (e.g., amphetamine and cocaine), hedonic/calming drug reward states (e.g., heroin and morphine), painful events, and so on. These could be arranged in further hierarchical layers.

The details of how this spectrum is built need future clarification. However, it could proceed along the lines of unsupervised-learning models for the development of cortical representations of general sensory input (Hinton & Ghahramani, 1997). Or it could be viewed as akin to the mechanisms of cognitive abstraction in declarative model-based systems, such as for a category of percepts (e.g., chairs in general) derived from several specific exemplars (e.g., particular chairs). Even pigeons can form perceptual abstractions, such as visual categories of pictures that contain images of trees or people, as relatively generalized concepts (Herrnstein, 1990).

Defocusing might also apply to Pavlovian representations of reward that influence instrumental behavior, such as in general PIT, when presenting an appetitive CS spurs a burst of instrumental effort to obtain other rewards (but rewards that are usually categorically similar to the CS’s UCS; e.g., tasty foods). Rather than depending on pure, model-free expectations of value, which is the conventional account of general PIT, this could depend on a model-based, but defocused, abstract, UCS prediction. For example, a CS for an ingestive UCS might trigger in different tests (a) specific PIT for its own UCS food, supported by a highly detailed representation of a reward’s unique sensory identity (e.g., a saltiness representation for the Dead Sea salt CS transformation). The food CS might also trigger (b) a defocused, model-based, PIT for a different food UCS based on a more abstract representation similar to a basic concept (e.g., a tasty food lacking sensory details that produces persistent “miswanting” after specific UCS devaluation). This defocused model would produce general PIT patterns of CS-triggered motivation for other food UCSs that belong to the same defocused class as its own UCS, but would not do so for categorically different UCSs that are quite different (e.g., nonfood rewards such as noncaloric liquids, drugs, sex, etc.). Next, (c) a nearly completely defocused representation of an outcome could simply indicate that it has good or bad valence (allowing predicted omission of a good or bad outcome to be treated similarly to the predicted occurrence of a bad or good outcome, respectively; as reflected in some tests of associative blocking). This would be close to (d) a true model-free, general PIT for a noningestive reward, such as drug reward, sex reward, and so forth. Both options (c) and (d) would generate equal intensities of general PIT for other food UCSs and for noningestive UCSs. However, option (c) might still retain other model-based features that could be exposed by different tests. Indeed, future PIT experiments might usefully explore the possibility that there are multiple, simultaneous representations for the same outcome, but at different degrees of defocusing. One way to do this would be to manipulate physiological states, for instance of hunger versus thirst, and then extend the range of instrumental choices in PIT experiments to include multiple UCSs belonging to different categories (e.g., food vs. nonfood rewards), and modulating CS values via relevant versus irrelevant appetites. Such PIT experiments could make more evident the difference between model-free and defocused model-based predictions, and also elucidate the representational hierarchy for the latter.

One might wonder whether the most defocused or abstract UCS prediction could be just the same as pure, model-free value. There are reasons to think not: The key distinction is that the range and form of generalization that underpins defocusing can be manipulated by information that is presented in contexts outside the precise learning problem. Take the case we mentioned above of smelling food whilst out walking. One could learn from a newspaper report that food sold on the street in London, say, is unhygienic. Such information might take the London UCS out of the generalization class of other street food, and perhaps reduce the motivating value of the CS scent of cooked food while walking London.

Extended training studies by Holland (2004) assessing PIT after devaluation of the UCS (see also earlier examples of persistence after devaluation, such as Wilson, Sherman, & Holman, 1981) might be reinterpreted as a concrete example of defocusing. As expected from the above, extending training rendered the instrumental response resistant to devaluation. More surprisingly, though, UCS devaluation also failed to reduce specific PIT, the boost to the vigor of instrumental actions aimed at obtaining the same identical UCS as predicted by the Pavlovian CS. That is, presenting the CS associated with a devalued UCS food still enhanced effort on the instrumental response that had previously obtained that same food (the PIT test was conducted in extinction, without food outcomes actually being delivered), even though proximal conditioned responses to the CS, such as head entries into the food dish, were reduced by the devaluation. This would be consistent with multiple simultaneous representations of the UCS, with the Pavlovian one that guided instrumental behavior being defocused when accessed by instrumental learning systems, and so unaffected by the particular, identity-specific, devaluation procedure.

Defocusing or loss of UCS identity might also relate to Tolman’s (1949, 1955) interpretation of the original demonstrations that extended overtraining could induce resistance to subsequent UCS devaluation (sometimes called “habits” for that reason). Those demonstrations showed that a suddenly hungry rat, which had always previously been trained while thirsty, continued to seek the location of a water reward in a maze, and continued to ignore the location of an alternative food reward that now ought to be valuable (Thistlethwaite, 1952). Tolman thought that this might involve a “narrowing” of the cognitive map. In his own words,

even though a rat’s drive be changed from thirst to hunger, his need-push may not, for a while, change correspondingly. The water place may still be valenced, even though the drive as measured by our original definition is now one of hunger. In other words, under some conditions, rats, and men too, do not seem to have need-pushes which correspond to their actual drives (and also I would remark, parenthetically, they may often experiences valences that do not correspond to their actual values). (Tolman, 1949, p. 368)

Although habit theorists might be tempted to view the lagging “need-push” as a model-free propensity, an alternative based on defocusing would be to view it as a defocused persistence of the cognitive representation of the value of act–outcome value in the new state, until reinstructed by value experiences relevant to that state (e.g., food becoming more valuable during hunger), all contained in a model-based or cognitive-style representation (Daw et al., 2005; Dickinson & Balleine, 2002; Doya, 1999). Such retasting opportunities lead the rat to subsequently switch to seeking food whenever in the hunger state, and not to persist in seeking water in the maze (Thistlethwaite, 1952). Tolman himself provided a rather model-based account of what he meant in terms of expectancies and cognitive maps in a related article: Namely, that thirsty overtraining with the water reward

interfered with activations of the appropriate scannings and consequent additional discriminations and expectancies necessary for the development of a pragmatic performance vector with respect to the food. The already aroused strong approach-to-water performance vector tended, as I have put it elsewhere, to narrow the rat’s “cognitive maps.” (Tolman, 1955, p. 36)

Although not identical to UCS defocusing, a narrowing of a cognitive map that prevents appropriate scanning of reward expectancies to assess new value might best be viewed in model-based terms.

Such concepts of narrowing the cognitive map or defocusing make it harder to distinguish between model-free and model-based control, since they argue that the model-based system can suffer from a form of pathology that makes its predictions resemble those of a model-free system. However, the concepts do not challenge the fundamental distinction between the two systems; rather, they invite a more discriminative set of experiments, perhaps of the flavor of those described above.

How to characterize Pavlovian model-based evaluation computationally?

A major computational challenge regarding Pavlovian valuation is to capture the change in CS value in algorithmic form. This challenge has yet to be fully met. In fact, a primary purpose of our writing this article is to inspire further attempts to develop better computational models for Pavlovian CS-triggered motivations in future. As an initial step, Zhang, Berridge, Tindell, Smith, and Aldridge (2009) proposed a phenomenological model of CS-triggered incentive salience, as the motivational transform of CS value from a previously learned cache of prediction errors (described in the Appendix). But, as those authors themselves agreed, much more remains to be done.

According to the Zhang et al. (2009) account, the cached value of previous UCS encounters is changed by a physiological–neurobiological factor called kappa that reflects the current brain/body state of the individual (whether the state is novel or familiar). The current kappa value multiplies or logarithmically transforms a temporal difference cache associated with a CS when the cue is reencountered. That transformation operation would be mediated by the mesocorticolimbic activations that produce incentive salience. The Zhang model succeeds in describing quantitatively the value transformations induced by salt appetite, other appetites and satieties, drug-induced priming of motivation, and so forth. However, the Zhang description is purely external to the mechanism in the sense that a kappa modification of a UCS value memory associated with CS captures the transformed motivation output, but does not provide any hypothesis about the internal algorithmic process by which the transformation is achieved. Essentially, the Zhang model shows how violence must be done to any preexisting model-free cache of learned values, such as that accumulated by a temporal difference mechanism, in order to achieve the newly transformed value that can appear in a new state. However, our whole point here is that the accomplishment of such Pavlovian state transformations essentially requires a model-based mechanism, not a model-free one, implying that a quite different computational approach will eventually be required. A comprehensive algorithmic version of the model-based Pavlovian computation has yet to be proposed. We hope better candidates will be proposed in coming years to help fill this important gap.

The implementational level

Marr’s (1982) implementational level concerns the way that the algorithms and representations are physically realized in the brain. A wealth of data have been collected from rodents and human and nonhuman primates as to the neural systems involved in model-based and model-free instrumental systems; given the observation that Pavlovian systems require and exploit predictions of long-run utility in closely related ways, one might hope that these results would generalize.

Very crudely, brain regions such as prefrontal cortex and the dorsomedial striatum, as well as the hippocampus and the default network might be involved in model-based prediction and control (Hassabis, Kumaran, Vann, & Maguire, 2007; Johnson & Redish, 2007; Pfeiffer & Foster, 2013; Schacter, Addis, & Buckner, 2008; Schacter et al., 2012; Spreng, Mar, & Kim, 2009; van der Meer, Johnson, Schmitzer-Torbert, & Redish, 2010). The dopamine system that originates in ventral tegmentum (VTA) and substantia nigra pars compacta (SNc), and its striatal targets, perhaps especially in dorsolateral neostriatum, have sometimes been suggested as being chiefly involved in model-free learning (Balleine, 2005; Daw et al., 2011; Dickinson & Balleine, 2002; Gläscher et al., 2010; Hikosaka et al., 1999; Killcross & Coutureau, 2003; Samejima, Ueda, Doya, & Kimura, 2005; Simon & Daw, 2011; Wunderlich et al., 2012). This has also been contested, and will be examined below.

Some paradigms, notably Pavlovian–instrumental transfer (PIT), provide an additional and more selective view. It is known from rodents that there is a particular involvement of circuits linking the amygdala and the accumbens in PIT, with special roles for the basolateral nucleus of the amgydala and possibly the shell of the accumbens in specific PIT, which is the form of PIT related to model-based evaluation, and the central nucleus of the amygdala and possibly the core of the accumbens in general PIT, which some regard as closer to model-free evaluation (Balleine, 2005; Corbit & Balleine, 2005; Corbit, Janak, & Balleine, 2007; Hall, Parkinson, Connor, Dickinson, & Everitt, 2001; Holland & Gallagher, 2003; Mahler & Berridge, 2012). A related circuit has been implicated in human PIT (Bray, Rangel, Shimojo, Balleine, & O’Doherty, 2008; Prevost, Liljeholm, Tyszka, & O’Doherty, 2012; Talmi, Seymour, Dayan, & Dolan, 2008). Note, though, our discussion above implying that general PIT might be reinterpreted as a form of defocused, model-based, specific PIT. Certainly general PIT undergoes similar transformations that enhance or suppress the ability of valued CSs to trigger “wanting” surges in response to neurochemical stimulations of either nucleus accumbens (shell or core) or central amygdala (Dickinson, Smith, & Mirenowicz, 2000; Mahler & Berridge, 2012; Pecina & Berridge, 2013; Wassum, Ostlund, Balleine, & Maidment, 2011; Wyvell & Berridge, 2000). Defocusing would force us to draw rather different conclusions from these various anatomical studies into the substrates of different control systems (Balleine, 2005).

Where in the brain does Pavlovian motivational revaluation of CS occur?

The answer must accommodate the ability of Pavlovian model-based systems to calculate the values of predicted outcomes under current motivational states. Some might suggest that such prospective revaluation should occur at a cortical level, perhaps involving ventromedial regions of prefrontal cortex. Quite a wealth of evidence indicates that orbitofrontal and related prefrontal areas are involved in model-based predictions of the values associated with stimuli and their predictors, in cases in which value has been obtained by previous experiences in relevant states (Boorman, Behrens, Woolrich, & Rushworth, 2009; Camille, Tsuchida, & Fellows, 2011; Jones et al., 2012; McDannald et al., 2012; O’Doherty, 2011) or even in the apparently purely model-based task of assigning value to imagined foods (Barron, Dolan, & Behrens, 2013). These areas are apparently not so involved instrumentally in directly assigning preexperienced values to current actions (Camille et al., 2011; O’Doherty, 2011), although they can potentially support stimulus-based rather than action-based choice (O’Doherty, 2011).

The orbitofrontal cortex was one of the areas in the rat that was found in the Dead Sea salt experiment to have greatly up-regulated activity in test trials following induction of salt appetite, when the CS lever was reencountered as being attractive in the new appetite state (M. J. F. Robinson & Berridge, 2013). However, the fact that animals whose neocortex has been surgically removed can still show revaluation of learned relations in the face of a new salt appetite (Wirsig & Grill, 1982) suggests that cortex (including orbitofrontal cortex) is at least not necessary for this revaluation. Preprogrammed subcortical sophistication for Pavlovian revaluation could be highly adaptive in realizing the most fundamental needs of an organism; the questions then become the range of outcomes for which subcortical transformation is possible (e.g., different sorts of natural appetite/thirst states, drug-induced states or beyond) and the identity of the regulatory circuitry that interfaces with mesolimbic circuitry, at least for the monitoring of natural need states (hypothalamus, etc.; Berthoud & Morrison, 2008; Gao & Horvath, 2008; Krause & Sakai, 2007). The potentially subcortical nature of Pavlovian motivation transformations may have implications for the degree of sophistication in the model-based previsioning—is it purely temporally local for immediately paired CS–UCS associations (e.g., depending on forms of stimulus substitution), or can it also bridge stimuli and time, as in richer forms of secondary conditioning? How defocused are these subcortical predictions? Can they contribute at all to instrumental model-based evaluation?

Role of mesolimbic dopamine?

Perhaps the most contentious contributor to evaluation is the neuromodulator dopamine. Dopamine neurons in the midbrain VTA project to the nucleus accumbens (ventral striatum) and the prefrontal cortex, and dopamine neurons in the adjacent SNc project to the neostriatum (dorsal striatum), with various subregional patterns of further localization. Many of these dopamine systems have been implicated in reward, though argument continues over precisely which reward-related functions are performed. As a brief summary of evidence, dopamine neurons projecting to nucleus accumbens and neostriatum respond similarly to rewards and to learned Pavlovian and instrumental cues (Montague et al., 1996; Morris, Nevet, Arkadir, Vaadia, & Bergman, 2006; Roesch, Calu, & Schoenbaum, 2007; Schultz, 1998, 2006; Schultz, Dayan, & Montague, 1997), and dopamine release in animals and humans is linked to rewards and cues in both striatum and nucleus accumbens (Boileau et al., 2006; Darvas & Palmiter, 2010; de la Fuente-Fernández et al., 2002; Kishida et al., 2011; Phillips, Stuber, Heien, Wightman, & Carelli, 2003; Roitman, Stuber, Phillips, Wightman, & Carelli, 2004; Volkow, Wang, Fowler, & Tomasi, 2012; Wanat, Willuhn, Clark, & Phillips, 2009; Wise, 2009; Zaghloul et al., 2009). Animals readily learn to emit actions in order to activate dopamine neurons in the VTA and SNc (Nieh, Kim, Namburi, & Tye, 2013; Rossi, Sukharnikova, Hayrapetyan, Yang, & Yin, 2013; Witten et al., 2011). There is also a rich pattern of connections from these structures to the ventral pallidum and to them from the amygdala, as well as with other subcortical nuclei, such as the lateral habenula and rostromedial tegmental nucleus (RMTg) and the serotonergic raphe nucleus, and also pathways linking them to the hypothalamus (Moore & Bloom, 1978; Swanson, 1982). Furthermore, the activity of dopaminergic cells, their release of dopamine, and/or the longevity of the neuromodulator at its targets are modulated by almost all addictive drugs (Hyman, Malenka, & Nestler, 2006; Koob & Volkow, 2010; Volkow et al., 2012). Finally, repeated exposure to addictive drugs can more permanently sensitize dopamine-related circuits in susceptible individuals in ways that enhance neural responses to learned reward cues (Leyton & Vezina, 2012; T. E. Robinson & Berridge, 2008; T. E. Robinson & Kolb, 2004; Thomas, Kalivas, & Shaham, 2008; Vezina & Leyton, 2009; Wolf & Ferrario, 2010).

Dopaminergic neurons and many of their targets are modulated by neuropeptide and hormone signals such as corticotropin releasing factor or ghrelin released by the hypothalamus or the periphery that can report on current states of stress or appetite (e.g., feeding-related) motivational state (Korotkova, Brown, Sergeeva, Ponomarenko, & Haas, 2006; Zigman, Jones, Lee, Saper, & Elmquist, 2006). The VTA and nucleus accumbens were notable among the structures recruited at the moment of salt cue reencounter during appetite in the M. J. F. Robinson and Berridge (2013) study, raising the possibility of dopamine activations as part of the mechanism for sudden CS transformation from repulsive to “wanted.” That possibility is made plausible because elevation of dopamine levels in nucleus accumbens shell or core directly enhances the degree of “wanting” triggered by reward CS above any previously learned levels in general PIT behavior, and in limbic neuronal firing to the CS in ventral pallidum, a chief output structure for nucleus accumbens (Pecina & Berridge, 2013; Smith et al., 2011; Tindell et al., 2005).

Much computational effort in the past decade focused on understanding these roles of dopamine has focused on its possible involvement in model-free learning, especially in the form of a temporal-difference prediction error for future reward that the phasic activity of dopamine neurons strikingly resembles (Barto, 1995; Berridge, 2007; Mahler & Berridge, 2012; Montague et al., 1996; Schultz, 2006; Schultz et al., 1997). One view suggests that a phasic dopamine pulse is the key teaching signal for model-free prediction and action learning, as in one of reinforcement learning’s model-free learning methods: the actor critic (Barto, Sutton, & Anderson, 1983), Q-learning (Roesch et al., 2007; Watkins, 1989), or SARSA (Morris et al., 2006; Rummery & Niranjan, 1994). The same dopamine signal can act to realize incentive salience, either as cached value or as Dead Sea type transformation (McClure, Daw, & Montague, 2003; Zhang et al., 2009).The actor-critic is of particular interest (Li & Daw, 2011), since, as studied in conditioned reinforcement (Mackintosh, 1983) or escape from fear (McAllister, McAllister, Hampton, & Scoles, 1980), it separates out circumstance-based predictions (in the critic) from action contingency (in the actor). There is evidence for circumstance-based and action-based prediction errors in distinct parts of the striatum (O’Doherty et al., 2004), although the action-based errors were value-based (associated with a variant of a state–action prediction called a Q-value), rather than purely action-based (as in the actor portion of the actor-critic; Li & Daw, 2011). The fact that the critic evaluates circumstances rather than actions under particular circumstances makes it a natural candidate as a model-free predictor that can support both Pavlovian and instrumental conditioning, although it remains to be seen whether dual value- and action-based routes to Pavlovian actions also exist, with the latter being a stamped-in stimulus–response mapping (analogous to the instrumental actor). Some complications do arise from the spatial heterogeneity for valence coding that has particularly been observed in the VTA (but see Matsumoto & Hikosaka, 2009), with one group of dopamine neurons being excited by unexpected punishments (rather than being suppressed, as might naively be expected for a prediction error for reward; Brischoux, Chakraborty, Brierley, & Ungless, 2009; Lammel, Lim, & Malenka, 2014; Lammel et al., 2012).

Equally, some tonic aspects of dopamine release have been suggested to mediate the vigour of action (for instance, reducing reaction times) or the exertion of effort (Niv, Daw, Joel, & Dayan, 2007; Salamone & Correa, 2002). Normal levels of tonic dopamine release are necessary for realizing both appetitive and aversive preparatory motivations elicited by stimulation of the accumbens (Faure, Reynolds, Richard, & Berridge, 2008; Richard & Berridge, 2011), and are involved in at least amplifying phasic bursts of motivation triggered by CS encounters, such as in appetitive PIT (Corbit et al., 2007; Murschall & Hauber, 2006). Appealingly for model-free learning theorists, the straightforward integration over time of the phasic prediction error formally signals the average reward (Daw, Kakade, & Dayan, 2002; although tonic and phasic dopamine activity may be under somewhat separate control—see Floresco, West, Ash, Moore, & Grace, 2003; Goto & Grace, 2005), and it is not clear whether there is also a model-based contribution to this average. The average reward reports the opportunity cost for the passage of time, and has thus been interpreted as being key to the instrumental choice of vigor (Niv et al., 2007).

However, a model-free-learning interpretation of dopamine mesolimbic function cannot be the whole story here, either (Berridge, 2007, 2012). This is indicated by motivational transformation results such as those of the Dead Sea salt experiment and some others mentioned above. Thus, the highly significant up-regulation in activity in VTA and target structures such as nucleus accumbens triggered by the CS following the induction of salt appetite suggest that dopamine release might also be dramatically greater, potentially licensing the excess behavior directed toward the conditioned stimulus. We argued above that this revaluation is the preserve of Pavlovian model-based reasoning, and cannot be accomplished by a model-free system. This then suggests that dopamine release can actually reflect model-based evaluations rather than (at least only) model-free predictions. Similar conclusions might be drawn from the finding that inactivating the VTA disrupts both specific and general PIT (Corbit et al., 2007).

Does the involvement of dopamine in model-based CS evaluations, which had traditionally been thought of instrumentally as being associated with model-free calculations, again imply a critical difference between mechanisms of Pavlovian and instrumental model-based evaluation? There are some reasons for thinking so. For instance, Dickinson and Balleine (2010) postulated that retasting the new value of an outcome in any novel motivational state is necessary for instrumental revaluation to discover its changed value, and retasting of food while hungry in the maze was also able to revalue seeking of food in the original Thistlethwaite (1952) water/food revaluation experiments discussed by Tolman (1949, 1955). This form of instrumental incentive learning appears to be independent of dopamine, proceeding normally under dopamine receptor blockade (Dickinson & Balleine, 2010; Dickinson et al., 2000; Wassum et al., 2011). By contrast, in the Pavlovian case, revaluation does not require retasting, and is powerfully modulated by dopamine (the CS value is suppressed by blockade, and magnified in value by dopamine-stimulating drugs).

Furthermore, endogenous features of an individual’s dopamine systems may be associated with the differences among individuals in the way they assign motivational value to a particular Pavlovian CS, such as a discrete distal cue that is highly predictive of reward UCS (Flagel et al., 2011; Saunders & Robinson, 2012; Yager & Robinson, 2013). For example, Flagel et al. (2011) measured the release and role of dopamine in two groups of rats: “sign-trackers,” whose motivated responses directionally targeted to the discrete are Pavlovian CS, and “goal-trackers,” who appear to eschew targeting Pavlovian incentive salience to that CS, and instead approach only the dish that delivers UCS (potentially mediated also by instrumental expectations or by habits). Only the sign-trackers showed a substantial elevation in dopamine release to the predictive CS that attracted them; and their behavior was most sensitively influenced by dopamine antagonists. If the goal-trackers are indeed more subject to instrumental model-based or to habitual model-free, control, then this absence of dopamine effects contrasts with the dopamine dependence of Pavlovian model-based control of incentive salience that we documented above.

In the end, despite these differences, the case is still open. Take, for instance, instrumental incentive learning. Daw et al. (2005) suggested that the apparent requirement for the outcome to be retasted in order to see an effect of devaluation, could instead reflect involvement of model-free habits that compete with a more readily revalued model for behavioral control. Their idea is that instrumental model-based prospective evaluation, just like Pavlovian prospective evaluation, has access to the new value of the UCS. However, because of the change in motivational state, the instrumental evaluation is also aware that it is less certain about this value because novelty promotes uncertainty. Retasting duly reduces that uncertainty. By contrast, model-free predictions know about neither the new value nor the associated new uncertainty. Thus, if model-free and model-based systems compete according to their relative certainties, the model-free habit, and thus devaluation insensitivity, will dominate until retasting. However, Pavlovian model-based predictions might be less uncertain than instrumental ones, since they do not have to incorporate an assessment of the contingency between action and outcome, and so may more easily best their model-free counterparts. Thus, there might still be only one model-based knowledge system, but two control systems or different ways of translating knowledge into action: instrumental act–outcome performance (disrupted by uncertainty) and Pavlovian motivation (less affected by uncertainty in this instance). The route by which this model-based information takes its Pavlovian effects could involve corticolimbic inputs from prefrontal cortex to the midbrain dopamine system. Potential experiments might test this idea, for instance by manipulating such inputs and examining whether instant CS revaluation effects such as the Dead Sea salt study still obtain.

More generally, there is increasing evidence for rich interactions between model-free and model-based predictions (Daw et al., 2011; Gershman et al., 2014; Simon & Daw, 2011). For instance, the activation of VTA dopamine neurons to a saltiness-related CS, if observed in salt appetite conditions, could arise from a model-based system as part of the way that this putatively trains model-free predictions (Doll et al., 2009; Foster & Wilson, 2006, 2007; Gershman et al., 2014). There may still be differences in detail—for instance, model-based influences over dopamine may involve the VTA more than the SNc. These remain to be explored.

Synthesis

We started with an experimental example of instant CS revaluation in the light of prevailing motivational states that poses an important challenge to standard computational accounts of learning and performance in Pavlovian conditioning. Our suggested answer is in one way rather simple: directly importing model-based features that are standard explanations in instrumental conditioning into what have been sometimes treated as purely model-free Pavlovian systems—that is, including for Pavlovian predictions what has long been recognized for instrumental instrumental predictions. The revaluation is exactly what a model-based system ideally could produce—that is, reporting the current value of a predicted outcome. Key questions that remain include the circumstances under which this recomputation would seize control, the neural mechanisms responsible, and how direct CS modulation is achieved without necessarily requiring tasting of an altered UCS.

However, looked at more closely, things get more interestingly complicated in at least two ways. First, the nature of the computations and algorithms underlying Pavlovian model-based predictions remain open for investigation and future modeling. We discussed evidence hinting that these might not be completely shared with instrumental model-based predictions. The apparently embellished scope of Pavlovian model-based calculation includes such things as instant revaluation, in both normal and decorticate subjects, putatively involving sensory-identity representations of a UCS and the possibility of defocusing that representation into a categorical one along the spectrum between specific and general predictions. These ideas enrich our picture of model-based systems (potentially even applying in some respects to instrumental model-based mechanisms).

Second, consider the conclusion that the results of the salt appetite experiment or other mesolimbic manipulations of cue-triggered incentive salience indeed depend on model-based calculations. This implies that Pavlovian model- and identity-based predictions burrow directly into what has previously been thought of as the neurobiological heart of model-free and purely valence-based predictions (i.e., a temporal difference prediction error mechanism)—namely, dopamine activity and release in nucleus accumbens and related mesostriatal and mesolimbic circuitry. It therefore becomes pressing to reexamine more closely the role of dopamine brain systems in reward learning and motivation. That might include tipping the balance between model-based and model-free Pavlovian predictions. Such issues might be studied, for instance, using manipulations such as the reversible pre- and infralimbic lesions or dorsomedial and dorsolateral neostriatal manipulations (Balleine & O’Doherty, 2010; DiFeliceantonio, Mabrouk, Kennedy, & Berridge, 2012; Killcross & Coutureau, 2003; Smith, Virkud, Deisseroth, & Graybiel, 2012) that have been so revealing for instrumental conditioning.

In summary, the present computational analysis invites a blurring between model-free and model-based systems and between Pavlovian and instrumental predictions. What is clearly left is that there is significant advantage to having structurally different methods of making predictions in a single brain; that there is a critical role for pre-programming in at least some methods for making predictions; that the attention to Pavlovian model-based predictions makes even more acute the question of the multifarious nature of exactly what might be predicted; and finally that all these issues are played out over a range of cortical and sub-cortical neural systems whose nature and rich interactions are presently becoming ever more apparent.