Abstract
Model-free and model-based computations are argued to distinctly update action values that guide decision-making processes. It is not known, however, if these model-free and model-based reinforcement learning mechanisms recruited in operationally based instrumental tasks parallel those engaged by pavlovian-based behavioral procedures. Recently, computational work has suggested that individual differences in the attribution of incentive salience to reward predictive cues, that is, sign- and goal-tracking behaviors, are also governed by variations in model-free and model-based value representations that guide behavior. Moreover, it is not appreciated if these systems that are characterized computationally using model-free and model-based algorithms are conserved across tasks for individual animals. In the current study, we used a within-subject design to assess sign-tracking and goal-tracking behaviors using a pavlovian conditioned approach task and then characterized behavior using an instrumental multistage decision-making (MSDM) task in male rats. We hypothesized that both pavlovian and instrumental learning processes may be driven by common reinforcement-learning mechanisms. Our data confirm that sign-tracking behavior was associated with greater reward-mediated, model-free reinforcement learning and that it was also linked to model-free reinforcement learning in the MSDM task. Computational analyses revealed that pavlovian model-free updating was correlated with model-free reinforcement learning in the MSDM task. These data provide key insights into the computational mechanisms mediating associative learning that could have important implications for normal and abnormal states.
SIGNIFICANCE STATEMENT Model-free and model-based computations that guide instrumental decision-making processes may also be recruited in pavlovian-based behavioral procedures. Here, we used a within-subject design to test the hypothesis that both pavlovian and instrumental learning processes were driven by common reinforcement-learning mechanisms. Sign-tracking and goal-tracking behaviors were assessed in rats using a pavlovian conditioned approach task, and then instrumental behavior was characterized using an MSDM task. We report that sign-tracking behavior was associated with greater model-free, but not model-based, learning in the MSDM task. These data suggest that pavlovian and instrumental behaviors may be driven by conserved reinforcement-learning mechanisms.
Introduction
Cues in the environment that predict rewards can acquire incentive value through pavlovian mechanisms (Flagel et al., 2009) and are necessary for the survival of an organism by facilitating predictions about biologically relevant events that enable an organism to engage in appropriate preparatory behaviors. Pavlovian incentive learning, however, can imbue cues with strong incentive motivational properties that exert control over behavior, which can lead to maladaptive and detrimental behaviors (Saunders and Robinson, 2013). For example, cues that are associated with drug use can enhance craving in addicts and because of their control over behavior may precipitate relapse to drug-taking behaviors in abstinent individuals (Hammersley, 1992). Understanding the biobehavioral mechanisms underlying associative learning could, therefore, provide critical insights into how stimuli gain incentive salience.
Pavlovian associations have largely been presumed to occur through model-free, or stimulus–outcome, learning; cues that are predictive of rewards incrementally accrue value through a temporal-difference signal that is likely to be mediated by mesolimbic dopamine (Huys et al., 2014; Nasser et al., 2017; Saunders et al., 2018). Theoretical work has, however, proposed that pavlovian associations may also involve learning that is described in the computational field as model based (MB; Dayan and Berridge, 2014; Lesaint et al., 2014), whereby individuals learn internal models of the statistics of action–outcome contingencies. This hypothesis has been supported by data indicating that pavlovian associations not only represent accrued value but also the identity of pavlovian outcomes (Robinson and Berridge, 2013) and by neuroimaging studies that identify neural signatures of model-free and model-based learning in humans during a pavlovian association task (Wang et al., 2020).
Pavlovian autoshaping procedures have been used to quantify the extent to which animals attribute incentive salience to cues predictive of rewards (Flagel et al., 2009, 2011; Nasser et al., 2015). When animals are presented with a cue associated with food reward delivery, the majority of rats known as sign trackers (STs) will approach and interact with the cue, whereas other rats known as goal trackers (GTs) will approach the location of the reward delivery (Hearst and Jenkins, 1974; Boakes, 1977). Rats that display sign-tracking behaviors, therefore, attribute incentive salience to the cue, whereas rats that display goal-tracking behaviors do not (Robinson and Flagel, 2009), or at least acquire less incentive to the cue than the goal. Our computational work (Lesaint et al., 2014; Cinotti et al., 2019) has suggested that these conditioned responses may be linked to individual differences in the extent to which rats use model-free and model-based reinforcement-learning systems to guide their behavior. For example, when using a hybrid reinforcement-learning model to simulate pavlovian approach behaviors we were able to recapitulate sign-tracking behaviors by increasing the weight of model-free updating and, notably, goal-tracking behaviors by increasing the weight of model-based updating (Cinotti et al., 2019). Variation in pavlovian approach behaviors in rodents may, therefore, reflect individual differences in model-free and model-based control over behavior (Dayan and Berridge, 2014; Lesaint et al., 2014).
Use of the multistage decision-making (MSDM) task in humans (Daw et al., 2011; Culbreth et al., 2016) and animals (Miller et al., 2017; Groman et al., 2019a; Akam et al., 2021) has provided empirical evidence that instrumental behavior is influenced by both model-free and model-based reinforcement learning computations. It is not known, however, if the relative contribution of model-free and model-based mechanisms that are recruited in an individual during pavlovian autoshaping procedures are predictive of their relative contribution during instrumental procedures, such as in the MSDM task (Sebold et al., 2016). If true, this could suggest that the computational mechanisms underlying learning are not unique to pavlovian or instrumental mechanisms but may represent a common reinforcement-learning framework within the brain that could be useful for restoring the learning mechanisms that are abnormal in disease states (Doñamayor et al., 2021; Groman et al., 2021).
In the current study we sought to test the hypothesis that ST rats would preferentially employ a model-free strategy in an instrumental task, whereas GT rats would preferentially employ a model-based strategy. Pavlovian conditioned approach was assessed in rats (Keefer et al., 2020) before model-free and model-based reinforcement-learning was assessed in a rodent analog of the MSDM task (Groman et al., 2019a). We report that sign-tracking behavior is correlated with individual differences in reward-mediated model-free, but not model-based, learning in the MSDM task. These data suggest that the model-free reinforcement-learning systems recruited during pavlovian conditioning parallel those recruited in instrumental learning.
Materials and Methods
Subjects
Twenty male Long-Evans rats were purchased from Charles River Laboratories at ∼6 weeks of age. Rats were pair housed in a climate-controlled vivarium on a 12 h light/dark cycle (lights on at 7:00 A.M., lights off at 7:00 P.M.). Rats had ad libitum access to water and underwent dietary restriction to 90% of their free-feeding body weight throughout the experiment to maintain the same hunger state in both the pavlovian and instrumental environments. Experimental procedures were approved by the Institutional Animal Care and Use Committee at Yale University and were in accordance with the National Institutes of Health institutional guidelines and Public Health Service Policy on Humane Care and Use of Laboratory Animals.
Pavlovian conditioned approach
Rats were first trained using a pavlovian conditioned approach task as previously described (Keefer et al., 2020). During a single trial, a retractable lever, conditioned stimulus (CS), located to the left or right of a food cup was inserted into the chamber for 10 s. As the lever retracted, a single sucrose pellet (45 mg; BioServ) was dispensed into the food cup. This CS and unconditioned stimulus (US) pairing occurred on a variable-interval 60 s schedule, and each CS-US pairing was present 25 times in each session. Each rat underwent a single daily session on the pavlovian conditioned approach task for 5 consecutive days. The primary dependent measures collected were latency to approach the lever and food cup as well as the number and probability of interactions rats made with the lever and food cup within each session. These dependent measures were used to generate a pavlovian score for each session a rat completed (see below, Data analysis). This pavlovian score is typically referred to as the Pavlovian Conditioned Approach (PCA) score; however, to avoid confusion with the data reduction technique known as principal component analysis (also commonly referred to as a PCA) we refer to this measure as the PavCA score.
Deterministic MSDM task
Following the pavlovian conditioning approach sessions, rats were trained to make operant responses (e.g., nose pokes and lever responses) to receive a liquid reward delivery (90 μl of 10% sweetened condensed milk) in a different environment than that used for the pavlovian conditioning approach task. Once operant responding had been established, rats were trained on a deterministic MSDM task using procedures previously described (Groman et al., 2019a). In the deterministic MSDM task, choices in the first stage deterministically led to the second-stage state. Second-stage choices were probabilistically rewarded. Rats initiated trials by making a response into the illuminated food cup. Two levers located on either side of the food cup were extended into the box and cue lights above the levers illuminated (sa). A response made on one lever (saa1) resulted in the illumination of two nose port apertures (e.g., ports 1 and 2, sB), whereas a response made on the other lever (saa2) would result in the illumination of two other apertures (ports 3 and 4, sC). Entries into either of the illuminated apertures were probabilistically reinforced using an alternating block schedule.
Each rat was assigned to one specific lever-port configuration (configuration 1, left lever→port 1, 2; right lever→port 3, 4; configuration 2, left lever→port 3, 4; right lever→port 1, 2), which was maintained through the study. Reinforcement probabilities assigned to each port, however, were pseudorandomly designated at the beginning of each session (0.90 vs 0.10 or 0.40 vs 0.15; see Fig. 4A). terminated when 300 trials had been completed or 90 min had elapsed, whichever occurred first. Trial-by-trial data were collected for individual rats, and the probability that rats would select the first-stage option leading to the highest reinforced second-stage option [p(correct|stage1)] was calculated, as well as the probability that rats would select the highest reinforced second stage option [p(correct|stage2)].
Training on the deterministic MSDM task occurred for the following primary reasons: (1) reduce spatial biases, which are common in rats; (2) ensure rats understood the alternating probabilities of reinforcement at the second-stage options; and (3) verify that rats understood the structure of the task and how first-stage choices led to different second-stage options. If rats appreciated the reinforcement probabilities assigned to the second-stage options and how choices in the first stage influence the availability of second-stage options, then the probability that rats choose the first-stage option leading to the second-stage option with the maximum reward probability [e.g., p(correct | stage 1)] should be significantly greater than that predicted by chance. Rats were trained on the deterministic MSDM until they met the criteria of a p(correct|stage1) being significantly greater than chance in four of the five sessions after completing the 35th training session on the deterministic MSDM. If rats did not meet the criterion after completing 43 sessions on the deterministic MSDM (N = 3), training was terminated regardless of p(correct|stage1).
Probabilistic MSDM task
Choice behavior was then assessed in the probabilistic MSDM task. Initiated trials resulted in the extension of two levers and illumination of cue lights located above each lever. For most trials (70%), first-stage choices led to the illumination of the same second-stage state that was deterministically assigned to that first-stage choice in the deterministic MSDM (referred to as a common transition). On a limited number of trials (30%), first-stage choices led to the illumination of the second-stage state most often associated with the other first-stage choice (referred to as a rare transition). Second-stage choices were probabilistically reinforced using the same alternating block schedule as that of the deterministic MSDM task. Rats completed 300 trials across five daily sessions on the probabilistic MSDM task.
Trial-by-trial data (∼1500 trials/rat) were collected to conduct logistic regression analyses of decision-making (described below). One rat was excluded from all analyses because of an extreme bias in the first-stage choice (e.g., rat chose one lever on 97% of all trials, regardless of previous trial events).
Data analysis
Pavlovian conditioned approach
To quantify the degree to which individual rats display sign-tracking or goal-tracking behaviors, a PavCA score was calculated for individual rats by averaging three standardized measures, collected as previously described (Meyer et al., 2012). These three measures were (1) a latency score, which was the average latency to make a food cup response during the CS, minus the latency to lever press during the CS, divided by the duration of the CS (10 s); (2) a probability score, which was the probability that the rat would make a lever press, minus the probability that the rat would make a food cup response across the session; and (3) a preference score, which was the number of lever contacts during the CS, minus the number of food cup contacts during the CS, divided by the sum of these two measures. The PavCA score ranged between −1.0 and 1.0, with values closer to 1.0 reflecting a greater prevalence of ST behaviors and values closed to −1.0 reflecting a greater prevalence of goal-tracking behaviors. Previous studies have calculated the average PavCA score from the last two pavlovian sessions rats complete to classify rats as either exhibiting high or low ST behaviors (Morrison et al., 2015; Rode et al., 2020) as goal-tracking behaviors are less commonly observed within the population. We refer to this average measure as the summary PavCA score. We conducted a similar median split of the distribution of summary PavCA scores and classified rats as either exhibiting high sign-tracking behaviors (high ST rats; N = 10) or low sign-tracking behaviors (low ST rats; N = 10). All group-level analyses reported in the current study were conducted using this binary classification.
Additionally, a trial-by-trial pavlovian score was calculated to serve as the dependent measure used in the computational analyses described below. Latency, probability, and preference scores were calculated on each trial, and the average of these measures was used to categorize an individual trial as approach to the lever or approach to the magazine. Specifically, the latency score was the average latency to make a food cup response, minus the latency to make a lever press during the CS, divided by the duration of the CS on that trial. The probability score was the probability that rats would make a lever press (+1) versus a food cup response (−1) on that trial. The preference score was the number of lever contacts during the CS, minus the number of food cup contact during the CS, divided by the sum of these measures for that trial. Although rats could vacillate between these responses within a single trial (e.g., approach lever, check magazine, approach lever), characterizing this within-trial variability is difficult and beyond the scope of the current study. The PavCA score and the trial-by-trial pavlovian score for each of the five pavlovian sessions was positively correlated (all R2 values > 0.94; all p values < 0.001) suggesting that these measures were capturing the same variability in pavlovian approach behaviors.
Model-free and model-based learning in the pavlovian conditioned approach task
We have previously reported that individual differences in pavlovian approach behavior can be recapitulated using a combination of model-free and model-based reinforcement-learning algorithms (Lesaint et al., 2014; Cinotti et al., 2019). We sought to use this hybrid reinforcement-learning model to index the contribution of these reinforcement-learning systems to the pavlovian conditioned approach behaviors measured in the current study. This model combines the outputs of these two reinforcement-learning systems to determine the likelihood that rats will approach the lever or approach the magazine on each trial. The structure of each trial of the task is represented by a Markovian Decision Process (MDP) consisting of six different states (Fig. 1A) defined by the experimental conditions, such as the presence of the lever or of the food and the current position of the rat (e.g., close to the magazine or the lever). The five different actions are explore the environment (goE), approach the lever (goL), approach the magazine (goM), engage the closest stimuli or eng<L>/<M>, and eat the reward, and state transitions given a selected action are deterministic. For example, if a rat in state 1 (s1) chooses the action goL, it will always lead to state 2 (s2) whereas if a rat in state 1 (s1) chooses the action goM, it will lead to state 3 (s3). Action values for all possible actions in the current state are generated by the decision-making model, which consists of both an MB and a feature model-free (FMF) reinforcement-learning algorithm (Fig. 1B). The MB and FMF value estimates are combined as a sum into a weighted value determined by the parameter ω. An ω parameter closer to 1 indicates that action values are more influenced by the MB computations, whereas an ω parameter closer to zero indicates that action values are more influenced by the FMF computations. The weighted values are fed into a softmax function representing the action selection mechanism.
The FMF system, compared with instrumental reinforcement-learning algorithms, assigns value representations to the features associated with each action rather than to the states of the task, which allows a generalization of values among different states. For example, when the rat goes toward the magazine in state 1 (s1) or engages the magazine in state 3 (s3), it does so motivated by the same feature value [V(M)] in these two different states, which means V(M) is updated twice in the course of a single trial. After each action, the value of the corresponding feature is updated according to a standard temporal difference (TD) learning rule by first computing a reward prediction error (δ) as follows:
The TD learning rule was only applied to the selected feature (E, environment; L, lever; M, magazine) in each state transition, except in the case of food, which was equal to 1, the value of reward. Because the rat is likely to visit the magazine during the intertrial interval (ITI), the values of the magazine are revised between state 5 and state 0 according to the following equation:
The MB system relies on learned transition T and reward R functions for updating action values. The transition value function aims to determine the probability of going from one state to the next given a certain action. After transitioning from state
Because the environment is deterministic, T(s, a, s) should converge perfectly toward values of 1 for all possible state transitions and remain at a value of 0 for all impossible state transitions (e.g., s1 → s4). Similarly, the reward function R(s, a) is updated according to the following:
This model contained five free parameters, the learning rate α, the discounting factor γ, the inverse temperature β, the ITI update factor
To avoid local maxima, starting values for each free parameter were optimized using a grid search so that each parameter had three possible initial values, and all 35 possible combinations were tested as the starting point of the gradient descent procedure to maximize the likelihood L. Each pavlovian session only contained 25 trials, which proved to be difficult for obtaining reliable parameter estimates from this reinforcement-learning model. To improve the accuracy and reliability of parameter estimates and model fit, trial-by-trial data for all five of the pavlovian sessions were concatenated into a single dataset for individual rats and analyzed with the reinforcement-learning model. The parameter estimates that yielded the largest log likelihood were retained and are reported in Table 1.
Logistic regression of choice data in the MSDM tasks
We have previously shown that the choice behavior of rats in the MSDM task is guided by previous trial events, such as previous trial outcome, choice, and, in the probabilistic MSDM task, state transitions. Trial-by-trial choice data in the deterministic and probabilistic MSDM was analyzed with a logistic regression model using the glmfit function in MATLAB version 2017a (MathWorks). These logistic regression models predicted the likelihood that rats would select the same first-stage choice on the current trial (trial t) that they had on the previous trial (trial t-1), namely the probability of staying or p(stay). The model used to analyze choice data in the deterministic MSDM contained the following predictors.
Intercept: +1 for all trials, which quantifies the tendency for rats to repeat the same first stage option regardless of any other trial events.
Correct: +1 for trials where the rat selects the first stage option with a common transition leads to the highest reinforced stage 2 option. −1 for trials where the rat selects the first stage option with a common transition leads to the lowest reinforced stage 2 option.
Outcome: +1 if the previous trial resulted in a rewarded outcome. −1 if the previous trial resulted in an unrewarded outcome.
The model used to analyze choice data in the probabilistic MSDM contained the same predictors as described above as well as two additional predictors.
Transition: +1 if the previous trial included a common transition.−1 if the previous trial included a rare transition.
Transition-by-Outcome: +1 if the previous trial included a common transition and was rewarded or if it included a rare transition and was unrewarded. −1 if the previous trial included a rare transition and was rewarded or included a common transition and was unrewarded.
The correct predictor in the logistic regression prevents spurious loading onto the transition-by-outcome interaction predictor (Akam et al., 2015), which can occur when using blocked schedules of reinforcement in the MSDM task. We included the correct predictor in all logistic regression models to ensure consistency across analyses and MSDM tasks. Critically, the regression coefficient applied to outcome quantifies model-free behavior, and the regression coefficient applied to the transition-by-outcome interaction quantifies model-based behavior.
Logistic regression of rewarded and unrewarded outcomes
We found that individual differences in the summary PavCA score was related to variation in the outcome regression coefficient (see below, Results). To determine whether this relationship was because of differences in the influence of rewarded and/or unrewarded outcomes on choice behavior, we analyzed choice data in the MSDM task using a different logistic regression model that estimated the likelihood that rats would repeat the same first-stage choice based on whether the previous trial was rewarded or unrewarded. This logistic regression model, unlike the first, permitted an independent analysis of how each trial outcome (rewarded or unrewarded) influenced first-stage choices. The predictors included in this model were as follows.
Intercept: +1 for all trials. This quantifies the tendency for rats to repeat the same first-stage option regardless of any other trial events.
Rewarded: +1 if the previous trial was rewarded and the rat chose the same lever (first-stage choice) that was selected on the subsequent trial.−1 if the previous trial was rewarded and the rat chose a different lever (first-stage choice) than what was selected on the subsequent trial.0 if the previous trial was unrewarded.
Unrewarded: +1 if the previous trial was unrewarded and the rat chose the same lever that was selected on the subsequent trial.−1 if the previous trial was unrewarded and the rat chose a different lever than what was selected on the subsequent trial. 0 if the previous trial was rewarded.
Positive regression coefficients for the rewarded and unrewarded predictors indicate that rats are more likely to persist with the same first-stage choice, whereas negative regression coefficients indicate that rats are more likely to shift their first-stage choice. The probability that rats would repeat the same first-stage choice following rewarded and unrewarded trials was also calculated to examine how this more traditional measure of win-stay and lose-stay behaviors might differ between high and low ST rats.
Statistical analyses
Values presented are mean ± SEM, unless otherwise noted. Statistical analyses were conducted in IBM SPSS (version 26), MATLAB (version 2017a, MathWorks), and R (https://www.R-project.org) software. Generalized linear models (GLMs; R glmfit package) were used to analyze the relationship between the summary pavlovian score and choice behavior of rats in the MSDM task. The dependent variable was a binary array coding for whether the first-stage choice was the same (+1) or different (0) from the previous trial. Predictors in the model could be correct, outcome, transition, transition-by-outcome interaction, and summary PavCA score or the binary classification of low ST or high ST rats. All higher-order (e.g., summary PavCA score times outcome times transition) and lower order (e.g., summary PavCA score times outcome) interactions were included in the model. Significant interactions were tested using progressively lower-order analyses. Another GLM was used to examine the relationship between the summary PavCA score and the influence of rewarded and unrewarded outcomes on first-stage choices. The dependent variable was a binary array coding for the first-stage choice (+1 for left lever and 0 for right lever). Predictors in the model were reward, unrewarded, and summary pavlovian score. All interactions (e.g., summary PavCA score times rewarded) were included in the model and significant interactions tested using lower-order analyses.
All other analyses were performed in SPSS. Repeated-measures data were entered into a generalized estimating equation model using a probability distribution based on the known properties of these data. Specifically, event data (e.g., number of trials in which rats chose the highest reinforced first-stage option) were analyzed using a binary logistic distribution. Relationships between dependent variables (e.g., ω and model-free learning) were tested using Spearman's rank correlation coefficient.
Results
Pavlovian conditioned approach
Pavlovian incentive learning was assessed in rats in a pavlovian conditioned approach task for 5 d (Fig. 2A,B). The summary PavCA score was calculated and a median split conducted to classify rats as exhibiting either high (N = 10) or low (N = 9) sign-tracking behaviors (Fig. 2C). As expected, the PavCA score increased across the sessions in the high ST group (Wald χ2 = 91.33; p < 0.001) but not in the low ST group (Wald χ2 = 0.23; p = 0.63; Fig. 2D). We then examined how lever- and food-cup-directed behaviors changed across the five pavlovian conditioning sessions in both high and low ST rats (Fig. 2E–G). Post hoc analysis of the group (high vs low ST) × session interaction (Wald χ2 = 30.37; p < 0.001) indicated that the latency score, the probability score, and the preference score increased across the pavlovian sessions in the high ST group (Wald χ2 = 68.28; p < 0.001) but not in the low ST group (all Wald χ2 values < 0.99; p > 0.32). These session-dependent changes in the high ST rats are similar to observations that we, and others, have reported using pavlovian conditioned approach tasks (Flagel et al., 2011; Saunders and Robinson, 2011; Keefer et al., 2020).
Computational analysis of pavlovian approach behavior
Each pavlovian session consisted of only 25 trials, which limited our ability to obtain reliable and accurate reinforcement-learning parameter estimates for each session and each rat. To overcome this, we concatenated the trial-by-trial data from all five pavlovian sessions into a single 125-trial dataset for individual rats and fitted these data with the hybrid model described above, and estimates of the five parameters (e.g.,
We found that some of the parameter estimates were on extreme ends of the distribution and/or boundary, likely because we were trying to optimize five parameters with a limited number of trials (∼125 trials/rat). To improve model fit, we fixed four of the parameters (
To ensure that the ω parameter estimate was not being skewed by the dynamics of learning that occurs across the five pavlovian sessions, we also estimated the ω parameter using the trial-by-trial data collected in the last two pavlovian sessions (e.g., 50 trials in total). We then compared this estimate that the ω parameter obtained from trial-by-trial data collected in all the pavlovian sessions (e.g., 125 trials). The ω parameter estimates from these analyses were positively correlated with one and other (Spearman's ρ = 0.41; p = 0.08), suggesting that inclusion of earlier sessions when learning was occurring did not bias our estimate of the ω parameter. Subsequent analyses reported below were done using the ω parameter that was estimated from trial-by-trial data from all five pavlovian sessions.
Our previous simulation experiments using this reinforcement-learning model have found that as the ω parameter approaches zero, and the decision-making algorithm favors a FMF system, the prevalence of sign-tracking behaviors increases. We hypothesized, therefore, that the ω parameter would be negatively correlated with the summary pavlovian scores across rats. Indeed, the ω parameter that was estimated from the trial-by-trial data collected across the five pavlovian sessions rats completed was negatively correlated with the summary PavCA score (Spearman's ρ = 0.89; p < 0.001; Fig. 3C). These results, collectively, indicate that the restricted hybrid reinforcement-learning model can capture meaningful variation in pavlovian approach behavior.
Reward-guided behavior in the deterministic MSDM task is related to ST behaviors
Choice behavior on the deterministic MSDM task was then examined (Fig. 4A,B). The probability that rats selected the first-stage choice associated with the most frequently reinforced second-stage option increased across the 35 training sessions (β = 0.012, p < 0.001) and was significantly greater than that predicted by chance in the last five sessions that rats completed (binomial test, p < 0.001; Fig. 4C). Rats were more likely to repeat a first-stage choice that was subsequently rewarded than a first-stage choice that was subsequently unrewarded (Wald χ2 = 113.57, p < 0.001; Fig. 4D) indicating that second-stage outcomes were able to influence subsequent first-stage choices. These data, collectively, indicate that rats understood the structure of the deterministic MSDM task and, critically, that their first-stage choices influenced the subsequent availability of second-stage options.
To quantify the influence of previous trial events (e.g., correct, outcome) on first-stage choices, choice data from rats was analyzed with a logistic regression model (Fig. 4E, Table 3). The intercept was significantly greater than zero (z = 14.92, p < 0.001), indicating that rats, similar to humans, were more likely to repeat a first-stage choice regardless of previous trial events. Nevertheless, the effect of outcome was also significantly different from zero (z = 46.56, p < 0.001), indicating that rats were using previous trial outcomes (reward and absence of reward) to guide their first-stage choices.
We then examined whether individual differences in pavlovian approach behavior predicted choice behavior of the same rat in the deterministic MSDM task. The summary pavlovian score was included as a covariate in the logistic regression model and the two-way interaction between outcome and the pavlovian score examined. The summary pavlovian score × outcome interaction was a significant predictor in the model (z = 7.51, p < 0.001; Table 3), and post hoc analyses indicated that the regression coefficient for outcome was significantly greater in high ST rats compared with the low ST rats (z = 9.58, p < 0.001; Fig. 4F). These data demonstrate that high ST rats were more likely to use previous trial outcomes to guide their choice behavior compared with low ST rats.
The outcome regression coefficient quantifies the degree to which both rewarded and unrewarded outcomes guide subsequent choice behavior. Differences in the outcome regression coefficient that we observed between high and low ST rats might, therefore, reflect variation in how rats use rewarded or unrewarded outcomes to guide their behavior. To independently assess the impact of rewarded and unrewarded trials on first-stage choices, we conducted a second logistic regression analysis of choice data in the deterministic MSDM task which examined the likelihood that rats would repeat the same first-stage choice following a rewarded or unrewarded outcome. The rewarded regression coefficient was positive (β = 1.98 ± 0.03, z = 71.59, p < 0.001), indicating that rats repeated first-stage choices that resulted in reward. The unrewarded regression coefficient was also positive (β = 0.33 ± 0.02, z = 17.78, p < 0.001) but smaller than that for rewarded regression coefficient (Wald χ2 = 106, p < 0.001), indicating that rats were more likely to repeat rewarded first-stage choices than unrewarded first-stage choices.
We then examined whether the summary pavlovian score interacted with the rewarded or unrewarded regression coefficients to predict first-stage choices in the deterministic MSDM (Table 4). The interaction between the summary pavlovian score times rewarded regression coefficient was significant (z = 8.93, p < 0.001,β = 0.51) and post hoc analyses between the low and high ST groups indicated that the rewarded regression coefficient was greater in high ST rats compared with low ST rats (z = 12.89, p = 0.001; Fig. 4G). The summary pavlovian score times unrewarded interaction, however, was not significant (z = 0.27, p = 0.79, β = 0.01; Fig. 4H). To confirm these differences in outcome-specific behaviors, we compared the probability that rats would repeat a first-stage choice following a rewarded (e.g., win-stay) or unrewarded (e.g., lose-stay) outcome between high and low ST rats. The probability of repeating a first-stage choice following a rewarded outcome was greater in high ST rats compared with low ST rats (Wald χ2 = 5.77, p = 0.02). No differences were observed for the probability of repeating a first-stage choice following an unrewarded outcome (Wald χ2 = 1.50, p = 0.22). High ST rats, therefore, used rewarded outcomes to guide their first-stage choices to a greater degree than low ST rats, suggesting that these former individual differences in pavlovian incentive learning are associated with variation in reward-guided instrumental behavior.
Probabilistic MSDM task and relationship to pavlovian approach behavior
To determine whether the relationship between the summary pavlovian score and reward-guided behavior in the above deterministic version of the MSDM task was associated specifically with model-free or model-based reinforcement learning, the choice behavior of rats was assessed in the probabilistic version of the MSDM task (Fig. 5A). According to model-free theories of reinforcement learning, the probability of repeating a first-stage choice should be influenced only by the previous trial outcome, regardless of whether the state transition was common or rare (Fig. 5B, left). In contrast, model-based theories of reinforcement learning posit that the outcome at the second stage should affect the choice of the first-stage option differently based on the state transition that was experienced (Fig. 5B, right). Evidence in humans and in our previous rodent studies, however, indicates that individuals use a mixture of model-free and model-based strategies in the probabilistic MSDM task. Indeed, the probability that rats in the current study would repeat the same first-stage choice according to outcomes received (rewarded or unrewarded) and the state transitions experienced (common or rare) during the immediately preceding trial indicated that rats were using both model-free and model-based learning to guide their choice behavior (Fig. 5C).
To quantify the influence of model-free and model-based strategies, choice data were analyzed with a logistic regression model (Daw et al., 2011; Akam et al., 2015, 2021; Groman et al., 2019a, b). The main effect of outcome, which provides an index of model-free learning, was significantly greater than zero (z = 22.65, p < 0.001; Fig. 5D, orange bar), indicating that rats were using second-stage outcomes to guide their first-stage choices. The interaction between the previous trial outcome and state transition, which provides an index of model-based learning, was also significantly greater than zero (z = 15.38, p < 0.001; Fig. 5D, purple bar). The combination of a significant main effect for outcome and a significant transition-by-outcome interaction suggests that rats were using both model-free and model-based strategies to guide their choice behavior in the probabilistic MSDM task.
We then examined whether the summary pavlovian score interacted with model-free and/or model-based learning to predict the probability of repeating the same first-stage choice in the probabilistic MSDM task (Table 5). The interaction between the summary pavlovian score and trial outcome significantly predicted choice behavior (z = 3.16, p = 0.002), but the interaction between the summary pavlovian score and the outcome-by-transition predictor did not (z = 1.60, p = 0.11). Post hoc comparisons between low and high ST rats indicated that the outcome regression coefficient—a measure of model-free learning—was significantly greater in high ST rats compared with low ST rats (z = 2.67, p = 0.008, β = 0.09; Fig. 5E), which was a similar effect observed in the deterministic task (Fig. 4F). The outcome-by-transition regression coefficient—a measure of model-based learning—did not differ between the low and high ST rats (Fig. 5F). These differences in the outcome regression coefficient (e.g., model-free learning) and lack of differences in the outcome-by-transition coefficient (e.g., model-based learning), collectively, indicate that high ST rats rely to a great degree on model-free learning in the MSDM task compared with low ST rats.
The greater model-free learning we observed in high ST rats may be, in part, because high ST rats acquired greater incentive value for the lever used in the pavlovian conditioning task that then biased responding in the MSDM task. We hypothesized that if this were true, then model-free behavior for the lever used in the pavlovian task might be higher than model-free behavior for the lever that was not used in the pavlovian task. To test this hypothesis, the probability that rats would repeat the same first-stage choice based on the second-stage outcomes (rewarded vs unrewarded) and state transition (common vs rare) was calculated for each lever. The difference between the probability of repeating a rewarded first-stage choice and an unrewarded first-stage choice was calculated to obtain an index of model-free learning for each lever. We compared the lever-specific index based on whether the lever in the MSDM task was in the same location as the lever used in the pavlovian task (referred to as “same”) or was in a different location as the lever used in the pavlovian task (referred to as “different”). We found that the model-free index did not differ between the levers (same lever, 0.21 ± 0.05; different lever, 0.28 ± 0.06; Wald χ2 = 0.77, p = 0.38). Notably, the model-free index did not differ between the levers in the high ST rats (same lever, 0.26 ± 0.05; different lever, 0.27 ± 0.07; Wald χ2 = 0.02; p = 0.90), suggesting that prior experience with one of the levers in the pavlovian conditioning task did not bias high ST rats to use a model-free strategy in the MSDM task.
To determine whether the summary pavlovian score was associated with rewarded or unrewarded outcomes, choice behavior in the probabilistic MSDM task was analyzed with an alternative logistic regression model (Table 6). Similar to what we had observed in the deterministic MSDM task, the interaction between the summary pavlovian score and rewarded predictor was significant (z = 4.31, p < 0.001, β = 0.23), high ST rats were more likely to repeat a first-stage choice that led to a rewarded second-stage choice compared with low ST rats (Fig. 5G). We also observed a significant interaction between the summary pavlovian score and the unrewarded predictor (z = −2.69, p = 0.007, β = –0.09), but the unrewarded regression coefficient was not statistically different between low and high ST rats (Fig. 5H). Moreover, the probability that rats would repeat a first-stage choice following a rewarded, but not unrewarded, outcome was greater in high ST rats compared with low ST rats (rewarded, Wald χ2 = 4.39, p = 0.04; unrewarded, Wald χ2 = 0.30, p = 0.58). These results, collectively, indicate that individual differences in pavlovian approach behavior are associated with variation in reward-mediated, model-free learning.
Pavlovian ST behavior is associated with reward-based, model-free updating
We found that pavlovian conditioned approach behaviors were associated with reward-mediated, model-free learning in both the deterministic and probabilistic MSDM tasks. This suggests that the model-free computations that guide pavlovian approach behaviors (e.g., FMF learning) may be related to the model-free computations that influence operant choice behavior in the MSDM task. To test this directly, we compared the regression coefficients obtained from the MSDM task in rats that either had a small ω (e.g., more model-free updating in the pavlovian conditioned approach task) or large ω (e.g., more model-based updating in the pavlovian conditioned approach task) parameter estimate (Fig. 6). We hypothesized that if the pavlovian FMF mechanisms were related to the operant-based model-free learning, then the outcome regression coefficient from the MSDM task would differ in rats with a smaller ω parameter estimate (e.g., greater FMF updating) compared with rats with a large ω parameter estimate (e.g., greater MB updating). As predicted, the outcome regression coefficient (e.g., model-free learning) was larger in rats with a smaller ω parameter compared with rats with a large ω parameter (Wald χ2 = 6.22, p = 0.01; Fig. 6A). These differences were specific to model-free learning, as the outcome-by-transition regression coefficient—a measure of model-based learning—did not differ as a function of the ω parameter (Wald χ2 = 1.21, p = 0.27; Fig. 6B). Furthermore, when we compared the rewarded and unrewarded regression coefficients between rats with either a high or low ω parameter, only the rewarded regression coefficient differed between the groups (rewarded, Wald χ2 = 6.51, p = 0.01; Fig. 6C; unrewarded, Wald χ2 = 1.42, p = 0.23; Fig. 6D). These data suggest that the model-free reinforcement-learning systems recruited during pavlovian conditioning parallel those recruited in the instrumental MSDM task.
Discussion
The current study provides new evidence that the model-free mechanisms that are used during the pavlovian conditioned approach task are related to the model-free mechanisms that guide instrumental decision-making behaviors. We report that a greater prevalence of sign-tracking behaviors in the pavlovian approach task is associated with greater model-free, but not model-based, learning in the MSDM task. Differences in model-free updating observed in high and low ST rats were associated specifically with reward-guided behaviors; rats with higher sign-tracking behaviors were more likely to repeat a rewarded choice than rats with lower sign-tracking behaviors. No differences in choice behavior following an unrewarded outcome were observed between low and high ST rats. Our data, collectively, provide direct evidence indicating that individual differences in sign-tracking behaviors are associated with reward-based, model-free computations. These results suggest that the model-free mechanisms mediating pavlovian approach behaviors might be controlled by the same model-free computations that guide instrumental behaviors and use conserved learning systems that are known to be altered in psychiatric disorders.
Individual differences in model-free computations are conserved across instrumental and pavlovian tasks
Rats with higher sign-tracking behaviors in the pavlovian approach task were found to have greater model-free reinforcement learning in both the deterministic and probabilistic MSDM tasks. These data suggest that the mechanisms that assign and update incentive value to cues predictive of rewards might be the same as those that update representations following rewarded actions. We propose, therefore, that pavlovian and instrumental behaviors are controlled by overlapping model-free, reinforcement-learning mechanisms. Alternatively, the related model-free measures that we quantified in the pavlovian and MSDM tasks may be driven by unique model-free mechanisms that rely on the same behavioral output. There is evidence that the neural mechanisms governing pavlovian and instrumental learning differ from each other (Bouton et al., 2021), but how these neural systems are involved in model-free computations that govern both pavlovian and instrumental learning is not fully understood. Future studies comparing how reward-mediated, model-free computations are encoded within these discrete circuits across pavlovian and instrumental environments could provide mechanistic insights into the behavioral correlations observed here.
The logistic regression analyses of choice behavior in the MSDM task indicated that rats with higher sign-tracking behaviors were more likely to repeat rewarded actions compared with rats with lower sign-tracking behaviors. This suggests that the degree of action value updating following rewards was greater in rats with higher sign-tracking behaviors and may explain why rats with greater sign-tracking behaviors are more resistant to outcome devaluation and slower to extinguish reward-predictive cues compared to GT, or lower ST, rats (Morrison et al., 2015; Nasser et al., 2015; Ahrens et al., 2016; Smedley and Smith, 2018; Fitzpatrick et al., 2019; Amaya et al., 2020; Keefer et al., 2020). For example, cached representations of cues predictive of rewards may be exaggerated in individuals with greater sign-tracking behaviors and, consequently, lead to slower adjustments in behavior when the value of the outcome changes. This is not a general impairment in extinction learning as rates of extinction of operant responses are similar between ST and GT rats (Ahrens et al., 2016; Fitzpatrick et al., 2019). Rather, previous work has proposed that strong attribution of incentive salience to reward-predictive cues may bias attention and lead to inflexible patterns of responding (Nasser et al., 2015; Ahrens et al., 2016; Keefer et al., 2020). Indeed, this may explain why sign-tracking behaviors in rats are associated with suboptimal choice behavior in a gambling task (Swintosky et al., 2021).
We did not, however, observe a relationship between pavlovian approach behaviors and model-based updating in the MSDM task. This was surprising given our previous theoretical work and the experimental work of others (Lesaint et al., 2014; Cinotti et al., 2019). The lack of association between the pavlovian summary score and model-based learning in the MSDM task is likely because we only observed a limited number of GT rats in the current sample. Specifically, only three rats in the current cohort of 20 would have been classified as GT rats (Fig. 2). This was not because the distribution of pavlovian approach behaviors in the current study was abnormal; previous studies using larger sample sizes than the current study (e.g., N = 560 vs N = 20) have observed similarly skewed distributions (Fitzpatrick et al., 2013) in food-restricted rats (Fraser and Janak, 2017). It is possible that our food restriction procedure biased rats toward a more model-free strategy in both pavlovian and instrumental environments. Future studies that use large sample sizes and manipulate hunger states to obtain behavioral measures that span the distribution of pavlovian approach behaviors may, therefore, find a relationship between goal-directed behaviors and model-based learning.
Prior experience with a particular lever in the pavlovian conditioning task did not appear to bias the behavior of rats in the MSDM task. It is possible, however, that the use of levers in both the pavlovian and operant environments had a more general influence on behavior in the MSDM task, and this influence was greater in high ST rats that attributed greater incentive salience to the lever. Although the testing environments and outcomes (e.g., sucrose pellet vs sweetened condensed milk solution) used for the pavlovian and MSDM tasks were different from one another, randomizing the order in which animals proceeded through each of the tasks would have reduced any potential order effects that may be confounding our results. We did consider implementing a crossover design to reduce any potential order effects but believed that extensive training in the MSDM task first, compared with the limited exposure in the pavlovian conditioning task, was more likely to have an impact on behavior in the pavlovian task. A more optimal design would have used different manipulandum in the pavlovian and instrumental tasks. Nevertheless, this is a limitation of the current study design, which we will address in future experiments.
The current study was only conducted in male rats, which limits our understanding of how these pavlovian and instrumental reward-based, model-free systems interact in females. Previous studies have not reported robust differences in the prevalence of sign-tracking and/or goal-tracking behaviors between male and female rats (Pitchers et al., 2015) or model-free and model-based learning in male and female humans (Gillan et al., 2015). We would not anticipate observing different results in female rats from those reported here in male rats. Nevertheless, it is possible that the model-free mechanisms mediating pavlovian approach behaviors in females are not the same model-free computations that guide instrumental behavior. This might explain the divergent learning strategies that have been observed between male and female mice (Chen et al., 2020).
Neurobiological mechanisms
Although the neurobiological mechanisms underlying pavlovian and instrumental learning are not fully understood, dopamine neurotransmission is likely to be a point of convergence between sign-tracking behaviors and reward-guided, model-free updating. Midbrain dopamine neurons are known to encode reward-prediction errors (RPEs), which is a fundamental computation in model-free learning (Hollerman and Schultz, 1998). The results of studies using voltammetry to quantifying changes in dopamine concentration in the nucleus accumbens, a main output of midbrain dopamine neurons, have proposed that phasic dopamine signals in ST rats is how incentive salience is transferred from the outcome to cue(s) predictive of reward (e.g., lever extension; Flagel et al., 2011). These dopaminergic RPEs were not observed in goal-tracking rats, suggesting that variation in attribution of incentive salience may reflect underlying differences in dopaminergic RPEs (Derman et al., 2018; Lee et al., 2018). Indeed, antagonism of dopamine signaling in the nucleus accumbens attenuates the expression of sign-tracking behaviors (Saunders and Robinson, 2012).
Dopamine, however, has also been implicated in model-based reinforcement learning. Individual differences in [18F]DOPA accumulation and dopamine tone in the nucleus accumbens of humans and rats, respectively, are associated with variation in model-based learning in the MSDM task (Deserno et al., 2015; Groman et al., 2019a). Dopamine may play a role in both reinforcement-learning systems. Indeed, previous studies have reported that both model-free and model-based calculations are encoded in the activity of midbrain dopamine neurons (Sadacca et al., 2017; Sharpe et al., 2017; Keiflin et al., 2019), but the influence of these dopaminergic neurons over behavior, and likely learning systems, is mediated by functionally heterogeneous circuits (Keiflin and Janak, 2015; Saunders et al., 2018). For example, mesocortical dopaminergic projections may encode model-based computations, whereas mesostriatal/mesopallidal dopaminergic projections may encode model-free computations (Chang et al., 2015). Studies that integrate circuit-based imaging approaches with biosensor technology (e.g., dLight) to measure circuit-specific dopamine transients in behaving animals could help resolve these critical questions regarding the functional role of dopamine circuits in these learning mechanisms (Kuhn et al., 2018).
Implications for addiction
Differences in the degree to which individuals attribute incentive salience to cues predictive of reward have been hypothesized to confer vulnerability to addiction. Indeed, there is evidence that ST rats will work hard to obtain cocaine (Saunders and Robinson, 2011), show greater cue-induced reinstatement (Saunders and Robinson, 2010; Saunders et al., 2013; Everett et al., 2020), are resistant to punished drug use (Saunders et al., 2013; Pohořalá et al., 2021), have a greater propensity for psychomotor sensitization (Flagel et al., 2008), and also display a higher preference for cocaine over food (Tunstall and Kearns, 2015) compared with GT rats. Drug self-administration in short access sessions, however, does not differ between ST and GT rats (Saunders and Robinson, 2011; Pohořalá et al., 2021). These data suggest that drug reinforcement may be similar between ST and GT rats but that ST rats may be more susceptible or prone to developing compulsive-like behaviors following initiation of drug use.
Only a few studies have used the MSDM task to examine the role of model-free and model-based learning in addiction susceptibility. In a previous study we reported that individual differences in model-free learning in the MSDM task were predictive of methamphetamine self-administration in long-access sessions (Groman et al., 2019b). This relationship, however, was negative; rats with lower model-free learning in the MSDM task took more methamphetamine than rats with higher model-free learning. Although additional addiction-relevant behaviors were not assessed in this previous study (e.g., progressive ratio, extinction, or reinstatement), the negative relationship between model-free learning and methamphetamine self-administration is surprising given the positive relationship between model-free learning and sign-tracking behaviors we observed here. These data might suggest a dynamic role of model-free learning in the different stages of addiction susceptibility (Kawa et al., 2016). For example, greater model-free learning before drug use may protect against drug intake but render individuals more vulnerable to the detrimental effects of the drug when ingested. Indeed, ST rats are less sensitive to the acute locomotor effects of cocaine but have a greater propensity for psychomotor sensitization (Flagel et al., 2008). Future studies that assess pavlovian conditioned approach behaviors and instrumental reinforcement-learning mechanisms in the same individual before evaluating drug-taking and drug-seeking behaviors may provide a greater understanding of the biobehavioral mechanisms underlying addiction susceptibility.
Summary
The present article provides direct evidence linking incentive salience processes with reward-guided, instrumental behaviors in adult male rats. Our data suggest that pavlovian approach behaviors and choice behavior of rats in a multistage decision-making task are driven by conserved model-free reinforcement-learning mechanisms that are known to be altered in individuals with mental illness, such as addiction (Groman et al., 2022). Future studies integrating systems-level approaches with the sophisticated behavioral and computational approaches used here will provide new insights into the biobehavioral mechanisms that are altered in individuals with mental illness.
Footnotes
This work was supported by National Institutes of Health–National Institute on Drug Abuse Grants DA041480 (J.R.T.), DA043443 (J.R.T.), DA051598 (S.M.G.), and DA043533 (D.J.C.); McKnight Foundation Memory and Cognitive Disorders Award (D.J.C.); and the State of Connecticut Department of Mental Health and Addiction Services through its support of the Ribicoff Laboratories. We thank Matthew Roesch for leading discussions that made this collaborative work a possibility.
The authors declare no competing financial interests.
- Correspondence should be addressed to Stephanie M. Groman at sgroman{at}umn.edu