Abstract
Dopamine (DA) and norepinephrine (NE) have been repeatedly implicated in neuropsychiatric vulnerability, in part via their roles in mediating the decision-making processes. Although two neuromodulators share a synthesis pathway and are coactivated under states of arousal, they engage in distinct circuits and modulatory roles. However, the specific role of each neuromodulator in decision-making, in particular the exploration–exploitation tradeoff, remains unclear. Revealing how each neuromodulator contributes to exploration–exploitation tradeoff is important in guiding mechanistic hypotheses emerging from computational psychiatric approaches. To understand the differences and overlaps of the roles of these two catecholamine systems in regulating exploration, a direct comparison using the same dynamic decision-making task is needed. Here, we ran male and female mice in a restless two-armed bandit task, which encourages both exploration and exploitation. We systemically administered a nonselective DA antagonist (flupenthixol), a nonselective DA agonist (apomorphine), a NE beta-receptor antagonist (propranolol), and a NE beta-receptor agonist (isoproterenol) and examined changes in exploration within subjects across sessions. We found a bidirectional modulatory effect of dopamine on exploration. Increasing dopamine activity decreased exploration and decreasing dopamine activity increased exploration. The modulatory effect of beta-noradrenergic receptor activity on exploration was mediated by sex. Reinforcement learning model parameters suggested that dopamine modulation affected exploration via decision noise and norepinephrine modulation affected exploration via sensitivity to outcome. Together, these findings suggested that the mechanisms that govern the exploration–exploitation transition are sensitive to changes in both catecholamine functions and revealed differential roles for NE and DA in mediating exploration.
- catecholamine
- decision-making
- dopamine
- exploration–exploitation tradeoff
- norepinephrine
- reinforcement learning
Significance Statement
Both dopamine (DA) and norepinephrine (NE) have been implicated in the decision-making process. Although these two catecholamines have shared aspects of their biosynthetic pathways and projection targets, they are thought to exert many core functions via distinct neural targets and receptor subtypes. However, the computational neuroscience literature often ascribes similar roles to these catecholamines, despite the above evidence. Resolving this discrepancy is important in guiding mechanistic hypotheses emerging from computational psychiatric approaches. This study examines the role of dopamine and norepinephrine on the explore–exploit tradeoff. By testing mice, we were able to compare multiple pharmacological agents within subjects and examine sources of individual differences, allowing direct comparison between the effects of these two catecholamines in modulating decision-making.
Introduction
Dysfunctions in cognitive processes, particularly within the domain of executive function, are implicated in numerous neuropsychiatric disorders (Elliott, 2003; Malloy-Diniz et al., 2017; Grissom and Reyes, 2019). One essential aspect of executive function is value-based decision-making. Decision-making under uncertainty has been a focus in computational neuroscience, which has developed approaches for discovering latent structures in decision-making strategies (Rothenhoefer et al., 2017; Ebitz et al., 2018; Chen et al., 2021a,b). Many computational models for value-based decision-making distinguish two essential latent processes: exploration and exploitation (Sutton and Barto, 1998; Niv et al., 2002; Daw et al., 2006; Jepma et al., 2020; R. C. Wilson et al., 2021). Exploration is an information-seeking process that helps discover the best action to take in an uncertain environment (Gershman, 2019; R. C. Wilson et al., 2021; Chen et al., 2021b). Once a favorable action is discovered, exploitation of the rewarding action is necessary to sustain reward. However, exploitation must be balanced with continued exploration as reward contingencies change. Dysregulation in exploration–exploitation tradeoff is observed in the phenotypes of numerous neuropsychiatric disorders and challenges, including schizophrenia, autism spectrum disorders, addictions, and chronic stress (H. Kim et al., 2007; Mussey et al., 2015; Morris et al., 2016; Addicott et al., 2020; Kaske et al., 2022). Understanding the fundamental mechanisms impacting the computations in our brain that underlie exploration–exploitation tradeoff could help identify critical circuits associated with differential risk factors for neuropsychiatric disorders and open avenues for novel interventions for executive function challenges.
Many neuropsychiatric challenges are associated with dysfunction in the catecholamines dopamine and norepinephrine (Kobayashi, 2001; Ressler and Nemeroff, 2001; Aston-Jones and Cohen, 2005; Bouret and Sara, 2005; Money and Stanwood, 2013; Addicott et al., 2020; Williams et al., 2021). These neuromodulators are well positioned to carry information about the state of the environment and influence behavior outputs via downstream control of action selection (Cohen et al., 2007; R. C. Wilson et al., 2021). Indeed, dopamine and norepinephrine have each separately been implicated in mediating the exploration–exploitation tradeoff (Seamans and Yang, 2004; Aston-Jones and Cohen, 2005; Bouret and Sara, 2005; Cohen et al., 2007; Redish et al., 2007; R. C. Wilson et al., 2021; Cremer et al., 2022). While dopamine and norepinephrine are coactivated under states of arousal, they act through partially separable circuits and have distinct pharmacological profiles (Ranjbar-Slamloo and Fazlali, 2019). It is notable, therefore, that dopamine and norepinephrine have largely been ascribed similar roles in mediating the explore–exploit tradeoff in the computational neuroscience literature. Dopamine has often been ascribed a role in mediating reward prediction errors and value-based learning (Schultz, 1998, 2013; Daw et al., 2006; Niv, 2009) while norepinephrine in modulating arousal, attention, and value assessment (Aston-Jones and Cohen, 2005; Yu and Dayan, 2005; Amemiya et al., 2016), but more recently both have been implicated in the action selection (Humphries et al., 2012; Warren et al., 2017). Previously we showed that multiple latent cognitive processes can describe changes in exploration (Chen et al., 2021b). It is unknown whether dopamine and norepinephrine govern this exploration–exploitation tradeoff via distinct or similar latent cognitive processes because they have been largely studied in isolation (Cremer et al., 2022).
To uncover the roles of dopamine and norepinephrine in the exploration–exploitation tradeoff, we conducted within-subjects pharmacological manipulations for these two neuromodulators and compared the modulatory effects on exploration and the latent cognitive parameters that influence exploration, in the same dynamic decision-making task (Chen et al., 2021b). We found a bidirectional modulatory effect that increasing dopamine activity decreased exploration and decreasing dopamine activity increased exploration. Beta-noradrenergic receptor activity also modulated exploration, but the modulatory effect was mediated by sex. Fitting a reinforcement learning (RL) model revealed that dopamine mediates exploration by changing the precision of value-based choice selection. In contrast, noradrenergic activity surprisingly influenced exploration by changing outcome sensitivity in a sex-dependent manner. These data suggest differential roles of dopamine and norepinephrine in mediating exploration and complex roles for both catecholamines in signaling reward.
Materials and Methods
Animals
Thirty-two BL6129SF1/J mice (16 males and 16 females) were obtained from Jackson Laboratories (stock #101043). Mice arrived at the lab at 7 weeks of age and adapted to 0900–2100 h reversed light cycle throughout the testing phase. Mice were housed in groups of four with ad libitum access to water while being mildly food restricted (85–95% of free feeding weight) before the start of the experiment (12 weeks) and during the experiment. All animals were cared for according to the guidelines of the National Institution of Health, and experimental protocols were approved by the Institutional Animal Care and Use Committee (IACUC) of the University of Minnesota.
Apparatus
Sixteen identical triangular touchscreen operant chambers (Lafayette Instrument) were used for training and testing. Two walls were black acrylic plastic. The third wall housed the touchscreen and was positioned directly opposite the magazine. The magazine provided liquid reinforcers (50% Ensure) delivered by a peristaltic pump, typically 7 µl (280 ms pump duration). ABET-II software (Lafayette Instrument) was used to program operant schedules and to analyze all data from training and testing.
Behavioral task
Two-armed spatial restless bandit task
Animals were trained to perform a two-armed spatial restless bandit task in the touchscreen operant chamber. Each trial, animals were presented with two identical squares on the left and right side of the screen. Nose poke to one of the target locations on the touchscreen was required to register a response. Each location is associated with some probability of reward, which changes independently over time. For every trial, there is a 10% chance that the reward probability of a given arm will increase or decrease by 10%. All the walks were generated randomly with a few criteria: (1) the overall reward probabilities of two arms are within 20% of each other, preventing one arm being overly better than the other; (2) the reward probability cannot go down to 0% or go up to 100%; and (3) there are no 30 consecutive trials where the reward probabilities of both arms are lower than 20% to ensure motivation.
Animals ran a simple deterministic schedule on Monday to readapt to the operant chamber after weekends off and ran a different restless bandit walk each day from Tuesday to Friday. Animals ran for two rounds of 4 consecutive days, and within each day, animals completed either 300 trials or spent a maximum of 2 h in the operant chamber. On average across all sessions, animals performed 276.5 trials with a standard deviation of 8.6 trials (male average, 253.7 trials, SD = 15.4; female average, 299.3 trials, SD = 0.74). Data were recorded by the ABET II system and was exported for further analysis. All computational modeling was conducted using Python.
Drug administration
To assess the effect of norepinephrine and dopamine receptor activity on explore behaviors, we used the following four different drugs to increase or decrease beta-noradrenergic or dopaminergic receptor activity systemically: a beta-noradrenergic receptor agonist isoproterenol (0.3 mg/kg), a beta-noradrenergic receptor antagonist propranolol (5 mg/kg), a nonselective dopamine receptor agonist apomorphine (0.1 mg/kg), and a nonselective dopamine receptor antagonist flupenthixol (0.03 mg/kg).
All drugs were fully dissolved in 0.9% saline and protected from light. Animals received intraperitoneal (i.p.) injection of drug or saline (control) at an injection volume of 5 ml/kg right before the experiment with the exception of apomorphine being administered 30 min before the experiment due to the immediate drug influence of locomotor behaviors. Drug and saline were administered in alternating sessions using a within-subject so that every animal received each of the four drugs. Beta-noradrenergic drug testing occurred first, followed by dopaminergic manipulations. All animals received interleaved saline and propranolol administration for 12 sessions (six sessions each condition) and then received interleaved saline and isoproterenol administration for 12 sessions (six sessions each condition). After this, animals washed out for 3 weeks and then began testing with dopamine receptor drugs. The dopamine receptor agonist and antagonist were administered counterbalanced across animals. Half of the animals (n = 16) were randomly selected to receive apomorphine first (six vehicle sessions and six drug sessions interleaved), and the other received flupenthixol first (six vehicle sessions and six drug sessions interleaved). The washout period between each drug condition within a receptor type (beta-adrenergic antagonist vs agonist; dopamine receptor antagonist vs agonist) is 3 d. By interleaving control and drug sessions, we essentially retook a control session in between every drug session, allowing us to account for potential effects of repeated drug administration and long-term changes. Each drug session is being compared with its own control session, which is the session immediately prior. Thus, potential changes from Control day 1 to Control day 6 could control for the potential effect of repeated drug administration from Day 1 to day 6.
Dosage chosen was based on previous studies on the effect of drugs on cognitive functions (Fernandez-Tome et al., 1979; Cabib et al., 1984; Goldschmidt et al., 1984; Ichihara et al., 1988; Nakamura et al., 1998; Grigoryan, 2012; Cinotti et al., 2019). Doses for each drug were chosen from previous studies as the lowest doses necessary to produce alterations in cognitive functions including decision-making, learning, and exploration with minimal influence on locomotion.
Data analysis
General analysis techniques
Data were analyzed with custom PYTHON and Prism 8 scripts. Generalized linear mixed models (GLMMs), ANOVA, and t test were used to determine the fixed effects of drug, sex, and interaction term, accounting for random effects of animal identity and session, unless otherwise specified. p values were compared against the standard α = 0.05 threshold. The sample size is n = 32 (16 males and 16 females) for all statistical tests. No animal was excluded from the experiment. All statistical tests used and statistical details were reported in the results. All figures depict mean ± SEM.
Generalized linear mixed models
To determine whether drug and sex predicted the reward acquisition performance, response time, probability of win-stay, probability of lose-shift, outcome sensitivity, and probability of exploration, we fitted a series of GLMM to examine the fixed effects of drug, sex, and interaction term, with animal identity and session number as random effects.
Outcome sensitivity
To examine how much of the switching behavior was sensitive to reward outcome, we calculated the difference between switching given no reward and switching given reward control by the total amount of switching. If an animal was not sensitive at all to outcome, we would expect to see the outcome sensitivity metric close to zero, as the switching behavior was equally likely a result of reward and no reward.
Hidden Markov model
To identify when animals were exploring, we fit a hidden Markov model (HMM) to the animals’ choice sequence. Our model consisted of two hidden states—explore state and exploit state—which are defined by the probability of making each choice (k, out of Nk possible options) and the probability of transitioning from one state to another. When an HMM is fitted to animals’ choices, it estimates a transition matrix, which is the mapping between every state at one time point and every state at some future point of time. The HMM fitting is the process of estimating how the distribution over states will evolve over time. The transition matrix estimates the probability of each type of transition between states. The HMM exploration as a uniform distribution over choices, i.e., the emissions model for the explore state, was uniform across options. This is the maximum entropy distribution of a categorical variable, which makes the few assumptions of true distribution of choices during exploration and therefore does not bias the model toward or away from any particular type of high-entropy choice period. In the exploitation state, subjects repeatedly sample the same choice; therefore the exploit states only permit one choice; i.e., exploit-left state only emits left choices and exploit-right state only emits right choice.
The latent states in this model are Markovian, meaning that they are time independent. They depend only on the most recent state. The HMM estimates a transition matrix to fit the behavior by mapping of past and future states. This matrix is a system of stochastic equations describing the one-time-step probability of transitioning between every combination of states. In our model, there were three possible states (two exploit states and one explore state). The parameters were tied across exploit states such that each exploit state had the same probability of beginning (from exploring) and of sustaining itself. Transitions out of the exploration, into exploitative states, were also tied. The model also assumed that the mice had to pass through exploration in order to start exploiting a new option, even if only for a single trial. Through fixing the emissions model, constraining the structure of the transmission matrix, and tying the parameters, the final HMM had only two free parameters: one corresponding to the probability of exploring, given exploration on the last trial, and one corresponding to the probability of exploiting, given exploitation on the last trial.
The model was fit via expectation-maximization using the Baum–Welch algorithm (Bilmes, 1998). This algorithm finds (possibly local) maxima of the complete-data likelihood. A complete set of parameters θ includes the emission and transition models, discussed already, but also initial distribution over states. Because the mice had no knowledge of the environment at the first trial of the session, the initial distribution started with p (explore state) = 1. The algorithm was reinitialized with random seeds 10 times to avoid local minima, and the model that maximized the observed (incomplete) data log likelihood across all the sessions for each animal was ultimately taken as the best. To decode latent states from choices, we used the Viterbi algorithm to discover the most probable posteriori sequence of latent states.
Reinforcement learning models
We fitted five RL models that could potentially characterize animals’ choice behaviors, with the structure of each RL model detailed below. AIC weights were calculated from AIC values across all treatment groups and compared across models to determine the best model with the highest relative likelihood.
The first model (random) assumes that animals choose between two arms randomly with some overall bias for one side over the other. This choice bias for choosing the left side over the right side is captured with a parameter b. The probability of choosing the left side on trial t is:
The second model is a noisy win-stay lose-shift (WSLS) model that adapts choices with regard to outcomes. This model assumes a win-stay lose-shift policy that is to repeat a rewarded choice and to switch to the other choice if not rewarded. Furthermore, this model includes a parameter ɛ that captures the level of randomness, allowing a stochastic application of the win-stay lose-shift policy. The probability of choosing arm k on trial t is shown in the following:
The third model (RL) is a basic delta-rule RL model. This two-parameter model assumes that animals learn by consistently updating Q values, which are values defined for options (left and right side). These Q values, in turn, dictate what choice to make next. For example, in a multiarmed bandit task, Qtk is the value estimation of how good arm k at trial t and is updated based on the reward outcome of each trial:
In each trial, rt – Qtk captures the reward prediction error (RPE), which is the difference between expected outcome and the actual outcome. The parameter a is the learning rate, which determines the rate of updating RPE. With Q values defined for each arm, choice selection on each trial was performed based on a Softmax probability distribution:
The fourth model (RLCK) incorporates a choice updating rules in addition to the value updating rule in Model 3. The model assumes that choice kernel, which captures the outcome-independent tendency to repeat a previous choice, also influences decision-making. The choice kernel updating rule is similar to the value-updating rule:
The fifth model (RLγ) is an asymmetrical learning model that incorporates an asymmetric learning scalar parameter that scales the learning rate on the trials where there is no reward. The choice selection is also based on a Softmax probability distribution where the inverse temperature determines the decision noise in the system.
The sixth model (RLCKγ) is the combination of the fourth and the fifth model, incorporating both asymmetrical learning and choice kernel. This model included a learning rate, an asymmetric learning scalar, a choice kernel, and an inverse temperature parameter.
Results
Aged-matched adult male and female wild-type mice (n = 32, 16 males and 16 females, strain B6129SF1/J) were trained to perform a restless two-armed spatial bandit task in touchscreen operant chambers (Fig. 1A). In this task, animals were presented with two identical visual targets (squares) on the left and right side of the screen during each trial. They indicated their choices by nose poking at one of the two target locations, each of which provided some probability of reward that changed independently and randomly over time (Fig. 1B). The dynamic reward contingency of this task naturally encourages the animal to balance between exploiting a favorable option when it is found and exploring to gain information about potential better alternatives. This task has been employed in rodents and primates as well as human subjects to understand learning and exploration and have successfully revealed divergent exploration strategies (Ebitz et al., 2018; Grossman et al., 2020; Chen et al., 2021b). In this study, we adopted this task to understand the modulatory effect of catecholamine receptor activity on the transition between exploration and exploitation.
Figure 1-1
Average correct performance and response time across sessions. A) Average probability of choosing the best choice at any given trial (correct performance) across sessions under flupenthixol (FLU)/vehicle and apomorphine (APO)/vehicle. B) Average probability of choosing the best choice at any given trial (correct performance) across sessions under propranolol (PRO)/vehicle and isoproterenol (ISP)/vehicle. C) Average response time across sessions under flupenthixol (FLU)/vehicle and apomorphine (APO)/vehicle. D) Average response time across sessions under propranolol (PRO)/vehicle and isoproterenol (ISP)/vehicle. Graphs depict mean ± SEM across animals. Download Figure 1-1, TIF file.
Figure 1-2
Changes in task performance, response time and level of exploration within sessions across 30-minute time bins. A) Average probability of choosing the best choice at any given trial (highest payoff choice) within sessions across 30-minute time bins under flupenthixol (FLU)/vehicle and apomorphine (APO)/vehicle. B) Average probability of choosing the best choice at any given trial (highest payoff choice) within sessions across 30-minute time bins under propranolol (PRO)/vehicle and isoproterenol (ISP)/vehicle. C) Average response time within sessions across 30-minute time bins under flupenthixol (FLU)/vehicle and apomorphine (APO)/vehicle. D) Average response time within sessions across 30-minute time bins under propranolol (PRO)/vehicle and isoproterenol (ISP)/vehicle. E) Average probability of exploration inferred from Hidden Markov model (HMM) within sessions across 30-minute time bins under flupenthixol (FLU)/vehicle and apomorphine (APO)/vehicle. F) Average probability of exploration inferred from Hidden Markov model (HMM) within sessions across 30-minute time bins under propranolol (PRO)/vehicle and isoproterenol (ISP)/vehicle. Graphs depict mean ± SEM across animals. Download Figure 1-2, TIF file.
To examine the distinct and/or overlapping roles of dopamine and norepinephrine in modulating exploration, we tested the systemic effect of a nonselective dopamine receptor agonist apomorphine (0.1 mg/kg), a nonselective dopamine receptor antagonist flupenthixol (0.03 mg/kg), a beta-noradrenergic receptor agonist isoproterenol (0.3 mg/kg), and a beta-noradrenergic receptor antagonist propranolol (5 mg/kg) on exploration. The experiment used a within-subject design, where all animals received each of the four drugs tested. This design allowed us to directly compare the effect of each drug within individual animals and dissociate drug effects from potential cohort differences. Since each animal received all four drugs, the length of study did not allow us to test multiple doses for each drug. Therefore, we selected the dosages of the drugs based on numerous previous studies on the role of dopamine and norepinephrine in cognitive processes (Fernandez-Tome et al., 1979; Cabib et al., 1984; Goldschmidt et al., 1984; Ichihara et al., 1988; Nakamura et al., 1998; Grigoryan, 2012; Cinotti et al., 2019). Animals were intraperitoneally administered either saline (control) or treatment with 5 mg/kg of propranolol, 0.3 mg/kg of isoproterenol, or 0.03 mg/kg of flupenthixol immediately prior to the bandit task or 0.1 mg/kg of apomorphine 30 min prior to the bandit task. Saline and drug treatments were administered interleaved by session (Figs. 1C, 2A). Mice performed 12 sessions under each pharmacological condition (six control and six drug treatment sessions). Each drug effect was compared with its own vehicle control collected in the interleaved sessions. The washout period between each drug condition is 3 d. Each session consisted of 300 trials. We examined the effect of repeated drug administration on the probability of choosing the best choice at any given trial from Session 1 to Session 6 for each drug (Extended Data Fig. 1-1A,B). No overall effect of repeated apomorphine administration on choice performance was found, and the only session difference was between Session 1 and Session 5 after correcting for multiple comparisons (p = 0.03). No effect of repeated flupenthixol, isoproterenol, and propranolol administration on choice performance was found across sessions after correcting for multiple comparisons. We also examined the effect of drug administration on average response time (the time elapsed between the onset of choices and nose poke response to make a choice) from Session 1 to Session 6 for each drug (Extended Data Fig. 1-1C,D). No effect of repeated apomorphine administration on response time was found across sessions after correcting for multiple comparisons. In the flupenthixol group, session difference in response time was driven by longer response time in Session 3 compared with the other sessions at p < 0.05 after correcting for multiple comparisons. No effect of repeated isoproterenol and propranolol administration on response time was found across sessions after correcting for multiple comparisons. In short, these results suggested a minimal effect of repeated drug administration on task behaviors.
To examine whether mice had learned to perform the task and were able to sustain reward, we first calculated the average probability of obtaining reward across sessions in all control and treatment groups. It is worth noting that in this task, there is no asymptotic performance because the animals need to constantly adapt to the changing reward contingency and any one option is not always the best choice to make. At any given time, an animal could decide to repeat a favorable option or to explore the options. The performance is best measured by the amount of reward acquired. Because reward schedules were dynamic and stochastic, sessions could differ slightly in the amount of reward that was available. We therefore compared the probability of actual obtained reward against the probability of reward if chosen randomly (chance). In the dopamine modulation condition, both control groups (saline) and treatment groups (apomorphine and flupenthixol) were able to obtain reward more frequently than chance [Fig. 1D,F; one-sample t test, vehicle control (apomorphine): t(31) = 13.96, p < 0.0001; apomorphine: t(31) = 17.24, p < 0.0001; vehicle control (flupenthixol): t(31) = 12.36, p < 0.0001; flupenthixol: t(31) = 16.48, p < 0.0001]. When modulating beta-noradrenergic receptor activity, both control groups (saline) and treatment groups (isoproterenol and propranolol) were able to obtain reward more frequently than chance [Fig. 2B,D; one-sample t test, vehicle control (isoproterenol): t(31) = 10.01, p < 0.0001; isoproterenol: t(31) = 13.84, p < 0.0001; vehicle control (propranolol): t(31) = 10.60, p < 0.0001; propranolol: t(31) = 20.08, p < 0.0001]. These results suggested that animals were not choosing randomly. The animals demonstrated a strong understanding of the task and were able to effectively obtain rewards above chance under baseline (vehicle control) and all drug treatments.
Next, we asked whether manipulating dopaminergic receptor activity influenced their reward acquisition performance. We used a GLMM to estimate effects and interactions of drug treatments and sex, with session and animal identity as random effects (Eq. 1; see Materials and Methods). Increasing dopamine receptor activity with apomorphine administration did not significantly alter their reward acquisition performance (Fig. 1D; GLMM, main effect of drug, p = 0.11, β1 = −0.017). Antagonizing dopamine receptor activity with flupenthixol administration also did not significantly alter reward acquisition performance (Fig. 1F; GLMM, main effect of drug, p = 0.28, β1 = −0.008). Increasing beta-noradrenergic receptor activity with isoproterenol administration did not affect the amount of reward obtained (Fig. 2B; GLMM, main effect of drug, p = 0.93, β1 = −0.0005). However, when decreasing beta-noradrenergic receptor activity with propranolol administration, animals obtained significantly more reward (Fig. 2D; GLMM, main effect of drug, p < 0.0001, β1 = −0.497). Even though manipulating dopamine receptor activity and increasing beta-noradrenergic receptor activity did not affect how much overall obtained reward, we considered the possibility that the tuning of catecholamine receptor activity affected the strategy used to explore an uncertain environment without affecting the level of accuracy. Our previous study demonstrated that similar reward acquisition performance can be achieved via divergent explore strategies (Chen et al., 2021b).
To examine the temporal changes in behaviors as drugs were onboarded and metabolized throughout a session, we calculated their probability of choosing the best choice (highest payoff choice) as a task performance metric throughout a single session in 30 min time bins (four time bins total). A two-way ANOVA was conducted and revealed that task performance (probability of choosing the best choice) did not significantly change across time bins within a single session after administration of vehicle/apomorphine and vehicle/flupenthixol after corrected for multiple comparisons using Bonferroni (Extended Data Fig. 1-2A). After propranolol administration, the probability of choosing the best choice significantly decreased across time bins (Extended Data Fig. 1-2B; main effect of time bin, F(3,93) = 5.41, p = 0.002). After correcting for multiple comparisons, the probability of choosing the best choice in the first 30 min and last 30 min was significantly different at p < 0.01 (mean diff = 0.12; 95% CI = [0.03, 0.20]). The result also revealed a significant main effect of drug that the probability of choosing the best choice under propranolol was higher than that of vehicle control across all time bins (F(1,31) = 6.2; p = 0.018). This suggests that while there were temporal changes in task performance over time that may be associated with drug metabolism, the effect of propranolol was onboard throughout the session, and thus no data should be excluded from analyses as a result of drug washout. After isoproterenol administration, the probability of choosing the best choice did not significantly change across time bins (Extended Data Fig. 1-2B).
Pharmacological manipulations of both dopaminergic and noradrenergic receptor activity have been shown to have a dose-dependent effect on motor functions (Goldschmidt et al., 1984; Weed and Gold, 1998). In this study, dosage selection was based on previous literature examining cognitive functions, and a low dose was used to have influence on cognition but avoid motor function impairment (Nakamura et al., 1998; Cinotti et al., 2019). One commonly used behavioral metric to examine motor function in a cognitive task is response time. Therefore, we calculated the response time, which was the time elapsed between the onset of choices and nose poke response to make a choice, as recorded by the touchscreen operant chamber. If the administration of drugs impaired motor function, we would expect to see longer response time in the drug group compared with vehicle control. In the dopamine agonist condition, the mean response under saline control was 3.6 s with a standard deviation of 1.88, and the mean response time under apomorphine treatment 3.56 s with a standard deviation of 1.18 (Fig. 1E). Upregulating dopamine receptor activity with apomorphine did not significantly influence the response time (GLMM, main effect of drug, p = 0.77, β1 = 0.10). In the dopamine antagonist condition, the mean response under saline control was 3.52 s with a standard deviation of 1.27 and the mean response time under flupenthixol treatment 3.52 s with a standard deviation of 1.24 (Fig. 1G). Similarly, downregulating dopamine receptor activity with flupenthixol did not significantly alter animals’ response time (GLMM, main effect of drug, p = 0.96, β1 = −0.01).
To measure the changes in response time as drugs were metabolized, we examined their average response time throughout a single session, binned by 30 min time bins (four time bins total; Extended Data Fig. 1-2C). After apomorphine administration, the response time did not differ significantly across time bins throughout a single session after correcting for multiple comparisons. In the vehicle/flupenthixol group, the main effect of time bin in average response time was driven by changes between the first 30 min and 60–90 min after correcting for multiple comparisons using Bonferroni under both vehicle control and apomorphine (30 vs 90 min: vehicle: mean diff = −2.32, 95% CI = [−4.48,−0.15], p = 0.03; FLU: mean diff = −3.01, 95% CI = [−5.18, −0.85], p = 0.002). Since this increase in response time was observed in both vehicle and flupenthixol condition, it is unlikely that this change in response time reflected temporal effects of drug metabolism. The change could reflect other temporal effects, such as changes in satiety level or motivation.
Modulating beta-noradrenergic receptor activity resulted in bidirectional changes in response time. When increasing beta-noradrenergic receptor activity with isoproterenol, both males and females took significantly longer to respond (Fig. 2C; GLMM, main effect of drug, p < 0.0001, β1 = −3.51). In the beta-adrenergic agonist condition, the mean response under saline control was 4.05 s with a standard deviation of 1.7, and the mean response time under isoproterenol treatment 7.34 s with a standard deviation of 4.31. Decreasing beta-noradrenergic receptor activity with propranolol significantly decreased the response time (Fig. 2E; GLMM, main effect of drug, p = 0.04, β1 = 0.51), primarily driven by response time reduction in females under propranolol compared with vehicle controls (GLMM, interaction term, p < 0.009, β3 = −0.57). In the beta-adrenergic antagonist condition, the mean response under saline control was 4.77 s with a standard deviation of 1.87, and the mean response time under propranolol treatment 4.54 s with a standard deviation of 1.87. Average response time of females was 4.97 s ± 1.58 SD under control condition and 4.46 s ± 1.35 SD under propranolol condition, whereas average response time of males was 4.57 s ± 2.15 SD under control condition and 4.63 s ± 2.32 SD under propranolol condition. Previous studies have linked response time with the complexity of the strategy used (Kool et al., 2010; Filipowicz et al., 2019; Chen et al., 2021b). Faster response time under propranolol might also point to the possibility that decreasing beta-noradrenergic receptor activity brought on changes in strategies that took shorter time to execute. Isoproterenol was a highly significant predictor for lower response time in both male and female mice, implicating the potential drug effect on motor behaviors.
We also examined changes in average response time as drugs were metabolized throughout a single session, binned by 30 min time bins (Extended Data Fig. 1-2D). After isoproterenol administration, the response time significantly decreased over time (main effect of time bin, F(3,93) = 4.94, p = 0.003) and a post hoc multiple comparison suggested that the average response time of the first 60 min were significantly higher than later bins at p < 0.01. Since no change in task performance was observed over time under isoproterenol, it is possible that the higher response time early on were due to acute drug effect on locomotion. After propranolol administration, the response time was significantly shorter in the first 30 min under both vehicle control and propranolol condition after corrected for multiple comparison (30 vs 120 min, vehicle: mean diff = −3.58, 95% CI = [−5.36, −1.80], p < 0.0001; PRO: mean diff = −2.70, 95% CI = [−4.57, −0.84], p = 0.0013). Since this temporal effect was observed in both vehicle and drug condition, it is unlikely a result of drug metabolism. There was also a main effect of drug that response time under propranolol was faster than vehicle throughout a session (F(1,31) = 8.44; p = 0.0067).
Opposing modulatory effect of dopamine and norepinephrine receptor activity on win-stay lose-shift behaviors
The role of dopamine has been heavily studied and shown to be a key contributor to value-based decision-making and RL (Seamans and Yang, 2004; Redish et al., 2007; Niv, 2009; Floresco, 2013). Although modulating dopamine receptor activity did not significantly influence the overall amount of reward obtained or the response time, it is possible that apomorphine and flupenthixol exerted influence on how animals adapted their choices to reward outcome. To understand how increasing and decreasing dopamine activity influenced animals’ reward-driven choice behaviors, we examined the probability of win-stay (repeating a choice when it was rewarded on the previous trial) and the probability of lose-shift (switching to the other choice when not rewarded). We found that when increasing dopamine receptor activity with apomorphine administration, both males and females had increased win-stay behavior (Fig. 1H; GLMM, main effect of drug, p = 0.011, β1 = −0.02) and decreased lose-shift behavior (Fig. 1I; GLMM, main effect of drug, p = 0.016, β1 = 0.034). To control for the total amount of switching, we calculated outcome sensitivity (Eq. 2; see Materials and Methods) to understand how much of the switching behavior was sensitive to outcome. The result suggests that there were no significant changes in outcome sensitivity when animals were administered apomorphine (Fig. 1J; GLMM, main effect of drug, p = 0.22, β1 = −0.08). This result suggested that apomorphine increased the overall “stickiness” or inflexibility of choice and more stay behaviors regardless of the outcome.
There was no significant change in the probability of win-stay when the animals were administered flupenthixol (Fig. 1K; GLMM, main effect of drug, p = 0.27, β1 = −0.001). It is worth noting that there was main effect of sex in the probability of win-stay, with females having a higher probability of win-stay, in both DA agonist and antagonist manipulations (GLMM, main effect of sex, apomorphine: p = 0.065, β2 = −0.054; flupenthixol: p = 0.01, β2 = −0.05). The results also revealed an opposing effect from dopamine agonist that antagonizing dopamine receptor activity with flupenthixol administration increased the probability of lose-shift (Fig. 1L; GLMM, main effect of drug, p = 0.04, β1 = −0.014). However, administration of flupenthixol significantly increased outcome sensitivity across both sexes (Fig. 1M; GLMM, main effect of drug, p = 0.006, β1 = −0.06) and there was more increase in outcome sensitivity in males (GLMM, interaction term, p = 0.027, β3 = −0.10).
Modulating beta-noradrenergic receptor activity also influenced win-stay and lose-shift behaviors. Increasing beta-noradrenergic activity with isoproterenol administration decreased the probability of win-stay compared with vehicle control (Fig. 2F; GLMM, main effect of drug, 0.009, β1 = 0.007). Isoproterenol administration also decreased the probability of lose-shift in both sexes (Fig. 2G; GLMM, main effect of drug, p = 0.0007, β1 = 0.083), primarily driven by significant decrease in lose-shift in females under isoproterenol (Fig. 2G, inset; GLMM, interaction term, p = 0.004, β3 = −0.077). Outcome sensitivity analysis revealed a significant decrease in outcome sensitivity when administer isoproterenol compared with vehicle control (Fig. 2J; GLMM, main effect of drug, p = 0.021, β1 = 0.134). We then examined outcome sensitivity across sexes under vehicle/isoproterenol condition and found that the changes in sensitivity to outcome were driven by a main effect of drug, instead of primary driven by one sex (Fig. 2K; nonsignificant effect of sex, p = 0.65).
In contrast, decreasing beta-noradrenergic activity with propranolol administration increased the probability of win-stay compared with vehicle control (Fig. 2H; GLMM, main effect of drug, p < 0.0001, β1 = −0.023). There was also a significant interaction between sex and drug that males showed a greater increase in win-stay on propranolol compared with females (Fig. 2H, inset; GLMM, interaction term, 0.02, β3 = −0.036). Interestingly, propranolol administration also decreased the probability of lose-shift but only in males (Fig. 2I; GLMM, main effect of drug, 0.45, β1 = −0.01, interaction term, p = 0.002, β3 = 0.08). Outcome sensitivity analysis revealed a significant increase in outcome sensitivity with propranolol administration (Fig. 2L; GLMM, main effect of drug, 0.003, β1 = −0.186). We then examined outcome sensitivity across sexes under vehicle/propranolol condition and found that the changes in sensitivity to outcome were driven by a main effect of drug, instead of primarily driven by one sex (Fig. 2M; nonsignificant effect of sex, p = 0.83). Our result suggested a bidirectional modulatory effect of NE manipulation on sensitivity to outcome, when controlled for the total amount of switching.
Together, these results suggested an opposing role of dopamine and norepinephrine in modulating reward-driven choice behaviors. Increasing dopamine or decreasing norepinephrine activity resulted in increased win-stay and decreased lose-shift but this effect is more sex different in the noradrenergic manipulations.
A hidden Markov model identifies distinct patterns of exploration and exploitation associated with modulation of dopamine and norepinephrine activity
The changes in the reward-driven behaviors such as win-stay and lose-shift could be a manifestation of strategic changes in how animals explored the changing environment. To understand how modulation of dopamine and norepinephrine affect the balance between exploration and exploitation, we used a HMM that modeled exploration and exploitation as two latent goal states to infer which choices were exploratory or exploitative (Fig. 3A). The HMM has previously been used in rodents, primates, and humans (Ebitz et al., 2018, 2020; Chen et al., 2021b) to infer explore–exploit state from choices and has shown robust correlation with neural activity as well as other behavioral metrics, including response time, value function, and RL. We have previously established fast-switching (putatively explore) and slow-switching (putatively exploit) regimes in mice choice behavior (Chen et al., 2021b). This HMM method produces explore/exploit labels that better track neural activity and choice behavior across species than RL models (Ebitz et al., 2018; Chen et al., 2021b). Figure 3A showed an example of HMM labeling of choices of an animal in the restless bandit task. The shaded area indicates exploratory trials inferred by the HMM. The animal displayed mixtures of exploratory bouts where choices were distributed across two choices and exploitative bouts where one choice was repeated.
Figure 3-1
Effect of dopamine and norepinephrine modulation on exploration in female and male mice. Download Figure 3-1, TIF file.
The HMM allows us to make statistical inferences about the probability that each choice was due to exploration or exploitation by modeling explore/exploit as the distinct latent goal states underlying choices. To evaluate the face validity of the HMM labels, we next examined whether HMM-labeled exploration matched the normative definition of exploration. First, by definition, exploration is a pattern of nonreward-maximizing choices whose purpose is to gain information. A nonreward-maximizing goal would produce choices that are orthogonal to reward value. Therefore, we examined the probability of choosing a choice with regard to its relative value. In all control and dopamine treatment conditions, HMM-labeled exploratory choices were nonreward maximizing: they were orthogonal to reward value, whereas exploit choices were correlated with relative reward value (Fig. 3B,E). The probability of choosing a target choice across bins of reward value was not different from chance (t(43) = 0.39; p = 0.70), which means that choice behavior was not driven by reward value during exploration. During exploration, animals chose both high-value choices and low-value choices at around chance level (mean = 49.8% ± 3.4% SD across all value bins). During exploitation, animals were much more likely to choose a high-value choice over a low-value choice (mean probability of choosing high-value choice = 75.8% ± 10.0% SD, mean probability of choosing low-value choice = 23.3% ± 10.1% SD). Second, prior research has shown that exploratory choices are more computationally expensive, thus taking longer than exploitative choices (Kool et al., 2010; Filipowicz et al., 2019; Chen et al., 2021b). We calculated the response time across HMM-inferred states and demonstrated that across all treatment conditions, HMM-inferred explore choices had longer response time than exploit choices (Fig. 3C,D,F,G; paired t test, DA antagonist condition: vehicle, t(31) = 2.27, p = 0.03; FLU, t(31) = 4.49, p < 0.0001, DA agonist condition: vehicle, t(31) = 4.48, p < 0.0001; APO, t(31) = 4.42, p < 0.0001). Together, these results suggested that HMM produced meaningful and robust labeling of explore/exploit choices that match the normative definition of exploration and were able to explain variances in value selection and response time. Because this approach to infer exploration is agnostic to the generative models and depends only on the temporal statistics of choices (Ebitz et al., 2018, 2020; Chen et al., 2021b), it is particularly ideal for circumstances like this one, where we suspect that the generative computations may differ across treatment groups.
Modulating dopamine receptor activity bidirectionally affected the level of exploration
To examine how much animals were exploring, we calculated the average number of exploratory choices inferred from HMM. The results revealed a bidirectional modulatory effect of dopamine receptor activity on the level of exploration. When animals were administered apomorphine, we found that animals on average had fewer exploratory trials than when they were on vehicle control, with 55.5% ± 12.6% SD of trials labeled as exploratory on apomorphine and 72.3% ± 13.6% SD of trials labeled as exploratory on vehicle (Fig. 3H,J; GLMM, main effect of drug, p < 0.0001, β1 = 0.174). This is also consistent with our findings in the win-stay lose-shift analysis that apomorphine increased the stickiness or repetitiveness of choice behaviors. In contrast, when decreasing dopamine receptor activity with flupenthixol administration, animals explore more compared with vehicle control, with 58.2% ± 12.0% SD of trials being exploratory on flupenthixol and 52.1% ± 12.3% SD of trials being exploratory on vehicle control (Fig. 3H,J; GLMM, main effect of drug, 0.002, β1 = −0.062). This result suggested that manipulating dopamine receptor activity bidirectionally affected the level of exploration across both sexes (Fig. 3I, Extended Data Fig. 3-1)—increasing dopamine receptor activity decreased exploratory choices and decreasing dopamine receptor activity increased exploratory choices. We also calculated temporal changes of the probability of exploration throughout a session in 30 min time bins to ensure the changes in the level of exploration were not driven by a specific temporal epoch (Extended Data Fig. 1-2E). After apomorphine/flupenthixol administration, the level of exploration did not differ significantly across time bins after correcting for multiple comparisons using Bonferroni, suggesting that DA manipulations had a sustained influence on exploration throughout the session.
The effect of beta-noradrenergic receptor activity on exploration was sex modulated
First, we conducted an HMM validation by examining the value selection and response time during HMM-labeled explore and exploit state. Same as our findings in the dopamine treatment conditions, we found that in all norepinephrine conditions, HMM-inferred exploratory choices were also orthogonal to reward and exploit choices proportional to relative reward value (Fig. 4A,D). The probability of choosing a target choice across bins of reward value was not different from chance (t(43) = 0.85; p = 0.398). During exploration, animals chose both high-value choices and low-value choices at around chance level (mean = 49.8% ± 1.8% SD across all value bins). During exploitation, animals were much more likely to choose a high-value choice over a low-value choice (mean probability of choosing high-value choice = 74.5% ± 8.6% SD, mean probability of choosing low-value choice = 24.2% ± 9.3% SD). Response time during HMM-labeled exploratory choices were also longer than exploitative choices across two vehicle control conditions in the norepinephrine manipulation group (Fig. 4B,E; paired t test, NE antagonist condition: vehicle, t(31) = 1.93, p = 0.06; NE agonist condition: vehicle, t(31) = 3.0, p = 0.005). However, response time during HMM-labeled exploratory choices were not significantly longer than exploitative choices in propranolol and isoproterenol condition (Fig. 4C,F; paired t test, PRO, t(31) = 1.235, p = 0.23; ISP, t(31) = 1.48, p = 0.15). This is most likely due to changes in response time as an effect of the drug manipulation. As shown above, modulating beta-noradrenergic receptor activity resulted in bidirectional changes in response time. Since in both saline conditions, we observed explore/exploit state-differentiated response time that gives us confidence in the approach to infer exploration using HMM, we next examined the effect of propranolol and isoproterenol on the level of exploration.
Modulating beta-noradrenergic receptor activity also influenced the probability of exploration, and the modulatory effect on exploration was associated with sex. Decreasing beta-noradrenergic receptor activity with propranolol significantly decreased the level of exploration (Fig. 4G–I) but only in males (Fig. 4J, Extended Data Fig. 3-1; GLMM, interaction term, 0.0001, β1 = 0.138). Increasing beta-noradrenergic receptor activity with isoproterenol administration significantly decreased exploration in females (Fig. 3K, Extended Data Fig. 3-1; interaction term, p = 0.002, β3 = −0.103). We calculated temporal changes of the probability of exploration throughout a session in 30 min time bins (Extended Data Fig. 1-2F). After propranolol/isoproterenol administration, the level of exploration did not differ significantly across time bins after correcting for multiple comparisons using Bonferroni, suggesting that NE manipulations had a sustained influence on exploration throughout the session. Together these results suggested a sex-linked neuromodulatory effect of beta-noradrenergic receptor on exploration—increasing beta-noradrenergic activity decreased exploration in males and decreasing beta-noradrenergic activity decreased exploration in females. One possibility is that this sex-differentiated modulatory effect reflected sex-dependent ceiling/floor effects of noradrenergic signaling.
Reinforcement learning models revealed changes in distinct parameters under modulation of dopamine and norepinephrine
The results of the HMM suggested that both dopamine and norepinephrine modulation influenced the level of exploration. However, it is unclear whether changes in exploration were due to changes in similar or distinct latent cognitive parameters. Does dopamine and norepinephrine modulate exploration via distinct or overlapping mechanisms? In our previous work, we have linked changes in different latent cognitive processes as inferred from the RL models to changes in overall exploration (Chen et al., 2021b). While inverse temperature (decision noise) parameter in RL models is traditionally thought to be related to exploration and learning rate thought to be related to learning, we have found that the exploration as inferred from the HMM is a function of both value difference and decision noise in the model (Chen et al., 2021b). Both the learning rate and inverse temperature parameters can increase the likelihood of making an exploratory choice. The HMM modeling results implicated changes in exploration with dopamine and norepinephrine manipulation. Fitting RL models could give us insight into whether changes in exploration with pharmacological manipulation was an effect of how value is learned or an effect of how noisy decision-making is. To address this question, we fitted a series of RL models to understand the effect of pharmacological manipulation on the latent cognitive parameters that could influence exploration and exploitation (Ishii et al., 2002; Daw et al., 2006; Pearson et al., 2009; Jepma and Nieuwenhuis, 2011).
To make inferences about changes in the RL model parameters, we first identified the best fitting RL model for the animals’ behaviors. The majority of RL model parameters can be categorized in three ways: value-dependent learning terms, value-independent bias terms, and decision noise/randomness terms (Katahira, 2018). Previous studies demonstrated the effect of various RL parameters on value-based decision-making, including value-dependent terms such as learning rate and asymmetrical learning rate (Frank et al., 2007; Gershman, 2016; Cinotti et al., 2019; Chen et al., 2021b), value-independent terms such as choice bias (Katahira, 2018; R. Wilson and Collins, 2019; Chen et al., 2021b), and noise terms such as inverse temperature and lapse rate (Economides et al., 2015; Cinotti et al., 2019; R. Wilson and Collins, 2019). Here, we compared six RL models that incorporated one or more of the above parameters (learning rate, bias, noise) and had different assumptions about the latent cognitive processes that animals might employ during exploration (Fig. 5A; Eqs. 3–7). These models included the following: (1) a “random” model with some overall bias for one choice over the other; (2) a “noisy win-stay lose-shift” model (WSLS) that assumes a win-stay lose-shift policy with some level of randomness; (3) a two-parameter “RL” model with a consistent learning rate and some inverse temperature that captures decision noise; (4) a three-parameter “RLCK” model that captures both value-based and value-independent decision with two parameters for learning rate and choice bias and an overall decision noise parameter; (5) a three-parameter “RLγ” model that captures asymmetrical learning with a learning rate parameter and a scaling parameter for negative outcome learning and a decision noise parameter; and (6) a four-parameter “RLCKγ” asymmetrical learning and bias model that includes a choice bias term on top of the “RLγ” (see Materials and Methods). We fitted these models to each individual animal and compared the likelihood of each of the six models under four vehicle control conditions and four drug conditions.
Figure 5-1
Correlation between reinforcement learning (RL) model parameter learning rate and decision noise (inverse temperature) and probability of exploration as inferred from the Hidden Markov model (HMM) under all treatment conditions. Higher level of exploration was correlated with lower inverse temperature, i.e. higher decision noise. A) Correlation between learning rate (α) and probability of exploration under dopamine (DA) manipulations (flupenthixol (FLU) on the left and apomorphine (APO) on the right). B) Correlation between learning rate (α) and probability of exploration under beta-noradrenergic (NE) manipulations (propranolol (PRO) on the left and isoproterenol (ISP) on the right). C) Correlation between inverse temperature (β) and probability of exploration under dopamine (DA) manipulations (flupenthixol (FLU) on the left and apomorphine (APO) on the right). D) Correlation between inverse temperature (β) and probability of exploration under beta-noradrenergic (NE) manipulations (propranolol (PRO) on the left and isoproterenol (ISP) on the right). Spearman’s rho is reported. * indicates p < 0.05, ** indicates p < 0.01, *** indicates p < 0.001. Download Figure 5-1, TIF file.
Figure 5-2
Choice kernel updating rate (αc) parameter in the reinforcement learning (RL) model was not affected by modulation of dopamine and noradrenergic receptor activity. A) Average choice kernel (αc) across flupenthixol (FLU) and vehicle. B) Average choice kernel (αc) across apomorphine (APO) and vehicle. C) Average choice kernel (αc) propranolol (PRO) and vehicle. D) Average choice kernel (αc) across isoproterenol (ISP) and vehicle. Graphs depict mean ± SEM across animals. Download Figure 5-2, TIF file.
In dopamine modulation conditions, the RLCK model (learning + choice kernel) and the RLCKγ (asymmetrical learning and bias) model best fit the behaviors (Fig. 5B; average AIC across DA conditions: model 1 random: 73,838.03; model 2 WSLS: 55,843.26; model 3 RL: 58,133.45; model 4 RLCK: 54,652.02; model 5 RLγ: 57,545.47; model 6 RLCKγ: 54,612.56). Since the RLCKγ model did not significantly improve the model fitting, we decided to use the more parsimonious RLCK model that has three parameters (learning rate, choice bias, and decision noise). To examine how well the best fit RLCK model was at actually predicting animals’ choices, we measured the model agreement for each model, which was calculated as the probability of choice correctly predicted by the optimized model parameters for each model (Fig. 5C). In dopamine agonist condition, the RLCK model predicted over 66% of animals’ actual choices (vehicle: 66.1% ± 5.3% SD, APO: 67.6% ± 5.0%). In the dopamine antagonist condition, the RLCK model also predicted over 66% of animals’ actual choices (vehicle, 66.4% ± 6.4%; FLU, 67.3% ± 5.4%). Figure 5, D, and E, shows the model simulation, animal's choice probability (probability of choosing left), and reward probability of the same animal. This also visually demonstrated that the RLCK model was able to accurately predict animal's changing choice behaviors. Additionally, this also demonstrated that animal choice behaviors were following the reward contingency.
To understand how changes in the RL model parameters are associated with changes in the level of exploration inferred from the HMM, we tested correlations between the probability of exploration (HMM parameter) and learning rate/inverse temperature (RL model parameters; Extended Data Fig. 5-1). There were no significant correlation between probability of exploration and learning rate (α) under all vehicle control and drug conditions (Extended Data Fig. 5-1A, left, vehicle: rho = 0.30, p = 0.09; FLU: rho = 0.29, p = 0.10; Extended Data Fig. 5-1A, right, vehicle: rho = 0.20, p = 0.27; APO: rho = 0.32, p = 0.07; Extended Data Fig. 5-1B, left, vehicle: rho = 0.21, p = 0.26; PRO: rho = 0.22, p = 0.23; Extended Data Fig. 5-1B, right, vehicle: rho = 0.29, p = 0.10; ISP: rho = −0.02, p = 0.92). The level of exploration was negatively correlated with inverse temperature under both dopamine agonist and antagonist conditions and under propranolol condition (Extended Data Fig. 5-1C, left, vehicle: rho = −0.82, p < 0.0001; FLU: rho = −0.65, p = 0.009; Extended Data Fig. 5-1C, right, vehicle: rho = −0.69, p < 0.0001; APO: rho = −0.68, p < 0.0001; Extended Data Fig. 5-1D, left, vehicle: rho = −0.45, p = 0.009; PRO: rho = −0.36, p = 0.04). HMM-inferred exploration was negatively correlated with inverse temperature under vehicle control for isoproterenol but there was no correlation under isoproterenol condition (Extended Data Fig. 5-1D, right; vehicle: rho = −0.59, p < 0.001, ISP: rho = −0.29, p = 0.10). Together, the result suggested that the changes in the level of exploration as modeled by the HMM were strongly correlated with decision noise, which was modeled as the inverse temperature in the RL model. The higher the decision noise (the lower the inverse temperature), the higher the probability of exploration. Our previous study suggested that both learning rate and inverse temperature could influence the overall level of exploration (Chen et al., 2021b). In this study, the result suggested that changes in decision noise were more strongly correlated with potentially a primary drive in changes in exploration.
Then we examined how the RL model parameters changed with dopamine modulation. Previous study has shown that both learning rate and inverse temperature parameters could induce changes in the overall level of exploration (Chen et al., 2021b). Therefore, we asked whether the bidirectional changes in exploration when up- or downregulating dopamine receptor activity were due to changes in learning rate or decision noise. The results revealed a bidirectional effect of dopamine modulation on the inverse temperature parameter (Fig. 5F–H). Because RL model data can be non-normally distributed, the Shapiro–Wilk test of normality was conducted to determine whether the RL parameters were normally distributed. The result suggested that the parameters were not always normally distributed (decision noise: vehicle: p = 0.85, APO: p < 0.0001; vehicle: p < 0.0001, FLU: p = 0.09); therefore we will report both parametric and nonparametric statistics results for RL parameters. Increasing dopamine receptor activity with apomorphine administration increased inverse temperature and decreased random noise (Fig. 5F; Wilcoxon matched-pairs test, p = 0.0165; paired t test, p = 0.053). Decreasing dopamine receptor activity with flupenthixol administration decreased inverse temperature and increased random noise (Fig. 5G; Wilcoxon matched-pairs test, p = 0.0253; paired t test, p = 0.053).
We also found that dopamine antagonist (flupenthixol) significantly increased learning rate compared with vehicle control (Fig. 5I; paired t test, t(31) = 3.45, p = 0.0016; Wilcoxon matched-pairs test, p = 0.0015). Increase in learning rate with flupenthixol is consistent with increase in outcome sensitivity with flupenthixol. This change in both RL model-based parameter and model-free parameter reflected more value updates from each reward or nonreward outcomes with administration of dopamine antagonist. Dopamine agonist (apomorphine) did not significantly influence learning rate (Fig. 5J) or choice kernel (Extended Data Fig. 5-2B) compared with the vehicle control (paired t test, learning rate: t(31) = 0.38, p = 0.71; choice kernel: t(31) = 0.21, p = 0.84). Dopamine antagonist (flupenthixol) did not significantly change the choice kernel compared with the vehicle control (Extended Data Fig. 5-2A; paired t test, t(31) = 0.96, p = 0.34). Two-way ANOVA was conducted to examine whether there was any sex-dependent effect on RL model parameters. However, the result revealed no main effect of sex in any of the model parameters with dopamine manipulation.
In norepinephrine modulation conditions, the RLCK model (learning + choice kernel), and the RLCKγ (asymmetrical learning and bias) model also best fit the behavioral data (Fig. 6A; average AIC across NE conditions: model 1 random: 66,898.83; model 2 WSLS: 56,061.3; model 3 RL: 53,194.59; model 4 RLCK: 50,035.02; model 5 RLγ: 52,664.48; model 6 RLCKγ: 50,027.82). However, since the RLCKγ model did not significantly improve the model fitting, we decided to use the more parsimonious RLCK model for the NE condition as well. The model agreement was calculated for all NE modulations. In the noradrenergic agonist condition, the RLCK model predicted over 65% of animals’ actual choices (Fig. 6B; vehicle: 65.7% ± 5.9% SD, ISP: 66.4% ± 5.7%). In the noradrenergic antagonist condition, the RLCK model also predicted over 65% of animals’ actual choices (vehicle: 65.5% ± 3.9% SD, PRO: 69.8% ± 4.5%). Figure 6, J and K, shows the model simulation, animal's choice probability (probability of choosing left), and reward probability of the same animal as Figure 5D,E.
We found that modulating beta-noradrenergic receptor activity also resulted in changes in the RL model parameter. Upregulating norepinephrine activity with isoproterenol significantly decreased the learning rate (Fig. 6C; Wilcoxon matched-pairs test, p = 0.0001; paired t test, p = 0.0008). This result is consistent with the decreased win-stay and decreased lose-shift behaviors when administered isoproterenol because the lower learning rate under isoproterenol resulted in reduced outcome sensitivity (Fig. 2J), which means learning less from either a rewarded or a nonrewarded outcome. When normalized learning rate with respect to each vehicle control, the learning rate under isoproterenol was significantly lower than the normalized learning rate under propranolol (Fig. 6E; paired t test, t(31) = 3.27, p = 0.0026). There were no significant changes in inverse temperature (Fig. 6H; t(31) = 0.7, p = 0.49) or choice kernel (Extended Data Fig. 5-2D; t(31) = 0.72, p = 0.47) when administered isoproterenol compared with vehicle control. Interestingly, we found that females had higher inverse temperature, i.e., lower decision noise, than males in both vehicle control and isoproterenol condition (Fig. 6I; main effect of sex, F(1,30) = 4.53, p = 0.04).
We did not find any significant changes in any RL parameters when administered propranolol compared with vehicle control (Fig. 6D,F; paired t test, learning rate: t(31) = 0.9, p = 0.38; inverse temperature: t(31) = 0.08, p = 0.93; choice kernel: t(31) = 1.09, p = 0.29), despite that there was a significant increase in reward acquisition with propranolol administration. Furthermore, no sex-dependent main effect was found in any model parameters in propranolol condition (Fig. 6G). One possible explanation is that this parsimonious RL model we selected was not complex enough to capture this change in reward acquisition with a single parameter. Furthermore, in our previous study, we have shown that different RL model parameters could interact and produce similar final performance (Chen et al., 2021b). More complex design of RL models are needed in future experiments to better capture task performance. Together, these results suggested distinct roles of dopaminergic drugs in modulating exploration via decision noise and beta-noradrenergic drugs in modulating exploration via outcome sensitivity.
Discussion
In this study, we pharmacologically manipulated dopamine and norepinephrine and examined their modulatory effects on exploration in an explore/exploit task. We used computational approaches to characterize latent cognitive variables that could influence exploration. Modulating dopamine activity bidirectionally modulated exploration—dopamine agonism decreased exploration and dopamine antagonism increased exploration. Modulation of beta-noradrenergic receptor activity also influenced how much animals explored, via changes in sensitivity to outcome. However, such modulatory effects were influenced by sex, making the role of norepinephrine in exploration more nuanced. While previous studies examined these two catecholamines in isolation, the current study allowed the direct comparison of the modulatory effect of dopamine and norepinephrine on the exploration in the same task within animals.
Dopamine's neuromodulatory function has been described in two key ways. First, the phasic activation of midbrain dopamine neurons is thought to be a key contributor to RL (Barto, 1995; Montague et al., 1995, 1996; Schultz et al., 1997; Schultz, 1998). Numerous studies across species have shown that dopamine neurons encoded the reward prediction error (RPE; Schultz, 1998; Niv, 2009). Phasic midbrain dopamine activity and release in ventral striatum is proposed to reflect this error, which is used to update action values (Schultz, 1998, 2013; Niv, 2009). Consistent with this view, previous pharmacological investigations in human decision-making have observed changes in model-derived learning rates that directly correlate with dopamine synthesis agonism via ʟ-DOPA (Frank et al., 2004; Rutledge et al., 2009) and dopamine receptor D2/D3 antagonism via amisulpride (Cremer et al., 2022). However, this framework does not necessarily account for a large body of literature describing a role for tonic or sustained dopamine levels modulating motivation, vigor, and cognitive flexibility (Floresco, 2013; Beeler et al., 2014), both systemically and when pharmacologically targeted, especially in the frontal cortex. Because pharmacological agents such as the ones used here are by definition modulating receptor activity over long periods of time, our results should be at least partially interpreted as showing a role for tonic neuromodulation. However, the role for tonic dopamine changes in RL computations supporting decision-making has been less described. This is notable in light of demonstrations that pharmacologically or genetically modulating tonic dopamine function induces profound changes in decision noise or perseverative errors—consistent with a role for dopamine modulation in cognitive flexibility (Beeler et al., 2010; Humphries et al., 2012; Eisenegger et al., 2014; E. Lee et al., 2015; Cinotti et al., 2019; Ebitz et al., 2019). Two prior examples combining pharmacological and computational modeling approaches explicitly testing the explore–exploit tradeoff particularly highlighted this. Cinotti et al. (2019) found that systemic dopamine blockade with flupenthixol increased decision noise without affecting learning rate in a rat three-arm bandit task. Similarly, Ebitz et al. (2019) reported that systemic cocaine, which blocks dopamine and norepinephrine reuptake, reduced flexibility in rhesus macaques and regulated “tonic exploration,” that is, spontaneous exploration that does not provide maximum information, as opposed to “phasic” exploration which occurred specifically maximize information. Our data thus provide additional support for the idea that tonic dopamine receptor modulation across the brain, in addition to supporting learning, is a central mechanism for tuning the precision or flexibility of value-based decisions.
Dopaminergic mechanisms in the brain exhibit complex functional heterogeneity, emerging in part from differences in receptor subtypes especially in striatum (Watabe-Uchida et al., 2012; Beier et al., 2015; Collins and Saunders, 2020; J. Lee and Sabatini, 2021; Delevich et al., 2022). One primary projection of the midbrain VTA/SNc dopamine neurons is the striatum, which contains GABAergic medium spiny neurons (MSNs) expressing either D1 or D2 dopamine receptors (Albin et al., 1989; Delevich et al., 2022). Although here we use receptor-nonspecific pharmacologic modulation, it is possible that the impacts are primarily due to modulation of one receptor/MSN circuit. D1 and D2 subclass of DA receptors mediate different latent processes underlying decision-making (Kwak et al., 2014; E. Lee et al., 2015; Tecuapetla et al., 2016; Collins and Saunders, 2020; Delevich et al., 2022). Numerous studies have implicated striatal D2R-expressing neurons in changes in decision noise or action selection (Kwak et al., 2014; E. Lee et al., 2015; Kwak and Jung, 2019; Delevich et al., 2022). D2R antagonism increased noise in the striatal representation of value, whereas D1R antagonism did not affect RL (E. Lee et al., 2015). Similarly, Kwak et al. (2014) found that D2R-KO mice showed increased noise and minimal reward performance deficit in D1R-KO mice (Kwak et al., 2014). However, there are also mixed results of receptor-specific effects on RL. Kwak and Jung (2019) proposed that D1R inactivation increased noise and D2R inactivation reduced reward learning. Delevich et al. (2022) found that enhancing D2R-expressing iSPN activity and reducing D1R-expressing dSPN activity in the dorsomedial striatum (DMS) increased noise and random exploration. In our study, systemic dopamine antagonism increased decision noise and exploration, which is more aligned with functions mediated primarily via D2R signaling. However, since the current study uses systemic manipulations with nonselective receptor agonists/antagonists, we were unable to further test receptor-specific or circuit-specific hypotheses. Future studies can use receptor-specific drugs, or more precise manipulations like DREADDs and optical techniques to further investigate. Our results have demonstrated that the behavioral paradigm and computational models developed here allows for the dissociation of latent cognitive processes underlying decision-making and can be used to inform hypotheses and experimental design in the future.
Surprisingly, we found that when modulating beta-noradrenergic receptor activity changed the level of HMM-inferred exploration, via changes in outcome sensitivity. Propranolol increased outcome sensitivity, as seen in increased win-stay and lose-shift. In contrast, isoproterenol decreased outcome sensitivity, as seen in reduced win-stay and lose-shift. Previous studies have proposed the locus ceruleus-norepinephrine (LC-NE) system to regulate exploration by changing decision noise (Kane et al., 2017; R. C. Wilson et al., 2021). Other studies have echoed the findings that norepinephrine levels predicted the level of random exploration (Jepma and Nieuwenhuis, 2011; Ebitz and Moore, 2018; Joshi and Gold, 2020). In this study, we did not find any changes in the noise parameter in the RL models (precision of value-based choice selection) associated with up- and downregulation of beta-noradrenergic activity. Although our manipulations were nonspecific modulators of beta receptors, it remains a caveat that this manipulation could alter noradrenergic signaling more broadly, a focus of future research. How can we reconcile our observation of beta-adrenergic impacts on learning rates versus previous experiments pinpointing noise parameters? Previously, we have shown that both learning rate and inverse temperature in RL models can influence overall level of exploration (Chen et al., 2021b), implicating that the interpretation of computational model parameters should depend on the context in which they are being used (Eckstein et al., 2022). Here, our result is consistent with the broader prior evidence that increasing norepinephrine increased exploration (R. C. Wilson et al., 2021) but suggests that “noisier” decisions we see here were a result of changes in the learning rate, which is also partially captured by the model-free metric of outcome sensitivity. This supports the idea that exploration could be driven by long-term changes in the influence of reward. Work focused on beta-noradrenergic signaling in the context of aversive learning has highlighted a role for noradrenergic signaling in extinction (Iordanova et al., 2021), over a phasic timeframe of minutes to hours. This suggests the intriguing possibility that the effect of noradrenergic blockade here may be to prevent extinction of learned values. In short, although the most common associations in the computational neuroscience literature are between dopamine and reward learning versus norepinephrine and decision noise, our current data suggests a more nuanced role for each of these modulators.
Adding an additional layer of nuance to these findings, we found that sex differences are a major axis of variability in the effect of beta-noradrenergic receptor modulation on the explore–exploit tradeoff. Isoproterenol decreased exploration only in males, while beta blocker propranolol decreased exploration especially in females. Past studies have implicated sex differences in the locus ceruleus-norepinephrine (LC-NE) system by stress (Bangasser et al., 2016), which is primarily mediated by β-adrenergic receptor signaling (T-H. Kim et al., 2019). Females have been reported to have higher tonic NE levels due to structural differences and estrogen-mediated effects (Pinos et al., 2001; Bangasser et al., 2016). One possibility is that the baseline sex differences in NE level could influence the modulatory effect of manipulations. A nonlinear inverted U shape dose effect has been reported between cognitive performance and noradrenergic activity (Aston-Jones and Cohen, 2005; Baldi and Bucherelli, 2005; Redish et al., 2007), beta-adrenergic receptor activity (Campbell et al., 2008; de Rover et al., 2015), and stress (Yerkes and Dodson, 1908; Ross and Van Bockstaele, 2020). However, the length of the current study did not allow for testing of multiple drug dosages. Future study could conduct a formal dose–response experiment to examine how sex difference in baseline NE level may interact with different strengths of beta-noradrenergic receptor manipulation to influence exploration and RL strategy.
Footnotes
This work was supported by the National Institute of Mental Health (NIMH) R01 MH123661 (N.M.G), NIMH P50 MH119569 (N.M.G), and NIMH T32 Training Grant MH115886 (C.S.C), startup funds from the University of Minnesota (N.M.G.), a Young Investigator Grant from the Brain and Behavior Foundation (R.B.E.), an Unfettered Research Grant from the Mistletoe Foundation (R.B.E.), and Fonds de Recherche du Québec Santé, Chercheur-Boursier Junior 1, #284309 (R.B.E.). We thank Dr. David Redish and Madison J Merfeld for comments improving this manuscript. We thank Anila Bano for helping with the experiment.
The authors declare no competing financial interests.
- Correspondence should be addressed to Cathy S. Chen at chen5388{at}umn.edu or Nicola M. Grissom at ngrissom{at}umn.edu.