Abstract
Dopamine D2/3 receptor signaling is critical for flexible adaptive behavior; however, it is unclear whether D2, D3, or both receptor subtypes modulate precise signals of feedback and reward history that underlie optimal decision making. Here, PET with the radioligand [11C]-(+)-PHNO was used to quantify individual differences in putative D3 receptor availability in rodents trained on a novel three-choice spatial acquisition and reversal-learning task with probabilistic reinforcement. Binding of [11C]-(+)-PHNO in the midbrain was negatively related to the ability of rats to adapt to changes in rewarded locations, but not to the initial learning. Computational modeling of choice behavior in the reversal phase indicated that [11C]-(+)-PHNO binding in the midbrain was related to the learning rate and sensitivity to positive, but not negative, feedback. Administration of a D3-preferring agonist likewise impaired reversal performance by reducing the learning rate and sensitivity to positive feedback. These results demonstrate a previously unrecognized role for D3 receptors in select aspects of reinforcement learning and suggest that individual variation in midbrain D3 receptors influences flexible behavior. Our combined neuroimaging, behavioral, pharmacological, and computational approach implicates the dopamine D3 receptor in decision-making processes that are altered in psychiatric disorders.
SIGNIFICANCE STATEMENT Flexible decision-making behavior is dependent upon dopamine D2/3 signaling in corticostriatal brain regions. However, the role of D3 receptors in adaptive, goal-directed behavior has not been thoroughly investigated. By combining PET imaging with the D3-preferring radioligand [11C]-(+)-PHNO, pharmacology, a novel three-choice probabilistic discrimination and reversal task and computational modeling of behavior in rats, we report that naturally occurring variation in [11C]-(+)-PHNO receptor availability relates to specific aspects of flexible decision making. We confirm these relationships using a D3-preferring agonist, thus identifying a unique role of midbrain D3 receptors in decision-making processes.
Introduction
Decision-making processes are compromised in individuals with psychiatric disorders, such as schizophrenia and addiction, and thought to underlie the behavioral problems that are observed in individuals with these debilitating disorders (Jentsch and Taylor, 1999; Millan et al., 2012; Lee, 2013). The development of novel pharmacological techniques that improve flexible, goal-directed behaviors have been of interest as potential therapeutics for psychiatric disorders (Etkin et al., 2013). However, the neural mechanisms underlying flexible and adaptive decision-making strategies are not fully understood.
One of the most established laboratory tasks for indexing flexible, goal-directed behaviors is the reversal-learning paradigm (for review, see Jentsch et al., 2014). In this paradigm, subjects learn to make a choice to obtain a desired outcome. Once the stimulus-reward association is learned, the stimulus-reward relationship is changed, or reversed, and subjects have to adapt their responses based on the new stimulus-reward relationship. Pharmacological and neuroimaging studies have indicated that the dopamine D2/3 receptor system is critical for reversal learning. Antagonism of D2/3 receptors impairs reversal performance in nonhuman primates and pigeons (Lee et al., 2007; Herold, 2010) and D2/3 receptor availability covaries with reversal performance in nonhuman primates (Groman et al., 2011, 2014). Therefore, D2/3 receptor dysfunction, observed in individuals with psychiatric disorders, such as addiction (Volkow et al., 2001; Lee et al., 2009), may be the mechanism by which rigid, inflexible behaviors emerge (Groman and Jentsch, 2012).
It is unclear whether the relationships between D2/3 receptors and reversal learning are due to D2, D3, or both receptor subtypes (Boulougouris et al., 2009). There is evidence implicating both. Mice lacking the D2 receptor are impaired in reversal-learning tasks (Kwak et al., 2014), whereas mice lacking the D3 receptor have better reversal performance compared with littermate controls (Glickstein et al., 2005). Flexible decision-making mechanisms may, therefore, rely on the balance of dopamine-mediated signaling through D2 and D3 receptors. Only a few studies have investigated a selective role of D3 receptors in decision-making processes, and these have focused on the impact that larger/uncertain rewards have in promoting riskier choices (St Onge and Floresco, 2009; Stopper et al., 2013). Moreover, no studies have combined neuroimaging, pharmacological, and computational approaches to assess the precise role for dopamine D3 signaling mechanisms in flexible decision-making strategies.
The current study sought to examine the relationship between naturally occurring variation in D3 receptor availability, using PET with the D3-preferring ligand [11C]-(+)-PHNO (Searle et al., 2010; Tziortzi et al., 2011), and decision-making of rats in a three-choice spatial acquisition and reversal-learning task with probabilistic reinforcement. We hypothesized, based on previous work (Glickstein et al., 2005), that higher [11C]-(+)-PHNO binding in the midbrain, where binding is exclusively due to D3 receptors (Rabiner et al., 2009; Searle et al., 2010), would be associated with poorer performance following a reversal of stimulus-reward association. In this study, we found that midbrain [11C]-(+)-PHNO binding was negatively related to the ability of rats to reverse, but not acquire, a stimulus-reward association. Computational modeling also indicated that greater [11C]-(+)-PHNO binding in the midbrain was associated with a lower rate of learning and lower sensitivity to positive feedback following a reversal. Similarly, administration of the D3-preferring agonist, pramipexole, impaired performance of rats in the reversal phase, decreased the learning rate, and reduced sensitivity to positive outcomes confirming the PET-based relationships. These results indicate that inflexible decision-making strategies may be reflective of high D3 receptors, and implicate the D3 receptor as a potential target for the treatment of decision-making problems in psychiatric disorders (Neisewander et al., 2014).
Materials and Methods
Subjects.
Thirteen male, Long–Evans rats (Charles River), ranging from 7 to 9 weeks of age, were used in the neuroimaging portion of this study. Nineteen additional male, Long–Evans rats (Charles River) were used in the pharmacological study. Rats were pair housed in a climate-controlled vivarium and maintained on a 12 h light/dark cycle (lights on at 7:00 A.M.; lights off at 7:00 P.M.). The diet was restricted to maintain a body weight ∼90% of their free-feeding weight throughout the experiment. Water was available ad libitum, except during behavioral assessments (1–2 h per day). The experimental protocols were consistent with the National Institutes of Health' Guide for the care and use of laboratory animals and approved by the Institutional Animal Care and Use Committee at Yale University.
Operant testing apparatus.
Operant behavior was assessed in standard aluminum and Plexiglas operant conditioning chambers. They were equipped with a photocell pellet-delivery magazine and a curved panel with five photocell-equipped noseports on the opposite side (Med Associates) and housed inside of sound-attenuating cubicles, with background white noise being broadcast and a house light illuminating the environment.
Drugs.
The dopamine D3-preferring receptor agonist (S)-2-amino-4,5,6,7-tetrahydro-6-(propylamino)benzothiazole dihydrochloride (pramipexole hydrochloride) was purchased from Sigma-Aldrich. Pramipexole was dissolved in sterile 0.9% sodium chloride daily and administered subcutaneously 10 min before each testing session at 1 ml/kg.
Magazine and noseport training.
Following 7 d of dietary restriction, rats were trained in a single 30 min session to obtain rewards (45 mg sucrose pellets; BioServ) from the magazine receptacle. The following day, rats were trained to make nosepoke entries into illuminated noseport receptacles to earn rewards. The magazine receptacle was illuminated at the beginning of each trial, and rats were trained to make a response (a single nosepoke) into the illuminated magazine to illuminate a single noseport on the opposite panel (one of the three interior noseport apertures). A nosepoke response into the illuminated noseport for a duration of at least 0.25 s resulted in delivery of a single reward and 1 s illumination of the magazine receptacle. Nosepoke responses into nonilluminated noseports resulted in a 10 s timeout in which all cues were extinguished. If rats failed to make a nosepoke response within 2 min, the trial was terminated and an omission was recorded followed by a 10 s timeout. Following a 5–7 s intertrial interval, the magazine was illuminated and rats could initiate another trial. Sessions terminated when rats completed 100 trials or 60 min had lapsed, whichever occurred first. If rats did not earn at least 60 rewards (e.g., 60 correct nosepoke responses) during the session, the session was repeated, using the same nosepoke duration requirement, the following day(s) until the performance criterion was met. Rats were then trained to make nosepoke responses that had durations of at least 0.5 and 1 s in separate sessions using the procedures described above, and the performance criterion was increased to 80 and 85 earned rewards, respectively. These noseport training procedures typically take between 4 and 7 d to complete.
Between-sessions acquisition and reversal training.
Next, rats were trained to acquire and reverse probabilistically reinforced three-choice spatial discrimination problems between individual sessions. Three-choice reversal paradigms offer an advantage over two-choice paradigms by allowing a direct quantification of the types of errors subjects make (Izquierdo and Jentsch, 2012). In these sessions, a response into the magazine aperture resulted in the illumination of three noseports (three interior ports) and rats could respond to any of the noseports to earn a potential reward. Each of the noseports was associated with a fixed probability of reward delivery (70%, 30%, or 30%) for the duration of the session. Initially, rats were required to learn which one of the three illuminated noseports was associated with the highest probability of reinforcement (70%) solely through trial and error (acquisition phase). Approximately 10% of all initiated trials were “forced trials” in which only one randomly chosen noseport was illuminated. These forced trials ensured that rats sampled all noseports. Correct responses during forced trials were reinforced using the probabilities assigned to that particular noseport. The remaining 90% of initiated trials were “free trials” in which three noseport apertures were illuminated and rats could respond to any one of the noseports to earn probabilistically delivered rewards. Sessions terminated when the performance criterion was met (80% of the last 40 responses were made to the most frequently reinforced noseport), when 200 trials had been completed or 60 min had lapsed, whichever occurred first. If rats did not meet the performance criterion, the same stimulus-reward probabilities were presented the following day(s) until the criterion was met.
Once rats acquired the stimulus-reward association, they were presented with the same spatial discrimination problem, but the stimulus-reward relationship was switched between two of the noseports: the noseport previously associated with the highest probability of reinforcement (70%) was now associated with the lowest probability of reinforcement (30%), and one of the noseports previously associated with the lowest probability of reinforcement (30%) was now associated with the highest probability of reinforcement (70%). This reversal session terminated once the performance criterion was met or was repeated the following day(s) until the performance criterion was met. Following the first reversal, the stimulus-reward probabilities were once again reversed between two of the noseports and choice behavior of rats assessed until the performance criterion was met. These between-session acquisition and reversal training sessions served to introduce rats to changes in noseport-reinforcement relationships.
Within-sessions acquisition and reversal training.
The ability of rats to acquire and reverse probabilistically reinforced, three-choice spatial discrimination problems within a single 240 trial session was then assessed (Fig. 1). Reinforcement probabilities (54%, 18%, or 6%) were randomly assigned to each noseport at the beginning of each session, and rats had 120 trials to learn which one of the three noseports was associated with the highest probability of reinforcement (referred to as the Acquisition Phase). Once rats completed 120 trials, the reinforcement probabilities for the noseports were changed, increasing for two of the noseports (6%–64% and 18%–36%) and decreasing for one noseport (54%–16%). Choice behavior was assessed for an additional 120 trials (referred to as the Reversal Phase). Sessions terminated when rats completed 240 trials or 75 min had lapsed, whichever occurred first. Approximately 20% of initiated trials (∼40 trials) were forced trials in which only one noseport aperture was illuminated. This was done to ensure that rats sampled all available noseports. Correct responses on forced trials were reinforced using the reinforcement probabilities assigned to each noseport. Rats completed between 10 and 13 of these within-session PRL-LP sessions.
Diagram of the PRL-HP task. Trials are initiated when a rat makes a 1 s noseport response into the magazine. Three noseports on the opposite panel are illuminated (stars), and rats can respond to any of the three apertures to earn probabilistically delivered rewards. If no response is made within 20 s, the trial is terminated and an omission recorded.
Our initial results indicated that performance of rats in the reversal phase of the PRL-LP was not significantly better than chance (see Results). To potentially improve reversal performance, rats were assessed on a within-session probabilistic acquisition and reversal-learning task using procedures similar to those in the PRL-LP, but with reinforcement probabilities that spanned a greater range. In these PRL-HP sessions, the probabilities of reinforcement for noseports during the acquisition phase were 72%, 24%, or 8% and changed to 16%, 36%, or 64% during the reversal phase. Rats completed 28 of these PRL-HP sessions. Because the difference between the noseport reinforcement probabilities was greater in the PRL-HP than in the PRL-LP, performance of rats in both the acquisition and reversal phase in the PRL-HP was expected to be better than that in the PRL-LP.
Over the course of extensive training, some rats (N = 6) began responding only to one particular noseport and/or avoiding some noseports regardless of changes in reinforcement probabilities. If a rat selected exclusively and/or avoided the same noseport(s) for three consecutive sessions, it was trained on a program in which the avoided noseport was associated with a 70% chance of reinforcement and the preferred noseport was associated with a 30% chance of reinforcement. Once a performance criterion was met (80% of the last 40 responses were at the noseport associated with the highest probability of reinforcement), the rat was returned to the version of the PRL (LP or HP) task they were previously being assessed. Four rats required 1 d of bias correction, and two rats required 3 d of bias correction.
Data processing.
In some sessions, rats did not complete 240 trials. If rats failed to complete at least 180 trials (e.g., 120 trials during the acquisition phase and 60 trials during the reversal phase), data from that session were excluded from further analysis.
The primary dependent measures for the PRL-LP and PRL-HP sessions were the percentage of trials in which rats chose the highest reinforced noseport during the acquisition (the first 120 trials completed) and reversal phase (the last 120 trials completed). The percentage of trials in which a perseverative response was made (e.g., a response on the noseport associated with the highest probability of reinforcement during the acquisition phase) and an intermediate response was made (e.g., a response to the noseport associated with an intermediate probability of reinforcement) during the reversal phase was also calculated. The data presented here are the mean ± SEM, unless otherwise stated, of the dependent measures collected in the last five PRL-HP sessions completed by each rat for which these criteria were met (Groman et al., 2011). The dependent measures collected from the PRL-HP sessions were compared with [11C]-(+)-PHNO BPND because these sessions were closest in time to the PET scans.
Because of the design of the PRL task, the reversal of the noseport-reinforcement probabilities occurred regardless of whether rats actually learned the noseport-reinforcement probability discrimination during the acquisition phase. To ensure that the relationships detected here were not driven by a failure of rats to acquire the initial discrimination, we conducted a separate set of analyses that only included sessions in which rats completed at least 180 trials and performed better than chance (33.3%) during the acquisition phase.
Computational modeling.
To quantify different aspects of decision-making processes, the choice behavior of rats in the last five PRL-HP sessions was analyzed using a previously validated forgetting Q-learning model (Barraclough et al., 2004; Ito and Doya, 2009). In this reinforcement-learning model, the value for each chosen noseport x (Vx) is updated after each trial (t + 1) according to the following model:
where the learning rate α determines how quickly the value for the chosen noseport decays (i.e., α = 1 indicates that the value is reset every trial), and Δ(t) indicates the change in the value that depends on the outcome from the chosen noseport in trial t. If the outcome of a trial was a reward (e.g., delivery of a sucrose pellet), then the value function of the chosen noseport Vx(t + 1) was updated by Δ(t) = Δ1, the reinforcing strength of a reward. However, if the outcome of a trial was the absence of reward, the value function of the chosen noseport was updated by Δ(t) = Δ2, the aversive strength of a no-reward outcome. The probability of choosing one noseport (e.g., NP1) over the other two noseports (e.g., NP2 and NP3) was calculated according to the softmax function (Eq. 2), using the value function of each respective noseport (VNP) on each trial (t) (Eq. 1) as follows:
Trial-by-trial choice data of each rat were fit with three parameters (α, Δ1, and Δ2) selected to maximize the likelihood of each rat's sequence of choices. Parameter estimates of choice behavior may change as a function of task phase (acquisition vs reversal) and differentially relate to [11C]-(+)-PHNO binding. To investigate this possibility, trial-by-trial choices during the acquisition and during the reversal were independently analyzed, resulting in six parameter estimates (acquisition: αACQ, Δ1-ACQ, and Δ2-ACQ; reversal: αREV, Δ1-REV, and Δ2-REV).
[11C]-(+)-PHNO PET scans.
Approximately 2–3 d after completing the 28 PRL-HP sessions, rats underwent PET scans with [11C]-(+)-PHNO on a Focus 220 PET scanner (Siemens). [11C]-(+)-PHNO was synthesized as previously described (Gallezot et al., 2012). Rats were transported to the Yale PET center where they were anesthetized with 2%–5% isoflurane in oxygen for the duration of the PET scan. Vital signs (e.g., respiratory rate) were monitored throughout the scan. A tail-vein catheter was placed, and then two rats were positioned side-by side, or one rat was positioned by itself, in the bed of the scanner. A transmission scan (9 min) with 57Co was acquired for attenuation correction. Rats then received a bolus injection of [11C]-(+)-PHNO (injected activity: 0.37 ± 0.07 mCi; injected mass: 0.00016 ± 0.00002 mg/kg) and dynamic data were acquired for 120 min. After completing of the scan, rats were removed from gas anesthesia and allowed to recover before being returned to the vivarium.
Reconstruction of PET images.
Three-dimensional sinogram files were created by binning these data into a total of 30 frames (1 × 30 s, 5 × 60 s, 1 × 90 s, 1 × 120 s, 1 × 210 s, and 21 × 300 s). Emission files in list mode were reconstructed using the motion-compensation ordered subset expectation maximization list-mode algorithm for resolution-recovery reconstruction, which includes corrections for normalization, dead time, scatter, and attenuation (Carson et al., 2003). The resultant dynamic images had voxel dimensions of 0.949 × 0.949 × 0.796 mm and matrix dimensions of 256 × 256 × 95.
Calculation of [11C]-(+)-PHNO binding potential (BPND).
Three-dimensional PET images were coregistered to a [11C]-(+)-PHNO BPND template image developed in house using tools within the FSL suite (FMRIB Software Library, version 4.0) (Smith et al., 2004). The [11C]-(+)-PHNO BPND template was coregistered to a T2-weighted structural magnetic resonance image template (Nie et al., 2013) using the PFUS module within PMOD (version 3.15; PMOD Technologies), and the resultant transformation was applied to all PET images across time for each rat. ROIs were drawn on the magnetic resonance image (Nie et al., 2013) (see Fig. 3) using FSL View (FMRIB Software Library, version 4.0), and activity concentration was extracted from three ROIs: dorsal striatum, midbrain and cerebellum. Although the ventral striatum has a high density of D3 receptors, high activity of [11C]-(+)-PHNO in the dorsal striatum has the potential to contaminate neighboring brain regions, such as the ventral striatum. As such, binding potential estimates in the ventral striatum, particularly in rats, may not represent as true a measure of D3 receptor availability as the midbrain, which does not suffer from these contamination issues. Therefore, we restricted our analysis to [11C]-(+)-PHNO binding potential in the midbrain as a measure of brain D3 receptors. Time-activity curves from each ROI were fit with the simplified reference tissue model (SRTM) (Lammertsma et al., 1996) in the PKIN module of PMOD (version 3.15; PMOD Technologies) to provide an estimate of R1, BPND, and k2′, the rate constant of tracer transfer from the reference region (cerebellum) to plasma. Using the k2′ estimate in the dorsal striatum (the region with the highest activity), time-activity curves were refitted using the SRTM2 model with the average, fixed k2′ value applied to all brain regions (Wu and Carson, 2002) to provide an estimate of nondisplaceable binding potential (BPND). One rat was excluded from the analysis due to poor fitting of time activity curves.
Pharmacological effects of a D3-preferring agonist on PRL performance.
To provide support for the relationships detected between [11C]-(+)-PHNO and reversal learning, the effects of the D3-preferring agonist (pramipexole dihydrochloride) on PRL performance were examined in a separate cohort of rats (N = 19). This experimental group was trained using procedures similar to those described above but did not receive training on the PRL-LP. Instead, this group of rats received an additional 12 d of training on the PRL-HP. Using a within-subjects, Latin square design, rats received a subcutaneous injection of either pramipexole (0.01 and 0.05 mg/kg) or vehicle (0.9% sodium chloride) 10 min before assessing performance in the PRL. Doses were separated by at least 2 d, during which no behavioral testing was conducted.
Statistical analyses.
All statistical analyses were completed using SPSS (version 21, IBM). Reliability was assessed using Cronbach's α, a measure of internal consistency. Paired t tests were used to compare the performance and choices of rats between the acquisition and reversal phases. The statistical relationships between the dependent measures were examined using the Pearson product-moment correlation coefficient and tested against a Student's t distribution. The fits of the computational models to the trial-by-trial choices of rats were estimated using a pseudo-R2 (Camerer and Hua Ho, 1999; Daw et al., 2006). Using McFadden's approach (McFadden, 1974), the pseudo-R2 was calculated using (r − m)/r, where m and r are the log likelihoods of these data under the model and random choices (0.33 for all trials). The behavioral effects of pramipexole were assessed using repeated-measures ANOVA. All significant drug effects were examined using paired t tests comparing measures collected following administration of vehicle to those collected after each dose of pramipexole.
Results
Training results
Rats required 2.31 ± 0.17, 1.0 ± 0, and 1.23 ± 0.12 sessions to reach criterion in the 0.25, 0.5, and 1 s noseport training phases, respectively. For the between session acquisition and reversal training, rats required 276 ± 42 trials to meet the performance criterion for the initial discrimination, 291 ± 51 trials for the first reversal, and 375 ± 59 trials for the second reversal.
Performance in the PRL-LP sessions
Rats completed 230 ± 3.24 trials per session. The percentage of trials in which rats chose the noseport with the highest probability of reinforcement was 65.0 ± 4.2% during the acquisition phase and 38.1 ± 3.9% during the reversal phase. Rats chose the noseport with the highest probability of reinforcement during the acquisition phase significantly more than in the reversal phase (t(12) = 4.83; p < 0.001), indicating that they found the reversal phase more difficult than the acquisition phase. Although rats chose the highest reinforced noseport at a frequency significantly greater than chance during the acquisition phase (t(12) = 7.61; p < 0.001), this was not true for choice behavior during the reversal phase (t(12) = 1.30; p = 0.22). In an attempt to improve performance of rats in the reversal phase, the behavior of rats was assessed in the PRL-HP task using reinforcement probabilities that spanned a greater range than that used in the PRL-LP task.
Performance in the PRL-HP sessions
The percentage of trials in which rats chose the noseport associated with the highest probability of reinforcement for the last five sessions completed was reliable in both phases of the task (acquisition: Cronbach's α = 0.74; reversal: Cronbach's α = 0.63), indicating that performance was stable before the PET scans were conducted. As such, the remaining analyses used the average of the dependent measures collected in the last five sessions because they provide the estimate of behavioral performance immediately before the PET scan. Rats completed 234 ± 1.46 trials per session. The percentage of trials in which rats chose the noseport with the highest probability of reinforcement was 73.3 ± 2.7% during the acquisition phase and 47.7 ± 3.9% (mean ± SEM) during the reversal phase (Fig. 2A,B). The percentage of trials in which rats chose the highest reinforced noseport during both the acquisition and reversal phases occurred at a frequency significantly greater than chance (acquisition phase: t(64) = 15.12; p < 0.001; reversal phase: t(64) = 3.83; p < 0.001; Figure 2C), indicating that rats learned to acquire, as well as reverse, the discrimination problems.
Performance of rats in the PRL-HP task. A, Average response probability (10 trial moving window ± SEM) of all rats (solid lines) and that predicted by the forgetting Q-learning model (dashed lines) (Barraclough et al., 2004) during the Acquisition Phase (first 120 trials) and those during the Reversal Phase (last 120 trials). B, Average probability (± SEM) that rats chose the highest reinforced noseport (P(NPhighest reinforced)) during the acquisition phase (red bar) was significantly greater than that during the reversal phase (green bar). C, Average probability (± SEM) that rats made a perseverative response (red bar) or an intermediate response (blue bar) following a reversal. ***p < 0.001.
As hypothesized, the percentage of trials in which rats chose the highest reinforced noseport during the reversal phase of the PRL-HP was significantly greater than that in the PRL-LP (t(64) = 2.99; p = 0.01). Nevertheless, the percentage of trials in which rats chose the noseport with the highest probability of reinforcement in the reversal phase was significantly less than during the acquisition phase (t(64) = 6.25; p < 0.001), indicating that, despite receiving extensive training on the PRL-HP, rats still found the reversal phase more difficult than the acquisition phase. Following the reversal, the percentage of trials in which a rat made a response to the noseport that was previously associated with the highest probability of reinforcement during the acquisition phase (e.g., a perseverative response) was 29.5 ± 3.1% and to the noseport that was always associated with an intermediate level of reinforcement (e.g., an intermediate response) was 22.8 ± 2.8% (Fig. 2C).
Computational model
Table 1 presents the average values for each of the parameter estimates and pseudo-R2 obtained using the forgetting Q-learning model (Barraclough et al., 2004) for choices made during the PRL-HP. The computational model fit the choice behavior of rats substantially better than random choices, as indicated by high pseudo-R2 values.
Parameter estimates (mean ± SEM), negative log likelihood (−LL), and pseudo-R2 obtained when choices made by rats for the whole sessions or the acquisition and reversal phase of the PRL-HP were analyzed using the forgetting Q-learning modela
The performance of rats during the acquisition and reversal phase was compared with the parameter estimates computed based on choices during the acquisition and reversal phase, respectively, of PRL-HP. The probability that rats chose the highest reinforced option during the acquisition was not related to any of the parameter estimates (all p values >0.10). However, the probability that rats chose the highest reinforced noseport option following a reversal was positively related to the αREV (r = 0.70; p = 0.007) and the Δ1-REV parameter (r = 0.85; p < 0.001). The Δ2-REV parameter was not significantly related to reversal performance (r = −0.24; p = 0.44).
Comparing [11C]-(+)-PHNO BPND to PRL-HP performance
To examine the role of D3 receptors in reversal learning, the relationship between [11C]-(+)-PHNO BPND in the dorsal striatum (2.12 ± 0.07; mean ± SEM) and midbrain (0.50 ± 0.05; mean ± SEM; Fig. 3; Table 2) was compared with the choice behavior of rats in the PRL-HP because this behavior was collected closest in time to the [11C]-(+)-PHNO scans (∼2–3 d apart). [11C]-(+)-PHNO BPND in both the midbrain or dorsal striatum was not significantly related to the percentage of choices made to the most frequently reinforced noseport during the acquisition (all p values > 0.60; Fig. 4A). In contrast, midbrain [11C]-(+)-PHNO BPND was negatively related to the probability of choosing the most frequently reinforced noseport following a reversal (r = −0.68; p = 0.01). This relationship was not observed for [11C]-(+)-PHNO BPND in the dorsal striatum (r = 0.17; p = 0.60; Fig. 4B). Furthermore, [11C]-(+)-PHNO BPND in the midbrain was positively related to the probability that rats would make a perseverative response during the reversal phase (r = 0.79; p = 0.002), but not to the probability that rats would make a response to the intermediate noseport (r = −0.03; p = 0.94; Fig. 4D). [11C]-(+)-PHNO BPND in the dorsal striatum was not related to the percentage of perseverative responses (r = −0.02; p = 0.94) or intermediate responses (r = −0.32; p = 0.31) during the reversal phase (Fig. 4C,D).
[11C]-(+)-PHNO BPND in the rat brain. A, Average [11C]-(+)-PHNO BPND (N = 12) presented in a transverse section (left), and coronal sections at the level of the midbrain (top) and striatum (bottom) overlaid on an magnetic resonance template (Nie et al., 2013). Transparent blue shading represents ROIs for the striatum and midbrain. B, A time-activity curve for a single rat that is presenting the standardized uptake value (SUV) in the dorsal striatum (open circles), midbrain (gray circles), and cerebellum (triangles).
[11C]-(+)-PHNO receptor availability measurements for each individual rat in the dorsal striatum and midbrain
Relationships between [11C]-(+)-PHNO BPND and behavior of rats in the PRL-HP task. A, [11C]-(+)-PHNO BPND in the dorsal striatum (open circles) and midbrain (closed circles) is not related to the probability of choosing the most frequently reinforced noseport (P(NPhighest reinforced)) during the acquisition phase. B, [11C]-(+)-PHNO BPND in the midbrain (closed circles), but not the dorsal striatum (open circles), is related to the probability that rats chose the most frequently reinforced noseport during the reversal phase (r = −0.68; p = 0.01). C, [11C]-(+)-PHNO BPND in the midbrain (closed circles) is related to the probability of making a perseverative response (r = 0.79; p = 0.002), but (D) not to the probability of making an intermediate response (r = −0.03; p = 0.94), during the reversal phase. [11C]-(+)-PHNO BPND in the dorsal striatum (open circles) is not related to either of these measures.
Next, midbrain [11C]-(+)-PHNO BPND was compared with the parameters of the forgetting Q-learning model, obtained from the choice behavior of rats in the PRL-HP. [11C]-(+)-PHNO BPND in the midbrain was negatively related to the learning rate, αREV (r = −0.74; p = 0.006; Fig. 5A) and the strength of reward, Δ1-REV, during the reversal phase (r = −0.69; p = 0.01; Fig. 5B). [11C]-(+)-PHNO BPND the midbrain was not related to the aversive strength of no reward during the reversal phase, Δ2-REV (r = 0.25; p = 0.43; Fig. 5C). To ensure that these relationships were not driven by two of the most extreme points, we compared the correlation coefficients when these subjects were included versus when these subjects were excluded using the Fisher r-to-z transformation. Exclusion of these two subjects reduced the correlation coefficients, but these coefficients were not significantly different from the original coefficients (z < 0.82 for all comparisons; p > 0.42 for all comparisons), indicating that these two subjects were not driving the relationships reported here.
Midbrain [11C]-(+)-PHNO BPND is related to parameter estimates of choice behavior following a reversal. [11C]-(+)-PHNO BPND in the midbrain is negatively related to (A) the αREV parameter (r = −0.74; p = 0.006) and (B) the Δ1-REV parameter (r = −0.69; p = 0.01). C, [11C]-(+)-PHNO BPND in the midbrain is not related to the Δ2-REV parameter.
The same pattern of results was observed when midbrain [11C]-(+)-PHNO BPND was compared with the performance of rats only in sessions in which at least 180 trials were completed and acquisition performance was significantly better than that expected at chance (e.g., probability of choosing the highest reinforced noseport option was significantly >33%). Midbrain [11C]-(+)-PHNO BPND was negatively related to the percentage of choices that rats made to the highest reinforced noseport during the reversal, but not the acquisition, phase (r = −0.64; p = 0.03), as well as to the αREV (r = 0.74; p = 0.006) and Δ1-REV (r = −0.56; p = 0.06) parameter estimated from the choice behavior of rats during the reversal phase. Therefore, the relationships between midbrain [11C]-(+)-PHNO BPND and reversal performance cannot be accounted for by differences in the ability of rats to acquire the initial discrimination.
To ensure that the correlations detected here were not due to differences in injected mass that potentially materialize as artificial variations in [11C]-(+)-PHNO BPND, the statistical dependencies between the injected mass of [11C]-(+)-PHNO and the dependent measures were examined. No statistically significant relationships were detected (p > 0.24 for all comparisons).
Effects of a D3-preferring agonist on PRL-HP performance
Based on our results with [11C]-(+)-PHNO, we hypothesized that pharmacological activation of D3 receptors with the D3-preferring agonist pramipexole (Piercey et al., 1996) would impair performance of rats in the reversal phase of the PRL-HP while leaving performance in the acquisition phase unaffected. Indeed, administration of pramipexole did not alter the performance of rats in the acquisition phase (F(2,32) = 0.17; p = 0.85; Fig. 6A). Pramipexole had a significant effect on performance during the reversal phase (F(2,32) = 4.91; p = 0.01). Administration of 0.05 mg/kg pramipexole, but not 0.01 mg/kg pramipexole (t = 0.81; p = 0.43), significantly reduced the probability that rats would choose the highest reinforced noseport during the reversal phase (t = 2.27; p = 0.04; Fig. 6A).
The effects of the D3-preferring agonist, pramipexole, on PRL performance. A, Administration of pramipexole did not affect the ability of rats to acquire the spatial discrimination, but 0.05 mg/kg pramipexole impaired the ability of rats to reverse the spatial discrimination. B, Administration of 0.05 mg/kg pramipexole reduced the learning rate (α) and the reinforcing strength of positive outcome (Δ1). *p < 0.05.
The same reinforcement-learning model described above was fit to choice behavior of rats during the reversal phase for each of the drug sessions. Similar to our findings with [11C]-(+)-PHNO, systemic administration of pramipexole significantly affected the αREV (F(2,24) = 3.60; p = 0.04), Δ1-REV (F(2,24) = 3.57; p = 0.04), and Δ2-REV (F(2,24) = 4.18; p = 0.03) parameter estimates (Fig. 6B). Post hoc paired t tests indicated that administration of 0.05 mg/kg pramipexole significantly reduced αREV (t = 2.24; p = 0.04) and Δ1-REV (t = 2.69; p = 0.02) parameter estimates, with a nonsignificant decrease in the Δ2-REV parameter (t = 2.11; p = 0.06).
Discussion
By combining PET neuroimaging and D3 receptor pharmacology with a novel within-session three-choice probabilistic acquisition and reversal-learning task in rats, the current study provides evidence that dopamine D3 receptor is critically involved in flexible decision-making processes. We report that rats with higher midbrain [11C]-(+)-PHNO BPND have problems reversing a stimulus-reward association, a greater degree of perseveration, lower sensitivity to positive outcomes, and a reduced learning rate for choice values than rats with lower [11C]-(+)-PHNO BPND. Furthermore, a pharmacological challenge confirmed our correlational findings. Systemic administration of a D3-preferring agonist impaired reversal performance, decreased the sensitivity of rats to positive outcomes, and reduced the learning rate. These data provide insight into the role of D3 receptors and flexible choice behavior that are relevant to the understanding of impairments of decision-making that are common to many psychiatric disorders.
[11C]-(+)-PHNO BPND is related to reversal learning
We have previously reported that D2/3 receptor availability, using the high-affinity D2/3 receptor radioligand [18F]fallypride, is associated with individual differences in the ability of monkeys to reverse, but not acquire, a visual discrimination (Groman et al., 2011, 2014). Higher striatal [18F]fallypride BPND was associated with better reversal performance. In the current study, however, we report that higher midbrain [11C]-(+)-PHNO BPND is associated with poorer reversal performance and that systemic administration of a D3-preferring agonist impairs reversal-learning performance. Given the evidence indicating that [11C]-(+)-PHNO BPND in the midbrain is 100% attributable to D3 receptor binding (Rabiner et al., 2009; Graff-Guerrero et al., 2010; Searle et al., 2010), this indicates that the relationship between [11C]-(+)-PHNO BPND and reversal performance reflects level of D3, rather than D2, receptors. This hypothesis is supported by work indicating that mice lacking the D3 receptor have better reversal performance (Glickstein et al., 2005). Pharmacological studies have also indicated opposing roles of D2 and D3 receptors in cognition. Administration of nonselective D2/3 receptor antagonists impairs reversal-learning performance (Lee et al., 2007; Herold, 2010), whereas selective D3 receptor antagonists have been reported to have procognitive effects (Watson et al., 2012). Together, these studies provide evidence that D2 and D3 receptors have opposing roles in reversal learning and further suggest that goal-directed behaviors may rely on a balance of dopamine signaling through both D2 and D3 receptors within the corticostriatal circuit.
Although our results indicate that [11C]-(+)-PHNO BPND in the midbrain plays an important role in mediating reversal-learning performance, it is likely that D3 receptors in other brain regions are also critical modulators of decision-making. Intracranial administration of a D3-preferring agonist into the nucleus accumbens decreases reward sensitivity in tasks assessing risky choice behavior (Stopper et al., 2013), consistent with the correlations and effects observed in the current study.
In contrast to previous studies using [18F]fallypride (Groman et al., 2011), we did not detect a significant relationship between dorsal striatal [11C]-(+)-PHNO BPND and reversal-learning performance. Although binding of [11C]-(+)-PHNO in the dorsal striatum is predominantly reflective of D2 receptors (Rabiner et al., 2009; Erritzoe et al., 2014), ∼6%–20% of [11C]-(+)-PHNO BPND is attributable to D3 receptors. It is possible that the lack of correlation between [11C]-(+)-PHNO BPND in the dorsal striatum and reversal learning detected here is due to opposing influences that these receptor subtypes have on flexible decision making. Additional PET studies that combine receptor-specific pharmacology and ex vivo techniques will clarify the role of D2 and D3 receptor subtypes across different brain regions in decision-making processes.
Using a computational model of choice behavior, we found that midbrain D3 receptors are related to the learning rate and sensitivity to positive outcomes following a reversal: rats with greater midbrain D3 receptor availability had lower learning rates and sensitivity to positive outcomes following a reversal than those with less midbrain D3 receptor availability. Both of these parameter estimates were linked to the probability that rats would choose the highest reinforced noseport following a reversal, suggesting that D3-mediated influences on learning rate and sensitivity to positive outcomes are the mechanism by which midbrain D3 receptors impact flexible decisions. The observation that a D3 agonist impaired reversal learning performance, and related work in humans (Santesso et al., 2009), support this conclusion. Together, these studies add to a growing body of literature implicating D3 receptors in reinforcement-learning processes.
Midbrain dopamine D3 receptors act as autoreceptors, regulating the release of dopamine in striatal regions (Koeltzow et al., 1998) which, itself, has been reported to influence reversal-learning performance (O'Neill and Brown, 2007; Cools et al., 2009; Clarke et al., 2011; Groman et al., 2013). It is possible, therefore, that individuals with high midbrain D3 receptor density have low striatal dopamine tone (Tang et al., 1994; Gobert et al., 1996) that reduces their learning rate and sensitivity to positive outcomes, resulting in inflexible, rigid behaviors. Future studies could directly test address this prediction.
Implications for psychiatric disorders
Reversal learning is altered in substance-dependent individuals (Ghahremani et al., 2011) and in animals chronically exposed to drugs of abuse (Jentsch et al., 2002; Schoenbaum et al., 2004; Groman et al., 2012). Preexisting differences in reversal-learning performance covary with future drug-taking behaviors (Cervantes et al., 2013), suggesting that inflexible decision-making processes may be both an antecedent as well as a consequence of addiction (Jentsch and Taylor, 1999; Groman and Jentsch, 2012). The current study found that greater midbrain [11C]-(+)-PHNO BPND was associated with worse reversal-learning performance. It is possible, therefore, that greater midbrain D3 receptor availability is associated with a greater risk for future drug-taking behaviors by impairing the decision-making processes that regulate drug intake.
Previous studies have implicated the D3 receptor in addiction: greater midbrain [11C]-(+)-PHNO BPND and D3 mRNA have been observed in substance-dependent individuals (Staley and Mash, 1996; Boileau et al., 2012; Matuskey et al., 2014; Payer et al., 2014), and antagonism of the D3 receptor reduces drug self-administration in animals (Higley et al., 2011; Song et al., 2012). However, as noted above, it is unknown whether disruptions in D3 receptor signaling are a consequence of chronic drug use or a preexisting difference that increases the likelihood of developing an addiction. We hypothesize, based on the current results and the work of others (Payer et al., 2014; Lobo et al., 2015), that greater preexisting D3 receptor density may enhance addiction vulnerability by impairing the ability of individuals to make flexible, adaptive decisions. Additional studies quantifying, as well as manipulating, D3 receptor density before and after drug use are needed to disentangle these relationships. Moreover, the use of computational models to characterize choice behavior may identify reinforcement learning parameters that are unique to or dissociate specific aspects of decision-making strategies. Such studies in animal models combined with neuroimaging offer a powerful method to disentangle vulnerability factors from those that are a consequence of drug exposures within midbrain corticostriatal circuits to provide insight into the pathophysiology of addiction.
In conclusion, we report here that higher midbrain [11C]-(+)-PHNO BPND and pharmacological stimulation of D3 receptors reduce the ability of rats to reverse probabilistic, spatial discrimination problems that were characterized by a lower learning rate and reduced sensitivity to positive outcomes. These results synthesize imaging and computational approaches with high translational utility to a growing body of work suggesting that D3 receptor dysregulation may underlie the behavioral problems observed in psychiatric disorders and implicate the D3 receptor as a target for the treatment of disorders that are associated with decision-making impairments.
Footnotes
This work was supported by Public Health Service Grants DA011717 and DA027844 to J.R.T., Distinguished Investigator National Alliance for Research on Schizophrenia and Depression Award to J.R.T., Yale Center for Clinical Investigation UL1 TR000142, Research Training Biological Sciences Grant 5T32 MH14276 to S.M.G., and National Science Foundation Graduate Research Fellowship DGE-1122492 to J.R.P. We thank Dr. Edythe London for providing additional imaging software necessary for completing these studies.
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Jane R. Taylor, Department of Psychiatry, 300 George Street, New Haven, CT 06511. jane.taylor{at}yale.edu