Orbitofrontal Dopamine Depletion Upregulates Caudate Dopamine and Alters Behavior via Changes in Reinforcement Sensitivity

Schizophrenia is associated with upregulation of dopamine (DA) release in the caudate nucleus. The caudate has dense connections with the orbitofrontal cortex (OFC) via the frontostriatal loops, and both areas exhibit pathophysiological change in schizophrenia. Despite evidence that abnormalities in dopaminergic neurotransmission and prefrontal cortex function co-occur in schizophrenia, the influence of OFC DA on caudate DA and reinforcement processing is poorly understood. To test the hypothesis that OFC dopaminergic dysfunction disrupts caudate dopamine function, we selectively depleted dopamine from the OFC of marmoset monkeys and measured striatal extracellular dopamine levels (using microdialysis) and dopamine D2/D3 receptor binding (using positron emission tomography), while modeling reinforcement-related behavior in a discrimination learning paradigm. OFC dopamine depletion caused an increase in tonic dopamine levels in the caudate nucleus and a corresponding reduction in D2/D3 receptor binding. Computational modeling of behavior showed that the lesion increased response exploration, reducing the tendency to persist with a recently chosen response side. This effect is akin to increased response switching previously seen in schizophrenia and was correlated with striatal but not OFC D2/D3 receptor binding. These results demonstrate that OFC dopamine depletion is sufficient to induce striatal hyperdopaminergia and changes in reinforcement learning relevant to schizophrenia.


Introduction
Modern versions of the dopamine (DA) hypothesis of schizophrenia suggest that important changes in DA function occur at two sites, the striatum and prefrontal cortex (PFC; Weinberger, 1987). In the striatum increased presynaptic DA synthesis and increased striatal D2 receptors correlate with the magnitude of positive symptoms in schizophrenia (Miyake et al., 2011) and blockade of striatal D2 receptors (Davis et al., 1991;Kapur and Remington, 2001) alleviates such symptoms. Moreover, the onset of psychosis is heralded by changes in DA function specifically within the caudate nucleus (Howes et al., 2009;Fusar-Poli et al., 2010), a key site of the increased D2 receptor availability seen in schizophrenia (Miyake et al., 2011). Decreased D1 receptor neurotransmission in the PFC is proposed to cause the "negative" (cognitive deficit) symptoms (Weinberger, 1987), although DA D3/D4 receptor mRNA is also downregulated in the orbitofrontal cortex (OFC; Meador-Woodruff et al., 1997) and in the cognitive-deficit syndrome of schizophrenia (Kanahara et al., 2013).
This raises the question as to whether these striatal and orbitofrontal changes observed in schizophrenia are causally related. Previous studies have provided evidence for interactions between other prefrontal cortical regions and striatal dopamine activity (Pycock et al., 1980;Roberts et al., 1994;Kolachana et al., 1995;Scornaiencki et al., 2009). Furthermore, the OFC not only innervates the caudate nucleus, but also projects directly and indirectly to the midbrain ascending DA systems (Leichnetz and Astruc, 1975;Haber et al., 1995) where it inhibits ventral tegmental area (VTA) neurons (Lodge, 2011), whereas glucose metabolism in the OFC correlates with D2 receptor availability in the human striatum (Volkow et al., 2001). Finally, prolonged psychological stress, a known risk factor and trigger for schizophrenia (van Winkel et al., 2008), reduces PFC DA transmission (Mizoguchi et al., 2000) and increases striatal DA uptake (Copeland et al., 2005). However, the specific relationship between DA in the OFC and the striatum has not yet been studied.
Thus, the present study determined whether depletions of dopamine, specifically within the OFC, can cause changes in D2 receptor transmission in the caudate nucleus. In a New World primate, the common marmoset, OFC dopamine was reduced using the neurotoxin 6-hydroxydopamine (6-OHDA) and its effects on striatal DA were assessed using 18 F-fallypride positron emission tomography (PET) to quantify D2/3 receptor binding, and in vivo microdialysis to assess levels of extracellular DA. In addition, the effects of OFC DA reductions were determined on performance of a probabilistic discrimination task in which marmosets had to learn which of two visual stimuli was more associated with reward. Patients with schizophrenia can show two distinct behavioral changes compared with controls on such tasks: they can adopt different strategies, such as switching response location at different rates (Frith and Done, 1983), and they can show altered sensitivity to positive or negative feedback that impacts upon learning (Waltz et al., 2007). How such behavioral changes relate to altered prefrontostriatal DA function is unclear. Therefore, we applied computational reinforcement learning models to subjects' performance to test for changes in either strategy or reinforcement learning.

Overview and behavioral methods
Subjects (n ϭ 7) completed a probabilistic discrimination learning task consisting of multiple discriminations, each comprising two abstract multicolored stimuli presented on a touch-sensitive computer screen as described previously (Clarke et al., 2007). All monkeys were trained to enter a clear plastic transport box for marshmallow reward, familiarized with the testing apparatus, and trained to respond to the touchscreen. They learned through trial and error which stimulus was usually (70 or 80%) associated with a 5 s banana milkshake reward and sometimes punished (30 or 20%) with a 0.3 s 100 dB loud noise, and vice versa (Fig.  1). Subjects completed one session of 40 trials per day and an individual discrimination was considered learned when they reached a criterion of 90% or more correct choices in one session. A new discrimination was then started the next day. The rate of learning was assessed by calculating how many incorrect choices were made during each discrimination. While learning the preoperative discriminations, they were scanned with 18 F-fallypride to assess their D2/D3 receptor nondisplaceable binding potential [BP ND ] (D2RB). Once the task was learned, they underwent a 6-OHDA-induced selective depletion of DA within the OFC or a control procedure. When recovered, they continued with postoperative discriminations, and ϳ16 weeks after surgery were rescanned with 18 F-fallypride to assess their postoperative DR2B and microdialyzed to assess the levels of extracellular DA in the caudate nucleus.

Subjects and housing
Seven common marmosets (Callithrix jacchus; 3 females, 4 males) bred on site at the University of Cambridge Marmoset Breeding Colony were housed in pairs. All monkeys were fed 20 g of MP.E1 primate diet (Special Diet Services) and two pieces of carrot 5 d per week after the daily behavioral testing session, with simultaneous access to water for 2 h. On weekends, their diet was supplemented with fruit, rusk, malt loaf, eggs, bread, and treats and they had ad libitum access to water. Their cages contained a variety of environmental enrichment aids that were varied regularly and all procedures were performed in accordance with the UK Animals (Scientific Procedures) Act, 1986. One sham-operated control subject contributed to imaging data and dialysis (n ϭ 7) but was not part of the behavioral study (n ϭ 6) and its postmortem data were lost due to a freezer malfunction.

Structural magnetic resonance imaging
Subjects were premedicated with ketamine hydrochloride (Pharmacia and Upjohn, 0.05 ml of a 100 mg/ml solution, i.m.) and given a longlasting prophylactic analgesic (Carprieve; 0.03 ml of 50 mg/ml carprofen, s.c.; Pfizer). The tail vein was cannulated (Intraflon 2 i.v. catheter attached to a Lock Stopper with injectable membrane; Vygon), the cannula was flushed with 0.5 ml saline and 0.25 ml heparinized saline and the monkey subsequently intubated and maintained on isoflurane gas anesthetic (flow rate: 2.0 -2.5% isoflurane in 0.3 l/min O 2 ; Novartis). The monkey was positioned in the MRI scanner and monitored throughout (pulse oximetry, temperature).
Animals were scanned supine in a Bruker PharmaScan 47/16 system, using a locally built birdcage coil for signal transmission and reception. Structural images were obtained using a RARE sequence optimized for contrast between gray and white matter (TR/TE eff 7455/36 ms, echo train length 8, field-of-view 7.68 ϫ 7.68 cm, matrix 256 ϫ 192, reconstructed to final resolution 300 ϫ 300 m, 50 slices of thickness 1 mm with gap 0.2 mm). Regions-of-interest (ROIs) were delineated for each subject independently on a slice-by-slice basis by a single expert reviewer (A.C.R.) using Analyze 8.1 (Mayo Clinic). ROIs were drawn for the orbitofrontal cortex, ventrolateral prefrontal cortex, ventromedial caudate, caudate body, dorsolateral caudate, putamen, nucleus accumbens, amygdala, ventral hippocampus, and cerebellum. Upon completion of the MRI scans, the monkeys were transferred to the PET scanner while still unconscious, and the PET scan commenced.  (D, discrimination). Representative stimuli are shown, labeled ϩ for correct and Ϫ for incorrect. Reinforcement probabilities are shown: for example, "90:10 probability" indicates that P(reward ͉ correct stimulus selected) ϭ P(punishment ͉ incorrect stimulus selected) ϭ 0.9 and P(punishment ͉ correct stimulus selected) ϭ P(reward ͉ incorrect stimulus selected) ϭ 0.1. The intensity of auditory punishment is shown in dB SPL.

F-Fallypride positron emission tomography
To determine the effects of an OFC hypodopaminergic state on D2 receptor availability in the striatum we used PET imaging with the highly selective dopamine D2/D3 receptor radioligand 18 F-fallypride. The high affinity of 18 F-fallypride allows the investigation of areas of both high and low D2/D3 receptor density (e.g., the striatum and PFC respectively; Lataster et al., 2011). The marmoset OFC preferentially innervates the ventromedial caudate in the striatum (Roberts et al., 2007), the caudate being implicated in the increase in D2 receptor availability seen in schizophrenia (Miyake et al., 2011). Thus, the caudate nucleus was an a priori ROI. As it has been demonstrated that baseline striatal DA synthesis (Cools et al., 2009) andD2 receptor binding (Groman et al., 2011) varies between individuals, the monkeys were scanned both before and after surgery so that each monkey could act as its own control when assessing the effects of OFC DA depletion on caudate D2/D3 receptor binding.
Animals were imaged using a microPET P4 scanner (Concorde Microsystems). The brain was located centrally in the field of view of the scanner (78 mm axial ϫ 200 mm diameter) to maximize sensitivity and spatial resolution. The amount of 18 F-fallypride injected was governed by the desire to minimize any mass-related perturbation of receptor availability, while also providing adequate counting statistics. Consequently, 0.49 Ϯ 0.04 nmol/kg was injected, which corresponded to an activity range of 5.1-23.1 MBq across the animals. 18 F-Fallypride was injected intravenously as a bolus over 10 s, followed by a 10 s heparinized saline flush. List-mode data acquired over 180 min after injection were subsequently histogrammed into the following time frames: 10 ϫ 10 s, 3 ϫ 20 s, 6 ϫ 30 s, 10 ϫ 60 s, 10 ϫ 120 s, and 29 ϫ 300 s. The energy and timing windows used were 350 -650 keV and 6 ns, respectively. Before injection, windowed coincidence mode transmission data were collected for 11 min with a rotating 68 Ge point source (ϳ100 MBq) to allowed measured attenuation correction.
The images were reconstructed using the PROMIS 3D filtered backprojection algorithm (Kinahan and Rogers, 1989), adapted locally for the specific scanner. Corrections for randoms, dead time, background, normalization, attenuation, and sensitivity were applied to the data during reconstruction. Images were reconstructed into 0.5 ϫ 0.5 ϫ 0.5 mm 3 voxels in a 180 ϫ 180 ϫ 151 array, and a Hann window cutoff at the Nyquist frequency was incorporated into the reconstruction filters to give an image resolution of ϳ2.3 mm (full-width, half-maximum). For each scan an added image (120 -180 min) was coregistered to its own MRI using rigid coregistration. ROIs delineated on the MRI were applied to the coregistered dynamic PET images to extract ROI time-activity curves (TACs).
ROI nondisplaceable binding potential was estimated from the ROI TACs with the simplified reference tissue model (reference tissue: cerebellum) using basis functions (sRTM; Gunn et al., 1997). One-hundredfifty basis functions spaced logarithmically in the range of 0.009 Յ 3 Յ 0.60 1/min were used.

Depletion of dopamine from the orbitofrontal cortex
Subjects were premedicated, given an analgesic, and anesthetized as above, before being placed in a stereotaxic frame modified for the marmoset (David Kopf). Anesthesia was monitored clinically and by pulse oximetry with capnography.
Lesions of the dopaminergic innervation of the OFC were made using 6-OHDA (Sigma-Aldrich; 6 g/l) in saline/0.1% L-ascorbic acid. To protect the serotoninergic innervation of the OFC from the 6-OHDA the selective serotonin reuptake inhibitor citalopram (Lundbeck; 5 mg/kg) was administered concomitantly in the infusate. Injections (0.04 l/20 s) were made into five sites on each side within the OFC, using a 30 gauge cannula attached to a 2 l Hamilton syringe. All injections were made 0.7 mm above the base of the brain. The coordinates and volumes used were as follows: AP ϩ16.75: LM Ϯ 2.5 (0.4 l) and LM Ϯ 3.5 (0.4 l); AP ϩ17.75: LM Ϯ 2.0 (0.4 l) and LM Ϯ 3.0 (0.4 l); and AP ϩ18.5: LM Ϯ 2.0 (0.6 l), having been adjusted where necessary in situ according to cortical depth (Roberts et al., 2007). Sham surgery was identical except for the omission of the toxin from the infusion. Postoperatively, all monkeys received the analgesic meloxicam (0.1 ml of a 1.5 mg/ml oral suspension; Boehringer Ingelheim) before being returned to their home cage for 10 d of "weekend diet" and water ad libitum to allow complete recovery before returning to testing.

In vivo striatal microdialysis
Following isoflurane anesthesia, commercially available BASi brain microdialysis probes with a 2 mm membrane (BASI MD-2200, BR-2, Bioanalytical Systems) were implanted acutely into the ventromedial (AP ϩ12.5 mm; L 2.3 mm; DV ϩ9.8 mm) and lateral (AP ϩ12 mm; L 3.5 mm; DV ϩ11.0) caudate nucleus and used for collection of the dialysate. Harvard microsyringe pumps with 2.5 ml gas-tight syringes were used to perfuse artificial CSF (aCSF) through the dialysis probe at a flow rate of 1.0 l/min. The aCSF had the following composition (in mM): NaCl 147, KCl 3.0, CaCl 2 1.3, MgCl 2 1.0, NaH 2 PO 4 0.2, and Na 2 HPO 4 1.3. After allowing 3 h for the implanted probes to equilibrate, dialysate fractions were collected every 20 min into 2 l 0.01 M perchloric acid. After three baseline samples, monkeys received a 75 mM K ϩ challenge for 10 min, which was followed by a further four baseline samples. Samples were stored at Ϫ80°C before being analyzed using reversed phase highperformance liquid chromatography (HPLC) and electrochemical detection as described previously (Clarke et al., 2007). The signal was integrated using Chromeleon software (v6.2, Dionex). Due to HPLC malfunction there was loss of data from the ventromedial caudate of one monkey and the lateral caudate from another. As the values were similar across the two regions values from all animals were averaged across the two areas, where available.

Postmortem histochemical assessment
The specificity and extent of OFC DA depletion following 6-OHDA infusions into the OFC was assessed by postmortem analysis of monoamine levels in cortical and subcortical regions 448.75 Ϯ 5.70 (mean Ϯ SEM) days after administration of the neurotoxin, as described previously (Clarke et al., 2007). Tissue samples were homogenized in 200 l 0.2 M perchloric acid for 1.5 min and centrifuged at 6000 rpm for 20 min at 4°C. The supernatant (75 l) was subsequently analyzed using HPLC as described above.

Statistics
Behavioral, D2/3 binding and DA-depletion data were analyzed using R (http://www.R-project.org/) and SPSS (IBM). For ANOVA, homogeneity of variance was verified using Levene's test; type III sums of squares and full factorial models were used unless stated. For designs with withinsubjects factors, where applicable, the Huynh-Feldt correction was used to correct for any violations of the sphericity assumption as assessed by the Greenhouse-Geisser test. Computational model parameters were estimated using a hierarchical Bayesian analysis. Rather than confidence intervals, this produces credible intervals, specifically highest posterior density intervals (HDI). An x% HDI is the narrowest interval containing x% of the posterior probability mass. For example, if the 50% HDI for a parameter excludes zero, then it is more likely than not that the parameter is non-zero; if the 95% HDI excludes zero, then the probability that the parameter is non-zero exceeds 0.95. A 95% HDI excluding zero is, therefore, in general better evidence for a parameter being non-zero than a 95% confidence interval, which merely describes the likelihood of the data given the null hypothesis.

Computational modeling of behavior
We analyzed behavior in several ways, including the fitting of several computational models of reinforcement learning to the behavioral data. We aimed to address several behavioral possibilities: (1) We analyzed behavior according to the reinforcement occurring on the immediately preceding trial, in a win-stay/lose-shift analysis, as is common (den Ouden et al., 2013). (2) The analyses in (1) indicated that OFC-depleted and control groups differed in their response to reinforcement veracity (whether reinforcement on the preceding trial was "true," meaning in the majority, or "false," meaning in the minority and misleading as to the best stimulus). This suggests an effect of prior history, so we examined the dependence of choice on preceding reward/punishment, and also on subjects' prior stimulus choices (to account for stimulus bound perseveration) in terms of several preceding trials, using an n-back analysis with a family of conditional logit regression models (Lau and Glimcher, 2005;Seymour et al., 2012). However this family of models did not explain the group differences in the win-stay/lose-shift analysis, and even the best of them provided a poor description of behavior (as judged by the Bayesian Information Criterion; BIC) compared with state-based reinforcement learning models, considered below. We do not present full n-back analyses for reasons of space. (3) We considered the possibility that subjects used "model-based" (declarative) learning (Wunderlich et al., 2012), such as tracking reinforcement probabilities and their certainties about those probabilities in a Bayesian or similar fashion, and altering their estimates of probability less when their certainty is already high. (4) We considered a family of conventional value-/state-based ("model free") reinforcement learning rules, in which subjects update simple representations of their environment after each trial.

Model fitting and comparison
Likelihood calculation and maximum a posteriori fitting. Several regression and reinforcement learning (RL) models were compared. Each applies its own algorithm, with a certain number of parameters, to the sequence of stimuli and rewards experienced by the subjects. Sessions were treated as contiguous. In all cases, the model M, having parameters , calculated the probability of choosing each possible action (i.e., of selecting each of two given stimuli). The vector of actions actually chosen by subject s was denoted a s , or a s,t at each trial t. The model's performance was evaluated by calculating the likelihood function P(choice actually made ͉ M, ) for each trial. The log-likelihood (LL) was calculated as follows: We conducted maximum a posteriori (MAP) fitting using priors as follows : learning rates and other parameters that are in principle constrained in the range (0,1) were given priors of Beta(1.1, 1.1), whereas softmax inverse temperatures (␤) and "stickiness" maxima (see below) were given priors of Gamma(shape ϭ 1.2, scale ϭ 5).
We selected model parameters M for a given model M to maximize the probability of obtained data D given the model and its parameters: for each subject using the optim() function of R (R Core Team, 2012). Logarithms are to base e throughout.
Model selection. Models were selected using the Bayesian information criterion BIC ϭ Ϫ2 LL ϩ k ln(n), where k is the number of parameters in the model and n is the number of observations (trials; Schwartz, 1978;Burnham and Anderson, 2004). Lower BIC values indicated a better fit after penalization for the number of parameters. The BIC was computed across all subjects, such that k ϭ zs, where s is the number of subjects and z is the number of parameters per subject entering the reinforcement learning model. This method gives more weight to subjects contributing more trials, but correctly so in terms of optimizing the overall fit, because such subjects contribute more information about the common model identity. There were no major differences if the corrected Akaike information criterion was used instead: Reference model BIC. We included a model choosing at random ( p ϭ 0.5 for each trial, for n ϭ 4814 trials) for comparison of BIC values.
Exceedance probability. Following MAP estimation, we also calculated the model that was most likely (across all subjects) based on the randomeffects analysis of Stephan et al. (2009), which treats the model identity as a random variable.
Parameter comparison. For some models, we compared model parameters across groups using summary statistics . Because the number of trials varied by subject, in some cases we also compared model fits by comparing the mean LL per trial, calculated for each subject, across groups. For the best model, we estimated parameters and group differences using a full Bayesian hierarchical method, described below.

Optimal Bayesian choice algorithm
A hypothetical ideal subject would estimate the probability of reinforcement for each stimulus, represent its uncertainty about those estimates, and choose so as to maximize the reward obtained. We modeled this behavior using an optimal Bayesian method.
The probability of reward for each stimulus was represented by a probability density function (PDF) for each stimulus. The prior PDF was uniform (that is, before a discrimination begins, a subject is assumed to believe equally strongly that a given stimulus will always deliver reward, never deliver reward, deliver reward with a 40% probability, and so on). For a uniform prior, the posterior probability density function after T trials with r rewards and s ϭ T Ϫ r punishments is given by the following: The probability of choosing a stimulus was determined by randomized probability matching (RPM; Scott, 2010). In RPM, an agent selects a series of actions a t at time t, and observes a sequence of rewards y t ϭ ( y 1 , …, y t ). For our purposes, reward occurs or does not occur on each trial, so y t ʦ {0,1}. Each action generates reward independently from the reward distribution f at ͑ y ͉ ͒, where is an unknown parameter vector; for our purposes, f a ( y ͉ ) is the Bernoulli distribution with success probability a . The quantity a () ϭ E( y t ͉ , a t ϭ a) is the expected reward from f a ( y ͉ ). If were known, the optimal strategy would be to choose the option with the largest a (). RPM calculates the quantity: and allocates choice t ϩ 1 to option a with probability w at . When rewards are drawn from independent Bernoulli random variables (a "binomial bandit"), as in the current situation, the optimality probability (Scott, 2010) is given by: calculated across the actions on offer, where Be( ͉ ␣, ␤) is the density of the ␤ distribution for a random variable with parameters ␣ and ␤, and Y at and T at are the cumulative number of successes and trials respectively observed for action a up to time t. RPM has no parameters and therefore requires no fitting. We used this model in isolation, but also added a softmax stage:

Value-/state-based RL models
Delta-rule updates of stimulus value. Simple value-based RL algorithms assign a value to each stimulus or action, and choose accordingly; the values are updated according to rules and parameters determining the impact of reward or punishment, but (unlike models such as RPM) they do not represent the statistical structure of the environment in a more complex way. Subjects' behavior was modeled using a delta-rule update function that allowed different speeds of response to reward ( r ) and to punishment ( p ): For each subject, the eight stimuli presented in discriminations D5-D8 were each assigned a value, which was updated according to this rule. The initial values of all stimuli were set to 0.5, midway between the target value for reward (1) and punishment (0). In other model variants, the constraint r ϭ p ϭ rp was applied.
Stickiness. In a subset of models, the tendency to repeat choices (Lau and Glimcher, 2005;Seymour et al., 2012) was modeled using additional parameters c and c: Prop͑ c , C t i , 0͒ when not chosen/not presented .
The initial C value for each stimulus was set to 0.
In variants, the constraint c ϭ r or the constraint c ϭ 1 was applied. Side bias. In some models, one of two possible sources of side bias was included. In one, based on the subject's own behavior only, the left side was favored as a result of previous choices to the left side: In the other, based on reinforcement, the left side was favored as a result of previous reinforcement following choice of the left side, or punishment following choice of the right side: Bias values were initially set to 0.
Softmax. The probability of responding was calculated according to a softmax rule, applied only across the two stimuli presented on each trial (n ϭ 2): The softmax (soft maximization) function takes a number of inputs and provides the same number of outputs. The outputs sum to 1. The largest input produces the largest output (maximization) and the proportion of the output captured by the largest input is determined by the softmax parameter (soft maximization rather than hard, or absolute, maximization). It has a variable inverse temperature ␤ (low ␤, or high temperature, leads to nearly equiprobable actions). The use of ␤ rather than temperature 1/␤ is for computational reasons, to avoid division by zero following underflow.
In variants, the constraint ␤ ϭ 1 was used. (This constraint was always used when r and p were separate parameters, since to vary it would have been confounded with the difference between r and p and would lead to over-parameterization, for example, for a given r , either an increase in ␤ or an increase in p would lead to exaggerated preferences between any pairs of stimuli; thus, ␤ and p would be negatively coupled.)

Simple approximation to stimulus reliability in predicting reinforcement
We also tested models that calculated the total number of trials T at and the total number of rewards Y at for each action a up to time t, as for RPM, but updated values according to the simpler rules: In this model, subjects weight the effect of reward or punishment by the "reliability" measure r, and have a fixed propensity ( r , p ) to do so. This reduces to the previous model of value updating when r ϭ p ϭ 0. As an example, in the case where r ϭ p ϭ 1, this model would weight the effect of reward by 0.8 for an action that had been rewarded on 80% of previous trials, and weight the effect of punishment by 0.2 for the same action.

Hybrid models incorporating Bayesian and value-based elements
Finally, we created models that blended simple value-based and optimal Bayesian responding. We calculated decision probabilities based on the simple delta-rule models, and separately according to RPM (with or without an RPM softmax stage), with a further parameter for each subject, 0 Յ w Յ 1, such that its decision probability for each action at each time point was wp rpm ϩ (1 Ϫ w)p delta . The complete set of reinforcement learning models tested is shown in Table 1.

Hierarchical Bayesian modeling of the optimal RL model
The best model (Delta1C-LC) was subjected to a full hierarchical Bayesian analysis using Stan (Stan Development Team, 2014), with the following parameters: (1) a shared group SD for each parameter: these had a prior distribution of the half-Cauchy(0, 5) distribution and constraints of [0,ϩϱ); (2) a per-group mean for each parameter: group mean values of rp , c , LC had prior distributions of Beta(1.1, 1.1) and were constrained to the range [0,1]; group mean values of c, d LC had prior distributions of Gamma(shape ϭ 1.2, scale ϭ 5) and constraints of [0,ϩϱ); (3) per-subject parameters, with similar constraints, drawn from normal distributions defined by the group-level parameters; (4) per-trial probabilities of choosing the best stimulus, calculated deterministically from the per-subject parameters according to the RL algorithm; and (5)  To compare the model's predictions to the behavioral analysis of "obey" probabilities, the probabilities of obeying preceding feedback of different types were sampled from the best-fit computational model of behavior. Per-subject estimates were sampled of the mean probability (as determined by the model) of choosing an option that would correspond to "obeying," given the actual choice made and actual reinforcement obtained on the previous trial.
To compare the model's predictions to the behavioral analysis of errors to criterion, six "virtual" subjects chose according to the computational model and their mean posterior per-subject parameter values. For each subject, the model's probabilities of choosing the correct stimulus were converted into actual choices and fed into a virtual environment embodying a simple model of the task (in which the probability of valid reinforcement was 0.8, the probability of the correct stimulus being on the left or right was 0.5, using sessions of 30 trials each, with a stopping criterion of 90% correct in a session just as for the monkeys). Reinforcement from the virtual environment was fed back into the model, to update its state for the next trial. The mean number of errors to criterion was measured, across 1000 iterations of the task, for each virtual subject.
To establish the necessity and sufficiency of model parameter changes to cause behavioral effects on errors to criterion and obey probabilities, multiple (n ϭ 1000) virtual subjects per group were similarly simulated, with either all parameters varying (all subjects taking their group mean value for each parameter), one parameter set varying [either the reinforcement rate parameter ( rp ), the stimulus stickiness parameters ( c and c), or the side stickiness parameters ( LC and d LC ) varied between groups, with all other parameters taking the mean overall values, or two parameter sets varying and the remaining parameter taking the overall mean values.
A potential cause of this decreased caudate D2RB was competition from increased extracellular DA that reduced radioligand binding. To assess this hypothesis, all subjects underwent caudate microdialysis. Extracellular DA levels were measured at baseline and following 75 mM K ϩ . These measures are taken to reflect tonic and phasic DA release, respectively as the influx of K ϩ mimics the arrival Their performance was assessed by the BIC. Low BIC values indicate a better fit, having penalized the models for their complexity. BIC values and corresponding model ranks (1 being best) are shown for sham-operated control subjects, for OFC DA-depleted subjects, and for all subjects together. Ranking by BIC across all subjects gives more weight to subjects contributing more trials, which is correct in terms of optimizing the overall fit, because such subjects contribute more information about the common model identity. The best model overall (Delta1C_LC) was also the best model for control subjects. It was the second-best model for lesioned subjects with a BIC difference of only 1 from the best model for lesioned subjects (Delta1C), and it differed structurally from that model only in parameters whose values were demonstrably different in the OFC DA-depleted group (Fig. 5C). ␤, Softmax inverse temperature; , rate constant; r , learning rate for reward; p , learning rate for punishment; rp , learning rate for reinforcement whether reward or punishment; c , learning rate for stimulus stickiness; LC , learning rate for side stickiness; LR , learning rate for side bias based on reinforcement; c, maximum for stimulus stickiness; d LC , maximum for side stickiness; d LR , maximum for side bias based on reinforcement; , propensity to weight reward or punishment by its reliability; w, fraction of decision making based on Bayesian (vs delta rule) processes.
of an action potential and thus induces the release of DA in a phasic manner. Consistent with this hypothesis, OFC DA-depleted monkeys showed a significant increase in baseline DA compared with controls (samples 1-3, t (5) ϭ Ϫ2.745, p ϭ 0.041; Fig. 2C) that was maintained after the K ϩ challenge (samples 7-8, p ϭ 0.04). However, differences in DA release in response to K ϩ challenge were not seen (t (5) ϭ Ϫ0.410, p ϭ 0.699), suggesting that OFC DA only modulates tonic rather than evoked striatal DA release. Furthermore, the extent of caudate extracellular DA release correlated negatively with the reduced levels of 18 F-fallypride binding seen in the ventromedial caudate of the OFC DAdepleted monkeys (r ϭ Ϫ0.807, p ϭ 0.028; Fig. 2D). These findings are consistent with competition between endogenous extracellular DA and 18 F-fallypride binding and demonstrate that OFC DA dysfunction modulates caudate DA levels.
Postmortem analysis at ϳ448 d after surgery, confirmed that injection of 6-OHDA caused a significant DA depletion (45 Ϯ 7.9%) in the OFC compared with controls (t (3) ϭ 4.27, p ϭ 0.024; Table 2). Because our previous work has shown that these OFC DA depletions show considerable recovery over time, we also analyzed the time course of DA OFC depletion obtained from our analysis of identical lesions from previous studies at earlier timepoints (81% depletion at 16 d, 75% depletion at 84 Ϯ 3 d, and 51% depletion at 370 Ϯ 23 d). Thus, the time period during which the behavioral analysis, imaging, and microdialysis were performed corresponds to periods of very high (in excess of 70%) OFC DA depletion (Fig.3). 5-HT levels were unaffected, and although the medial prefrontal and OFC/lateral PFC also showed significant depletions of DA and NA respectively (medial PFC DA, t (2) ϭ 11.063, p ϭ 0.008; OFC NA, t (3) ϭ 11.145, p ϭ 0.002; lateral PFC NA, t (3) ϭ 3.892, p ϭ 0.03), these depletions are not apparent at any earlier time points, suggesting that they may be due to compensatory processes that occur later than our period of interest.

OFC-DA depletion improved overall behavioral performance and resulted in a decreased sensitivity to false punishment
To assess how behavior was altered by the OFC-DA depletion we focused not only on overall errors to criterion but also how the feedback on the immediately preceding trial impacted upon stimulus selection on the current trial, using a win-stay/loseshift analysis approach. The latter approach has frequently been used (Waltz and Gold, 2007;den Ouden et al., 2013) to reveal how sensitivity to positive or negative feedback governs subsequent behavioral choice (the premise being that positive feedback should lead to repeated choice of a given stimulus, whereas negative feedback should lead to a shift in choice to an alternate stimulus) and has been successfully used to reveal differences in reinforcement learning in schizophrenia (Waltz et al., 2007). We therefore defined "shifting" as choosing a different stimulus to that chosen on the previous trial and "staying" as its converse, and calculated the probability of obeying reinforcement (staying after reward and shifting after punishment). We analyzed responding on trial X according to (1) valence: whether the response on trial X Ϫ 1 was rewarded or punished, and (2) veracity: whether that reinforcement was true (majority; e.g., reward following selection of the "correct" stimulus) or false/misleading (minority; e.g., punishment following selection of the correct stimulus).

Presurgical performance
There were no between-group differences in either errors to criterion or in any win-stay/lose-shift parameters (D1-D4, all p Ͼ 0.05, NS; Fig. 4).
Preliminary win-stay/lose-shift analysis established that no term involving the reinforcement probabilities (80:20 vs 70:30) was significant (F Յ 1.34, p Ն 0.311), so this term was dropped from subsequent analyses.

Behavior was best described by a simple computational model of reinforcement learning
To examine the behavioral strategy used by the subjects and to characterize this change better, several computational models of behavior were compared (Table 1). The best model for controls was one in which subjects' choices were governed by reinforcement, their own recent choice of stimulus, and their own recent choice of response side (Model Delta1C-LC; BIC 3371 across controls; Table 1). The best model for OFC DA-depleted subjects was either the same model (BIC 1639) or a similar model lacking  the dependence on their own recent choice of response side (Model Delta1C; BIC 1638). Model Delta1C-LC also had the highest exceedance probability, at 0.705 (the probability that this model is more likely than any other model tested), and the lowest BIC across all subjects. Thus, this model was selected as the winner. It incorporated parameters for (1) sensitivity to reinforcement ( rp , rate), without the need for different response rates to reward and punishment; (2) stimulus stickiness, the tendency to repeat choices to stimuli that have been recently chosen ( c , rate; c, maximum effect relative to reinforcement); and (3) side stickiness, the tendency to repeat choices to the side (left vs right) that had been recently chosen ( LC , rate; d LC , maximum effect relative to reinforcement). The stickiness parameters govern a process analogous to exploration versus exploitation strategies (Lau and Glimcher, 2005;Seymour et al., 2012).

OFC DA depletion increased response exploration and reinforcement sensitivity, as assessed by computational modeling of behavior
This model was a good descriptor of both groups and there were no preoperative group differences in its parameters. Postoperatively however, OFC DA-depleted animals exhibited alterations in both strategy and reinforcement-related behavior.
OFC DA-depletion made subjects less reliant on a strategy of reselecting a recently chosen side (Fig. 5C: strong evidence for reduced d LC (the maximum for side stickiness; probability of non-zero difference ϭ 0.051), with some evidence for increased LC (learning rate for side stickiness; probability of non-zero difference ϭ 0.938), indicating that their side stickiness strategy had a significantly lesser effect on behavior overall and altered rapidly). OFC DA-depleted animals also showed an enhancement of reinforcement sensitivity (Fig. 5C). The reinforcement rate parameter ( rp ) had a posterior probability of 0.778 of being nonzero (being much stronger evidence for a difference than a frequentist p value of 0.222). There were no group differences in the parameters governing stimulus stickiness.
To assess the contribution of the alterations in strategy (reduced side stickiness) and the increased reinforcement sensitivity to these predictions independently, the analyses were replicated in simulations that allowed only subsets of the parameters to vary between groups (see Materials and Methods). This revealed that the reduced side stickiness did not impact upon win-stay/loseshift behavior, but that changes in the overall sensitivity to reinforcement (regardless of whether it was reward or punishment) were necessary and sufficient to reproduce the reductions in both errors to criterion and sensitivity to false punishment shown by the OFC DA-depleted group behaviorally (Fig. 5E).
In summary, the OFC DA-depleted animals showed an increase in reinforcement sensitivity, and a decrease in side stickiness. Of these two changes, the increase in reinforcement sensitivity was responsible for the changes in errors to criterion and sensitivity to false punishment shown behaviorally.

Behavioral changes correlated with D2 receptor binding in the caudate nucleus but not with OFC D2 receptor binding
Given the specificity of these changes in D2RB to the caudate nucleus, and previous evidence suggesting that D2RB in the caudate is implicated in the ability to shift responding to changing feedback (Groman et al., 2011), we investigated whether (1) the reduction in side stickiness, (2) increased reinforcement sensitivity, or (3) the resulting reduction in insensitivity to misleading negative feedback induced by OFC DA-depletion were related to changes in caudate DA and D2RB.
Side stickiness maximum (d LC ) was positively correlated with D2RB in the caudate nucleus (ventromedial caudate, r ϭ 0.887, uncorrected p ϭ 0.018; dorsolateral caudate, r ϭ 0.870, p ϭ 0.024 and caudate body, r ϭ 0.884, p ϭ 0.019; Fig. 6C) but not with D2RB in the OFC (r ϭ Ϫ0.020, uncorrected p ϭ 0.971; Fig. 6D). We did not examine the relationship with LC (learning rate for side stickiness) as well, because LC and d LC were themselves  strongly anticorrelated (r ϭ Ϫ0.959, p ϭ 0.002). Similarly, the probability of shifting following false punishment correlated with D2RB in both the ventromedial caudate (r ϭ 0.893, p ϭ 0.017; Fig. 6E) and the caudate body (r ϭ 0.841, p ϭ 0.036) but not the OFC (r ϭ 0.157, p ϭ 0.767; Fig. 6F ). Neither c (learning rate for stimulus stickiness) nor c (maximum for stimulus stickiness) correlated with D2RB in the caudate or OFC. Because the computational model revealed reinforcement sensitivity as the key driver of changes in overall task performance, this suggests that performance improves as caudate DA increases and reinforcement sensitivity increases. Similar results were found with a voxel-based approach, and the differences did not exist preoperatively (data not shown).

Discussion
Depleting OFC DA led to an upregulation of tonic extracellular striatal DA levels, measured by microdialysis, with a corresponding decrease in DA D2/D3 receptor binding potential, measured by PET. This depletion improved subjects' ability to learn visual discriminations in a task offering partially ambiguous feedback. OFC DA-depleted subjects were less driven by a tendency to persist in choosing a recently chosen side, as established by computational modeling, although this change did not explain their behavioral alterations. They also showed an increase in reinforcement sensitivity, which did predict the observed behavioral 4 (Figure legend continued.) ( rp ), a tendency to repeat choices to recently chosen stimuli ( c , c), and a tendency to repeat choices to recently chosen sides ( LC , d LC ). Lesioned subjects showed increased sensitivity to reinforcement (higher rp ). They also showed less side stickiness (shown both by a lower d LC , indicating a reduction in the overall influence of side stickiness compared with that of reinforcement, and a higher LC , indicating that the influence of side stickiness was less long-lasting). The dagger ( †) indicates that between-group differences in rp were necessary and sufficient for the other behavioral effects shown in D, E (see Materials and Methods, and Results). Error bars show the posterior distributions of group differences in group mean parameter values, as highest-density intervals (HDIs; orange, 75% HDI excludes zero; red, 95% HDI excludes zero). Percentages are the posterior probabilities that the parameter differs from zero (width of the largest HDI excluding zero), as described in the Materials and Methods; they are not frequentist p values. D, This computational model predicted fewer errors to criterion in the OFC DA-depleted group (compare with A). E, Moreover, the computational model predicted the differences in responding to false punishment in the behavioral data (compare with B).

Figure 6.
Relationship between behavior and striatal dopamine. A, The d LC parameter correlated with 18 F-fallypride BP ND in the caudate (ventromedial caudate shown) but (B) not the OFC. C, Similarly, the rp parameter correlated with 18 F-fallypride BP ND in the caudate but (D) not the OFC. The probability of shifting after false-negative feedback correlated with (E) the reduced levels of ventromedial caudate 18 F-fallypride BP ND seen in the OFC DA-depleted monkeys but not (F) the levels of 18 F-fallypride BP ND seen in the OFC. changes, namely a reduction in shifting away from the better stimulus in the face of punishment and a reduction in the number of errors made before criterion performance was attained. Parameters representing reinforcement sensitivity and the tendency to choose a recently chosen side were anticorrelated and correlated (respectively) with striatal D2RB, an inverse measure of striatal DA itself, but were not related to OFC D2RB. These results suggest that OFC DA depletion increases behavioral switching and reinforcement sensitivity via increases in striatal DA release.
The novel finding that DA depletion specifically within the OFC induces selective caudate DA excess is relevant to models of schizophrenia. Most previous work on the relationship between PFC and striatal DA relates to the dorsolateral PFC, the whole PFC, or the rodent ventromedial PFC, rather than the OFC. Catecholamine depletion of the ventromedial PFC in rats increases DA throughout the dorsal and ventral striatum (Pycock et al., 1980), whereas N-acetyl-aspartate levels (a putative marker of neuronal integrity) in the dorsolateral PFC predict striatal D2 receptor availability in schizophrenia (Bertolino et al., 1999). PFC DA receptor binding is abnormal in schizophrenia (Okubo et al., 1997) and the magnitude of prefrontal dysfunction predicts increased striatal DA uptake during the Wisconsin card-sorting task in schizophrenia (Meyer-Lindenberg et al., 2002) and the prodromal state (Fusar-Poli et al., 2010), supporting the hypothesis that abnormal frontostriatal interactions contribute to the development of this disorder. It is known that the OFC inhibits firing in the VTA (Lodge, 2011) and that OFC damage disrupts striatal dopaminergic signaling and learning from unexpected outcomes in rats (Takahashi et al., 2009) and humans (Tsuchida et al., 2010. Here, we demonstrate for the first time that a reduction in primate OFC DA elevates DA levels in the caudate (perhaps also via VTA disinhibition), the site where changes in dopaminergic function are associated with the onset of psychosis (Fusar-Poli et al., 2010).
The instrumental behavior required by the probabilistic discrimination task can be generated by several interacting neuropsychological systems (Cardinal et al., 2002). It can be habitual, using "model-free" reinforcement learning driven by reward prediction errors without representing the causal structure of the world, or goal-directed (model-based), based on an internal model of the consequences of actions derived from experience of their outcomes. The OFC is implicated in aspects of model-based learning (McDannald et al., 2011), and the balance between model-based and model-free learning can be altered by DA manipulations (Wunderlich et al., 2012). Our behavioral results were not well described by a shift between model-free and modelbased learning systems but our task was not explicitly designed to compare the two, and may be underpowered to detect such effects. Indeed, computational models of model-based strategies described behavior poorly in both the lesion group and controls. Our results are also not explicable simply by changes in motor function: response latencies were unaffected.
The most parsimonious account of our behavioral results was offered by a model free computational model in which learning was driven by reinforcement (according to a simple delta rule operating at the same rate for reward and punishment), by stimulus stickiness (the tendency to choose the stimulus chosen on the previous trial) and side stickiness (the tendency to respond to the side of the testing chamber chosen on the previous trial). Like schizophrenia patients who show alterations in both strategy and reinforcement learning, OFC DA-depleted monkeys were less strongly influenced by their recently chosen side, and more influ-enced by reinforcement. This effect was predicted by caudate but not OFC D2RB. This model also retrodicted the behavioral outcomes that OFC DA-depleted monkeys learned the task faster and that their choice selection was less affected by unpredicted negative outcomes.
The effect on side stickiness can be viewed as favoring exploration of stimulus locations over exploitation, or as an increase in the rate of response-based or side-lateralized switching. Our results are compatible with Humphries et al. (2012) theory that tonic striatal DA influences the trade-off between exploration and exploitation. Their network simulations suggest that in a two-choice task, high tonic dopamine promotes exploration under certain circumstances and the exploration-exploitation trade-off can alter win-stay/lose-shift probabilities and overall measures of task success, as seen in our data. They provide a potential neurobiological substrate for the increase in response switching between two locations induced by the indirect DA agonist amphetamine (Evenden and Robbins, 1983;Ridley et al., 1988) in a manner similar to that seen in schizophrenia (Frith and Done, 1983). They are also consistent with the well established roles of striatal DA in controlling responding within egocentric space (Cook and Kesner, 1988). It may be that the effects on side stickiness and reinforcement sensitivity are neurally separable, because changes in reinforcement sensitivity (but not side stickiness) were capable of driving changes in win-stay/lose-shift behavior. If so, it might be that DA in the caudate body mediates changes in side stickiness specifically, given that D2RB changes in the caudate body correlated with side stickiness and not reinforcement sensitivity (which was limited to the head of the caudate).
Much interest has centered on the role of the striatum in reinforcement learning and it is of note that had we not selected among competing computational models, our win-stay/loseshift outcomes would have found a ready explanation in reward prediction error signaling theories of the striatum (Sutton and Barto, 1998). Midbrain DA neurons fire in response to unexpected rewards (for convenience we will term these "blips"), and reduce their firing in response to unexpected omission of reward ("dips";Schultz, 2002). In our study, OFC DA depletion increased tonic striatal DA without affecting K ϩ -induced phasic DA release. This increase in tonic DA might mask the dips when reward is unexpectedly not delivered (and a mildly aversive outcome delivered instead), without affecting the blips in response to unexpected reward. Accordingly, one would expect a selective decrease in the normal behavioral response to unexpected punishment/reward omission, as observed. However, although convenient, this interpretation would imply that the changes in behavior were related to the difference between unexpected and expected reward, or between reward and punishment, or both. Instead, our results were explicable in terms of a simpler change in reinforcement sensitivity. This could be viewed as an enhancement of a model-free reinforcement learning system due to increased caudate DA. It is no surprise that an increase in reinforcement sensitivity was associated with fewer errors to criterion. Because the majority of reinforcement is valid, the invalid, minority feedback impacts upon behavior less, and thus animals with increased sensitivity will be more likely to ignore misleading feedback. Our results also emphasize the importance of considering the reinforcement-independent functions of the striatum, because a change in response strategy can influence simple behavioral measures often assumed to depend on reinforcement learning (Humphries et al., 2012).
Striatal dopamine increases have been suggested to contribute to changes in salience processing and psychosis, particularly early in the course of schizophrenia (Kapur, 2003;Fusar-Poli et al., 2010). In particular, because established schizophrenia is associated with impairments in reinforcement learning (Waltz et al., 2007), an important question arising from our results is whether prodromal or early psychosis, via increases in striatal dopamine, can be associated with improvements in reinforcement learning under some circumstances; certainly, global performance improvements have at times been reported in schizophrenia (Kasanova et al., 2011). Alternatively, the overall improvement apparent in our animal model may be a consequence of our control subjects not relying on a model-based system that is so prominent in humans. Schizophrenia involves many neural changes and animal models such as this one do not attempt to reproduce the entire disorder. Nevertheless, reproducing individual aspects of the disorder's complex neurobiology is helpful in isolating the cause of the individual neurobehavioral sequelae that do present in this complex disorder.
Orbitofrontal cortex DA function is abnormal in schizophrenia (Meador-Woodruff et al., 1997). The cause or causes of these abnormalities remain unknown, and it is uncertain whether they contribute to symptoms of the disorder, though it has long been hypothesized that prefrontal dopaminergic dysfunction is responsible for the striatal dopaminergic hyperfunction (Weinberger, 1987). One potential mechanism is via genetic changes affecting the OFC. Knock-out of the DISC1 schizophrenia susceptibility gene reduces OFC tyrosine hydroxylase expression (Sekiguchi et al., 2011). Another is via stress, as prolonged psychological stress reduces PFC DA transmission (Mizoguchi et al., 2000). A third is via distant cortical damage. For example, early ventromedial temporal lobe lesions damage PFC and impair dorsolateral prefrontal cortical regulation of striatal DA (Saunders et al., 1998;Bertolino et al., 2002), with dorsolateral PFC DA abnormalities also seen in schizophrenia (Davis et al., 1991). Here, using a combined behavioral, neuroimaging and computational approach we have demonstrated (to our knowledge for the first time) a specific additional mechanism of prefrontal-striatal regulation, in which DA depletion of the primate OFC causes an increase in tonic DA in the caudate nucleus. Behaviorally, this depletion caused an increase in the tendency to switch response location, a feature of choice behavior observed in patients with schizophrenia, and an increase in reinforcement sensitivity, both of which correlated with striatal but not OFC D2/D3 receptor binding. These results provide causal evidence that altered OFC DA transmission contributes to the striatal hyperdopaminergia known to contribute to behavioral dysfunction in schizophrenia.