Abstract
The orbitofrontal cortex (OFC) has been implicated in reinforcement-guided decision making, error monitoring, and the reversal of behavior in response to changing circumstances. The anterior cingulate cortex sulcus (ACCS), however, has also been implicated in similar aspects of behavior. Dissociating the unique functions of these areas would improve our understanding of the decision-making process. The effect of selective OFC lesions on how monkeys used the history of reinforcement to guide choices of either particular actions or particular stimuli was studied and compared with the effects of ACCS lesions. Both lesions disrupted decision making, but their effects were differentially modulated by the dependence on action– or stimulus–value contingencies. OFC lesions caused a deficit in stimulus but not action selection, whereas ACCS lesions had the opposite effect, disrupting action but not stimulus selection. Furthermore, OFC lesions that have previously been found to impair decision making when deterministic stimulus–reward contingencies are switched were found to cause a more general learning impairment in more naturalistic situations in which reward was stochastic. Both OFC and ACCS are essential for reinforcement-guided decision making rather than just error monitoring or behavioral reversal. The OFC and ACCS are both, however, more concerned with learning and making decisions, but their roles in selecting between stimulus and action values are distinct.
Introduction
A long history of research has implicated orbitofrontal cortex (OFC) in reinforcement-guided decision-making (Schoenbaum and Roesch, 2005; Fellows, 2007; Murray et al., 2007). More recently, however, the anterior cingulate sulcus (ACCS) has been implicated in reinforcement-guided decision-making (Shima and Tanji, 1998; Kennerley et al., 2006). Whether and how OFC makes a distinct contribution from the ACCS in guiding choice behavior is unclear. Notably, it has been suggested that both OFC and ACCS are especially important when error feedback suggests a change or reversal in choice is required (Bush et al., 2002; Ridderinkhof et al., 2004; Murray et al., 2007).
The first aim of the present study was to assess whether OFC lesions impair performance on decision-making tasks previously shown to be impaired by ACCS lesions (Kennerley et al., 2006). The tasks required choices between two actions. In experiment 1, only one action was associated with reward on a given trial and the rewarded action was intermittently changed (action reversal task) (Fig. 1 A). In experiment 2, the subjects made choices between two actions associated with different probabilities of reinforcement (action matching task) (Fig. 1 A). In addition to comparing monkeys with OFC lesions to controls, we also review the previously published performance of monkeys with ACCS lesions on identical tasks.
Whether OFC or ACCS plays the critical role in decision making may reflect whether the choice is being made between actions or between visual stimuli (Rushworth et al., 2007). The OFC, particularly lateral OFC, and ACCS are, respectively, more densely interconnected with high-level sensory processing streams and the motor system (Carmichael and Price, 1996). The second aim of the present study was, therefore, to compare the effects of OFC and ACCS lesions on reinforcement-guided stimulus choices as opposed to reinforcement-guided action choices (experiment 3) (Fig. 1 A). In experiments 1 and 2, choices could only be made between different actions using a single manipulandum in the absence of any guiding or conditional stimulus cues. In contrast, choices in experiment 3 were between visual stimuli whose spatial positions, and therefore their associated actions, varied from trial to trial.
The third aim was to examine the importance of OFC in new learning of stimulus–reward associations. Despite previous reports that OFC lesions do not affect discrimination learning (Iversen and Mishkin, 1970; Meunier et al., 1997; Izquierdo et al., 2004), it is conceivable that an OFC contribution to new learning may be more apparent if reward associations are probabilistic rather than simply deterministic (present or absent). If reward delivery is stochastic, then the best choice is dependent on the integrated history of reinforcement rather than just the last outcome. Although the previous trial may have been unrewarded, the value of the chosen option may still be higher than the value of the alternatives. By making it harder for subjects to determine which option is the most valuable, it may be possible to gain insights into prefrontal function not seen in deterministic settings. Recent evidence suggests ACCS is active and critical whenever reinforcement, whether positive or negative, provides information that allows a revision of the value of a choice (Matsumoto et al., 2003; Walton et al., 2004; Amiez et al., 2005, 2006; Kennerley et al., 2006), and it is possible that OFC performs a similar function. Experiment 4, therefore, examined whether OFC lesions impaired stimulus–reward learning in conditions in which each stimulus was allocated reward stochastically in a three-choice stimulus discrimination task (Fig. 1 A–C).
Materials and Methods
Subjects.
In experiment 1, 12 male rhesus macaques (Macaca mulatta) aged between 4 and 10 years and weighing between 7 and 13 kg served as subjects. Three received bilateral aspiration lesions of the orbitofrontal cortex (OFC group), three received bilateral lesions of the anterior cingulate sulcus cortex (ACCS group), and six served as unoperated controls (CON group). The same groups were again tested in experiment 2, although only four individuals from the CON group were tested. In experiment 3, the same lesion groups, with three animals in the CON group, were tested. For experiment 4, the same OFC and CON groups were tested again. The data from the ACCS lesion macaques in experiments 1 and 2 have been reported previously (Kennerley et al., 2006). All animals were maintained on a 12 h light/dark cycle and had 24 h ad libitum access to water, apart from when they were testing. All experiments were conducted in accordance with the United Kingdom Animals Scientific Procedures Act (1986).
Apparatus.
The joystick apparatus for experiments 1 and 2 have been described in detail previously (Kennerley et al., 2006). In brief, all testing was conducted while macaques sat in wheeled transport cages in a quiet testing room. The transport cage was positioned in front of a custom-made joystick that could either be lifted (moved vertically) or turned (moved to the right). For experiments 3 and 4, visual stimuli were presented on a 19 inch visual display unit (VDU) that was fitted with a touch-sensitive screen (3M Micro-touch: 3M). Monkeys sat in transport cages 20 cm from the VDU. Stimuli were 8 bit color clipart bitmap images 128 × 128 pixels randomly selected from a larger pool of images.
In all experiments, food pellets served as a reward (190 mg pellets: P. J. Noyes) and were dispensed into a food well above the joystick or to the right of the touch screen. Food pellet delivery was controlled by a MED Associates dispenser located behind the joystick/touch screen. A large food box located to the left of both the joystick and touch screen contained most of the monkey's total amount of food for the day, and this was opened at the end of testing.
Experiment 1: reinforcement-guided action selection task.
Before the start of testing, all macaques had received extensive training with a joystick and knew that either lifting or turning the joystick would lead to food reward. The initial direction of “correct” action was counterbalanced across testing sessions. The start of each trial was signaled by a 800 Hz tone for 300 ms. If a macaque made the correct action, then a reward and a 400 ms, 500 Hz tone were delivered; conversely, if the macaque made the incorrect action, neither a reward nor a tone was delivered. After each trial, there was a 600 ms intertrial interval (ITI) before the start of the next trial. During the ITI, the joystick could still be moved in either direction. If animals made an action during this period, it was recorded as an ITI error. The action–reward contingencies were kept stable for 25 rewarded trials and were then reversed every subsequent 25 rewarded trials. Preoperatively and postoperatively, monkeys completed five full sessions of 150 rewarded trials. One session was run per day.
The “EC” analyses were conducted on the data in an identical manner with that of Kennerley et al. (2006). This analysis examines the degree to which macaques are able to use reward information to sustain correct performance by determining the impact of each successive correct response (“C”) on behavior after an initial error (“E”). Only trial types with at least 15 instances were analyzed, which resulted in trial types EC+1 to EC(9)+1 being included. The “EE” analysis is almost identical with the EC analyses, but instead of determining the impact of successive correct trials on performance, it assesses the impact of successive errors on subsequent performance.
Experiment 2: action–outcome matching task.
Macaques could either lift or turn the joystick for food reward. However, unlike experiment 1, in which at any one time one action was deterministically associated with reward (1.0 probability of reward) and the other with no reward (0.0 probability of reward), now each action was associated with a probability of reward. Each day, monkeys were presented with one of four possible reward probability ratios: 1.0:0.0, 0.75:0.25, 0.5:0.2, and 0.4:0.1 (high reward probability action to low reward probability action), counterbalanced for each action alternative. The probability that a reward would be delivered for making an action was determined by two independent probability algorithms on a trial-by-trial basis. In keeping with previous matching protocols if a reward was allocated to a particular action on a given trial, it stayed available until the macaque selected that action (Herrnstein, 1997; Sugrue et al., 2004; Kennerley et al., 2006). To gain the maximum possible rate of reward, macaques therefore have to switch between the two options, harvesting the rewards available on each. All testing was conducted postoperatively.
The performance of each macaque was assessed by calculating the number of trials to reach the ratio of responses (high-probability action choices to low-probability action choices) that yielded 97% of the maximum rate of reward (r opt) (Kennerley et al., 2006). In brief, r opt was calculated by plotting the expected value (EV) given each rate, r, of high action choice, EV(r), for the different probability ratios (1:0, 0.75:0.25, 0.5:0.2, 0.4:0.1), where p and q represent the probability of reward associated with the high and low actions or stimuli (Eq. 1). r opt was then determined for each probability ratio by taking the maximum point on each of the curves plotted (Fig. 2). The response ratios that yielded reward at >97% (±) of the maximal reward rate are marked on each of the curves for the different ratios by the red box according to the following: Once the response ratio associated with 97% of the maximum reward rate had been calculated, the number of trials that macaques took to achieve this response ratio was determined. This was done using a 50 trial moving window (−25/+25) of the subjects choices for the high reward probability action.
Experiment 3: stimulus–outcome matching task.
Experiment 3 was designed to be analogous to the action–outcome matching task (experiment 2) except that instead of choosing between actions associated with different probabilities of reward, macaques chose between selecting two different stimuli. Before the start of testing, all macaques had received extensive training with touch screens and knew that touching a stimulus on the screen could lead to food reward. Each day, macaques were presented with two novel stimuli on the touch screen at the same time in a left/right configuration. At the start of testing, each stimulus was randomly assigned one of two reward probabilities.
When the macaque chose one of the stimuli, it stayed on the screen for 1 s, whereas the other nonselected stimulus disappeared. The probability that a reward would be dispensed after selecting one of the two stimuli was determined by two independent algorithms on a trial-by-trial basis. There was then a 2 s ITI before the next trial. The probability of reward associated with each stimulus was kept stable across the testing session and was predetermined according to one of two probability ratios. The two reward probability ratios used were as follows: 0.75:0.25 and 0.50:0.18 (high reward probability stimulus to low reward probability stimulus). Stimuli were randomly assigned to left or right on each trial. One session was run per day and monkeys were tested on each ratio twice, making a total of four sessions. In keeping with the experiment 2 and previous probability matching protocols, if a reward was allocated to a particular stimulus on a given trial it stayed available until the macaque selected that stimulus (Herrnstein, 1997; Sugrue et al., 2004). The performance of each macaque was then assessed using the same methods as experiment 2 (Eq. 1, Fig. 2). The number of trials that macaques took to achieve 97% of the r opt was determined using a 20 trial moving window (−10/+10) of the subjects choices for the high reward probability stimulus.
To investigate the macaques' ability to integrate probability and reward magnitude, an additional four sessions were run, but instead of only potentially receiving a single food reward, each stimulus was associated with a different amount of food reward. Now when the animals selected the stimulus associated with a high probability of reward (0.75 or 0.5), it potentially yielded two pellets, whereas selecting the other stimulus, associated with a low probability of reward (0.25 or 0.18), potentially led to the delivery of four pellets. Performance was again assessed using the same methods as before, but to take into account this difference in reward sizes between the two options, the following function was used to calculate r opt and 97% of the maximum reward rate (Eq. 2; Fig. 2, bottom row):
Experiment 4: three-choice stimulus–outcome task.
Three stimuli, which were entirely novel at the start of each testing session, were presented on a touch screen in one of four spatial configurations (Fig. 1 B). The configuration on a given trial was randomly selected. The spatial position of each stimulus within each configuration was further randomized to ensure that macaques were not simply using spatial cues to guide their choices. Each stimulus was associated with a different probability of receiving a single food reward. When a macaque chose one of the stimuli, it stayed on the screen for 1 s, whereas the other nonselected stimuli disappeared. A reward was then dispensed according to a predetermined reward schedule (Fig. 1 C). There was then a 2 s ITI before the next trial. Rewards were not assigned according to a “matching” schedule and did not remain available for a given stimulus until it was selected (Herrnstein, 1997; Sugrue et al., 2004; Kennerley et al., 2006).
The reward schedule defined whether a reward would be delivered on a given trial for selecting a particular stimulus. The probability of reward associated with each stimulus for a given trial was calculated from these schedules using a 20 trial moving window (+10/−10) (Fig. 1 C). The probability of reward associated with each stimulus changed over the course of the testing session between 0.9 and 0.0 during the 300 trial session. At the start of a session, one stimulus was initially associated with a high probability of receiving a reward, which then gradually diminished over the course of the session (stimulus A) (Fig. 1 C). In contrast, the probability associated with the other two stimuli was relatively low at the start of the session but progressively increased with different time courses, such that only one of the stimuli was associated with the highest probability of reward at the end of the session (stimulus B). Preoperatively and postoperatively, macaques completed five sessions. One session was run per day.
OFC and CON groups' choices on the three-choice stochastic reward task were compared by investigating how many times each subject chose the stimuli associated with the highest probability of reward (H option) (stimuli A–C) (Fig. 1 C). An objective estimate of the H option, H schedule, was determined for every trial by comparing the mean reward rate associated with each option over a 20 trial period. A subjective estimate of the H option, H RL, was also derived using a reinforcement learning model for each trial (Sutton and Barto, 1998). For this measure, the model determined the value associated with each option based on each macaques choices. H RL was calculated by assuming a simple Rescorla–Wagner model with a Boltzmann action selection rule (Behrens et al., 2007). We fitted the learning rate, α, and the temperature parameter, β, using standard nonlinear minimization routines. The learning rate was separately determined for each schedule and macaque based on their preoperative performance. This learning rate was then applied to both preoperative and postoperative testing sessions and the number of choices made in accordance with the predictions of the model were counted and compared. An index of how different the macaque's choices were from the option associated with the highest subjective value predicted by the model, called the error score, was also calculated and compared.
In summary, the H schedule measure is the percentage of trials on which the monkey chose the option that would have had the highest probability of reward. However, the subjects would often have to be clairvoyant to pick this option because the only information available to them comes from their previous experience of reinforcement associated with the different options. Optimal behavior might alternatively be analyzed as the percentage of trials on which the monkeys chose the option that has been observed most valuable recently. This is quantified by the H RL measure.
Surgery.
All surgery was performed under sterile conditions in a dedicated operating theater. At least 12 h before surgery, macaques were treated with an antibiotic (8.75 mg/kg amoxicillin, i.m.) and a steroidal antiinflammatory (20 mg/kg methylprednisolone, i.m.) to reduce the risk of postoperative infection, edema, and inflammation. Additional supplements of steroids were given at 4–6 h intervals during surgery. On the morning of surgery, animals were sedated with ketamine (10 mg/kg, i.m.) and xylazine (0.5 mg/kg, i.m.) and given injections of atropine (0.05 mg/kg), an opioid (0.01 mg/kg buprenorphine), and a nonsteriodal antiinflammatory (0.2 mg/kg meloxicam) to reduce secretions and provide analgesia, respectively. They were also treated with an H2 receptor antagonist (1 mg/kg ranitidine) to protect against gastric ulceration, which might have occurred as a result of administering both a steroid and nonsteroidal antiinflammatory treatments. Macaques were then moved to the operating theater where they were intubated, switched onto isoflurane anesthesia (1–2%, to effect in 100% oxygen), and placed in a head holder. The head was shaved and cleaned using antimicrobial scrub and alcohol. A midline incision was made, the tissue retracted in anatomical layers, and a bilateral bone flap removed. All lesions were made by aspiration with a fine gauge sucker. Throughout the surgery, heart rate, respiration rate, blood pressure, expired CO2, and body temperature were continuously monitored. At the completion of the lesion, the wound was closed in anatomical layers. Nonsteroidal antiinflammatory analgesic (0.2 mg/kg meloxicam, orally) and antibiotic (8.75 mg/kg amoxicillin, orally) treatment were administered for ∼5 d postoperatively. At least 3 weeks were allowed for recovery before testing resumed.
Three animals received bilateral OFC lesions. In these macaques, the intention was to make lesions similar to those that have previously been reported by Izquierdo et al. (2004). Therefore, care was taken to remove all the tissue on the orbital surface medial to the lateral orbital sulcus as far as the gyrus rectus. The caudal boundary of the lesion was an imaginary line perpendicular to the caudalmost point of the lateral and medial orbital sulci. The rostral boundary was the set of lines joining the most rostral points of the medial and lateral orbital sulci.
As previously described, three animals received bilateral ACCS lesions (Kennerley et al., 2006). Briefly, the ACCS lesion was made by removing the cortex within the dorsal and ventral banks of the anterior cingulate sulcus (including areas 24c, 24c′). The caudal limit of the ACCS lesion was an imaginary perpendicular line drawn through the midpoint of the precentral dimple. The lesion then extended to the most rostral point on the cingulate sulcus. The method of Parker and Gaffan (1997) was used to best preserve the blood supply to the tissue dorsal and lateral to the lesion. When the lesion in the first hemisphere was complete, the falx was cut to allow access to the second hemisphere.
Histology.
When the animals had completed their testing, they were anesthetized with sodium pentobarbitone and perfused with 90% saline and 10% formalin. The brains were then removed and placed in 10% sucrose formalin until they sank. The brains were blocked in the coronal plane at the level of the most medial part of the central sulcus. Each brain was cut in 50 μm coronal sections. Every 10th section was retained for analysis and stained with cresyl violet.
Both the OFC and ACCS lesions were mostly as intended (Fig. 3). OFC lesions reliably destroyed lateral OFC including Walker's areas 11 and 13 in all cases (Fig. 3) (Walker, 1940). There was, however, some variation in the amount of damage to the parts of OFC on the medial surface, including Walker's area 14. At their most anterior extent, lesions often failed to include the most medial portion of the orbital surface and at more posterior end of the lesion parts of the gyrus rectus were spared. The extent of the ACCS lesions have been reported in detail previously (Kennerley et al., 2006; Rudebeck et al., 2006a). Briefly, ACCS lesions reproducibly destroyed areas 24c and 24c′. Both dorsal and ventral banks of the cingulate sulcus were ablated anterior of a point adjacent to the midpoint of the precentral dimple. In one animal, the lesion started anterior of this point. All ACCS lesions ended at the most anterior point on the cingulate sulcus (Kennerley et al., 2006).
Data analysis.
Where possible, the data from all experiments were analyzed using repeated-measures ANOVA statistical methods with surgery (preoperative vs postoperative performance), ratio, reward (experiments 2 and 3), response (experiment 3), and schedule (experiment 4) as within-subject factors and group (OFC vs ACCS vs CON) as a between-subject factor. Additional post hoc simple main effects analyses were conducted to explore any significant main effects or interactions (p < 0.05) suggesting a change in performance after a lesion. If the data failed the assumptions of normality or equal variance, they were transformed (log10) to improve the distribution.
Results
Experiment 1: reinforcement-guided action selection task
Lesions of the OFC did not affect overall performance of the reinforcement-guided action selection task, in direct contrast to lesions confined to the ACCS (Fig. 4 A) (Kennerley et al., 2006). Postoperatively, although the OFC group made more errors than the CON group, this increase was not statistically different (mean ± SEM, preoperative, CON, 106.67 ± 13.01; OFC, 114.33 ± 14.71; postoperative, CON, 140.33 ± 22.12; OFC, 204.67 ± 56.29) (group by surgery interaction, F (1,7) = 2.1, p > 0.1). The same group was also able to flexibly switch action–outcome associations when the reward contingencies were reversed, regaining peak performance at a comparable rate with the CON group (10 trials after a reversal; group by surgery interaction, F (1,7) = 0.57, p > 0.4) (Fig. 4 A).
Despite this apparent lack of effect on reinforcement-guided action selection, it might be the case that OFC lesions only subtly alter the ability to use either reward or error information to guide subsequent action selection and that this effect was not picked up in the analyses of overall or switch-related performance. To explore this possibility, the influence of rewarded and unrewarded trials on subsequent choices was determined (Fig. 4 B). Immediately after a rewarded or correct action (correct+1 trials), the OFC group (mean ± SEM, preoperative, 93.2 ± 2.1; postoperative, 86.8 ± 6.1) were just as likely to make correct responses as the CON group (mean ± SEM, preoperative, 93.4 ± 1.1; postoperative, 92.2 ± 1.8) (group by surgery interaction, F (1,7) = 1.86, p > 0.2). Similarly, immediately after an unrewarded or error response (error+1), the OFC group (mean ± SEM, preoperative, 64.45 ± 8.47; postoperative, 57.21 ± 6.23) were statistically just as likely as the CON (mean ± SEM, preoperative, 63.89 ± 3.77; postoperative, 62.06 ± 4.1) group to switch to making the correct, rewarded response (group by surgery interaction, F (1,7) = 0.83, p > 0.3). These results suggest that OFC lesions did not affect the influence of correctly performed trials or errors on subsequent choices.
The impact of OFC lesions on action selection might, however, be even more subtle and may only be apparent when the influence of multiple rewarded or unrewarded trials on subsequent behavior is examined. To test this hypothesis, we conducted the same EC and EE analyses used by Kennerley et al. (2006). The first part of this analysis compared the percentage of correct choices after different numbers of consecutive correct (EC) responses after an error. The EC analysis revealed that postoperatively both CON and OFC groups performed similarly when positive reinforcement was delivered for making the alternative action after an error (trial types E+1 to EC(8)+1; group by surgery interaction, F (1,7) = 2.3, p > 0.1) (Fig. 5). The second part of the analysis compared the percentage of correct responses after different numbers of consecutive error responses (EE) after an initial error. The EE analysis showed that the OFC group's choices were comparable with the CON group when no reward was delivered for making the same response after an error (trial types E+1 to EE+1; group by surgery interaction, F (1,7) = 0.25, p > 0.5).
It has previously been shown that ACCS lesions impair performance on action–outcome reversal tasks (Kennerley et al., 2006) (Figs. 4, 5). Although ACCS lesions did not significantly alter performance immediately after an error, animals were unable to use subsequent successive positive reinforcement to sustain repeated selection of the alternative correct action (Fig. 5, bottom). The data from three control macaques in the present study have not been reported previously. We therefore reanalyzed the ACCS data by comparing it with the current CON data to be sure that the previously reported effects were apparent. Once again, it was clear the ACCS lesion significantly impaired overall performance (p < 0.05) (Fig. 4 A) and, particularly, performance on those trials that followed correctly performed trials (p < 0.01) (Fig. 4 B). The EC analysis showed that ACCS animals were less likely to use successive positive reinforcement to sustain repeated selection of the correct action (p < 0.01) (Fig. 5, bottom). The findings are in direct agreement with our previously published report of the effect of ACCS lesions on action–outcome reversal (Kennerley et al., 2006).
Experiment 2: action–outcome matching task
After the completion of experiment 1, macaques were postoperatively tested on a discrete trial version of a dynamic probability-matching task (Kennerley et al., 2006). Performance was assessed by calculating the number of trials that it took macaques to reach the ratio of responses that yielded 97% of this maximum rate of reward (r opt). Maximum reward rate and r opt for each ratio is different and was individually determined using the methods outlined by Kennerley et al. (2006) (Fig. 1). Macaques' choices from the stochastic and deterministic ratios were analyzed separately because the aim of this experiment was to assess decision making in stochastic reward environments that can be thought of as more naturalistic. Lesions of the OFC did not affect the ability of macaques to learn the ratio of responses that was associated with the r opt for each ratio of probabilities. The macaques with OFC (mean overall trials to r opt ±SEM, 125.22 ± 47.63) lesions learned at a similar rate as the CON group (mean overall trials to r opt ± SEM, 89.88 ± 17.95) (Fig. 6 A) and took a similar number of trials to reach 97% of r opt across the three stochastic reward ratios (effect of group, F (1,5) = 0.61, p > 0.4; or group by ratio interaction, F (2,10) = 0.89, p > 0.4) (Fig. 7 A). In contrast, macaques with damage to the ACCS (mean overall trials to r opt ± SEM, 200.46 ± 31.94) took more trials than the CON group to reach the optimal ratio of responses across the three stochastic ratios (effect of group, F (1,5) = 16.12, p < 0.05) (Kennerley et al., 2006). Importantly, neither the OFC nor the ACCS group was different from the CON group when the reward ratio was deterministic (1.0:0.0 reward ratio; both comparisons, p > 0.2).
Experiment 3: stimulus–outcome matching task
Although the ACCS was critical for selecting which action to make for reinforcement in experiments 1 and 2, lesions of the OFC, and not the ACCS, disrupted the ability to learn the optimal ratio of choices in a stimulus–outcome matching task (Fig. 6 B). Postoperatively, the OFC group learned more slowly (Fig. 6 B) and took more trials to reach the ratio of choices that yielded 97% of the maximal reward rate, r opt, than did the CON group (Fig. 7 B) (effect of group, F (2,6) = 5.47, p < 0.05; OFC vs CON, p < 0.05). In contrast, the ACCS group learned just as quickly as the CON group (p > 0.5) (Fig. 7 B), reaching the optimal ratio of responses within ∼30 trials (Fig. 6 B). Changing the amount of food reward or ratio of probabilities associated with the two options did not affect the pattern of results (either interaction between group, reward, and ratio, F (2,6) < 2.94, p > 0.1).
A number of previous published reports have highlighted the role of the OFC in changing behavior in response to errors (Fellows and Farah, 2003; Izquierdo et al., 2004). It was critical, therefore, to determine whether the macaques' deficit in learning the optimal ratio of responses was the result of an inability to switch between choosing the two visual stimuli or to alter behavior after unrewarded as well as rewarded trials. First, the macaques' choices in experiment 3 were analyzed to determine the overall probability of switching for each ratio and reward level by determining, regardless of the outcome of a trial, whether a macaque chose the alternative option on the next trial. Macaques with lesions of the OFC were just as likely to switch between choosing the two visual stimuli as the CON and ACCS groups (probability of switching: effect of group, F (2,6) = 0.15, p > 0.5).
Second, the macaques' choices immediately after trials that were either unrewarded or rewarded were analyzed by reward level (single or differential reward), ratio (0.75:0.25 or 0.5:0.18), and whether the response was to high-probability (H) or low-probability (L) stimulus. After trials in which macaques were unrewarded, the OFC group did not show as great a tendency to sustain choosing the H stimulus as opposed to the L stimulus compared with the CON group (supplemental Fig. S1, available at www.jneurosci.org as supplemental material); however, this was only true when both stimuli were associated with a single reward for choosing either the H or L stimulus (reward by group by ratio by response interaction, F (2,6) = 5.94, p < 0.05; single reward condition CON vs OFC; group by response interaction, F (1,4) = 7.99, p < 0.05). A similar, albeit slightly milder, pattern of effects was found on trials immediately after a rewarded trial. Macaques with OFC lesions did not show the same difference between selecting the H as opposed to the L stimuli as the CON group, but only when a single reward could be obtained on each trial (group by reward by response interaction, F (2,6) = 18.09, p < 0.01; single reward condition, CON vs OFC: group by response interaction, F (1,4) = 4.36, p = 0.054). These analyses prove that the impairments associated with OFC lesions cannot simply be attributed to switching or the use of unrewarded trials alone because the use of rewarded trials was also affected.
Experiment 4: three-choice stimulus discrimination task
It might be argued that the stimulus–outcome matching experiment contains an implicit response reversal requirement. In the matching task, the value of the nonchosen stimulus increases over successive trials; when a reward is allocated to a stimulus on a given trial, it remains available until the stimulus is chosen even if this only occurs on a later trial. To further explore the role of the OFC in new learning of stimulus–reward associations, experiment 4 also investigated decision making in a dynamic stochastic environment. The use of a predetermined reinforcement schedule in which the value of the different options was not dependent on the macaques' history of previous choices, however, made the points at which there were reversals in the relative values of the stimuli clear during the testing session.
Experiment 4 had a second advantage in that it also made it possible to investigate the distribution of choices in relation to the subjective value of the stimuli. Although it is standard practice to assess lesion effects in relation to an experimenter-determined reinforcement schedule, the subjective value of a stimulus can be estimated by using standard reinforcement learning models (Sutton and Barto, 1998). Such models, however, are less appropriate to the reversal and matching tasks used in experiments 1–3 in which there are additional reversal and matching contingencies. In the action– and stimulus–outcome matching tasks, the probability of getting a reward is explicitly a function of previous choices; choosing a particular stimulus can change the likelihood of getting a reward on the next trial.
Preoperatively, macaques in both groups were able to learn which of the three novel stimuli (Fig. 8 A, stimulus A, blue line) was associated with the highest probability of reward within the first 100 trials (mean ± SEM, choice probability of stimulus A at trial 70, CON, 72.33 ± 2.85; OFC, 73.67 ± 3.18). During this period, both groups selected this option at comparable rates [number of H option (H schedule) stimulus choices within first 100 trials, F (1,4) = 3.79, p > 0.1]. When the stimulus associated with the highest probability of reward subsequently changed to stimulus B (Fig. 8 A, red line), macaques in both groups switched to choosing this option but the rate at which they did lagged behind the underlying reward schedule. A difference between the underlying reward schedule and macaques' choices would be expected because of the stochastic nature of the task. The value of a stimulus can only be determined over a period of trials in which that stimulus is selected. If there is a decline in the reward rate of the stimulus the macaque is selecting, it is likely that they will then sample the other available options to establish whether one of these options now has a higher probability of reward. It is this process that will take a number of trials, leading to the difference between the underlying reward schedule change and the change in macaques' choices.
Across the 300 trial testing session, as a whole, macaques in both groups selected the option associated with the highest objective and subjective probability of reward at a level greater than chance [number of responses to the stimulus of highest value (H schedule): one-sample t test vs chance, both groups, 0.33; t > 4.5, p < 0.05]. Importantly, both groups selected the stimulus associated with the highest probability of reward to a similar degree (H schedule or H RL: effect of group, F (1,4) < 0.55, p > 0.4) (Fig. 8 A). When macaques failed to select the highest probability option, the difference between the subjective value of the highest and the chosen option was also comparable between the two groups (error score, effect of group, F (1,4) = 0.04, p > 0.5).
Postoperatively, the OFC group were considerably worse than the CON group at choosing the option with the highest objective value across the whole 300 trial testing session (H schedule: group by surgery interaction, F (1,4) = 18.54, p < 0.05) (Fig. 8 A). Analyzing the macaques' choices using a reinforcement learning model revealed that the OFC group also chose options with lower subjective values throughout a days session (H RL: group by surgery interaction, F (1,4) = 35.54, p < 0.01) (Fig. 8 B). These effects were apparent even in the first 100 trials before any programmed change in stimulus values (group by surgery interaction, number of trials on which macaques selected the H schedule within the first 100 trials, F (1,4) = 63.21, p < 0.001), indicating that the effect is related to probabilistic stimulus–reward associations rather than to reversing stimulus–reward associations per se.
Additional analysis of the choices of macaques with OFC lesions across all postoperative sessions revealed that there was only a trend for these subjects to select the stimulus associated with the highest probability of reward at a greater rate than would be expected by chance (one-sample t test vs chance, 0.33; t = 3.17, p = 0.087). Despite this finding, it is unlikely that macaques with OFC lesions were merely selecting stimuli at random. On trials in which OFC lesion animals failed to select the highest-value option, they often selected the second-best option just as the CON group did when they failed to correctly select the highest-value option. Although it should be reiterated that the OFC lesion animals selected the best option (H RL) on fewer trials than the CON group, it is worth noting that the errors in valuing the different stimuli made by animals in each group were similar; when subjects failed to select the H RL option postoperatively, the difference in value between the best option and the chosen option was not significantly greater in the OFC group (error score: group by surgery interaction, F (1,4) = 3.28, p > 0.1). This result is important because it suggests that macaques with OFC lesions were not just selecting stimuli randomly during the testing sessions but were instead failing to value stimuli accurately.
Discussion
The study examined the effect of OFC lesions on reinforcement-guided stimulus selection and action selection and compared this effect with that of ACCS lesions. Both OFC and ACCS are necessary for effective reinforcement-guided decisions but their contributions are distinct. The ACCS, but not OFC, is critical when decisions are based on the association of actions with reinforcement (experiments 1 and 2) (Figs. 4 ⇑ ⇑–7). No discriminative stimuli were present in experiments 1 and 2 and so no stimulus–reinforcement associations were available to guide decision making. In contrast, OFC, but not ACCS, is critical when decisions depend on the association of stimuli with reinforcement (experiments 3 and 4) (Figs. 6 ⇑–8). The actions made to the chosen stimuli varied from trial to trial, and so no action–reinforcement associations were available to guide performance in experiments 3 and 4.
The ACCS impairment was apparent both when animals learned reversals of deterministic action–reward associations (experiment 1) and when they learned the value of actions with probabilistic reward associations (experiment 2) (Kennerley et al., 2006). It is therefore unlikely that the lack of an OFC effect on the action tasks was caused by the paradigms being too simple. In contrast, experiment 3 showed that OFC lesions, but not ACCS lesions, impaired performance in a situation analogous to experiment 2, but in which probabilistic stimulus–reinforcement associations rather than probabilistic action–reward associations were to be learned. Although it has not been emphasized or addressed before, a reconsideration of existing studies reveals that a stimulus–reinforcement reversal task, analogous to the action reversal task used here, is impaired by OFC (Iversen and Mishkin, 1970; Izquierdo et al., 2004) but not ACCS lesions (Rudebeck et al., 2006a).
Experiments 3 and 4 also revealed that OFC lesions can impair even the learning of completely novel stimulus values when the association with reward is stochastic and dynamically changing. In experiment 4, macaques with OFC lesions chose options with a worse history of reinforcement than did control animals (Fig. 8 B). The impairment was apparent even in the first 100 trials of the performance of each day before any reversal in the values of the objects (Fig. 8 A). These results suggest that decision-making deficits after OFC lesions may ultimately be the consequence of impaired reinforcement learning (Fellows, 2007). The OFC contribution to new learning may have been apparent because the reward associations in experiment 4 were probabilistic rather than simply deterministic. The stochastic nature of reward delivery in probabilistic settings means that the best choice cannot be discerned from the outcome of the last trial; instead, it has to be determined from an integrated history of reinforcement. As already noted, this may be an important tool for future research into the function of the prefrontal cortex.
It is now clear that ACCS does not simply represent the occurrence of an error but is instead important whenever an outcome suggests the need to revise the estimate of the value of an action (Walton et al., 2004; Amiez et al., 2005, 2006; Kennerley et al., 2006; Matsumoto et al., 2007; Quilodran et al., 2008). Our results suggest that OFC can be similarly important in the learning of new stimulus–reinforcement associations and not simply in effecting a reversal in behavior after an error as has been previously highlighted (Fellows and Farah, 2003; Izquierdo et al., 2004). This is consistent with evidence of encoding of reinforcement expectations in OFC neurons in both rats and monkeys (Schoenbaum et al., 1998; Tremblay and Schultz, 1999; Padoa-Schioppa and Assad, 2006).
The results support the hypothesis that there is a relative specialization for stimulus/object–reinforcement and action–reinforcement association learning in OFC and ACCS, respectively (Rushworth et al., 2007). Although a direct comparison of the two brain areas has not previously been made in the macaque, it is the case that reports of ACCS lesion impairment have usually used action–reinforcement learning tasks (Shima and Tanji, 1998; Hadland et al., 2003; Kennerley et al., 2006), whereas OFC impairment has been investigated using stimulus–reinforcement learning tasks (Rolls, 1999; Murray et al., 2007).
In addition to the distinct roles that they play in reinforcement-guided action selection and stimulus selection, the OFC and ACCS differ in other respects. For example, a distinction has been drawn between control of decision making by action value caching or by model-based representation of outcome predictions (Daw et al., 2005). OFC represents the expected value of an outcome or goal associated with a stimulus; animals normally do not make object choices that lead to outcomes that have been devalued, but this is not the case after OFC lesions (Baxter et al., 2000; Izquierdo et al., 2004). In contrast, the ACCS integrates information about the intrinsic costs of an action as well as its reward-related benefits; ACC, but not OFC, lesions alter action–effort cost–benefit decision making (Walton et al., 2003; Rudebeck et al., 2006b) and ACCS neuron activity reflects the cost as well as the benefit of a course of action (Kennerley et al., 2008). In other words, ACCS may represent the value of an action, whereas OFC may represent the value of an expected outcome.
The shared involvement of OFC and ACCS in reinforcement-guided decision making is consistent with the areas' shared anatomical connections with brain structures critical for reinforcement learning such as the amygdala and ventral striatum (Porrino et al., 1981; Morecraft et al., 2007). The differences in OFC and ACCS function are also, however, consistent with differences in their relative strength of connection with sensory and motor systems. Although OFC accesses highly processed sensory information, particularly visual information, via its connections with temporal and perirhinal cortex, there are fewer such connections to ACCS or adjacent medial frontal cortex (Van Hoesen et al., 1993; Webster et al., 1994; Carmichael and Price, 1996; Kondo et al., 2005). In contrast, ACCS may influence action selection via its direct connections to spinal cord and premotor and motor cortex (Dum and Strick, 1993; Wang et al., 2004). Diffusion weighted imaging and tractography studies suggest similar differences in ACCS and OFC connectivity in the human brain (Croxson et al., 2005).
Temporal and perirhinal projections are particularly prominent to the lateral OFC (Carmichael and Price, 1996) and the selectivity of the OFC impairment for stimulus–reinforcement learning may stem from the restriction of the present set of lesions to more lateral parts of OFC (Brodmann areas 11, 13) (Fig. 3). The lesions therefore resemble those studied by Murray and colleagues (Izquierdo et al., 2004) more than they resemble the lesions in patients, which often center on medial but not lateral parts of OFC (Fellows and Farah, 2003). Given its distinct connectivity, it is possible that medial OFC may play a different role in reinforcement-guided decision making (Fellows, 2007).
Previous recording studies of primate OFC have reported that, on presentation of a visual stimulus, neural activity occurs that is correlated with the value of reward expectation (Tremblay and Schultz, 1999; Wallis and Miller, 2003). The same studies show little evidence of the encoding of information about actions or action–reinforcement associations. However, it could be argued that the absence of consistent action–reinforcement associations from these paradigms means that they are not the most appropriate for identifying action–reinforcement encoding. In contrast, ACCS neurons contain information about whether or not an action should or should not be made in addition to information about reinforcement expectation (Matsumoto et al., 2003; Quilodran et al., 2008). Despite these suggestive differences, it is important to note that activity in primate OFC and ACCS has not been directly compared in dynamic stochastic reward environments used in the present study. The present results, however, demonstrate that it is possible to learn the value of actions and the value of stimuli and for both types of values to guide decision making albeit via distinct neural circuits. A recent report showed that many neurons in both ACCS and OFC encode the value of choices when decision variables such as probability and magnitude are fixed across a testing session (Kennerley et al., 2008). In many situations, both systems are likely to work together, and in the short term, damage to either may affect a number of cognitive processes even if longer term impairments are specific to one system. Both systems may be a prerequisite for effective decision making in natural environments.
Footnotes
-
This work was supported by the Medical Research Council United Kingdom (P.H.R., T.E.B., M.J.B., M.F.S.R.) and The Wellcome Trust (M.E.W., M.G.B.). We thank Dr. D. Gaffan for his advice and encouragement as well as Greg Daubeny for assistance with histology.
- Correspondence should be addressed to Peter H. Rudebeck at his present address: Laboratory of Neuropsychology, National Institute of Mental Health–National Institutes of Health, Building 49, Suite 1B80, 49 Convent Drive, Bethesda, MD 20892-4415. rudebeckp{at}mail.nih.gov