Abstract
Deciding whether to forego immediate rewards or explore new opportunities is a key component of flexible behavior and is critical for the survival of the species. Although previous studies have shown that different cortical and subcortical areas, including the amygdala and ventral striatum (VS), are implicated in representing the immediate (exploitative) and future (explorative) value of choices, the effect of the motor system used to make choices has not been examined. Here, we tested male rhesus macaques with amygdala or VS lesions on two versions of a three-arm bandit task where choices were registered with either a saccade or an arm movement. In both tasks we presented the monkeys with explore–exploit tradeoffs by periodically replacing familiar options with novel options that had unknown reward probabilities. We found that monkeys explored more with saccades but showed better learning with arm movements. VS lesions caused the monkeys to be more explorative with arm movements and less explorative with saccades, although this may have been due to an overall decrease in performance. VS lesions affected the monkeys’ ability to learn novel stimulus-reward associations in both tasks, while after amygdala lesions this effect was stronger when choices were made with saccades. Further, on average, VS and amygdala lesions reduced the monkeys’ ability to choose better options only when choices were made with a saccade. These results show that learning reward value associations to manage explore–exploit behaviors is motor system dependent and they further define the contributions of amygdala and VS to reinforcement learning.
Significance Statement
The amygdala and VS are known to be important for learning reward associations and for mediating explore–exploit behaviors. These behaviors are typically studied in experimental paradigms where choices are made with a single motor system. Here we show that nonhuman primates mediate explore–exploit behaviors in a motor system-dependent way. Monkeys were more explorative with eye movements but showed better learning performance with arm movements. Moreover, we showed different effects of amygdala and VS lesions on explore–exploit behaviors based on the motor system implementing task choices. Thus, we further define amygdala and VS contributions to explore–exploit behaviors and suggest that a different value representation might be driving learning in the oculomotor and skeletomotor systems.
Introduction
Living in a changing environment requires deciding whether to forego immediate rewards or explore new opportunities. This decision is known as the explore–exploit dilemma (Sutton and Barto, 2018), and mediating it properly is crucial for optimal interaction with the environment. The dilemma requires animals and humans to learn the values of their choices to determine when exploration is advantageous. Their choices are described in terms of actions. Exploitation of familiar choice options is optimal in a stable environment because it maximizes the immediate reward. Exploratory behaviors, in contrast, are advantageous when circumstances change as exploration decreases uncertainty about novel stimuli and may lead to greater future reward (Averbeck, 2015). Thus, this process relies primarily on the brain's ability to learn value representations of stimuli and/or actions, and to use these representations to select options that maximize future rewards. Understanding how the brain mediates explore–exploit behaviors is fundamental, as alteration of explorative behavior has been linked to brain disorders (Djamshidian et al., 2011; Averbeck et al., 2013; Martinelli et al., 2018; Sethi et al., 2018).
A previous work has identified several cortical regions involved in driving exploration and the responses to novel stimuli (Daw et al., 2006; Pearson et al., 2009; Raja Beharelle et al., 2015; Zajkowski et al., 2017; Blanchard and Gershman, 2018; Ebitz et al., 2018; Costa and Averbeck, 2020; Ogasawara et al., 2022), initially suggesting a top-down control in shifting between exploitative and explorative behaviors. Recent studies have shown that the amygdala and ventral striatum (VS), traditionally linked to associative and value-based learning (Haber and Knutson, 2010; Floresco, 2015; Averbeck and Costa, 2017), also play a major role in novelty-driven exploration (Costa et al., 2019), perhaps under the influence of the dopamine system (Kakade and Dayan, 2002; Costa et al., 2014; Lak et al., 2016) but at least in parallel with cortical value encoding (Hogeveen et al., 2022). However, in previous studies, subjects were required to learn the values of choice options through responses made with a single effector (e.g., eye movements or arm movements). This approach may reflect the common assumption that there exists a single value representation in the brain that drives learning, independently of the motor system used to make the choice. For example, actor–critic models suggest that a single value representation in the VS underlies policy learning in motor systems (Joel et al., 2002; O'Doherty et al., 2004). This hypothesis, however, does not account for differential effects of lesions on learning stimulus-outcome versus action-outcome associations (Goldstein et al., 2012; Rothenhoefer et al., 2017).
Early studies proposed that the VS might work as a motivational-motor interface by virtue of its connections with both motivational and motor regions (Mogenson et al., 1980; Groenewegen and Russchen, 1984; Brog et al., 1993). The amygdala also shows connectivity with motor areas (Avendano et al., 1983; Amaral and Price, 1984; Carmichael and Price, 1995; Ghashghaei and Barbas, 2002; Stefanacci and Amaral, 2002). Therefore, both regions are situated to participate in mediating stimulus- and action-value learning.
Here we investigated the causal role of the amygdala and VS in explore–exploit behaviors when monkeys learned the values of choice options with two different motor systems. We contrasted the performance of control monkeys against monkeys that received excitotoxic lesions of either the amygdala or the VS on two different versions of a three-arm bandit task, in which choices were registered with either saccades (eye task) or arm movements (arm task). In both tasks we repeatedly presented the monkeys with explore–exploit tradeoffs through the periodic introduction of novel choice options with unknown reward probabilities. Therefore, regardless of the motor system used to implement choices, exploration was required to learn if novel options could lead to increased future rewards. If the causal contributions of the amygdala and VS to reward learning are dependent on the output system, then the effect of lesions in the two areas should depend on whether monkeys learn via saccades or arm movements.
Methods
Subjects
The study was carried out on 23 male rhesus monkeys (Macaca mulatta), with weights ranging from 6 to 11 kg. Three of the monkeys received bilateral excitotoxic lesions of the VS, eight received bilateral excitotoxic lesions of the amygdala, and the remaining 12 served as unoperated controls. For the duration of the study, the monkeys were placed on water (eye task) or food (arm task) control and earned either fluid or food through their performance on the task on testing days. Experimental procedures for all monkeys were performed in accordance with the Guide for the Care and Use of Laboratory Animals and were approved by the National Institute of Mental Health Animal Care and Use Committee.
Surgery and lesion assessment
Monkeys received two separate stereotaxic surgeries, one for each hemisphere, which targeted either the VS, using quinolinic acid, or the amygdala, using ibotenic acid. Details of the surgeries have been reported in a previous study (Costa et al., 2016). After lesion surgeries, each monkey was implanted with a titanium head post to facilitate head restraint. Unoperated controls received the same type of cranial implant. All behavioral training and data collection for the present study occurred after the lesion surgeries. For each monkey in the lesion groups, the location and extent of lesions of amygdala or VS were quantitatively assessed from postoperative MRI scans. Extensive details of the method have been reported in a previous study (Basile et al., 2017). Briefly, the location and extent of damage was evaluated from T2-weighted scans obtained within 10 d of surgery. For each operated animal, we matched coronal MR scan slices to drawings of coronal sections from a standard rhesus monkey brain at 1 mm intervals. The post-operative hypersignal that indicates cell death was plotted onto the standard sections. The location and extent of the amygdala and VS hypersignal extent are shown on representative sections (Fig. 1b). Further, six of the monkeys with amygdala lesions, all three of the monkeys with VS lesions, and four of the control monkeys were previously tested in other studies (Dal Monte et al., 2015; Costa et al., 2016; Rothenhoefer et al., 2017; Taswell et al., 2018, 2021; Taubert et al., 2018). In all monkeys, the observed hypersignal indicated substantial damage of the target structures and minimal damage to surrounding structures.
Behavioral tasks design and lesion maps. a, Upper panel, the structure of the task consisted of multiple blocks. On every block, the first set of options (S1) consisted of three novel visual choice options. This set was presented to the monkeys for 10–30 trials. One of the options was then randomly replaced with a novel option, which had a randomly assigned reward probability. Thirty-two novel options were introduced in each block. Bottom panel, example of an individual trial during the eye task and the arm task. b, Extent of lesions maps for amygdala lesions (upper panel) and VS lesions (lower panel) groups and in representative sections. For the amygdala lesions map the two brain slices represent the eight different monkeys (4 monkeys in each slice).
Experimental setup
The monkeys were trained to perform an oculomotor version and a skeletomotor version of a three-armed bandit task, referred to as the eye task and the arm task, respectively. Table 1 shows how monkeys were distributed in the two tasks. The monkeys were seated on a primate chair in front a 19 in LCD monitor, in the eye task, or a 15 in touchscreen monitor (Elo Touch Solutions, Inc.) in the arm task. The monitor was placed 40 cm (in the eye task) and 30 cm (in the arm task) from the monkey's eyes. The eye task was the same as was used in previous studies (Costa et al., 2019; Costa and Averbeck, 2020; Tang et al., 2022). Task control was performed using the MonkeyLogic behavioral control system for the eye task (Hwang et al., 2019) and a presentation software for the arm task (Neurobehavioral Systems). The monkey's eye movements during the eye task were monitored using an Arrington ViewPoint eye tracking system (Arrington Research) and sampled at 1 kHz. The eye position was not monitored during the arm task. Monkeys were rewarded with juice delivered via a pressurized fluid delivery system (Mitz, 2005) in the eye task, and nutritionally complete, grain-based food pellets delivered via pellet dispensers (Med Associates) in the arm task.
Distribution of monkeys tested in the arm task and eye task. Control (black), VS lesions (VS, blue), and amygdala lesions (AMY, red) monkeys
Both tasks had the same underlying structure (Fig. 1a, top panel). Monkeys performed a series of blocks, each of which was composed of up to 650 trials. The first trial of each block started with presentation of a set of three novel choice options. Each option was associated with a specific reward probability drawn from a symmetric reward schedule centered around a reward probability of 0.5. Thus, each of three initial options was associated with either a low reward probability (range, 0.2–0.3), a medium reward probability (0.5), or a high reward probability (range, 0.7–0.8). On each trial the monkeys chose one of the options, and their choice was followed (or not) by a reward according to the reward probability of the chosen option. To maximize reward, the monkeys had to learn the reward probability associated with each option by accumulating rewards over trials and select the options with the higher reward probability. To present the monkeys with a recurring explore–exploit dilemma, every 10–30 trials we replaced one of the options, randomly chosen, with a novel choice option with an unknown reward probability. Within a single block of 650 trials, 32 novel options were introduced. When a novel option was introduced, it was pseudorandomly assigned a reward probability, which remained fixed for all the trials during which that option was presented to the monkeys. When assigning the reward probability to a novel option, the only rule was that all three stimuli could not be assigned the same a priori reward probability. Furthermore, in both tasks the positions of the three options were located at the vertices of a triangle with one vertex either pointed up or down on each trial. The location of the three options was randomized across trials.
Although the Eye and Arm versions of the task were quite similar, there were necessarily some differences. In the eye task (Fig. 1a, bottom left), the monkeys started each trial by acquiring and holding a central fixation point for a variable time (250–750 ms). After the hold period, the three peripheral choice options appeared on the screen. The monkeys were required to saccade to one of the peripheral choice targets and hold fixation on it for 500 ms in order to receive (or not) the reward. Rewards were randomly delivered according to the reward probability assigned to the chosen option. In the arm task (Fig. 1a, bottom right), the monkeys had to double tap a green central square target [100 × 100 pixels; RGB = (70, 147, 40)] to start a trial. Then the central target disappeared, and the three peripheral choice options appeared on the screen. Once the options appeared, the monkeys had 3 s to reach to the chosen option with the hand and double tap on it. Failure to select an image within 3 s resulted in a distinctive audio cue (“d’oh”) and aborted the trial. The monkeys’ choice was then followed (or not) by the reward with an associated audio cue (“excellent!”) according to the reward probability of the chosen option. We used audio tones since control monkeys performed previous experiments in which they were trained to work with audio tone as negative/positive feedback. Double tapping was implemented to prevent accidental touch and selection of the choice options. The monkeys’ arms were not restrained during the arm task. Monkeys could reach the monitor and perform the task with their preferred arm through a small door in the chair, which is kept open throughout the task. The intertrial interval was 1.5 s in the eye task and 2 s in the arm task.
Data analysis and statistics
All analyses have been performed with custom codes implemented in MATLAB (www.mathworks.com). For all comparisons reported in this study we carried out an N-way ANOVA in which the following fixed factors were included: the task performed and the lesion group assignment. We considered the fraction of time the monkeys chose different option types as dependent variables, and the task performed and the lesion groups as independent variables. Monkeys were included as a random factor. For all reported ANOVAs, we ran a full model with all factors and interactions of all order. Nonsignificant interactions and comparisons were reported only if critical for results and data interpretation. Although some monkeys were trained on both tasks, all statistics were done with between subject comparisons. This can only lead to less power to detect significant effects and therefore is a conservative approach.
Results
We tested control, amygdala lesion, and VS lesion monkeys on two versions of a three-arm bandit task, referred to as the eye task and the arm task (Fig. 1). In both tasks, we repeatedly presented the monkeys with explore–exploit tradeoffs through the periodic introduction of novel choice options with unknown reward probabilities. Thus, the monkeys had to periodically mediate an explore–exploit tradeoff and either choose a familiar option with a known reward probability or explore the novel option to see if it was better than the best familiar option. The reward probabilities associated with the three options could occur in all combinations (e.g., 0.2/0.2/0.8 or 0.5/0.8/0.8) with the exception that all three options could not have the same reward probability. While both tasks were characterized by this common explore–exploit design, monkeys were required to make their choices with two different effectors: either with an eye movement, in the eye task, or an arm movement, in the arm task. To quantify choice behavior and the effect of lesions in the two tasks we proceeded in two steps. In the first step, we focused on the explore–exploit decisions, comparing choices of the novel option relative to the two familiar options. In the second step of our analysis, we focused on the learning performance. As novel options were periodically introduced, the monkeys had to periodically learn reward associations. We performed this analysis first for the novel options as a function of trials since they were introduced and subsequently averaged across all trials.
Explore–exploit behavior
To quantify the explore–exploit behavior, we analyzed how choices were distributed among the novel option and the two familiar options. To do this, we first characterized the monkey's estimate of the reward probability associated with each option in each trial, as the number of times the option was rewarded divided by the number of times it was chosen. This was used to define the best and worst familiar option on each trial. The best familiar option was defined as the familiar option that had the highest estimated reward probability. The worst familiar option was defined as the familiar option with the lowest estimated reward probability. The novel option was the most recently introduced option. The reward probability estimate for each option was updated each time the option was chosen; more importantly, the estimated reward probability for an option did not necessarily align with the reward probability assigned by the experimenter, especially before it had been sampled several times. We then measured the fraction of times the monkeys chose the three options, (i.e., novel, best familiar, and worst familiar), as a function of the number of trials since the novel option was introduced (Fig. 2a).
Explore–exploit behavior in the arm task and eye task and relative effect of amygdala and VS lesions. a, Fraction of times the monkeys chose the novel, best and worst options as a function of the number of trials since the novel option was introduced in the arm task (top row) and the eye task (bottom row) and for all groups (columns). When a new novel option was introduced the previous novel option became a familiar option. b, Fraction of times the monkeys chose the novel, best and worst options in the first trial after the novel option was introduced. Shaded areas and error bars represent ±1 SEM. across monkeys.
We found differences in the monkeys’ choice behavior in the two tasks following the introduction of a novel choice option. Monkeys in the eye task tended to choose the novel option, whereas monkeys in the arm task tended to choose the best familiar option. Monkeys differed in their first-trial preference among options in the two tasks (Fig. 2b; task × option, F(2, 44) = 77.581, p < 0.001). Monkeys chose the novel option more frequently in the eye task than in the arm task (t(26) = 7.622, p < 0.001), whereas monkeys showed a higher preference for the best familiar option in the arm task compared to the eye task (t(26) = 8.086, p < 0.001).
We also found that lesions influenced the choice between options in the two tasks (task × group × option, F(4, 44) = 6.674, p < 0.001), and this effect also occurred separately for each task type (group × option, arm task, F(4, 16) = 6.215, p = 0.003; eye task, F(4, 28) = 3.134, p = 0.03). Relative to controls, VS lesions increased preference for the novel options in the arm task (t(5) = 3.138, p = 0.025), but decreased preference for the novel option in the eye task (t(9) = −2.7414, p = 0.022). VS lesions also increased choice of the worst option in eye task (t(9) = −2.658, p = 0.026), but not in the arm task (t(5) = −0.604, p = 0.571) compared to controls.
In both tasks, amygdala lesions did not affect how monkeys chose the novel option (arm task, t(6) = 0.611, p = 0.571; eye task, t(12) = −0.686, p = 0.522), the best alternative option (arm task, t(6) = 0.224, p = 0.83; eye task, t(12) = 1.086, p = 0.3), or the worst alternative option (arm task, t(6) = −0.71, p = 0.51; eye task, t(12) = −0.2807, p = 0.784) in comparison to controls.
We next examined whether the choice latencies for the novel and familiar options differed in the two tasks and were affected by the lesions (Fig. 3). Choice latencies were defined as the time between the stimulus onset and the monkey's choice. Choice latencies differed in the two tasks (Task, F(1, 46) = 988.16, p < 0.001) and by options (novel vs familiar) (option, F(1, 46) = 9.4, p = 0.004). Monkeys also showed different choice latencies for novel and familiar options in the two tasks (option × task, F(1, 46) = 12.84, p < 0.001). Monkeys showed longer choices latencies for the novel options compared to familiar options in the arm task [mean (±SEM), familiar = 1130.50 (56.46) ms; novel = 1366.78 (83.19) ms; (t(20) = 2.35, p = 0.029)] but not in the eye task (mean (±SEM), familiar = 152.63 (6.39) ms; novel = 154.67 (5.93) ms; t(32) = 0.24, p = 0.807)]. Lesions also influenced the choices latencies (group, F(2, 46) = 5.44, p = 0.007). In both tasks, compared to the control, VS lesions reduced the choices latencies for the novel option [eye task, mean (±SEM), control = 159.92 (11.14) ms; VS = 140.04 (3.12) ms; t(9) = −2.46, p = 0.035; arm task, mean (±SEM), control, 1408.05 (104.92) ms; VS, 1127.34 (89.55) ms; t(5) = −3.05, p = 0.028)] but not for the familiar option [eye task, mean (±SEM), control, 157.53 (12.47) ms; VS, 138.59 (3.42) ms; t(9) = 0.89, p = 0.394; arm task, mean (±SEM), control, 1085 (69.70) ms; VS, 1052.58 (58.69) ms; t(5) = 0.98, p = 0.369)]. In contrast, in both tasks amygdala lesions did not affect the choices latencies for the novel options [eye task, mean (±SEM), Amy = 155.26 (7.54) ms; t(12) = −0.95, p = 0.360; arm task, mean (±SEM), Amy = 1505.10 (164.84) ms; t(6) = 0.43, p = 0.679)] or the familiar options [eye task, mean (±SEM), Amy = 153.10 (7.17) ms; t(9) = −0.89, p = 0.394; arm task, mean (±SEM), Amy = 1233.81 (130) ms; t(6) = 0.86, p = 0.42)] compared to control.
Choice latencies. Choice latencies, defined as time between the stimuli onset and the monkeys’ choice, in control, amygdala lesions, and VS lesions monkeys in the arm task (left) and eye tasks (right) averaged across monkeys. Error bars represent ±1 SEM across monkeys.
Learning novel reward associations
Next, we asked how efficiently the monkeys learned to differentiate the values of the novel options when they were introduced. We first measured the fraction of times the monkeys chose the novel options based on their a priori reward probabilities (i.e., high, medium, and low), as a function of the number of trials since the novel options were introduced (Fig. 4a). These curves were, however, affected by the overall preference for novel options. To remove the overall preference for novel options and characterize how well the monkeys learned to differentiate options with different values, we computed the average difference of the monkeys’ choices for each pair of options over the first 15 trials (i.e., high vs low, high vs medium, and medium vs low valued options, Fig. 4b). Examining the differences between the probability of choosing differently valued options instead of the actual probabilities of choosing each option allowed us to estimate the learning without being biased by the different exploratory tendency (i.e., the overall preference for novel options in the eye task) that we observed in the two tasks.
Learning novel reward associations in the arm task and eye task and relative effect of amygdala and VS lesions. a, Fraction of times the monkeys chose the novel option associated with high, medium, and low reward probability as function of the number of trials since the novel option was introduced in the arm task (top row) and the eye task (bottom row), and for all groups (columns). b, Average choice difference based on the different value (i.e., high vs low, high vs medium, and medium vs low reward probability options) in control (white), amygdala lesions (AMY, black), and VS (red) lesions monkeys. Shaded areas and error bars represent ±1 SEM across monkeys.
Monkeys learned to discriminate among the options and choose more frequently the higher valued options. The choice differences were larger for the high versus low, compared to the high versus medium and medium versus low valued options (value difference, F(2, 44) = 37.788, p < 0.001). Choice differences also differed between tasks (task, F(1, 22) = 8.401, p = 0.008) and by value difference (task × value difference, F(2, 44) = 4.045, p = 0.024). In the arm task compared to the eye task, choice differences were larger between high and low value options (t(26) = 2.3, p = 0.03), and between medium and low value options (t(26) = 3.73, p < 0.001).
We also found that lesions influenced the ability to discriminate between options, measured as the choice difference (group, F(1, 22) = 6.735, p = 0.005) and these effects differed by both value difference (group × value difference, F(4, 44) = 5.574, p = 0.001) and by task (task × group × value difference, F(4, 44) = 3.205, p = 0.021). When we analyzed the tasks separately, we found that in the eye task lesions led to a general deficit in the discrimination performance (group, F(2, 42) = 5.35, p = 0.008), without a lesion-dependent effect on how monkeys learned between differently valued options (group × value difference, F(4, 42) = 0.44, p = 0.778). We observed an overall decrease of the discrimination performance following both amygdala (t(40) = 2.24, p = 0.03) and VS (t(31) = 2.315, p = 0.027) lesions relative to controls in the eye task.
In contrast, in the arm task lesions influenced how monkeys discriminated between differently valued options (group, F(2, 24) = 13.78, p < 0.001; group × value difference, F(4, 24) = 4.26, p = 0.009). Following VS lesions, monkeys were less able to discriminate between all pairs of options in comparison to controls (t(19) = 2.84, p = 0.011). They also specifically discriminated less between high and low reward probability options (t(5) = 3.097, p = 0.027) and between medium and low probability options (t(5) = 4.791, p = 0.005). Amygdala lesions reduced the monkeys’ ability to discriminate between high and low (t(6) = 4. 149, p = 0.006), and high and medium (t(6) = 0.237, p = 0.017) reward probabilities, without an effect on how monkeys discriminated between medium and low reward probability options (t(6) = 0.056, p = 0.957) relative to controls.
Overall performance
Next, we measured how monkeys in each group chose between low, medium, and high reward probabilities options, averaged across all trials regardless of whether reward probabilities were associated with novel or familiar options (Fig. 5a). This contrasts with the results above examining only choices of the novel options over the first 15 trials in which they were introduced (Fig. 4). Monkeys differently discriminated between reward probabilities in the two tasks (task ×reward probability, F(2, 44) = 24.32, p < 0.001). They chose more frequently the high reward probability options (t(26) = 5.53, p < 0.001), and less frequently the low reward probability options (t(26) = −5.79, p < 0.001) in the arm task relative to the eye task. The discrimination between reward probabilities was also affected by the lesions (group × reward probability, F(4, 44) = 3.92, p = 0.038). Following both lesions, monkeys did not discriminate between medium and low reward probability options in the eye task [amygdala, (t(5) = 0.127, p = 0.904; VS, t(2) = −0.917, p = 0.456)]. Thus, monkeys were generally more efficient in discriminating reward probabilities in the arm task than in the eye task, and amygdala and VS lesions specifically affected the monkeys’ overall performance in the eye task, but not in the arm task. The better discrimination by reward probability in the arm task was mirrored by the amount of reward earned by the monkeys (Fig. 5b). We found that the reward earned by the monkeys, relative to the maximum possible, was higher during the arm task compared to the eye task (Fig. 5b, task, F(1, 22) = 52.43, p < 0.001), and was not influenced by the lesion in the two tasks (group × task, F(2, 22) = 0.15, p = 0.863). This suggests better implementation of explore–exploit behaviors when choices were made with arm movements and further suggests no differential effect of lesions on the reward earned based on the reward type in the two tasks.
Discrimination of choice options by reward probability and reward earned. a, Average fraction of times the monkeys chose high, medium, and low probability options for each task and group. Choices were averaged across the type of option selected (e.g., novel, best, and worst alternative). b, Probability of earning reward, relative to the maximum possible, in control, amygdala lesions, and VS lesions monkeys in the two tasks averaged across monkeys. Error bars represent ±1 SEM across monkeys.
We further investigated the overall performance as well as the effect of lesions in the two tasks by looking at the number of trials performed, the error rates, and the single trial duration (Fig. 6). Generally, the number of trials and error rates were higher during the eye task compared to the arm task (n trials, task, F(1, 22) = 53.68, p < 0.001; error rates, task, F(1, 22) = 83.77, p < 0.001), and were not affected by the lesions (n trials, group, F(2, 22) = 1.4, p = 0.267; error rates, group, F(2, 22) = 1.67, p = 0.212). Lesions also did not affect the number of trials performed and error rates in a task-specific way (n trials, group × task, F(2, 22) = 1.47, p = 0.253; error rates, group × task, F(2, 22) = 0.08, p = 0.919). Lastly, while single trials were longer in the arm task compared to eye task (task, F(1, 22) = 446.41, p < 0.001), as expected by longer choices latency, their duration was affected by lesions (group, F(2, 22) = 7.25, p < 0.005), and in a task-specific way (group × task, F(2, 22) = 8.93, p = 0.001). In the arm task, the trial duration was longer after VS lesions (t(5) = 9.75, p < 0.001) but not after amygdala lesions (t(6) = 1.2, p = 0.289). In contrast, in the eye task both lesions did not affect the trial duration (amygdala, t(12) = 2.0, p = 0.07; VS, t(9) = −0.17, p = 0.869).
Overall performance. Average number of trials (top), error rates (middle), and trial duration (bottom) during the arm task (left column) and the eye task (right column) for control, amygdala lesions, and VS lesions monkeys. Error bars represent ±1 SEM across monkeys.
Discussion
In this study, we tested unoperated controls and monkeys with amygdala or VS lesions on two different versions of a three-arm bandit task, in which choices were made using either the oculomotor (i.e., saccades) or skeletomotor (i.e., arm movements) systems. In both tasks we repeatedly presented the monkeys with explore–exploit tradeoffs by regularly introducing a novel choice option with unknown reward probability. This approach allowed us to investigate the following: (1) how monkeys mediate explore–exploit behaviors when choices are made with different motor systems and (2) differential effects of amygdala and VS lesions on choice behavior mediated by different motor systems.
Motor system-dependent explore–exploit behaviors
We found that monkeys implemented different, motor system-dependent, explore–exploit behaviors. When the novel option was first introduced, monkeys were more novelty prone with eye movements, but they selected more frequently the best alternative option with arm movements. This different explorative tendency could reflect how well the monkeys had learned about the value of each option before the introduction of a novel option. Improved learning could better define the best option and decrease exploration. Accordingly, monkeys showed better learning performance with arm movements.
Traditionally, reinforcement learning models assume that there exists a single value representation in the brain that drives learning independent of the motor system. The actor–critic model, for example, assumes that a single-state value in the VS drives policy learning in motor systems (Joel et al., 2002; O'Doherty et al., 2004; Takahashi et al., 2008). In this model the critic learns and makes predictions about state values and computes reward prediction errors as the monkey moves from one trial to another. The actor, in contrast, learns to perform the action with the highest value (i.e., select the most rewarded stimulus) using reward prediction errors generated by the critic. The critic is often associated with the VS and dopamine neurons and the actor with the dorsal striatum (DS; Cooper et al., 2012; Seo et al., 2012; Lee et al., 2015; Averbeck and O'Doherty, 2022).
To date, only one study has compared the value representation in the brain underlying learning with the oculomotor and skeletomotor systems (Wunderlich et al., 2009). They found value representations for the hand in premotor cortex and the dorsal putamen, and value representations related to eye movements in the pre-supplementary eye field and intraparietal regions. These results, along with data from the present study, are not consistent with a single value learning system in the VS that drives policy learning the same in all motor systems. One possible hypothesis is that the motor systems compute separate action-value representations based on the stimulus value encoded in motivational regions. Another hypothesis is that different stimulus value representations encoded in motivational regions, but specific to each motor system, drive distinct policy learning in the oculomotor and skeletomotor regions.
One further reason for the difference in explore–exploit behavior in the two systems may lie in their ecological relevance. Quickly detecting novel stimuli in the environment through eye movements is crucial for the survival of the species. Many primate groups have capitalized on vision at the expense of other sensory systems (e.g., olfaction), and multiple visual cortex expansions and other adaptations have occurred in primate lineages (Kaas, 2004; Eldridge et al., 2021). Primates, like many other animals, take advantage of the oculomotor system to extract the necessary visual information to interact with, and learn about objects in the environment (Gottlieb et al., 2013, 2014; Ngo et al., 2022). This learning and interaction are eventually used to guide actions through the skeletomotor system, to achieve goals. More explorative behavior with the eye and better learning performance with the arm could therefore mirror the ecological relevance of the two systems. It remains to be seen whether the dissociative learning processes observed in the two tasks may rely on different explore–exploit strategies, as well as the contribution of vision and active pursuit during the arm task. One further question would be whether the decision process in the arm task unfolds with a first saccade toward the novel option followed by the selection of the best familiar option with arm movements. Similarly, it remains to be defined whether allowing the monkeys to visually explore the options in the eye task before making their choice would translate in a different explore–exploit behavior. Lastly, it is worth noting that the monkey's reward was juice, in the eye task, and food, in the arm task. This could alter the reward structure in the two tasks and affect the monkeys’ exploratory tendency with eye and arm. For example, a shorter time to consume liquid reward compared to food may decrease the subjective value the monkeys assigned to single-trial rewards and increase exploration in the eye task. Also, the different consumption time of the two rewards may underlie the different number of trials performed and amount of reward earned per day in the two tasks. However, previous studies have shown similar behavioral performance between food and liquid reward in monkeys (Seo et al., 2012) and rodents (Carelli et al., 2000; Goltstein et al., 2018). Furthermore, neurons in the nucleus accumbens similarly responded to food and liquid reward (Carelli et al., 2000). Also, human participants receiving abstract reward and monkeys receiving juice reward showed similar explore–exploit behavior (Hogeveen et al., 2022). Thus, we believe that the different reward type did not play a major role in the explore–exploit behavior and the learning performance in the two tasks.
Effect of amygdala and VS lesions
We also found that amygdala and VS lesions impaired the behavior in a task-specific way. Amygdala and VS lesions caused deficits in learning novel stimulus-reward associations in both tasks. However, monkeys with amygdala lesions were still able to discriminate the novel option with low reward probability only in the arm task. VS lesions led to poor discrimination among all cues in both tasks and affected exploration in a task-specific way. After VS lesions, monkeys were more novelty oriented in the arm task, but less novelty oriented in the eye task. Further, monkeys with amygdala and VS lesions showed deficits in discriminating options by reward value only in the eye task.
Consistent with the lesion effects, previous studies have demonstrated that amygdala and VS lesions in monkeys affect learning of stimulus-reward associations in some (Costa et al., 2016; Rothenhoefer et al., 2017), but not all (Malkova et al., 1997; Vicario-Feliciano et al., 2017; Taswell et al., 2021) tasks. There are many differences between the tasks in which lesions do and do not affect learning, and it is not currently clear which aspects of the task affect performance. One important difference is that tasks in which effects are observed tend to use concurrent reward schedules (i.e., each option is rewarded, but with different probabilities) and therefore these structures may be more important in learning to select among relatively rewarding choices. What is clear is that the amygdala and VS do not impact reinforcement learning in general.
Studies in human and nonhuman primates suggest that the amygdala and VS likely mediate these functions as part of a larger circuitry, including much of the frontal cortex, temporal areas, and subcortical regions (Daw et al., 2006; Cohen et al., 2007; Badre et al., 2012; Costa et al., 2019; Costa and Averbeck, 2020; Giarrocco and Averbeck, 2021; Ogasawara et al., 2022). The ventral–dorsal distinction in the striatum also mirrors its connectivity within cortico-basal ganglia loops controlling the oculomotor and skeletomotor systems (Alexander et al., 1986; Haber, 2016; Saga et al., 2017). For example, premotor cortex projections to the putamen, which projects to the ventrolateral portion of the globus pallidus pars interna (Gpi), define a circuit controlling the skeletomotor system. Posterior prefrontal cortex projections to the caudate, which projects to the dorsomedial Gpi, give rise to a circuit controlling eye movements (Heilbronner et al., 2018). The amygdala also projects to the VS, which projects to the ventromedial Gpi. This topographical organization extends across thalamic nuclei, which then project back to the same cortical areas, closing the loop (Giarrocco and Averbeck, 2022).
Recent studies have suggested that switching between exploiting and exploring relies on the interaction between motivational (e.g., amygdala and VS) and motor regions (Hogeveen et al., 2022; Tang et al., 2022), including contributions of prefrontal regions, while forming new choice policies (Ebitz et al., 2018; Domenech et al., 2020). Further, amygdala lesions in humans and monkeys modulate the representation of immediate expected value in medial prefrontal cortex and orbitofrontal cortex during learning (Hampton et al., 2007; Rudebeck et al., 2017).
The VS is classically thought to be a motivational–motor interface (Mogenson et al., 1980), and lesions of the VS modulate the activity of motor cortical areas and the motor output (Suzuki and Nishimura, 2022). Interestingly, we found that VS lesions decreased choice latencies in both tasks. This suggests that VS plays a key role in delaying responses to obtain rewards and in mediating speed–accuracy tradeoff mechanisms (Rothenhoefer et al., 2017). The amygdala shows connectivity with motor areas of the frontal cortex (Avendano et al., 1983; Amaral and Price, 1984; Carmichael and Price, 1995; Ghashghaei and Barbas, 2002; Stefanacci and Amaral, 2002) and plays a role in the control of gaze patterns (Dal Monte et al., 2015; Taubert et al., 2018; Maeda et al., 2020). Moreover, stimulation of both the VS and the amygdala increases locomotor activity (Pijnenburg et al., 1973; Burns et al., 1996; Sargolini et al., 1999). The different effect of amygdala and VS lesions on monkeys’ behavior in the two tasks may therefore be achieved through different, partially overlapping circuits. It should be noted that the behavioral training and data collection occurred after the lesion surgeries. We cannot exclude that this affected how the monkeys learned the basic task rule and the explore–exploit behaviors. However, significant differences in explore–exploit behaviors specifically occurred in the early trials following the presentation of the novel option. Conversely, monkeys showed a similar overall preference between novel and familiar options in the single tasks regardless of the lesions group (e.g., high preference for the best option in control and lesions groups tested in the arm task), and generally all groups showed a low preference for the worst option. This similar trend most likely reflects a similar implementation of the task rule, regardless of the lesions.
Conclusion
In this study, we tested control monkeys and monkeys with amygdala and VS lesions on two different versions of a three-arm bandit task, the eye task and the arm task. The former required monkeys to make choices with eye movements, while in the latter monkeys made their choices with an arm movement. In both tasks, we presented monkeys with explore–exploit tradeoffs by periodically replacing a familiar option with a novel choice option that had unknown reward probability. Monkeys showed lower explorative behavior, better learning performance, and earned more rewards in the arm task compared to the eye task. We also showed that amygdala and VS lesions influence the choice behavior in a motor system-dependent way. Amygdala and VS lesions caused monkeys to be less efficient in learning novel reward associations in both tasks. After amygdala lesions monkeys maintained, to some degree, the ability to learn the value of the novel options when choices were made with an arm movement. Additionally, both amygdala and VS lesions led to lower discrimination of options by reward probability only in the eye task. Our results show that explore–exploit behaviors are mediated in a motor system-dependent way. Along with the effect of amygdala and VS lesions, we further suggest that a different value representation might be driving learning in the oculomotor and skeletomotor systems, likely involving separable cortico-basal ganglia-thalamocortical circuits.
Footnotes
This work was supported by the intramural research program of the National Institute of Mental Health ZIA MH002928 to B.B.A., the National Institutes of Health Grant R01MH125824 and the Oregon National Primate Research Center grant P51OD011092 to V.D.C.
The authors declare no competing financial interests.
- Correspondence should be addressed to should be addressed to Bruno B. Averbeck at bruno.averbeck{at}nih.gov.