Abstract
The striatum is a major input site of the basal ganglia, which play an essential role in decision making. Previous studies have suggested that subareas of the striatum have distinct roles: the dorsolateral striatum (DLS) functions in habitual action, the dorsomedial striatum (DMS) in goal-directed actions, and the ventral striatum (VS) in motivation. To elucidate distinctive functions of subregions of the striatum in decision making, we systematically investigated information represented by phasically active neurons in DLS, DMS, and VS. Rats performed two types of choice tasks: fixed- and free-choice tasks. In both tasks, rats were required to perform nose poking to either the left or right hole after cue-tone presentation. A food pellet was delivered probabilistically depending on the presented cue and the selected action. The reward probability was fixed in fixed-choice task and varied in a block-wise manner in free-choice task. We found the following: (1) when rats began the tasks, a majority of VS neurons increased their firing rates and information regarding task type and state value was most strongly represented in VS; (2) during action selection, information of action and action values was most strongly represented in DMS; (3) action-command information (action representation before action selection) was stronger in the fixed-choice task than in the free-choice task in both DLS and DMS; and (4) action-command information was strongest in DLS, particularly when the same choice was repeated. We propose a hypothesis of hierarchical reinforcement learning in the basal ganglia to coherently explain these results.
Introduction
The basal ganglia are known to play an essential role in decision making. The striatum, the major input site of the basal ganglia, has a dorsolateral-ventromedial gradient in its input modality. That is, the dorsolateral striatum receives sensorimotor-related information and the ventromedial region receives associative and motivational information (Voorn et al., 2004; Samejima and Doya, 2007). This organization suggests different roles for different subareas of the striatum in decision making (Balleine et al., 2007; Wickens et al., 2007).
Lesion studies suggest that the dorsomedial striatum (DMS) and the dorsolateral striatum (DLS) contribute differently to goal-directed actions (DMS), and habitual actions (DLS), respectively (Yin et al., 2004, 2005a, b, Yin et al., 2006; Balleine et al., 2007; Balleine and O'Doherty, 2010). Lesion and recording studies of the ventral striatum (VS) suggested its role in motivation in response to reward-predicting cues (Berridge and Robinson, 1998; Cardinal et al., 2002; Nicola, 2010).
Based on reinforcement learning theory (Watkins and Dayan, 1992; Sutton and Barto, 1998), the actor-critic model hypothesizes that the patch compartment, dominant in VS, realizes the critic that learns reward prediction in the form of a “state value,” and the matrix compartment, dominant in the dorsal striatum (DS), implements the actor that learns action selection (Houk et al., 1995; Joel et al., 2002). A variant of the hypothesis is that matrix neurons learn “action values” of candidate actions (Doya, 1999, 2000). Theoretical models also suggested that model-based action selection, which can realize flexible, goal-directed action selection (Doya, 1999; Daw et al., 2005, 2011) occurs in the network linking the prefrontal cortex and the striatum.
To further clarify different roles of subregions of the striatum, however, it is essential to record from DLS, DMS, and VS during choice behaviors. Many previous recording studies have reported neural representations of state, action, reward, past action, past reward, reward expectation, action value, and chosen value within the striatum, but without systematic differences between the subregions (Samejima et al., 2005; Pasquereau et al., 2007; Lau and Glimcher, 2008; Hori et al., 2009; Ito and Doya, 2009; Kim et al., 2009; Kimchi and Laubach, 2009; Kimchi et al., 2009; Roesch et al., 2009; Wunderlich et al., 2009; Stalnaker et al., 2010; Gremel and Costa, 2013; Kim et al., 2013). A small number of studies have reported some regional differences: upcoming action in DS, but not in VS (Kim et al., 2009), upcoming state in VS, but not in DS (van der Meer et al., 2010), and stronger past-action information in DMS than in DLS (Kim et al., 2013).
In this study, we systematically analyzed neuronal activities from DLS, DMS, and VS of rats performing a nose-poke choice task consisting of fixed-choice and free-choice blocks. We specifically investigated two predictions: (1) representation of action values is strongest in DMS; and (2) action-command representation is stronger in fixed-choice blocks than in free-choice blocks in DLS and the opposite holds in DMS.
Materials and Methods
Subjects.
Male Long–Evans rats (n = 7; 250–350 g body weight; ∼14–29 weeks old) were housed individually under a reversed light/dark cycle (lights on at 20:00, off at 08:00). Experiments were performed during the dark phase. Food was provided after training and recording sessions so that body weights dipped no lower than 90% of initial levels. Water was supplied ad libitum. The Okinawa Institute of Science and Technology Animal Research Committee approved the study.
Apparatus.
All training and recording procedures were conducted in a 40 × 40 × 45 cm experimental chamber placed in a sound-attenuating box (O'Hara & Co.). The chamber was equipped with three nose-poke holes in one wall and a pellet dish on the opposite wall (see Fig. 1A). Each nose-poke hole was equipped with an infrared sensor to detect head entry, and the pellet dish was equipped with an infrared sensor to detect the presence of a sucrose pellet (25 mg) delivered by a pellet dispenser. The chamber top was open to allow connections between electrodes mounted on the rat's head and an amplifier. House lights, a video camera, and a speaker were placed above the chamber. A computer program written with LabVIEW (National Instrument) was used to control the speaker and the dispenser and to monitor states of the infrared sensors.
Behavioral task.
Animals were trained to perform a free-choice task and a fixed-choice task using nose-poke responses. In either task, each trial started with a tone presentation (start tone: 2300 Hz, 1000 ms). When the rat performed a nose-poke in the center hole for 500–1000 ms, one of three cue tones (left tone: 900 Hz, 1000–2000 ms; right tone: 6500 Hz; 1000–2000 ms; and choice tone: white noise, 1000–2000 ms) was presented (see Fig. 1B). The left and right tones indicated which choices a rat should make to have the highest probability of a reward. In contrast, the choice tone offered no information, forcing the rat to make a choice. The rat had to maintain the nose-poke in the center hole during presentation of the cue tone. Otherwise, the trial was ended (a wait-error trial) with the presentation of an error tone (9500 Hz, 1000 ms). After offset of the cue tone, the rat was required to perform a nose-poke in either the left or right hole within 60 s (otherwise, the trial was ended as an error trial after the error tone) and then either a reward tone (500 Hz, 1000 ms) or a no-reward tone (500 Hz, 250 ms) was presented. The reward tone was followed by delivery of a sucrose pellet in the food dish. Reward probabilities were varied depending on the cue tone and the chosen action (see Fig. 1C). Reward probabilities were fixed for the left tone (50% chance of reward for the left hole choice, 0% for right hole choice) and the right tone (0%, 50%). Reward probabilities for the choice tone were varied in a block-wise manner among the four settings: (L90%, R50%), (50%, 90%), (50%, 10%), and (10%, 50%).
Surgery.
After rats mastered the free-choice task, they were anesthetized with pentobarbital sodium (50 mg/kg, i.p.) and placed in a stereotaxic frame. The skull was exposed and holes were drilled in the skull over the recording site. Three drivable electrode bundles were implanted into DLS in the left hemisphere (0.7 mm anterior, 3.8 mm lateral from bregma, 4.0 mm ventral from the brain surface), DMS in the left hemisphere (0.4 mm posterior, 2.0 mm lateral from bregma, 3.2 mm ventral from the brain surface), and VS in the right hemisphere (1.7 mm anterior, 1.7 mm lateral from bregma, 6.0 mm ventral from the brain surface). An electrode bundle was composed of eight Formvar-insulated, 25 μm bare diameter nichrome wires (A-M Systems) and was inserted into a stainless-steel guide cannula (0.3 mm outer diameter; Unique Medical). Tips of the microwires were cut with sharp surgical scissors so that ∼1.5 mm of each tip protruded from the cannula. Each tip was electroplated with gold to obtain an impedance of 100–200 kΩ at 1 kHz. Electrode bundles were advanced by 125 μm per recording day to acquire activity from new neurons.
Electrophysiological recording.
Recordings were made while rats performed fixed- and free-choice tasks. Neuronal signals were passed through a head amplifier at the head stage and then fed into the main amplifier through a shielded cable. Signals passed through a bandpass filter (50∼3000 Hz) to a data acquisition system (Power1401; CED), by which all waveforms that exceeded an amplitude threshold were time-stamped and saved at a sampling rate of 20 kHz. The threshold amplitude for each channel was adjusted so that action potential-like waveforms were not missed while minimizing noise. After a recording session, the following off-line spike sorting was performed using a template-matching algorithm and principal component analysis by Spike2 (Spike2; CED): recorded waveforms were classified into several groups based on their shapes, and a template waveform for each group was computed by averaging. Groups of waveforms that generated templates that appeared to be action potentials were accepted, and others were discarded. Then, to test whether accepted waveforms were recorded from multiple neurons, principal component analysis was applied to the waveforms. Clusters in principal component space were detected by fitting a mixture Gaussian model, and each cluster was identified as signals from a single neuron. This procedure was applied to each 50 min data segment; and if stable results were not obtained, the data were discarded.
Then, gathered spike data were refined by omitting data from neurons that satisfied at least one of the five following conditions: (1) The amplitude of waveforms increased >150% or decreased <50% during the recording session. (2) The amplitude of waveforms was <7× the SD of background noise. (3) The firing rate calculated by perievent time histograms (PETHs) (from −4.0 s to 4.0 s with 100 ms time bin based on the onset of cue tone, the exit of the center hole, or the entrance of the left or right hole) was <1.0 Hz for all time bins of all PSTHs. (4) The firing rate represented by EASHs (see below) with 10 ms time bins smoothed by Gaussian filter with σ = 10 ms (see Fig. 3D–I, black) was <1.0 Hz at any time bin. (5) The estimated recording site was considered to be outside the target. Furthermore, considering the possibility that the same neuron was recorded from different electrodes in the same bundle, we calculated cross-correlation histograms with 1 ms time bins for all pairs of neurons that were recorded from different electrodes in the same bundle. If the frequency at 0 ms was 10× larger than the mean frequency (from −200 ms to 200 ms, except the time bin at 0 ms) and their PETHs had similar shapes, either one of the pair was removed from the database. After this procedure, to extract phasically active neurons (PANs; putative medium spiny neurons), the proportion of time-spent intervals (ISIs) that was >1 s (PropISIs >1 s) was calculated for each neuron (Schmitzer-Torbert and Redish, 2004). Then, the neurons for which PropISIs >1 s was >0.4 were regarded as PANs.
Histology.
After all experiments were completed, rats were anesthetized as described in the surgery section, and a 10 μA positive current was passed for 30 s through one or two recording electrodes of each bundle to mark the final recording positions. Rats were perfused with 10% formalin containing 3% potassium hexacyanoferrate (II), and brains were carefully removed so that the microwires would not cause tissue damage. Sections were cut at 60 μm on an electrofreeze microtome and stained with cresyl violet. Final positions of electrode bundles were confirmed using dots of Prussian blue. The position of each recorded neuron was estimated from the final position and the moved distance of the bundle of electrodes. If the position was outside DLS, DMS, or VS, recorded data were discarded.
Decision trees.
To estimate a decision tree for choice tones (see Fig. 2D), sequences of choice behavior in choice-tone trials were extracted. We denote the action in the tth choice trial as a(t) ∈ {L, R}, the reward as r(t) ∈ {0, 1}, and the experience as follows: The conditional probability of making a left choice given the preceding sequence of experiences is estimated by the following: where NL(e(t − 1), e(t − 2),…, e(t − d)) and NR(e(t − 1), e(t − 2),…, e(t − d)) are the number of occurrence of the left (L) and right (R) actions, respectively, after the experience of (e(t − 1), e(t − 2),…, e(t − d)). d is the number of previous trials taken into consideration. In this study, conditional probabilities of left choices were calculated for all possible combinations for d = 1 and d = 2. In the same way, to estimate a decision tree for a left or right tone, sequences of choice behavior in left-tone- or right-tone-trials (both in single fixed-choice blocks and in double fixed-choice blocks) were used, respectively (for more detail, see Ito and Doya, 2009).
Evaluation of decision-making models.
Any decision-making models for a single stimulus (state) and binary choice (action) can be defined by the conditional probability of a current action, given past experiences as follows: where e(1:t − 1) is a simple description of e(1), e(2), …, e(t − 1). Behavioral data are composed of a set of sequences (sessions) of actions and rewards. If necessary, we use the index l for the index of sessions, for example, a{l}(t). The number of trials for session l is represented by Tl, and the number of sessions is L.
To fit the parameters to the choice data and evaluate the models, we used the likelihood criterion, which is the probability that the observed data were produced by the model. The likelihood can be normalized, so that it equals 0.5 when predictions are made with chance-level accuracy (PL(t) = 0.5 for all t). The normalized likelihood is defined by the following: where z{l}(t) is the likelihood for a single trial as follows: The (normalized) likelihood can be regarded as the prediction accuracy, namely, how accurately the model predicts actions using past experiences. Generally, models that have a larger number of free parameters can fit the data more accurately and show a higher likelihood. However, these models may not be able to fit new data due to overfitting. For a fair comparison of models, choice data were divided into training data (101 sessions) and test data (101 sessions). Free parameters of a model were determined by maximizing the likelihood of the training data. Then, the model was evaluated by the likelihood or the normalized likelihood of test data (holdout validation).
Markov models.
Markov models are the simplest nonparametric models. They predict an action from experiences in the past d trials. For all possible sequences of actions and rewards in d trials, different parameters defining an action probability were assigned as follows: where w represents the free parameters (for parameter search, see Ito and Doya, 2009). Markov models provide a useful measure to objectively evaluate other models.
DFQ-learning model.
The DFQ-learning model (Ito and Doya, 2009) is an extension of the Q-learning models. Action values Qi, which are estimates of the rewards from taking an action i ∈ {L, R}, are updated by the following: where α1 is the learning rate for the selected action, α2 is the forgetting rate for the action not chosen, κ1 represents the strength of reinforcement by reward, and κ2 represents the strength of the aversion resulting from the no-reward outcome. This set of equations can be reduced to standard Q-learning by setting α2 = 0 (no forgetting for actions not chosen) and κ2 = 0 (no aversion from a lack of reward). The FQ model is a version introducing the restriction α1 = α2. Using the action values, the prediction of the choice at trial t was given by the following: We considered cases of fixed and time-varying parameters. For fixed parameter models, a set of parameters α1, α2, κ1, and κ2 were free parameters, which were assumed to be constant for all sessions. For time-varying parameters, α1, α2, κ1, and κ2 were estimated in each trial, assumed to vary according to the following: where ζj and ξj are noise terms drawn independently from a Gaussian distribution N(0, σα2) and N(0, σκ2), respectively, and σα and σκ are free parameters that control the magnitude of change. The predictive distribution P(h(t)|e(1:t − 1)) of parameters h = [QL, QR, α1, α2, κ1, κ2] given past experiences e(1:t − 1) was estimated using a particle filter (Samejima et al., 2005; Ito and Doya, 2009). The action probability PL(t) was obtained from Equation 1 with the mean of the predictive distribution of QL(t) and QR(t). In this study, 5000 particles were used for the estimation.
Event-aligned spike histogram (EASH).
In our choice task, six task events were defined: entry into the center hole (E1), onset of the cue tone (E2), offset of the cue tone (E3), exit from the center hole (E4), entry into the left or right hole (E5), and exit from the left or right hole (E6). The interval between task events varied by trials. To align task event timings for all trials, EASHs were proposed. First, the average duration for each event interval was calculated; from E1 to E2 (Phase 2) = 0.75 s, E2 to E3 (Phase 3) = 1.50 s, E3 to E4 (Phase 4) = 0.54 s, E4 to E5 (Phase 5) = 0.76 s, and E5 to E6 (Phase 6) = 0.38 s. Then, spike timings during each event interval for each trial were linearly transformed into a corresponding averaged event interval. The number of spikes in each event interval was preserved. Furthermore, we defined time points E0 (2 s before E1) and E7 (2 s after E8) to define Phase 1 (from E0 to E1) and Phase 7 (from E6 to E7). The transformation described above was not applied to spike timings in Phases 1 and 7 because the durations of these Phases were not changed by trials. In this way, a regular raster plot (see Fig. 3B, top) was transformed into an event-aligned raster plot (see Fig. 3C, top). Then, by taking a time histogram with 10 ms bins of the transformed raster plot, EASH was obtained (see Fig. 3C, bottom).
Mutual information.
To elucidate when and how much information from each event, such as state, action, reward, was coded in neuronal activity, the mutual information shared between firing and each event was calculated using the method described by Ito and Doya (2009). For a certain time window in each trial, we defined a neuronal activity as F and a task event as X. F is a random variable taking f1, f2, f3, or f4 for each trial, which represents the level of firing rate. X is a random variable taking x1 or x2, corresponding to chosen action, left or right, respectively, for the action information (mutual information between firing rate and action). For the state information, x1 and x2 correspond to fixed-choice and free-choice blocks, and for the reward information, x1 and x2 correspond to rewarded and unrewarded choices, respectively. Mutual information shared by F and X is defined by the following: For each neuron, mutual information (bits/s) was estimated (for more detail, see Ito and Doya, 2009) for every 100 ms time bin of an EASH, using whole trials, including single fixed-, double fixed-, and free-choice blocks (see Fig. 5A,C,F). To test whether the averaged mutual information (see Fig. 5B,D,E,G) was significant, the threshold indicating significant information (p < 0.01) was obtained in the following manner. A binary event, x1 or x2, was generated randomly for each trial, and the mutual information between this random event and spikes was calculated for all neurons. Then the mutual information was averaged over each region. This calculation was repeated 100 times with new random events. Then the second largest mutual information for each time window was regarded as the threshold indicating significant information.
Regression analysis.
We conducted multiple linear regression analysis to capture the information coded in neuronal firing. Because there are various candidates for explanatory variables, selecting a set of explanatory variables is an important issue. A fitting model using many parameters tends to show low fitting error but can result in overfitting. Furthermore, if some explanatory variables in a regression model are correlated, a regression analysis tends to fail to detect these variables, resulting in a Type II error. In the present study, we used the Bayesian information criterion (BIC) to select a set of explanatory variables from the full model (2) (described in Results). BIC can be regarded as a fitting measure taking into account the penalty for the number of parameters in the model. Assuming that the model errors ε(t) are independent and identically distributed according to a normal distribution, the BIC is given by the following: where N is the amount of data, k is a number of the parameters (the number of β), and σ̂2 is the error variance defined by the following: Here, ŷ(t) is a prediction from the model in which the parameters βi are tuned so that σ̂2 is minimized. If a model shows a smaller BIC, it means that the model is better. Because the full regression model (2) includes six variables (including the constant variable for β0), we can consider 26 models for all combinations, regardless of whether explanatory variables are included. We calculated the BIC for all combinations, and then we selected a set of explanatory variables that showed the smallest BIC. Then, we tested the statistical significance of each regression coefficient in the selected model using the regular regression analysis. If p < 0.01, the corresponding variable was regarded as being coded in the firing rate. This variable selection was conducted independently for each neuron and for each time bin.
Results
We recorded neuronal activity from DLS, DMS, and VS of rats (n = 7) performing fixed-choice and free-choice tasks in an experimental chamber (Fig. 1A). After a rat poked its nose into the center hole, one of three cue tones (left tone, right tone, and choice tone) was presented (Fig. 1B,C). Reward probabilities were varied depending on the cue tone and the chosen action (Fig. 1C). Reward probabilities were fixed for the left tone (50% for left choice, 0% for right choice) and the right tone (0%, 50%). Reward probabilities for the choice tone were varied in a block-wise manner out of the four settings: (90%, 50%), (50%, 90%), (50%, 10%), and (10%, 50%). In the first and second blocks, the left and right tones were presented, respectively (single fixed-choice block) (Fig. 2A). In the third and fourth blocks, the left and right tones were randomly presented upon each trial (double fixed-choice blocks). In the fifth to eighth blocks, only the choice tone was presented (free-choice blocks) (Fig. 2A). Four reward probability pairs (Fig. 1C) were randomly assigned to the four blocks. The same block was held until at least 20 choice trials were completed (40 trials for double fixed-choice blocks). A block was completed when the choice frequency of the action associated with the higher reward probability reached 80% during the last 20 trials (40 trials for both tones in double fixed-choice blocks) and a new block was started with no explicit cue presented to the rats. To assess sensitivity to the change of reward probability, an extinction test consisting of 5 trials with no reward was conducted for the left and right tones between the third and fourth double fixed-choice blocks and for the choice tone between sixth and seventh free-choice blocks. This block sequence was repeated two or three times in one day recording session.
Numbers of trials required to complete one block were 22.56 ± 8.70 for single fixed-choice blocks, 47.11 ± 20.29 for double fixed-choice blocks, and 41.10 ± 27.58 for free-choice blocks, (mean ± SD). Here, we report the results of all 78,107 trials in 86 recording sessions, performed by seven rats, consisting of 12,185 single fixed-choice trials (16.4%), 22,234 double fixed-choice trials (30.0%), 36,292 free-choice trials (48.9%), and 3505 extinction test trials (4.7%).
Behavioral performance
First, we tested whether action selection was sensitive to the reward omission in fixed-choice blocks and free-choice blocks. In the extinction tests (Fig. 2B,C), whereas the choice probabilities for the free-choice tone shifted toward 50% following the first nonrewarded trial (orange lines, Fig. 2B), choice probabilities for the left and right tones remained biased even after five no-reward trials (Fig. 2B, blue and red lines), suggesting a low sensitivity to the change in reward contingency. Choice probabilities during extinction tests were significantly different between the left and right tones and the choice tone (Fig. 2C; p < 0.0001, χ2 test).
Decision trees (Fig. 2D–F) indicate how choice probabilities changed with previous action and reward experiences. The decision tree for choice tones expands symmetrically (Fig. 2D), indicating that action selection was sensitive to past experiences, as in our previous study of free-choice trials (Ito and Doya, 2009). On the other hand, action probabilities for the left or right tones (in both single and double fixed-choice blocks) are biased toward 1 (Fig. 2E) or 0 (Fig. 2F), respectively, indicating insensitivity to past experiences.
These results suggest that the action selection in the free-choice blocks was more flexible and action selection in the fixed-choice blocks more inflexible. The action in the fixed-choice blocks might be related to habitual action (Barnes et al., 2005; Bayley et al., 2005; Broadbent et al., 2007). In recent years, the term “habitual action” is often used as a contrast of “goal-directed action”; and in this context, these actions are distinguished by outcome devaluation tests and/or contingency degradation tests (Yin et al., 2004, 2005a, b, 2006; Balleine et al., 2007; Balleine and O'Doherty, 2010). It requires further experiments to test whether the behaviors in the fixed- and free-choice blocks in the present task can be regarded as goal-directed and habitual actions, respectively.
We then analyzed choice sequences in the free-choice task using value-based reinforcement-learning models (Ito and Doya, 2009). We used DFQ-learning (Q-learning with differential forgetting) models, in which action values Qi(t) for i = {L or R} were updated by the previous action and reward with four parameters: the learning rate α1 for the action chosen, the forgetting rate α2 for the action not chosen, the strength of reinforcement κ1 of a reward, and the strength of aversion κ2 from no-reward. DFQ-learning models are experience based, model-free algorithms that cannot reproduce goal-directed behaviors.
We consider cases of fixed parameters and time-varying parameters. For fixed parameters, these were assumed to be constant for all sessions. For time-varying parameters, these were assumed to vary with drift-rate parameters σa and σk, and estimated along with time courses of action values by dynamic Bayesian inference (Ito and Doya, 2009). We tested six subtypes of DFQ models (Q-learning with α2 = κ 2 = 0, FQ-learning with α1 = α2, and full DFQ learning for fixed or time-varying parameters) and found that the FQ-learning model with time-varying parameters predicted rat behaviors best (Fig. 2G). Prediction accuracy was higher than that of Markov models (generic time-series prediction models). The dth Markov model is a purely descriptive model, which predicts an action from the past limited experiences in the d last trials. This model fitting result was almost the same as in our previous study (Ito and Doya, 2009). Rat actions were predicted by the FQ-model with time-varying parameters (κ1 and κ2; α is not shown), and estimated action values for left (QL) and right (QR) (Fig. 2H), which were used for the analysis of neuronal activity (see Figs. 7 and 8).
We also tested a regular actor-critic model with constant variables (Sutton and Barto, 1998), but the normalized likelihood of test data was close to the chance level (0.5080; data not shown). For actor-critic models, we can consider numerous variations of actor models; however, we could not find adequate actor-critic models that fit the rats' behavior better than the Q-learning models with constant variables.
Activity patterns of phasically active neurons
We recorded neuronal activity in DLS, DMS, and VS of rats performing fixed- and free-choice tasks. Each rat was wired with three bundles of eight microwires. Bundles were progressed by 125 μm between recording sessions so that data from new neurons were acquired in each session. Stable recordings were made from 260 neurons in DLS, 178 neurons in DMS, and 179 neurons in VS from seven rats (Fig. 3A) (see Materials and Methods). Among these, 190, 105, and 119 neurons from DLS, DMS, and VS, respectively, were classified as PANs (putative medial spiny projection neurons) based on statistics of interspike intervals (Schmitzer-Torbert and Redish, 2004) and waveforms (see Materials and Methods). Only data from these PANs were used for the following analyses.
Intervals of time between task events (the commencement of center hole poking, the onset of cue tone, the offset of cue tone, the termination of center hole poking, the start of L/R hole poking, and the end of L/R hole poking) varied across trials (Fig. 3B). To develop an overall neuronal activity profile despite this timing variability, we created event-aligned spike histograms (EASHs) (Fig. 3C). An EASH is derived by linearly scaling time intervals between task events in each trial to the average interval across all trials (see Materials and Methods). The peak at the start of L/R poking is clearer in the EASH than in the PETH aligned by the timing of center hole entry (Fig. 3B,C). We defined the intervals between task events as trial Phases 1 through 7 (Fig. 3C). DLS, DMS, and VS neurons were activated at different task events and phases, such as the start of center hole poking (Fig. 3D), and in different trial phases, such as between the exit from the center hole and the start of left or right hole poking (trial Phase 5, execution of action) (Fig. 3D,F,G). Most neurons changed their activity patterns depending on upcoming actions, selected actions, reward outcomes, and types of choice blocks. For instance, DLS neurons (Fig. 3D and 3E) changed their activities before action execution depending on whether the left or right hole was to be selected. Activities of DMS and VS neurons were modulated by both executed actions and reward outcomes (Fig. 3F,H). DMS and VS neurons (Fig. 3G and Fig. 3I, respectively) showed different activities in fixed-choice blocks and free-choice blocks. Interestingly, Roesch et al. (2009) conducted similar fixed- and free-choice tasks using odor stimuli, and reported that population activity pattern in VS was the same between these tasks. A possible reason for this difference is that Roesch et al. (2009) randomly selected fixed-choice and free-choice in every trial, while in our task these trials were separated in different blocks (Fig. 2A).
To develop an overview of neural activity profiles in DLS, DMS and VS, we visualized normalized EASHs of all PANs (Fig. 4A), where indices of neurons are sorted by their activation peaks. For each trial phase, we found neurons that increased their activity in all three subareas, but in different proportions (Fig. 4B). The proportion of neurons that increased their activity as a rat approached the center hole (trial Phase 1) was >60% in VS, significantly larger than in DLS and DMS (p = 0.0015 and p < 0.0001, respectively, χ2 test). After a rat's exit from the center hole until its entrance into the L/R hole (trial Phase 5), >60% of DMS neurons were activated, which was significantly larger than the proportions of DLS and VS (p = 0.00021 for DLS, p = 0.040 for VS, χ2 test). During the trial phase of receiving a sugar pellet (trial Phase 7), the proportion of activated neurons was significantly larger in VS than in DLS (p = 0.0040, χ2 test). Furthermore, VS neurons maintained activation longer (more than half of peak activity) than DLS (p < 0.0001, Mann–Whitney U test) and DMS neurons (p < 0.0001, Mann–Whitney U test), and DLS neurons had shorter durations of activation than VS neurons and DMS neurons (p < 0.0001 for VS, p = 0.040 for DMS, Mann–Whitney U test) (Fig. 4C).
Information coding of state, action, and reward
To elucidate when and how much information about each task event was represented in each subarea of the striatum, the amount of mutual information between neuronal firing and each task event was calculated (Panzeri and Treves, 1996; Ito and Doya, 2009).
State information (fixed- or free-choice block) was strongest in VS, especially, during the approach to the center hole (trial Phase 1) (p < 0.0001 for DLS, p = 0.022 for DMS, Mann–Whitney U tests) (Fig. 5A,B). During this phase more than 60% of VS neurons were activated (trial Phase 1, Fig. 4A,B). Action information (left or right hole choice) started increasing during the tone presentation (trial Phase 3) specifically in DLS (Fig. 5C–E).
Regarding action information, Kim et al. (2009) reported that slight, but significant upcoming-action signals were represented in DS but not in VS. In this study, we found consistent and more detailed action representations. Action information during the 100 ms before the offset of the cue tone was significantly higher in DLS than in DMS and VS (p = 0.0175 for DMS, p = 0.00091 for VS, Mann–Whitney U tests). Immediately after the offset of the cue tone (trial Phase 4) action information in DMS rapidly increased and became higher than that in DLS and VS during the 100 ms before the offset from the center hole, while information in VS was weakest (p = 0.16 for DLS, p = 0.0027 for VS, p < 0.0001 for DLS vs VS, Mann–Whitney U tests) (Fig. 5D,E). During action execution (trial Phase 5), the action information was the highest in DMS (p < 0.0001 for DLS, p < 0.0001 for VS, p < 0.0001 for DLS vs VS, Mann–Whitney U tests) (Fig. 5D).
Reward information (delivered or not) started rising simultaneously in all subareas after the start of L/R poking, when the reward or no-reward tone was presented (trial Phase 6, Fig. 5G). Reward information was strongest in VS, followed by DMS, and reward information in DLS was significantly less than that in VS and DMS (p = 0.012 for DLS vs DMS, p < 0.0001 for DLS vs VS, p = 0.13 for DMS vs VS, Mann–Whitney U tests). We also found similar patterns in the proportion of event-coding neurons, namely, how many neurons changed their firing rates for selected actions, reward outcomes, and the different choice blocks (states) (Fig. 6). For instance, the proportion of state-coding neurons during cue presentation (trial Phase 3) was the largest in VS, and the proportion of action-coding neurons was the largest in DLS before action execution and in DMS during action execution. The proportion of reward-coding neurons was the largest in VS during the L/R poking (Fig. 6D). These proportions in the population were similar in fixed-choice and free-choice blocks (Fig. 6B,C).
Model-based analysis of action value and state value coding
We then conducted a model-based analysis of neural coding (Corrado and Doya, 2007; O'Doherty et al., 2007) using the FQ-learning model with time-varying parameters that best fit rat behaviors during free-choice blocks (Fig. 2G,H) (Ito and Doya, 2009). We conducted multiple linear regression analysis using the following regression model: where y(t) is the number of spikes in trial t at a certain time bin. βi is the regression coefficient, a(t) is actions (1 for left, 2 for right), and r(t) is reward (1 for reward, 0 for no-reward). a(t − 1) and r(t − 1) are action and reward in the previous trials, respectively. QL(t) and QR(t) are action values estimated using FQ-learning model, and ε(t) is an error term. The second term, βLt, was inserted to absorb any increasing or decreasing trend in the firing rate during a session. How to select a set of explanatory variables is an important issue in regression analyses. In this analysis, for each neuron and for each time bin, we chose the set of explanatory variables from the full model (2) so that the BIC was minimized (see Materials and Methods).
The firing rate of a DLS neuron (Fig. 7A) was significantly correlated with the left action value QL during cue presentation before action onset (trial Phase 3). Firing of DMS neurons (Fig. 7B) showed a negative significant correlation with the left action value QL during action execution (trial Phase 5). VS neurons that had negatively correlated firing with both left and right action values during the approach to the center hole (trial Phase 1), suggested state-value coding (Fig. 7C). Firing of neurons in these 100 ms time bins changed in a remarkably similar manner to the time courses of action values and state values estimated from animal behavior (Fig. 7A–C).
Substantial proportions of neurons in DLS, DMS, and VS coded actions before and after action execution (Fig. 8A). Kim et al. (2013) reported that action in the previous trial was strongly coded in DMS through the entire trial period; and in our experiment, although the previous action signal in DMS seemed to be relatively weak, a significant proportion of DMS neurons coded previous actions during action execution (trial Phase 5; Fig. 8B). Substantial proportions of neurons in all subareas coded rewards after the start of L/R hole poking (Fig. 8C), and rewards in the previous trial were also strongly coded in all trial phases in all striatal regions (Fig. 8D), consistent with previous studies (Ito and Doya, 2009; Kim et al., 2009, 2013).
Action-value coding neurons are defined as neurons with activities significantly correlated with either action value, left (QL) or right (QR). Although significant proportions of neurons in each subarea had action-value coding in all trial phases, the proportion of action-value coding neurons was highest in DMS during action execution (Fig. 8E; p = 0.00058 for DLS, p = 0.081 for VS, Mann–Whitney U test). The majority of these action value-coding neurons in DMS represented QR, the action value for right hole choice, and the proportion of neurons coding QR during action execution was significant larger in DMS than in DLS (p = 0.00017, Mann–Whitney U test), both of which were recorded from the left hemisphere.
The strongest action value representation in DMS is consistent with our prediction (I). We also found action-independent value-coding neurons (state-value coding neurons), which are defined as neurons with activities significantly correlated with both action values with the same sign. The proportion of state-value coding neurons was the highest in VS in all trial phases, especially in starting phase (trial Phase 1) and action initiation phase (trial Phase 4), the proportion was significantly larger in VS than in DLS and DMS (p = 0.039 for DMS and p = 0.0027 for DLS, in trial Phase 1; p = 0.024 for DMS and p = 0.0045 for DLS, first half of Phase 4; Mann–Whitney U test) (Fig. 7F). We did not find significant numbers of policy-coding neurons (those with significant correlations with both action values with opposite signs), in any subarea or in any time bin (data not shown).
Action-command information in fixed- and free-choice blocks
To assess the roles of different subareas in the striatum in fixed- and free-choice blocks, we compared action command coding in the three subareas separately for four different task blocks; the single fixed-choice block, the double fixed-choice block, the free-choice block with higher reward probabilities [(L = 90%, R = 50%) and (50%, 90%)], and the free-choice block with lower reward probabilities [(L = 50%, R = 10%) and (10%, 50%)] (Fig. 9). We calculated action information from the last 20 trials in each block so that the estimation bias of the mutual information was identical for all types of blocks. Action information in the single and double fixed-choice blocks started to rise first in DLS during cue presentation and then in DMS after cue offset (trial Phase 4), and finally in VS after onset of action execution (trial Phase 5) (Fig. 9A,B). Contrary to expectation, this temporal pattern of action information was seen also in free-choice blocks (Fig. 9C,D). There were no conditions in which the action command in DMS was stronger than in DLS.
We also compared the strength of action coding in different blocks for each subarea (Fig. 10). In DLS, the action information in single and double fixed-choice blocks started increasing immediately after the onset of cue presentation (Fig. 10A). Action information in free-choice blocks with higher rewards increased more slowly than that in fixed-choice blocks, and action information in free-choice blocks with lower rewards appeared later and was weaker. There were significant differences in action information between fixed-choice blocks and free-choice blocks in trial Phase 3 (during the 500 ms before cue presentation) and trial Phase 4 (during the 500 ms before center hole exit) (p = 0.00047 and p = 0.00037, respectively, paired-Mann–Whitney U test). In DMS, action information was stronger in single and double fixed-choice blocks than in free-choice blocks (Fig. 10B), similar to that in DLS, with significant differences in trial Phases 3 and 4 (p = 0.040 for Phase 3, p = 0.0029 for Phase 4, paired-Mann–Whitney U test). Action information in VS had a transient peak during cue presentation only in single, fixed-choice blocks and remained low until the onset of action execution (Fig. 10C) with no significant differences between fixed- and free-choice blocks in trial Phases 3 and 4 (p = 0.23 and p = 0.28, respectively, paired-Mann–Whitney U test). While stronger action information coding in fixed-choice blocks by DLS neurons is consistent with our prediction (II), the same stronger action information in fixed-choice blocks found also in DMS neurons is contrary to our prediction (II).
We further analyzed the action command in free-choice blocks depending on whether the action was the same as in the previous trial (stay trials) or different (switch trials) (Fig. 10D–F). In DLS, action information was stronger in stay trials than in switch trials in trial Phase 3 (during the 500 ms before cue presentation) (p = 0.019, Mann–Whitney U test). In DMS, however, we could not find any difference in the action-command signal between switch and stay trials.
Discussion
To clarify the distinct roles of DLS, DMS, and VS in decision making, we recorded neuronal activity from these portions of the striatum of rats performing a fixed-choice task and a free-choice task. The analysis of phasically active neurons, which are thought to be medial spiny neurons, revealed differences in the temporal profiles of firing and information coding. When rats began the tasks by approaching the center hole, more than half of VS neurons increased their firing (Fig. 4A,B), and activities coded the information of the task condition (fixed- or free-choice blocks) (Figs. 5B and 6D) and a state value (Fig. 8F). When presentation of the cue tone started, action information began to rise only in DLS (Fig. 5E). Then, immediately after offset of the cue tone, action information in DMS sharply increased (Fig. 5D,E). When rats started moving to the left or right hole, action information became higher in DMS than in DLS and VS, and the proportion of action-value coding neurons increased specifically in DMS (Fig. 8C).
Clear peaks of action-value information during action execution in DMS and state-value representation in VS are consistent with our prediction (I). Contrary to our prediction (II), action information before action execution was stronger in fixed-choice blocks than in free-choice blocks in both DLS and DMS (Figs. 9 and 10). However, further analysis of free-choice blocks revealed that action-command coding was stronger in stay trials than in switch trials in DLS, although there was no significant difference in DMS (Fig. 10D,E). This suggests relatively stronger involvement of DLS in repetitive behaviors.
Action value and state value coding
Although neuronal correlation with action values has been reported mainly in DS (Samejima et al., 2005; Pasquereau et al., 2007; Lau and Glimcher, 2008; Hori et al., 2009; Kim et al., 2009; Wunderlich et al., 2009), clear differences in action-value representation among the subregions of the striatum have not been detected (Kim et al., 2009, 2013; Stalnaker et al., 2010). In support of the prediction (I), our analysis revealed that the highest signal of action values occurred in DMS during action execution (Fig. 8E). This is consistent with the role of action values in realizing flexible action selection. Regarding state values, consistent with previous suggestions (O'Doherty et al., 2004; Atallah et al., 2007; Takahashi et al., 2007, 2008), we observed neuronal correlation with a state-value signal most strongly in VS. A significant proportion of state-coding VS neurons were observed from the approaching period (trial Phase 1) to the action-initiation period (trial Phase 4) (Fig. 8F), supporting the idea that VS plays the role of the critic (Joel et al., 2002).
Different dynamics of action command in DLS and DMS
The information of upcoming action, namely, action command, has been found in DS of monkeys performing a choice task (Pasupathy and Miller, 2005; Samejima et al., 2005; Pasquereau et al., 2007). By contrast, in rodent studies, it has been reported that the action-command signal in DS was relatively weak or not represented in spatial choice tasks (Kim et al., 2009, 2013; Thorn et al., 2010; van der Meer et al., 2010). By contrast, Stalnaker et al. (2010) reported clear action-command signal in DLS and DMS, similar to our results. In both tasks, rats were required to keep nosepoking during the cue presentation before action selection. This immobile phase might be important to capture the action-command signal.
In our task, the rat had to maintain its nose-poke in the center hole until the offset of a cue tone before moving to the left or right hole; otherwise, the trial was ended as an error trial without a reward. Thus, this task required two processes. One was to wait until the offset of the cue and the other was to select either left or right hole for the given cue. Considering the temporal profile of action-command coding (Fig. 5D,E), DLS and DMS appear to be involved in parallel and competitive decision modules. DLS might belong to an elementary decision module that rapidly selects an action responding to the given cue, ignoring the waiting process. DMS might belong to a comprehensive decision module, which can take into account both waiting and selecting. We speculate that a decision module related to DMS attempted to maintain nosepoking, whereas another decision module related to DLS attempted to respond. The action was realized when both modules agreed after offset of the cue tone in successful trials (no wait-error trials).
Action command in fixed- and free-choice blocks
In fixed-choice blocks, the rats kept responding to the same action for each tone cue regardless of the outcome, whereas in free-choice tasks the rats showed high sensitivity to reward omission and past experiences (Fig. 2B–F). From this result, we expected that action-command representation in DLS, which is involved in habitual action, would be stronger in fixed-choice blocks than in free-choice blocks, and this relation would be reversed in DMS, which is involved in flexible, goal-directed action.
However, in both DLS and DMS, the action-command signal was significantly stronger in fixed-choice blocks than in free-choice blocks (Fig. 10A,B). Stalnaker et al. (2010) conducted similar tasks using odor cues and also reported that the action command of DMS was stronger in forced-choice trials than in free-choice trials. Furthermore, temporal patterns of action-command signals of all subareas (Fig. 5D,E) were preserved when these were calculated separately for fixed- and free-choice blocks (Fig. 9). When we compared the action-command signal in stay trials and switch trials in free-choice blocks, while action-command coding was stronger in stay trials than in switch trials in DLS, we could not find stronger action-command coding in switch trials in DMS (Fig. 10D,E).
These results suggest that DLS or DMS becomes dominantly active in fixed- or free-choice tasks, respectively, is not correct. Our results show that the computations in DLS and DMS are not mutually exclusive but performed in parallel both in fixed- and free-choice tasks. The downstream network might select the final output depending on the task condition (Thorn et al., 2010).
Hierarchical reinforcement learning model
Summarizing previous findings, the roles of the three striatal subareas could be described as a “two actors-one critic” model. The DLS is involved in simple state-action or stimulus–response association (inflexible, habitual action), whereas the DMS performs action based on an action-outcome association (flexible, goal-directed action). The VS might help the learning of these two actors via a dopaminergic response. However, dissociation of DLS and DMS functions solely on the basis of habitual versus goal-directed actions could not explain our results (Figs. 4, 5, 6, 7, and 8).
An alternative model is that a hierarchical reinforcement learning system is realized across the dorsolateral-ventromedial axis of the basal ganglia (Samejima and Doya, 2007; Ito and Doya, 2011). In the present task, a rat needs to perform multiple motor actions to complete a single trial: approaching the center hole, sticking its nose into the hole, maintaining the nose-poke in the hole, and so on. It is reasonable to conclude that the stratum is also involved in these detailed actions. We proposed a working hypothesis that VS, DMS, and DLS are hierarchical learning modules in charge of actions at different physical and temporal scales (Ito and Doya, 2011). VS is the coarsest module governing actions of the whole animal, such as aiming for a goal, avoiding a danger, or just taking a rest. DMS is the middle module in charge of abstract actions, such as turn left, turn right, or go straight, by taking into account contextual information. DLS is the module in charge of the finest control of physical actions, such as the control of each limb. Consistent with this hypothesis, the averaged firing duration of VS neurons was the longest among three subareas, that of DMS neurons was the second, and DLS neurons showed the shortest firing duration (Fig. 4C).
A large majority of VS neurons was activated at the time that rats started the tasks (Fig. 4B). This might be interpreted to mean that VS is involved in higher-order decisions to initiate tasks, or as a signal to promote the flexible approach proposed by Nicola (2010). With regard to DMS, most neurons were activated during execution of the action selection (Fig. 4B), and at that time, action information was strongly represented (Fig. 5D). A similar DMS activation during action selection was also reported by Thorn et al. (2010). These findings suggest that DMS is the site most likely to be involved in decisions regarding abstract actions, such as “select the left hole” or “select the right hole.” Activity peaks of DLS neurons were not only sharper than those of DMS and VS (Fig. 4C) but also uniformly distributed compared with DMS and VS, in the absence of specific preferred task events (Fig. 4B). Each activity peak might help to control the body and limbs during a brief time window, as proposed by Ito and Doya (2011).
Footnotes
This work was supported by Ministry of Education, Culture, Sports, Science and Technology KAKENHI Grants 23120007 and 26120729, Japan Society for the Promotion of Science KAKENHI Grant 25430017, and Okinawa Institute of Science and Technology Graduate University research support to K.D. We thank anonymous reviewers, whose comments and criticisms greatly improved the manuscript; and Steven D. Aird, technical editor of Okinawa Institute of Science and Technology Graduate University, for thorough editing and proofreading.
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Makoto Ito, Okinawa Institute of Science and Technology Graduate University, 1919-1 Tancha, Onna-son Okinawa 904-0412, Japan. ito{at}oist.jp
This article is freely available online through the J Neurosci Author Open Choice option.