2006 Special IssueBrain mechanism of reward prediction under predictable and unpredictable environmental dynamics
Introduction
When we learn or plan an optimal action the dynamics of the environment is a critical factor. If the environmental state transition is totally random, the best we can do is to select an action with the maximal expected reward. On the other hand, if the future states given a selected action are known or learnable, we can exploit this causality to select an action that will maximize long-term expected reward. Reinforcement learning in theory tells us how an agent can learn an optimal behavior according to reward setting and environmental dynamics (Sutton & Barto, 1998). Recently, reinforcement learning theory has been successfully used to explain animal and human action learning behavior and reward-predictive activities of the midbrain dopaminergic system as well as those of the cortex and the striatum (Berns et al., 2001, Doya, 2000, Houk et al., 1995, McClure et al., 2003, O’Doherty, J. P. et al., 2003, Schultz et al., 1997). However, almost all of these studies considered a non-dynamic environment or simple delay from actions to rewards, leaving unclear the brain mechanisms of action learning under different dynamics.
Here, to explore the effect of predictability of dynamics of the environment on subjects’ learning behaviors and brain activities, we compared brain activities when subjects performed a Markov decision task with deterministic and stochastic state transition rules. In a model-based analysis, we estimated meta-parameters and reward prediction signals from each subject’s action sequence by Bayesian estimation methods. We found that subjects used larger temporal discounting parameters under predictable than under unpredictable environmental dynamics, and that different basal ganglia loops were involved in reward prediction under predictable and unpredictable environmental dynamics.
Section snippets
Behavioral task
In the Markov decision task (Fig. 1), markers on the corners of a square present four states, and the subject selects one of two actions by pressing a button ( = left button, = right button) (Fig. 1A). The action determines both the amount of reward and the movement of the marker (Fig. 1B). In the REGULAR condition, the next trial is started from the marker position at the end of the previous trial. Therefore, in order to maximize the reward acquired in the long run, the subject has to
Behavioral results
Fig. 2 summarizes the learning performance of a representative single subject (solid line) and group average (dashed line) during the fMRI measurements. All subjects successfully learned to take larger immediate rewards in the RANDOM condition (Fig. 2A). In the REGULAR condition, fourteen subjects successfully learned to take a large reward at after small punishments at , , and (Fig. 2B). Two subjects fell into a 3-states cycle; they took a large reward at , a large punishment at
Possible cortical roles in action learning under predictable and unpredictable environments
In the control condition in which the reward was always zero, the subjects’ actions did not affect the reward outcomes. In contrast, in the RANDOM and REGULAR conditions in which either action lead to a larger reward at each state, the subjects performed action learning based on reward. Thus our results of the block-design analysis comparison with the control condition in the OFC, dPM, PC and striatum may reflect several functions needed for action learning based on reward, regardless of the
Conclusion
We demonstrated the different involvement of cortico-basal ganglia loops in action learning in different dynamics of the environment. The OFC-ventral striatum loop was involved in action learning based on immediate reward in both predictable and unpredictable environments. In contrast, in a predictable environment, the DLPFC-dorsal striatum loop was dominantly involved in action learning taking into consideration future states. In our previous study, although we showed different involvement of
Acknowledgements
We thank M. Kawato for helpful discussions. This research was funded by “Creating the Brain”, Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency.
References (38)
Complementary roles of basal ganglia and cerebellum in learning and motor control
Current Opinion in Neurobiology
(2000)- et al.
Temporal prediction errors in a passive learning task activate human striatum
Neuron
(2003) - et al.
Temporal difference models and reward-related learning in the human brain
Neuron
(2003) - et al.
Parallel organization of functionally segregated circuits linking basal ganglia and cortex
Annual Review of Neuroscience
(1986) - et al.
Prefrontal cortex and decision making in a mixed-strategy game
Nature Neuroscience
(2004) - et al.
Emotion, decision making and the orbitofrontal cortex
Cerebral Cortex
(2000) - et al.
Predictability modulates human brain response to reward
Journal of Neuroscience
(2001) - et al.
Differential response patterns in the striatum and orbitofrontal cortex to financial reward in humans: A parametric functional magnetic resonance imaging study
Journal of Neuroscience
(2003) - et al.
Memory related motor planning activity in posterior parietal cortex of macaque
Experimental Brain Research
(1988) - et al.
A model of how the basal ganglia generate and use neural signals that predict reinforcement
Expectation of reward modulates cognitive signals in the basal ganglia
Nature Neuroscience
3-D diffusion tensor axonal tracking shows distinct SMA and pre-SMA projections to the human striatum
Cerebral Cortex
Diffusion tensor fiber tracking shows distinct corticostriatal circuits in humans
Annals of Neurology
Neurophysiological investigation of the basis of the fMRI signal
Nature
Separate neural systems value immediate and delayed monetary rewards
Science
Neuronal activity in the primate premotor, supplementary, and precentral motor cortex during visually guided and internally determined sequential movements
Journal of Neurophysiology
Dissociating valence of outcome from behavioral control in human orbital and ventral prefrontal cortices
Journal of Neuroscience
Planning and spatial working memory: A positron emission tomography study in humans
European Journal of Neuroscience
Activity in human ventral striatum locked to errors of reward prediction
Nature Neuroscience
Cited by (60)
Moving towards specificity: A systematic review of cue features associated with reward and punishment in anorexia nervosa
2020, Clinical Psychology ReviewThe development of goal-directed decision-making
2018, Goal-Directed Decision Making: Computations and Neural CircuitsPredictability influences whether outcomes are processed in terms of original or relative values
2014, Brain and CognitionCitation Excerpt :In the present study, the original/relative values referred to the mathematical original/relative values. Previous studies using single-dopamine–neuron recording, fMRI, or ERP measurement exhibited distinctive features of brain response when outcomes or decision context differed in predictability (Aron et al., 2004; Fiorillo et al., 2003; Ohira et al., 2010; Polezzi et al., 2008; Ramnani et al., 2004; Schultz, 1998; Tanaka et al., 2006). Furthermore, it has been shown that P300 was highly correlated with outcome processing (Gu, Lei, et al., 2011; Hajcak et al., 2005; Luo et al., 2011; Sato et al., 2005; Yeung & Sanfey, 2004).
A role of serotonin and the insula in vigor: Tracking environmental and physiological resources
2021, Behavioral and Brain SciencesThe “status quo bias” in Response to External Feedback in Decision-Makers
2023, Adaptive Human Behavior and Physiology
- 1
Tel.: +81 42 739 8668; fax: +81 42 739 8663.
- 2
Tel.: +81 82 257 5208; fax: +81 82 257 5209.
- 3
Tel.: +81 82 257 5205; fax: +81 82 257 5209.
- 4
Tel.: +81 98 921 3843; fax: +81 98 921 3873.