Elsevier

Neural Networks

Volume 19, Issue 8, October 2006, Pages 1233-1241
Neural Networks

2006 Special Issue
Brain mechanism of reward prediction under predictable and unpredictable environmental dynamics

https://doi.org/10.1016/j.neunet.2006.05.039Get rights and content

Abstract

In learning goal-directed behaviors, an agent has to consider not only the reward given at each state but also the consequences of dynamic state transitions associated with action selection. To understand brain mechanisms for action learning under predictable and unpredictable environmental dynamics, we measured brain activities by functional magnetic resonance imaging (fMRI) during a Markov decision task with predictable and unpredictable state transitions. Whereas the striatum and orbitofrontal cortex (OFC) were significantly activated both under predictable and unpredictable state transition rules, the dorsolateral prefrontal cortex (DLPFC) was more strongly activated under predictable than under unpredictable state transition rules. We then modelled subjects’ choice behaviours using a reinforcement learning model and a Bayesian estimation framework and found that the subjects took larger temporal discount factors under predictable state transition rules. Model-based analysis of fMRI data revealed different engagement of striatum in reward prediction under different state transition dynamics. The ventral striatum was involved in reward prediction under both unpredictable and predictable state transition rules, although the dorsal striatum was dominantly involved in reward prediction under predictable rules. These results suggest different learning systems in the cortico-striatum loops depending on the dynamics of the environment: the OFC-ventral striatum loop is involved in action learning based on the present state, while the DLPFC-dorsal striatum loop is involved in action learning based on predictable future states.

Introduction

When we learn or plan an optimal action the dynamics of the environment is a critical factor. If the environmental state transition is totally random, the best we can do is to select an action with the maximal expected reward. On the other hand, if the future states given a selected action are known or learnable, we can exploit this causality to select an action that will maximize long-term expected reward. Reinforcement learning in theory tells us how an agent can learn an optimal behavior according to reward setting and environmental dynamics (Sutton & Barto, 1998). Recently, reinforcement learning theory has been successfully used to explain animal and human action learning behavior and reward-predictive activities of the midbrain dopaminergic system as well as those of the cortex and the striatum (Berns et al., 2001, Doya, 2000, Houk et al., 1995, McClure et al., 2003, O’Doherty, J. P. et al., 2003, Schultz et al., 1997). However, almost all of these studies considered a non-dynamic environment or simple delay from actions to rewards, leaving unclear the brain mechanisms of action learning under different dynamics.

Here, to explore the effect of predictability of dynamics of the environment on subjects’ learning behaviors and brain activities, we compared brain activities when subjects performed a Markov decision task with deterministic and stochastic state transition rules. In a model-based analysis, we estimated meta-parameters and reward prediction signals from each subject’s action sequence by Bayesian estimation methods. We found that subjects used larger temporal discounting parameters under predictable than under unpredictable environmental dynamics, and that different basal ganglia loops were involved in reward prediction under predictable and unpredictable environmental dynamics.

Section snippets

Behavioral task

In the Markov decision task (Fig. 1), markers on the corners of a square present four states, and the subject selects one of two actions by pressing a button (a1  = left button, a2  = right button) (Fig. 1A). The action determines both the amount of reward and the movement of the marker (Fig. 1B). In the REGULAR condition, the next trial is started from the marker position at the end of the previous trial. Therefore, in order to maximize the reward acquired in the long run, the subject has to

Behavioral results

Fig. 2 summarizes the learning performance of a representative single subject (solid line) and group average (dashed line) during the fMRI measurements. All subjects successfully learned to take larger immediate rewards in the RANDOM condition (Fig. 2A). In the REGULAR condition, fourteen subjects successfully learned to take a large reward at s1 after small punishments at s2, s3, and s4 (Fig. 2B). Two subjects fell into a 3-states cycle; they took a large reward at s1, a large punishment at s2

Possible cortical roles in action learning under predictable and unpredictable environments

In the control condition in which the reward was always zero, the subjects’ actions did not affect the reward outcomes. In contrast, in the RANDOM and REGULAR conditions in which either action lead to a larger reward at each state, the subjects performed action learning based on reward. Thus our results of the block-design analysis comparison with the control condition in the OFC, dPM, PC and striatum may reflect several functions needed for action learning based on reward, regardless of the

Conclusion

We demonstrated the different involvement of cortico-basal ganglia loops in action learning in different dynamics of the environment. The OFC-ventral striatum loop was involved in action learning based on immediate reward in both predictable and unpredictable environments. In contrast, in a predictable environment, the DLPFC-dorsal striatum loop was dominantly involved in action learning taking into consideration future states. In our previous study, although we showed different involvement of

Acknowledgements

We thank M. Kawato for helpful discussions. This research was funded by “Creating the Brain”, Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency.

References (38)

  • K. Doya

    Complementary roles of basal ganglia and cerebellum in learning and motor control

    Current Opinion in Neurobiology

    (2000)
  • S.M. McClure et al.

    Temporal prediction errors in a passive learning task activate human striatum

    Neuron

    (2003)
  • J.P. O’Doherty et al.

    Temporal difference models and reward-related learning in the human brain

    Neuron

    (2003)
  • G.E. Alexander et al.

    Parallel organization of functionally segregated circuits linking basal ganglia and cortex

    Annual Review of Neuroscience

    (1986)
  • D.J. Barraclough et al.

    Prefrontal cortex and decision making in a mixed-strategy game

    Nature Neuroscience

    (2004)
  • A. Bechara et al.

    Emotion, decision making and the orbitofrontal cortex

    Cerebral Cortex

    (2000)
  • G.S. Berns et al.

    Predictability modulates human brain response to reward

    Journal of Neuroscience

    (2001)
  • R. Elliott et al.

    Differential response patterns in the striatum and orbitofrontal cortex to financial reward in humans: A parametric functional magnetic resonance imaging study

    Journal of Neuroscience

    (2003)
  • J.W. Gnadt et al.

    Memory related motor planning activity in posterior parietal cortex of macaque

    Experimental Brain Research

    (1988)
  • J.C. Houk et al.

    A model of how the basal ganglia generate and use neural signals that predict reinforcement

  • R. Kawagoe et al.

    Expectation of reward modulates cognitive signals in the basal ganglia

    Nature Neuroscience

    (1998)
  • S. Lehericy et al.

    3-D diffusion tensor axonal tracking shows distinct SMA and pre-SMA projections to the human striatum

    Cerebral Cortex

    (2004)
  • S. Lehericy et al.

    Diffusion tensor fiber tracking shows distinct corticostriatal circuits in humans

    Annals of Neurology

    (2004)
  • N.K. Logothetis et al.

    Neurophysiological investigation of the basis of the fMRI signal

    Nature

    (2001)
  • S.M. McClure et al.

    Separate neural systems value immediate and delayed monetary rewards

    Science

    (2004)
  • H. Mushiake et al.

    Neuronal activity in the primate premotor, supplementary, and precentral motor cortex during visually guided and internally determined sequential movements

    Journal of Neurophysiology

    (1991)
  • J. O’Doherty et al.

    Dissociating valence of outcome from behavioral control in human orbital and ventral prefrontal cortices

    Journal of Neuroscience

    (2003)
  • A.M. Owen et al.

    Planning and spatial working memory: A positron emission tomography study in humans

    European Journal of Neuroscience

    (1996)
  • G. Pagnoni et al.

    Activity in human ventral striatum locked to errors of reward prediction

    Nature Neuroscience

    (2002)
  • Cited by (60)

    • The development of goal-directed decision-making

      2018, Goal-Directed Decision Making: Computations and Neural Circuits
    • Predictability influences whether outcomes are processed in terms of original or relative values

      2014, Brain and Cognition
      Citation Excerpt :

      In the present study, the original/relative values referred to the mathematical original/relative values. Previous studies using single-dopamine–neuron recording, fMRI, or ERP measurement exhibited distinctive features of brain response when outcomes or decision context differed in predictability (Aron et al., 2004; Fiorillo et al., 2003; Ohira et al., 2010; Polezzi et al., 2008; Ramnani et al., 2004; Schultz, 1998; Tanaka et al., 2006). Furthermore, it has been shown that P300 was highly correlated with outcome processing (Gu, Lei, et al., 2011; Hajcak et al., 2005; Luo et al., 2011; Sato et al., 2005; Yeung & Sanfey, 2004).

    View all citing articles on Scopus
    1

    Tel.: +81 42 739 8668; fax: +81 42 739 8663.

    2

    Tel.: +81 82 257 5208; fax: +81 82 257 5209.

    3

    Tel.: +81 82 257 5205; fax: +81 82 257 5209.

    4

    Tel.: +81 98 921 3843; fax: +81 98 921 3873.

    View full text