Abstract
Learning occurs across multiple timescales, with fast learning crucial for adapting to sudden environmental changes, and slow learning beneficial for extracting robust knowledge from multiple events. Here, we asked if miscalibrated fast vs slow learning can lead to maladaptive decision-making in individuals with problem gambling. We recruited participants with problem gambling (PG; N = 20; 9 female and 11 male) and a recreational gambling control group without any symptoms associated with PG (N = 20; 10 female and 10 male) from the community in Los Angeles, CA. Participants performed a decision-making task involving reward-learning and loss-avoidance while being scanned with fMRI. Using computational model fitting, we found that individuals in the PG group showed evidence for an excessive dependence on slow timescales and a reduced reliance on fast timescales during learning. fMRI data implicated the putamen, an area associated with habit, and medial prefrontal cortex (PFC) in slow loss-value encoding, with significantly more robust encoding in medial PFC in the PG group compared to controls. The PG group also exhibited stronger loss prediction error encoding in the insular cortex. These findings suggest that individuals with PG have an impaired ability to adjust their predictions following losses, manifested by a stronger influence of slow value learning. This impairment could contribute to the behavioral inflexibility of problem gamblers, particularly the persistence in gambling behavior typically observed in those individuals after incurring loss outcomes.
Significance Statement
Over five million American adults are considered to experience problem gambling (PG), leading to financial and social devastation. Yet the neural basis of PG remains elusive, impeding the development of effective treatments. We apply computational modeling and neuroimaging to understand the mechanisms underlying problem gambling. In a decision-making task involving reward-learning and loss-avoidance, individuals with PG show an impaired behavioral adjustment following losses. Computational model-driven analyses suggest that, while all participants relied on learning over both fast and slow timescales, individuals with PG showed increased reliance on slow-learning from losses. Neuroimaging identified the putamen, MPC, and insula as key brain regions in this learning disparity. This research offers new insights into the altered neural computations underlying problem gambling.
Introduction
The world changes across various timescales (e.g., the temperature changes both daily and seasonally). Hence, humans and animals must have developed sophisticated mechanisms to learn and adapt over such timescales. Extensive research has shown behavioral and neural evidence for learning across different timescales, including at the level of the neural code (Fairhall et al., 2001; Ulanovsky et al., 2004; La Camera et al., 2006; Wark et al., 2009; Lundstrom et al., 2010; Raubenheimer et al., 2012; Murray et al., 2014; Spitmaan et al., 2020; Mahajan et al., 2021; Soltani et al., 2021; Masset et al., 2023), behavior (Kording et al., 2007; Iigaya et al., 2018, 2019; Zimmermann et al., 2018), ecology and evolution (Lemke, 2001, 2002; Kuzawa and Thayer, 2011). While such multi-timescale learning is generally effective, improper weighing of information from different timescales can lead to maladaptive behaviors. For instance, over-reliance on slow learning in rapidly changing environments may result in maladaptive decision-making (Iigaya, 2016; Iigaya et al., 2019). This raises an intriguing question: could such miscalibrated learning be relevant to psychiatric conditions, such as problem gambling (PG), and if so, what neural mechanisms underlie this phenomenon?
PG is broadly characterized by persistent and problematic gambling behavior that often leads to significant impairment and distress, whereas gambling disorder (GD) is a clinically severe form of PG that is formally defined in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) (Hodgins et al., 2011; Potenza et al., 2019). A key behavioral symptom of PG is loss chasing, whereby individuals persist in gambling behavior ostensibly to remediate their financial position after extensive losses have already been incurred (Xuan and Shaffer, 2009; Zhang and Clark, 2020).
The underlying neurocomputational basis of the propensity to chase losses in PG is currently unclear. A number of studies have utilized behavioral economics methods to probe attitudes to loss in PG during choice behavior, in which one natural hypothesis could be that individuals with PG have reduced loss aversion. However, collectively such studies have found varying and somewhat inconsistent evidence for behavioral differences between gamblers and non-gamblers in the degree of loss aversion during choice (Giorgetta et al., 2014; Takeuchi et al., 2016; Genauck et al., 2017; Ring et al., 2018). Another direction of research has been to test for the sensitivity of individuals with GD to changing their behavior following feedback. One notable behavioral finding that aligns with persistent behavior in gamblers, especially in the face of losses, is increased perseveration on reversal learning paradigms (De Ruiter et al., 2009; Perandrés-Gómez et al., 2021). Several recent studies have utilized reinforcement-learning models to examine altered computational substrates for learning in gambling disorder, and these studies have identified a number of differences in reinforcement-learning computations, such as decreased direct exploration (Wiehler et al., 2021), as well as altered learning rates for both gains and losses (Suzuki et al., 2023).
The aim of the present study was to examine the role of alterations in slow and fast learning in PG. We utilized a variant of a task designed to probe reinforcement-learning in both gain and loss contexts separately (Kim et al., 2006). Specifically, in the gain condition, participants could choose between two stimuli associated with differing probabilities of winning a monetary reward outcome. Their goal in that condition was to learn to choose the stimulus associated with the highest probability of winning money. In the loss avoidance condition, participants had to choose between two stimuli associated with differing probabilities of obtaining a monetary loss. Their goal in that condition was to choose the stimulus associated with the lowest probability of losing. After a variable number of trials, the contingencies were reversed in both conditions so that the stimuli previously associated with the highest probability of winning, or the lowest probability of losing were no longer advantageous, thus necessitating the need for participants to adjust their choice of stimuli in each condition.
Utilizing a formal computational model-based approach, we differentiated between the contribution of two different forms of reinforcement learning, slow and fast learning, in guiding behavioral performance on this task. We could further separately evaluate the contribution of these two processes in learning about gains and in learning about losses. Previous studies have shown that there is a set of optimal relative weights that should be assigned across different timescales (Iigaya, 2016; Iigaya et al., 2019). We hypothesized that individuals with PG might rely excessively on slow-learning when making decisions, as an overemphasis on slow-learning processes could provide an explanation for the behavioral phenomenon of increased perseveration previously reported in such individuals. Furthermore, we aimed to test whether a reliance on increased slow-learning would be especially prominent when learning from losses, which could potentially account more specifically for the behavioral observation of increased loss chasing in gamblers. Increased reliance on slow-learning from losses could lead to the behavioral manifestation of a reduced tendency to switch away from a losing option, because changes in the expectation about the future value of that option would be updated only very slowly, even after accumulating losses.
To examine this hypothesis, we analyzed behavioral and fMRI data collected from 20 individuals who exhibited PG (by scoring two or more on the clinical scale of the DSM-IV criteria for gambling disorder), while they performed the gain/loss learning task, alongside data from 20 recreational gamblers as a well-matched comparison group (scoring zero on the DSM-IV gambling disorder scale). Participants were recruited through flyers from the Los Angeles area.
In addition to testing our computational hypothesis at the behavioral level, we also aimed to examine the extent to which neural responses related to slow learning were distinct in individuals with problem gambling, in order to help pinpoint the specific neural computational substrates associated with altered reinforcement learning in this cohort. We hypothesized that brain regions previously implicated in loss-related learning and aversion-related processing, such as the insular cortex and striatum (Seymour et al., 2005; Tom et al., 2007; Clark et al., 2008; Samanez-Larkin et al., 2008) would show alterations in the neural correlates of loss-related computations in individuals with PG.
Materials and Methods
Participants
We recruited 20 participants into our PG group and 20 participants into a recreational gambling control group. Participants were recruited from the greater Los Angeles area using a combination of posted flyers and Craigslist. A requirement for participation in the study was that participants had gambled at least once in the last six months. We recruited people from the population at large as opposed to targeting individuals already in treatment for problem gambling as we aimed to avoid the potential confounding effects of individuals who have already decided to seek treatment potentially being different from the population of problem gamblers at large, as well as avoiding any potential confounding effects of the treatment program itself on our measured effects. We further aimed to specifically recruit “recreational” gamblers into our HV group, that is, people who have at least some experience of gambling (and have recently gambled in the past six months), but who do not exhibit PG. We reasoned that these individuals would be better matched to our PG group in terms of life experience, demographics, and exposure to gambling-related stimuli and environments than individuals with absolutely no gambling experience whatsoever.
More than 400 potential participants replied to our advertisement and were subjected to an initial telephone screening to determine their eligibility. Hundred and ten passed the initial screening and were subsequently invited to come to the lab to participate in a more detailed in-person clinical interview. A trained research psychologist administered the SCID-GD, under the supervision of the senior clinician involved with the study. The Senior clinician or another licensed clinician then reviewed the interview and screening results. The participants assigned to the PG group were selected to have a score of 2 or more on the clinical scale in DSM-IV, while those assigned to the control group all had a score of 0. Note that data collection for this study was begun in 2013, prior to the official release of DSM-5; hence, we utilized the diagnostic criteria for DSM-IV throughout. However, given the participant’s characteristics, we don’t expect that group assignment would have changed had the study been implemented under DSM-5 criteria. The participants in both groups were furthermore required to be free of axis I disorders (e.g., anxiety, depression, PTSD, as well as schizophrenia) and were not currently taking any neuromodulatory medication.
We continued recruitment until we reached our target goal of 20 participants in the PG group and 20 matched controls in the recreational gambling group. None of the participants in our final sample in either group reported being previously diagnosed with a gambling disorder, and none reported seeking treatment for gambling-related problems. The recruitment and experimental administration were organized in a double-blind fashion so that the researcher responsible for running the fMRI experiment was not aware of the participant’s group status until after the data were collected. One additional participant (who would have been assigned to the PG group) was interviewed and scanned but was subsequently excluded due to an incidental finding on the structural scan.
The participants were remunerated $20–30 for participating in the interview session and $50–65 for participating in the scanning session. All participants were also given information about resources available for PG, although they were not informed of the outcome of the clinical interview. The research was approved by the Institutional Review Boards of the California Institute of Technology and the University of California, Los Angeles, and each participant gave informed consent.
The participants were well matched on demographic variables, including age, years of education, and gender. The PG group consisted of 9 females and 11 males with an average age of 37.9 (sd 12.3) and an average of 13.6 (sd 4.8) years of education. The control group consisted of 20 participants (10 female) with an average age of 36.9 (sd 11.6) and an average of 14.6 (sd 3.9) years of education. Figure 1 shows the distribution of DSM scores, Alcohol Use Disorders Identification Test (Reinert and Allen, 2007), and Fagerstrom Test for Nicotine Dependence (Uysal et al., 2004).
Clinical Scales for participants. DSM-IV (left), Alcohol Use Disorders Identification Test (AUDIT; middle), and Fagerstrom Test for Nicotine Dependence (FTND; right). The control group is shown in black and PG group is shown in purple.
The forms of gambling predominantly reported on in our sample were inside casinos and included either casino table games (blackjack, poker) or slot machines. Sports betting and lottery use were not commonly reported as primary forms of gambling.
Task description
The task performed by the participants was a modified version of the reward and avoidance learning task first used in Kim et al. (2006). The task involved two-alternative free choices with three intermixed conditions: a reward, a loss avoidance, and a neutral condition. On each trial, the participant is presented with one of three pairs of fractal stimuli. Each pair is specific to one of the three conditions (Fig. 2). On each trial, the participant then has to choose between the two presented fractals, each of which is associated with differing probabilities of yielding particular outcomes. In the reward condition, participants could either win $1, or else gain nothing, while in the loss avoidance condition, participants could either lose $1 or else gain nothing. The neutral condition probabilistically involves the visual feedback of a scrambled dollar bill or else no outcome, which in either case results in no change in overall winnings. Irrespective of the condition, when the trial yields no outcome, no feedback is given, and the fixation cross of the next trial is presented.
Task and behavior A, The task. On each trial, a participant was presented with two stimuli, each of which was associated with a unique probability of outcomes. There were three trial types. On trials in the gain condition, the stimuli were associated with potential monetary gain outcomes. On trials in the loss-avoidance condition, the stimuli were associated with potential monetary losses. On trials in the neutral trials, participants received no monetary outcomes. B, Example outcome probabilities in the reward condition. The probabilities of reward associated with the two stimuli (choice A and choice B) are plotted as a function of the trial number. Choice B is initially more rewarding than choice A. After about 15 trials, the reward probabilities are reversed. C, Behavioral performance, measured by the total number of received rewards (losses) divided by the total number of trials in the gain (loss) condition, is shown before and after the reversal points for control and individuals in the PG group.
An experimental session consists of two blocks. Each block contains 90 trials, consisting of 30 trials in the gain condition, 30 trials in the loss condition, and 30 trials in the neutral condition. Each block has a unique set of stimuli, predicting a 70% or a 30%chance of yielding an outcome. After 10–20 trials (independently jittered between conditions), the outcome probabilities are reversed within each condition.
Before the experiment, participants are informed that the three conditions exist, but not which stimuli are associated with each particular condition. They are furthermore informed that the amount they earn during the experiment will be added to a base pay of $40, so that they could potentially either win more than $40, or else come away with less than $40 depending on how much money they won or lost while performing the task. Thus, in the reward condition, participants should aim to choose the fractal associated with the greatest probability of winning $1, thereby adding to their total winnings, while in the loss condition, they should aim to choose the option associated with the lowest probability of losing $1, thereby avoiding decrementing their overall winnings.
Notes on task design: We note that the study was originally designed to perform an analysis with a standard reinforcement learning model with a single timescale. However, after data collection and preliminary analysis, we realized the potential relevance of the newly validated two-timescale learning model. This is the reason that the task design was not initially optimized to probe the current two-timescale model.
fMRI data acquisition
The fMRI data were acquired using a Siemens Tim Trio 3T scanner located at the Caltech Brain Imaging Center. For each block of trials, 185 volumes of 44 slices covering the whole brain were recorded interleaved ascending at a 30-degree angle, TR = 2.78 s, TE = 30 ms, voxel size 3 mm isotropic.
Computational modeling
We used a variant of previously validated computational models that learn reward histories over multiple timescales (Corrado et al., 2005; Iigaya et al., 2019). Formally, the model assumes that the value of stimuli is computed as a sum of the values learned both quickly and slowly.
Because we have only 30 trials per participant, fitting this full model is not possible. Therefore we simplified the above model by assuming the time constant of the fast system to be one trial, or χfast = 1 and ν = 1. Note that η, is normally negative (Lau and Glimcher, 2005; Fonseca et al., 2015; Iigaya et al., 2018), meaning that the choice kernel normally captures a tendency towards alternation. The combination of fast reward value and fast choice kernel produces behavior similar to win-stay, lose-switch behavior. Thus, the two timescale model captures a combination of win-stay, lose-switch type of behavior and slower, reinforcement learning based, behavior.
Model fitting
In order to determine the distribution of model parameters h, we conducted a hierarchical Bayesian, random effects analysis (Huys et al., 2011; Iigaya et al., 2016) for each subject. In this, the (suitably transformed) parameters hi of experimental session i are treated as a random sample from a Gaussian distribution with means and variance
The prior distribution
Model comparison
We compared models according to their integrated Bayes Information Criterion (iBIC) scores (Huys et al., 2011; Iigaya et al., 2016). We analyzed model log-likelihood
fMRI analysis
The data were processed using SPM 12. We performed a standard GLM analysis on the fMRI data with SPM 12. The SPM feature for asymmetrically orthogonalizing parametric regressors was disabled throughout. We first performed the analysis on all participants, performing the second-level inference pooled across all participants, and then control and PG groups separately. The onsets of trial, action, and outcomes were controlled by stick regressors, separately for three trial types (gain, loss, neutral). In addition, the GLM had the following parametric regressors: at the trial onset, parametric modulators included the total fast value, the total slow value, and the output of the choice kernel, all separately for three trial types. At outcome onset, parametric regressors included the prediction error computed from fast value, the prediction error computed from slow value, and the outcome, all separated for three trial types. Movement regressors obtained from SPM preprocessing were also included.
Validation analysis for between-group comparisons and statistical analyses
To ensure that choosing ROIs on the basis of clusters of voxels identified as significant in the pooled group analysis for subsequent interrogation of group differences does not introduce a bias to produce erroneously significant results in the group difference analyses, we conducted a supplementary analysis, focusing on the slow value contrast as an example (but the simulation results should hold for all contrasts reported). We first calculated the covariance matrix between participants based on their beta values for the slow value over all grey matter voxels. We then generated synthetic beta values (10 million voxels for each participant) across participants, modeled as a multivariate Gaussian distribution with the real data’s mean and the calculated covariance. This synthetic data preserved the original correlation patterns observed between individuals and across groups.
Next, we applied a standard group-level analysis to identify which of the synthetic voxels exhibited significant correlations (large betas) by applying t-tests. Using p < 0.001 as a threshold, within these significant voxels, we estimated the likelihood of observing significantly different mean beta values between the control and PG groups, with the significance threshold set at p < 0.05 by the permutation test, just like in our primary analysis.
We found that, across the synthetic voxels exhibiting “significant” pooled group effects (at p < 0.001), the probability of subsequently finding a significant difference between the control and PG groups in those voxels (tested by p < 0.05) was 0.036, confirming that the ROI selection did not favor a significant group difference over that which would be expected by chance. If we select the synthetic voxels that do not exhibit significant group-level effects (p > 0.2), the probability of finding a significant group difference was similar at 0.060. This analysis demonstrates that the ROI selection is not biasing the statistical inference conducted on those ROIs in favor of a significant group difference effect.
Results
The study design
After an initial phone screening, we conducted a clinical interview of 110 participants who reported engaging in regular gambling but who had no prior psychiatric diagnoses, including GD. From this initial cohort, we identified 20 individuals who had a score of 2 or more on the clinical scale in DSM-IV (referred to as the PG group) and 20 “recreational” gamblers (who had at least some recent experience of having gambled within the past 6 months, but who had none of the diagnostic criteria of gambling disorder; referred to as the control group).
During the study, participants underwent fMRI scanning while performing an adaptation of a well-established probabilistic reward and avoidance learning task (Kim et al., 2006) (Fig. 2A). On each trial, participants were presented with two alternative target stimuli on the screen and were required to make a choice between them. The stimuli were organized into three sets, with each pair belonging to one of three conditions. In the reward condition, the stimuli were associated with the chance of receiving monetary rewards. In the loss avoidance condition, the stimuli were associated with the chance of avoiding monetary losses. The neutral condition consisted of stimuli with no associated outcomes.
The outcome probabilities remained relatively stable for approximately 15 trials in each condition. Subsequently, without any warning, the outcome probabilities were reversed, challenging participants to adapt their choices based on their previous outcome experiences (Fig. 2B). Each condition underwent one reversal, and a total of 30 trials per condition were conducted across two sessions.
Behavioral results
Overall performance
We first investigated if there were difference in overall performance on the task between the PG and control groups and the pre and post reversal conditions. For this, we performed an ANOVA with two factors, group (PG vs. control) and reversal (before vs. after), on the performance that is measured by the proportion of choices made to the currently designated high valued stimulus across the task, separately for the pre and post reversal phases of the task. We found no significant group effects (gain condition: F(1,76) = 0.19; loss condition: F(1,76) = 0.99), and no significant interaction effect between group (PG vs control) and reversal (before vs after) in either the gain or loss conditions (F(1,76) = 0.12); (F(1,76) = 1.11).
We also tested a different performance measure based on the total gain or loss accumulated in the gain or loss condition, respectively (Fig. 2C). In the gain condition, we found no significant effect of group (F(1,76) = 0.01), but a significant effect of the reversal (F(1,76) = 11.51, p < 0.01). The interaction between group and reversal was not significant. In the loss condition, we found a significant group effect (F(1,76) = 4.55, p < 0.05), but a non-significant reversal effect (F(1,76) = 2.79). We found an interaction effect between group and reversal in the loss condition that was trending toward significance (F(1,76) = 3.89, p = 0.052). To follow up, we performed an additional permutation test on the group difference in performance separately for the pre and post reversal. We found that in the loss condition, the group difference after reversal was significant (permutation test; p < 0.01).
Dynamics of behavior around reversal
In order to examine behavior in a more fine-grained manner, we next implemented a trial-by-trial analysis of the dynamics of choice around the reversal. First, we looked at the probability of choosing the correct target after reversal. In the gain condition, we found evidence of maladaptive choice behavior (Fig. 3A) albeit in a manner not specific to the PG group, where the choice probability in both groups did not appear to go above chance after reversal. In the loss condition, such maladaptivity was not apparent (Fig. 3B).
Detailed task behavior. A, Trial-by-trial choice dynamics in the gain condition. The mean (solid) choice probability for the control (black) and PG (purple) is shown across trials in the gain (left) and the loss (right) conditions. The ideal choice probability is 0 before the reversal and 1 after the reversal. The shaded area indicates the SEM. B, The probability of repeating the same choice (good or bad) after receiving no reward (left) or reward (right). The probability is computed separately for control (black; C) and PG (purple; G) groups pre- and post-reversals. The good choice is defined as the choice of the alternative that is associated with a higher gain probability or a smaller loss probability. C, Trial-by-trial choice dynamics in the loss condition. D, The probability of repeating the same choice after receiving a loss (left) or avoiding a loss (right). The group difference was significant for the probability of repeatedly selecting the better target after avoiding loss following reversals (p < 0.01 permutation tests).
Next, we computed the probability of repeating the same choice (the probability of repeat) after different trial conditions and outcomes. We found evidence for a perseverative choice tendency in the gain condition after reversals in both groups, where participants continued selecting an option that was no longer very rewarding even after receiving no reward (Fig. 3C). The probability of repeating the choice that is currently not very rewarding was significantly higher after than before reversals for both groups (p < 0.01 permutation tests). On the other hand, we found in both groups that the probability to repeat the same choice after a loss after reversals was not greater than chance level, which does not support loss chasing (After loss in Fig. 3D).
Next we tested for a difference between the groups in perseverative choice. In the gain condition, we found no significant group difference in perserverative choice. In the loss condition, on the other hand, this detailed analysis provides evidence for a difference between the groups. Specifically, we found that the PG group was significantly less likely than the control group to repeat the same “good choice” after avoiding a loss when they selected the target associated with a smaller loss probability after reversals (p < 0.01 permutation tests). To see if this is a transient effect around the reversal, we removed seven trials immediately after the reversal from our analysis. The effect remained significant (p < 0.01 permutation tests), suggesting that this is not a transient effect. Seven trials were chosen based on the initial transient of behavioral adjustments after the reversal observed in Figure 3C; but we also confirmed this finding was robust to the number of trials after reversals that were eliminated (from 1 to 10 trials).
We also confirmed this effect using an ANOVA with two factors, group and reversal, on perseverative choice, showing a significant effect of group (F(1,76) = 42.5; p < 0.01) and a significant interaction between group and reversal (F(1,76) = 4.35; p < 0.05). These findings suggest that, while participants did not actively chase losses, the PG group exhibited altered choice behavior on the task, in relation to a decreased probability of persisting with the correct choice after a reversal in the loss condition.
Computational modeling analysis suggests over-weighting of slow learning relative to fast learning in PG
We then performed a computational model-driven analysis to examine if the observed behavioral performance differences could be captured by over-weighting of slowly learned value. We used a variant of a previously validated multi-timescale learning model of decision making, which has been validated in a similar alternative choice task with probabilistic reward (Sutton, 1995; Corrado et al., 2005; Iigaya et al., 2018, 2019) (see the Materials and Methods). The model integrates reward history across two timescales (fast and slow) and computes the value of choice (Fig. 4A). For simplicity, the fast timescale is set to one trial. As a result, the model captures a combination of win-stay lose-switch behavior and behavior driven by standard incremental learning (Q-learning) of reward history. The model has critical free parameters that determine the impact of fast and slow learning (wfast/wslow). If this relative weight is large, the model captures a variant of win-stay lose-switch behavior (Lee et al., 2004; Iigaya et al., 2018). If small, on the other hand, the model’s decision follows the bias computed by the reward history over trials. An intermediate value of the relative weight thus gives a combination of these two. Previous studies have shown that this relative weight can be changed across trials (Iigaya, 2016), across days (Iigaya et al., 2019), and that it depends on the duration of inter-trial-intervals (Iigaya et al., 2018). However, here we assumed the weights are fixed for each participant but different for the gain and loss conditions.
A computational model describing learning of reward history across multiple timescales captures human behavior on the task. A, Schematic of the computational model. Gain or loss history is integrated over two timescales independently to compute fast and slow values. The two values are then weighted with relative weights to compute an overall decision-value for each condition separately. B, The model’s estimates of relative weights. The relative weights assigned to slowly learned value with respect to the fast value are plotted for the control and PG groups for reward trials (top) and loss trials (loss). There is no significant difference between groups in gain trials, but the individuals with PG show a significantly larger weight than the controls in loss trials (p < 0.001 permutation test). C, Model simulations show that the classic reinforcement learning model does not capture the behavioral data, but the two-timescale model does. From left to right: data, one-timescale model simulation, and two-timescale model simulation. The average monetary gain received per trial is shown before and after reversals for the control (black) and PG (purple). The behavioral effects shown in Figure 2C are captured by the the two-timescale learning model (right) but not the standard RL model (middle).
We performed model fitting using a hierarchical Bayesian method (Iigaya et al., 2018). We found that the relative impact of fast and slow learning (wfast/wslow) was similar for the PG and control groups in the reward condition. However, the impact of slow learning compared to fast learning was significantly greater for the PG group in the loss-avoidance condition compared to the control group (Fig. 4B). This confirms our hypothesis.
To establish the validity of our two-time scale model, we also fit a standard reinforcement learning model with a single learning rate to the behavioral task data across groups. Note that the two timescales model learned the value over slow and fast timescales, whereas the one timescale model leaned the value over a single timescale. When comparing the fit of this standard model to the data against that of the two-time scale model, the integrated-BIC score (Huys et al., 2011; Iigaya et al., 2018) for the standard RL was 4533, and the one for the two-timescale model was 4383 in the logarithmic scale, implying that the two-timescale model significantly outperformed the standard RL with a single timescale. Further, model simulations showed that the standard single learning rate model fails to capture the behavioral signature in the real human choice data, namely the group differences we found in performance after reversals on loss trials. The two-timescale model, on the other hand, captures this core behavioral feature well (Fig. 4C).
fMRI reveals neural correlates of slowly learned value in the putamen and the ACC/dmPFC
Behavioral analyses suggest that the key difference between control and PG is the slow value computed on loss trials. The next question, therefore, is to determine how slow value is computed in the brain, with a particular focus on the loss avoidance condition, before investigating neural differences between groups in such value signals. For this, we performed a standard GLM analysis utilizing parametric regressors with variables from the two-time scale computational model (see methods). For each participant, we entered parametric regressors generated using individual subject-level parameter estimates (MAP estimates) drawn from the hierarchical behavioral model fitting procedure (see methods).
We first asked how the slow value is represented at the time of every trial onset in the loss avoidance condition, using all participants’ data (i.e., pooling over the two groups). A whole brain analysis with cluster-level FWE error correction revealed that slow value from the computational model was correlated with activity in the left putamen (peak coordinate [−24,−2,0]) and anterior cingulate cortex (peak coordinate [−6,44,28]) (Fig. 5A).
fMRI correlates of the slowly learned value component from the two time-scale computational model. A, A cluster of voxels in the left putamen is significantly correlated with the model’s predicted signal (whole-brain cFWE p < 0.05 with height threshold at p < 0.001. This result comes from an analysis that pooled across the two groups. In addition, a cluster of voxels in the ACC is significantly correlated with the model’s predicted signal (whole-brain cFWE p < 0.05 with height threshold at p < 0.001. This result comes from an analysis that pooled across the two groups. B, A cluster in the left insula was significantly correlated with prediction error signals from the slowlearning component of the two-time-step model (whole-brain cFWE p < 0.05 with height threshold at p < 0.001.
In order to gain insight into how slow-value signals are computed on the basis of trial and error feedback (loss or no-loss), we next examined how prediction error signals are represented at the timing of outcome presentation. Aversive prediction error signals were generated using the model with the MAP estimates of parameters for each trial for each participant. A whole brain analysis with cluster-level FWE correction revealed that the left insula is significantly negatively correlated with an aversive-learning prediction error signal (Fig. 5B). This insular PE signal responds positively with unexpected losses and negatively with unexpected loss omission, across all participants.
We also examined fast value signals generated from the model in the loss condition within the same GLM analysis. We found no significant activity correlating with fast-value at whole brain FWE corrected levels. However, at an uncorrected threshold, a cluster of voxels in mPFC was found to correlate the fast value signal (see Table 1). We do not focus on prediction error signals computed on the fast timescale in the present study because they are highly correlated with static gain, non-gain, loss and non-loss signals, respectively, and thus the present fMRI study is not optimized to uniquely identify them.
Non-significant fMRI results
Finally, we also tested for BOLD correlates of slow and fast values in the gain condition in the same GLM analysis. We found no clusters surviving whole-brain FWE correction for these contrasts. However, a cluster in the dmPFC was found to correlate with slow value in gain trials at an uncorrected threshold (Table 1). A cluster in vmPFC was found to be correlated with fast value in gain trials at an uncorrected threshold (Table 1).
fMRI reveals different neural responses in PG group compared to control group in dACC and insular cortex
We next tested for differences between the PG and control groups in the fMRI activity related to slow and fast-value in the loss condition, a particular focus for our fMRI analysis given the behavioral findings that the PG group is different from control when learning from losses but not gains. For this, we utilized regions of interest defined on the basis of the clusters identified from the pooled group results. Note that selecting ROIs based on the pooled group results to investigate group differences is justified on the basis that such an ROI analysis identifies relevant voxels responsive to the task variables, while such an ROI selection strategy does not positively bias the likelihood of observing a significant group difference (see Methods for a description of simulations that demonstrate this).
Specifically, to test whether correlations with slow learning are greater for the PG compared to the control groups, we used regions of interest (ROI) defined from the putamen and anterior cingulate clusters identified from the pooled analysis. We found no significant difference in the strength of the correlation with slow-value between groups in the putamen ROI, but the correlation with slow-value in the ACC ROI was significantly greater in the PG than control group (Fig. 6A).
Group difference in the fMRI correlates of slow-value learning in loss trials. A, The slow value signal. There was no significant difference between groups in an ROI defined on the putamen cluster identified from the pooled analysis. However, a cluster of voxels in the ACC is significantly correlated with the model’s predicted signal (whole-brain cFWE p < 0.05 with height threshold at p < 0.001. This result comes from an analysis that pooled across the two groups. B, The prediction error signal. The magnitude of correlation was found to be significantly greater for the PG than control groups (p < 0.05 permutation test).
Next, we tested for a difference in prediction error responses related to slow learning. Specifically, we asked whether PG or control groups showed a stronger or weaker correlation with aversive-learning prediction errors in the left insular cortex ROI defined from the cluster identified via the pooled analysis. We found that individuals with PG showed a significantly greater correlation with aversive prediction errors in loss trials compared to the control group (Fig. 6B).
For completeness, we also tested for differences in the neural responses between the groups related to fast value during the loss condition, focusing on the cluster that was identified in mPFC from the pooled analysis. No significant difference was found between groups in this cluster.
We also tested for group differences related to slow-value in the gain condition, focusing on the dmPFC cluster reported in the pooled analysis. We found no significant difference between groups in that signal.
Lastly, we tested for a difference between groups in the vmPFC cluster identified as responding to fast-value during the gain condition in the pooled analysis. The correlation with fast-value in that cluster was significantly stronger for the control group than the PG group (p < 0.05). However, the correlation with fast value was negative, meaning that trials after no reward show stronger activation than trials after reward (Table 1).
Discussion
We examined behavioral and neural differences in reinforcement-learning mechanisms pertaining to gain-seeking and loss avoidance between recreational gamblers who did not reach any of the diagnostic criteria for GD (control group) and individuals with problem gambling (PG group). Our hypothesis was that individuals with PG would exhibit stronger learning impairments from recent losses due to an increased reliance on slow learning. We found evidence supporting this hypothesis through behavioral analyses using computational modeling of multiple timescale learning.
Impaired behavioral adjustments following loss in gambling disorder have been well-documented. A notable phenomenon is loss chasing (Lesieur, 1979; Campbell-Meiklejohn et al., 2008; Zhang and Clark, 2020), where gamblers amplify their betting after losses. However, our results were not entirely consistent with loss chasing. Rather, individuals in the PG group showed a greater tendency to switch their choices after avoiding losses following contingency reversals. Our model captures this as a result of ignoring recent outcomes due to an over-reliance on slow learning.
To delve into the neural mechanisms underlying these behavioral differences, we conducted a model-based fMRI analysis. Our results revealed correlations of slow value during the loss avoidance condition in both the putamen and ACC. Previous studies in healthy volunteers have implicated the putamen in habits (Tricomi et al., 2009; McNamee et al., 2015). The present findings implicating the putamen in tracking slowly-learned value signals could align with those prior results. It has been suggested that model-free reinforcement-learning provides a computational account for habit learning (Daw et al., 2005, 2011). However, the precise relationship between model-free learning and habits has remained unclear at both behavioral and neural levels (Friedel et al., 2014; Gillan et al., 2015; Sjoerds et al., 2016).
Here, we have dissociated two distinct components of reinforcement learning: slow and fast learning. Both of these forms of learning, as described here, can be considered to be model-free. However, the present neural findings relating slow-value to the posterior putamen raise the possibility that the relationship between habits and model-free learning might be more closely aligned with slow as opposed to fast model-free learning components. One complication for this interpretation of our findings is that we observed slow-value signals in the putamen in the loss avoidance condition and not in the gain condition. One possible account for these findings, is that habitization might play a stronger role early on in loss avoidance than in gain-learning, thereby more rapidly engaging the putamen. Further research will be necessary to compare the emergence of habitization under appetitive and aversive learning conditions. However, in any event, the putamen did not differ significantly between the PG and control groups, suggesting that this brain structure does not directly account for the impaired loss-learning observed in individuals with PG.
In addition to the region of posterior putamen aligned with slow-value learning, we also found slow-value learning signals in the anterior cingulate cortex. Crucially, the anterior cingulate cortex slow-value areas did show significant differences between the problem gambling and recreational gambling group, in the direction of enhanced slow-learning signals in the PG group. The ACC has been suggested to play a central role in integrating various information for the purpose of computing decision values (Kennerley et al., 2006), which in the present study could reflect the fact that the PG group exhibits an increased reliance on utilizing slowly-learned values to guide their choices.
One possible computational mechanism by which enhanced slow-value components might emerge in the first place in the PG group is via alterations in the means by which value signals are learned. In typical reinforcement-learning accounts, this is assumed to occur via prediction errors that signal the difference in value between expected and actual outcomes. Here, we examined prediction errors that are involved in slow-learning. During the loss avoidance condition, we tested for such prediction error signals in the brain, and we found a slow-learning aversive prediction error to be encoded in the left insular cortex. The insular cortex has previously been suggested to encode aversive learning prediction errors on a similar task (Kim et al., 2006). This insular cortex signal was found to be significantly enhanced in the PG group compared to the recreational gambling group. This suggests that individuals with PG may be experiencing enhanced slow-learning contributions to behavior, because of an enhanced engagement of slow-learning prediction error signals during the reinforcement-learning process. The identification of alterations in prediction error coding in the anterior insula in the problem gambling group, aligns with previous studies that have implicated the insula in PG (Choi et al., 2012; Clark et al., 2014; Tsurumi et al., 2014; Limbrick-Oldfield et al., 2017; Clark et al., 2019; Suzuki et al., 2023). The insula has also been implicated in other addiction disorders, such as substance use disorders in general (Naqvi and Bechara, 2009), as well as in the encoding of risk more generally (Preuschoff et al., 2008). Our findings suggest a specific computational role, especially in PG, in the form of prediction error from losses, related to the slow-learning component.
In light of our behavioral findings that the individuals with PG were impaired when learning from losses, we focused the neuroimaging analysis on the loss avoidance condition. We did also find some tentative evidence for altered responses in PG during gain-learning, specifically in the ventromedial prefrontal cortex. Thus, it may be premature to conclude that individuals with PG are exclusively impaired in the loss processing domain. Indeed, numerous fMRI studies of GD have reported alterations in reward-related responses in the medial prefrontal cortex and ventral striatum (Koehler et al., 2013; Potenza, 2013; Meng et al., 2014; Fauth-Bühler et al., 2017; Clark et al., 2019). However, it seems reasonable to conclude that alterations in slow-learning have the potential to account for an important element of real-world behavior in individuals with PG, whereby they persist in gambling even after sustaining substantial monetary losses.
We found that participants in both control and PG groups exhibited deficits in adjusting their choice after the reversal in the gain conditions, manifesting in their poor performance after reversal. One possibility for this unusual result is that the task contained separate loss and gain conditions, potentially lowering the subjective sensitively to no-reward outcomes, in contrast to loss (Bavard and Palminteri, 2023).
A limitation of the present study is that we utilized a relatively small sample size of 20 individuals in each group. To assess the robustness of the present findings, it will therefore be important to follow up with a replication study in a larger cohort. Despite this limitation, it is important to note that the present study has a number of significant methodological advantages over the typical approach utilized in the GD literature. First of all, unlike studies that recruit patients with PG from a GD clinic, here we recruited participants with problem gambling directly from the community. While the clinic provides a convenient means of recruitment, it has several downsides. These include the fact that the subset of individuals with PG actively seeking treatment may be qualitatively distinct in a number of respects from those who are not actively seeking treatment, thereby introducing a potential sampling bias. A further concern is that if such patients have already been in treatment through the clinic (as would be the case for many such patients), then any assessment of their behavior and/or neural responses will be confounded by treatment effects. The individuals in our study indicated that they were not actively seeking treatment (though we did provide them with information about how to do so after they participated in our study). Secondly, we recruited our comparison group in the same way as the individuals with PG, and we ensured they had all recently gambled recreationally while showing no clinical signs of the disorder. This ensured that our healthy comparison group was well-matched demographically. Consequently, in spite of our limited sample size, the present study improves in a number of other important methodological respects on the typical approach used in the field, on account of the careful recruitment and design procedures we employed.
More generally, the present study provides an example of how careful computational modeling can, when applied to behavioral and fMRI data, provide insight into the nature of the computations that may be altered in psychiatric disease (Montague et al., 2012; Wang and Krystal, 2014; Huys et al., 2016; Redish and Gordon, 2016). While it remains an open question how well the current study design, which involves temporally changing gain and loss probabilities, generalizes to actual gambling scenarios with fixed probabilities, studies have shown that even an optimal observer can infer changing biases from fair coin flips (Bialek, 2005) and people perceive biases, such as hot hand effects, even in their absence (Neiman and Loewenstein, 2011). Our model captures these biases, thus potentially relevant to actual gambling behavior. The findings of enhanced neural responses in the anterior cingulate and anterior insular related to slow-value encoding and learning, suggest the possibility that these neural responses could be precisely targeted for treatment in individuals with PG, such as by, for example, through focused transcranial magnetic stimulation (Zucchella et al., 2020).
Footnotes
We thank Ardy Rahman for implementing the recruitment and screening of participants, Jeff Cockburn, Caroline Charpentier, Vince Man, Reza Tadayon-Nejad, and Sandy Tanwisuth for valuable discussions. This work was supported by a grant from the International Center for Responsible Gaming to JOD and TF. KI is supported by the BBRF Young Investigator Grant.
- Correspondence should be addressed to Kiyohito Iigaya at ki2151{at}columbia.edu.