Abstract
Real-world choice options have many features or attributes, whereas the reward outcome from those options only depends on a few features or attributes. It has been shown that humans learn and combine feature-based with more complex conjunction-based learning to tackle challenges of learning in naturalistic reward environments. However, it remains unclear how different learning strategies interact to determine what features or conjunctions should be attended to and control choice behavior, and how subsequent attentional modulations influence future learning and choice. To address these questions, we examined the behavior of male and female human participants during a three-dimensional learning task in which reward outcomes for different stimuli could be predicted based on a combination of an informative feature and conjunction. Using multiple approaches, we found that both choice behavior and reward probabilities estimated by participants were most accurately described by attention-modulated models that learned the predictive values of both the informative feature and the informative conjunction. Specifically, in the reinforcement learning model that best fit choice data, attention was controlled by the difference in the integrated feature and conjunction values. The resulting attention weights modulated learning by increasing the learning rate on attended features and conjunctions. Critically, modulating decision-making by attention weights did not improve the fit of data, providing little evidence for direct attentional effects on choice. These results suggest that in multidimensional environments, humans direct their attention not only to selectively process reward-predictive attributes but also to find parsimonious representations of the reward contingencies for more efficient learning.
- curse of dimensionality
- decision-making
- feature-based learning
- reinforcement learning
- selective attention
Significance Statement
From trying exotic recipes to befriending new social groups, outcomes of real-life actions depend on many factors, but how do we learn the predictive values of those factors based on feedback we receive? It has been shown that humans simplify this problem by focusing on individual features that are most predictive of the outcomes but can extend their learning strategy to include combinations of features when necessary. Here, we examined interaction between attention and learning in a multidimensional reward environment that requires learning about individual features and their conjunctions. Using multiple approaches, we found that learning about features and conjunctions control attention in a cooperative manner and that the ensuing attentional modulations mainly affect future learning and not decision-making.
Introduction
Every day, we face many choice options or actions that have a multitude of features or attributes. However, after making a decision, the feedback we receive is often binary (e.g., success or failure) or a simple scalar indicating how good or bad the outcome is, but not which feature(s) or attribute(s) resulted in the observed outcome. To guide decisions based on reward feedback, an organism could track reward history associated with the selection of each choice option, which can be viewed as a unique combination of multiple attributes (e.g., shape, color). However, the number of possible combinations grows exponentially as the number of attributes grows, a problem referred to as the curse of dimensionality (Barto and Mahadevan, 2003; Sutton and Barto, 2018), making this strategy unfeasible due to memory constraints and insufficient reward feedback. Fortunately, in the real world, choice options that have similar perceptual properties also have similar predictive reward values (e.g., most green fruits are unripe). Therefore, learning about features or attributes (i.e., feature-based learning) could provide an efficient strategy by avoiding learning about each choice option or action individually (Niv et al., 2015), thereby mitigating the curse of dimensionality without sacrificing much precision (Farashahi et al., 2017b).
In addition, feature-based learning allows the organism to deploy attention to process certain features more strongly during learning and/or decision-making, resulting in additional flexibility (Mackintosh, 1975; Pearce and Hall, 1980; Busemeyer and Townsend, 1993; Dayan et al., 2000; Kruschke, 2001; Krajbich et al., 2010; Wilson and Niv, 2012; Niv et al., 2015; Akaishi et al., 2016; Soltani et al., 2016; Leong et al., 2017; Farashahi et al., 2017b, 2019; Busemeyer et al., 2019; Oemisch et al., 2019). For example, attention can bias decision-making by causing predictive reward values associated with different feature dimensions to be weighted differently when they are combined. Moreover, attention can bias learning by attributing the reward outcome to certain feature dimensions, leading the organism to preferentially update the values associated with those dimensions.
Unfortunately, feature-based learning becomes imprecise when certain features are predictive of reward values only when considered in conjunction with some other features (e.g., not all red or crispy fruits are edible but red crispy fruits usually are). Feature-based learning alone will ignore these important interactions, leading to incorrect generalizations. Of course, such imprecision can be mitigated by simultaneous learning about features and conjunctions of features (O’Reilly and Rudy, 2001; Farashahi and Soltani, 2021). However, this solution can become impractical as the number of feature conjunctions also grows exponentially. Importantly, not all feature conjunctions of a choice option are equally predictive of its reward value. Therefore, organisms can achieve an appropriate performance if they also deploy attention to enhance learning about the most informative conjunctions of features and to utilize those representations more strongly when making decisions. This necessitates a generalized form of selective attention beyond the selection among individual elementary features. It remains unclear whether this form of attention affects different types of learning strategies as well as choice behavior and, if so, how these effects interact with each other.
Finally, the mechanism by which learned predictive reward values influence attention on a trial-by-trial basis is not fully understood. In general, attention could be guided by reward value in multiple ways. For example, in some studies on simple and multidimensional reinforcement learning (RL), attention has been shown to depend on the absolute difference (Hunt et al., 2014; Soltani et al., 2016; Farashahi et al., 2017b), sum (Niv et al., 2015; Soltani et al., 2021), or maximum (Anderson et al., 2011; Leong et al., 2017; Gluth et al., 2018; Daniel et al., 2020) of the feature values of the alternative options. However, these findings come from experiments using relatively simple reward schedules, where different functions on the values may yield similar attention weights. This similarity makes it challenging to precisely identify how predictive reward value influences attention. Because all of these functions can be implemented through canonical neural circuits endowed with reward-dependent plasticity (Soltani and Wang, 2006, 2010), elucidating how value is transformed into attention at the behavioral level can provide testable predictions about the underlying neural mechanisms.
In general, an individual's abilities to focus on specific features (feature-based selective attention) and to understand how combinations of features interact to produce reward outcomes (configural learning) are crucial components of representation learning. This learning involves identifying a compact yet task-relevant representation of choice options that enables efficient learning by pinpointing which features are informative about reward values and how different feature dimensions interact (Niv, 2019; Radulescu et al., 2019a, 2021). Although previous studies have investigated different learning strategies separately, mechanisms of their interaction and how this interaction controls behavior are unknown. Understanding these processes provides an important step toward uncovering more robust solutions to the curse of dimensionality in complex, naturalistic environments.
Here, we used multiple methods including various RL models to examine the interaction between learning, choice behavior, and attention in a three-dimensional reward learning task in humans. In previous studies, attention has been directly measured using methods such as eye-tracking (Krajbich et al., 2010; Leong et al., 2017) or has been indirectly estimated as a latent variable (attention weights) within a formal mathematical model. These attention weights then influence other processes within the model to modulate behavior (Busemeyer and Townsend, 1993; Kruschke, 2001; Soltani et al., 2016; Farashahi et al., 2017b; Busemeyer et al., 2019). In the current study, we took the latter approach and used the fit of choice data with RL models to infer attentional effects via attention weights. We aimed to answer the following questions. First, how do simple feature-based learning and more complex conjunction-based learning interact to control choice behavior? Second, how do these learning strategies collectively determine or control attention (cooperatively or competitively)? Third, how and where do attentional modulations exert their influence: on choice, learning, or both? Answers to these questions provide testable predictions about neural mechanisms by which attention shape representation learning in multidimensional reward environments.
Materials and Methods
Participants
In total, 92 healthy participants (N = 66 females) were recruited from the Dartmouth College student population (ages 18–22 years). Participants were recruited through the Department of Psychological and Brain Sciences experiment scheduling system. They were compensated with money and T-points, which are extracredit points for classes within the Department of Psychological and Brain Sciences at Dartmouth College. All participants were compensated at $10/hour or 1T-point/hour. They could receive an additional amount of monetary reward for their performance of up to $10/hour. Similar to a previous study based on the same dataset (Farashahi and Soltani, 2021), we excluded participants whose performances (proportion of trials in which the more rewarding option is chosen) after the initial 32 trials were <0.53. We also excluded an additional participant who failed to provide any reward probability estimates in three out of five bouts. These criteria resulted in the exclusion of 25 participants in total. All experimental procedures were approved by the Dartmouth College Institutional Review Board and informed consent was obtained from all participants before the experiment.
Experimental paradigm
The multidimensional reward learning task involved learning about the predictive reward values (reward probabilities) associated with 27 abstract, visual stimuli through reward feedback and consisted of choice and estimation trials (Fig. 1). Stimuli consisted of three feature dimensions (color, shape, and texture) where each feature dimension had three possible values (three colors, three shapes, and three textures), leading to 27 stimuli (objects) in total. During the choice trials, the participants were presented with two stimuli that had distinct features in all three dimensions, and they were asked to choose between them to obtain a reward. Reward feedback—whether a reward point was earned or not—was provided randomly after each choice, with a probability determined by the reward schedule. During choice trials, the order of the stimuli's presentation was pseudorandomized such that all pairs that were distinct in all three feature dimensions were presented four times in different spatial layouts. During the estimation trials, participants were presented with each of the 27 stimuli in a random order and were asked to estimate the probability that the selection of each stimulus would lead to a reward. There were 432 choice trials in total. The estimation trials were interspersed in five bouts that appeared after choice trial numbers 86, 173, 259, 346, and 432.
The reward schedule
An algorithm analogous to the Naive Bayes algorithm can be used to approximate the predictive reward values associated with each stimulus using the average reward probabilities associated with its features, as detailed below (Murphy, 2012; Hunt et al., 2014; Farashahi et al., 2017b), under the simplifying assumption that individual features are conditionally independent given the reward outcome. When each stimulus/object has m features, and there are n instances of each feature, the average reward probability for different instances of a given feature can be obtained by marginalizing over other feature dimensions as follows:
A similar method can be used to estimate stimulus values based on a mixture of “feature” and “conjunction values” as follows:
Different feature-based and mixed feature-based and conjunction-based strategies can lead to different amounts of approximation error, which we define as the average KL divergence between
Computational models of value-driven attention
To capture participants’ trial-by-trial learning and choice behavior, we fit various RL models that estimated different types of predictive reward values (i.e., probability of reward associated with features, conjunctions, or stimuli) and included different types of attentional modulation during choice and decision-making. Here, attention is defined as a set of normalized weights that multiplicatively modulate different components of the RL model. These weights were indirectly inferred by fitting computational models to choice data (Busemeyer and Townsend, 1993; Busemeyer et al., 2019) instead of being directly measured with methods such as eye-tracking (Krajbich et al., 2010; Leong et al., 2017).
Briefly, to make decisions, an RL model with no attention calculates the value of each stimulus/object by taking a simple average of the values of its constituent features and/or conjunctions, without applying any weighting, on each trial (see below). To learn from the subsequent reward feedback, the model then updates the values of different features and conjunctions with the same learning rates. When attention modulates choice, it means that the values of different features and/or conjunctions are weighted differently in the calculation of each object's value (Eq. 6). When attention modulates learning, it means that the values of different features and conjunctions are updated using different learning rates (Eq. 4). When there is attentional modulation of both choice and learning, it means that both processes happen. Furthermore, these attention weights are themselves dependent on the values learned by RL (Eqs. 7, 8), allowing them to be adjusted on a trial-by-trial basis according to reward feedback. Mechanistically, these modulations could be achieved by modulating the gain of the stimulus-encoding units before a choice was made and/or after the reward was received.
To test how learned predictive reward values drive attention, we compared three possible relationships between the predictive reward values of the two presented choice options (stimuli) on a given trial and attentional modulation: constant and uniform attention independent of value (denoted “const”), attention based on summed values (denoted “sum”), attention based on the absolute difference in values (denoted “diff”), and attention based on the maximum value (denoted “max”). In addition, we assumed that attention could modulate choice, learning, or both, which led to a total of 10 variations of attentional mechanisms. A suffix of “X none” in model's names indicates no attentional modulation, a suffix of “X C” indicates only attention at choice, a suffix of “X L” indicates only attention at learning, and a suffix of “X CL” indicates attention at choice and learning. We also considered five different “learning strategies”: (1)
F, feature-based learning; (2)
We based all our learning models on the RL models with decay, which have been shown to capture behavior in similar tasks successfully (Niv et al., 2015; Farashahi et al., 2017b; Farashahi and Soltani, 2021). This means that while the values associated with features, conjunctions, and/or the identity of the chosen option were updated after each reward feedback, all other values decayed toward a baseline (see below). Specifically, the values associated with the features and conjunctions of the chosen stimulus/object or the chosen stimulus/object itself, (denoted by
The log-odd for choosing a stimulus on the left was the weighted sum of feature, conjunction, and/or object values of that stimulus as follows:
To show that having dynamic attention weights improve the fit of RL models, we also performed model comparison between the best-fitting model in the current study and the best-fitting models from the previous study that used the same dataset (Farashahi and Soltani, 2021). In each of the previous mixed feature- and conjunction-based learning models, the values of a specific feature and the conjunction of the other two features were learned. The decay process (Eq. 5) was shown to significantly improve the model's fit; therefore, it was also included here. Finally, models in the previous study assumed separate learning rates and decision weights for feature and conjunction values. To ensure a fair comparison, we modified these models to have separate decision weights only (i.e., the ω parameter) to maintain consistency with other models tested.
Model fitting and model selection
The RL models were fit using the Bayesian Adaptive Direct Search optimization algorithm (Acerbi and Ma, 2017). Forty random initial optimization points were sampled to avoid local optima. For each model with attention, one set of the initial parameter values was chosen as the best parameters for the base model with the same learning strategy but without attention. Models were compared using Akaike Informative Criterion (AIC) and random-effect Bayesian Model Selection (BMS) based on the Bayesian Information Criterion (BIC) as an approximation for model evidence (Stephan et al., 2009; Rigoux et al., 2014). We report the posterior model probability, which is the posterior estimate of a model's frequency as well as the protected exceedance probability (pxp), which equals the probability that one model exists more frequently than all other models. Unless otherwise specified, the Bayesian omnibus risk, which measures the probability that the observed differences in model frequencies are due to chance, was <0.001 for all our model comparisons. This suggests that there were significant discrepancies in different models’ ability to account for behavioral data.
For all mixed-effect modeling, we fit a random slope and intercept for each participant. If the model did not converge, we incrementally simplified the random-effect structure by first removing correlation between random effects, followed by the random slopes. If the model failed to converge even with only random intercepts, an ordinary linear regression model was used.
Model-based analysis of choice behavior
Using the estimated model parameters from fitting the choice data with the best-fitting model, we were able to simulate our models and examine the latent variables (Wilson and Collins, 2019), including the inferred subjective values and attention weights. We computed the entropy of the attention weights, defined as
Model-free analysis of choice behavior and value estimation
To identify biases in reward credit assignment and choice, we fit generalized linear mixed-effect models to predict choice based on either reward and choice history or based on true reward values at different time points (early and late) in the experiment, as described below.
First, to examine bias in reward credit assignment, we generalized win–stay lose–switch commonly used to study learning in simple environments (Lau and Glimcher, 2005; Noonan et al., 2010, 2017; Walton et al., 2010; Katahira, 2018; Moran et al., 2019) to our multidimensional environment. More specifically, we used the following generalized linear mixed model (GLMM) as follows:
We used the first 150 trials to study reward credit assignment because participants’ performances reached their steady state at that point and because participants’ sensitivity to reward history should decrease with more learning. Although the GLMMs cannot separate the influence of attentional modulations at the time of choice and learning (Katahira, 2018), they can nonetheless detect biases in choice behavior. Importantly, attention would lead to some coefficients being higher than the other ones, indicating heightened sensitivity to the reward history associated with some feature(s) and/or conjunction(s).
Second, to examine the participants’ choice strategy at the end of the experiment (last 150 trials), we fit GLMMs that used the values associated with the informative feature and the informative conjunction to predict participants’ choice. If the participants employed a feature- and conjunction-based learning strategy, the values associated with both the informative feature and informative conjunction should significantly predict choice.
Finally, to examine participants’ value representations more directly, we also fit their estimations of the reward probability associated with each stimulus/object with GLLMs that used predictive reward values along different feature or conjunction dimensions as the independent variables. Both the reward probability estimates and the marginal probabilities were transformed through a logit transformation
Model recovery and model validation
Due to the large number of models we used to fit the data, a comprehensive model recovery was intractable. Instead, we performed a model recovery of the best-fitting model against other models that share the same learning strategy or the same attentional mechanism. To perform model recovery, we used the same reward schedule and sampled 100 stimuli sequences using the same rule as for participants in the experiment. For each model type, we sampled parameter sets uniformly or log-uniformly from a plausible range of values. Consistent with the experimental data, we retained only those parameter sets that resulted in a choice sequence exceeding our performance exclusion threshold. We also excluded parameter sets that produced highly random choice sequences, specifically those where average likelihood per trial was <0.525, given the sampled parameters. Based on these parameters, we simulated choice sequences using the actual sequences of stimuli observed by each participant. We then fit all models and compare them using random-effect BMS. We also verified parameter recovery by reporting Spearman's rank correlation between the ground-truth parameters and the estimated parameters.
Finally, to qualitatively validate our winning model, we used the estimated parameters based on the best-fitting model to simulate choice sequences and trial-by-trial attention weights. As multiple studies have pointed out (Palminteri et al., 2017; Wilson and Collins, 2019), even the quantitatively best-fitting model may not qualitatively capture important aspects of empirical behavior. When fitting decision-making and learning models on a trial-by-trial basis, it is common practice to use the sequence of choices made by participants. When these models were then used to generate their own choices, a mis-specified model could generate a sequence of choices that differ significantly from empirical ones. This necessitates the so-called model validation procedure, where we simulated sequences of choices generated by the quantitatively best-fitting models and compared them against empirical choices. We compared the simulated performance against participants’ empirical performances. We also compared the model's attention weights when it was simulated using the sequences of choices made by participants and the sequences of choices generated by the model itself. We conducted these analyses on 3,350 simulated experiment sessions, after generating 50 simulated sessions for each participant using their best-fitting parameters. Finally, we also performed the same model-free analysis of credit assignment on simulated behavior and compared them with results obtained from empirical choice behavior. This analysis involved 670 simulated sessions, where 10 sessions were generated for each participant using their best-fitting parameters.
Results
To investigate the effects of attention on learning and decision-making in high-dimensional environments, we reanalyzed choice behavior in a multidimensional probabilistic learning task (Farashahi and Soltani, 2021) in which human participants selected between pairs of visual stimuli (objects), each defined by three visual features (color, pattern, and shape). The three visual features resulted in three types of conjunctions between two features: shape and color, pattern and color, and shape and pattern. Since each feature could take three values, each conjunction could take nine values. This produced a total of 27 distinct stimuli/objects to learn about. Each choice was followed by binary reward feedback with the probability of harvesting a reward determined by the reward schedule. Participants learned about these reward probabilities through trial and error (see Materials and Methods for more details). We also asked participants to provide their estimations of reward probabilities for each stimulus at five evenly spaced time points throughout the experiment (Fig. 1A).
The experimental design and overall performance of participants. A, The timeline of a choice trial consisting of fixation, stimulus presentation, choice presentation, and reward feedback. The inset shows layout of an example probability estimation trial. Matrices depict reward probabilities associated with 27 stimuli, each identified by three visual features, as shown in the inset. B, The average learning curve across all participants. The black and gray curves show the average reward received and the proportion of trials the participants chose the better option, respectively. Both curves were smoothed by a moving average filter over 20 trials. The shaded area indicates the SEM. Performance reached the steady state at ∼150 trials. Arrows on the x-axis indicate the choice trials after which a bout of value estimation trials was administered. C, The plot shows the informativeness of each dimension as measured by the proportion of variance in the reward schedule explained by each feature, conjunction, and the stimulus/object identity. By design, the informative feature and the informative conjunction explain a larger amount of variance in the reward schedule than noninformative ones. Additional variance can only be explained by the stimulus (object) identity. D, The average predictive reward value of individual stimuli ordered by each feature dimension and contained feature. The height of the bars shows the mean values with error bars indicating standard deviation, and circles show the exact stimulus values with a small jitter for clarity. The predictive reward values of stimuli that share each of the two noninformative features varied from each other due to the design of the reward schedule even though these values were close to 50% on average. E, The error in estimated probabilities (approximation error) based on different learning strategies. Learning about the informative feature and informative conjunction provided the smallest error, whereas learning about the noninformative features and conjunctions lead to a similar error as learning about the informative feature alone (compare leftmost point on the blue curve and rightmost point on the red and yellow curves).
Critically, the reward schedule was designed such that one informative feature and the conjunction of two other noninformative features (i.e., the informative conjunction) explained a large proportion of variance in the reward schedule (see Materials and Methods for more details), whereas the remaining proportion of the variance was specific to each object (Fig. 1C). Therefore, by design, participants could learn the reward schedule with reasonable, though not perfect, accuracy by correctly identifying the informative feature and informative conjunction and learning their predictive values. Nonetheless, reward probabilities associated with the 27 stimuli greatly varied according to the presence of the informative or noninformative features (Fig. 1D), making it a nontrivial task to find the informative feature, unlike previous experiments (Niv et al., 2015; Leong et al., 2017; Oemisch et al., 2019). By learning about and combining the predictive values of the informative feature and conjunction, participants could achieve a good approximation to the actual reward probabilities associated with each stimulus. In contrast, learning about the noninformative feature and conjunction dimensions leads to inaccurate approximation while still allowing performance that was better than chance (Fig. 1E; see also Materials and Methods, Experimental paradigm).
Overall, we found that participants’ performance became significantly better than chance (0.5) soon after the beginning of the experiment and reached steady state after ∼150 trials (Fig. 1B). By examining how well different RL models account for participants’ choice behavior, a previous study verified that participants learned about both the informative feature and the conjunction (Farashahi and Soltani, 2021). However, this study did not investigate how participants arrived at this learning strategy, whether some participants deviated from this strategy in a systematic way, and the role of attention in learning. Here, using a combination of model-free analyses and fitting choice data with more complex RL models, we characterized how participants’ choice behavior was affected by different attentional strategies, how these attentional strategies interacted with value learning and decision-making, and how these interactions affected the overall performance.
Learning in multidimensional environments is guided by the informative feature and conjunction
To determine how participants adjusted their choices based on reward and choice history early in the experiment, we applied a mixed-effect logistic regression to the first 150 trials of the experiment (see Materials and Methods for more details). Participants’ performance gradually increased in the first 150 trials, where we hypothesized that their choices were adjusted according to reward and choice history due to ongoing learning. Indeed, we found that during this period, participants’ choices could be significantly predicted by the reward outcome from the previous trial associated with the informative feature (b = 0.19; SE = 0.04; t(9970) = 4.27; p < 0.001; Fig. 2A) and informative conjunction (b = 0.17; SE = 0.08; t(9970) = 2.15; p = 0.03; Fig. 2A). This means that whenever one of the two choice options (stimuli) in the current trial shared the same informative feature or informative conjunction as the chosen stimulus from the previous trial, participants were more likely to choose it if the previous choice was rewarded and avoid it if the previous choice was not rewarded (i.e., win–stay and lose–switch based on feature). In contrast, there was no evidence that participants adjusted their choices according to feedback based on the noninformative features or noninformative conjunctions (p > 0.05). These results provide evidence that early in the experiment, participants selectively associated rewards with the informative feature and conjunction, but not the noninformative features (including the ones that are constituents of the informative conjunction) or the noninformative conjunctions.
Model-free analysis of reward credit assignment and decision-making at the beginning and end of the experiment. A, Analysis of choice behavior during the first 150 trials of the experiment. The plot shows the regression weights from the mixed-effect logistic regression model to predict choice using features of choice and reward outcomes in the previous trial. Early choices can be significantly predicted by reward outcome associated with the informative feature and informative conjunction, but not noninformative features or conjunctions. Choices can also be significantly predicted by the informative feature and one of the noninformative features of previous choice, but not other variables. One, two, and three asterisks indicate p < 0.05, p < 0.01, and p < 0.001, respectively. B, Analysis of choice during the last 150 trials of the experiment. The plot shows the fit of late choice data using the informative feature (blue) and conjunction (purple) as indicated in the legend. The inset shows the regression weights from the mixed-effect logistic regression analysis of choice, using ground-truth reward probabilities associated with the informative feature and conjunction. Choices later in the experiment were strongly informed by the values of the informative feature and informative conjunction. See Extended Data Figure 2-1 and Extended Data Table 2-1 for results of additional analyses. See Extended Data Figure 2-2 for similar analysis of the simulated behavior using the best-fitting model.
Figure 2-1
Results of additional regression analyses, demonstrating gradual learning and persistent attentional biases. (A) Results of the logistic regression model of the effect of reward and choice history (same as Fig. 2A) applied to the last 150 trials of the experiment. There was an overall higher level of choice repetition due to more learning. There was evidence for selective learning about the informative feature, informative conjunction, as well as the non-informative feature 1 and non-informative conjunction 1. (B) Results of the mixed-effects logistic regression based on true values (similar to Fig. 2B) applied to all choice trials, in blocks of 108 trials. The dark colored lines show the fixed-effects, and the lighter color lines show the participant-specific random-effects. The fixed-effect slopes for the value of both the informative feature and conjunction increased over time. The participant-specific slopes for almost all participants also increased at least over parts of the experiment. The fixed-effect slope for the object/stimulus value was not significant in any block. Significance levels are indicated as follows: + for p < 0.1, * for p < 0.05, ** for p < 0.01, *** for p < 0.001. Download Figure 2-1, TIF file.
Table 2-1
Results of fitting choice behavior using logistic regression with actual reward values as predictors, separately for each of four blocks of 108 trials. Download Table 2-1, DOCX file.
Figure 2-2
Model-free analysis of the simulated behavior based on the best-fitting model, separately for early and late stages of the experiment. (A–B) Plots show the regression weights from the mixed effects logistic regression model to predict simulated choice using features of choice and reward outcomes in the previous trial, separately for the first 150 (A) and the last 150 (B) trials of the experiment. Early choices can be significantly predicted by reward outcome associated with the informative feature and informative conjunction. Late choices were also influenced by reward associated with one of the non-informative conjunctions. Additionally, the features of the previous choice could significantly predict choices, with this effect strengthening later in the experiment. Significance levels are indicated as follows: + for p < 0.1, * for p < 0.05, ** for p < 0.01, *** for p < 0.001. Download Figure 2-2, TIF file.
In addition, we found that participants’ choices were significantly influenced by the informative feature (b = 0.15; SE = 0.05; t(9970) = 2.78; p = 0.005) and noninformative Feature 1 (b = 0.18; SE = 0.05; t(9970) = 3.57; p < 0.001) of the stimulus selected on the previous trial. This means that participants were more likely to repeat choosing an option that shared either the informative feature or the noninformative Feature 1, with the previously chosen option regardless of the previous reward outcome. Such choice autocorrelation could be purely due to choice bias (i.e., tendency to choose options that share a feature or conjunction with a past choice), or it could be a result from biased learning that places more weight on positive reward outcomes than on negative ones (Palminteri and Lebreton, 2022). Although both mechanisms may be at play (Katahira, 2018; Palminteri and Lebreton, 2022), the latter suggests that a positivity bias in learning can also skew choice autocorrelation through attention, leading to increased sensitivity to the choice history associated with some feature(s) and/or conjunction(s).
This pattern slightly changed as performance and learning about informative feature and conjunction saturated toward the end of the experiment. More specifically, by performing the same analysis on the last 150 trials of the experiment, we found that participants still selectively associated reward feedback with the informative feature and both informative and noninformative conjunctions (informative feature, b = 0.13; SE = 0.05; t(9970) = 2.82; p = 0.005; noninformative Conjunction 1, b = 0.19; SE = 0.08; t(9970) = 2.35; p = 0.02; the informative conjunction, b = 0.16; SE = 0.08; t(9970) = 1.96; p = 0.05; Extended Data Fig. 2-1A). They were also more likely to select a stimulus/object if it contained the same informative feature as the previously selected stimulus/object and, to a lesser extent, if it contained either of the two noninformative features of the previously selected stimulus/object (informative feature, b = 0.38; SE = 0.06; t(9970) = 5.90; p < 0.001; noninformative Feature 1, b = 0.17; SE = 0.05; t(9970) = 3.20; p = 0.001; noninformative Feature 2, b = 0.10; SE = 0.05; t(9970) = 2.19; p = 0.03; Extended Data Fig. 2-1A).
Comparison of the results between the first and last 150 trials suggests that later in the experiment, participants increasingly relied on learned values rather than adjusting their behavior based on new reward feedback. This shift can explain why, during the last 150 trials, the weights for choice-related coefficients increased, while those for reward-related coefficients decreased, particularly for the informative conjunction (Extended Data Fig. 2-1A). In addition, the significant effects of choices associated with noninformative features may reflect stickiness in behavior, possibly as a result of confirmation bias caused by the asymmetry in learning rates from positive and negative outcomes.
Additionally, our findings suggest that later in the experiment, participants demonstrated enhanced learning from reward feedback associated with one of the noninformative conjunctions. This enhancement could stem from several factors. First, there is considerable variability in learning strategies among participants, with some participants focusing on noninformative features and conjunctions due to the inherent complexity of the utilized reward schedule. Second, toward the end of the experiment, the impact of reward feedback may naturally diminish as participants learn through reward prediction errors. As their estimate become closer to the true reward values, the reward prediction error reduces, leading to a general decline in learning. Finally, as we show below, simulations of experiment with a model that only learned the values of informative feature and conjunctions were able to qualitatively reproduce many of the patterns observed in Figure 2A and Extended Data Figure 2-1A, including the association of reward with some of the noninformative features and conjunctions (Extended Data Fig. 2-2). Together, these results suggest that the behavior of participants was consistent with learning the informative feature and the conjunction.
Later in the experiment, participants’ performance reached a steady state, which could indicate reduced learning in this period. Therefore, we hypothesized that participants’ choices should reflect the predictive reward values of the informative feature and conjunction. To test this, we used the log ratio of the actual reward probabilities/values of the two options along the informative feature and conjunction dimensions to predict participants’ choice in the last 150 trials using a mixed-effect logistic regression model. We found that the log ratio of values along both the informative feature (b = 1.78; SE = 0.20; t(10047) = 8.71; p < 0.001) and the informative conjunction (b = 0.23; SE = 0.08; t(10047) = 2.89; p = 0.003) significantly predicted participants’ steady-state choices, suggesting that participants learned the values of both the informative feature and conjunction and made choices based on a weighted combination of these values as reflected in the best fit logistic curves based on the informative feature and conjunction values (Fig. 2B). Compared with a model that only considered the value of the informative feature, a likelihood ratio test revealed that incorporating the values of the informative conjunction as an additional predictor significantly improved the fit
To get a more fine-grained understanding of how this value-based choice behavior emerged over time, we also fit logistic regression models to predict choice with log-odds of true feature and conjunction values as predictors (similar to Fig. 2B), separately to choice data from blocks of 108 trials. We found the regression coefficients/slopes for both the informative feature and conjunction values increased over time (Extended Data Table 2-1). In contrast, the slope for the stimulus/object value did not significantly rise above 0. To further investigate participant-specific learning of predictive reward values, we also examined the random effects of the coefficients. For almost all participants, their individual coefficient for both the informative feature and conjunction rose for at least an earlier portion of the experiment (Extended Data Fig. 2-1B).
Moreover, participant-specific slopes for Finf and Cinf suggest that a few participants only learned the informative feature and not the informative conjunction as evidenced by the slope for Cinf being small or even negative. This indicates that some participants did not effectively learn the predictive values of the informative conjunctions. The reduction in the slope of Cinf could also be a consequence of the observed increase in reward coefficients for noninformative conjunctions and the rise in choice coefficients for noninformative features later in the experiment (Extended Data Fig. 2-1A). These changes could collectively diminish the impact of informative conjunctions over time (Extended Data Fig. 2-1B, middle panel).
Overall, these results indicate that values of both the informative feature and conjunction were learned gradually, but the feature values were learned faster than conjunction values. In addition, participants did not transition to object-based learning, as indicated by object values not significantly predicting choice behavior, though some explored the noninformative conjunctions (Extended Data Fig. 2-1B).
Although the above logistic regression analyses cannot separate the influence of attention on choice versus learning (Katahira, 2018), they enabled us to detect biases in choice behavior that imply differential processing of different features or conjunctions of the stimuli. Based on the above analyses, we conclude that in the multidimensional reward learning task, participants’ initial behavioral adjustments indicated a higher sensitivity to the reward and choice history of certain features and conjunctions over other ones. This adaptive strategy allowed participants to learn an approximate value representation by learning the values of the informative feature and informative conjunction without having to learn the object/stimulus values directly. This had lasting effects on the participants’ behavior, as they made their choices by combining the values from the informative feature and conjunction once their performances had reached steady state. These results motivated us to investigate the mechanisms through which this attentional bias could have emerged and how it could have interacted with value learning (see the next section).
Attention is guided jointly by the informative feature and conjunction and only affects learning
The above model-free analyses verified that participants’ credit assignment and/or decision-making were biased toward certain features and conjunctions. Next, to explain how these attentional biases emerge and exert their influences, we constructed various RL models that included different attentional mechanisms and fit choice behavior with these models. The general architecture of the models (Fig. 3A) was inspired by the hierarchical decision-making and learning model proposed by Farashahi et al. (2017b). In these models, a set of nine feature-encoding units (three for each feature), 27 conjunction-encoding units (nine for each conjunction), and 27 object-encoding units are tuned to different dimensions of the stimuli. Each of these units projects to a corresponding value-encoding unit via synapses that undergo reward-dependent plasticity. This allowed the value-encoding units to estimate reward values associated with individual features, conjunctions of features, and objects. In addition, feature- and conjunction-value–encoding units send input to the corresponding attentional-selection circuits that, in turn, provide feedback to multiplicatively modulate the gain of stimulus-encoding units. During decision-making, when feature and/or conjunction values were added together to calculate the predictive value of each choice option, this feedback could enable the model to assign higher weights to values associated with certain stimulus dimensions over others. The ensuing attention-weighted value signals drive the decision-making circuit that generates choice on each trial and could result in harvesting reward. The reward outcome on each trial in turn modulates the update of synapses between sensory- and value-encoding units (Soltani and Wang, 2010). In addition to gain modulation during decision-making, attention could also differentially modulate the rate of synaptic updates by changing the gain of the presynaptic stimuli encoding units, which ultimately modifies the learning rates or associability of different stimuli dimensions (Kruschke, 2001). As a result, on a given trial, the values associated with features and/or conjunctions that received higher attention weights would be updated more and vice versa.
The computational model for learning in high-dimensional environments with value-driven attention. A, Illustration of the model's architecture. Sensory units encode different features, conjunctions, and stimulus/object identities of the choice options. These units project to feature-value, conjunction-value, and object-value–encoding units via plastic synapses that undergo reward-dependent plasticity, allowing the latter units to estimate the corresponding predictive reward values. Feature-value and conjunction-value-encoding circuits send feedback to feature and conjunction attention circuits, which potentially interact with each other. Using these value signals, the attention circuit calculates modulatory signals that feedback into the sensory encoding units to modulate the gain of feature/conjunction encoding, which in turn modulates decision-making and/or learning. B, Results of random-effect BMS. Reported values in the middle panel are pxp of all models. The column and row above and to the right are the results of family-wise BMS aggregating across different types of value learning strategies and across different attentional mechanisms, respectively. The name of attentional mechanisms is given by how attention is calculated (const, constant and uniform; diff, absolute difference; sum, average; max, maximum) and when attention is applied (none, no attentional modulation; C, during choice; L, during learning; CL, during choice and learning). For example, diff X L denotes that attention is calculated based on absolute difference and modulates value updates. See Extended Data Figure 3-1 for the results of model and parameter recovery analysis. See Extended Data Figure 3-2 and Extended Data Table 3-1 for additional model comparison results with models in a previous study by Farashahi and Soltani (2021). See Extended Data Table 3-2 for summary statistics of the best-fitting model's estimated parameters. C, The average differences in trial-wise BIC between the best-fitting model (
Figure 3-1
Model and parameter recovery. (A–B) Using Bayesian Model Selection, we compared the winning model with other models with: the same attentional mechanism (modulation of learning based on value difference) and different learning strategies (A), and the same learning strategy (F + C learning with joint attention) but different attentional mechanisms (B). The reported values are pxp, and the color of the cells show the posterior model probabilities. The true models were generally well-recovered with perfect accuracy (pxp = 1), except for those in which attention driven by value differences modulated both choice and learning. Those models were difficult to distinguish from models in which only learning was modulated. However, the false model did not fit the data significantly better than the true model (Wilcoxon’s signed-rank test,
Figure 3-2
Model comparison with models used by Farashahi and Soltani (2021). (A) Result of BMS between the best-fitting model in the current study (
Table 3-1
Comparison between the goodness-of-fit of the mixed feature and conjunction-based learning models of Farashahi and Soltani (2021) and the best-fitting model of the current study. Reported values are the mean negative loglikelihood (NLL) and BIC. Numbers in parentheses denote the standard error of mean. Download Table 3-1, DOCX file.
Table 3-2
Summary statistics of estimated parameters of the best-fitting model for predicting choice data. Download Table 3-2, DOCX file.
To test how learned predictive reward values could drive attention, we compared three possible relationships between these values and attention (in addition to no relationship): summation, absolute difference, and maximum (see Materials and Methods). All of these functions could be implemented by canonical recurrent neural network circuits and had been used in prior studies to calculate attention based on subjective reward values, sometimes in much simpler reward schedules where these possibilities could be hard to distinguish from each other (Anderson et al., 2011; Hunt et al., 2014; Niv et al., 2015; Soltani et al., 2016, 2021; Leong et al., 2017; Farashahi et al., 2017b; Gluth et al., 2018; Daniel et al., 2020; Farashahi and Soltani, 2021; Pettine et al., 2021). In our model, predictive reward values of the two presented options along different dimensions are first passed through one of the above functions and then normalized across dimensions to have a sum of 1, resulting in one attention weight per feature or conjunction dimension (e.g., color or color/shape conjunction). Attention could modulate decision-making, learning, both, or neither. We assumed that during decision-making, attention could modulate the relative weights of a stimulus’ feature and conjunction values in determining the log-odds of choosing that stimulus. In contrast, the learning rates of feature- and conjunction-value updates could be modulated by the attention weights to model the effects of attention during learning.
We also considered five “learning strategies”: (1) feature-based learning
After fitting models to trial-by-trial choice data of each participant using maximum likelihood estimation, we applied BMS to the BIC of all models to determine the best model. The model that best explained our choice data (
We then pooled models with similar attentional mechanisms or learning strategies to compute family-wise BMS. This confirmed that across all learning strategies, in the best-fitting models, attention modulated only the value updates, not the decision-making. Moreover, this attentional mechanism was controlled by the absolute value difference between the two alternative options (on each trial) along different dimensions (diff X L, posterior probability = 0.42;
Moreover, across all attentional mechanisms, the learning strategies that best fit our data were feature and conjunction learning with joint attention (
By examining the trial-wise BIC (Eq. 11) of the models with the best attentional strategy (diff X L), we found the difference between the trial-wise BIC of the best-fitting model
We also compared the ability of the best-fitting model and multiple models that have previously been used to fit the same choice data (Farashahi and Soltani, 2021). Specifically, each of the previous models learned about values of a feature and a conjunction (denoted
Attention weights estimated by RL model explains biases in credit assignment and individual differences in performance
Using the maximum likelihood parameter estimates of the best-fitting model (Extended Data Table 3-2; see Extended Data Fig. 3-1C,D for parameter recovery results), we inferred the trial-by-trial subjective attention weights in the same way that subjective values could be inferred from the RL models that fit choice data the best. Using this approach, we investigated the distribution and dynamics of attention across the informative and noninformative stimulus dimensions. Because attention weights add up to 1, we treated their distribution as a categorical probability distribution over the dimensions and applied an information-theoretic approach to characterize their dynamic properties with the following observations. First, we found that the entropy of the attention weights, which is inversely proportional to how focused attention is, decreased throughout the experiment (
Dynamics of attentional modulation and its effects on performance. A, The plot shows the entropy of the attention weights and JSD between attention weights on consecutive trials. Entropy rapidly decreased and remained low, suggesting that attention became more focused over time. The increase in JSD suggests that attention tended to switch across trials even late in the experiment. A moving average of a window size of 20 trials was applied for visualization purposes, not hypothesis testing. B, The average attention weights for individual participants. The color indicates the average cross-trial JSD for each participant. Participants exhibited a variety of patterns for attentional modulation. Some concentrated on either the informative feature and conjunction pair or one of the noninformative feature and conjunction pairs (points close to the vertices of the triangle with low JSD). Others oscillated between two or three different dimensions (points far from the vertices with high JSD). C, The plot shows the time course of smoothed average attention weights across participants, weighted by participants’ overall sensitivity to reward feedback
Figure 4-1
Trial-by-trial average attention weights estimated by the best-fitting RL model, not weighted by the product of inverse temperature and learning rate. The difference between the informative and first non-informative dimensions diminished, but the same qualitative pattern was still present. Conventions are the same as in Fig. 4C. Download Figure 4-1, TIF file.
Figure 4-2
Sources of variabilities in the attention weights over time. (A) Plot shows the trial-by-trial attention weights of an example participant, as inferred by the best-fitting model. These weights became close to zero or one as learning progressed, but switches between dimensions were still present across trials. This explains the decrease in the entropy of the attention weights and the accompanying increase in JSD. (B) Plot shows the value separability equal to the standard deviation of learned values within each feature/conjunction dimension, separately for each feature and conjunction. These quantities, which were independent of the pair of options presented on a given trial, reached asymptotic values quickly. Shaded area shows one standard error. No smoothing was applied. Download Figure 4-2, TIF file.
Figure 4-3
The attention weights estimated using the F + Cjoint model with attention modulating choice only (diff X C). Shaded areas show one standard error. Compared to the best-fitting model (diff X L), this model did not exhibit the strong initial bias to a non-informative feature and conjunction. Figure conventions are the same as Fig. 4C. Download Figure 4-3, TIF file.
Even though value learning was a gradual process in our experiment, attention weights in our models also depended on the pair of options available on each trial. This could enable stimulus-specific, rapid shifts in attention across trials, particularly when the inverse temperature for attentional selection (γ in Eq. 8) is high. In line with this, we found the estimated value for γ to be relatively large (Median = 257.88, IQR = 70.39–410.17; Fig. 5A), making the competition for the control of attention to be close to a hard winner-takes-all.
Distribution of key model parameters. A, The distribution of
γ values (inverse temperature for attentional selection) in the best-fitting model. Higher values of
γ correspond to more sensitive attentional selection. B, The distribution of
ω values, measuring relative weighting of feature versus conjunction for decision-making. This distribution suggests that participants tend to consider both feature and conjunction when making decisions but with a slight overall bias toward feature-based learning. C, The plot shows the distributions of the learning rates for rewarded and unrewarded trials
Figure 5-1
Estimated learning rates for the feature-based and conjunction-based learning in the model without attention. Learning rates were larger for rewarded than unrewarded trials, demonstrating a positivity bias. This suggests that the positivity bias in learning rate was not due to the inclusion of attention or the inadequacy of the current attentional mechanisms in explaining learning from a lack of reward. Download Figure 5-1, TIF file.
Figure 5-2
The histogram of estimated inverse temperatures for attention normalization in the F + Cjoint model with attention modulating choice only (diff X C). Download Figure 5-2, TIF file.
As attention weights became more all-or-none with learning, switching between them became more drastic, resulting in higher JSD as demonstrated by the attention weights of an example participant (Extended Data Fig. 4-2A). To illustrate that variability in pairs of options available on each trial could partially account for high JSD, we performed an additional analysis to remove the effect of the available choice options. To that end, we computed the standard deviation of values within each dimension. This quantity, which we refer to as value separability, measures the absolute difference between values of different instances of a feature or conjunction, which is the quantity that determines attention weights in our best-fitting model. However, unlike the actual attention weights, it does not depend on the available choice options on any given trial. We found that this quantity changes smoothly and converged as learning progressed (Extended Data Fig. 4-2B). This indicates that the available options on each trial play a big role in observed jumps in attention weights.
In addition, we found that attention was influenced by both feature values and conjunction values and, in turn, modulated the learning of both sets of values, as revealed by the distribution of the
ω values that quantify the relative influence of feature and conjunction in the model (
The distribution of participants’ trial-averaged attention weights (Fig. 4B) revealed that participants employed a diverse range of attentional strategies. Although some participants focused more on the informative dimensions (i.e., the informative feature and informative conjunction pair), a substantial proportion of them developed attention on the noninformative Dimension 1 (the noninformative Feature 1 and noninformative Conjunction 1). The trial-by-trial, participant-averaged attention weights also demonstrated an initial bias toward the noninformative Dimension 1 (Fig. 4C). To account for each participants’ overall sensitivity to reward feedback, the attention trajectory of each participant was weighted by
One possible explanation for the observed bias toward noninformative Dimension 1 was the asymmetric learning rates (Fig. 5C), which led participants to update their values to a lesser extent after no reward, causing them to learn predictive reward values in a biased fashion (Cazé and van der Meer, 2013; Katahira, 2018; Palminteri and Lebreton, 2022). This would also explain why the choice history of noninformative Feature 1 had a strong effect on participants’ ongoing choices (Fig. 2B). This asymmetry in learning rates was not merely a result of adding the attentional component, as the same asymmetry was present in the model without attention (
Importantly, individuals’ estimated trial-averaged attention weights predicted their actual performance in the task such that the more they allocated attention toward learning the values of the informative feature and conjunction, the higher their performance was (ρ(65) = 0.35; p = 0.003; Fig. 4D). In contrast, more attention toward learning the values of noninformative Dimension 1 corresponded with lower performance (ρ(65) = −0.27; p = 0.03). In contrast, the correlation between attention toward noninformative Dimension 2 and performance was negative but not significant (ρ(65) = −0.13; p = 0.30). This null result, however, could be due to the small number of participants who ended up focusing on noninformative Dimension 2 (Fig. 4B).
Overall, these results suggest that initially, participants tended to develop a bias toward one of the noninformative dimensions. After receiving more reward feedback, however, they transitioned to learning about the correct combination of the informative feature and conjunction. Ultimately, the extent of learning (credit assignment) about the informative dimensions influenced their final performance. These observations highlight the complex but critical role of attention in performing multidimensional learning tasks.
RL model captures key characteristics of the experimental data
When fitting decision-making and learning models on a trial-by-trial basis, it is common to generate choice data based on the estimated parameters of the best-fitting model to ensure that this model is able to generate a behavior that qualitatively resembles the empirical data (Palminteri et al., 2017; Wilson and Collins, 2019). This procedure, known as model validation, guarantees that results are not merely due to the best model capturing specific sequence of choices made by the participants. Therefore, we simulated the multidimensional probabilistic learning task using the estimated parameters from the best-fitting model. Specifically, we simulated 50 sessions of the experiment using each participants’ estimated parameters and the exact order of choice options they were presented with. We found that on average, the best-fitting model's performance matched the empirical learning curve (Fig. 6A). Moreover, the model was able to capture individual differences in performance, as there was a high degree of correlation between the empirical performance and the average simulated performance for individual participants
Analyses of simulated data generated by the best-fitting model using estimated parameters for individual participants. A, B, Performance based on simulated data and comparison with the average performance of all participants. The simulated performance matched empirical performance on average and across participants. Shades and error bars show the SEM. C, The entropy of the attention weights and JSD between attention weights on consecutive trials. Simulated trial-by-trial attention became more concentrated over time, and attention can jump sharply across consecutive trials. D, The average attention weights for simulated data. Conventions are similar to Figure 4B. The distribution of attention weights in simulated data exhibited a diverse set of credit assignment strategies. E, Trial-by-trial attention weights estimated from the simulated data. Conventions are similar to Figure 4C. See Extended Data Figure 6-1 for the unweighted average of the simulated attention weights. F, Relationship between the allocation of attention and performance in simulated data. Higher attention on the informative feature and conjunction predicted better performance. Triple asterisks indicate
Figure 6-1
Trial-by-trial average attention weights during simulation, not weighted by the product of inverse temperature and learning rate, during simulated choice sequences. The difference between the informative and first non-informative dimensions diminished, but the same qualitative pattern was still present. The trajectory of the simulated attention weights qualitatively replicated characteristics of the empirical attention weights. Conventions are the same as in Fig. 4C. Download Figure 6-1, TIF file.
Figure 6-2
Variability of value estimations and attention weights across simulations of each individual participant based on the winning model. (A) Plot shows the variability of each feature/conjunction value, as quantified by the standard deviation of each value across different simulation runs of the same set of empirical parameters. The grey lines are for each individual value and the dark green lines are aggregated across all feature and conjunction values. This shows that even with the same set of parameters, due to stochasticity in choice, variability in subjective values can persist throughout the experiment. (B) The variability in attention, as quantified by the KL-divergence between attention weights in each simulation run and the average attention weights across all simulation runs for each participant. Similar to the subjective values, variability in attention persists throughout the experiment. Download Figure 6-2, TIF file.
Figure 6-3
Attention at the time of choice only was insufficient to explain participants’ behavior. (A) The average attention weights for each simulated experiment using the F + Cjoint model with attention modulating choice only (diff X C). The color indicates the average cross-trial JSD for each simulation run. (B) The simulated attention weights trajectory failed to replicate important aspects of the empirical attention trajectory, i.e. the greater attention toward a specific pair of non-informative features and conjunctions. Download Figure 6-3, TIF file.
Computing the trial-by-trial simulated attention weights, we found that attention in the model became more focused over time, as indicated by the decrease in attention weights’ entropy by trial (linear mixed-effect model; main effect of trial b = −0.16; SE = 0.002; t(67.00) = −79.80; p < 0.001; Fig. 6C), similar to our experimental observation (Fig. 4A). Moreover, the JSD of attention weights across consecutive trials increased throughout the trials, suggesting larger jumps of attention across consecutive trials as the experiment progressed (linear mixed-effect model; main effect of trial; b = 0.04; SE = 0.001; t(67.00) = 36.60; p < 0.001; Fig. 6C). In addition, similar to our experimental observations, we found diverse attentional strategies (compare Fig. 6D and Fig. 4B), suggesting that participants’ attention weights could diverge due to noise in the choice sequence.
We also repeated the cluster-based permutation test (one-sided t test; cluster threshold
In addition, similar to experimental data, we found a significant correlation between performance and attentional strategy in the simulated data such that more attention to the informative dimension was associated with better performance (ρ(3348) = 0.38; p < 0.001; Fig. 6F) and more attention toward the noninformative dimensions was associated with lower performance (ρ(3348) = −0.28; p < 0.001 for noninformative Dimension 1; ρ(3348) = −0.20; p < 0.001 for noninformative Dimension 2; Fig. 6F). In summary, these results indicate that patterns of the attention weights inferred from choice were not merely due to the specific sequence of choices used to fit the model. Instead, they represent a general characteristic of both the model and the experimental design.
Finally, to more clearly demonstrate the ability of the best-fitting model in capturing empirical behavior, we also performed the model-free analysis on the simulated data generated by the best-fitting RL model. Specifically, we simulated 10 sessions of the experiment for each participant using the model with attentional modulation of learning with their best-fitting parameters. These simulations (Extended Data Fig. 2-2) successfully reproduced many of the patterns observed in Figure 2 and Extended Data Figure 2-1A, including the preferential learning about the informative feature and conjunction, as well as the association of reward with some of the noninformative features and conjunctions.
Attentional modulation of decision-making only could not explain the experimental data
Results from fitting choice data revealed that the best-fitting model does not include attentional modulation of decision-making (no attention weights on choice), which seemingly contradicts findings from previous studies (Leong et al., 2017). However, detecting the effect of attention weights during choice may have been challenging when attentional modulation was influencing the learning processes. More specifically, we defined attention weights at decision-making as the weights that determine how learned values are combined to drive choices. If attention weights already influenced value updates, this biased learning alone could account for behavioral biases observed in the data, potentially masking any direct effects of attention on decision-making. This also reflects the general difficulty of distinguishing between learning rates and decision weights/inverse temperature in RL models. This challenge was evident in our model recovery results, where we found that among the models with the best-fitting learning strategy (F + Cjoint), the data generated by models with attention at both learning and choice (diff X CL) could be fit equally well by models with attention at learning only (diff X L; Extended Data Fig. 3-1B). Similarly, our qualitative behavioral signatures, calculated using logistic regression (Fig. 2B), could be the result of attentional effects during choice, learning, or both (Katahira, 2018).
To address this, we performed additional analyses to demonstrate that having attention only at decision-making was not sufficient to explain our findings. Specifically, we repeated our model-based analyses on the behavior of a model that was identical to the best-fitting model except for the attention weights modulating decision-making instead of learning. We found that the trajectory of attention weights did not exhibit the initial bias toward the noninformative feature and conjunction, which later shifted toward the informative feature and conjunction. Instead, attention weights were allocated mostly to the informative feature and conjunction soon after the beginning of the experiment (Extended Data Fig. 4-3). In addition, we found that the estimated inverse temperature for attentional selection
We suspect that the quick convergence of the subjective reward values to the correct values resulted from the absence of variabilities and biases that attention introduced during the learning process. Moreover, the observed attention weights on the informative feature and conjunction in the model with attention during choice only did not allow this model to explain the important suboptimal attentional biases that participants exhibited (i.e., the bias toward noninformative Feature and Conjunction 1), especially at the beginning of the experiment (Fig. 4C). We further verified these intuitions by simulating the F + Cjoint diff X C model, using parameters estimated from fitting participants’ data. We found that in the majority of simulated sessions, the average attention weights estimated from the simulated data were strongly biased toward the informative feature and conjunction (Extended Data Fig. 6-3A), unlike the variability of attentional strategies produced by the best-fitting diff X L model (Fig. 6D). In addition, by examining the average trajectories of attention weights across simulated sessions, we found no initial bias toward a specific combination of a noninformative feature and conjunction (compare Fig. 6E and Extended Data Fig. 6-3B). Together, these findings indicate that models with attention weights solely modulating decision-making does not accurately reflect the patterns of experimental data. Instead, attentional modulation during learning appears crucial for accurately explaining participants’ behavior.
Value estimations are biased by the informative feature and conjunction
To further test how participants represented reward contingencies or values, we next analyzed their estimates of reward probabilities associated with the 27 stimuli/objects. Our hypothesis was that the similarities in a participant's value estimates for different stimuli could be influenced by their attentional biases. For example, a participant who focused their attention on the color dimension would rate all stimuli that share the same color as having similar values. Alternatively, a participant who focused their attention on the conjunction of shape and pattern dimensions would rate all stimuli that share the same shape and pattern configuration similarly but would not rate stimuli that share only a shape or a pattern similarly.
Expanding on methods used in previous work (Farashahi and Soltani, 2021), we fit linear mixed-effect models of the participants’ estimates of the reward probability associated with each stimulus using different predictive reward values based on features, conjunctions, or their combinations as the independent variables. More specifically, we fit five models using the following independent variables: (1) values of the informative feature,
Analyses of estimation trials reveal that participants’ estimates were mostly influenced by the informative feature and informative conjunction. A, Adjusted
Table 7-1
Comparison of the ability of different types of reward values in predicting participants’ reward probability estimates across five bouts of estimation trials (
Table 7-2
Participants combined the values of the informative feature and conjunction to form value estimations. Table shows results of the linear mixed-effects modeling of the participants’ reward probability estimates across five bouts of estimation trials (
Table 7-3
Informative feature and conjunction explained more variance in participants’ reward value estimations. Table shows results from the mixed-effects ANOVA analysis of participants’ reward probability estimates across five bouts of estimation trials (
In order to further examine whether participants learned about stimulus/object values in addition to feature and conjunction values, we fit mixed-effect models using predictive reward values of the informative feature and the informative conjunction in addition to the actual reward values of the stimuli/objects. We found that the regression weight on the informative feature was consistently larger than 0 (
Although the above analyses revealed that participants estimated reward values of the informative feature and conjunction, they cannot detect if they exhibited deviations from the actual reward probabilities due to attentional biases. Such biases can be uncovered by fitting ANOVA models that use the features of each stimulus and their interactions to predict the participants’ reward probability estimates and comparing the variance explained by each feature and their interactions. This method does not depend on the ground-truth reward probabilities and, therefore, is more capable of capturing biases in learning. Using this method, we found that, consistent with the previous method, both the informative feature and the interaction of the two noninformative features (the informative conjunction) explained a significant amount of variance in the value estimates from all estimation trials (informative feature,
To further relate reward probability estimates from the estimation bouts to the RL models used to fit choice data, we also utilized reward values from the best-fitting RL model (F + Cjoint diff X L) to predict reward probability estimates. To that end, we first computed a weighted average of the subjective values along different feature and conjunction dimensions to compute predictive reward value of stimuli before each estimation bout (i.e., after choice trials 86, 173, 259, 346, 432). We z-scored these subjective values from RL modeling for each participant because the inverse temperature parameter allowed these values to assume very different ranges. Using a linear mixed-effect model, we then fit a scaling parameter and an intercept to predict participants’ estimated reward probabilities based on the subjective values from the RL model. We found that reward probability estimates were better fit by subjective values based on the best RL model than by the actual reward values (Extended Data Table 7-1). This result further demonstrates that the winning RL model effectively captured both the learning and integration of values across different dimensions.
Overall, analyses of the participants’ value estimates confirmed our analyses of choice data suggesting that participants learned about the informative feature and the informative conjunction and prioritized learning about these dimensions over other dimensions. In addition, these analyses captured that suboptimal attentional strategy that led to biases in value estimation. Finally, the fact that subjective probability estimates were well captured by the subjective values based on the best RL model, even though that RL model was not fit on the value estimates, further validates the suitability of this model in capturing behavioral data.
Discussion
Learning in naturalistic environments relies on the interaction between multiple forms of reward learning, decision-making, and selective attention. To investigate this interaction and underlying neural mechanisms, we analyzed behavioral data from a multidimensional probabilistic reward learning task with multiple approaches. Using model-free methods, we found differences in sensitivity to both the reward and choice history associated with different features and conjunctions. Specifically, participants adjusted their behavior by preferentially associating reward outcomes to the informative feature and conjunction of the selected stimulus/object on a given trial. They also repeated choosing the options that shared the same informative feature (and in some participants, one of the noninformative features) as in the preceding trial. Similarly, their value estimations resembled a weighted combination of the estimated reward values associated with the informative feature and the informative conjunctions, with an initial bias toward one of the noninformative features, which diminished over time.
To explain these observations, we constructed multiple RL models that learned predictive values of features, conjunctions, and/or stimuli using different value-based attentional mechanisms that modulated decision-making and/or learning using attention weights. The best model in terms of explaining participants’ choices was the one that learned both feature and conjunction values. In this model, attention weights were controlled by the difference in predictive reward values of the two options in terms of feature and conjunction and, in turn, modulated learning but not choice behavior. Interestingly, feature and conjunction values influenced attention weights in a cooperative manner such that the values of a feature were first integrated with the values of the conjunction of the two other features, and the resulting attention weights modulated the learning of both feature and conjunction values. RL models also allowed us to examine the participant-specific trial-by-trial learning strategies. Specifically, the trial-wise fit of choice data revealed a transition from attention-modulated feature–based learning to attention-modulated mixed feature- and conjunction-based learning. In addition, the amount each participant attended to the informative feature and conjunction correlated with their performance in the task.
In high-dimensional environments, predictive reward values of stimuli can be represented in multiple ways, including feature-based, conjunction-based, object-based representations, or a combination thereof. Learning in these environments require flexible stimulus representations, which can be achieved either by deploying selective attention to reduce dimensionality (Niv et al., 2015; Leong et al., 2017; Mack et al., 2020) or by acquiring conjunctive representations to increase dimensionality (Bernardi et al., 2020). These are two aspects of representation learning (Niv, 2019; Radulescu et al., 2019a, 2021), which have been studied using variety of tasks in different fields. This includes the information integration and weather prediction tasks in category learning (Kruschke, 2001; Ashby and Maddox, 2005), multicue conditioning tasks in classical conditioning (Mackintosh, 1975; Pearce and Hall, 1980; Dayan et al., 2000; O’Reilly and Rudy, 2001; Harris, 2006), and variants of multidimensional RL tasks (Niv et al., 2015; Leong et al., 2017; Farashahi et al., 2017b; Farashahi and Soltani, 2021). All these tasks require learning of stimulus–outcome contingencies where the ground-truth contingencies contain certain structures to allow the use and generalization of learning strategies. For example, some features provide little or no information about reward outcomes and can be filtered out through selective attention to enable more efficient learning with little impact on performance (Niv et al., 2015; Leong et al., 2017; Mack et al., 2020). Moreover, associations between certain stimulus features and outcomes could be generalizable or context specific (Kruschke, 2001; Ashby and Maddox, 2005), making outcomes predictable based on elemental or configural stimulus representation, respectively (Mackintosh, 1975; Harris, 2006), and these translate to feature-based or object-based learning in RL (Farashahi et al., 2017b). Here, by carefully parameterizing the relationship between the combinations of stimulus features and reward outcome probabilities associated with each three-dimensional choice options, we were able to control the generalizability of the reward schedule to include novel structures that have not been tested in previous studies, i.e., the presence of both an informative feature and an informative conjunction.
Most previous studies on the interactions between RL and selective attention did not consider the possibility of reward-predictive conjunctions. This is despite evidence that, in multidimensional environments, healthy individuals are capable of learning the values of conjunctions/configurations of features (O’Reilly and Rudy, 2001; Farashahi et al., 2017b; Duncan et al., 2018; Ballard et al., 2019; Pelletier and Fellows, 2019). This could be because previous studies on configural learning were performed in the context of two-feature choice options with little ambiguity about the informative conjunctions. In contrast, our results based on three-dimensional choice options suggest that in naturalistic environments in which predictive reward values cannot be generalized across features, selective attention can be more complex than a simple competition among features (or conjunctions of features). Instead, value representations based on both features and conjunctions interact to determine attention, which in turn shapes efficient state representations upon which learning can happen. Importantly, this value-guided attention provides an additional mechanism for controlling the trade-off between adaptability and precision (Farashahi et al., 2017a,b). By focusing on the most task-relevant aspects of the environment, this strategy allows for the conservation of cognitive resources via selective attention while developing more sophisticated state-space representations (e.g., conjunctive representations).
Such flexible and reward-predictive stimulus representations can be implemented by mixtures of expert models (Badre and Frank, 2012; Frank and Badre, 2012; Collins and Frank, 2013; Lee et al., 2014; Cortese et al., 2021). In these models, multiple “expert” modules utilizing different stimulus representations compete to predict the reward outcome, and (approximate) Bayesian inference is used to arbitrate among these representations. In contrast, our proposed models assume that the value learning circuit itself can exhibit attentional biases due to internal activity within the decision-making circuit (Szabo et al., 2006; Pannunzi et al., 2012) without having an external arbitration circuit. In particular, the best-fitting model resembles the hierarchical decision-making and learning model proposed in a previous study (Farashahi et al., 2017b), which used the relative informativeness of stimulus/object values and integrated feature values to arbitrate between feature-based and object-based learning.
Interestingly, stochastic, reward-dependent Hebbian synaptic plasticity provides a biologically plausible mechanism for implementing the naive Bayes algorithm (Soltani and Wang, 2010; Murphy, 2012). Naive Bayes is an efficient classification algorithm for problems where the different features are conditionally independent given the outcome (Ng and Jordan, 2001; Murphy, 2012). However, when this assumption is violated, as in our experiment in which the conjunction of two features provided additional information about reward probabilities, naive Bayes in its simplest form (similar to elemental or feature-based learning) results in suboptimal behavior. The attentional selection of the informative conjunction during learning is, however, conceptually similar to the hierarchical naive Bayes that has been proposed to deal with the problem of conditional dependence between features (Han et al., 2005; Langseth and Nielsen, 2006). Similar to our best-fitting model, instead of trying to filter out irrelevant features or irrelevant conjunctions, hierarchical naive Bayes selects features to create informative conjunctions. Although the models tested here were not motivated by optimality principles, the connection between our findings and above algorithms may provide a normative explanation for why this attentional mechanism is adopted.
Notably, functions similar to selective attention and conjunction-based learning can also be achieved by exemplar-based models from the category learning literature (Kruschke, 2001, 2020; Jones and Canas, 2010; Mack et al., 2016; Bornstein et al., 2017). These models calculate the reward value of a stimulus by averaging all past reward outcomes, weighted by the similarities between the current stimulus and previously chosen ones. This similarity can depend on not only the features but also conjunctions of multiple features. Attention weights in these models are updated in an error-driven fashion using gradient descent. Although testing these models against RL models is beyond the scope of the current study, we note that most exemplar-based models only incorporate feature-based attention. Therefore, additional mechanisms are needed to capture the dissociation between learning about a conjunction of two features and learning about the two features separately that was observed in our data.
Although we assumed that value-based attention could affect both choice and learning (but only found evidence for attentional effects on learning), other theoretical works have suggested that attention at choice and learning could serve different roles to improve decision-making in high-dimensional and uncertain environments (Dayan et al., 2000) and that the outcome of the choice may play a role in switching attention before value updates (Kruschke, 2001). There is also evidence for different degrees of attentional modulation at choice and learning and for outcome-dependent attention during learning (Kruschke, 2001; Akaishi et al., 2016). In the models tested here, the switching of attention was dependent on gradual value learning but not on instantaneous reward feedback. Due to concerns of parameter identifiability, we also did not test for separate attentional mechanisms during choice and learning. Eye-tracking could be used as a model-independent measure of attention to avoid this issue (Krajbich et al., 2010; Leong et al., 2017). Although uniform attention to multiple features individually might be hard to differentiate from attention to the conjunctions of these features even with eye-tracking, this method could provide auxiliary data for fitting attention-modulated RL models (Radulescu et al., 2019b).
Overall, our study provides evidence for the existence and possible origin of attentional modulations in multidimensional reward learning. It reveals the intricate interactions between selective attention and configural learning in human reward learning and shed light on the general process of representation learning as a mechanism for balancing the flexibility and precision of learning and mitigating the curse of dimensionality (Farashahi et al., 2017b; Radulescu et al., 2019a, 2021). Nonetheless, specific neural mechanisms and substrates through which attention modulates learning remain unknown. Future studies incorporating neural recordings, along with predictions from the computational models presented here, could explore how attentional modulations emerge with value learning and how these modulations in turn alter learning.
Footnotes
We thank Chanc Orzell for the helpful comments on the manuscript. This work was supported by National Science Foundation CAREER Award (BCS1943767) to A.S.
The authors declare no competing financial interests.
- Correspondence should be addressed to Alireza Soltani at alireza.soltani{at}dartmouth.edu.