Abstract
It is well established that holding information in working memory (WM) elicits sustained stimulus-specific patterns of neural activity. Nevertheless, here we provide evidence for a distinct class of neural activity that tracks the number of individuated items in working memory, independent of the type of visual features stored. We present two EEG studies of young adults of both sexes that provide robust evidence for a signal tracking the number of individuated representations in working memory, regardless of the specific feature values stored. In Study 1, subjects maintained either colors or orientations across separate blocks in a single session. We found near-perfect generalization of the load signal between these two conditions, despite being able to simultaneously decode which feature had been voluntarily stored. In Study 2, participants attended to two features with very distinct cortical representations: color and motion coherence. We again found evidence for a neural load signal that robustly generalized across these distinct visual features, even though cortically disparate regions process color and motion coherence. Moreover, representational similarity analysis provided converging evidence for a content-independent load signal, while simultaneously showing that unique variance in EEG activity tracked the specific features that were stored. We posit that this load signal reflects a content-independent “pointer” operation that binds objects to the current context while parallel but distinct neural signals represent the features that are stored for each item in memory.
Significance Statement
The format of representations in working memory, along with its capacity limits, is highly debated. Here, we provide strong evidence for a content-independent load signal. We theorize this signal reflects the assignment of pointers that bind objects to a spatiotemporal context. This theory provides a unifying framework that may capture many of the behavioral phenomena seen in WM studies, from object-based benefits to dissociations between the number and precision of representations in WM.
Introduction
It is well established that WM storage elicits sustained patterns of neural activity that track the contents of the stored items, even in the absence of the remembered stimulus (Fuster and Alexander, 1971; Fuster and Jervey, 1981; Funahashi et al., 1989; Goldman-Rakic, 1995; Harrison and Tong, 2009; Serences et al., 2009; D'Esposito and Postle, 2015; Rademaker et al., 2019). These stimulus-specific neural signals are contingent on voluntary storage goals, and they track behavioral estimates of the precision of the stored memories (Emrich et al., 2013; Ester et al., 2013; Wimmer et al., 2014). Thus, there has been a clear motivation for the strong focus on the content-specific neural signals that are sustained during WM storage. Nevertheless, the present work will highlight a distinct class of storage-related neural activity that is functionally separable from the representation of content, per se.
Specifically, we are referring to neural signals that track the number of items stored in working memory, without respect to the specific features of those items (Todd and Marois, 2004; Vogel and Machizawa, 2004; Xu and Chun, 2006; Adam et al., 2020; Thyer et al., 2022). For example, Vogel and Machizawa (2004) discovered a contralateral EEG waveform that tracks the number of items stored on the attended side and is highly predictive of individual WM capacity (Luria et al., 2016). Critically, the amplitude of the CDA is similar for single- and multi-featured objects (Woodman and Vogel, 2008), suggesting that it tracks the number of items stored rather than the total amount of information associated with those items. More recently, Adam et al. (2020) used a machine learning approach that uses the full scalp topography of EEG voltage to decode the number of individuated items stored in WM. This approach is sensitive enough to provide above-chance decoding with single trials of data and reveals a signature of WM storage that generalizes across novel observers. Importantly, Thyer et al. (2022) showed that this multivariate load signature (mvLoad) generalizes across stimuli that vary in both the type (color vs orientation) and the number (single vs dual-feature objects) of features within the stored items. These findings provide evidence for an EEG signature of WM load that tracks the number of individuated representations stored rather than the type or amount of information associated with each item.
What is the computational role of content-independent load signals? We hypothesize that they may reflect an indexing operation that enables the online tracking of items through time and space. For example, Kahneman et al. (1992) proposed the object file as a temporary episodic representation that enables the continuous tracking of items through time and space, despite possible changes in the visual appearance and position of those items. Kahneman et al. (1992) argued that this was essential for binding the representation of an attended item to the context of an unfolding event. Likewise, Pylyshyn (1989) described “fingers of instantiation” (FINSTs) that enable the dynamic tracking of items despite changes in their appearance or location. With both object files and FINSTs, the core insight is that observers require a flexible indexing system that can support the tracking and storage of objects, even though the appearance and position of a relevant item can change dramatically during an unfolding event. Thus, object files and FINSTs were proposed as a mechanism to ensure the continuity of an item's representation despite these challenges. Here we refer to object files and FINSTs as spatiotemporal “pointers” to highlight the process of binding an item to a specific set of spatiotemporal coordinates, and we argue that this may be a fundamental requirement for storage in visual working memory. This perspective aligns with prominent models of WM that propose separate neural processes for the maintenance of features and binding of items to a specific context (Xu and Chun, 2006; Swan and Wyble, 2014; Oberauer and Lin, 2017; Balaban et al., 2019; Bouchacourt and Buschman, 2019; Hedayati et al., 2022). For example, Xu and Chun (2006) argued for two separate neural pathways for object identification and individuation. Similarly, Swan and Wyble (2014) proposed a “neural binding pool” that binds the features of an object together so that it can be represented as an individuated token within a specific scene. Critically, both theories argue for a clear separation between the processes that represent the specific details of an item and the processes that bind that representation to the surrounding context. Thus, clear evidence for content-independent load signals that track the number of stored items without tracking their featural content would support the hypothesis that the binding of items to context reflects a distinct aspect of working memory from the maintenance of each item's features.
As noted above, Thyer et al. (2022) used multivariate decoding approaches to show that EEG signatures of WM load generalized across variations in both the type and number of visual features stored. Although this suggests a clear dissociation between WM load and featural content, there are two key limitations of the Thyer et al. demonstration that merit discussion and follow-up. First, the generalization shown by Thyer et al. (2022) was imperfect, such that decoding accuracy dropped when a model trained on one condition (e.g., color) was tested on another condition (e.g., orientation). This decline in decoding could indicate that there were feature-specific aspects of the EEG load signature. However, each condition in the Thyer et al. study was collected during a different experimental session, opening the possibility that small differences in electrode placement and noise levels reduced generalization. Thus, further work is needed to determine the degree of overlap between load signatures for distinct visual features. Second, the choice of color and orientation as the stored features might have led to the overestimation of content independence, simply because the neural populations representing color and orientation are so interdigitated in the visual cortices. Given the coarse spatial resolution of scalp-recorded EEG activity, it is possible that distinct populations of color and orientation cells could still produce similar enough EEG signals to mimic a general load signal (Sandhaeger and Siegel, 2023). In the present work, we address both of these concerns by collecting all conditions within a single experimental session (Experiment 1) and by examining generalization across color and motion coherence (Experiment 2), features that are known to have highly separable cortical populations (Zeki, 1978; Felleman and Van Essen, 1991; Vaina, 1994). Finally, we applied representational similarity analysis (RSA) to obtain converging evidence for a common load signature across disparate stimulus types, while showing that distinct variance in ongoing EEG activity was explained by the specific features that were maintained.
To anticipate the results, Experiment 1 revealed effectively perfect generalization between the load signatures generated during color and orientation blocks, in line with the content-independent pointer hypothesis. Moreover, in the same dataset we could robustly decode which feature (color or orientation) was being attended and stored, corroborating our assumption that observers were selectively storing distinct aspects of the stimuli in the color and orientation conditions. In Experiment 2, we again observed strong generalization between the load signatures for two cortically disparate features, color, and motion coherence, as well as robust decoding of the attended feature. Finally, RSA provided complementary evidence of a load signal that was independent of the attended feature, even while distinct variance in EEG activity tracked the specific features that were stored in working memory. Together, these results provide strong evidence for content-independent pointers as a key component of storage in visual working memory.
Materials and Methods
Subjects
Experiments included 29 volunteers (Experiment 1, n = 13; Experiment 2, n = 16) participating for monetary compensation ($20 per hour). Subjects were between the ages of 18 and 35 years old, reported normal or corrected-to-normal visual acuity, and provided informed consent according to procedures approved by The University of Chicago Institutional Review Board. Subjects were recruited via online advertisements and fliers posted on the university campus.
Experiment 1
Our target sample in Experiment 1 was 12 subjects. Seventeen volunteers participated in Experiment 1 (8 females; mean age = 24.9 years, SD = 3.8). Four subjects were excluded from the final sample for the following reasons: the session was ended early due to eye movements (n = 2); the subject's data was corrupted or otherwise unusable (n = 2). The final sample size was 13 (6 female; mean age = 25.0 years; SD = 4.1).
Experiment 2
Our target sample in Experiment 1 was 16 subjects. Nineteen volunteers participated in Experiment 2 (9 females; mean age = 26.6 years; SD = 4.0). Three subjects were excluded from the final sample for the following reasons: The session was ended early due to eye movements (n = 2); the subject did not have enough data after artifact rejection (n = 1). The final sample size was 16 (9 female; mean age = 26.7 years; SD = 2.2).
Apparatus
We tested the subjects in a dimly lit, electrically shielded chamber. Stimuli were generated using PsychoPy (Peirce et al., 2019). Subjects viewed the stimuli on a gamma-corrected 24 in LCD monitor (refresh rate = 120 Hz; resolution = 1,080 × 1,920 pixels) with their chins on a padded chin rest at a viewing distance of 75 cm.
Task procedures
Luminance-balanced displays
For each experiment, the luminance of the target color set was measured using a LS-150 Luminance Meter. Next, the luminance meter was used to identify an RGB value that produced a gray matching the average luminance of the target set in each experiment. Placeholders were colored with this gray and set to cover the same area as the targets, controlling the total area and luminance across set sizes.
Experiment 1
Experiment 1 used a whole-field change detection task (Fig. 1a). On each trial, a memory array appeared containing four total elements. There were one or three targets to be remembered, and the remaining items were gray placeholders. Stimuli were presented against a mid-gray background (∼61 cd/m2). Items were positioned with a maximum of one item per quadrant, and all items were placed at least 4° apart from fixation and from one another.
Memory targets were colored circles (radius, 1.3°) with oriented bars cut out of the middle (height, 2.6°; width, 0.5°). The possible orientations were 0°, 45°, 90°, and 135°, and they were sampled without replacement for each trial. The possible colors were randomly sampled without replacement from a set of four colors (RGB values: red, 255, 0, 0; green, 0, 255, 0; blue, 0, 0, 255; yellow, 255, 255, 0). The placeholders were filled circles (radius, 1.13° to match the same total area) in a shade of gray [red, green, blue (RGB) value = 149, 150,149] that matched the average luminance of all possible colors in the color set (see above, Luminance-balanced displays).
On each trial, subjects viewed a memory array (250 ms), remembered items across a delay (1,000 ms), were probed on one item, and reported whether the probed item was the same as or different from the remembered item (unspeeded). Alternating each block, subjects were instructed to attend to either the color or the orientation of the memory items. Only the attended feature dimension could change. On a change trial, the relevant feature could change into any other value from the set (i.e., any other color or orientation), regardless of whether that feature value was present in another item in the display.
Experiment 2
Experiment 2 used a whole-field change detection task (Fig. 1b). On each trial, a memory array appeared containing three total items. There were one or two targets to be remembered, and the remaining items were gray placeholders. Stimuli were presented against a mid-gray background (∼61 cd/m2). Items were positioned with a maximum of one item per quadrant, with the center of each item randomly selected between 1.5 and 3° away from fixation on both the horizontal and vertical axes.
Memory targets were colored random dot kinematograms (RDKs; radius, 1.5°). The RDKs were presented as 100 individual dots moving through a circular area. Dots were 7.5 pixels in size, lasted four frames, and moved at a speed of 0.06°/s through the circular area. The possible colors were randomly sampled without replacement from a set of seven colors (RGB values: red, 255, 0, 0; green, 0, 255, 0; blue, 0, 0, 255; yellow, 255, 255, 0; purple, 255, 0, 255; teal, 0, 255, 255; orange, 255, 128, 0). The moving dots within each item were either moving coherently (all in one direction) or incoherently (moving in random directions). For coherently moving dots, the direction of movement was randomly sampled from a uniform distribution between 0 and 359° (inclusive) with 1° steps. To discourage subjects from encoding coherence as specific orientations, the orientation of the probed cloud was again randomly sampled at test so that it would be independent of the original orientation. The placeholders were also RDKs with the same dimensions, shown in a shade of gray (RGB value = 166, 166, 166) that matched the average luminance of all possible colors in the color set (see above, Luminance-balanced displays).
After data collection, a coding error was discovered in how coherence was assigned, as the options were sampled from a repeated set, such that there were always one or two coherent dot patches in the display. In set size 1, the probability of the target RDK being coherent was 50%, but, due to this error, when the target was coherent, at most 1 distractor was coherent, and when the target was incoherent, at least 1 distractor was coherent. Probabilities in the set size two condition mirrored that of set size one. For example, when both targets were coherent, there was no possibility that the distractor would be coherent and vice versa. To address this, we examined whether we could decode coherence. As described in the Results section, the amount of coherence is only decodable for targets and only when subjects are attending to coherence. Given that coherence was not decodable during the color blocks, stimulus-driven effects of coherence cannot explain the common load signature between the color and motion coherence conditions.
On each trial, subjects viewed a memory array (500 ms), remembered items across a delay (1,000 ms), were probed on one item, and reported whether the probed item was the same as or different from the remembered item (unspeeded). In alternating blocks, subjects were instructed to attend to either the color or the motion coherence of the memory items. Only the attended feature dimension could change. As in Experiment 1, on a change trial, the relevant feature could change into any other value from the set (i.e., any other color, or the alternate coherence level), regardless of whether that feature value was present in another item in the display.
EEG acquisition and preprocessing
EEG acquisition
We recorded EEG activity from 30 active Ag/AgCl electrodes mounted in an elastic cap (Brain Products, actiCHamp). We recorded from international 10–20 sites Fp1, Fp2, F7, F3, Fz, F4, F8, FT9, FC5, FC1, FC2, FC6, FT10, T7, C3, Cz, C4, T8, CP5, CP1, CP2, CP6, P7, P3, Pz, P4, P8, O1, Oz, and O2. Two additional electrodes were affixed with stickers to the left and right mastoids, and a ground electrode was placed in the elastic cap at position Fpz. All sites were recorded with a right-mastoid reference and were rereferenced off-line to the algebraic average of the left and right mastoids. Data were filtered on-line (low cutoff, 0.01 Hz; high cutoff, 80 Hz; slope from low to high cutoff, 12 dB/octave) and were digitized at 500 Hz using BrainVision Recorder (Brain Products) running on a PC. Impedance values were brought below 10 kΩ at the beginning of the session.
Eye tracking
We monitored gaze position using a desk-mounted EyeLink 1000 Plus infrared eye tracking camera (SR Research). Gaze position was sampled at 1,000 Hz. According to the manufacturer, this system provides spatial resolution of 0.01° of visual angle and average accuracy of 0.25–0.50° of visual angle. We calibrated the eye tracker every one to two blocks of the task and between trials during the blocks if necessary. We drift-corrected every five trials. Additionally, we drift-corrected the eye tracking data for each trial by subtracting the mean gaze position measured during the 200 ms window immediately preceding the memory array.
Artifact rejection
We segmented the EEG data into epochs time locked to the onset of the memory array (200 ms before until 1,000 ms after stimulus onset). We baseline-corrected the EEG data by subtracting mean voltage during the 200 ms window immediately prior to stimulus onset. Eye movements, blinks, blocking, drift, and muscle artifacts were detected by applying automatic criteria, and we discarded any epochs contaminated by artifacts. All subjects included in the final sample had at least 140 trials of each condition.
Eye movements and blinks
We employed real-time eye movement detection. If subjects moved their eyes >1.25° from fixation, the trial was interrupted and their eye position was shown to them for feedback purposes. Interrupted trials were made up at the end of each block. During preprocessing, we rejected trials that contained eye movements beyond 1° of visual angle using the pop_artextval function in ERPLAB (Lopez-Calderon and Luck, 2014).
Drift and muscle artifacts
We checked for drift (e.g., skin potentials) with the pop_rejtrend function in ERPLAB.). We checked for muscle artifacts with the pop_artextval function in ERPLAB. We excluded trials where EEG activity was >80 µV or less than −80 µV. We excluded trials with peak-to-peak activity >100 µV within a 200 ms window with 100 ms steps. We also excluded trials with any value beyond a threshold of 80 µV.
Multivariate decoding and generalization procedure
All decoding analyses followed the framework of the multivariate load analysis (mvLoad), a within-subject decoding approach using the Scikit-Learn Logistic Regression model (Pedregosa et al., 2011), which has been previously described (Adam et al., 2020; Thyer et al., 2022). First, we divided each trial into nonoverlapping 25 ms windows and calculated the average voltage for each electrode in the window. Next, we randomly binned trials of the same condition into sets of 15 trials and averaged within them to increase signal-to-noise. Next, the binned trials were sorted into training and testing sets, stratified by trial condition, using the train_test_split function from Scikit-Learn (Pedregosa et al., 2011). This cross-validation procedure splits the binned data into 80% training and 20% testing sets while balancing the percentage of samples for each condition. Training data were z-score normalized by their mean and standard deviation at each time point using the StandardScaler Scikit-Learn function, and test data were z-score standardized using the mean and standard deviation of the training set, before training and testing the model at each time point. Our measure of decodability on the test sets is described below. A measure using the test set with shuffled labels was also recorded to produce an empirical null distribution. This process was repeated 1,000 times, including the initial random binning, and the results for each subject at each time point were averaged across repetitions.
Decodability measure: hyperplane contrast
In past work (Thyer et al., 2022), we tested the generality of mvLoad EEG patterns by training the model on one condition and measuring decoding accuracy in a different condition. High decoding accuracy was taken to indicate reliable generalization of the load patterns across the two conditions. Although this is a very common approach for examining the generalizability of a multivariate pattern, it is possible for generalization to be imperfect even if there is a perfectly overlapping load signal. For instance, if there are differences in EEG activity due to feature-specific activity, and those differences are not fully orthogonal to the axis that encodes WM load, then generalization could be undermined despite a shared representation of load. To provide an analogy, there is an additive shift in the CDA (i.e., an increase that is constant across all set sizes) when people hold orientations in WM rather than colors (Woodman and Vogel, 2008). If a model's decision boundary was fit to separate set size 1 from set size 2 based on the amplitude of the CDA with colors, it would be biased to call all orientation trials set size 2, even though the same CDA difference between set sizes is present. A better alternative is to estimate the vector differentiating pairs of conditions (e.g., the CDA set size difference) and ask how similar it is to the vector differentiating a new pair of conditions.
Given those considerations, we attempted to isolate the multivariate axis that was specific to WM load by estimating a “hyperplane contrast.” We recorded each test trial bin's signed distance from the fit model's decision boundary, also known as a hyperplane, and distances for trials of the same condition were averaged together. Decodability was defined as the contrast (difference between) the two conditions’ distances from the hyperplane. This is a cross-validated estimate of the distance between two conditions along the discriminating axis which most separates them. When this procedure is performed using a linear discriminant analysis (LDA) as the classifier, it is called the linear discriminant contrast (LDC) and is equivalent to cross-validated Mahalanobis distance (Walther et al., 2016). The only difference between LDC and this procedure is that we use logistic regression to identify the hyperplane, rather than LDA. To test generalization, we record the signed distance of test data from a pair of conditions from a held-out context (i.e., the other attended feature), and the contrast between the two conditions is recorded. The random binning and averaging of the held-out context data happens at the same time as it does for the trained context data.
This hyperplane contrast is beneficial for a few reasons. First, it is less coarse than accuracy, which requires recoding the test scores into binary categories. This binarization treats trials adjacent to the hyperplane the same as trials extremely far from the hyperplane. Second, it is not impacted by the exact placement of the hyperplane along the discriminant axis, which is beneficial when estimating the generalization on held-out conditions. If there is a signal that indicates a change in context that is not parallel to the hyperplane (orthogonal to the discriminant axis, e.g., the additive shift for orientations described above), this would bias classification, even if the true signal was present in the new context and identical to the original signal that the model was trained on. Third, it is unbounded, in contrast to accuracy and area-under-the-curve, so it should not be warped by ceiling effects and should grow proportionally with the signal separating the conditions. This makes the interpretation of the contrast distance straightforward: if the contrast distance between the training context and the held-out context are the same and vice versa when the contexts are flipped, then the discriminating axis and the magnitude of the signal separating the conditions in each context are the same. That is, it is the same signal.
Attended feature decoding
In both experiments, we assessed whether we could decode which feature was attended. When setting the training data, the number of trials for each set size was equated to prevent the model from leveraging a potential content-independent WM load signal.
Load decoding and generalization
In both experiments, we assessed the decodability of load within each feature and whether the load signal generalized to the other feature (e.g., whether a model trained on color data would accurately decode orientation data, and vice versa). Trials for the held-out feature were also randomly binned, and 20% were randomly selected for testing on each permutation following the same procedure as the trained feature (see above).
Coherence level decoding
We assessed whether we could decode the coherence level in Experiment 2. First, we asked whether we could decode the number of coherent items in the display while controlling for the number of attended coherent items. To maximize power, we combined three pairs of conditions, which differed only in the number of coherent distractors: set size 1 trials with no coherent targets and either 1 or 2 coherent distractors, set size 1 trials with 1 coherent target and either 0 or 1 coherent distractors, and set size 2 trials with 1 coherent target and either 0 or 1 coherent distractors. Within each pair of conditions, the number of trials was equated by randomly downsampling the condition with more trials on each permutation. Once the number of trials was equated for each pair, all conditions with fewer coherent distractors were randomly binned together, as were all conditions with more coherent distractors. As the proportion of trials contributed by each pair, after downsampling, was broken roughly in ¼, ¼, and ½, we set the number of trials per bin to be 16. After binning, decoding proceeded the same as the above analyses.
Next, we asked if we could decode the number of attended coherent targets. The above analysis showed no evidence for a signal reflecting the number of coherent distractors (see Results). Therefore, to maximize power, we focus just on the number of coherent targets. We did this by using set size 1 trials and contrasting trials with a coherent target against trials with an incoherent target. Both analyses were run separately for the attend coherence and the attend-color blocks.
Significance testing
In all decoding analyses, we tested whether decodability was significantly above chance at each time point using a paired-samples, one-tailed t test against the empirical null. This empirical null was defined by testing the trained models on randomly shuffled trial labels (see above). To test for imperfect generalization, we assessed whether the difference in the hyperplane contrast between the trained feature and the held-out feature was larger than the difference between the two empirical nulls, also with a paired-samples, one-tailed t test. Because we tested for significance at each time point, we used the Benjamini–Hochberg procedure to control the false discovery rate (FDR) at 0.05.
Eye movement controls
Eye-based decoding and correlations with EEG results
To examine whether decoding was driven by neural activity related to eye movements, we repeated all decoding analyses using the eye movement data available for each subject to see if any signals were decodable. These analyses were repeated simultaneously with the main decoding analyses, so that the same trials were randomly binned together, and the same train–test split occurred. This allowed us to also ask whether the individual test sets produced similar predictions between the eye and EEG data. For each participant, we correlated the predictions across all 1,000 repetitions to assess the amount of mutual information between them. These correlations were transformed to Fisher's z scores and tested at the group level against 0 at each time point using a one-tailed t test. p values were FDR corrected, as above. In Experiment 2, one subject's eye data could not be aligned, so they were excluded from these analyses. All of the above analyses were rerun without the subject to ensure there were no qualitative differences. The only change is that the target coherence level in attend-motion blocks is no longer significantly decodable at any time point after FDR correction, though there are several time points which are trending toward significance (p < 0.1).
Associating EEG-based decodability from eye-based decodability across participants
In the above analyses, we tested whether EEG-based decodability aligned with eye-based decodability across trials, within each subject. As an additional test of whether EEG-based decodability was driven by eye-based decodability, we tested whether EEG-based decodability aligned with eye-based decodability across subjects. As the major motivation of this work is whether WM load signals generalize across attended features, we focused on the strength of the across-feature contrast (training on one feature and finding the size of the contrast in the other), averaging across the two models per experiment. To increase SNR we average time across the last 500 ms of the delay period in each experiment, as these appeared to be times when both EEG-based and eye-based contrasts were stable, and combined the data across experiments. We fit a linear regression model predicting an individual's average EEG-based cross-feature contrast from their eye-based cross-feature contrast, with a separate intercept for each experiment.
Representational similarity analysis
While decoding gives us an indication of which signals are present, and their similarity across contexts, decoding results can be difficult to interpret, as models will leverage all available signals to discriminate between conditions. For example, the ability to decode which feature is attended could result from the model leveraging task-set like signals, feature-specific signals, or a mixture of both signals. RSA provides an alternative means of assessing which signals are present in the data. Here, we leverage RSA to assess whether feature-specific load signals are present in the data and whether a content-independent load signal remains while controlling for other confounds.
We only ran this analysis in Experiment 2, as RSA benefits from condition-rich designs (Kriegeskorte et al., 2008). Experiment 1 can only be divided into four total conditions (2 set sizes × 2 attended features). In contrast, the decoding results below provide initial evidence that the coherence level of the memory targets can be decoded when participants are tasked with holding the coherence level in WM, suggesting Experiment 2 can be meaningfully divided into 10 conditions once the coherence level of memory targets in the displays are included. The division into specific coherence levels is beneficial because it allows examination of the content-independent load signal while simultaneously providing a complementary means of assessing the apparent coherence level signal that was identified via decoding. Such convergence would confirm that scalp EEG activity tracks the specific coherence values that are held in WM. This could be a useful signal for future studies focused on feature-specific activity and its time course.
At a high level, we computed the empirical representational dissimilarity matrices (RDMs) between every pair of conditions at each time point for each subject, using the same time windows as the decoding analyses. We then computed the semipartial correlation between these empirical RDMs and predicted dissimilarities based on a set of theoretically relevant factors: feature-specific load signals, an attended feature signal, an attended coherence level signal, and a content-independent load signal. We assessed whether these semipartial correlations were significantly greater than 0 at the group level. Because some regressors were based on pupil sizes, only the 15 subjects with available eye data were analyzed.
Computation of empirical dissimilarities
To estimate the empirical RDMs, we computed the cross-validated Mahalanobis distance, also known as the LDC between every pair of conditions. This is known to be a reliable measure for constructing RDMs (Walther et al., 2016). In addition, as mentioned above, it is a variant of the hyperplane contrast used in our decoding analyses that effectively replaces the logistic regression classifier with a linear discriminant classifier. The LDC between each pair of conditions was simultaneously computed with the following formula:
Theoretical dissimilarities
We compared the empirical RDMs to a set of RDMs constructed to reflect predicted differences based on theoretical factors. Four of the RDMs were constructed based on theory, and four were based on empirical results. Our content-independent load RDM predicted that all set size 1 trials would look identical, all set size 2 trials would look identical, and all set size 1 trials would look equally distinct from all set size 2 trials. Our attended-feature RDM predicted that all attend-color conditions would be equally similar to one another and equally dissimilar to all attend-motion conditions (which would all be equally similar to one another). We also included two feature-specific load signals, one for the number of maintained colors and one for the number of maintained motion patches. Each assumed that the amount of feature-specific information for the attended feature increased stepwise with set size and was 0 (or at least less, as the distances are converted to ranks, see below) when the other feature was attended. Note, because the distances are rank transformed (see below), only the relative size of the steps matter. While we present results assuming as equal step from load 0 to load 1 and from load 1 to load 2 for the feature-specific load signals, we confirmed that our results are qualitatively identical if one assumes compressive steps (the load-1–load-2 step is smaller than the load-0–load-1 step) or expansive steps (the load-1–load-2 step is larger than the load-0–load-1 step).
We also included four additional regressors, based on empirical results. First, we included an RDM for the attended coherence level, as we found slight evidence for an attended coherence level signal (see Results). We did not have strong predictions for how the attended coherence level signal would change at set size two. It might reflect the number of coherent targets, in which case the signal might be confounded with set size and mimic a motion-specific load signal, as the number of coherent targets will increase with the number of targets. Alternatively, the signal might reflect the proportion of coherent targets, in which case it would be deconfounded from set size. Thus, we empirically estimated the RDM. Using the set size 1 model to decode the target coherence level in attend-motion blocks, we found the distance from the hyperplane for the two training conditions and the eight remaining conditions, averaged across subjects. These distances were converted into pairs of distances between all 10 conditions. This was done at each time window separately, as the coherence level signal may change across time. To avoid data leakage, each subject's data was excluded before averaging and converting the hyperplane distances into pairwise distances, resulting in a unique coherence level RDM for each subject, at each time window, that did not contain their data.
Second, we included a set of three regressors to examine whether some effort signal contributed to the apparent content-independent load signal. As a proxy for effort, we included condition-level accuracy, pupil size, and the interaction of accuracy and pupil sizes to generate RDMs. We chose accuracy as one potential proxy for effort under the assumption that it may reflect the difficulty of conditions. Some subjects may expend more effort in more difficult conditions, in which case the RDMs for accuracy and effort would be the same, or they may not expend different levels of effort, in which case effort could not explain the apparent content-independent load signal. Relatedly, we chose pupil size as an alternative proxy to effort, based on previous work linking pupil size both to WM load (Koevoet et al., 2023) and to effort (van der Wel and van Steenbergen, 2018). Pupil size RDMs were built in each time window via the following procedure. First, the pupil diameter was baselined using the same window as the EEG data, i.e., the 200 ms preceding stimulus onset, then binned in 25 ms time windows. To facilitate intersubject averaging, the pupil data was z-score standardized across all trials and time points (Naber et al., 2013), and then averaged within each condition across time. Finally, the data were averaged across subjects within each condition, and the RDMs were constructed for each time window. Lastly, we included an RDM reflecting the interaction of pupil size and accuracy. To do this, we z-score standardized both the group-averaged accuracy across conditions and the group-averaged pupil size across conditions for each time window and multiplied the two together before constructing the final RDM at each time window.
We examined the unique information for each regressor relative to other regressors by computing the variance inflation factor (VIF) for each subject at each time point (as the coherence level, pupil, and accuracy*pupil RDMs changed across time). The VIF of a given regressor (i) reflects the proportion of a given regressor's variance that is explained by other regressors and is computed as follows:
Identifying theoretical factors that explain reliable neural variance
To compare the overall empirical RDM to our factor RDMs, we used a rank regression procedure (Iman and Conover, 1979; Kiat et al., 2022). For each theoretical factor, we computed the semipartial rank correlation by subtracting the R2 of a submodel excluding that factor from the R2 of the full model and multiplied the square root of this difference by the sign of the factor's coefficient in the full model. As in Kiat et al. (2022), we chose rank correlation as we did not assume a linear relationship between any of our theoretical factors and the observed condition distances. We tested these correlations at the group level using a one-tailed Wilcoxon sign-rank test against zero. We tested each time window, with FDR correction using the Benjamini–Hochberg procedure.
Estimating spatial attention signals
One possible explanation for a generalizable, content-independent load signal is that it reflects the allocation of spatial attention which differs across set sizes but is consistent across attended feature conditions. To examine this possibility, we assessed whether we could decode how spatial attention was distributed across trials and whether it differed across set sizes, in both experiments. For set size 1 trials, we attempted to decode which quadrant the colored target was in. For set size 3 (Experiment 1) or set size 2 (Experiment 2) trials, we attempted to decode which quadrant the irrelevant placeholder was in. The broader question we examined was whether these spatially specific patterns could provide an alternative explanation for why load decoding generalized across distinct features.
Given there are four possible locations for each set size and each attended feature (16 total conditions per experiment), we used an analytical pipeline similar to the RSA above. Namely, at each time point, we computed the cross-validated Mahalanobis distance between all conditions simultaneously across 1,000 train–test splits, following the same procedure as above. Across splits, the different locations in the training set were balanced via down sampling to equally contribute to the estimate of the covariance matrix. We next took the signed square root of these distance measures, as a cross-validated estimate of the d’, or SNR between conditions. At each time point and for each participant, we averaged the d's between locations together, separately for each set size of each attended feature.
For each set size of each attended feature, we tested whether spatial decodability between locations (average d’) was greater than 0 using a one-tailed t test at each time point, with FDR correction using the Benjamini–Hochberg procedure as above. At each time point, we also ran a two-way repeated-measures analysis of variance (rmANOVA) with a factor for attended feature (color vs orientation/motion coherence) and set size (1 vs 3/2). p values were FDR corrected using the Benjamini–Hochberg procedure.
Results
Behavior is impacted by WM load and remembered feature
Across both experiments and conditions, subjects performed the task with above-chance accuracy (Fig. 2). In Experiment 1, an rmANOVA revealed a significant main effect of set size, indicating that accuracy declined as set size increased, F(1,11) = 20.59, p < 0.001. There was also a significant main effect of attended feature, F(1,11) = 32.74, p < 0.001, and a significant interaction of attended feature and set size, F(1,11) = 16.72, p < 0.001. Accuracy was high for color (M = 0.95; SD = 0.069) than orientation (M = 0.89; SD = 0.099), and this effect was larger in the set size 2 condition than in the set size 1 condition (t(11) = 4.09; p = 0.0018).
In Experiment 2, an rmANOVA revealed a significant main effect of set size, indicating that accuracy declined as set size increased, F(1,12) = 31.47, p < 0.001. There was also a significant main effect of attended feature, F(1,12) = 26.02, p < 0.001, and a significant interaction of attended feature and set size, F(1,12) = 27.76, p < 0.001. Accuracy was higher for color (M = 0.98; SD = 0.027) than for motion coherence (M = 0.848, SD = 0.097), and this effect was larger in the set size 2 condition than in the set size 1 condition (t(15) = 4.68; p = 0.0003).
The attended feature can be reliably decoded across time
EEG-based decodability
In each experiment, we first assessed whether subjects were attending to the relevant feature by computing the decodability of the attended feature (Figs. 3, 4). In both experiments, decodability, as measured by the hyperplane contrast, was sustained throughout the entire delay period (Fig. 4). In Experiment 1, the mean hyperplane contrast during the delay period was 0.730 (arbitrary units, SD = 0.756). In Experiment 2, the mean hyperplane contrast during the delay period was 3.696 (arbitrary units, SD = 1.564). We note that this is a very large signal—if the predictions were binarized by category and scored as accuracy instead of the contrast, it would peak ∼90% and end ∼70%. In line with this, remarkably different ERPs were evoked by physically identical stimuli when the relevant feature switched between color and motion coherence (Fig. 3).
Eye-based decodability
We also assessed whether the decodability of the attended feature was driven by eye movements. In Experiment 1, attended feature could not be decoded from eye movements at any time point. Despite this, there was a correlation of the contrasts across permutations between the two models, which was significant at times. However, this correlation was very small (mean Z across the delay = 0.0263, r also = 0.0263). In Experiment 2, there was sustained decodability of attended feature from eye movements throughout the delay. However, the time courses of decodability were different, with eye movement having a secondary increase in decodability late in the delay period that was missing from the EEG results. In addition, there was no significant correlation of the contrasts across permutations between the two models. Together, this suggests that eye movements had little impact if any on the decodability of the attended feature.
The amount of motion coherence can be decoded, but only when attended
In addition to attended feature decodability, we examined whether we could decode the level of coherence (i.e., the number or proportion of coherent stimuli) in the display. We focus on decoding coherence in the attend-motion blocks, though we repeated the same analyses in the attend-color blocks.
First, we asked whether we could decode the total amount of coherence on the screen, regardless of how many coherent items were stored, by identifying pairs of conditions that differed only in the number of coherent distractors and pooling them together. We found no decodability for the total amount of coherence, either via EEG signals or eye movements (Fig. 5, left column).
We next asked whether we could decode the number of coherent stimuli that were stored in working memory by contrasting set size 1 trials with and without a coherent target (Fig. 5, right column). We were able to decode the attended coherence level via both EEG and eye movements, though their contrasts were not correlated across permutations. Notably, there were only two significant time points overall after FDR correction, and there were no longer any significant time points after excluding the participant for which we did not have eye tracking, indicating that the signal-to-noise ratio is low.
We repeated these analyses using attend-color trials and found no decodability of the amount of coherence for either signal via EEG or eye movements and no correlations in the contrasts across permutations. This suggests that subjects are only maintaining information about the number of coherent clouds when the information is relevant to the task at hand, in line with previous findings that observers have voluntary control over which features are stored from an object (Woodman and Vogel, 2008; Serences, et al., 2009). Future work may confirm the presence of this signal with a larger sample and leverage it to study feature-specific activity and its time course in WM.
WM load can be decoded and robustly generalizes across features
EEG-based decodability
We next assessed whether we could decode WM load. We assessed the decodability and generalizability simultaneously; we trained a model to decode load within each attended feature and assessed the decodability of load both for that feature via held-out test data (within-feature) and for the other attended feature (across-feature).
Figure 6 summarizes our results in Experiment 1. We were able to decode load for both color and orientation, with sustained decodability within-feature across the delay period. In addition, we saw effectively perfect generalization. Across-feature decodability was also sustained throughout the delay period, and there was no time point in which the across-feature contrast was significantly smaller than the within-feature contrast in either case.
Figure 7 summarizes our results in Experiment 2. We again were able to decode load for both color and motion coherence, with sustained decodability within-feature across the delay period. When training on color data and testing on coherence, we again saw effectively perfect generalization, as there was no time point in which the color-based contrast was significantly bigger than the coherence-based contrast. When training on coherence, we again saw generalization to color data, but it was imperfect. The color-based contrast was significantly smaller than the coherence-based contrast, especially during the delay period. This suggests that the model was able to pick up on motion-specific load signals during training, hurting the generalization to color. This may also affect our ability to decode the attended feature, as the model may leverage feature-specific load signals to differentiate between the attended feature conditions.
Eye-based decodability
We also assessed whether eye movements drove load decoding in the two experiments (Figs. 6, 7, bottom 2 rows). In both experiments, load was decodable across the delay period within features, with effectively perfect generalization across features. However, the time courses of decodability and generalization differed from that of the EEG-based results, beginning in the delay period and plateauing. In contrast, EEG-based decodability began during the stimulus presentation and decreased over the delay period. In addition, the contrasts between EEG-based and eye-based models trained and tested on the same data across permutations were not correlated. The mean Fisher's Z for the eight models (2 experiments × 2 training conditions × 2 testing conditions) across the trial epochs ranged between −0.004 and 0.013, with only a single significant time point (Experiment 1, training on orientation and testing on color, during the baseline period, after FDR correction via Benjamini–Hochberg procedure).
Eye-based decodability does not drive EEG-based results across individuals
The preceding analysis examined whether eye-based decoding results corresponded to EEG-based results across permutations within individuals, finding no evidence of correlations between the two. However, the permutation outputs that were used for the contrasts may be too noisy to reliably identify correlations. Therefore, we also examined whether eye-based results drove EEG-based results across individuals. Specifically, we fit a linear regression to predict the average across-feature EEG-based contrast from the average across-feature eye-based contrast. To increase SNR, we average time across the last 500 ms of each experiment's delay period and pooled across experiments (see Materials and Methods). The model included a unique intercept per experiment, but a shared slope association the EEG-based results to the eye-based results.
The model was significant (Fig. 8; F(25,2) = 3.616; p = 0.042), with a significant regressor for eye-based decoding (t = 2.35; p = 0.027). However, both intercepts were also significant, meaning that EEG-based decoding was above chance even after regressing out the impact of eye-based decoding across individuals (Exp1: t = 2.067, p = 0.049; Exp2: t = 2.082, p = 0.048). Furthermore, visual inspection of Figure 8 suggests that the significant relationship between EEG-based decoding and eye-based decoding was driven by 2 participants in Experiment 1 (Fig. 7, top right points). Removing these subjects from the model removed the association, but did not impact the significance of the intercepts (F(23,2) = 0.114, p = 0.89; teye= 0.268, p = 0.791; tExp1 = 2.671, p = 0.014, tExp = 2.548, p = 0.018). The results were preserved if we additionally removed the remaining subject that was furthest from the remaining data points (Fig. 8, top left point; also from Experiment 1) along with the other two subjects. Lastly, we confirmed that the EEG-based load decoding and generalization results of Experiment 1 are qualitatively identical after excluding these participants. Taken together with the differences in time courses between the EEG-based and eye-based decodability, these results suggest that, while eye movements also vary systematically across WM loads in a way that is independent of the maintained features, eye movements cannot explain the generalization of WM load signals that we observe in EEG.
Representational similarity analysis identifies feature-specific and content-independent signals
In addition to the decoding analyses described above, we also examined the similarity structure of the Experiment 2 conditions across time, using RSA. RSA is complementary to our decoding analyses in two important ways. First, decoding models will leverage all available signals for differentiating a pair of conditions, making interpretation difficult if multiple features are changing. Second, RSA forces researchers to make relatively precise predictions about the nature of signals of interest, and RSA enables concurrent measurement of the unique variance in neural activity that is associated with each predicted signal. Thus, we used RSA to confirm the presence of a content-independent load signal, while simultaneously examining the presence of other related signals. Within each time window, we computed the semipartial rank correlation between our empirical RDMs and a set of RDMs based on theoretical factors (see Materials and Methods). This allows us to examine the variance that is uniquely explained by a given factor, while controlling for the other, potentially confounding factors. For example, it is unclear to what extent attended-feature decodability is driven by task-set-like signals or the ability of the models to leverage feature-specific load signals. RSA provided converging evidence for feature-specific and content-independent signals.
We found evidence for the presence of multiple signals in the EEG data (Fig. 9). First, we again found evidence for an attended feature signal, which onset shortly after stimulus onset (window centered on 52.5 ms) and lasted through a large portion of the delay period (final time window centered on 1,012.5 ms). Second, while we did not find evidence for a color-specific load signal, we did find evidence for a motion-specific load signal, which achieved significance later in the delay period (first window centered on 892.5 ms). Notably, the time course of the motion-specific load signal appeared to match that of the drop in generalization in decodability when training on attend-motion data and testing on the attend-color data. This suggests that the imperfect generalization we observed when training on motion data and testing on color data was driven by the existence of motion-specific load signals. As we did not have a large enough sample for an individual differences analysis, we instead ran a within-subject analysis to investigate this, post hoc. For each subject, we correlated the time course of the motion-specific load signal with that of the drop in generalization when training on attend-motion data and testing on attend-color data (contrastmotion − contrastcolor), using the 70 time windows, transformed the correlations to Fisher's Z, and assessed the correlations at the group level with a two-tailed t test. There was a significant correlation between the two time courses at the group level (mean Z = 0.411; mean r = 0.390; t = 7.08; p = 5.48 × 10−6; note that this comparison and the whole RSA were run using the 15 subjects with available eye tracking and pupil size data), supporting the idea that the drop in generalization in decodability was driven by the leveraging of motion-specific load signals.
In this analysis, the coherence signal was not significant at any time point. We compared it to the other regressors by examining its VIF, a measure of the proportion of variance in the coherence RDM that was explained by the other RDMs. When a VIF is high, it can be difficult to recover accurate estimates of a regressor's contribution to a model. With this said, the VIF for coherence was low, peaking ∼750 ms at 3 (roughly two-thirds variance explained), but averaging at 1.41 (roughly 30% variance explained), meaning that it can only be partially be explained by other regressors. It may be that previous decoding results simply reflected the same variance captured by these other regressors, or it may be that this partial overlap with other regressors, combined with the exclusion of the subject without available eye tracking data, reduced our sensitivity to an already subtle signal in this analysis. In addition, it may be that the signal reflects each individual's ability to extract coherence and therefore is hindered by using predictions based on group averages.
We also examined accuracy, pupil size, and their interaction as proxies for the varying difficulty or effort across conditions. To increase SNR, given the relatively low trial count per condition, we averaged measures across participants before constructing the predicted RDMs. The accuracy-based predicted differences were significant at 1 time point (window centered on 604.5 ms), but the pupil size based predicted differences were never significant, nor were the predicted differences based on the interaction of accuracy and pupil size. We also examined predictions based on individual-specific differences in accuracy, pupil, size, and their interaction; these produced qualitatively identical results.
Finally, we saw evidence for a sustained content-independent load signal (first significant window centered on 124.5 ms, last centered on 1,372.5 ms). This result was qualitatively unchanged if we excluded the color-specific load factor, which appeared to have the worst fit to the empirical RDM, or if we excluded any other subset of regressors. We also examined an alternative coding of accuracy, using the level of the behavioral analysis (2 set sizes × 2 attended features). This regressor caused the motion-specific load signal to no longer be significant at any time point. Examination of the VIF revealed that this accuracy regressor was highly similar to the motion-specific load regressor, producing extremely high VIFs for both accuracy (mean VIF = 246.44) and motion-specific load signals (mean VIF = 270.71), with over 99.5% of each factor's variance being explained. In this model, the attended feature and content-independent load RDMs were again qualitatively unchanged. Overall, RSA provided clear converging evidence for the presence of a content-independent load signal that is independent of the attended feature of the stored items.
Spatial attention signals are transient in raw voltage, unlike WM load signals
One possible explanation for an apparent content-independent load signal is that is driven by changes in spatial attention across set sizes. To examine whether this is the case in the current dataset, we tested our ability to decode different configurations of targets across set sizes. We took advantage of the fact that each stimulus was presented in a distinct quadrant, meaning that each target or placeholder can be coded by the quadrant it was presented in. For each experiment and each attended feature, we simultaneously computed the cross-validated d’ (a similar contrast measure to the hyperplane contrast used above; see Materials and Methods) between every pair of target locations for set size 1, and every pair of distractor locations for the higher set size. We compared each decodability trace against 0 using one-tailed t tests and ran two-way rmANOVAs comparing decodability across set sizes and attended features.
In both experiments, we could decode the location of both the set size 1 target and the higher set size placeholder (Fig. 10). In addition, a two-way rmANOVA revealed set size effects when examining differences in spatial decodability in both experiments. However, these effects were transient, ending soon after stimulus onset (final time bins centered on 460.5 ms in Experiment 1 and 724.5 ms in Experiment 2). This suggests that the generalizable WM load signals during the delay period are not driven by differences in the strength of spatial attention signals.
One may argue that even if the overall strength of spatial signals is equated across set sizes, there may still be a difference in format of the spatial signal between set sizes (e.g., an enhancement signal for set size 1 targets, and a suppression signal for higher set size place holders), which may drive load decoding. However, the overall spatial decodability time course is also too transient for this account. In both experiments, this spatial decodability peaked during or soon after the initial stimulus presentation, before dropping (Fig. 10). In Experiment 1, the orientation-based spatial decodability was sustained, but color-based spatial decodability was sparse after ∼650 ms (time bin centered on 628.5 ms), with a few more significant moments between ∼800 and 1,050 ms (final time bin centered on 1,036.5 ms). In Experiment 2, motion-based spatial decodability of set size 1 targets lasted until ∼1,100 ms (final time bin centered on 1,108.5 ms), and color-based spatial decodability was sparse starting ∼900 ms (window centered on 892.5 ms), with a few more significant time points between ∼1,000 and 1,150 ms (final time bin centered on 1,156.5 ms). Thus, the time windows in which spatial decoding succeeded could not explain the sustained time course over which content-independent load activity was observed.
These time courses also align with recent work which attempted to actively separate spatial attention signals from WM load signals through the use of “dot cloud” stimuli, which could change in overall area and overlap with one another, enabling a separation between the number of items stored and the breadth of the attended region (Jones et al., 2024). This work found that indeed the two signals are dissociable and that signals which reflected the breadth of spatial attention fade rapidly across the delay period while WM load signals are sustained. We note that, both this previous work and the current work find this drop in spatial signals when examining raw EEG voltage, rather than other signal bands such as alpha power, as we are curious about whether spatial attention signals can drive WM load signals, which are also estimated using raw EEG voltage. Taken together, these results suggest that spatial attention signals cannot explain the sustained and generalizable WM load decoding that we find throughout the delay period.
Discussion
Our findings provide evidence for a content-independent WM load signal that tracks the number of items stored regardless of the features that define those items. At the same time, distinct aspects of the EEG signal tracked the attended features, in line with many past studies that have reported stimulus-specific neural activity during WM storage. Thus, we are not challenging the importance of stimulus-specific delay activity. Instead, we are highlighting the importance of a distinct class of storage-related neural activity that is not tied to feature-specific activity.
Earlier work from Thyer et al. (2022) provided evidence for a content-independent load signal by showing strong generalization between WM load models for color and orientation. Here, we provide three critical extensions of the prior work. First, we replicated the generalization between color and orientation load signals using a within-session manipulation of the attended feature, showing near-perfect generalization of the load signature between color and orientation. Second, we show that WM load signatures generalize across color and motion coherence stimuli, features that are known to be processed in cortically disparate regions. Thus, content-independent load patterns could not be explained by the hypothesis that to the neural populations processing the two features were too finely interwoven to be distinguished by EEG. Third, we provide converging evidence for a content-independent load signal from a different analytic approach, RSA, which showed that that unique variance in the EEG signal was explained by a content-independent load signal while simultaneously identifying feature-specific signals.
We considered the possibility this content-independent load signal actually reflects the load for a single consistent feature that is encoded across conditions. For example, in Experiment 2, color was the target-defining feature. If participants must attend to color to select targets for both feature conditions, this signal may reflect the number of colors in memory. However, there are both theoretical and empirical arguments against this possibility. First, if participants must encode colors into working memory in order to assess target status, they would also need to encode the colors of distractors into working memory, leading to equivalent color load between the set sizes. Relatedly, there is growing empirical evidence that people can attend to target-defining features, requiring a relatively deep amount of processing, without appearing to maintain the featural information in WM, leading to a phenomenon known as attribute amnesia (Chen and Wyble, 2016). Thus, even if participants are only attending to the color of targets and not the distractors to make a target judgement, the literature suggests that they do not encode it into (or else rapidly remove it from) WM if it is not relevant.
Second, we found consistent evidence in this work that participants are selectively attending to the relevant feature. In both experiments, we found that we could decode the attended feature. Further, in Experiment 2, while RSA showed evidence for motion-specific load signals, and decoding results even provided initial evidence that we can decode the amount of coherence present, but only when it was the relevant feature (though future work with a larger sample size is needed to confirm its presence). The latter finding falls in line with past studies showing that observers can selectively store the relevant features from multifeature objects (Woodman and Vogel, 2008; Serences et al., 2009). For instance, Serences et al. (2009) presented observers with oriented gratings of different colors and found that patterns of activity in the primary visual cortex faithfully decoded the attended but not the unattended features of the gratings. Taken together, this pattern of results strongly argues against the possibility that participants are simply maintaining one or both features equally regardless of the experimental condition and supports our interpretation of the signal as truly content independent.
What underlying process or processes does this signal reflect? Although we offer a working hypothesis that focuses on the binding of items to context, we acknowledge the need to consider alternative explanations based on nonmnemonic processes that might be confounded with WM load. Here, we examined two prominent possibilities, cognitive effort and spatial attention (shown in past work to support rehearsal in visual WM; Awh and Jonides, 2001; Williams et al., 2013). We attempted to capture putative changes in effort across conditions by including accuracy, pupil size, and their interaction as proxies for effort in the RSA. We continued to find evidence for a sustained content-independent load signal, arguing against effort as an explanation. To examine whether content-independent decoding of load could be explained by a common spatial attention signal, we examined whether EEG activity tracked the locus of spatial attention and whether this spatially specific signal distinguished between set sizes. Although we did observe differences in spatial decodability across set sizes, these effects were transient and did not align with the time course of the WM load signal. This is consistent with recent work (Jones et al., 2024), in which we have found that spatial attentional signals were more transient than load signals and explained distinct variance in EEG activity. Thus, differences in effort or covert spatial orienting do not explain the sustained content-independent load signals observed here. Given these results, we next offer a working hypothesis that these content-independent signals may reflect a content-general process for binding item representations to the surrounding context.
Prominent theoretical accounts of WM storage propose separable neural processes for the maintenance of item representations, on the one hand, and the individuation and binding of the stored representations to the current context, on the other hand (Xu and Chun, 2006; Swan and Wyble, 2014; Balaban et al., 2019; Bouchacourt and Buschman, 2019; Oberauer, 2019). These proposals align with past theories of dynamic visual cognition (Kahneman et al., 1992; Pylyshyn, 2009), which require a spatiotemporal indexing process that enables the continuous monitoring of items through time and space (Hakim et al., 2019; Thyer et al., 2022). Thus, our hypothesis is that content-independent load signals may reflect the number of spatiotemporal pointers that are deployed to bind items to context, enabling their maintenance and retrieval from WM.
This distinction between content-independent pointers and feature-specific neural activity may provide a productive perspective for a number of findings in the working memory literature. For example, Fukuda et al. (2010) found that the number of items that each individual could store in working memory was correlated with fluid intelligence, while no similar link was observed between intelligence and the precision of the stored memories. It is possible that the number of items that can be stored is limited by the number of pointers that can be simultaneously deployed, while mnemonic precision is determined by parallel but separate neural processes that maintain precise feature representations for each stored item. Another key finding in the visual WM literature is that storage is limited by the number of objects instead of the total number of feature values to be stored (Luck and Vogel, 1997). Although there is some behavioral cost of adding new features to objects that are stored in working memory (Olson and Jiang, 2002), there are robust “object-based benefits,” such that substantially more features can be stored when they are integrated within a smaller number of objects. Robust evidence for object-based limits in visual WM storage (Ngiam et al., 2023) argues against a storage ceiling that is determined by the total amount of information stored. However, if there is a limit to the number of content-independent pointers that can be concurrently deployed, this could explain why visual WM performance is better predicted by the total number of objects to be stored than by the total amount of feature information contained within those objects. Finally, a dichotomy between the maintenance of stimulus-specific details and content-independent indexing operations may explain why distinct regions of the cortex have been shown to track the number and complexity of objects in visual WM (Xu and Chun, 2006). For example, the motion-specific load signals found in this work may reflect the maintenance of those coherence representations in downstream visual regions (Zeki, 1978; Felleman and Van Essen, 1991; Vaina, 1994), which are pointed at (producing the content-independent load signal) in order to be actively maintained and accessible in working memory.
It is important to note that different theories of WM storage may describe different content-independent processes that scale with WM load. One possibility is that WM storage relies on the oscillation of the focus of attention between WM representations which would otherwise rapidly decay (Mongillo et al., 2008; Landau and Fries, 2012; Fiebelkorn et al., 2013; Chota et al., 2022). Under this scenario, attention may change in sampling frequency as more items are oscillated between with increasing WM load (Holcombe and Chen, 2013). The signal we observe may reflect this change in sampling frequency, and this theory may be extended to incorporate the object-based benefits described above. Thus, while there is a broader theoretical motivation for an item-based binding operation that scales with WM load, more work is needed to test this computational account of content-independent load activity. For example, future work can examine whether these content-independent load signals are a necessary precursor for accessing item representations based on contextual retrieval cues, as predicted by models that distinguish between recollective and familiarity-based modes of memory retrieval (Yonelinas, 2023).
In summary, we present strong evidence for a content-independent load signature that generalizes across color and orientation as well as color and motion coherence, two cortically disparate visual features. This conclusion was supported by both generalization of classifier models across distinct features, as well as by RSA analyses that identified a pure WM load signal. We argue that this signal is distinct from spatial attention and effort and that theories of WM must include content-independent processes. These findings align with a broad class of models that argue for a distinction between the maintenance of stimulus-specific details and the binding of stored items to the current context. We propose that this signal may reflect the allocation of spatiotemporal pointers for binding items to context, providing a unifying perspective on previously identified neural and behavioral effects.
Data Availability
The code is available at https://github.com/henrymj/ContentIndependentLoad. A complete repository including the data can be accessed at https://osf.io/q8fya/.
Footnotes
This research was supported by National Institute of Mental Health Grant No. ROIMH087214, Office of Naval Research Grant No. N00014-12-1-0972 to E.A., and a Neubauer Distinguished Scholar Doctoral Fellowship from The University of Chicago to H.M.J.
The authors declare no competing financial interests.
- Correspondence should be addressed to Henry M. Jones at henryjones{at}uchicago.edu.