Abstract
Real-world listening settings often consist of multiple concurrent sound streams. To limit perceptual interference during selective listening, the auditory system segregates and filters the relevant sensory input. Previous work provided evidence that the auditory cortex is critically involved in this process and selectively gates attended input toward subsequent processing stages. We studied at which level of auditory cortex processing this filtering of attended information occurs using functional magnetic resonance imaging (fMRI) and a naturalistic selective listening task. Forty-five human listeners (of either sex) attended to one of two continuous speech streams, presented either concurrently or in isolation. Functional data were analyzed using an inter-subject analysis to assess stimulus-specific components of ongoing auditory cortex activity. Our results suggest that stimulus-related activity in the primary auditory cortex and the adjacent planum temporale are hardly affected by attention, whereas brain responses at higher stages of the auditory cortex processing hierarchy become progressively more selective for the attended input. Consistent with these findings, a complementary analysis of stimulus-driven functional connectivity further demonstrated that information on the to-be-ignored speech stream is shared between the primary auditory cortex and the planum temporale but largely fails to reach higher processing stages. Our findings suggest that the neural processing of ignored speech cannot be effectively suppressed at the level of early cortical processing of acoustic features but is gradually attenuated once the competing speech streams are fully segregated.
Significance Statement
When listening selectively to one of multiple speech streams, the attended stream is segregated from the acoustic mixture and gated to further information processing, whereas the processing of irrelevant speech must be attenuated to minimize distraction. We here used functional magnetic resonance imaging to investigate the role of the auditory cortex for this attentional filtering of auditory input. Our data suggest that stimulus representations in the primary auditory cortex and the planum temporale are hardly affected by attention, whereas responses further along the auditory cortex processing hierarchy get more and more selective for the attended input. This suggests that ignored speech cannot be effectively suppressed on the acoustic feature level but is gradually attenuated once the competing speech streams are fully segregated.
Introduction
Real-world listening environments often consist of multiple concurrent sound streams. When listening selectively to one of these streams, for example, a person talking in a crowded cocktail bar, the attended stream is segregated from the acoustic mixture and gated to further information processing, whereas irrelevant acoustic input (e.g., background noise, competing speech) gets suppressed to minimize distraction and perceptual interference (Cherry, 1953; Hillyard et al., 1973; McDermott, 2009; Khalighinejad et al., 2019).
This attention-driven filtering of relevant auditory input relies on an efficient interplay of stimulus-related sensory processing and top-down mediated attention control (Shinn-Cunningham, 2008; Evans et al., 2016; Puschmann et al., 2017; King et al., 2018). While modulations of auditory sensory activity by attention have been observed at earlier stages of the auditory pathway (Lehmann and Schönwiesner, 2014; Forte et al., 2017; Etard et al., 2019), the auditory cortex is commonly thought to be essentially involved in the attentional gating of information, and irrelevant input hardly reaches subsequent processing stages (Zion Golumbic et al., 2013).
We here studied the spatial progression of attention-driven filtering of relevant auditory input along the auditory cortex processing hierarchy, ranging from primary sensory areas to auditory association cortex. To this end, we acquired functional magnetic resonance imaging (fMRI) data while participants performed a continuous selective listening task, in which one of two concurrent speech streams had to be attended. Previous research using similar competing-speaker paradigms provided evidence that early sensory responses localized to the primary auditory cortex are not much affected by attention and contain information on the acoustic features of both selectively attended and ignored sound streams (O’Sullivan et al., 2019; Kiremitçi et al., 2021). In contrast, neural responses at later stages of auditory cortex processing were reported to predominantly reflect the processing of the attended input (Mesgarani and Chang, 2012; O’Sullivan et al., 2019). Notably, while some studies suggested that the processing of to-be-ignored speech is limited to its acoustic features and early sensory areas (Brodbeck et al., 2018; Teoh et al., 2022), others provided evidence for an expanded processing of ignored speech, potentially involving some degrees of early linguistic analysis (Kiremitçi et al., 2021).
An inter-subject analysis approach was applied to extract stimulus-specific neural representations of selectively attended and ignored speech. Inter-subject analysis approaches have been previously used to reveal shared cortical patterns of experience- and stimulus-driven brain activity in naturalistic listening or viewing settings (Hasson et al., 2004; Honey et al., 2012; Regev et al., 2013, 2021), as well as the sharing of stimulus-driven information between brain regions (Simony et al., 2016). The degree of inter-subject synchronization of brain activity has been previously shown to be sensitive to experimental manipulations, such as attention engagement (Regev et al., 2019; Rosenkranz et al., 2021) or inter-individual differences in brain function (Naci et al., 2014; Puschmann et al., 2021).
Previous studies on inter-subject alignment of brain activity mainly focused on the synchronization of activation time series. Then again, others also demonstrated an inter-subject alignment of spatial voxel patterns (Chen et al., 2017; Zadbood et al., 2017; Nastase et al., 2019; Chien and Honey, 2020). Here we applied both approaches in parallel to gain complementary information on spatial and temporal auditory cortex representations of attended and ignored speech during selective listening. Moreover, we explored whether time-resolved estimates of inter-subject synchronization, based on momentary voxel patterns, can be used to assess the listener's attention focus.
Based on the evidence outlined above, we expected attention-driven modulations of stimulus-driven activity to increase along the auditory processing hierarchy. We hypothesized that the attended and the ignored speech streams are similarly represented at the level of the primary auditory cortex, whereas the superior temporal cortex and sulcus regions primarily process the attended stream (cf. O’Sullivan et al., 2019). Consequently, information on the to-be-ignored speech stream was not expected to propagate to brain regions beyond the auditory cortex that are involved in natural language processing (cf. Zion Golumbic et al., 2013; Regev et al., 2019).
Materials and Methods
Participants
The study included two participant groups. Group 1 (N = 24; 15 females, 9 males; 20–28 years of age; mean ± SD, 24 ± 2 years) performed a selective listening task used to investigate attention-driven modulations in auditory cortex processing. Group 2 (N = 21; 12 females, 9 males; 19–35 years of age; mean ± SD, 25 ± 5 years) listened to the speech streams, which were used in the selective listening run, in isolation (and in separate fMRI runs). These data were used to generate predictor time series of stimulus-driven brain processing. All participants were right-handed and reported no history of neurological, psychiatric, or hearing-related disorder. The experimental procedures were approved by the Research Ethics Committee of the University of Oldenburg, and written informed consent was obtained from all participants.
fMRI task
Audio recordings of two German fairy tales by Wilhelm and Jacob Grimm [The Singing, Springing Lark (narrative A) and The Golden Bird (narrative B)] served as stimulus material for the functional MRI experiment. Both narratives were read by the same female narrator. To ease stream segregation, we adjusted the mean F0 difference between both narratives to 3 semitones using the Change Pitch module in Audacity (www.audacityteam.org; RRID:SCR_007198) with high-quality stretching to preserve the original stimulus duration. Narratives A and B were cut to 660 and 630 s duration, respectively, and adjusted for RMS level. As a last step, 30 s of silence were added at the beginning of narrative B, resulting in an effective duration of 660 s for both audio stimuli.
Stimuli were presented at individual sound comfort level which was adjusted inside the MRI scanner prior to the experiment. For this, participants could listen repeatedly to the first 30 s of narrative A under scanner background noise and provided feedback on perceived stimulus loudness via button press.
Participant group 1 performed three task runs in total. In task run 1, participants were presented with the mixture of both narratives. They were instructed to fixate on a cross, which was displayed centrally on a screen, and to listen selectively to the content of narrative A, which started 30 s earlier than narrative B. Afterward, they were asked to answer 20 questions on the contents of narratives A and B. Questions on the to-be-ignored stream were unexpected by the participants. In task runs 2 and 3, participants listened to narratives B and A in isolation and answered 20 other questions on the respective story contents afterward. Task runs 2 and 3 were not used for the presented data analysis.
Participants in group 2 also performed three task runs in total. In task runs 1 and 2, they listened to narratives A and B in isolation while fixating on a cross presented on a screen. Afterward, they were asked to answer 20 content questions on the respective story (i.e., the same questions that were asked in task run 1 of participant group 1). Presentation order was randomized across participants to minimize habituation effects. In task run 3, they listened to the mixture of both narratives at changing SNR (−3, 0, +3 dB) and were instructed to focus only on narrative A. Task run 3 was not analyzed for this manuscript.
Additional behavioral testing
Previous studies reported relationships between selective listening performance and auditory working memory performance (Puschmann et al., 2019; Bidelman and Yoo, 2020). Also, given that F0 represented a dominant sensory cue for stream segregation in our selective listening setting (i.e., no spatial separation, same speaker with F0 shifted), we speculated that individual frequency discrimination abilities may affect selective listening success. To control for such relationships, working memory and pitch discrimination thresholds were assessed in the selective listening group (i.e., participant group 1). In addition, a pure tone audiogram was obtained from all participants (frequency range, 125–8,000 Hz). The pure tone average (PTA; i.e., the mean hearing threshold obtained at 500, 1,000, 2,000, and 4,000 Hz) served as a measure of individual hearing loss.
Working memory was assessed using a tonal sequence manipulation task (Foster et al., 2013; Albouy et al., 2017; Puschmann et al., 2019). In each trial, a 3-tone piano sequence was presented and then repeated in reversed order after a 2 s retention period. Participants were asked to indicate whether the pitch of a single tone was changed during the reversed presentation or not. Pitch changes preserved the overall contour of the sequence and had a magnitude of 2 or 3 semitones (equally distributed). Half of the 108 trials contained a pitch change, and not more than three trials of the same condition (same, changed pitch) occurred in sequence. The working memory score was computed as the sum of correct responses over all trials.
Frequency discrimination thresholds were obtained using a two-alternative forced choice staircase procedure. In each trial, participants listened to two pure tone stimuli of 250 ms duration, separated by a silent interval of 500 ms. One stimulus retained a fixed reference frequency of 200 Hz, and the frequency of the other stimulus was adjusted based on performance. Initially, the frequency difference Δf was set to 7% of the reference frequency and was changed in steps of 1.25% with an one-up/two-down adaptation rule. The staircase procedure was terminated after 15 turns. Three staircases were completed by each participant, and the frequency discrimination threshold was computed as the mean frequency difference (in percentage) of the last 10 turns, averaged across the three staircase runs.
MRI data acquisition
MRI data were recorded using a 3 T MRI system (MAGNETOM Prisma, Siemens Healthcare) and a 20-channel head-coil. In each functional run, 400 T2*-weighted gradient echo planar imaging volumes with BOLD contrast were obtained (TR, 1,750 ms; TE, 30 ms; flip angle α = 75°; voxel size, 3.125 × 3.125 × 3.5 mm3). Functional volumes were recorded in ascending order and consisted of 28 transverse slices with a gap of 0.7 m in-between, resulting in full coverage of cerebral cortex and cerebellum. A high-resolution T1-weighted structural image (TR, 2,000 ms; TE, 2.07 ms; flip angle α = 9°; voxel size, 0.75 × 0.75 × 0.75 mm3), and gradient field maps were additionally obtained from each participant.
Audio onset was synchronized to the acquisition of the 11th volume of each functional MRI run. Stimulus presentation was controlled via MATLAB (MathWorks; RRID:SCR_001622) using Psychtoolbox-3 (www.psychtoolbox.org, RRID:SCR_002881). Audio stimuli were produced using a USB audio interface (UA-25EX, Roland), amplified via a stereo amplifier (A-9510, Onkyo), and delivered via MRI-compatible earphones (S14, Sensimetrics). Participants were additionally equipped with over-ear hearing protectors during MRI recordings to attenuate scanner background noise.
MRI data preprocessing
Structural and functional MRI data were preprocessed using fMRIprep 1.2.3 (Esteban et al., 2019, RRID:SCR_016216), which is based on Nipype 1.1.6-dev (Gorgolewski et al., 2011, RRID:SCR_002502). Functional data preprocessing included correction for susceptibility distortion, spatial realignment which corrected for head motion, slice timing correction, spatial coregistration with anatomical data, and normalization to MNI152NLin2009cAsym space with resampling to isotropic 2 mm voxels. Details are described below.
The T1-weighted (T1w) image was corrected for intensity nonuniformity (INU) using N4BiasFieldCorrection (Tustison et al., 2010; ANTs 2.2.0) and used as T1w reference throughout the workflow. The T1w reference was then skull-stripped using antsBrainExtraction.sh (ANTs 2.2.0), using OASIS as target template. Brain surfaces were reconstructed using recon-all (Dale et al., 1999; FreeSurfer 6.0.1, RRID:SCR_001847), and the brain mask estimated previously was refined with a custom variation of the method to reconcile ANTs-derived and FreeSurfer-derived segmentations of the cortical gray matter (GM) of MindBoggle (Klein et al., 2017; RRID:SCR_002438). Spatial normalization to the ICBM 152 Nonlinear Asymmetric template version 2009c (Fonov et al., 2009; RRID:SCR_008796) was performed through nonlinear registration with antsRegistration (Avants et al., 2008; ANTs 2.2.0, RRID:SCR_004757), using brain-extracted versions of both T1w volume and template. Brain tissue segmentation of cerebrospinal fluid (CSF), white matter (WM), and GM was performed on the brain-extracted T1w using fast (Zhang et al., 2001; FSL 5.0.9, RRID:SCR_002823).
For each of the BOLD runs, the following preprocessing was performed. First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep. A deformation field to correct for susceptibility distortions was estimated based on a field map that was coregistered to the BOLD reference, using a custom workflow of fMRIPrep derived from D. Greve's epidewarp.fsl script and further improvements of HCP Pipelines (Glasser et al., 2013). Based on the estimated susceptibility distortion, an unwarped BOLD reference was calculated for a more accurate coregistration with the anatomical reference. The BOLD reference was then coregistered to the T1w reference using bbregister (FreeSurfer) which implements boundary-based registration (Greve and Fischl, 2009). Coregistration was configured with nine degrees of freedom to account for distortions remaining in the BOLD reference. Head motion parameters with respect to the BOLD reference (transformation matrices and six corresponding rotation and translation parameters) are estimated before any spatiotemporal filtering using mcflirt (FSL 5.0.9; Jenkinson et al., 2002). BOLD runs were slice-time corrected using 3dTshift from AFNI 20160207 (Cox and Hyde, 1997; RRID:SCR_005927). The BOLD time series were resampled onto their original, native space by applying a single, composite transform to correct for head motion and susceptibility distortions. The BOLD time series were resampled to MNI152NLin2009cAsym standard space, generating a preprocessed BOLD run in MNI152NLin2009cAsym space, and resampled to an isotropic voxel size of 2 × 2 × 2 mm3. Resampling was performed using antsApplyTransforms (ANTs), configured with Lanczos interpolation to minimize the smoothing effects of other kernels (Lanczos, 1964).
All region of interest (ROI) analyses described below were based on the nonsmoothed preprocessed functional data in MNI152NLin2009cAsym space. Only for the voxel-wise whole-brain data analysis, spatial smoothing with an isotropic Gaussian kernel of 6 mm full-width at half-maximum was performed using SPM12 (https://www.fil.ion.ucl.ac.uk/spm/; RRID:SCR_007037).
The smoothed and nonsmoothed preprocessed functional data in MNI152NLin2009cAsym space were cleaned from physiological noise and motion artifacts using nuisance regression with confound regressors provided by fMRIPrep that accounted for head motion (3 translation + 3 rotation) and global signal fluctuations in WM and CSF. The residual voxel time series were high-pass filtered with a cutoff period of 128 s to remove slow signal drifts.
Auditory cortex parcellation
To assess attention-driven modulations in stimulus processing at different stages of the auditory cortex processing hierarchy, we parcelled the auditory cortex into 16 (i.e., eight per hemisphere) anatomical ROIs. As shown in Figure 1, ROIs covered Heschl's gyrus (Te1), the planum temporale (Te2.1, Te2.2), the lateral aspect of the superior temporal gyrus (Te3), the upper (STS1) and lower (STS2) bank of the superior temporal sulcus, and the temporo-insular cortex (TI, TeI). Auditory cortex ROIs were described in detail in previous publications (Morosan et al., 2001; Zachlod et al., 2020) and are available for download as part of the Julich Brain Atlas (Amunts et al., 2020; RRID:SCR_023277). The area Te1 map was created using SPM12 Image Calculator by merging the maps of areas Te1.0, Te1.1, and Te1.2.
The figure depicts the auditory cortex ROIs, projected on the left temporal lobe of the MNI152 template using MRIcroGL software (RRID:SCR_024413; https://www.nitrc.org/projects/mricrogl). The coronal view (left panel; y = −10 mm) shows the ROI topology, progressing from temporo-insular cortex (regions TI, TeI) over Heschl's (Te1) and the superior temporal gyrus (Te2.1, Te3) toward the superior temporal sulcus (STS1, STS2). The axial view (right panel; z = 10 mm) shows ROI locations on the superior temporal plane.
Temporal inter-subject analysis of auditory cortex activity
This analysis aimed to identify attention-driven modulations of stimulus-related auditory cortex activations by relating ROI signal time series during selective listening to signal time series obtained from the same auditory cortex ROIs during single-speaker listening.
The overall analysis workflow is depicted in Figure 2A. As a first step, individual ROI time series (averaged across all voxels) were obtained for both single-speaker runs of participant group 2. ROI time series were z-standardized to zero mean and unit standard deviation to minimize bias due to inter-individual differences in BOLD signal amplitude and averaged across individuals (per narrative). This resulted in two predictor time series for each ROI, reflecting the mean temporal pattern of stimulus-driven activity when listening to narratives A and B (compare Fig. 2A, mean time series indicated in bold red and blue). These predictor time series were submitted to a general linear model (GLM) to predict ROI time series (averaged across all voxels) during the selective listening run for each individual of participant group 1. The GLM analysis resulted in regression coefficients β that reflect the association between the ROI activity during selective listening and the processing of narratives A and B, when presented in isolation.
Inter-subject analysis of stimulus-driven brain activity: We applied two complementing analysis methods, investigating inter-subject synchronization of activation time series and of spatially distributed voxel patterns. A, For the temporal inter-subject analysis, mean ROI signal time series were obtained from participant group 2, which listened to both narratives in consecutive MRI runs. Time series were z-standardized to zero mean and unit standard deviation and averaged across individuals (per narrative). These time series, which are thought to reflect the commonly shared and time-locked processing of narratives A and B, served as predictors in a GLM framework to assess stimulus-specific representations of narratives A and B during selective listening. The resulting β estimates were analyzed as of attention. B, For the spatial inter-subject analysis, individual voxel ROI activation patterns, reflecting the spatially distributed brain response at a single point in time, were extracted from the single-speaker runs of participant group 2. The extracted spatial voxel patterns were z-standardized to zero mean and unit standard deviation and averaged across individuals (per narrative). This resulted in two predictor patterns for each ROI, reflecting the spatial representation of each narrative in auditory cortex at a given point in time. The predictor patterns were submitted to a GLM analysis to predict the spatial activation pattern (at the same point in time) during selective listening for each individual of participant group 1. C, The spatial inter-subject analysis was conducted time point by time point, resulting in a time series of β estimates for each individual, showing changes in the association between the regional activation pattern during selective listening and the single-speaker data over time. The figure depicts an exemplary time series of regression estimates β for a single participant and right area STS2. β estimates related to the attended narrative A and the ignored narrative B are shown in red and blue, respectively.
The GLM analysis was implemented in MATLAB using the glmfit function. To minimize effects related to habituation and attention orienting, we restricted the analysis to the last 10 min of each audio track (i.e., volumes 44–387 of each task run).
Inter-regional temporal inter-subject analysis
To investigate the sharing of stimulus-specific information between auditory cortex ROIs during selective listening, we performed an inter-regional temporal inter-subject analysis (Simony et al., 2016). This analysis type reflects an extension of the previously described temporal inter-subject analysis approach, in which GLM predictor and response time series of two different ROIs are combined. The resulting regression coefficients β can be interpreted as markers of stimulus-related functional connectivity between ROIs, selectively related to the processing of the narratives A and B.
For the analysis, individual ROI time series (averaged across all voxels) were extracted from both single-speaker runs of participant group 2. ROI time series were z-standardized to zero mean and unit standard deviation and averaged across individuals (per narrative). These group-averaged time series served as predictors in a GLM framework to predict the signal time series in another ROI (averaged across all voxels) during the selective listening run for each individual of participant group 1. GLMs were computed between all ROI combinations to obtain two sets of stimulus-driven connectivity matrices, for the attended and the ignored speech stream.
The GLM analysis was implemented in MATLAB using the glmfit function. As before, the analysis window was restricted to the last 10 min of each audio track (i.e., volumes 44–387 of each task run) to minimize effects related to habituation and attention orienting.
As the directionality of the GLM analysis (i.e., single-speaker time series of ROIi predict selective listening data in ROIj vs single-speaker time series of ROIj predict selective listening data in ROIi) was not considered meaningful in our analysis framework, connectivity matrices were averaged with its transposed version to obtain symmetrical connectivity patterns.
Whole-brain temporal inter-subject analysis
The ROI-based data analyses of auditory cortex time series were complemented with a voxel-wise whole-brain data analysis to investigate stimulus processing beyond the auditory cortex. To this end, we applied a mass-univariate voxel-wise temporal inter-subject analysis. The applied analysis workflow was identical to the ROI-based analysis. However, instead of using mean ROI signal time series, the analysis was performed voxel by voxel.
For each voxel, individual signal courses from both single-speaker runs of participant group 2 were extracted, z-standardized to zero mean and unit standard deviation, and averaged across individuals (per narrative). This resulted in two-group mean time series of stimulus-driven activity related to the processing of narratives A and B. These time series were submitted to a GLM analysis to predict stimulus-specific activations during the selective listening run for each individual of participant group 1. The GLM analysis resulted in two whole-brain volumes of regression coefficients β per participant that reflected the association between activation time series during selective listening and the processing of narratives A and B.
To compare stimulus representations during selective listening and single-speaker listening, we additionally computed voxel-wise maps of inter-subject synchronization for the single-speaker runs of the experiment. To this end, voxel time series of narratives A or B were extracted from participant group 2 and z-standardized to zero mean and unit standard deviation. For each voxel, all but one time series were then averaged across individuals to obtain a predictor time series. The predictor time series was submitted to a GLM analysis to predict stimulus-specific activations in the left-out participant. This analysis was repeated for all participants. The leave-one-out cross-validation approach was used to ensure independence of response and predictor time series.
The whole-brain analysis was restricted to the last 10 min of each audio track (i.e., volumes 44–387 of each task run) to minimize effects related to habituation and attention orienting.
The GLM analysis was implemented in MATLAB using the glmfit function. The SPM12 function spm_read_vols was used to read out the voxel time series from fMRI volumes. The resulting whole-brain maps of inter-subject synchronization were written to a three-dimensional volume using the spm_write_vol function.
Spatial inter-subject analysis of auditory cortex activity
Complementing the previously described analyses of auditory cortex signal time series during selective listening, this analysis aimed to identify attention-driven modulations in the spatially distributed activation pattern related to the processing of narratives A and B during selective listening.
The overall analysis workflow is depicted in Figure 2B. For each ROI and each narrative, individual voxel activation patterns, reflecting the spatially distributed brain response at a single point in time, were extracted from the single-speaker runs of participant group 2. The extracted spatial voxel patterns were z-standardized to zero mean and unit standard deviation and averaged across individuals (per narrative). This resulted in two predictor patterns for each ROI, reflecting the spatial voxel representation of each narrative at a given point in time. The group mean patterns were submitted to a GLM analysis to predict the spatial activation pattern (at the same point in time) during selective listening for each individual of participant group 1. The GLM analysis resulted in regression coefficients β that reflect the association between the spatially distributed activation pattern during selective listening and the activation pattern related to the processing of narratives A and B when presented in isolation. This analysis was repeated for each time point of the fMRI data time series, resulting in a “time series” of β estimates over the course of the selective listening run. An example time series of an individual participant is shown in Figure 2C.
The GLM analysis was implemented in MATLAB using the glmfit function. Similar to the temporal inter-subject analysis, the spatial analysis was restricted to the last 10 min of the selective listening data (i.e., volumes 44–387; compare Fig. 2C, shaded area) to reduce onset and habituation effects.
Experimental design and statistical analysis
Four participants of group 1 were removed from the dataset prior to data analysis—two due to not completing the study, one due to falling asleep during MRI measurements, and another one due to poor task performance (i.e., proportion of correct answers on story contents was three standard deviations below group average). Similarly, one dataset was removed from participant group 2 due to elevated hearing thresholds (>20 dB HL) at the time of the experiment. Therefore, group-level statistical analysis of fMRI and behavioral data was based on 20 datasets of each participant group.
Statistical testing was performed in jamovi (ANOVA; https://www.jamovi.org/; RRID:SCR_016142), MATLAB (t tests for ROI analyses, attention classification), and SPM12 (whole-brain data analysis).
Temporal inter-subject analysis of auditory cortex activity
To test for stimulus-driven responses (i.e., >0) during selective listening, β estimates of each ROI were submitted to one-tailed one-sample t tests. Differences in regression coefficients β related to attention allocation were assessed using a three-factorial repeated-measures analysis of variance (ANOVA) with factors attention (two levels, attended, ignored speech), hemisphere (two levels, left, right), and auditory cortex region (eight levels, TI, TeI, Te1, Te2.1, Te2.2, Te3, STS1, STS2). Greenhouse–Geisser correction was applied to adjust for lack of sphericity. Additional two-tailed paired t tests were performed to investigate differences between the attended and the to-be-ignored speech stream in all ROIs. Effects were deemed as statistically significant for p < 0.05. For t tests, p values were corrected for false discovery rate (FDR) using the Benjamini–Hochberg procedure. The FDR correction was implemented using the fdr_bh function by D. Groppe for MATLAB (via MATLAB File Exchange; file ID: 27418).
Inter-regional inter-subject analysis
To test for stimulus-driven functional connectivity between auditory cortex ROIs during selective listening (i.e., >0), we submitted connectivity matrices obtained for the to-be-attended and for the ignored stream to element-wise one-tailed one-sample t tests. To investigate differences in stimulus-driven functional connectivity related to attention, we performed element-wise two-tailed paired t tests between connectivity matrices of the attended and the to-be-ignored speech stream. Effects were deemed as statistically significant for p < 0.05, FDR corrected using the Benjamini–Hochberg procedure.
Spatial inter-subject analysis of auditory cortex activity
Mean β estimates (i.e., averaged over volumes 44–387) were computed as an index of mean stimulus-driven activation for each ROI and each speech stimulus and submitted to statistical testing. To test for stimulus-driven responses (i.e., >0) within the different auditory cortex ROIs, we submitted mean β estimates to one-tailed one-sample t tests. Differences in regression coefficients β related to attention allocation were assessed using a three-factorial repeated-measures ANOVA with factors attention (two levels, attended, ignored speech), hemisphere (two levels, left, right), and auditory cortex region (eight levels, TI, TeI, Te1, Te2.1, Te2.2, Te3, STS1, STS2). Greenhouse–Geisser correction was applied to adjust for lack of sphericity. Additional two-tailed paired t tests were performed to investigate differences between the attended and the to-be-ignored speech stream in all ROIs. Effects were deemed as statistically significant for p < 0.05. For t tests, p values were FDR corrected using the Benjamini–Hochberg procedure.
Classification of the auditory attention focus
The listeners’ momentary attention focus was classified (i.e., attending to narrative A vs attending to narrative B) using the time-resolved β estimates obtained from the spatial inter-subject analysis. The proportion of time points with βA > βB was computed for each ROI and sliding windows of different duration, ranging from 1 (i.e., classification is based on single β estimate) to 344 time points (i.e., classification is based on the mean regression estimate over the whole analysis window). Classification accuracy was reported as being significantly above chance when the lower boundary of the 95% confidence interval of group-level decoding performance did not overlap with the upper boundary of the 95% confidence interval for chance-level performance, which is considered to follow a binomial distribution. The upper boundary of the 95% confidence interval for chance-level performance was estimated using the Wilson score interval method (Wilson, 1927).
Whole-brain temporal inter-subject analysis
Whole-brain β maps of inter-subject synchronization obtained for narratives A and B during selective listening and during single-speaker listening were tested against zero-baseline using voxel-wise one-tailed one-sample t tests. Voxel-wise differences in inter-subject synchronization between the attended and the ignored stream during selective listening were assessed using one-tailed paired t tests. To assess the impact of distracting speech on the neural processing of the attended speech stream, we compared whole-brain β maps obtained for narrative A during selective listening and single-speaker listening using two-tailed two-sample t tests. Effects were reported as statistically significant for p < 0.05, corrected for family-wise errors (FWE) on the voxel-level. Clusters consisting of <10 voxels were not reported.
Behavioral data analysis
To confirm the effectiveness of the applied attention manipulation, the proportion of correct responses was compared between the attended narrative A and the to-be-ignored narrative B using a two-tailed paired t test. To analyze the impact of distracting speech on comprehension of the attended speech stream, we compared the proportion of correct responses to narrative A between the selective listening condition (participant group 1 data) and when listening to the same narrative in isolation (participant group 2 data) using a two-tailed two-sample t test.
The relationship between individual selective listening performance (i.e., the proportion of correct responses to the attended narrative A) and (1) the hearing loss, (2) frequency discrimination thresholds, and (3) auditory working memory scores was explored using robust regression analyses in MATLAB (with default settings; i.e., bisquare weighting, tuning constant set to 4.685). Robust regression was preferred to ordinary least-squares regression to reduce the effect of outliers on the parameter estimation.
To control for differences in question difficulty between both narratives, we performed two-tailed paired t tests between the proportion of correct responses to narratives A and B, when presented in isolation (data of participant group 2—task runs 1 and 2).
Results of the behavioral data analysis were reported as statistically significant for p < 0.05, FDR corrected using the Benjamini–Hochberg procedure.
Code and data availability
Preprocessed functional MRI data, analysis scripts, and unthresholded statistical maps of the whole-brain analysis are available for download via the Open Science Framework (https://osf.io/ze4wu/).
Results
Behavioral data
Figure 3 depicts the proportion of correctly answered questions on story content during selective listening (participant group 1 data) and when listening to each narrative in isolation (participant group 2 data). After completing the selective listening run, participants correctly answered 62.5% ± 12.4% (mean ± SD; range, 40–90%) of questions on the to-be-attended narrative A, but only 0.8% ± 2.5% (mean ± SD; range, 0–10%) of questions on the to-be-ignored input [t(19) = 22.1; p(FDR) < 0.001]. Overall, 18 of 20 participants could not correctly answer a single question on the content of the to-be-ignored narrative, demonstrating that the attention manipulation was highly effective.
Behavioral data: The figure depicts the percentage of correctly answered questions on the content of narratives A and B, respectively, during the selective listening run (left panel; data of participant group 1) and when listening to each narrative in isolation (right panel; data of participant group 2). Each dot represents the performance of an individual participant.
When listening to narratives A and B in isolation, participant group 2 answered 76.8% ± 10.4% (mean ± SD; range, 60–90%) and 79.5% ± 14.9% (mean ± SD; range, 40–100%) of questions correctly, suggesting no overall difference in comprehensibility or question difficulty between narratives [t(19) = −0.9; p(FDR) = 0.551]. Compared with the selective listening situation, a higher proportion of content questions on narrative A was answered correctly when the speech stream was presented without interfering speech input [t(38) = 3.9; p(FDR) = 0.001], which indicates a negative influence of the to-be-ignored speech stream on overall story comprehension.
We explored whether individual differences in selective listening performance are associated with differences in auditory working memory raw scores (mean ± SD, 59 ± 9; range, 45–83), frequency discrimination thresholds (mean ± SD, 2.46% ± 1.65%; range, 1.20–5.83%), or individual hearing loss (mean ± SD, 2.4 ± 3.9 dB HL; range, −3.1 to 9.4 dB HL). Robust regression analyses suggested no significant relationship between these measures and the proportion of correctly answered content questions on the to-be-attended narrative A [working memory, R2 = 0.06, p(FDR) = 0.551; frequency discrimination, R2 = 0.02, p(FDR) = 0.599; hearing loss, R2 = 0.02, p(FDR) = 0.599].
Attention-driven modulation of auditory cortex time series
Mean activation time series of different ROIs along the auditory cortex processing hierarchy were analyzed using a temporal inter-subject approach to assess the auditory cortex representation of attended and to-be-ignored speech during selective listening.
Results of the temporal inter-subject analysis are depicted in Figure 4A. The time series data show significant stimulus-driven activation related to processing of the attended speech stream (i.e., narrative A) in all auditory cortex ROIs except for the left and right temporo-insular region TI [one-tailed simple t tests at p(FDR) < 0.05], with β estimates increasing from anterior to posterior, and toward superior temporal sulcus regions. In contrast, auditory cortex processing of the to-be-ignored speech stream was more confined, with the activation peak in area Te2.1. Left area TI as well as areas STS1 and STS2 showed no significant representation of the to-be-ignored stream [one-tailed one-sample t test at p(FDR) < 0.05]. Differences in stimulus processing related to the allocation of attention during selective listening were investigated using a three-factorial repeated-measures ANOVA and post hoc t tests. The ANOVA revealed main effects of attention (F(1,19) = 35.5, p < 0.001) and brain region (F(3.2,59.9) = 36.3, p < 0.001), but no significant main effect of hemisphere (F(1,19) = 1.5, p = 0.236). It further revealed significant attention-by-brain region (F(3.7,69.5) = 36.7, p < 0.001) and brain region-by-hemisphere (F(4.4,83.9) = 2.8, p = 0.024) interactions. Interactions between attention and hemisphere (F(1,19) = 0.2, p = 0.686) as well as the three-way interaction between attention, brain region, and hemisphere (F(4.5,85.5) = 0.3, p = 0.865) were not significant. Post hoc paired t tests were performed to explore differences in stimulus-driven auditory cortex processing as a function of attention and brain region. We found that, in comparison with the attended speech stream, the neural representation of the to-be-ignored speech was significantly reduced in left Te2.2 [p(FDR) < 0.05] and bilaterally in Te3, STS1, and STS2 [all p(FDR) < 0.001].
Results of the temporal inter-subject analysis. A, The figure depicts group mean β estimates (±SD) in the left (i.e., left group of bars) and right auditory cortex. Filled bars mark β estimates that are significantly larger than zero [one-tailed one-sample t test, p(FDR) < 0.05]. Significant attention-driven modulation of stimulus-driven activation was observed in auditory cortex regions Te2.2 (left hemisphere only), Te3, STS1, and STS2 [two-tailed paired t tests; *p(FDR) < 0.05, ***p(FDR) < 0.001]. B, The sharing of stimulus-related information between different levels of auditory cortex processing was analyzed using an interregional temporal inter-subject analysis. The left and middle panels depict thresholded connectivity matrices for the attended narrative A and the to-be-ignored narrative B [tested again zero-baseline using one-tailed one-sample t tests; p(FDR) < 0.05]. The right panel shows interregional connections that are significantly enhanced for the attended as compared with the ignored speech stream [two-tailed paired t tests; p(FDR) < 0.05]. The panels show the lower triangle of the symmetrical connectivity matrices; left and right hemispheric regions are alternating. The diagonal of the matrix is not displayed as it depicts within-region associations (i.e., identical to the data shown in Fig. 4A).
Stimulus-related functional connectivity of auditory cortex regions
To investigate the sharing of stimulus-related information between auditory cortex regions during selective listening, an interregional temporal inter-subject analysis was performed.
As shown in Figure 4B, inter-regional associations between activation time series related to the processing of the attended speech stream were widespread and qualitatively increased toward higher-order auditory cortex regions Te3, STS1, and STS2 [left panel; one-tailed one-sample t tests, p(FDR) < 0.05]. In contrast, inter-regional associations related to the processing of the to-be-ignored stimulus were more restricted, with strongest associations between areas Te1, Te2.1, Te2.2, and Te3 [Fig. 4B, middle panel; one-tailed one-sample t tests, p(FDR) < 0.05]. A direct comparison of connectivity patterns revealed stronger inter-regional connections with areas Te3, STS1, and STS2 for the attended versus the to-be-ignored speech stream [Fig. 4B, right panel; two-tailed paired t tests; p(FDR) < 0.05]. These findings suggest that stimulus-related information on the to-be-ignored speech stream is shared among low-level auditory cortex regions but largely fails to reach higher-order structures of the auditory processing hierarchy.
Attention-driven modulation of spatial auditory cortex activation patterns
Complementing the analysis of auditory cortex activation time series, we investigated attentional-driven modulations in the voxel pattern representation of attended and ignored speech during selective listening using a spatial inter-subject analysis, which was performed volume by volume for each time point of the selective listening run.
Figure 5A depicts the mean β estimates of the spatial pattern analysis, averaged over the analyzed time window. Spatial activation patterns during selective listening were significantly related to the processing of the to-be-attended speech stream in all ROIs except for left and right TI [one-tailed one-sample t tests at p(FDR) < 0.05]. β estimates reflecting the processing of the attended stream overall increased from anterior to posterior and ventrally toward the superior temporal sulcus. Spatial voxel patterns were significantly related to processing of the to-be-ignored stream in left TI, right TeI, and bilateral Te1, Te2.1, Te2.2, left Te3, and STS1 [one-tailed one-sample t tests at p(FDR) < 0.05]. Similar to the time series analysis, β estimates reflecting the processing of the to-be-ignored speech stream were maximal for region Te2.1 and diminished toward higher-order stages of the auditory processing hierarchy. Differences in stimulus-driven processing related to the allocation of attention during selective listening were investigated using a three-factorial repeated-measures ANOVA and post hoc t tests. The ANOVA revealed main effects of attention (F(1,19) = 64.3, p < 0.001) and brain region (F(4.6,87.4) = 30.4, p < 0.001), but no significant main effect related to hemisphere (F(1,19) = 1.25, p = 0.277). We further found a significant attention-by-brain region interaction (F(4.0,75.5) = 35.67, p < 0.001), but no interactions between brain region and hemisphere (F(3.9,73.8) = 1.2, p = 0.313), attention and hemisphere (F(1,19) = 1.93, p = 0.181), or all three ANOVA factors (F(4.6,87.8) = 1.4, p = 0.238). Post hoc t tests showed that β estimates obtained for left TeI, left Te2.2 [both p(FDR) < 0.05], and bilateral Te3, STS1, and STS2 [all p(FDR) < 0.001] were significantly increased for the attended versus the to-be-ignored speech stream.
Results of the spatial inter-subject analysis. A, The figure shows group mean β estimates (±SD) for left (i.e., left group of bars) and right auditory cortex ROIs, averaged over the analyzed time window. Filled bars mark β estimates that are significantly larger than zero [one-tailed one-sample t test, p(FDR) < 0.05]. Significant modulations of stimulus-driven activation patterns by selective attention were observed in auditory cortex regions Te2.2 (left hemisphere only), Te3, STS1, and STS2 [two-tailed paired t tests; *p(FDR) < 0.05, ***p(FDR) < 0.001]. B, We explored the possibility to classify the listeners’ attention focus based on the time-resolved regression estimates. The figure depicts the classification accuracy (i.e., proportion of time points with βA > βB) for different time windows, including 1–344 data points. Shaded areas indicate 95% confidence intervals of classification accuracy; the dotted gray line depicts the upper boundary of the 95% confidence interval of chance-level performance.
Attention focus classification
As mean β estimates of the spatial inter-subject analysis provided evidence for attention-driven modulations across multiple brain regions, we explored whether also shorter segments of the β time series provide sufficient information to robustly classify the listeners’ attention focus during selective listening. To this end, we compared mean β estimates obtained for the attended and ignored stream for time windows ranging from 1 (i.e., classification based on single fMRI volumes) to all 344 time points of the task run.
Figure 5B depicts the classification accuracy (i.e., βA > βB) as a function of time window for each auditory cortex ROI. Overall, the data suggest that the attention focus on narrative A can be successfully assessed from spatial activation patterns of regions Te3, STS1, and STS2, with near-perfect classification accuracy for long analysis time windows. In contrast, the attention focus could not be reliably classified based on data from Heschl's gyrus (Te1), the planum temporale (Te2.1, Te2.2), or temporo-insular cortex regions (TI, TeI).
Whole-brain analysis
A whole-brain inter-subject analysis of voxel time series was conducted to assess the representation of attended and ignored speech beyond auditory sensory areas. Figure 6A depicts thresholded activation maps associated with the neural processing of the attended and to-be-ignored speech input during selective listening. Stimulus-driven brain responses related to the attended speech stream encompassed the lateral Heschl's gyrus, posterior Heschl's sulcus, the planum temporale, large parts of the superior gyrus, the superior temporal sulcus, and the middle temporal gyrus, as well as anterior and mid-posterior portions of the left inferior temporal gyrus. Beyond temporal lobe, stimulus-driven responses were found in the inferior frontal gyrus (pars orbitalis and pars triangularis), the right superior frontal gyrus, the superior medial orbitofrontal and medial frontal cortex, the angular gyrus, the inferior parietal lobe, the intraparietal sulcus, small parts of the inferior and middle occipital gyrus, the amygdala, the hippocampus, the posterior cingulate cortex, the precuneus, as well as lobules VIIa and VI of the cerebellum [one-tailed one-sample t test; p(FWE) < 0.05].
Results of the whole-brain analysis: To explore stimulus-driven brain activity beyond the auditory cortex, a voxel-wise temporal inter-subject analysis was conducted. A, The top and middle panels show stimulus-specific synchronization of brain activity related to processing of the attended and the to-be-ignored speech stream during selective listening [one-tailed one-sample t test; p(FWE) < 0.05]. For the attended stream, stimulus-driven responses encompassed distributed brain regions, including the auditory cortex, the middle temporal gyrus, anterior and mid-posterior portions of the left inferior temporal gyrus, the inferior frontal gyrus, the right superior frontal gyrus, the superior medial orbitofrontal and medial frontal cortex, the angular gyrus, the inferior parietal lobe, the intraparietal sulcus, small parts of the inferior and middle occipital gyrus, the amygdala, the hippocampus, the posterior cingulate cortex, and the precuneus, as well as lobules VIIa and VI of the cerebellum. In contrast, stimulus-specific brain activity related to processing of the ignored speech stream was restricted to auditory sensory areas, even when lowering the statistical threshold to p < 0.001 (uncorrected; red shaded area). The bottom panel shows brain areas in which stimulus-specific synchronization of brain activity was significantly enhanced for the attended as compared with the ignored speech stream [one-tailed paired t test; p(FWE) < 0.05]. B, The figure depicts inter-subject synchronization of brain activity related to processing of narratives A and B when presented in isolation, without competing input [one-tailed one-sample t test; p(FWE) < 0.05]. All activations are displayed on the MNI152 template using MRIcroGL software.
For the to-be-ignored speech stream, statistically significant stimulus-driven activation was restricted to the anterior portion of the left planum temporale [one-tailed one-sample t test; p(FWE) < 0.05]. Lowering the statistical threshold to p < 0.001 (uncorrected) resulted in large bilateral coverage of the auditory cortex, including Heschl's gyrus and the planum temporale (red shaded areas). A direct comparison between the attended and the to-be-ignored speech stream during selective listening is shown in the bottom panel of Figure 6A [one-tailed paired t test; p(FWE) < 0.05].
Overall, the whole-brain activation pattern observed for the attended speech stream during selective listening closely resembled the voxel-wise map of stimulus-driven responses found for the same story when presented in isolation (compare Fig. 6B); a direct comparison did not reveal any statistically significant differences [two-tailed two-sample t test; p(FWE) < 0.05].
Discussion
We here investigated the neural representation of selectively attended and ignored speech in a naturalistic multispeaker setting. Our results suggest that low-level stimulus representations in Heschl's gyrus (Te1) and in the adjacent portion of the planum temporale (Te2.1) are not significantly modulated by attention, whereas neural responses in the lateral superior temporal gyrus (Te3) and superior temporal sulcus (STS1, STS2) regions are increasingly dominated by the attended input. Complementing this progressive change in neural activation, we observed decreased stimulus-related functional connectivity between lower- and higher-order auditory cortex regions for the to-be-ignored versus the attended speech stream, demonstrating that the spread of to-be-ignored information gets largely attenuated toward later stages of auditory cortex processing. These findings support the view that the auditory cortex selectively gates attended input to higher-order brain areas involved in natural language processing.
Auditory cortex gating of selectively attended speech
Previous studies demonstrated that neuronal receptive fields at the level of the primary auditory cortex are modulated by task goals and attention; the low-level spectrotemporal representation of relevant input gets enhanced while neural responses to irrelevant sounds are simultaneously suppressed (Fritz et al., 2007; O’Connell et al., 2014; De Martino et al., 2015). This dynamic short-term plasticity is thought to facilitate stimulus discriminability and to stabilize the perceptual grouping of sounds (Carlin and Elhilali, 2015). Attention-driven modulations of primary auditory cortex activity have however mainly been observed in experimental paradigms involving simple tonal or isolated speech-like stimuli. In contrast, recent work using continuous natural speech in competing-speaker paradigms reported rather small modulations of primary sensory activity by attention (O’Sullivan et al., 2019; Kiremitçi et al., 2021). Adding to this picture, our current data show only a minor modulation of stimulus-driven primary auditory cortex responses by attention.
The reduced attention-driven modulation of primary auditory cortex activity observed in competing-speaker paradigms may be attributed to the variability and overlap of spectrotemporal features of the speech streams, which potentially limits the effectivity of receptive field sharpening for attentional selection based on stimulus acoustics. Instead, attentional selection of competing speech sounds is likely to occur primarily based on a selective gating of segregated stream representations (Shinn-Cunningham, 2008; Marinato and Baldauf, 2019; de Vries et al., 2021). Previous work suggests that stream segregation is incomplete at the level of the primary auditory cortex and that neural activity in this region represents the acoustic input mixture, rather than individual speech streams (Micheyl et al., 2007; Puvvada and Simon, 2017; Hausfeld et al., 2018).
The planum temporale has been previously reported to contain segregated representations of the acoustic input (Smith et al., 2010) and is commonly thought to be essentially involved in auditory scene analysis (Griffiths and Warren, 2002; Zatorre et al., 2002; Deike et al., 2010; Ragert et al., 2014; Teki et al., 2016). Consistent with the view that attentional selection in competing-speaker experiments acts strongly on segregated speech representations, our data show a marked increase in attention-driven modulations of stimulus-driven activity at later stages of the auditory processing hierarchy, with the (left) posterior planum temporale (Te2.2) being the hierarchically lowest region showing statistically significant effects. Moreover, functional connectivity related to the processing of the to-be-ignored speech streams between early auditory sensory regions and areas Te3, STS1, and STS2 was largely attenuated, suggesting that information on the attended stream is gated selectively toward higher-order auditory information processing.
The cortical representation of ignored speech
Our data suggest that the neural processing of ignored speech is largely confined to the auditory cortex. While the neural representation of the ignored stimulus was strongest in early auditory areas, we also found weak signatures of ignored speech processing in the lateral superior temporal gyrus (Te3) and in the upper bank of the superior temporal sulcus (i.e., for the spatial inter-subject analysis). This finding agrees with recent fMRI data provided by Kiremitçi et al. (2021), which show ignored speech processing at different levels of the auditory processing hierarchy, up to the superior temporal sulcus and superior marginal sulcus areas. Moreover, while early auditory cortex representations of speech can be linked to stimulus acoustics, activations in higher-level auditory regions have been demonstrated to reflect articulatory and semantic speech features (Huth et al., 2016; de Heer et al., 2017; Kiremitçi et al., 2021). This suggests that irrelevant speech undergoes some degree of linguistic feature analysis (see also Har-shai Yahav and Zion Golumbic, 2021).
Previous studies provided mixed evidence regarding the processing of to-be-ignored speech. While some reported a fast attenuation of the unattended input (Ding and Simon, 2012; O’Sullivan et al., 2015; Brodbeck et al., 2018), others found a more persistent representation, up to later processing stages (Zion Golumbic et al., 2013; Puschmann et al., 2019; Kiremitçi et al., 2021). The discrepancies possibly reflect variations in the experimental setup, for example, regarding the spatial configuration or the contrast between voices. Such variations are likely to impact on the listeners’ ability to filter relevant auditory information and to avoid intrusions by unattended speech (Cherry, 1953; Bronkhorst, 2000). Moreover, differences in the studied participant sample may add to variability in the literature. For example, better speech-in-speech (and speech-in-noise) perception has been reported for trained musicians compared with non-musicians, possibly mediated by a musical training-related increase in auditory working memory (Clayton et al., 2016; Puschmann et al., 2019; Bidelman and Yoo, 2020). Puschmann et al. (2019) further found that this perceptual advantage was accompanied by an increased cortical representation of the to-be-ignored speech, both in the auditory cortex and beyond.
In our inter-subject analysis framework, neural responses to ignored speech during selective listening are modeled using stimulus-driven responses to the same speech stimulus when it was presented in isolation (and actively attended). Therefore, our analysis is unlikely to pick up potential temporal or spatial modulations in the neural response that occur exclusively when a speech stream is actively ignored. But, previous studies have linked selective attention in cocktail party settings primarily to changes in neural gain (Ding and Simon, 2012; Brodbeck et al., 2018), whereas few studies reported the emergence of additional (late) response components that are exclusively linked to the neural processing of the ignored speech (Fiedler et al., 2019). In our view, it is open whether these responses are linked to stimulus information processing per se or are rather to cognitive control, potentially reflecting distractor suppression.
Inter-subject analysis of stimulus-driven auditory cortex activity
We applied two complementary inter-subject analysis methods to assess attention-driven modulations of stimulus-related auditory cortex activity, investigating the synchronization of regional activation time series and spatial voxel patterns across individuals and listening conditions. Notably, both analysis approaches yielded comparable results, suggesting that local auditory cortex responses were temporally and spatially aligned across individuals. Moreover, the time-resolved estimates of stimulus-driven activity based on spatially distributed voxel patterns contained sufficient information to classify the participants’ attention focus from regions Te3, STS1, and STS2. Potentially, the time-resolved analysis approach could be further extended to study fluctuations in attention allocation over time.
The inter-subject synchronization of spatial voxel patterns was less pronounced than the alignment of regional activation time series. This was expected given the previously demonstrated inter-individual variability in spatial tuning to different sound features (Schönwiesner and Zatorre, 2009; Moerel et al., 2012). Hyperalignment, that is, the transformation of anatomically defined voxel patterns into a shared information space across individuals, may be an effective method to mitigate this problem (Haxby et al., 2020). Alternatively, a “within-subject” analysis, in which stimulus-specific associations in the spatial voxel pattern are computed between independent task runs recorded from the same individual, could provide a superior spatial correspondence. However, stimulus repetition effects may limit the applicability of this method (Sternin et al., 2023).
Overall, our data show an increase in inter-subject synchronization along the auditory cortex processing hierarchy, consistent for both analysis approaches. Similar patterns have been observed in some prior studies (Regev et al., 2013; Zhang et al., 2023); other data however suggest a strong inter-subject alignment of early auditory sensory activity (Nastase et al., 2019; Regev et al., 2019; Li et al., 2022). This indicates that early sensory activations related to story listening synchronize across individuals, but not to the same degree in all experimental contexts. The variability across studies could reflect differences in task difficulty or stimulus material or individual factors of the participant sample. It cannot be ruled out that the comparably weak inter-subject synchronization in early auditory areas may have limited our ability to more reliably detect modulations of stimulus-driven activity by selective attention in early auditory areas.
Summary
We here adopted an inter-subject analysis approach to study attention effects on stimulus-driven auditory cortex activity during selective listening to continuous natural speech. Our data show that the auditory cortex gates attended input along the auditory processing hierarchy while responses to ignored input are gradually attenuated. Consistent with the view that attentional selection in competing-speaker experiments acts on segregated representations of the competing speech streams, attention-driven modulations of stimulus-related activity were largely restricted to later stages of the auditory processing hierarchy. In contrast, activity in the primary auditory cortex—presumably reflecting the acoustic input mixture—were only little affected by attention.
Footnotes
This work was supported by the Cluster of Excellence Hearing4all, funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC2177/1, Project ID 390895286), by the Neuroimaging Unit of the Carl von Ossietzky Universität Oldenburg, funded by a DFG grant (3T MRI INST 184/152-1 FUGG), and by a Foundation Grant from the Canadian Institutes of Health Research (FDN1432179) to R.J.Z. S.P. was supported by a DFG research scholarship (PU590/1).
The authors declare no competing financial interests.
- Correspondence should be addressed to Sebastian Puschmann at sebastian.puschmann{at}uni-oldenburg.de.