Abstract
Previous work has demonstrated that performance in an auditory selective attention task can be enhanced or impaired, depending on whether a task-irrelevant visual stimulus is temporally coherent with a target auditory stream or with a competing distractor. However, it remains unclear how audiovisual (AV) temporal coherence and auditory selective attention interact at the neurophysiological level. Here, we measured neural activity using EEG while human participants (men and women) performed an auditory selective attention task, detecting deviants in a target audio stream. The amplitude envelope of the two competing auditory streams changed independently, while the radius of a visual disk was manipulated to control the AV coherence. Analysis of the neural responses to the sound envelope demonstrated that auditory responses were enhanced largely independently of the attentional condition: both target and masker stream responses were enhanced when temporally coherent with the visual stimulus. In contrast, attention enhanced the event-related response evoked by the transient deviants, largely independently of AV coherence. These results provide evidence for dissociable neural signatures of bottom-up (coherence) and top-down (attention) effects in AV object formation.
SIGNIFICANCE STATEMENT Temporal coherence between auditory stimuli and task-irrelevant visual stimuli can enhance behavioral performance in auditory selective attention tasks. However, how audiovisual temporal coherence and attention interact at the neural level has not been established. Here, we measured EEG during a behavioral task designed to independently manipulate audiovisual coherence and auditory selective attention. While some auditory features (sound envelope) could be coherent with visual stimuli, other features (timbre) were independent of visual stimuli. We find that audiovisual integration can be observed independently of attention for sound envelopes temporally coherent with visual stimuli, while the neural responses to unexpected timbre changes are most strongly modulated by attention. Our results provide evidence for dissociable neural mechanisms of bottom-up (coherence) and top-down (attention) effects on audiovisual object formation.
Introduction
In many real world sound environments, sounds originate from multiple sources; the auditory system needs to appropriately segregate and group sound features to efficiently process the entire scene (Shamma et al., 2011; Maddox and Shinn-Cunningham, 2012; Middlebrooks et al., 2017). Several psychoacoustic studies have demonstrated that visual cues which are temporally coherent with sounds can modulate auditory processing. For example, a synchronous, task-irrelevant light flash improves the detection of weak auditory signals (Lovelace et al., 2003). Similarly, task-irrelevant visual stimuli, which are temporally coherent with a speech envelope, enhance speech intelligibility in background babble noise (Yuan et al., 2020). Furthermore, performance in an auditory selective attention task can be enhanced or impaired, depending on whether the task-irrelevant visual stimulus is temporally coherent with a target sound stream or a competing masker stream (Maddox et al., 2015). However, the neural mechanisms mediating the interactions between temporal coherence and selective attention in facilitating audiovisual (AV) integration remain unknown.
Several previous studies have identified potential neural correlates of attentional modulation of AV integration. For example, a study using simple tone pips and visual gratings demonstrated that ERPs related to multisensory integration were amplified by selective attention (Talsma and Woldorff, 2005). When both visual and auditory stimuli were attended, the ERP peak amplitude showed superadditive AV effects; however, subadditive effects were observed for unattended stimuli (Talsma et al., 2007). Some EEG and MEG studies have used the analysis of “neural envelope-tracking responses” to speech, by modeling the relationship between neural activity and the auditory envelope (Golumbic et al., 2013; Crosse et al., 2015), and have found that congruent AV speech enhances the envelope tracking response relative to auditory speech alone or the linear summation of auditory and visual speech. Other studies have used auditory selective attention tasks to show that attention is necessary for AV speech integration. For example, Morís Fernández et al. (2015) measured fMRI data and showed that multisensory integration occurred almost exclusively only when the congruent AV speech was attended. However, Ahmed et al. (2021) measured EEG and found some evidence for early AV integration in the unattended stream, consistent with the idea that distinct AV computations emerge at different processing stages (Talsma and Woldorff, 2005; Talsma et al., 2007; Kayser and Shams, 2015; Zumer et al., 2020). One potential difficulty with interpreting findings from AV speech processing is that it can be hard to know the extent to which they generalize to other continuous AV stimuli, given that speech processing can be heavily influenced by linguistic knowledge and expectations. Thus, these speech-specific studies might not represent more general mechanisms of visual influences on auditory processing. Consistent with AV integration occurring independently of attention for nonspeech stimuli, neural correlates of AV integration were observed in single neurons in the auditory cortex of passively exposed ferrets. This included enhancement of the neural representation of the temporally coherent features (i.e., envelope), but also of the other (i.e., timbre) sound features (Atilgan et al., 2018). Together, from these findings, it remains unclear whether such bottom-up effects modulate the cortical representation of auditory streams independently of attentional top-down enhancement, or whether these effects are synergistic.
Here, we use EEG to investigate the electrophysiological correlates of AV temporal coherence and auditory selective attention on sound processing in an auditory selective attention task. Listeners were required to detect short timbre deviants in an attended audio stream, while a visual stimulus was paired with either the target, masker, or neither sound through coherent size/amplitude fluctuations. First, we focused our analysis on how AV coherence and attention affected the neural signatures of continuous stream processing, as manifest in the envelope-tracking response. Second, we focused on the transient auditory deviants, whose timing was independent of the features of the visual stream, and compared deviant-evoked ERPs between conditions. Our goal was to test the hypothesis that attention and AV integration operate independently.
Materials and Methods
Participants
Twenty volunteers were recruited for this experiment (median ± SD age, 22 ± 2 years; 12 males; 19 right-handed). All participants were healthy, had self-reported normal hearing, and had normal or corrected-to-normal vision. Before the experiment, each participant gave written informed consent. All procedures were approved by the Human Subjects Ethics Sub-Committee of the City University of Hong Kong.
Stimuli
We adapted the behavioral paradigm from previous psychoacoustics studies (Maddox et al., 2015; Atilgan and Bizley, 2021). Stimuli included two simultaneously presented auditory streams and one visual stream. One auditory stream was meant to be attended, and will be referred to as the target sound (At), while the other one was meant to be unattended, and will be referred to as the masker stream (Am). Finally, stimulation included a concurrently presented visual stream (V) which comprised a radius-modulated disk. Auditory streams were independently amplitude-modulated, and the modulation of the visual disk could be temporally coherent either with the amplitude of the target stream (AtAmVt), the masker stream (AtAmVm), or independent of both (AtAmVi) (Fig. 1A).
The envelopes <7 Hz were generated using the same methods as in Maddox et al. (2015). Briefly, frequency domain synthesis was used to generate the envelopes. In the frequency domain, amplitudes of frequency bins between 0 and 7 Hz were set to 1 and, for other frequency bins, to 0, The non-zero bins were given a random phase from a uniform distribution between 0 and 2π, at an audio sampling rate of 24,414 Hz. To maintain Hermitian symmetry, the corresponding frequency bins across Nyquist frequency were set to the respective complex conjugates. The inverse Fourier transform was calculated to create the time domain envelope. Three envelopes of each trial were computed using the same method, and they were orthogonalized using the Gram–Schmidt procedure. Visual envelopes were generated by downsampling the auditory envelope at the monitor frame rate of 60 Hz, where the disk radius of the first frame was corresponding to the first auditory sample. Each auditory stream consisted of one continuous amplitude modulated synthetic vowel, either /u/ or /a/, which were generated by filtering a click train at four “formant” frequencies (F1-F4). The fundamental frequency (F0) of vowel /u/ was 175 Hz, and the formant peaks were 460, 1105, 2975, 4263 Hz, while the F0 of vowel /a/ was 195 Hz, and the formant peaks were 936, 1551, 2975, 4263 Hz. Auditory deviants were embedded in the auditory streams by temporarily changing the timbre of the vowel. The deviant in vowel /u/ transitioned (in F1/F2 space) toward the vowel /ε/, with the maximum timbre change resulting in formant peaks at 525, 1334, 2975, 4263 Hz, while the deviant in vowel /a/ transitioned toward /i/ with formant peaks at 860, 1725, 2975, 4263 Hz. The duration of each deviant was 200 ms, which included a linear change of the formants toward the deviant for 100 ms and then back for 100 ms.
The visual stimulus was a gray disk surrounded by a white ring presented at the center of the black screen. The radius of the visual stimulus was modulated by the visual envelope, such that the disk subtended between 1° and 2.5°, and the white ring extended 0.125° beyond the gray disk.
Each trial lasted 14 s and comprised three streams. A target audio stream and the visual stream were each 14 s in duration while the masker stream, although also generated to be 14 s in duration, was silenced for the first second. The initial 1 s, during which only the target stream was audible, provided the cue for the listener which was the to-be-attended target stream. Auditory deviants could occur at any time during a window beginning 2 s after the onset of the target audio stream and ending 1 s before the end of the trial, subject to the constraint that the minimum interval between auditory deviants was 1.2 s. On average, each stream contained 2 deviants (range 1-3 across trials). Unlike Maddox et al. (2015), the visual stream did not contain any color deviants.
Procedure
Participants were seated in a sound-attenuated room. Auditory stimuli were presented binaurally via earphones (ER-3, Etymotic Research), using an RZ6 signal processor at a sampling rate of 24,414 Hz (Tucker-Davis Technologies). The sound level was calibrated at 65 dB SPL. Visual stimuli were presented on a 24-inch computer monitor using the Psychophysics Toolbox for MATLAB. Participants were asked to pay attention to the target auditory stream and to detect the embedded auditory deviants by pressing a keyboard button. They were instructed to refrain from pressing buttons in response to any events in the masker stream.
Before the actual task, all participants completed a training session to verify that they were able to detect the auditory deviants. The training session included four blocks, and each block included 9 trials. The feedback of performance was given after each block, and all participants showed they could perform the experiment (d′ > 0.8) in at least one block of four.
Participants were instructed to minimize eye blinks and body movements during the EEG recording. Continuous EEG signals were collected using an ANT Neuro EEGo Sports amplifier from 64 scalp channels at a sampling rate of 1024 Hz. The EEG signals were grounded at the nasion and referenced to the Cpz electrode. Each participant completed 12 blocks of the task, with 18 trials (6 trials × 3 conditions) in each block. Trials belonging to different conditions were presented in a randomly interleaved order. In total, each participant completed 216 trials (72 trials × 3 conditions). Feedback on behavioral performance was provided after each block. Triggers corresponding to trial and deviant onset were recorded along with the EEG signal.
Behavioral data analysis
A “hit” was defined as the response to the deviant in the target auditory stream within 1 s following the onset of the deviant, and a “false alarm” was defined as the response to a deviant that occurred in the masker stream. To study how visual coherence affects auditory deviant detection, we conducted a one-way repeated-measures ANOVA on the sensitivity measure d′ with a within-subjects factor of AV condition (visual coherent with the target, AtAmVt, visual coherent with the masker, AtAmVm, and independent visual AtAmVi).
EEG signal preprocessing
EEG signals were preprocessed using the SPM12 Toolbox (Wellcome Trust Center for Neuroimaging, University College London) for MATLAB. Continuous data were downsampled to 500 Hz, high-pass filtered at a cutoff frequency of 0.01 Hz, notch-filtered between 48 and 52 Hz, and then low-pass filtered at 90 Hz. All filters were fifth-order zero-phase Butterworth. Eyeblink artifacts were removed by the use of the principal component analysis (PCA) based on a “preselection” spatial filtering technique described by Ille et al. (2002). Specifically, eyeblink artifacts were detected by computing the PCs of the signal in the channel Fpz, and removed by subtracting the first two spatiotemporal components associated with each eyeblink from all channels (Ille et al., 2002). The EEG data were then rereferenced to the average of all channels. The preprocessed data were further analyzed in two ways: For the response to the sound amplitude envelope, the preprocessed data were bandpass filtered between 0.3 and 30 Hz (Crosse et al., 2015), downsampled to 64 Hz, and subjected to a calculation of the temporal response function (TRF), or used for stimulus reconstruction (see below). For the deviant evoked response analysis, the preprocessed data were epoched from −100 to 500 ms relative to deviant onset. Epoched EEG signals were then denoised using the “Dynamic Separation of Sources” (DSS) algorithm (de Cheveigné and Simon, 2008), which is commonly used to maximize reproducibility of stimulus-evoked response across trials and maintain the differences across the different stimulus types (here: 2 vowel types × 3 experimental conditions). Epoched data were linearly detrended, and the first seven DSS components were preserved and applied to project the data back into sensor space. The SD of the voltage over time was computed for each trial, and we excluded the noisy trials whose SD exceeded the median ± 2 SD over trials. Across participants, ∼30 trials were excluded for each participant (the included trials were 829 ± 31 [median ± SD] of 864 trials). Denoised data were averaged across the good trials.
EEG response to sound amplitude envelopes
Stimulus reconstruction
To investigate how visual temporal coherence and attention affect multisensory integration, we quantified the accuracy of neural tracking of the sound amplitude envelope. We reconstructed amplitude envelopes of different elements of the AV scene (Crosse et al., 2015) based on the EEG data using a linear model as follows:
Since in the condition AtAmVi, the envelope of At, Am, and Vi are independent of each other, we could obtain the decoder of the envelopes of the auditory target only, auditory masker only, and visual only, respectively. To obtain the decoder for each condition, we used leave-one-trial-out cross-validation to select the λ value (from the set of 10−6, 10−5, …, 105, 106) for which the correlation between
The reconstruction accuracy (r) was defined as the Pearson correlation coefficient between the actual stimulus envelope and the estimated envelope.
Based on our main research question, namely, whether the effects of attention and coherence are independent or synergistic, the possible scenarios of combining the effects of coherence and attention were considered in the context of two main models of AV coherence: an integration model and a summation model (Fig. 1B). To test whether the reconstruction accuracy using either the AV decoder (“integration model”) and/or A + V decoder (“summation model”) was significantly larger than chance, we conducted a nonparametric permutation test. The null distribution of 1000 Pearson's r values was created for each subject by calculating the correlation between randomly shuffled response trials of estimated sound envelopes and actual sound envelopes. We estimated sound envelopes using each decoder separately, and generated the null distribution for each condition.
To test for the interaction of attention and AV integration, we computed a repeated-measures ANOVA on reconstruction accuracy with two main within-subjects factors: attention (target vs masker) and integration decoder (“integration model”: AV vs “summation model”: A + V).
TRF estimation
To investigate how the visual temporal coherence and attention affect AV integration across the EEG channels, we estimated the linear TRF (Crosse et al., 2016a), which links the EEG response at each channel and the sound envelope. The TRF is the linear filter that best describes the brain's transformation of the sound envelope to the continuous neural response at each EEG channel location (Haufe et al., 2014). TRFs were estimated separately for each experimental condition (AtAmVt, AtAmVm, AtAmVi) as follows:
To test whether AV integration is affected by attention, we compared the TRF amplitude between the temporally coherent and independent conditions across EEG channels and time points. Single-participant TRF data were converted into 3D images (2D: spatial topography, 1D: time) and entered into a repeated-measures ANOVA with two within-subjects factors: attention (attended:
Auditory deviant-evoked ERP
To assess how attention and visual coherence affect deviant-evoked activity, the EEG data were first subject to a traditional channel-by-channel mass-univariate analysis. Epoched data were averaged over trials, separately for the deviants in At and Am and for each visual condition (Vt, Vm, Vi). Single-subject ERP data were converted into 3D images (two spatial dimensions and one temporal dimension) and entered into a repeated-measures ANOVA with two within-subjects factors: attention (attended: deviant in the At stream; unattended: deviant in the Am stream) and visual coherence (coherent with the sound containing deviants: deviants in AtVt and AmVm; visual condition independent of the sound: AtVm and AmVt). The two-way repeated-measures ANOVA was implemented as a GLM in SPM12. The resulting statistical parametric maps, representing the main and interaction effects, were thresholded at p < 0.05 (two-tailed) and corrected for multiple comparisons across spatiotemporal voxels at an FWE-corrected p = 0.05 (cluster-level).
In a follow-up attempt to isolate dissociable neural signatures of attention and visual coherence, we concatenated the ERP data across participants and used PCA to reduce the EEG data dimensionality and obtain spatial PCs (representing the weight of channel topographies) and temporal PCs (representing voltage time-series). The EEG data were concatenated across participants before being subjected to PCA, to obtain the same PCs across participants. The PCs quantified independent contributions to whole-scalp data, such that the sensitivity to those isolated components increased (relative to original data, containing a mixture of components). The first four PCs (explaining 80% of the original variance across participants) were used to extract single-participant ERP components for further analysis. Each PC was then converted into 1D images (time) and subject to statistical inference using repeated-measures ANOVAs, as above. Significance thresholds were kept identical to the traditional univariate analysis, but correction for multiple comparisons was implemented across time points (rather than spatiotemporal voxels).
Correlating timbre deviant evoked ERP magnitude with behavioral performance
Since the behavioral task was to detect deviants in the target auditory stream, we extracted the EEG responses to deviants in At and measured the peak-to-peak amplitude of the PCs of ERP identified above. We then calculated the Pearson correlation coefficients between the behavioral mean d′ and the mean PC amplitude over conditions (AtAmVt, AtAmVm, and AtAmVi). To reduce the number of comparisons, we limited our correlation analyses to those ERP components and factors that showed significant effects. Specifically, for the first PC for which we have identified the significant main effect of attention (see Results), the negative and positive peaks were measured between 100 and 200 ms, and between 220 and 300 ms. respectively. For the third PC for which we have identified significant main and interaction effects of attention and coherence (see Results), the positive and negative peaks were measured between 50 and 160 ms, and between 220 and 400 ms, respectively. Before calculating the correlations, we fitted the behavioral performance d′ with the PC peak-to-peak amplitude using a linear regression model, and detected the outliers in each condition using Cook's distance (threshold: three means of Cook's distance).
Results
Behavioral results
First, we investigated whether behavioral performance was stable over time, which would warrant pooling data from all blocks. To this end, we calculated the single-participant hit rate separately for each of the 12 blocks, and fitted the data using a linear regressor representing the block number. The resulting regression coefficient (slope) was not statistically different from zero across participants (one-sample t test, t = 1.11, p = 0.28), suggesting that there were neither significant learning nor fatigue effects during the experiment.
To investigate the effect of the visual temporal coherence on behavioral performance, we performed one-way repeated-measures ANOVAs on d′ (F = 0.15, p = 0.85) (Fig. 1C), hit rates (F = 0.42, p = 0.66), and false alarm rates (F = 2.12, p = 0.13). The hit rates were 69 ± 2.7%, 70 ± 2.6%, and 70 ± 2.9% (mean ± SEM), and the mean false alarm rates were 4 ± 0.6%, 5 ± 0.9%, and 5 ± 1% for the three conditions (AtAmVt, AtAmVm, AtAmVi), respectively. No significant effect of visual coherence on deviant detection was observed, likely because of large variability and heterogeneity of response patterns across participants. For instance, while some participants showed behavioral benefits of visual coherence (e.g., larger d′ in AtAmVt condition than AtAmVm), others showed the opposite effects (Fig. 1C). Two previous studies using similar stimulus paradigms (Maddox et al., 2015; Atilgan and Bizley, 2021) reported enhanced task performance when the target stream and visual stimulus were temporally coherent. Our failure to replicate these data may be attributable to small but perhaps important differences in the details of the experimental paradigms, especially the manipulation of visual attention (see Discussion). Furthermore, our behavioral results are consistent with the general framework of the possible effects of attention and coherence (Fig. 1B), in which the relative contribution of the integration term might be small compared with the summation term. However, the aims of this study were to identify effects of auditory selective attention and AV coherence on physiological measures of neural stimulus representations, and the timbre deviants primarily served as a device for controlling and monitoring our participants' attention. The relatively high hit rates and low false alarm rates indicate that the deviants had fulfilled that purpose.
Stimulus reconstruction reveals temporal coherence mediated AV integration
To investigate the occurrence of AV integration at both attended and unattended conditions, we reconstructed an estimation of the sound envelope from the recorded EEG waveforms. We used the condition in which the visual stimulus was independent of both auditory streams to estimate unimodal reconstructions for the target auditory stream (At), the masker stream (Am), and the visual stream (Vi) (Fig. 2B). From this condition, we could independently estimate unisensory response elements, without introducing some of the confounds inherent in comparing activity across multisensory and unisensory trials, where prestimulus expectation and attention may differ (Mishra et al., 2007; Rohe et al., 2019). We first confirmed that the unimodal reconstructions for all conditions were significantly better than the chance estimated using a permutation test. From the unisensory reconstructions, we estimated the response to stimuli in which the visual stimulus was coherent with one or the other stream by linear summation. This linear summation model was compared with an integration model in which AV envelopes were reconstructed based on the responses to conditions in which the visual stimulus was temporally coherent with one or the other stream (i.e., AtVt and AmVm) (Fig. 2A). Testing for the interaction of attention and integration in a two-way repeated-measures ANOVA, we only found that the main effect of integration was significant (F = 491.8, p < 0.001). In post hoc comparisons, we observed that the average reconstruction accuracy of the AV decoder was significantly higher than that of the A + V decoder for both the target stream (Fig. 2C, Wilcoxon signed-rank test p < 0.001) and the masker stream (Fig. 2D, Wilcoxon signed-rank test p < 0.001), consistent with AV integration occurring independently of attention.
Forward models highlight attentional modulation of auditory responses
We next asked how temporal coherence and attention affect AV integration across the different EEG channels by estimating TRFs of each channel. While stimulus reconstruction predicts the accuracy of cortical tracking of the amplitude envelope by using multichannel EEG response (and may therefore be dominated by visual responses), TRFs reflect the linear transformation of the sound envelope to the neural responses at each EEG channel. We first explored whether we could observe similar evidence of AV integration from the TRF estimations as we did with the stimulus reconstruction. We estimated unisensory TRFs for the auditory target stream (TRFAt), the auditory masker stream (TRFAm), and the visual stimulus (TRFVi), separately, from the response in the condition AtAmVi, in which temporal envelopes of all three streams were independent. We then estimated the TRFAtVt and TRFAmVm using the responses in the condition AtAmVt and AtAmVm, respectively.
To investigate how the cortical representation of amplitude envelopes was influenced by attention and AV integration, we used a two-way repeated-measures ANOVA to assess the influence of AV integration and attention on the TRF amplitudes across all EEG channels. We observed a significant main effect of attention (Fig. 3A, anterior cluster, 78-219 ms, Fmax = 26.68, Zmax = 4.51, pFWE < 0.001; Fig. 3B, anterior and central cluster, 297-391 ms, Fmax = 27.98, Zmax = 4.61, pFWE = 0.008) and integration (Fig. 3C, anterior cluster, 219-250 ms, Fmax = 14.12, Zmax = 3.35, pFWE = 0.009).
In summary, we observed evidence that AV integration occurred both in the target and masker auditory stream when measures of stimulus reconstruction accuracy were used to analyze the neural responses to the sound envelopes. Analysis of TRFs amplitude across all EEG channels showed that attention modulated the magnitude of the TRF. AV integration was observed for the masker stream in central and frontal channels. The attention effect was observed for a subset of channels in the TRF analysis but not in the stimulus reconstruction, which used the responses across all channels. The other possible reason that attentional effects were observed with the TRF and not the stimulus reconstruction, is that the latter might be dominated by the responses to visual stimulus (Fig. 2B). Together, our results suggest AV integration occurs automatically, before attentional modulation.
Effects of AV temporal coherence and selective attention on deviant-evoked responses
The analysis so far has focused on the neural responses to the amplitude envelopes of the AV scene, and has revealed evidence for both attentional modulation of acoustic responses, and AV integration of temporally coherent cross-modal sources. Since, in the temporally coherent conditions, the visual and auditory streams convey redundant information, this integration falls short of reaching the stricter definition of binding proposed by Bizley et al. (2016) which requires an enhancement of independent features that are not those which link the stimuli across modalities. Here, the presence or timing of the auditory timbre deviants that listeners detected in the selective attention task is not predicted by the amplitude changes of the audio or visual envelopes, and they thus provide a substrate with which to explore binding.
To investigate how AV temporal coherence and attention affect the deviant-evoked responses, we compared the ERPs evoked by deviants embedded in At and Am streams, and, to look for evidence of binding, asked how AV temporal coherence modulated these responses (Fig. 4). As shown in scalp topographies, which visualize the response change over time for each condition (Fig. 4A), the deviant-evoked response in the target stream was clearly stronger than that in the masker stream.
Accordingly, in a traditional channel-by-channel mass-univariate analysis, correcting for multiple comparisons across all channels and time points, we observed a significant main effect of attention (anterior cluster, 196-302 ms, Fmax = 16.87, Zmax = 3.65, pFWE < 0.001; posterior cluster, 210-320 ms, Fmax = 21.32, Zmax = 4.08, pFWE < 0.001) and a significant interaction effect of attention and temporal coherence (anterior cluster, 62-146 ms, Fmax = 11.26, Zmax = 2.99, pFWE = 0.036, posterior cluster, 58-146 ms, Fmax = 16.92, Zmax = 3.66, pFWE = 0.03). No main effect of temporal coherence was observed.
Significant post hoc comparisons between conditions were consistent with the main effect of attention: for both temporally coherent and temporally independent streams, the deviant response in the target always exceeded that of the masker. The amplitude of the ERP evoked by timbre deviants presented in the target stream (AtVm) was significantly larger than that in the masker stream (AmVt) in two clusters: negative peak enhancement was observed over anterior channels (Fig. 4B, first row, 210-300 ms after deviant onset, pFWE < 0.001, Tmax = 4.23), and positive peak enhancement over posterior channels (Fig. 4B, second row, 212-302 ms after deviant onset, pFWE < 0.001, Tmax = 4.68). In the AV coherent stream, we observed that ERP amplitude evoked by the timbre deviants in the attended coherent stream (AtVt) was significantly stronger than in the unattended coherent stream (AmVm) in two clusters: one over the central and frontal channels between time lag 236-310 ms (Fig. 4C, first row, cluster-level pFWE < 0.001, Tmax = 3.8), and one over posterior channels between time lag 234-350 ms (Fig. 4C, second row, cluster-level pFWE = 0.007, Tmax = 4.18).
Post hoc comparisons also allowed us to examine the interaction between temporal coherence and attentional condition. We observed that the amplitude of ERP evoked by deviants in the masker stream was significantly smaller when this was accompanied by a temporally coherent visual stimulus (Fig. 4E). The deviant-induced ERP was smaller in the AmVm condition than in the AmVt condition in two clusters: one over the central and frontal channels between time lag 74-186 ms (Fig. 4E, first row, cluster-level pFWE = 0.011, Tmax = 4.38), and one over left temporal and posterior channels between time lag 94-180 ms (Fig. 4E, second row, cluster-level pFWE = 0.005, Tmax = 3.72). In contrast, AV temporal coherence did not influence the size of the deviant response in the target stream (Fig. 4D).
From the mass-univariate ERP data analysis (i.e., when analyzing all channels and correcting for multiple comparisons across channels and time points), attention was the main modulator of the size of the deviant response, with temporal coherence only influencing the deviant responses in the masker stream. In a follow-up exploratory analysis, we investigated whether effects of visual coherence, as well as attention, can be identified when EEG channels are grouped into principal spatiotemporal components explaining different sources of variance. To this end, we performed a PCA to extract the spatiotemporal components of the ERP, and performed separate two-way repeated measures of ANOVAs with two main factors: attention (attended and unattended) and visual coherence (coherent and incoherent), on the first four PCs in the time domain. These four PCs together explained 80% of the original variance. The analysis of the first PC (Fig. 5A, explaining 67% of the original variance) only showed a main effect of attention (time lag between 208 and 284 ms, Fmax = 32.53, Zmax = 4.92, pFWE < 0.001). No main or interaction effects were found to be significant for the second and fourth PC (Fig. 5B and Fig. 5D, explaining 6% and 3% of the original variance, respectively). However, the analysis of the third PC (explaining 4% of the original variance) showed a main effect of attention (time lag between 8 and 84 ms, Fmax = 43.33, Zmax = 5.53, pFWE < 0.001; 134-170 ms, Fmax = 77.98, Zmax = 6.88, pFWE < 0.001; 260 ms, Fmax = 26.54, Zmax = 4.50, pFWE < 0.001), coherence (time lag at 346 ms, Fmax = 9.98, Zmax = 2.81, pFWE < 0.001), and the interaction effect between attention and visual coherence (time lag between 214 and 238 ms, Fmax = 14.82, Zmax = 3.43, pFWE < 0.001). We therefore subjected the third PC to further analyses described below.
Post hoc tests on this PC supported the idea that attention dominates the neural response, but that temporal coherence can modulate it. In keeping with the main ERP results, the main effect of AV temporal coherence was apparent in the unattended stream, suggesting that the effect of attention may be strong enough to elicit a ceiling effect. Specifically, we observed main effect of attention (AtVm > AmVt: 86-244 ms, cluster-level pFWE < 0.001, Tmax = 8.16; AtVt > AtVm at 38 ms, cluster-level pFWE < 0.001, Tmax = 4.60; at 178 ms, cluster-level pFWE = 0.032, Tmax = 3.45; Fig. 5C). The effect of attention on the incoherent stream extends over more time points than the effect of attention on the coherent stream. Consistent with this being because of a temporal coherence mediated enhancement of the masker stream, the deviant-evoked responses in the masker were significantly greater when accompanied by a temporally coherent visual stimulus (AmVm > AmVt: 100-132 ms, Tmax = 3.79, cluster-level pFWE < 0.001; 240-268 ms, cluster-level pFWE < 0.001, Tmax = 3.55; Fig. 5C). The PC was dominated by the responses from the left temporal and right frontal channels (Fig. 5C, last column).
Correlations between behavioral performance and EEG
To examine the relationship between the EEG responses and behavioral performance, we calculated Pearson correlation coefficients between measures of behavioral performance and neural activity. Outliers were deleted using Cook's distance if the distance was >3 times the means of Cook's distance. We first considered whether the magnitude of the deviant response in the target stream correlated with overall behavioral performance (mean d′ across all visual conditions), reasoning that participants with a stronger deviant response might be better able to accurately report timbre deviants. For both PC1 and PC3, the peak-to-peak PCs of ERP amplitudes obtained for the deviants in the target stream (At) correlated with overall d′ performance (PC1 peak-to-peak amplitude: Fig. 6A, r = 0.55, p = 0.019; PC3: Fig. 6B, r = 0.61, p = 0.005).
The auditory selective attention task required that participants not only detect timbre deviants, but that they successfully differentiated target and masker events. We therefore hypothesized that listeners who more successfully engaged selective attention mechanisms might show larger differences in the magnitude of deviant response to target and masker deviants. To test this, we subtracted the peak-to-peak amplitude of EEG responses for masker deviants from the peak-to-peak amplitude to target deviants, and then measured the correlation between the EEG responses difference with the behavioral performance (d′). This relationship was observed for PC3 (Fig. 6C, r = 0.67, p = 0.001), but not PC1 (r = −0.01, p = 0.971).
Finally, while the visual condition did not significantly influence behavioral performance at the group level, there was significant heterogeneity within our listeners. To determine whether modulation of behavioral performance by the visual stimulus correlated with the magnitude of the attention × visual condition interaction in PC3, we considered the difference in the normalized d′ performance for target-coherent and masker-coherent trials (i.e., the difference in target-coherent d′ and masker-coherent performance d′/overall d′) and correlated this with the difference in the attentional modulation of the third PC peak-to-peak amplitude across visual conditions, that is, AtVt-AmVt versus AtVm-AmVm (Fig. 6D, r = 0.51, p = 0.031). While the correlation was significant and in the predicted direction (i.e., participants who showed a benefit for target-coherent trials had a greater attentional modulation in the target-coherent condition), we note that it is principally driven by a single participant whose removal renders the correlation nonsignificant.
Discussion
This study used an auditory selective attention task, performed in the presence of a temporally modulated visual stimulus, to dissect the neural signatures of selective attention and AV temporal coherence. Our EEG data of envelope responses reveal evidence for AV integration of temporally coherent AV envelopes, which occurred independently of selective attention. Meanwhile, selective attention had a strong effect on the amplitude of TRFs derived from the envelope responses, with TRFs corresponding to target streams yielding higher amplitudes than those corresponding to masker streams. To further investigate AV binding, we examined the EEG responses to the timbre deviants, which occurred independently of the amplitude envelopes of the audio(visual) streams. The fact that the EEG responses elicited by the timbre deviants were affected by the visual coherence of the stimulus can be interpreted as evidence that temporal coherence in the AV streams favored the emergence of a fused AV percept, which contrasts more strongly against the deviants than a purely auditory stream would. In direct support of this notion, we observed that, in some spatiotemporal components of the neural response, AV temporal coherence interacted with selective attention.
Temporal coherence-based AV integration occurs independently of attention
Based on the stimulus envelope reconstruction analysis, we found that the cortical responses to the AV amplitude envelope were better explained by an AV integration model than by a linear summation (A + V) model in both the attended and unattended streams, suggesting attention was not required to link audio and visual streams. Our study thus provides evidence that AV integration based on temporal coherence between the auditory and visual stream can occur independently of attention. This result is in contrast to previous studies using speech as stimuli. Ahmed et al. (2021) found AV integration was only observed for attended speech stream, demonstrating that responses to attended speech were better explained by an AV model, while the responses to unattended speech were better explained by the A + V model. However, their integration model outperformed the linear summation model for unattended speech at very short (0-100 ms lag) latencies, suggesting that distinct multisensory computations occur at different processing stages. In contrast to studies using natural speech and videos of faces, our visual disk was much simpler. One possibility, which is already noted in Atilgan et al. (2018), is that bottom-up AV integration does occur independently of attention for simple nonspeech stimuli. Another possibility is that watching a competing talker is more distracting than watching an uninformative disk, perhaps leading to observers actively suppressing a competing face in the context of a selective attention task. A final difference might be that subjects in Ahmed et al. (2021) were instructed to look at the eyes of the face, whereas our listeners fixated on the disk itself; potentially, the radius changes of the disk, presented at the fovea, provide a more salient temporal cue. In support of this possibility, we note that the stimulus reconstruction accuracy of the visual-only decoder in the independent condition was quite high, and significantly larger than that of the audio-only decoder.
We used a forward model to examine the cortical representation of the sound amplitude envelope across all EEG channels. Two-way repeated-measures ANOVA indicated significant main effects of attention and integration. In the unattended sound stream, the TRFAV amplitude was significantly stronger than the summation of TRFA and TRFV amplitude, which suggests that AV integration occurs independently of attention. This result is consistent with our results from the envelope reconstruction (Fig. 2), as well as a previous study from Crosse et al. (2015), both in terms of the direction of the effect (AV vs A + V) and its latency in the ∼200 ms range. Furthermore, attention strongly modulated the TRF, with the TRFAV amplitude for the target stream being significantly larger than that for the masker stream. This finding is consistent with previous studies, demonstrating an enhancement of attended speech streams (Ding and Simon, 2012; Mesgarani and Chang, 2012) and AV streams (Zion Golumbic et al., 2013). An open question is why AV temporal coherence did not influence the attended stream TRFAV. Perhaps the enhancement of the TRF by attention generated a ceiling effect, or possibly, if we had required subjects to attend to the visual stimulus, we might have observed stronger AV interactions. Nevertheless, our TRF results reveal the effects of both AV temporal coherence and attention on the TRF amplitude.
Attention and coherence effects on the deviant evoked responses
In this study, we adapted the behavioral paradigm of previous studies (Maddox et al., 2015; Atilgan and Bizley, 2021); however, we failed to replicate the behavioral findings. Two key differences may explain this: first, the magnitude of the timbre deviants was increased, which effectively rendered the task easier. The overall d′ scores are higher in the current dataset than in previous ones. A recent study (Cappelloni et al., 2022) also suggested that the temporal coherence of the visual stream might not provide additional benefit if the two auditory streams were easily segregated. Second, in these previous studies, listeners were also required to detect occasional color deviants in the visual stimulus, which required them to maintain some level of attention toward the visual modality. In our experiment, the visual stimulus neither contained deviants of its own, nor did it provide cues that might facilitate the detection of auditory deviants. Within the framework of the model included in Figure 1B, attending to the visual stream would lead to further enhancement. It is possible that this difference explains why, at the group level, we did not observe a significant effect of AV temporal coherence on auditory deviant detection.
A whole-scalp analysis of deviant-evoked ERPs brought evidence for a main effect of attention, with the latency of the effect corresponding to a P300 time window. The P300 is a later component in response to novelty occurring between 200 and 600 ms relative to deviant onset, and has been previously described for the auditory and visual modalities (Friedman et al., 2001). Previous studies showed that the P300 is attention-dependent (Polich, 2007), consistent with our findings. The anterior-posterior topography of the effect shown on Figure 4 is because of our choice of re-referencing to the average of all channels. In addition to this robust modulation of the deviant response by attention, a further PCA based on the timbre deviant elicited ERPs revealed interactions between attention and AV temporal coherence. For specific PCs, there was an attention-dependent enhancement of the deviant-evoked responses in the target stream independent of the visual coherent. This suggests that the attentional modulation of the target stream is sufficiently strong that temporal coherence exerts no additional effect. We found the main effect of attention to modulate activity at very early latencies (8-84 ms), although cluster-based statistics do not indicate that all time points within this time window show significant effects, but rather that there are some time points within the cluster that show significant effects. The post hoc test showed that the early peak of the attention effect was at 38 ms (Fig. 5C, AtVt vs AtVm). Previous studies have shown similarly early attention effects on auditory responses, for example, in a previous MEG study (Auksztulewicz and Friston, 2015), a main effect of attention was observed at ∼27-40 ms after tone onset. Such early latencies are consistent with earlier results obtained in attentional paradigms based on auditory filtering (Rif et al., 1991) and could be interpreted as evidence of attentional gating (Lange, 2013). However, for the unattended stream, temporal coherence does enhance the deviant evoked response in the masker stream. One possibility, therefore, is that, in this paradigm, the attentional modulation was sufficiently strong that, for target sounds, there was a ceiling effect preventing any further modulation by AV temporal coherence (equivalent in the model in Fig. 1B to the magnitude of attentional enhancement rendering small changes because of AV integration as irrelevant to the eventual summed activity). Some caution is required in interpreting these results given that the third PC accounted for only 4% of the variance in the EEG data, but it is noteworthy that this PC also correlated with differences in task performance. The magnitude of attentional modulation scaling with overall behavioral performance d′ (Fig. 6C). There was some evidence for a correlation between the extent to which the visual condition influenced behavioral performance and the magnitude of the temporal coherence-dependent attentional effects (Fig. 6D), although this requires replication, preferably in the context of task parameters that more reliably elicit a modulation of task performance by AV temporal coherence. That we see significant AV integration in the envelope tracking responses, but not in behavior or in the main ERP analysis (Fig. 4) of the timbre deviants, potentially suggests that both behavior and timbre deviant responses are dominated by attentional effects. Future experiments could make attentional selection harder, for example, by making the pitch or timbre of the two streams more similar, to determine whether it is possible to unmask AV temporal coherence effects that are hinted at by our PCA of the timbre deviant responses.
Our results are consistent with previous studies on “cocktail party effect” speech stream segregation, in which congruent visual stimuli enhanced the cortical representation of the speech envelope of attended speech streams relative to unattended streams (Golumbic et al., 2013; Crosse et al., 2016b). However, unlike in these previous studies, where visual speech provided temporal and contextual information about the auditory envelope, we used a simple disk as a visual stimulus, which provided no information about the auditory deviant. While previous studies have demonstrated that attention dedicated to one feature of an object may enhance the responses to other features of the object in both auditory (Alain and Arnott, 2000; Shinn-Cunningham, 2008; Shamma et al., 2011; Maddox and Shinn-Cunningham, 2012) and visual modalities (O'Craven et al., 1999; Blaser et al., 2000), our results provide new evidence that temporal coherence modulates the attentional enhancement of the neural response to the timbre deviants (“other” features) of the AV object.
In conclusion, we examined the temporal coherence and attention effect on neural responses to the continuous sound envelope and the deviant evoked response, respectively. Temporal coherence facilitated the AV integration independent of attention, and attention further enhanced the AV integration of the coherent AV stream. Attention amplified a large portion of the deviant-evoked response independent of temporal coherence, while coherence only modulated deviant-evoked responses in the unattended auditory stream. These results provide evidence for partly dissociable neural signatures of bottom-up (coherence) and top-down (attention) effects in AV object formation.
Footnotes
This work was supported in part by Wellcome & Royal Society Sir Henry Dale Fellowship Wellcome Trust Award 098418/Z/12/Z. This work was supported by European Commission's Marie Skłodowska-Curie Global Fellowship 750459 to R.A.; and European Community/Hong Kong Research Grants Council Joint Research Scheme Grant 9051402 to R.A. and J.W.S. For the purpose of open access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission. We thank Reuben Chaudhuri and On-mongkol Jaesiri for assistance with data collection.
The authors declare no competing financial interests.
- Correspondence should be addressed to Jan W. Schnupp at wschnupp{at}cityu.edu.hk or Ryszard Auksztulewicz at ryszard.auksztulewicz{at}fu-berlin.de