Abstract
Neuroimaging studies suggest cross-sensory visual influences in human auditory cortices (ACs). Whether these influences reflect active visual processing in human ACs, which drives neuronal firing and concurrent broadband high-frequency activity (BHFA; >70 Hz), or whether they merely modulate sound processing is still debatable. Here, we presented auditory, visual, and audiovisual stimuli to 16 participants (7 women, 9 men) with stereo-EEG depth electrodes implanted near ACs for presurgical monitoring. Anatomically normalized group analyses were facilitated by inverse modeling of intracranial source currents. Analyses of intracranial event-related potentials (iERPs) suggested cross-sensory responses to visual stimuli in ACs, which lagged the earliest auditory responses by several tens of milliseconds. Visual stimuli also modulated the phase of intrinsic low-frequency oscillations and triggered 15–30 Hz event-related desynchronization in ACs. However, BHFA, a putative correlate of neuronal firing, was not significantly increased in ACs after visual stimuli, not even when they coincided with auditory stimuli. Intracranial recordings demonstrate cross-sensory modulations, but no indication of active visual processing in human ACs.
Significance Statement
Visual information has a profound influence on auditory processing, particularly in noisy conditions. These “cross-sensory” influences start already in ACs, the brain area that processes sound signals. It has, however, been unclear whether auditory cortex actively processes visual information or whether visual signals only change the way sounds are processed. We studied this question by neurophysiological recordings from 16 participants with epilepsy who had electrodes implanted in their brains due to medical reasons. Using these intracranial recordings, we show that cross-sensory visual information modulates sound processing but triggers no high-frequency activity—a correlate of local neuronal firing—in auditory cortex. This result provides important information on the role of sensory areas in multisensory processing in the human brain.
Introduction
Cross-sensory visual information is known to influence early processing in human auditory cortices (ACs; Molholm et al., 2002; Foxe and Schroeder, 2005; Pekkola et al., 2005; Raij et al., 2010), but the functional significance of these influences remains unclear. According to a more conservative hypothesis, cross-sensory influences play a modulatory role in early ACs, to enhance relevant and suppress irrelevant sound inputs. In line with this suggestion, the effects of unimodal visual stimuli on ACs have often been limited to subthreshold synaptic influences (Schroeder and Foxe, 2002; Ghazanfar et al., 2005; Kayser et al., 2008). For example, laminar extracellular recordings in nonhuman primates (NHPs) suggest that cross-sensory visual stimuli modulate the phase of neuronal oscillations but do not increase multiunit firing activity in ACs (Lakatos et al., 2007). At the same time, neurophysiological evidence of multisensory interactions (MSIs), nonadditive changes in neuronal firing rates to multi- versus unisensory stimuli (Stein and Meredith, 1993), concentrates on higher rather than early sensory areas (Konorski, 1967; Barlow, 1972). One previous NHP model suggested that firing activity is suppressed rather than increased to concurrent audiovisual (AV) versus auditory inputs in ACs (Kayser et al., 2008).
A more provocative hypothesis suggests that cross-sensory stimuli may be actively processed in early ACs (Calvert et al., 1997; Foxe et al., 2000; Ferraro et al., 2020). This hypothesis has gained support from single-unit recordings in NHPs (Brosch et al., 2005) and other mammals (Wallace et al., 2004; Bizley et al., 2007; Kobayasi et al., 2013), which provide evidence for cross-sensory firing activity in ACs. According to one of these studies, cross-sensory stimuli could drive single-unit firing patterns that convey information of the properties of visual stimuli in the ferret ACs (Bizley et al., 2007). A recent mouse study also suggested firing activity to visual stimuli in the deep infragranular layers of ACs (Morrill and Hasenstaub, 2018).
Whether cross-sensory activity effects such as those seen in animal models occur also in the human ACs is difficult to examine using conventional noninvasive neuroimaging. Human neuroimaging evidence of responses to (unimodal) cross-sensory stimuli in ACs consists mainly of studies using MEG/EEG (Giard and Peronnet, 1999; Raij et al., 2010) and fMRI (Calvert et al., 1997; Pekkola et al., 2005), which offer limited means for making inferences of neuronal mechanisms. At the same time, the few previous intracranial studies on cross-sensory influences in ACs have been limited to relatively small samples of participants (N = 3–8) and individual-level analyses (Mercier et al., 2015; Ferraro et al., 2020), which may limit the generalizability of results.
Here, we tested the hypothesis that human ACs are not only modulated but also activated by cross-sensory visual stimuli using intracranial stereo-EEG (SEEG) recordings in 16 participants with depth electrodes implanted near ACs for preoperative monitoring. SEEG provides direct measurements of local field potentials (LFPs) from the neuronal tissue, to quantify broadband high-frequency activity (BHFA; above 70 Hz). In contrast to gamma band oscillations that are also visible to MEG and EEG, BHFA signals are believed to be reliable correlates of local spiking activity (Manning et al., 2009; Miller, 2010; Ray and Maunsell, 2011; Parvizi and Kastner, 2018). Moreover, in studies on ACs, the recording contacts of depth electrodes may extend across the depth of the superior temporal plane, which helps reveal auditory versus cross-sensory responses at a much greater detail than what is possible with noninvasive recordings, or even with subdural electrocorticography (ECoG). A challenge in previous human SEEG studies of cross-sensory modulation of human ACs has been that the clinically determined anatomical implantation plans vary between participants, thereby complicating robust hypothesis testing at the group level (Besle et al., 2008; Ferraro et al., 2020). Here, to facilitate group analyses, we complemented traditional “electrode-space” analyses with our recently introduced surface-based source modeling technique that estimates the neuronal activity in the anatomically normalized cortical “source space” (Lin et al., 2021).
Materials and Methods
Participants
We studied 16 participants with epilepsy (15–45 years, seven women) with pharmacologically intractable epilepsy who were undergoing clinically indicated intracranial SEEG recordings. All aspects such as implantation and positioning of electrodes, as well as durations of recordings, were based purely on clinical needs. All participants gave written informed consent before participating in this study. The study was approved by the Institute Review Board Taipei Veteran General Hospital. The details of participants’ demographics and electrode implantation plans are provided in Table 1. The sample size was selected to exceed the numbers of participants who were studied in previous comparable intracranial human ECoG (Mercier et al., 2015) and SEEG studies (Ferraro et al., 2020).
Task and stimuli
In the main experiment, the 16 participants were presented auditory (A), visual (V), or AV stimuli in a randomized order, at a 1.2–2.8 s jittered interstimulus interval (ISI; mean 2 s). In an oddball design, the participant was asked to press a button with their right index finger upon detecting a target stimulus (p = 0.1) among the nontarget stimuli (Fig. 1). In the A trials, the nontarget stimulus was a 300 ms white-noise burst and the target stimulus an equally long 440 Hz pure tone. In V trials, the nontarget stimuli consisted of static checkerboard stimuli (3.5 × 3.5 degrees of visual angle, 300 ms duration) and the target was the same checkerboard but overlapped with a central black diamond pattern. In AV trials, the participant heard and saw the combinations of A and V stimuli; the AV target was the combination of a pure tone and checkerboard with a black diamond. The onset of the V stimulus preceded the A onset by 48 ms. Two 6 min runs of data were collected from each participant. In total, the subject was presented with 100 A, 100 V, and 100 AV stimuli. Auditory stimuli were delivered binaurally via insert earphones (Model S14, Sensimetrics) and the visual stimuli on a 17 inch computer screen (Asus MM17; ASUSTeK Computer) positioned at a viewing distance of 60 cm, controlled by using E-Prime (Psychology Software Tools).
A subgroup of eight subjects participated also in a separate session with more complex task and stimuli than in the main experiment (Table 1). This additional session was based on a delayed match-to-sample (DMS) design, with separate lexical and nonlexical parts. The SEEG data from the DMS tasks were analyzed together for the present purposes. In each trial of the Lexical DMS, the participant was first presented an A stimulus item, a voice with one of four Mandarin tones. After a randomly jittering delay period of 1–1.5 s, the participant was shown a V probe, which contained one of the possible numbers (i.e., 1, 2, 3, or 4) in the video screen. The participant was asked to press one button with their right-hand index finger if the content of the V probe matched with that of the A item and another button with their right-hand middle finger if it did not match. The task was self-paced; in other words, the task sequence moved on to the following trial only after the participant had responded. In each trial of the Nonlexical DMS, the participant was first presented an A stimulus, consisting of male or female voice stimulus. Then, after a jittering 1–1.5 s delay, the participant saw a V probe, consisting of the Mandarin symbol meaning “male” (i.e., 男) or “female” (i.e., 女). Their task was to press one button with their right-hand index finger if the A voice and V symbol gender matched or another button with their right-hand middle finger if there was a mismatch.
Data acquisition
SEEG electrodes were placed solely based on the clinical need of each participant with epilepsy, to identify epileptogenic zones. The participants with epilepsy were implanted with from 6 to 17 electrodes (0.86 mm diameter, 5 mm spacing; 6, 8, or 10 contacts per electrode; Ad-Tech). SEEG data were collected at 2,048 samples/s. A white matter contact was used as a reference electrode in five of the 16 participants with epilepsy. In the remaining 11 participants with epilepsy, the scalp EEG electrode at the location FPz served as the initial reference (Caune et al., 2014; Jonas et al., 2014, 2016; Rikir et al., 2014; Koessler et al., 2015; Mittal et al., 2016; Cam et al., 2017). However, in these 11 participants, the SEEG signals were re-referenced to a contact located in white matter before data analysis. Figure 2a depicts the distribution of all electrode contacts of all participants in a standard brain representation.
Two sets of anatomical MRIs were obtained, one before and another after the implantation surgery of each participant with epilepsy. The preoperative anatomical MRI was obtained with a 3 T MRI (Siemens Skyra, Siemens) using a MPRAGE sequence with TR/TI/TE/flip, 2,530 ms/1100 ms/3.49 ms/7o; partition thickness, 1.33 mm; matrix, 256 × 256; 128 partitions; and FOV, 21 cm × 21 cm. The postoperative anatomical MRI was acquired at 1.5 T (GE Signa HDxt system) with a fast spoiled gradient–recalled echo sequence (TR/TE/TI, 10.02/4.28/0 ms; flip angle, 15°; matrix, 256 * 256; bandwidth, 31.2 kHz; view, 256 × 256 mm; and axial slice thickness, 1.0 mm).
Preprocessing
SEEG preprocessing was performed by using the Matlab (MathWorks) EEGLAB toolbox (version 14.1.0; http://sccn.ucsd.edu/eeglab/). Electrode contacts that carried excessive line noise were excluded from the analyses based on visual inspection. These analyses also excluded electrode contacts within epileptic lesion sites according to the neurologists (authors H.-Y.Y., C.-C.C.). SEEG waveforms were notch filtered at 60 Hz to reduce power line artifacts and detrended when cropping the raw data into trials. Trials including epileptiform activity were excluded. SEEG recording was then segmented into epochs of 1,500 ms duration with a 500 ms pre-stimulus baseline relative to the A, V, and AV stimulus onsets. SEEG epochs with deflections larger than 6 SDs from the epoch average were discarded. Analyses of time–frequency representations (TFRs) of power and inter-trial phase consistency (ITPC) were based on individual trials. The event-related response analyses in Figures 3 and 4 utilize trial-averaged responses.
Source modeling
The coverage provided by depth electrode contacts is sparse and, due to clinical reasons, differs between participants. Thus, it is difficult to analyze the extent of activations within any individual subjects or to conduct anatomically normalized group analyses. These challenges could potentially be addressed by intracranial source modeling of the SEEG data. Although SEEG source modeling has the characteristic limitations associated with ill-posed inverse problems, a crucial advantage compared with MEG/EEG source modeling is the availability of the directly measured depth information of the origins of the signals within the brain parenchyma. Another point is that the BHFA signal that is measurable with SEEG, and which is not visible to noninvasive MEG/EEG, has been reported to be highly focal (Manning et al., 2009; Ray and Maunsell, 2011). Therefore, in addition to conventional signal analyses, we employed inverse modeling of the intracranial source currents to facilitate anatomically normalized group analyses of cross-sensory influences on human AC.
SEEG source modeling was conducted using our recently published strategy (Lin et al., 2021). The locations of electrode contacts in the individual’s brain were identified from the post-surgery MRI, based on discrete dark image voxel clusters caused by the susceptibility artifact at each electrode contact. After specifying the distances between neighboring contacts and the number of contacts in an electrode, the electrode was manually aligned with the dark image voxel clusters in the post-surgery MRI using our in-house Matlab (MathWorks) software with a graphical user interface. Thereafter, contact locations were further optimized (within ±10 mm translation and ±2° rotation) by minimizing the sum of squares of image voxel values at all contact locations and their neighboring voxels within a 3 × 3 × 3 voxel cube in the post-surgery MRI using the Matlab patternsearch function. The SEEG electrode contact locations were then registered to the pre-surgery MRI, which was used to build boundary element models (BEMs) required for the lead field calculation and to define locations of potential neural current sources. The inner-skull, outer-skull, and outer-scalp surfaces for the BEMs, as well as the cortical source spaces at the gray/white matter boundaries and subcortical source spaces, including the thalamus, caudate, putamen, hippocampus, amygdala, and brainstem, were automatically reconstructed from the pre-surgery MRI by using FreeSurfer (http://surfer.nmr.mgh.harvard.edu). In each cortical source location, which were separated by ∼5 mm from one another, we had three orthogonal neural current dipoles in +x, +y, and +z directions. In subcortical regions, sources were separated by 2 mm in the three orthogonal directions. An example of the current source space electrode contacts, as well as of the inner-skull, outer-skull, and outer-scalp surfaces, from a representative participant with epilepsy is shown in Figure 2b. The lead fields were calculated by using the OpenMEEG package (https://openmeeg.github.io/; Kybic et al., 2005; Gramfort et al., 2010), with the relative conductivity values for air, scalp, brain parenchyma, and skull being 0, 1, 1, and 0.0125, respectively.
The measured SEEG data and the cortical current sources at time t are related to each other by y(t) = A x(t) + n(t), where y(t) is the collection of SEEG data across electrode contacts, x(t) denotes the unknown current strength, and n(t) denotes noise. Electrode contacts potentially related to epileptic activity were excluded. The x(t) has 3 × m elements to describe the currents in three orthogonal directions at m brain locations and A is the lead field matrix. For a unit current dipole source at location r′ in the +x, +y, or +z direction, the electric potentials measured from all electrode contacts are denoted by a(r′) = [ax(r′), ay(r′), az(r′)]. The lead field matrix A was obtained by assembling a(r′) across all possible current source locations:
To estimate x(t) using the minimum-norm estimate (MNE), we had
Regions of interest
Three ROIs were defined based on the Human Connectome Project (HCP) multimodal parcellation (MMP1) atlas, combined version (Glasser et al., 2016), which was projected to the FreeSurfer “fsaverage” standard brain surface representation. From the HCP atlas, we selected the ROIs “early auditory cortex” (AC), “auditory association cortex” (AAC), and “temporo-parieto-occipital junction” (TPOJ), which were subsequently resampled to each participant’s individual cortical surface representation. For the group and individual-level TFR analyses, in each subject, we determined the contact that was closest to the HCP labels AC, AAC, and TPOJ, which were first co-registered to each subject’s cortical surface representation (Arnulfo et al., 2015). For the iERP analysis, average time courses were calculated across the vertices of the ROI, with the waveform signs of sources aligned on the basis of surface-normal orientations to avoid phase cancellations.
Time–frequency representation analysis
TFRs of SEEG power were calculated in Matlab by convolving individual trials with a dictionary of 7-cycle Morlet wavelets. In the case of ROIs of source estimates, the TFRs of power were determined from the sums of squares of the amplitude values along three orthogonal dipole directions. For the group analyses, the power values were then averaged across trials and baseline normalized. In the supplementary individual-level analysis, we compared the power at each time/frequency element after the stimulus onset with the mean power during the respective pre-stimulus baseline period using paired t tests. ITPC across trials was calculated by dividing the mean amplitudes of the wavelet coefficients across the trials by the mean of absolute values of the amplitudes:
Statistical analyses
For the behavioral data, we conducted a repeated-measures ANOVA across the task conditions. We then compared reaction times (RT) and proportions of correct responses (PCRs) between AV and A, as well as between AV and V conditions using paired t tests, corrected for multiple comparisons using the Bonferroni’s correction (total number of tests, 4). We also calculated behavioral index of multisensory RT benefit, that is,
For the iERP analysis, a nonparametric maximum-statistic permutation test was utilized to determine a critical value at which the group-average A and V responses significantly exceeded the pre-stimulus baseline. In 1,000 permutations, the order of time points of each participant’s baseline period for A and V responses was randomized before calculating the group average. To deal with the multiple comparison problem, from each permutation, we selected the maximum amplitude across the A and V baseline periods to be entered to the null distribution. The critical value was determined as the 95th percentile of this null distribution.
TFRs of power values at each frequency were averaged across trials, after which we calculated the base-10 logarithm of each value relative to the average power during the baseline period. Because the ITPC values are bounded within the range between 0 and 1, the values were transformed to z-score–like values using the Rayleigh test and norminv.m of Matlab (Fisher, 1993). We used a linear mixed effect (LME) model to calculate the statistical inferences of TFRs of power and ITPC in A, V, and AV across participants in both electrode and source analysis using the fitlme.m function of Matlab (MathWorks). For each condition, the analysis was conducted in each element of the relevant time–frequency grid. To compare each task condition separately against the baseline, we used LME models defined by the formula [Wilkinson notation (Wilkinson and Rogers, 1973)] M ∼ 1 + (1|subject_id) where the M is the to-be-predicted effect (signal power or ITPC value, respectively), 1 refers to the fixed effect of the intercept, and the term (1|subject_id) refers to the random effect (i.e., random intercept) of the subject identity. Here, the statistical significance was determined based on the fixed effect of intercept. For the contrast between AV and other conditions, we first calculated the to-be-predicted effect size of the contrast of interest, which was then predicted by the same model. The group TFR analysis of Experiment 2 was conducted in the contacts closest to the ROIs representing AC, AAC, and TPOJ, separately for all A item trials and V probe epochs. In this analysis, the A and V epoch data from lexical and nonlexical DSM tasks were pooled together. All TFR comparisons were corrected for the inflation of type I error by controlling the false discovery rate (FDR) across the time–frequency elements.
Data for reference
Codes used for this study are open to access at https://github.com/fahsuanlin/fhlin_toolbox/wiki. Deidentified data utilized to yield our main results are available at https://dataverse.harvard.edu/privateurl.xhtml?token=d7216368-34bf-4a8a-858a-824b216385fb. Any additional information required to reanalyze the data reported in this paper is available from the corresponding author.
Results
To investigate cross-sensory influences in AC, we presented A, V, and AV stimuli to 16 participants with SEEG depth electrodes implanted near ACs for presurgical monitoring. We found strong evidence of subthreshold modulatory effects, reflected as “phase resetting” of ongoing intrinsic low-frequency oscillations and enhanced event-related desynchronizations (ERD) of 15–30 Hz β oscillations in ACs. Cross-sensory effects were also found in the ROI analysis of iERP source waveforms in the AC. However, although the participants significantly benefited from combining information from the two modalities in terms of the behavioral RT, there was little evidence of enhanced BHFA to V stimuli, whether presented alone or in combination with A stimuli in our group analyses.
Behavioral results
All 16 participants were able to perform the task according to the instruction: On average, they were able to detect almost all targets in A, V, and AV conditions, with no statistically significant differences in PCRs between the conditions in our repeated-measures ANOVA (F(2,30) = 0.45; p = 0.64). The mean ± standard error (SEM) values of PCR for A, V, and AV target stimuli were PCRA = 0.92 ± 0.04; PCRV = 0.93 ± 0.04; and PCRAV = 0.94 ± 0.03, respectively. The high PCR value for the unimodal V targets provides evidence that the participants were able to focus their gaze according to the instruction. Evidence for significant advantages of combining A and V stimuli were found in RTs to correctly detect target stimuli. The mean ± SEM values of RT for A, V, and AV conditions were RTA = 668 ± 24 ms, RTV = 575 ± 19 ms, and RTAV = 546 ± 24 ms, respectively. There was a significant main effect of task condition (repeated-measures ANOVA; F(2,30) = 22.7; p < 0.001). According to subsequent Bonferroni-corrected t tests across the conditions, the RTs were significantly faster to AV targets than to A targets (t(15) = −5.6; pCorrected < 0.001; Cohen’s d = −1.5), as well as to AV targets than to V targets (t(15) = −3.0; pCorrected < 0.05; d = −0.8).
Intracranial event-related potentials in AC
An example of iERP in A and V conditions in the electrode contact closest to AC in a single participant is shown in Figure 3b. The corresponding source estimate for the AC ROI is shown in Figure 3d. Both the closest-contact and the source-modeled waveforms show prominent cross-modal activity in AC in the V condition. Previous noninvasive studies in humans warrant a hypothesis that cross-sensory responses to simple V stimuli are elicited in ACs, lagging the first A responses by several tens of milliseconds (Raij et al., 2010). To evaluate this hypothesis using intracranial SEEG data, we compared the group-averaged source-estimated ROI waveforms calculated to iERPs to A and V stimuli to a critical value determined using a maximum-statistic permutation procedure (Fig. 3d–f). The group-average AC source activity ascended significantly above the pre-stimulus noise at 27 ms post stimulus in the case of A responses and at 113 ms in the case of V responses.
Time–frequency representation of SEEG power
Recent single-unit recordings in NHPs (Brosch et al., 2005) and other mammals (Wallace et al., 2004; Bizley et al., 2007; Kobayasi et al., 2013), including recent studies in the mouse ACs (Morrill and Hasenstaub, 2018), suggest that instead of modulatory influences only, cross-sensory visual stimuli can also trigger suprathreshold activations in neurons in ACs, reflecting active processing of visual information. Here, we examined BHFA, a putative correlate of multiunit firing activity in the nearby regions (Ray et al., 2008; Parvizi and Kastner, 2018), using both the closest-electrode-contact and source-modeled data.
For each subject, we determined the electrode contacts that were closest to the centroids of three cortical HCP labels across the auditory hierarchy, including “early auditory cortex” (AC), “auditory association cortex” (AAC), and “temporo-parieto-occipital junction” (TPOJ; Fig. 4b).
A limitation of the electrode-contact analysis is that the recording sites, which were determined by clinical reasons only, differ between the participants. Therefore, to allow for anatomically normalized analyses across the auditory processing hierarchy, we used our recently developed SEEG source modeling approach (Lin et al., 2021). Hypothesis testing was conducted using LME models analogous to those used for electrode-contact analyses in the cortical ROIs AC, AAC, and TPOJ (Fig. 4c).
TFRs of baseline-normalized signal power values to A, V, and AV stimuli in these contacts of interest and the respective ROIs were then entered to LME models comparing them with the pre-stimulus baseline and determining the differences between AV–(A + V), AV–A, and AV–V contrasts. These LME models assessed the statistical significance of the intercept while controlling for the fixed effect of implanted hemisphere and the random effect of subject identity at all time and frequency instances, with the p values corrected for multiple comparisons using the FDR procedure of Benjamini and Hochberg (1995). The critical p value was determined jointly across all conditions/contrasts and contacts-of-interest at 0–500 ms and 6–250 Hz.
A, V, and AV stimulus conditions
In the contact closest to the HCP label AC, TFRs to A and AV stimuli showed a characteristic event-related pattern where statistically significant (FDR-adjusted p < 0.05) early power increase (or “event-related synchronization,” ERS) at the 6–15 Hz θ/α range is followed by a β-range (15–30 Hz) power suppression (or “event-related desynchronization,” ERD; Fig. 4b). At the group level, this θ/α-β ERS/ERD pattern was coupled with a significant (FDR-adjusted p < 0.05) sustained increase of BHFA. Interestingly, the β-range ERD was equally clear to V stimuli alone, suggesting statistically significant (FDR-adjusted p < 0.05) cross-sensory modulation of β-range oscillations in or near human ACs, which started ∼200 ms after the stimulus onset. However, no evidence of increases of BHFA, a putative marker of underlying firing activity, were found in the contacts closest to AC, ACC, or TPOJ.
The A, V, and AV stimulus-related TFRs of SEEG power of each individual participant’s contact closest to AC are shown in Figure 5. In these individual-level analyses, the single-trial TFR power was compared with pre-stimulus baseline using paired t tests across all trials. The purpose of comparisons against the baseline was also to normalize the TFRs across frequencies to mitigate the 1/f trend of signal power. To avoid false negatives in our interpretation of the lack of BHFA to V stimuli in ACs, we used no correction for multiple comparisons in these specific single-participant analyses. Despite the inclusive criterion, in the electrode-contact space, evidence of robust increase of visual BHFA, deemed qualitatively different from pre-stimulus noise, was found in only one participant, Subject 8. However, in this participant, the contact that was in the 3D volume closest to early AC appears to be, in fact, located closer to the fundus of STS (a classic multisensory area) than AC, per se.
Results of our anatomically normalized group ROI source-modeling analysis agreed with the SEEG contact analysis (Fig. 4c). At the lower frequency range, responses to A and AV stimuli were characterized by an early θ/α ERS (<200 ms; FDR-adjusted p < 0.05), followed by a longer-lasting β ERD (<30 Hz; FDR-adjusted p < 0.05). Consistent with the electrode-contact analysis, this later lower-frequency ERD effect was also prominently significant to V stimuli alone (FDR-adjusted p < 0.05), constituting the strongest putatively cross-sensory influence across the auditory hierarchy in the present study. As for the BHFA range, statistically significant (FDR-corrected p < 0.05) power increases were observed to A and AV stimuli alone in AC, AAC, and TPOJ areas. No significant increases of cross-sensory BHFA were observed to V stimuli in AC or AAC. Traces of significantly increased BHFA to V stimuli were, however, observed in the area TPOJ (FDR-adjusted p < 0.05). Figure 6 shows the source modeling results of the AC ROI in each individual participant.
Contrasts between A, V, and AV conditions
The group analysis of TFRs from the contact closest to the three ROIs suggested significant (FDR-adjusted p < 0.05) nonadditive MSI effects, determined from the AV–(A + V) contrast, at the β range (Fig. 7a). As the β power was suppressed to both A and V stimuli, these positive MSI effects could be viewed as sub-additive reduction of β suppression. Other than the confirmatory contrast between AV and V conditions, where power changes to a stimulus containing auditory information versus no auditory information are nonsurprising, there was no evidence that the simultaneous presentation of A and V stimuli increases BHFA in the contact closest to AC. Similar to the electrode-contact analysis, the results of source modeling ROI analyses suggested significant (FDR-adjusted p < 0.05) nonadditive MSI effects in AC, as well as in AAC and TPOJ, which occurred at the θ and β ranges (Fig. 7b). Consistent with the electrode-contact analysis, these positive MSI effects reflect as sub-additive reduction of β and θ suppression, which occurred to both A and V stimuli alone during the same time/frequency windows. In the AV versus A contrast, we also observed significant increases of early high α activity and subsequent β ERD effects. Besides the anticipated BHFA effects in the AV–V contrast, the only evidence of increased BHFA was in the AV–A contrast in TPOJ. In particular, no other BHFA effects were observed to AV–(A + V) or AV–A contrast.
Finally, to evaluate the behavioral relevance of multisensory information, we compared behavioral index of multisensory RT benefit to the AV–(A + V) power contrast estimated from the ROI analysis. No significant correlations (cluster-based randomization test) between the multisensory RT benefit and the AV–(A + V) contrast were observed.
Intracranial inter-trial phase consistency
In the light of previous studies in NHPs (Lakatos et al., 2007), we tested whether cross-sensory stimuli affect the phase of ongoing oscillations in human AC. The results suggest significant increases of low-frequency ITPC after V stimuli, both in the electrode contact closest to AC (Fig. 8a) and in all three source-modeling ROIs across the auditory processing hierarchy (FDR-adjusted p < 0.05; Fig. 8b). In the contact analysis, this ITPC effect was significant already ∼100 ms after the onset of V stimuli. Interestingly, the cross-sensory ITPC increases across the auditory processing hierarchy occurred at time/frequency windows where there was no evidence for concurrent increases of TFR power. Apart from the expected difference between AV–A conditions, the ITPC analysis of contrasts yielded no consistently significant differences reflecting potential MSIs in the electrode or ROI analyses.
Experiment 2: cross-sensory power changes in auditory cortex during complex stimulation and task
An important concern is whether the results described above, which were obtained during a simple task and basic stimuli, can be generalized to a more complex cognitive setting. Therefore, we analyzed data in a subgroup of eight participants during additional lexical and nonlexical DMS (Fig. 9a). The behavioral analysis suggested that the participants were able to perform according to the instruction. In the lexical and nonlexical DMS tasks, respectively, the mean ± SEM values of the PCR were 0.92 ± 0.04 and 0.96 ± 0.02. The respective RT values were 1,643 ± 347 ms and 1,155 ± 179 ms.
Figure 9 displays the results of a group analysis, which was conducted across contacts that in each participant were closest to AC, AAC, or TPOJ in both hemispheres. Consistent with the findings in the main experiment, we found no evidence of cross-sensory BHFA increases to symbolic V stimuli that were presented in either lexical or nonlexical contexts in AC, AAC, or TPOJ. At the same time, we observed a significant suppression of θ (<8 Hz) and β (15–30 Hz) to visual stimuli in contacts closest to the left AC, AAC, and TPOJ. Interestingly, these cross-sensory effects were more obvious in the left than those in the right hemisphere. Indices of predominantly left-lateralized effects were observed in the BHFA measures in the higher areas ACC and TPOJ.
Discussion
Here, we used intracranial SEEG to examine cross-sensory influences in human ACs. In Experiment 1 (N = 16), we used simple noise-burst sounds and “checkerboard” visual stimuli lacking strong auditory associations, to minimize higher-order (e.g., semantic) feedback influences. The results of Experiment 1 were compared with a cognitively more complex Experiment 2 (N = 8). Visual stimuli modulated ITPC of low-frequency oscillations in AC, without increasing signal power, suggesting cross-sensory phase resetting of intrinsic oscillations (Lakatos et al., 2007). Visual stimuli resulted in a θ-to-β-range ERD in ACs. We also observed cross-sensory iERP responses to V stimuli that lagged auditory responses for several tens of milliseconds. However, no robust evidence of visually driven BHFA, a putative correlate of local neuronal firing (Manning et al., 2009; Miller, 2010; Ray and Maunsell, 2011), was found in ACs.
The significant increase in ITPC without concurrent signal amplification, which resembles previous NHP (Lakatos et al., 2007) and human ECoG results (Mercier et al., 2015) could reflect phase modulations of intrinsic oscillations in AC. Lakatos et al. (2007) proposed that such cross-sensory modulations could help ensure that concurrent auditory inputs arrive to ACs at an optimal high-excitability phase, to amplify weak inputs in noisy conditions (Lakatos et al., 2007; Luo et al., 2010; Atilgan et al., 2018). For example, a recent intracranial EEG study suggested that auditory neurons track the phase of visual speech, helping amplify the cortical representation of speech signals (Megevand et al., 2020). Cross-sensory phase resetting of AC oscillations has been reported in noninvasive studies as well (Luo et al., 2010; Biau et al., 2015).
BHFA is thought to reflect a nonoscillatory correlate of neuronal firing (Manning et al., 2009; Miller, 2010; Ray and Maunsell, 2011), offering a measure of suprathreshold activations using human SEEG (Parvizi and Kastner, 2018; for an alternative model, see Leszczynski et al., 2020). In our source analyses, traces of visually triggered BHFA were observed only in TPOJ. In contrast to the strong AV and auditory BHFA patterns, no significant visual BHFA effects were observed in AC, AAC, or in the SEEG contacts closest to these ROIs. In individual participants, significant BHFA to visual stimuli was observed in only one out of 16 participants, whose contact closest to AC was, in fact, located in the upper bank of STS. Consistent with previous smaller-sample human ECoG (Mercier et al., 2015) and SEEG studies (Ferraro et al., 2020), these results suggest that cross-sensory visual stimuli drive very weak, if any, suprathreshold population-level activations near human ACs (see also Quinn et al., 2014).
The lack of cross-sensory BHFA in human ACs is in line with neurophysiological recordings in NHPs (Lakatos et al., 2007; Kayser et al., 2008; Kajikawa et al., 2017). Laminar recordings suggested no cross-sensory somatosensory multi-unit activity (MUA) in the NHP primary ACs (Lakatos et al., 2007). Further, while visual stimuli modulated single-unit activity (SUA) and MUA to concurrent sounds, they did not trigger SUA or MUA to visual stimuli alone in the NHP AC (Kayser et al., 2008). Contrasting NHP findings of visually triggered firing effects in ACs have, in turn, been attributed to byproducts of extensive training (Brosch et al., 2005). Beyond this, visually triggered firing activity has been found in ACs of nonprimate mammals only (Bizley et al., 2007; Morrill and Hasenstaub, 2018).
One of the most prominent cross-sensory effects in ACs was the visual θ-to-β ERD, which mimicked the ERD to auditory stimuli. In unimodal studies, α-range ERD has been associated with stimulus discrimination/detection (Mazaheri and Picton, 2005): Auditory α-ERD increases when a participant comprehends noisy speech sounds (Dimitrijevic et al., 2017) or hears clear rather than degraded speech (Tavabi et al., 2011; Billig et al., 2019). Intracranial sleep studies suggest that auditory α/β ERD occurs only in wakefulness, highlighting its relationship to conscious access to auditory stimulation (Hayat et al., 2022). Consistent with the present results, previous evidence exists that α-range ERD is also elicited by cross-modal cues, including visual gestures preceding speech–sound onsets (Biau et al., 2015) and sounds that predict visuospatial target locations (Feng et al., 2017). These cross-sensory ERDs reportedly correlate with behavioral performance in the task-relevant modality (Feng et al., 2017). Theta-to-β ERDs elicited to visual stimuli in ACs could, thus, reflect cross-modal (or state-dependent; Bimbard et al., 2023) priming that facilitates auditory–perceptual processing.
Cross-modal visual influences can not only enhance but also suppress auditory-related fMRI and MEG/EEG signals (Jääskeläinen et al., 2004; Lehmann et al., 2006; Stekelenburg and Vroomen, 2012; Gau et al., 2020). These suppression effects have been associated with predictive coding, enabled by cross-modal information becoming available slightly before the sound onsets (Stekelenburg and Vroomen, 2012; van Laarhoven et al., 2021). An alternative explanation for the visual-triggered θ-to-β ERD could, thus, be cross-modal suppression of ACs. However, this alternative interpretation is countered by neurophysiological evidence that in the unimodal case, α/β ERD correlates with enhanced rather than suppressed firing activity (Ray et al., 2008).
The pathways for cross-sensory modulations and MSI effects in human ACs have been proposed to include (1) feedback from higher-order areas (Smiley et al., 2007; Luo et al., 2010), (2) direct connectivity between sensory areas (Rockland and Ojima, 2003; Budinger et al., 2006; Bizley et al., 2007; Falchier et al., 2010), and/or (3) subcorticocortical influences originating in nonspecific thalamic nuclei (Hackett et al., 2007). Many of the present effects, including the θ-to-β ERD and low-frequency MSI effects in ACs, occurred at relatively long latencies. Furthermore, the earliest significant low-frequency MSI effects occurred in TPOJ rather than in AC. Such long-latency effects could reflect supramodal top-down influences (Lakatos et al., 2009) or even behavioral byproducts of the cross-sensory stimuli (Bimbard et al., 2023). However, the cross-sensory ITPC effects in ACs, which became significant already ∼100 ms after the stimulus onset, could be early enough to be explained by more direct cross-sensory influences. Notably, stronger early-latency modulations could be expected to visual stimuli with strong auditory associations (e.g., articulatory gestures).
In the classical sense, MSI refers to nonadditive changes in SUA, with the AV responses being stronger than the sum of unimodal auditory and visual responses (Stein and Meredith, 1993). Here, we found significant MSI effects only in low-frequency oscillations. However, these MSI effects occurred at time–frequency windows, in which we observed a similar ERD pattern for all three stimulus types: In these cases, the positive AV–(A + V) contrast refers to sub- rather than super-additive influences.
Notably, low-frequency subthreshold modulations (also detectable in noninvasive MEG/EEG) might be easier to detect than cross-sensory driving of neuronal activity. Spiking effects are more focal than modulatory effects in terms of the underlying neuronal tuning (for an AC example, see Eggermont et al., 2011). Furthermore, due to their quadrupolar field pattern, spikes can be recorded near the activated neuron only. The present measure of driving effects, BHFA, could also be harder to detect using sparse SEEG sampling, because higher frequency signals have a shorter “coherence distance” than lower-frequency oscillations. It is also worth noting that BHFA is correlate but not direct measure of underlying neuronal activity (Manning et al., 2009; Miller, 2010; Ray and Maunsell, 2011; Parvizi and Kastner, 2018; see also Leszczynski et al., 2020). Nonetheless, as estimated by the group averages of baseline-normalized power to auditory stimuli, the effect sizes of individual subject’s BHFA were at least as strong as, if not stronger, than the effect sizes of lower-frequency modulatory effects.
The present analyses are limited to visual cross-sensory influences in or near ACs. This is because of the lack of (clinically determined) SEEG contacts in early visual areas in our participants, a limitation shared by the majority of previous human SEEG studies (Parvizi and Kastner, 2018; see, however, also Jonas et al., 2014). Another limitation relates to MSI effects at the BHFA range. The strongest modulatory cross-sensory effects to visual stimuli in auditory areas were the β ERD, which started 200 ms after the stimulus onset. A design with a comparable lag between auditory and visual stimuli might have provided more sensitivity for testing this hypothesis.
In conclusion, we found intracranial evidence that cross-sensory visual stimuli modulate subthreshold LFP signals, including through low-frequency phase resetting in human ACs. However, overall, cross-sensory influences in ACs were somewhat weaker than we had anticipated. As determined via BHFA, there was little evidence of direct activation of ACs by visual stimuli alone. Even when combined with concurrent auditory inputs, visual stimuli drove only weak, if any, increases of BHFA in ACs. Visually triggered BHFA influences were limited to the more polymodal area TPOJ. Finally, many of the most prominent cross-sensory modulatory influences, including the β-range ERD and concurrent MSI influences, occurred at relatively long latencies in ACs. It thus seems that visual inputs modulate auditory responses, but that little visual information processing takes place in human ACs. Analogously selective attention, such cross-sensory modulatory influences could help suppress irrelevant sound features (Kauramaki et al., 2010), enhance task-relevant rhythmic inputs (Lakatos et al., 2007; Schroeder et al., 2008), and help guide attentional resources to appropriate points of time to enhance auditory speech perception (Zion Golumbic et al., 2013).
Footnotes
This wok was supported by National Institute on Deafness and Other Communication Disorders (R01DC017991, R01DC016765, R01DC016915); Acad. Finland Suomen Akatemia (276643, 298131, 308431); and Russian Science Foundation 22-48-08002.
↵*J.A. and H.J.L. contributed equally to this work.
↵**I.P.J. and F.H.L. are the joint senior authors.
The authors declare no competing financial interests.
- Correspondence should be addressed to Jyrki Ahveninen at jahveninen{at}mgh.harvard.edu.