Abstract
Natural conversation is multisensory: when we can see the speaker's face, visual speech cues improve our comprehension. The neuronal mechanisms underlying this phenomenon remain unclear. The two main alternatives are visually mediated phase modulation of neuronal oscillations (excitability fluctuations) in auditory neurons and visual input-evoked responses in auditory neurons. Investigating this question using naturalistic audiovisual speech with intracranial recordings in humans of both sexes, we find evidence for both mechanisms. Remarkably, auditory cortical neurons track the temporal dynamics of purely visual speech using the phase of their slow oscillations and phase-related modulations in broadband high-frequency activity. Consistent with known perceptual enhancement effects, the visual phase reset amplifies the cortical representation of concomitant auditory speech. In contrast to this, and in line with earlier reports, visual input reduces the amplitude of evoked responses to concomitant auditory input. We interpret the combination of improved phase tracking and reduced response amplitude as evidence for more efficient and reliable stimulus processing in the presence of congruent auditory and visual speech inputs.
SIGNIFICANCE STATEMENT Watching the speaker can facilitate our understanding of what is being said. The mechanisms responsible for this influence of visual cues on the processing of speech remain incompletely understood. We studied these mechanisms by recording the electrical activity of the human brain through electrodes implanted surgically inside the brain. We found that visual inputs can operate by directly activating auditory cortical areas, and also indirectly by modulating the strength of cortical responses to auditory input. Our results help to understand the mechanisms by which the brain merges auditory and visual speech into a unitary perception.
- audiovisual speech
- broadband high-frequency activity
- crossmodal stimuli
- intracranial electroencephalography
- neuronal oscillations
- phase–amplitude coupling
Introduction
Viewing one's interlocutor significantly improves intelligibility under noisy conditions (Sumby and Pollack, 1954). Moreover, mismatched auditory (A) and visual (V) speech cues can create striking illusions (McGurk and Macdonald, 1976). Despite the ubiquity and power of visual influences on speech perception, the underlying neuronal mechanisms remain unclear. The cerebral processing of auditory and visual speech converges in multisensory cortical areas, especially the superior temporal cortex (Miller and D'Esposito, 2005; Beauchamp et al., 2010). Crossmodal influences are also found in cortical areas that are traditionally considered to be unisensory; in particular, visual speech modulates the activity of auditory cortex (Calvert et al., 1997; Besle et al., 2008; Kayser et al., 2008).
The articulatory movements that constitute visual speech strongly correlate with the corresponding speech sounds (Chandrasekaran et al., 2009; Schwartz and Savariaux, 2014) and predict them to some extent (Arnal et al., 2009), suggesting that visual speech might serve as an alerting cue to auditory cortex, preparing the neural circuits to process the incoming speech sounds more efficiently. Earlier, we raised the hypothesis that this preparation occurs in part through a resetting of the phase of neuronal oscillations in auditory cortex: through this phase reset, visual speech cues influence the temporal pattern of neuronal excitability fluctuations in auditory cortex (Schroeder et al., 2008). This hypothesis is formulated by considering four lines of evidence. First, auditory speech has predictable rhythms, with syllables arriving at a relatively rapid rate (4–7 Hz) nested within the slower rates (1–3 Hz) of phrase and word production. These rhythmic features of speech are critical for it to be intelligible (Shannon et al., 1995; Greenberg et al., 2003). Second, auditory cortex synchronizes its oscillations to the rhythm of heard speech, and the magnitude of this synchronization correlates with the intelligibility of speech (Ahissar et al., 2001; Luo and Poeppel, 2007; Ding and Simon, 2014; Vander Ghinst et al., 2016). Third, neuronal oscillations correspond to momentary changes in neuronal excitability, so that, independent of modality, the response of sensory cortex depends on the phase of its oscillations on stimulus arrival (Lakatos et al., 2008). Fourth, even at the level of primary sensory cortex, oscillations can be phase reset by stimuli from other modalities, and this crossmodal reset influences the processing of incoming stimuli from the preferred modality (Lakatos et al., 2007; Kayser et al., 2008).
Human electroencephalographic (EEG) and magnetoencephalographic (MEG) studies of cerebral responses to continuous, naturalistic audiovisual (AV) speech have established that oscillations are influenced by the visual as well as the auditory component of speech (Luo et al., 2010; Crosse et al., 2015, 2016; O'Sullivan et al., 2016; Park et al., 2016, 2018; Giordano et al., 2017). While these observations are compatible with the phase reset hypothesis, they do not rule out the possibility that the apparent phase alignment simply reflects a succession of crossmodal sensory-evoked responses; in fact, some favor this interpretation (Crosse et al., 2015, 2016). Our perspective is that phase reset and evoked response mechanisms ordinarily operate in complementary fashion (Schroeder and Lakatos, 2009). Thus, in the present context, we expect that both will mediate visual influences on auditory speech processing. To dissect these influences, one must be able to resolve the local activity of a given cortical area well enough to dissociate a momentary increase in phase alignment from any coincident increase in oscillatory power (Makeig et al., 2004; Shah et al., 2004). No noninvasive neurophysiological study to date meets that standard, but invasive techniques are better suited for that level of granularity.
Here, we used intracranial EEG (iEEG) to probe the mechanistic basis for the effect of visual speech on auditory processing. We find that (1) unisensory visual speech resets the phase of low-frequency neuronal oscillations in auditory cortex; and (2), consistent with known perceptual effects, the visual input-mediated phase reset amplifies cortical responses to concomitant auditory input. These results strongly support crossmodal phase reset as one of the neuronal mechanisms underlying multisensory enhancement in audiovisual speech processing. We also observe a complementary effect, visual input-evoked reduction of response amplitude to concomitant auditory input. Together with the improved phase tracking, we interpret this as evidence for more efficient and reliable stimulus processing in the presence of congruent audiovisual speech inputs (Kayser et al., 2010).
Materials and Methods
Experimental design
Participants
Nine patients (5 women; age range, 21–52 years) with drug-resistant focal epilepsy who were undergoing video-iEEG monitoring at North Shore University Hospital (Manhasset, NY) participated in the experiments. All participants were fluent English speakers. The participants provided written informed consent under the guidelines of the Declaration of Helsinki, as monitored by the Feinstein Institutes for Medical Research institutional review board.
Stimuli and task
Stimuli (Zion Golumbic et al., 2013b) were presented at the bedside using a laptop computer and Presentation software (version 17.2; Neurobehavioral Systems; RRID:SCR_002521; http://www.neurobs.com). Trials started with a 1 s baseline period consisting of a fixation cross on a black screen. The participants then viewed or heard video clips (7–12 s) of a speaker telling a short story. The clips were cut off to leave out the last word. A written word was then presented on the screen, and the participants had to select whether that word ended the story appropriately or not. There was no time limit for participants to indicate their answer; reaction time was not monitored. There were two speakers (one woman) telling four stories each (eight distinct stories); each story was presented once with one of eight different ending words (four appropriate), for a total of 64 trials. These were presented once in each of three sensory modalities: audiovisual (movie with audio track), auditory (soundtrack with a fixation cross on a black screen), and visual (silent movie). Trial order was randomized, with the constraint that the same story could not be presented twice in a row, regardless of modality. Precise timing of stimulus presentation with respect to iEEG data acquisition was verified using an oscilloscope, a microphone, and a photodiode.
The task was intended to ensure that participants were attending the stimuli. Performance was on average 85% (range, 59–95%) in the audiovisual modality, 84% (61–95%) in the auditory modality, and 68% (44–88%) in the visual modality. Performance was significantly above chance in each modality [paired t tests; AV: t(8) = 8.37, p = 4.74 * 10−5; A: t(8) = 8.81, p = 4.74 * 10−5; V: t(8) = 3.44, p = 0.0088 [p values were corrected for multiple comparisons using the false discovery rate (FDR) procedure; Benjamini and Hochberg, 1995].
Data acquisition
iEEG electrode localization
The placement of iEEG electrodes (subdural and depth electrodes; Ad-Tech Medical; and Integra LifeSciences) was determined on clinical grounds, without reference to this study. The localization and display of iEEG electrodes was performed using iELVis (RRID:SCR_016109 (http://ielvis.pbworks.com); Groppe et al., 2017). For each participant, a postimplantation high-resolution computed tomography (CT) scan was coregistered with a postimplantation 3D T1 1.5 tesla MRI scan and then with a preimplantation 3D T1 3 tesla MRI scan via affine transforms with 6 df using the FMRIB Linear Image Registration Tool included in the FMRIB Software Library [RRID:SCR_002823 (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki); Jenkinson et al., 2012] or the bbregister tool included in FreeSurfer (RRID:SCR_001847 [https://surfer.nmr.mgh.harvard.edu/fswiki/FreeSurferWiki); Fischl, 2012]. Electrodes were localized manually on the CT scan using BioImage Suite [RRID:SCR_002986 (https://medicine.yale.edu/bioimaging/suite/); Joshi et al., 2011]. The preimplantation 3D T1 MRI scan was processed using FreeSurfer to segment the white matter, deep gray matter structures, and cortex; to reconstruct the pial surface; approximate the leptomeningeal surface (Schaer et al., 2008); and parcellate the neocortex according to gyral anatomy (Desikan et al., 2006). To compensate for the brain shift that accompanies the insertion of subdural electrodes through a large craniotomy, subdural electrodes were projected back to the preimplantation leptomeningeal surface (Dykstra et al., 2012). For depth electrodes, only contacts that were located in gray matter were retained for further analysis (Mercier et al., 2017). Each iEEG electrode was attributed to a cortical region according to automated parcellation in FreeSurfer (for a similar approach, see Mégevand et al., 2017; Arnal et al., 2019).
iEEG recording and preprocessing
Intracranial EEG signals were referenced to a vertex subdermal electrode, filtered and digitized (0.1 Hz high-pass filter; 200 Hz low-pass filter; 500–512 samples/s; XLTEK EMU128FS or Natus Neurolink IP 256 systems, Natus Medical). Analysis was performed offline using the FieldTrip toolbox [RRID:SCR_004849 (http://www.fieldtriptoolbox.org/); Oostenveld et al., 2011] and custom-made programs for MATLAB [MathWorks; RRID:SCR_001622 (https://www.mathworks.com/products/matlab.html)]. The 60 Hz line noise and its harmonics were filtered out using a discrete Fourier transform filter. iEEG electrodes contaminated with noise or abundant epileptiform activity were identified visually and rejected. iEEG electrodes that lay in white matter were also rejected (Mercier et al., 2017). The remaining iEEG signals were rereferenced to average reference.
The absolute phase and power of an iEEG signal are quantities that depend on the reference; consequently, the quantification of synchronization between electrodes is strongly influenced by the choice of the reference (Guevara et al., 2005; Mercier et al., 2017). Here, however, we strictly focus on the relative relationship between a continuous sensory stimulus and the phase or power response at a given electrode; therefore, at no point is the phase or power of an iEEG signal measured at a given electrode compared with those at another. Furthermore, all statistical testing is performed at the single-electrode level through permutation testing; thus, any influence of the reference on the observed data is also present in the surrogate data generated by the permutation test (see below). For these reasons, the analyses presented here are immune to the choice of a particular referencing scheme.
Intracerebral recordings are not immune to volume conduction: laminar electrodes in monkey auditory cortex could record responses from a nearby visually responsive area situated ∼1 cm away (Kajikawa and Schroeder, 2011). However, the amplitude of volume-conducted LFP decreases steadily as the measuring electrode is more distant from the source, meaning that each iEEG electrode remains more influenced by locally occurring activity (Dubey and Ray, 2019). Furthermore, bipolar montages in iEEG have their own issues, including the risk of mixing signals from disparate generators, which would negate the spatial precision of iEEG (Zaveri et al., 2006). Based on these considerations, we selected referential over bipolar recordings for our analyses.
Data analysis
Time courses of auditory and visual speech stimuli
The envelope of auditory speech stimuli was computed by filtering the audio track of the video clips through a gammatone filter bank approximating a cochlear filter, with 128 center frequencies equally spaced on the equivalent rectangle bandwidth–rate scale and ranging from 80 and 5000 Hz (Carney and Yin, 1988), computing the Hilbert transform to obtain power in each frequency band, and averaging again over frequencies (MATLAB Toolbox, Institute of Sound Recording, University of Surrey; https://github.com/IoSR-Surrey/MatlabToolbox). The time course of visual speech stimuli was estimated by manually measuring the vertical opening of the mouth on each still frame of the video clips (Park et al., 2016). Auditory and visual speech stimulus time courses were then resampled to 200 Hz at the same time points as the iEEG signals.
Time–frequency analysis of iEEG signals
To obtain instantaneous low-frequency power and phase, the iEEG signal was filtered between 0.5 and 9 Hz (−3 dB cutoff points, sixth-order Butterworth filters), downsampled to 200 Hz, and Hilbert transformed. Instantaneous power and phase in the delta band (−3 dB cutoff points at 0.5–3.5 Hz) and the theta band (−3 dB cutoff points at 4–7.5 Hz) was computed in similar fashion. The intertrial coherence (ITC), a measure of phase alignment across repetitions of the same sensory stimulus, was computed as the mean resultant vector length using the CircStat toolbox (https://github.com/circstat/circstat-matlab; Berens, 2009). Broadband high-frequency activity (BHA), which reflects local neuronal activity (Crone et al., 1998; Ray et al., 2008), was computed by filtering the iEEG signal in 10 Hz bands between 75 and 175 Hz (fourth-order Butterworth filters), computing the Hilbert transform to obtain instantaneous power, dividing instantaneous power in each band by its own mean over time to compensate for the 1/f power drop, and then averaging again over bands (Golan et al., 2016). BHA was then downsampled to 200 Hz.
Stimulus–response cross-correlation
The relationship between speech stimuli and brain responses was quantified by computing their cross-correlation. For each iEEG electrode, data from all trials in each sensory modality were concatenated and were then cross-correlated with the corresponding concatenated stimulus time courses. For low-frequency power and BHA, Pearson correlation was computed; for the low-frequency phase, linear-to-circular correlation was computed (Berens, 2009). To account for the fact that brain responses to sensory stimuli occur with some delay, lags of −200 to +200 ms between stimuli and responses were allowed. The maximum of the absolute value of the correlation coefficient over this time period was considered. Because the above analysis included lag values where brain responses could theoretically precede the corresponding sensory stimulus, we repeated the entire analysis while allowing only physiologically plausible lags (i.e., from −200 to 0 ms). This reanalysis yielded essentially identical results.
Statistical testing
To assess the statistical significance of observed cross-correlation coefficients, their distribution under the null hypothesis was estimated using a permutation test. In each iteration, trial labels were shuffled to disrupt the temporal relationship between stimuli and responses, and one value of the correlation coefficient was computed. The procedure was repeated 1000 times. Observed values of correlation coefficients were then expressed as z scores of the null distribution. For an illustration of the procedure, see Figure 2A–E.
Correlations between cross-correlations over sites were assessed with Spearman's rank correlation to account for the non-normal distribution of cross-correlation coefficients. Differences in cross-correlations across conditions were assessed either with paired t tests, when distributions were approximately normal, or with permutation testing, when they were not. Differences in power or ITC from baseline to the stimulus period were assessed with permutation testing.
Correction for multiple comparisons
The p values were corrected for multiple comparisons over electrodes using an FDR procedure (Benjamini and Hochberg, 1995) with the familywise error rate set at 0.05 and implemented in the Mass Univariate ERP toolbox [RRID:SCR_016108 (https://github.com/dmgroppe/Mass_Univariate_ERP_Toolbox); Groppe et al., 2011a]. The Benjamini–Hochberg FDR procedure also maintains adequate control of the familywise error rate in the case of positive dependencies between the observed variables.
Rationale for the iEEG analysis strategy
The question of how best to analyze iEEG data over multiple participants is not straightforward, because the coverage and density of iEEG material typically vary widely from one patient to the next. Selecting electrodes that sample from one or a handful of predefined anatomic regions of interest is a common approach. When sufficiently dense sampling is available from large patient cohorts performing the same task, whole-brain activity maps and statistics can be generated (Kadipasaoglu et al., 2015; Grossman et al., 2019a,b). In our case, we did not want to restrict our analysis to anatomically defined regions of interest. The approach we took was to analyze all data at the single-electrode level, and then to apply strong correction for multiple comparisons through the Benjamini–Hochberg FDR procedure (Benjamini and Hochberg, 1995) performed over all electrodes in the dataset (N = 1012). Studies using simulated EEG and event-related potential data showed that this procedure provides adequate correction for multiple comparisons (i.e., adequate control of the FDR) in the case of neurophysiological data (Groppe et al., 2011a,b).
For each variable of interest, statistical testing was systematically performed independently for each electrode in the dataset (N = 1012) and then corrected for multiple comparisons on the entire electrode set. Thus, the auditory-responsive cortical sites were the sites (N = 186) that displayed significant BHA tracking of the speech envelope, after FDR correction over all 1012 sites. Separately, sites were deemed to significantly track visual speech through their low-frequency phase or power if statistical testing survived FDR correction over all 1012 sites (independent of whether they were auditory responsive or not). It is only after this first, stringent correction that we combined the sites that were both auditory responsive and tracked visual speech through low-frequency activity. This conjunction analysis is not circular, because the order in which the selection criteria are applied has no importance and does not artificially preordain its results.
To compute cross-correlations, stimuli from all trials were concatenated into a single minutes-long time course. Cortical responses at each electrode were concatenated over trials in similar fashion. The permutation procedure simply altered the order in which the responses were concatenated to break down the temporal relationship between stimuli and responses. The procedure did not introduce any unnatural interruption of either stimuli or responses. It also fully respected any existing dependencies between variables, like response properties at neighboring electrodes.
Overall, 159 cortical sites tracked mouth movements through their high-frequency activity; of those, 37 also tracked the speech envelope. We intentionally did not analyze visually responsive cortical sites further or focus on the smaller contingent of dual tracking sites, for the following two reasons: (1) the phase reset hypothesis is directional and makes predictions about the influence of visual speech on auditory cortex, but not the reverse; and (2) the hypothesis suggests that the crossmodal influence of visual speech on auditory cortex could be largely subthreshold—that is, manifesting as changes in oscillatory phase, but not neuronal firing (Schroeder et al., 2008). Hence, we focused on investigating how visual speech cues influenced oscillatory activity in auditory-responsive cortex.
The superadditive approach has been abundantly used to demonstrate multisensory effects with linear variables like the amplitude of event-related potentials: a cortical site exhibits a superadditive effect if its response to a multisensory stimulus is larger than the sum of its responses to unisensory stimuli. Nonlinear approaches to examine superadditive multisensory effects on oscillatory power (Senkowski et al., 2007) or intertrial coherence (Mercier et al., 2015) have been described as well. Circular variables like phase angles, however, do not sum easily, especially in our case where there is a variable lag between the continuous auditory and visual stimulus streams. For this reason, we limited our analysis to the relatively simple comparison of responses to audiovisual versus auditory-alone stimuli in auditory cortex, similar to previous work (Kayser et al., 2008).
Data and software availability
Data and custom-made software are available on request from Pierre Mégevand (pierre.megevand{at}unige.ch).
Results
Cortical tracking of auditory and audiovisual speech
We recorded iEEG signals from electrodes implanted in the brain of nine human participants undergoing invasive electrophysiological monitoring for epilepsy (Fig. 1C). Patients attended to clips of a speaker telling a short story, presented in the auditory (soundtrack with black screen), visual (silent movie), and audiovisual modalities (Fig. 1A). Cortical sites were considered to be auditory responsive if the time course of their local neuronal activity, assessed by BHA (also known as “high-gamma power”; Crone et al., 1998), correlated significantly with that of auditory speech (indexed by the amplitude of the speech envelope). We quantified the magnitude of speech–brain correlations through cross-correlation and tested for significance using permutation testing, as illustrated in Figure 2A–E. After FDR correction for multiple comparisons, 186 cortical sites, centered mostly on the superior and middle temporal gyri of both cerebral hemispheres, displayed significant BHA tracking of auditory speech (Fig. 2F). These sites were analyzed further as auditory-responsive cortex.
We also examined how low-frequency oscillatory activity in auditory-responsive cortex tracks auditory speech. As expected, we found strong tracking through both low-frequency phase and power, the intensity of which clearly correlated with BHA tracking (Fig. 3A,B). Next, we asked whether the tracking of the speech envelope differed in response to audiovisual speech compared with unisensory auditory speech. On average, tracking through the phase of low-frequency activity was stronger for audiovisual speech than for purely auditory speech, whereas tracking through low-frequency power was weaker for audiovisual than for auditory speech (Fig. 3C,D). A depiction of the anatomic localization of auditory-responsive sites revealed that cortical areas related to auditory and language processing in the superior temporal gyrus (and the middle temporal gyrus to a lesser extent), mostly showed increased phase tracking, but reduced power tracking, to audiovisual versus auditory speech (Fig. 3E,F). The improvement in phase tracking to audiovisual speech suggests that visual speech cues provide an additional influence to auditory cortex, which improves the phase alignment of its low-frequency activity to the speech envelope. The observation of the opposite phenomenon in low-frequency power tracking is inconsistent with the idea that this improvement is simply an artifact of increased evoked response power.
Speech resets the phase of ongoing oscillations in auditory-responsive cortex
A phase-resetting mechanism presupposes the existence of ongoing oscillations outside periods of sensory processing. To examine this, we measured the instantaneous power and phase of low-frequency EEG activity at baseline 0.5 s before stimulus onset, and 1.5 s after stimulus onset. We picked this later time point to avoid the stimulus onset-evoked response. For this analysis, our choice of a task design where the same stimulus was presented multiple times becomes clear (Fig. 4A): we found that oscillatory power decreased relative to baseline during the presentation of a continuous stimulus (Fig. 4B), while at the same time phase alignment across repeated presentations of the same stimulus (quantified as the ITC) increased (Fig. 4C). This analysis makes it clear that oscillatory activity is already present before stimulus onset. During stimulus presentation, the observation of a decrease in power coincident with the increase in phase alignment argues that the latter reflects a pattern of phase resetting rather than a succession of evoked responses. While there are alternative possibilities (e.g., prestimulus oscillations are suppressed and replaced with a completely new set of cortical oscillatory dynamics), we think that our interpretation of these events is the most parsimonious.
Tracking of visual speech by auditory-responsive cortex
We then asked how unisensory visual speech influences low-frequency activity in auditory-responsive cortex. To index the time course of visual speech, we measured the vertical opening of the mouth, a metric that correlates with the area of mouth opening and with the speech envelope (Fig. 1B). We quantified the intensity of the tracking of mouth opening by either low-frequency phase or power in auditory-responsive cortex, using the same approach as for the tracking of the speech envelope. We found that a subset of auditory-responsive cortical sites displayed phase tracking of visual speech (Fig. 5A,C). These sites were clustered in the superior and middle temporal gyri, for the most part. We also found power tracking of visual speech in another subset of auditory-responsive cortical sites (Fig. 5A,D). Importantly, these sites were generally different from those that displayed phase tracking, and their anatomic localization was more diffuse, including temporal cortex but also spreading to the occipital, parietal, and frontal cortices (Fig. 5B). This segregation of phase and power tracking sites is consistent with the idea that phase reset and evoked responses provide complementary mechanisms for the influence of visual speech in auditory cortex.
Next, we examined the influence of phase reset on local neuronal activation as indexed by BHA. The intensity of BHA tracking correlated with that of tracking through the low-frequency phase (Fig. 5E), indicating coupling between the low-frequency phase and the amplitude of neuronal activation (Canolty et al., 2006). By contrast, there was no detectable correlation between BHA tracking and low-frequency power tracking (Fig. 5F). These observations are consistent with the hypothesis that phase reset to visual speech augments local neuronal activation in auditory cortex.
Since the placement of iEEG electrodes was determined solely by clinical circumstances, anatomic coverage varied significantly across participants. Figure 6 shows the tracking of visual speech by auditory-responsive cortex in individual patients. The patients with denser sampling of temporal regions (patients 1, 3, 4, 5, and 8) tended to be the ones in whom we observed tracking of visual speech by auditory-responsive cortex.
Speech is a mixture of rhythms: syllables, which occur at a frequency well approximated by the theta band of cerebral oscillations, are nested within the slower rates of word and phrase production, which themselves correspond to the delta band. To assess whether auditory-responsive cortex was differently sensitive to these two dimensions of speech in the visual modality, we repeated our analysis of speech tracking by EEG phase and power in the delta- and theta-frequency bands. This analysis showed a clear dissociation between phase and power tracking, which contributes to the evidence that these two phenomena are distinct. Thirty auditory-responsive cortical sites tracked visual speech with the phase of their delta oscillations, whereas 17 sites showed delta power tracking and 4 sites displayed both (Fig. 7A). There was a significant correlation between delta phase and BHA tracking of visual speech (Spearman's ρ(n = 34) = 0.4273, p = 0.0123), whereas the correlation between delta power and BHA tracking was not significant (ρ(n = 21) = 0.2065, p = 0.3675). For the most part, delta-phase tracking sites were clustered in the superior and middle temporal gyri, similar to what we observed for low frequency. In the theta band, by contrast, a single electrode displayed phase tracking, whereas 23 showed power tracking. The correlation between theta power and BHA tracking did not reach significance (ρ(n = 23) = 0.3192, p = 0.1377). These results suggest that, at least in our experimental conditions, visual speech cues provide mostly suprasyllabic information to auditory cortex (words and phrases) in the form of ongoing delta-phase reset.
Because our experiment by design entailed repeated presentations of the same speech stimuli, we performed a split-halves analysis to ensure that the effects that we observed were not caused by the participants' increased familiarity with the material. We found no difference in the magnitude of speech tracking (quantified by the stimulus–response cross-correlation) between early and late trials for the BHA tracking of auditory speech at auditory-responsive sites (paired t test: t(185) = −0.1985, p = 0.9628) or for the tracking of visual speech by either low-frequency phase (t(25) = 0.8403, p = 0.4087) or low-frequency power (t(27) = −0.2581, p = 0.7983). We also did not find any difference in behavioral performance for the early versus late trials (A: t(8) = 1.0000, p = 0.3466; AV: t(8) = 0.2294, p = 0.8243; V: t(8) = −1.8521, p = 0.1012). This suggests that increased familiarity with the speech stimuli did not significantly affect their cortical tracking.
Discussion
Both phase-entrained low-frequency activity and fluctuations in broadband high-frequency activity in auditory cortex track the temporal dynamics of auditory speech (Ding and Simon, 2014). Previous neurophysiological studies have shown that the visual component of audiovisual speech influences cerebral activity, mostly in visual areas, but also in the cortical network that processes auditory speech, which includes superior temporal, inferior frontal, and sensorimotor areas (Luo et al., 2010; Zion Golumbic et al., 2013b; Crosse et al., 2015, 2016; O'Sullivan et al., 2016; Park et al., 2016, 2018; Giordano et al., 2017; Ozker et al., 2018; Micheli et al., 2020). Collectively, these studies have demonstrated that auditory cortical dynamics are sensitive to visual speech but were not able to identify the underlying mechanisms.
Here, we used iEEG recordings for a more direct examination of these mechanisms. Within the cortical network that responds to auditory speech, tracking by low-frequency phase was enhanced by audiovisual compared with auditory stimulation, while the opposite was true for tracking by low-frequency power fluctuations. This dissociation is incompatible with the notion that the enhancement of phase tracking in the audiovisual condition is simply an artifact of increased evoked response power. Rather, it suggests that two complementary mechanisms may be operating. The first, visual phase reset-induced enhancement of cortical responses to auditory speech, seems to best account for the well known perceptual enhancement of auditory speech by concomitant visual cues (Sumby and Pollack, 1954), since both phase tracking and intelligibility are improved in response to audiovisual speech over auditory speech alone. The second mechanism, a visual speech-mediated reduction of evoked responses in auditory cortices, is in line with previous observations that neurophysiological responses to audiovisual stimuli in both auditory and visual cortex are generally smaller than those to the preferred modality stimulus alone (Besle et al., 2008; Mercier et al., 2013, 2015; Schepers et al., 2015). The paradox that a reduction in response amplitude accompanies the perceptual enhancement afforded by audiovisual speech can be reconciled when one considers that the audiovisual stimuli used here and in the above-mentioned studies were well above threshold as well as congruent. Work in monkeys revealed that the information gain from congruent audiovisual input, compared with auditory input alone, resulted in an increase in the temporal precision of firing by auditory neurons, together with a reduction in the total number of action potentials fired (Kayser et al., 2010). Similarly, we interpret our observation of improved phase tracking and reduced response amplitude as evidence for more efficient and reliable cortical processing of congruent audiovisual speech.
We observed dissociations between the sites that display phase versus power tracking of visual speech: the phase tracking sites concentrated in the auditory- and language-related superior, lateral and inferior temporal cortices, while the power tracking sites also involved temporal cortex but were more widely distributed in frontal, parietal, and occipital regions. Further, phase tracking was evident at lower frequencies (in the delta band), whereas power tracking extended to the theta band. These anatomic and physiological distinctions are consistent with the idea that phase reset and evoked responses provide complementary mechanisms for the influence of visual speech in auditory cortex. The magnitude of visual speech tracking by BHA correlated significantly with that of phase tracking, but not with power tracking. This pattern of effects suggests that phase reset by visual speech augments the neuronal representation of auditory speech in auditory cortex (Schroeder et al., 2008). The same phase reset would also provide a reference frame for spike-phase coding of information (Kayser et al., 2009, 2010), but unit recordings would be necessary to evaluate that idea. The proposal that evoked response reductions to audiovisual speech in power tracking sites might represent more efficient cortical processing when both sensory streams bring congruent information could be tested by varying the congruence and information content of each sensory input, as has been explored for visual cortex (Schepers et al., 2015).
Not all auditory cortical sites tracked visual speech. One possible explanation is that the influence of visual speech cues on auditory cortex in general is relatively subtle. Alternatively, cortical patches of auditory-responsive cortex that are also sensitive to visual speech cues could be interspersed within regions that only respond to one or the other modality, as was shown in the superior temporal sulcus of humans and monkeys (Beauchamp et al., 2004; Dahl et al., 2009). Further studies with denser iEEG electrodes and better coverage of Heschl's gyrus, planum temporale, and superior temporal sulcus will provide a finer-grained picture of visual speech tracking by auditory cortex.
The pattern of rapid quasi-rhythmic phase resetting that we observe has strong implications for the mechanistic understanding of speech processing in general. Indeed, this phase resetting aligns the ambient excitability fluctuations in auditory cortex with the incoming sensory stimuli, potentially helping to parse the continuous speech stream into linguistically relevant processing units such as syllables (Schroeder et al., 2008; Giraud and Poeppel, 2012; Zion Golumbic et al., 2012). As attention strongly reinforces the tracking of a specific speech stream (Mesgarani and Chang, 2012; Zion Golumbic et al., 2013b; O'Sullivan et al., 2015), phase resetting will tend to amplify an attended speech stream above background noise, increasing its perceptual salience.
We focused on the impact of visual speech cues on auditory cortex, and not the reverse, because the auditory component of speech is the more relevant one for intelligibility. Furthermore, although there is some variability between the relative timing of individual visual speech cues and the corresponding speech sounds, on average, the visual cues precede the auditory ones (Chandrasekaran et al., 2009; Schwartz and Savariaux, 2014). Accordingly, the phase reset hypothesis (Schroeder et al., 2008) posits that visual cues influence the processing of incoming speech sounds through phase reset, but does not make any prediction regarding the influence of speech sounds on the processing of visual speech. Speech sounds have been shown to modulate the responses of visual cortex to visual speech cues (Schepers et al., 2015); further work will need to examine the nature of that modulatory effect (crossmodal evoked responses vs phase reset).
The statistical relationship between auditory speech and the preceding visual speech gestures permits the brain to predictively bias auditory cortical excitability toward an optimal dynamic excitability state. Oscillatory enhancement of local neuronal excitability operates over a relatively large range of phase angles (Buzsáki and Draguhn, 2004; Lakatos et al., 2005); for delta and theta oscillations, this implies a relatively wide temporal window. At least three anatomic and functional routes could subtend the influence of visual cues on auditory speech processing (Schroeder et al., 2008), as follows: (1) feedback from higher-order, multisensory, speech- or language-related cortex; (2) lateral projections from visual to auditory cortical areas; and (3) feedforward projections from visual thalamic nuclei. Given the progressively increasing response latencies to visual stimuli (Schroeder et al., 1998), including visual speech cues (Nishitani and Hari, 2002), in higher-order visual areas, and the relatively short audiovisual asynchronies in natural speech, the feedback route is unlikely to be the sole or even the major driver of crossmodal phase reset. Projections from visual thalamic nuclei and visual cortical areas to auditory cortex are documented in nonhuman primates (Smiley and Falchier, 2009; Falchier et al., 2010) and might allow the short-latency responses of early auditory regions to crossmodal sensory input, which are modulatory rather than excitatory, in contrast to the feedforward thalamocortical projections of the preferred sensory modality (Schroeder et al., 2001; Lakatos et al., 2007, 2009). In sum, the range of available crossmodal circuitry, in combination with the time range over which excitability enhancement can operate, can easily support the temporal parameters of predictive phase reset as outlined in our hypothesis, though nailing down the manner in which the different components contribute will require additional research.
The importance of the lag between the visual and auditory components of vocalization stimuli was highlighted in a monkey study of a voice-sensitive area in the anterior temporal lobe: that lag determined whether neurons increased or decreased their firing rate in response to the auditory cue (Perrodin et al., 2015). The generalization of these observations to naturalistic audiovisual speech is hampered by the fact that only onset responses to short vocalizations were studied, making it impossible to differentiate phase reset from evoked responses. It would be revealing to study cortical tracking of continuous audiovisual speech while manipulating the lag between the auditory and visual streams. Alternatively, the variable lags that naturally occur between articulatory gestures and speech sounds could be leveraged to similar effect.
Could auditory speech imagery explain the influence of visual speech cues on auditory cortex? Efforts to track the neuronal correlates of speech imagery have struggled with the temporal alignment between cortical activity and the presumed (physically absent) speech, even for single words (Martin et al., 2016). In our experiment, the comparatively longer duration of the speech stimuli (7–11 s) makes it unlikely that participants could have learned the stories well enough to generate auditory imagery with perfect timing relative to the visual speech cues. If that had been the case, the split-halves analysis would presumably have disclosed better performance for the late trials versus the early ones. That we found no such improvement argues against speech imagery as the major explanation for the tracking of visual speech by auditory cortex in our data. A recent MEG study showed that auditory cortex tracked low-frequency (<0.5 Hz) features of the absent speech sounds in response to watching silent speech (Bourguignon et al., 2020). In that study as well, auditory imagery was deemed unlikely to explain auditory cortical entrainment.
To disentangle the contributions of low-frequency phase versus power in auditory cortical responses to visual speech, we examined linear-to-linear (for power) and linear-to-circular (for phase) cross-correlations separately. Few sites exhibited both phase and power tracking, in fact no more than expected by chance. This suggests that, although we did not address the issues of stimulus autocorrelation and correlation between auditory and visual stimuli, we were still able to delineate two sets of auditory-responsive cortical sites that tracked visual speech using distinct fundamental neuronal mechanisms. Other approaches were previously applied to characterize cortical responses to speech, like the spectrotemporal/temporal response function, which linearly models the cortical response based on the spectrotemporal characteristics of the stimulus (Zion Golumbic et al., 2013a,b). We are not aware of previous attempts to use such methods to describe the relationship between a linearly varying stimulus and a circularly varying response like oscillatory phase. Eventually, methods based on linear-to-circular spectrotemporal response functions, or partial correlation accounting for both circular and linear variables, might prove superior to the first step that we took here.
The intelligibility of the speech cues themselves was not at the center of our preoccupations, which is why we gave little emphasis to the participants' behavioral performance. Furthermore, we did not control for the participants' deployment of attention to a particular component of the stimuli. Thus, we cannot distinguish whether the differences in cortical tracking of audiovisual versus auditory speech are because of automatic crossmodal stimulus processing or to the attentional effects of multisensory versus unisensory stimuli (Macaluso et al., 2000; Johnson and Zatorre, 2006). Future work will reveal how cortical tracking is influenced by manipulating the participants' comprehension of audiovisual speech stimuli, as well as the way they focus their attention on them.
Visual enhancement of speech takes place within the context of strong top-down influences from frontal and parietal regions that support the processing of distinct linguistic features (Park et al., 2016, 2018; Di Liberto et al., 2018; Keitel et al., 2018). Further, low-frequency oscillations relevant to speech perception can themselves be modulated by transcranial electrical stimulation (Riecke et al., 2018; Zoefel et al., 2018). Our findings highlight the need to consider oscillatory phase in targeting potential neuromodulation therapy to enhance communication.
Footnotes
This work was supported by the Swiss National Science Foundation (Grants 139829, 145563, 148388, and 167836 to P.M.), the National Institute of Neurological Disorders and Stroke (Grant NS-098976 to M.S.B., C.E.S., and A.D.M.), and the Page and Otto Marx Jr Foundation to A.D.M. We thank the patients for their participation; and Erin Yeagle, Willie Walker Jr., the physicians and other professionals of the Neurosurgery Department and Neurology Department of North Shore University Hospital; and Itzik Norman and Bahar Khalighinejad for their assistance. Part of the computations for this work were performed at the University of Geneva on the Baobab cluster.
The authors declare no competing financial interests.
- Correspondence should be addressed to Charles E. Schroeder at cs2388{at}columbia.edu or Ashesh D. Mehta at amehta{at}northwell.edu