Electrophysiological evidence for a multisensory speech-specific mode of perception
Highlights
► Is sine-wave speech (SWS) integrated with lipread information? ► Audiovisual integration occurred only when SWS was perceived as speech. ► A McGurk MMN with SWS was evoked in speech mode, but not in non-speech mode. ► Perceptual mode affects audiovisual integration at the sensory processing stage.
Introduction
An important question about speech perception is whether speech is processed as all other sounds (Fowler, 1996, Kuhl et al., 1991, Massaro, 1998), or whether there are specialized mechanisms responsible for translating the acoustic signal into phonetic segments (Repp, 1982, Tuomainen et al., 2005). A relevant finding favoring the notion of ‘speech-specificity’ was provided by Remez, Rubin, Pisoni, and Carrell (1981). They created time-varying sine-wave speech (SWS) replicas of natural speech that were perceived by naïve listeners as non-speech whistles, bleeps, or ‘computer sounds’, but when another group of subjects was instructed about the speech-like nature of the stimuli, they could easily assign linguistic content to the sounds. This ambiguous nature of SWS has provided researchers a tool to study the neural and behavioral specificity of speech sound processing, because identical acoustic stimuli can be used that are perceived differently, depending on the mode of the listener. In this way, it has been shown with functional magnetic resonance imaging (fMRI) that SWS stimuli elicit stronger activity within the left posterior superior temporal sulcus for listeners in speech mode than for listeners in non-speech mode (Möttönen et al., 2006). The phonetic content of SWS is also more likely integrated with visual information of lipread speech if listeners are in speech mode rather than non-speech mode, suggesting that audiovisual (AV) integration of phonetic information only occurs when listeners perceive the sound as speech (Tuomainen et al., 2005, Vroomen and Baart, 2009). Not all aspects of audiovisual integration, though, have been found to depend on the mode of the listener. One example is that perception of temporal synchrony between a heard SWS sound and lipread information is not different for listeners in speech or non-speech mode (Vroomen & Stekelenburg, 2011). Furthermore, lipread information can improve auditory detection of SWS targets in noise, but the size of the improvement does not depend on the mode of the listener (Eskelund, Tuomainen, & Andersen, 2011). This thus indicates that phonetic, but not temporal and loudness processing of the sound depend on the speech mode of the listener.
In the current study, we searched for a neural correlate of the distinction between a ‘speech’ and ‘non-speech’ mode of audiovisual speech processing. It is generally acknowledged that in audiovisual speech the auditory and visual signals are integrated at some processing stage into a coherent multisensory representation. Hemodynamic studies have shown that multisensory cortices (superior temporal sulcus/gyrus) (Callan et al., 2004, Calvert et al., 2000, Skipper et al., 2005), ‘sensory-specific’ cortices (Callan et al., 2003, Calvert et al., 1999, Kilian-Hütten et al., 2011, von Kriegstein and Giraud, 2006) and motor areas (Ojanen et al., 2005, Skipper et al., 2005) are involved in audiovisual speech integration. Electrophysiological studies have shown that AV speech interactions occur in the auditory cortex as early as 100 ms (Arnal et al., 2009, Besle et al., 2004, Stekelenburg and Vroomen, 2007, van Wassenhove et al., 2005). Here, we used the well-known mismatch negativity (MMN) component in the electroencephalogram (EEG) as a neural marker of audiovisual speech integration. The MMN is a fronto-centrally negative event-related potential (ERP) component that is elicited by sounds that violate the automatic predictions of the central auditory system (Näätänen, Gaillard, & Mäntysalo, 1978). The MMN is measured by subtracting the ERP of a frequent ‘standard’ sound from an infrequent ‘deviant’ sound, and it appears as a negative deflection with a fronto-central maximum peaking around 150–250 ms from the onset of the sound change. The MMN is most likely generated in the auditory cortex and presumably reflects pre-attentive auditory deviance detection (Näätänen, Paavilainen, Rinne, & Alho, 2007).
Important from our perspective is that the MMN has also been successfully used to probe the neural mechanisms underlying the integration of information from different senses, as in the case of hearing and seeing speech (i.e., lipreading; Colin et al., 2002, Kislyuk et al., 2008, Saint-Amour et al., 2007, Sams et al., 1991). In a typical experiment, an intersensory conflict is created between the heard and lipread information, e.g., hearing/ba/, but seeing the face of a speaker articulate/ga/. The crucial aspect of this stimulus is that the incongruent lipread information leads to an illusory change in the perceived quality of the sound, and a perceiver typically reports to ‘hear’/da/, when in fact auditory/ba/was presented combined with visual/ga/(McGurk & MacDonald, 1976). This change in the quality of the sound then evokes a ‘McGurk-MMN’, despite that the acoustic information remains unchanged. This finding has generally been taken as evidence that lipread information can penetrate auditory cortex and modulate its activity at a very basic level.
In the current study, we examined whether the McGurk-MMN indeed depends on the illusory change of the sound. So far, this has only been shown indirectly because the illusion itself has not been manipulated in any direct way. Another issue that has not yet been resolved is by what mechanism the distinction between a speech and non-speech mode of sound processing actually affects audiovisual integration. Tuomainen et al. (2005) speculated that attention may play a key role in the effect. It was argued that in speech mode, attention enhances processing and binding of those features in the stimuli that form a phonetic object, whereas in non-speech mode attention is focused on some other acoustic aspect (e.g., loudness, pitch, duration) that discriminates the stimuli. Whether attention is indeed the crucial factor can be tested using the McGurk-MMN because this brain potential does not require attention to be evoked. Here, we employed SWS while listeners were in speech or non-speech mode. Our ‘standard’ stimulus was an SWS sound derived from natural auditory/onso/that was combined with the video of a speaker articulating the same syllables/onso/(AnVn). The ‘deviant’ stimulus contained exactly the same SWS sound/onso/, but now combined with incongruent lipread information of/omso/(AnVm). The incongruent lipread information was expected to change the percept of the SWS sound from/onso/into/omso/if the SWS sound was heard as speech, but this illusory change should be almost completely abolished if the SWS sound is perceived as non-speech (for behavioral evidence, see Eskelund et al., 2011, Tuomainen et al., 2005, Vroomen and Stekelenburg, 2011). This dissociation then allowed us to test, with identical stimuli, whether the illusory change in the sound actually affects the McGurk-MMN. If so, one would expect a McGurk-MMN with SWS sounds for listeners in speech mode, but not for listeners in non-speech mode. For comparative purpose, we included a third group of listeners that heard the original natural recordings of/omso/and/onso/to confirm that the McGurk-MMN would be elicited by the original speech sounds. For all three groups, we also included a visual-only (V-only) and an auditory-only (A-only) condition. The V-only condition served as a control to rule out that the McGurk-MMN was based on the visual difference between standard and deviant (a visual MMN), and to correct the McGurk-MMN accordingly by subtracting the V-only deviance wave from the AV-wave (Saint-Amour et al., 2007). With the A-only condition, we could test whether an actual auditory change from SWS/onso/into/omso/resulted in a (non-illusory) auditory MMN, and whether that differed for listeners in speech and non-speech mode.
Section snippets
Participants
Forty-five healthy students (13 males, 32 females, mean age of 21.0 years) with normal hearing and normal or corrected-to-normal vision participated after giving written informed consent (in accordance with the Declaration of Helsinki). They received course credits for their participation. They were equally divided into three between-subjects conditions (i.e., natural speech, SWS speech mode, and SWS non-speech mode). Note that a between-subject design was required because once participants
Behavioral data
The proportion of correctly identified auditory stimuli was calculated for the congruent and incongruent stimuli. As clearly visibly in Fig. 1, and as expected, lipread information changed the percept of the sound if that sound was perceived as speech, but not if perceived as non-speech. This was confirmed in a MANOVA for repeated measures with as within-subjects variable Congruency (congruent vs. incongruent), and as a between-subject variable Group (SWS non-speech mode, SWS speech mode,
Discussion
Using SWS, we demonstrated that the speech mode of a listener affects the McGurk-MMN. A single auditory token of SWS was presented repeatedly, together with either congruent or incongruent lipread information while listeners were either in speech or non-speech mode. The incongruent lipread information in the AV condition evoked a McGurk-MMN for listeners in speech mode, but not for listeners in non-speech mode. A behavioral experiment further demonstrated that the incongruent lipread
References (45)
- et al.
Hemispheric lateralization in preattentive processing of speech sounds
Neuroscience Letters
(1998) - et al.
Auditory speech detection in noise enhanced by lipreading
Speech Communication
(2004) - et al.
Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex
Current Biology
(2000) - et al.
Mismatch negativity evoked by the McGurk–MacDonald effect: A phonetic representation within short-term memory
Clinical Neurophysiology
(2002) - et al.
A new method for off-line removal of ocular artifact
Electroencephalography & Clinical Neurophysiology
(1983) - et al.
Electrophysiological indicators of phonetic and non-phonetic multisensory interactions during audiovisual speech perception
Brain Research, Cognitive Brain Research
(2003) - et al.
Perceiving identical sounds as speech or non-speech modulates activity in the left posterior superior temporal sulcus
Neuroimage
(2006) - et al.
Early selective-attention effect in evoked potential reinterpreted
Acta Psychologica
(1978) - et al.
The mismatch negativity (MMN) in basic research of central auditory processing: A review
Clinical Neurophysiology
(2007) - et al.
Processing of audiovisual speech in Broca's area
Neuroimage
(2005)
Seeing voices: High-density electrical mapping and source-analysis of the multisensory mismatch negativity evoked during the McGurk illusion
Neuropsychologia
Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex
Neuroscience Letters
Auditory frequency discrimination and event-related potentials
Electroencephalography & Clinical Neurophysiology
Seeing to hear better: Evidence for early audio–visual interactions in speech identification
Cognition
Listening to talking faces: Motor cortical activation during speech perception
Neuroimage
Illusory sound shifts induced by the ventriloquist illusion evoke the mismatch negativity
Neuroscience Letters
Audio–visual speech perception is special
Cognition
Phonetic recalibration only occurs in speech mode
Cognition
Perception of intersensory synchrony in audiovisual speech: Not that special
Cognition
Dual neural routing of visual facilitation in speech processing
Journal of Neuroscience
Bimodal speech: Early suppressive visual effects in human auditory cortex
European Journal of Neuroscience
Cited by (31)
How are visemes and graphemes integrated with speech sounds during spoken word recognition? ERP evidence for supra-additive responses during audiovisual compared to auditory speech processing
2022, Brain and LanguageCitation Excerpt :Interestingly, the examination of the neural process underpinning the perceptual bias, using the McGurk-mismatch negativity (MMN) protocol, showed that the illusions induced by lip movements and written syllables were not supported by the same neural mechanism. While the integration between speech sounds and lip movements was associated with a negative deflection corresponding to the MMN typically reported in previous studies (Colin et al., 2002; Saint-Amour et al., 2007; Stekelenburg & Vroomen, 2012), it was not the case for written syllables. The integration between speech sounds and written syllables rather induced a late positive deflection with a frontal distribution that is indicative of a P3a, known to be involved in stimulus selection and decision-making processes.
Using rotated speech to approximate the acoustic mismatch negativity response to speech
2018, Brain and LanguageFace configuration affects speech perception: Evidence from a McGurk mismatch negativity study
2015, NeuropsychologiaCitation Excerpt :The non-Thatcherized upright face produced a strong McGurk-MMN response. The amplitude, latency and scalp distribution of this MGurk-MMN response is comparable with those reported in previous studies (Colin, 2002; Ponton et al., 2009; Saint-Amour et al., 2007; Stekelenburg and Vroomen, 2012). This signifies that the McGurk illusion we found in the behavioral experiment influenced activity in auditory cortex confirming that the effect is truly perceptual.
Phonetic matching of auditory and visual speech develops during childhood: Evidence from sine-wave speech
2015, Journal of Experimental Child PsychologyCitation Excerpt :If so, it seems likely that the children we tested could also rely on the temporal correlation to detect the AV correspondence (note that adults may infer a causal relationship between sight and sound even when the two are asynchronous; Parise, Spence, & Ernst, 2012). Of relevance are studies that used sine-wave speech in behavioral and electrophysiological measures to demonstrate that different properties of the AV speech signal (e.g., temporal features vs. phonetic content) are integrated at different levels in the processing chain (Baart, Stekelenburg, & Vroomen, 2014; Baart, Vroomen, et al., 2014; Eskelund, Tuomainen, & Andersen, 2011; Stekelenburg & Vroomen, 2012; Tuomainen et al., 2005; Vroomen & Baart, 2009; Vroomen & Stekelenburg, 2011). The AV matching paradigm used in the current study indicates that it is likely that such a staged process also occurs in children; children showed a “top-up” benefit (above and beyond their already above-chance performance in non-speech mode) from having phonetic knowledge about the stimuli, but only after approximately 6.5 years of age, indicating that sufficient accrual of phonetic knowledge had occurred by then to influence the AV matching of the degraded stimuli.
Electrophysiological evidence for speech-specific audiovisual integration
2014, NeuropsychologiaCitation Excerpt :In both studies, though, only listeners in speech mode showed phonetic AV integration (i.e., identification of the sound was biased by lip-read information, see Tuomainen, Andersen, Tiippana, & Sams, 2005; Vroomen & Baart, 2009 for similar findings). Recently, Stekelenburg and Vroomen (2012b) showed that the electrophysiological mismatch negativity (i.e., MMN, e.g., Näätänen, Gaillard, & Mäntysalo, 1978), as induced by lip-read information that triggers an illusory phonetic auditory percept (Colin, Radeau, Soquet, & Deltenre, 2004; Colin et al., 2002; Kislyuk, Möttönen, & Sams, 2008; Saint-Amour, De Sanctis, Molholm, Ritter, & Foxe, 2007; Sams et al., 1991), only occurred when listeners were in speech mode. Taken together, these findings indicate that across experiments, listeners bound the AV information together into a coherent phonetic event only when in speech mode, but not when in non-speech mode.