Electrophysiological evidence for a multisensory speech-specific mode of perception

doi:10.1016/j.neuropsychologia.2012.02.027

Neuropsychologia

Volume 50, Issue 7, June 2012, Pages 1425-1431

https://doi.org/10.1016/j.neuropsychologia.2012.02.027 Get rights and content

Abstract

We investigated whether the interpretation of auditory stimuli as speech or non-speech affects audiovisual (AV) speech integration at the neural level. Perceptually ambiguous sine-wave replicas (SWS) of natural speech were presented to listeners who were either in ‘speech mode’ or ‘non-speech mode’. At the behavioral level, incongruent lipread information led to an illusory change of the sound only for listeners in speech mode. The neural correlates of this illusory change were examined in an audiovisual mismatch negativity (MMN) paradigm with SWS sounds. In an oddball sequence, ‘standards’ consisted of SWS/onso/coupled with lipread/onso/, and ‘deviants’ consisted of SWS/onso/coupled with lipread/omso/. The AV deviant induced a McGurk-MMN for listeners in speech mode, but not for listeners in non-speech mode. These results demonstrate that the illusory change in the sound by incongruent lipread information evoked an MMN which presumably takes place at a pre-attentive sensory processing stage.

Highlights

► Is sine-wave speech (SWS) integrated with lipread information? ► Audiovisual integration occurred only when SWS was perceived as speech. ► A McGurk MMN with SWS was evoked in speech mode, but not in non-speech mode. ► Perceptual mode affects audiovisual integration at the sensory processing stage.

Introduction

An important question about speech perception is whether speech is processed as all other sounds (Fowler, 1996, Kuhl et al., 1991, Massaro, 1998), or whether there are specialized mechanisms responsible for translating the acoustic signal into phonetic segments (Repp, 1982, Tuomainen et al., 2005). A relevant finding favoring the notion of ‘speech-specificity’ was provided by Remez, Rubin, Pisoni, and Carrell (1981). They created time-varying sine-wave speech (SWS) replicas of natural speech that were perceived by naïve listeners as non-speech whistles, bleeps, or ‘computer sounds’, but when another group of subjects was instructed about the speech-like nature of the stimuli, they could easily assign linguistic content to the sounds. This ambiguous nature of SWS has provided researchers a tool to study the neural and behavioral specificity of speech sound processing, because identical acoustic stimuli can be used that are perceived differently, depending on the mode of the listener. In this way, it has been shown with functional magnetic resonance imaging (fMRI) that SWS stimuli elicit stronger activity within the left posterior superior temporal sulcus for listeners in speech mode than for listeners in non-speech mode (Möttönen et al., 2006). The phonetic content of SWS is also more likely integrated with visual information of lipread speech if listeners are in speech mode rather than non-speech mode, suggesting that audiovisual (AV) integration of phonetic information only occurs when listeners perceive the sound as speech (Tuomainen et al., 2005, Vroomen and Baart, 2009). Not all aspects of audiovisual integration, though, have been found to depend on the mode of the listener. One example is that perception of temporal synchrony between a heard SWS sound and lipread information is not different for listeners in speech or non-speech mode (Vroomen & Stekelenburg, 2011). Furthermore, lipread information can improve auditory detection of SWS targets in noise, but the size of the improvement does not depend on the mode of the listener (Eskelund, Tuomainen, & Andersen, 2011). This thus indicates that phonetic, but not temporal and loudness processing of the sound depend on the speech mode of the listener.

In the current study, we searched for a neural correlate of the distinction between a ‘speech’ and ‘non-speech’ mode of audiovisual speech processing. It is generally acknowledged that in audiovisual speech the auditory and visual signals are integrated at some processing stage into a coherent multisensory representation. Hemodynamic studies have shown that multisensory cortices (superior temporal sulcus/gyrus) (Callan et al., 2004, Calvert et al., 2000, Skipper et al., 2005), ‘sensory-specific’ cortices (Callan et al., 2003, Calvert et al., 1999, Kilian-Hütten et al., 2011, von Kriegstein and Giraud, 2006) and motor areas (Ojanen et al., 2005, Skipper et al., 2005) are involved in audiovisual speech integration. Electrophysiological studies have shown that AV speech interactions occur in the auditory cortex as early as 100 ms (Arnal et al., 2009, Besle et al., 2004, Stekelenburg and Vroomen, 2007, van Wassenhove et al., 2005). Here, we used the well-known mismatch negativity (MMN) component in the electroencephalogram (EEG) as a neural marker of audiovisual speech integration. The MMN is a fronto-centrally negative event-related potential (ERP) component that is elicited by sounds that violate the automatic predictions of the central auditory system (Näätänen, Gaillard, & Mäntysalo, 1978). The MMN is measured by subtracting the ERP of a frequent ‘standard’ sound from an infrequent ‘deviant’ sound, and it appears as a negative deflection with a fronto-central maximum peaking around 150–250 ms from the onset of the sound change. The MMN is most likely generated in the auditory cortex and presumably reflects pre-attentive auditory deviance detection (Näätänen, Paavilainen, Rinne, & Alho, 2007).

Important from our perspective is that the MMN has also been successfully used to probe the neural mechanisms underlying the integration of information from different senses, as in the case of hearing and seeing speech (i.e., lipreading; Colin et al., 2002, Kislyuk et al., 2008, Saint-Amour et al., 2007, Sams et al., 1991). In a typical experiment, an intersensory conflict is created between the heard and lipread information, e.g., hearing/ba/, but seeing the face of a speaker articulate/ga/. The crucial aspect of this stimulus is that the incongruent lipread information leads to an illusory change in the perceived quality of the sound, and a perceiver typically reports to ‘hear’/da/, when in fact auditory/ba/was presented combined with visual/ga/(McGurk & MacDonald, 1976). This change in the quality of the sound then evokes a ‘McGurk-MMN’, despite that the acoustic information remains unchanged. This finding has generally been taken as evidence that lipread information can penetrate auditory cortex and modulate its activity at a very basic level.

In the current study, we examined whether the McGurk-MMN indeed depends on the illusory change of the sound. So far, this has only been shown indirectly because the illusion itself has not been manipulated in any direct way. Another issue that has not yet been resolved is by what mechanism the distinction between a speech and non-speech mode of sound processing actually affects audiovisual integration. Tuomainen et al. (2005) speculated that attention may play a key role in the effect. It was argued that in speech mode, attention enhances processing and binding of those features in the stimuli that form a phonetic object, whereas in non-speech mode attention is focused on some other acoustic aspect (e.g., loudness, pitch, duration) that discriminates the stimuli. Whether attention is indeed the crucial factor can be tested using the McGurk-MMN because this brain potential does not require attention to be evoked. Here, we employed SWS while listeners were in speech or non-speech mode. Our ‘standard’ stimulus was an SWS sound derived from natural auditory/onso/that was combined with the video of a speaker articulating the same syllables/onso/(AnVn). The ‘deviant’ stimulus contained exactly the same SWS sound/onso/, but now combined with incongruent lipread information of/omso/(AnVm). The incongruent lipread information was expected to change the percept of the SWS sound from/onso/into/omso/if the SWS sound was heard as speech, but this illusory change should be almost completely abolished if the SWS sound is perceived as non-speech (for behavioral evidence, see Eskelund et al., 2011, Tuomainen et al., 2005, Vroomen and Stekelenburg, 2011). This dissociation then allowed us to test, with identical stimuli, whether the illusory change in the sound actually affects the McGurk-MMN. If so, one would expect a McGurk-MMN with SWS sounds for listeners in speech mode, but not for listeners in non-speech mode. For comparative purpose, we included a third group of listeners that heard the original natural recordings of/omso/and/onso/to confirm that the McGurk-MMN would be elicited by the original speech sounds. For all three groups, we also included a visual-only (V-only) and an auditory-only (A-only) condition. The V-only condition served as a control to rule out that the McGurk-MMN was based on the visual difference between standard and deviant (a visual MMN), and to correct the McGurk-MMN accordingly by subtracting the V-only deviance wave from the AV-wave (Saint-Amour et al., 2007). With the A-only condition, we could test whether an actual auditory change from SWS/onso/into/omso/resulted in a (non-illusory) auditory MMN, and whether that differed for listeners in speech and non-speech mode.

Section snippets

Participants

Forty-five healthy students (13 males, 32 females, mean age of 21.0 years) with normal hearing and normal or corrected-to-normal vision participated after giving written informed consent (in accordance with the Declaration of Helsinki). They received course credits for their participation. They were equally divided into three between-subjects conditions (i.e., natural speech, SWS speech mode, and SWS non-speech mode). Note that a between-subject design was required because once participants

Behavioral data

The proportion of correctly identified auditory stimuli was calculated for the congruent and incongruent stimuli. As clearly visibly in Fig. 1, and as expected, lipread information changed the percept of the sound if that sound was perceived as speech, but not if perceived as non-speech. This was confirmed in a MANOVA for repeated measures with as within-subjects variable Congruency (congruent vs. incongruent), and as a between-subject variable Group (SWS non-speech mode, SWS speech mode,

Discussion

Using SWS, we demonstrated that the speech mode of a listener affects the McGurk-MMN. A single auditory token of SWS was presented repeatedly, together with either congruent or incongruent lipread information while listeners were either in speech or non-speech mode. The incongruent lipread information in the AV condition evoked a McGurk-MMN for listeners in speech mode, but not for listeners in non-speech mode. A behavioral experiment further demonstrated that the incongruent lipread

References (45)

K. Alho et al.
Hemispheric lateralization in preattentive processing of speech sounds
Neuroscience Letters
(1998)
L.E. Bernstein et al.
Auditory speech detection in noise enhanced by lipreading
Speech Communication
(2004)
G.A. Calvert et al.
Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex
Current Biology
(2000)
C. Colin et al.
Mismatch negativity evoked by the McGurk–MacDonald effect: A phonetic representation within short-term memory
Clinical Neurophysiology
(2002)
G. Gratton et al.
A new method for off-line removal of ocular artifact
Electroencephalography & Clinical Neurophysiology
(1983)
V. Klucharev et al.
Electrophysiological indicators of phonetic and non-phonetic multisensory interactions during audiovisual speech perception
Brain Research, Cognitive Brain Research
(2003)
R. Möttönen et al.
Perceiving identical sounds as speech or non-speech modulates activity in the left posterior superior temporal sulcus
Neuroimage
(2006)
R. Näätänen et al.
Early selective-attention effect in evoked potential reinterpreted
Acta Psychologica
(1978)
R. Näätänen et al.
The mismatch negativity (MMN) in basic research of central auditory processing: A review
Clinical Neurophysiology
(2007)
V. Ojanen et al.
Processing of audiovisual speech in Broca's area
Neuroimage
(2005)

D. Saint-Amour et al.

Seeing voices: High-density electrical mapping and source-analysis of the multisensory mismatch negativity evoked during the McGurk illusion

Neuropsychologia

(2007)

M. Sams et al.

Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex

Neuroscience Letters

(1991)

M. Sams et al.

Auditory frequency discrimination and event-related potentials

Electroencephalography & Clinical Neurophysiology

(1985)

J.L. Schwartz et al.

Seeing to hear better: Evidence for early audio–visual interactions in speech identification

Cognition

(2004)

J.I. Skipper et al.

Listening to talking faces: Motor cortical activation during speech perception

Neuroimage

(2005)

J.J. Stekelenburg et al.

Illusory sound shifts induced by the ventriloquist illusion evoke the mismatch negativity

Neuroscience Letters

(2004)

J. Tuomainen et al.

Audio–visual speech perception is special

Cognition

(2005)

J. Vroomen et al.

Phonetic recalibration only occurs in speech mode

Cognition

(2009)

J. Vroomen et al.

Perception of intersensory synchrony in audiovisual speech: Not that special

Cognition

(2011)

L.H. Arnal et al.

Dual neural routing of visual facilitation in speech processing

Journal of Neuroscience

(2009)

J. Besle et al.

Bimodal speech: Early suppressive visual effects in human auditory cortex

European Journal of Neuroscience

(2004)

Boersma, P., & Weenink, K. (1999–2005). Praat: Doing phonetics by computer. Retrieved from:...

Cited by (31)

How are visemes and graphemes integrated with speech sounds during spoken word recognition? ERP evidence for supra-additive responses during audiovisual compared to auditory speech processing
2022, Brain and Language
Citation Excerpt :
Interestingly, the examination of the neural process underpinning the perceptual bias, using the McGurk-mismatch negativity (MMN) protocol, showed that the illusions induced by lip movements and written syllables were not supported by the same neural mechanism. While the integration between speech sounds and lip movements was associated with a negative deflection corresponding to the MMN typically reported in previous studies (Colin et al., 2002; Saint-Amour et al., 2007; Stekelenburg & Vroomen, 2012), it was not the case for written syllables. The integration between speech sounds and written syllables rather induced a late positive deflection with a frontal distribution that is indicative of a P3a, known to be involved in stimulus selection and decision-making processes.
Both visual articulatory gestures and orthography provide information on the phonological content of speech. This EEG study investigated the integration between speech and these two visual inputs. A comparison of skilled readers’ brain responses elicited by a spoken word presented alone versus synchronously with a static image of a viseme or a grapheme of the spoken word’s onset showed that while neither visual input induced audiovisual integration on N1 acoustic component, both led to a supra-additive integration on P2, with a stronger integration between speech and graphemes on left-anterior electrodes. This pattern persisted in P350 time-window and generalized to all electrodes. The finding suggests a strong impact of spelling knowledge on phonetic processing and lexical access. It also indirectly indicates that the dynamic and predictive value present in natural lip movements but not in static visemes is particularly critical to the contribution of visual articulatory gestures to speech processing.
Electrophysiological Indexes of Incongruent Audiovisual Phonemic Processing: Unraveling the McGurk Effect
2018, Neuroscience
In this study the timing of electromagnetic signals recorded during incongruent and congruent audiovisual (AV) stimulation in 14 Italian healthy volunteers was examined. In a previous study (Proverbio et al., 2016) we investigated the McGurk effect in the Italian language and found out which visual and auditory inputs provided the most compelling illusory effects (e.g., bilabial phonemes presented acoustically and paired with non-labials, especially alveolar–nasal and velar–occlusive phonemes). In this study EEG was recorded from 128 scalp sites while participants observed a female and a male actor uttering 288 syllables selected on the basis of the previous investigation (lasting approximately 600 ms) and responded to rare targets (/re/, /ri/, /ro/, /ru/). In half of the cases the AV information was incongruent, except for targets that were always congruent. A pMMN (phonological Mismatch Negativity) to incongruent AV stimuli was identified 500 ms after voice onset time. This automatic response indexed the detection of an incongruity between the labial and phonetic information. SwLORETA (Low-Resolution Electromagnetic Tomography) analysis applied to the difference voltage incongruent–congruent in the same time window revealed that the strongest sources of this activity were the right superior temporal (STG) and superior frontal gyri, which supports their involvement in AV integration.
Using rotated speech to approximate the acoustic mismatch negativity response to speech
2018, Brain and Language
The mismatch negativity (MMN) response is influenced by the magnitude of the acoustic difference between standard and deviant, and the response is typically larger to linguistically relevant changes than to linguistically irrelevant changes. Linguistically relevant changes between standard and deviant typically co-occur with differences between the two acoustic signals. It is therefore not straightforward to determine the contribution of each of those two factors to the MMN response. This study investigated whether spectrally rotated speech can be used to determine the impact of the acoustic difference on the MMN response to a combined linguistic and acoustic change between standard and deviant. Changes between rotated vowels elicited an MMN of comparable amplitude to the one elicited by a within-category vowel change, whereas the between-category vowel change resulted in an MMN amplitude of greater magnitude. A change between rotated vowels resulted in an MMN ampltude more similar to that of a within-vowel change than a complex tone change did. This suggests that the MMN amplitude reflecting the acoustic difference between two speech sounds can be well approximated by the MMN amplitude elicited in response to their rotated counterparts, in turn making it possible to estimate the part of the response specific to the linguistic difference.
Face configuration affects speech perception: Evidence from a McGurk mismatch negativity study
2015, Neuropsychologia
Citation Excerpt :
The non-Thatcherized upright face produced a strong McGurk-MMN response. The amplitude, latency and scalp distribution of this MGurk-MMN response is comparable with those reported in previous studies (Colin, 2002; Ponton et al., 2009; Saint-Amour et al., 2007; Stekelenburg and Vroomen, 2012). This signifies that the McGurk illusion we found in the behavioral experiment influenced activity in auditory cortex confirming that the effect is truly perceptual.
We perceive identity, expression and speech from faces. While perception of identity and expression depends crucially on the configuration of facial features it is less clear whether this holds for visual speech perception.
Facial configuration is poorly perceived for upside-down faces as demonstrated by the Thatcher illusion in which the orientation of the eyes and mouth with respect to the face is inverted (Thatcherization). This gives the face a grotesque appearance but this is only seen when the face is upright.
Thatcherization can likewise disrupt visual speech perception but only when the face is upright indicating that facial configuration can be important for visual speech perception. This effect can propagate to auditory speech perception through audiovisual integration so that Thatcherization disrupts the McGurk illusion in which visual speech perception alters perception of an incongruent acoustic phoneme. This is known as the McThatcher effect.
Here we show that the McThatcher effect is reflected in the McGurk mismatch negativity (MMN). The MMN is an event-related potential elicited by a change in auditory perception. The McGurk-MMN can be elicited by a change in auditory perception due to the McGurk illusion without any change in the acoustic stimulus.
We found that Thatcherization disrupted a strong McGurk illusion and a correspondingly strong McGurk-MMN only for upright faces. This confirms that facial configuration can be important for audiovisual speech perception. For inverted faces we found a weaker McGurk illusion but, surprisingly, no MMN. We also found no correlation between the strength of the McGurk illusion and the amplitude of the McGurk-MMN. We suggest that this may be due to a threshold effect so that a strong McGurk illusion is required to elicit the McGurk-MMN.
Phonetic matching of auditory and visual speech develops during childhood: Evidence from sine-wave speech
2015, Journal of Experimental Child Psychology
Citation Excerpt :
If so, it seems likely that the children we tested could also rely on the temporal correlation to detect the AV correspondence (note that adults may infer a causal relationship between sight and sound even when the two are asynchronous; Parise, Spence, & Ernst, 2012). Of relevance are studies that used sine-wave speech in behavioral and electrophysiological measures to demonstrate that different properties of the AV speech signal (e.g., temporal features vs. phonetic content) are integrated at different levels in the processing chain (Baart, Stekelenburg, & Vroomen, 2014; Baart, Vroomen, et al., 2014; Eskelund, Tuomainen, & Andersen, 2011; Stekelenburg & Vroomen, 2012; Tuomainen et al., 2005; Vroomen & Baart, 2009; Vroomen & Stekelenburg, 2011). The AV matching paradigm used in the current study indicates that it is likely that such a staged process also occurs in children; children showed a “top-up” benefit (above and beyond their already above-chance performance in non-speech mode) from having phonetic knowledge about the stimuli, but only after approximately 6.5 years of age, indicating that sufficient accrual of phonetic knowledge had occurred by then to influence the AV matching of the degraded stimuli.
The correspondence between auditory speech and lip-read information can be detected based on a combination of temporal and phonetic cross-modal cues. Here, we determined the point in developmental time at which children start to effectively use phonetic information to match a speech sound with one of two articulating faces. We presented 4- to 11-year-olds (N = 77) with three-syllabic sine-wave speech replicas of two pseudo-words that were perceived as non-speech and asked them to match the sounds with the corresponding lip-read video. At first, children had no phonetic knowledge about the sounds, and matching was thus based on the temporal cues that are fully retained in sine-wave speech. Next, we trained all children to perceive the phonetic identity of the sine-wave speech and repeated the audiovisual (AV) matching task. Only at around 6.5 years of age did the benefit of having phonetic knowledge about the stimuli become apparent, thereby indicating that AV matching based on phonetic cues presumably develops more slowly than AV matching based on temporal cues.
Electrophysiological evidence for speech-specific audiovisual integration
2014, Neuropsychologia
Citation Excerpt :
In both studies, though, only listeners in speech mode showed phonetic AV integration (i.e., identification of the sound was biased by lip-read information, see Tuomainen, Andersen, Tiippana, & Sams, 2005; Vroomen & Baart, 2009 for similar findings). Recently, Stekelenburg and Vroomen (2012b) showed that the electrophysiological mismatch negativity (i.e., MMN, e.g., Näätänen, Gaillard, & Mäntysalo, 1978), as induced by lip-read information that triggers an illusory phonetic auditory percept (Colin, Radeau, Soquet, & Deltenre, 2004; Colin et al., 2002; Kislyuk, Möttönen, & Sams, 2008; Saint-Amour, De Sanctis, Molholm, Ritter, & Foxe, 2007; Sams et al., 1991), only occurred when listeners were in speech mode. Taken together, these findings indicate that across experiments, listeners bound the AV information together into a coherent phonetic event only when in speech mode, but not when in non-speech mode.
Lip-read speech is integrated with heard speech at various neural levels. Here, we investigated the extent to which lip-read induced modulations of the auditory N1 and P2 (measured with EEG) are indicative of speech-specific audiovisual integration, and we explored to what extent the ERPs were modulated by phonetic audiovisual congruency. In order to disentangle speech-specific (phonetic) integration from non-speech integration, we used Sine-Wave Speech (SWS) that was perceived as speech by half of the participants (they were in speech-mode), while the other half was in non-speech mode. Results showed that the N1 obtained with audiovisual stimuli peaked earlier than the N1 evoked by auditory-only stimuli. This lip-read induced speeding up of the N1 occurred for listeners in speech and non-speech mode. In contrast, if listeners were in speech-mode, lip-read speech also modulated the auditory P2, but not if listeners were in non-speech mode, thus revealing speech-specific audiovisual binding. Comparing ERPs for phonetically congruent audiovisual stimuli with ERPs for incongruent stimuli revealed an effect of phonetic stimulus congruency that started at ~200 ms after (in)congruence became apparent. Critically, akin to the P2 suppression, congruency effects were only observed if listeners were in speech mode, and not if they were in non-speech mode. Using identical stimuli, we thus confirm that audiovisual binding involves (partially) different neural mechanisms for sound processing in speech and non-speech mode.

View all citing articles on Scopus

View full text

Electrophysiological evidence for a multisensory speech-specific mode of perception

Abstract

Highlights

Introduction

Section snippets

Participants

Behavioral data

Discussion

Neuroscience Letters

Speech Communication

Current Biology

Clinical Neurophysiology

Electroencephalography & Clinical Neurophysiology

Brain Research, Cognitive Brain Research

Neuroimage

Acta Psychologica

Clinical Neurophysiology

Neuroimage

Neuropsychologia

Neuroscience Letters

Electroencephalography & Clinical Neurophysiology

Cognition

Neuroimage

Neuroscience Letters

Cognition

Cognition

Cognition

Dual neural routing of visual facilitation in speech processing

Journal of Neuroscience

Bimodal speech: Early suppressive visual effects in human auditory cortex

European Journal of Neuroscience