Elsevier

Neuropsychologia

Volume 50, Issue 7, June 2012, Pages 1425-1431
Neuropsychologia

Electrophysiological evidence for a multisensory speech-specific mode of perception

https://doi.org/10.1016/j.neuropsychologia.2012.02.027Get rights and content

Abstract

We investigated whether the interpretation of auditory stimuli as speech or non-speech affects audiovisual (AV) speech integration at the neural level. Perceptually ambiguous sine-wave replicas (SWS) of natural speech were presented to listeners who were either in ‘speech mode’ or ‘non-speech mode’. At the behavioral level, incongruent lipread information led to an illusory change of the sound only for listeners in speech mode. The neural correlates of this illusory change were examined in an audiovisual mismatch negativity (MMN) paradigm with SWS sounds. In an oddball sequence, ‘standards’ consisted of SWS/onso/coupled with lipread/onso/, and ‘deviants’ consisted of SWS/onso/coupled with lipread/omso/. The AV deviant induced a McGurk-MMN for listeners in speech mode, but not for listeners in non-speech mode. These results demonstrate that the illusory change in the sound by incongruent lipread information evoked an MMN which presumably takes place at a pre-attentive sensory processing stage.

Highlights

► Is sine-wave speech (SWS) integrated with lipread information? ► Audiovisual integration occurred only when SWS was perceived as speech. ► A McGurk MMN with SWS was evoked in speech mode, but not in non-speech mode. ► Perceptual mode affects audiovisual integration at the sensory processing stage.

Introduction

An important question about speech perception is whether speech is processed as all other sounds (Fowler, 1996, Kuhl et al., 1991, Massaro, 1998), or whether there are specialized mechanisms responsible for translating the acoustic signal into phonetic segments (Repp, 1982, Tuomainen et al., 2005). A relevant finding favoring the notion of ‘speech-specificity’ was provided by Remez, Rubin, Pisoni, and Carrell (1981). They created time-varying sine-wave speech (SWS) replicas of natural speech that were perceived by naïve listeners as non-speech whistles, bleeps, or ‘computer sounds’, but when another group of subjects was instructed about the speech-like nature of the stimuli, they could easily assign linguistic content to the sounds. This ambiguous nature of SWS has provided researchers a tool to study the neural and behavioral specificity of speech sound processing, because identical acoustic stimuli can be used that are perceived differently, depending on the mode of the listener. In this way, it has been shown with functional magnetic resonance imaging (fMRI) that SWS stimuli elicit stronger activity within the left posterior superior temporal sulcus for listeners in speech mode than for listeners in non-speech mode (Möttönen et al., 2006). The phonetic content of SWS is also more likely integrated with visual information of lipread speech if listeners are in speech mode rather than non-speech mode, suggesting that audiovisual (AV) integration of phonetic information only occurs when listeners perceive the sound as speech (Tuomainen et al., 2005, Vroomen and Baart, 2009). Not all aspects of audiovisual integration, though, have been found to depend on the mode of the listener. One example is that perception of temporal synchrony between a heard SWS sound and lipread information is not different for listeners in speech or non-speech mode (Vroomen & Stekelenburg, 2011). Furthermore, lipread information can improve auditory detection of SWS targets in noise, but the size of the improvement does not depend on the mode of the listener (Eskelund, Tuomainen, & Andersen, 2011). This thus indicates that phonetic, but not temporal and loudness processing of the sound depend on the speech mode of the listener.

In the current study, we searched for a neural correlate of the distinction between a ‘speech’ and ‘non-speech’ mode of audiovisual speech processing. It is generally acknowledged that in audiovisual speech the auditory and visual signals are integrated at some processing stage into a coherent multisensory representation. Hemodynamic studies have shown that multisensory cortices (superior temporal sulcus/gyrus) (Callan et al., 2004, Calvert et al., 2000, Skipper et al., 2005), ‘sensory-specific’ cortices (Callan et al., 2003, Calvert et al., 1999, Kilian-Hütten et al., 2011, von Kriegstein and Giraud, 2006) and motor areas (Ojanen et al., 2005, Skipper et al., 2005) are involved in audiovisual speech integration. Electrophysiological studies have shown that AV speech interactions occur in the auditory cortex as early as 100 ms (Arnal et al., 2009, Besle et al., 2004, Stekelenburg and Vroomen, 2007, van Wassenhove et al., 2005). Here, we used the well-known mismatch negativity (MMN) component in the electroencephalogram (EEG) as a neural marker of audiovisual speech integration. The MMN is a fronto-centrally negative event-related potential (ERP) component that is elicited by sounds that violate the automatic predictions of the central auditory system (Näätänen, Gaillard, & Mäntysalo, 1978). The MMN is measured by subtracting the ERP of a frequent ‘standard’ sound from an infrequent ‘deviant’ sound, and it appears as a negative deflection with a fronto-central maximum peaking around 150–250 ms from the onset of the sound change. The MMN is most likely generated in the auditory cortex and presumably reflects pre-attentive auditory deviance detection (Näätänen, Paavilainen, Rinne, & Alho, 2007).

Important from our perspective is that the MMN has also been successfully used to probe the neural mechanisms underlying the integration of information from different senses, as in the case of hearing and seeing speech (i.e., lipreading; Colin et al., 2002, Kislyuk et al., 2008, Saint-Amour et al., 2007, Sams et al., 1991). In a typical experiment, an intersensory conflict is created between the heard and lipread information, e.g., hearing/ba/, but seeing the face of a speaker articulate/ga/. The crucial aspect of this stimulus is that the incongruent lipread information leads to an illusory change in the perceived quality of the sound, and a perceiver typically reports to ‘hear’/da/, when in fact auditory/ba/was presented combined with visual/ga/(McGurk & MacDonald, 1976). This change in the quality of the sound then evokes a ‘McGurk-MMN’, despite that the acoustic information remains unchanged. This finding has generally been taken as evidence that lipread information can penetrate auditory cortex and modulate its activity at a very basic level.

In the current study, we examined whether the McGurk-MMN indeed depends on the illusory change of the sound. So far, this has only been shown indirectly because the illusion itself has not been manipulated in any direct way. Another issue that has not yet been resolved is by what mechanism the distinction between a speech and non-speech mode of sound processing actually affects audiovisual integration. Tuomainen et al. (2005) speculated that attention may play a key role in the effect. It was argued that in speech mode, attention enhances processing and binding of those features in the stimuli that form a phonetic object, whereas in non-speech mode attention is focused on some other acoustic aspect (e.g., loudness, pitch, duration) that discriminates the stimuli. Whether attention is indeed the crucial factor can be tested using the McGurk-MMN because this brain potential does not require attention to be evoked. Here, we employed SWS while listeners were in speech or non-speech mode. Our ‘standard’ stimulus was an SWS sound derived from natural auditory/onso/that was combined with the video of a speaker articulating the same syllables/onso/(AnVn). The ‘deviant’ stimulus contained exactly the same SWS sound/onso/, but now combined with incongruent lipread information of/omso/(AnVm). The incongruent lipread information was expected to change the percept of the SWS sound from/onso/into/omso/if the SWS sound was heard as speech, but this illusory change should be almost completely abolished if the SWS sound is perceived as non-speech (for behavioral evidence, see Eskelund et al., 2011, Tuomainen et al., 2005, Vroomen and Stekelenburg, 2011). This dissociation then allowed us to test, with identical stimuli, whether the illusory change in the sound actually affects the McGurk-MMN. If so, one would expect a McGurk-MMN with SWS sounds for listeners in speech mode, but not for listeners in non-speech mode. For comparative purpose, we included a third group of listeners that heard the original natural recordings of/omso/and/onso/to confirm that the McGurk-MMN would be elicited by the original speech sounds. For all three groups, we also included a visual-only (V-only) and an auditory-only (A-only) condition. The V-only condition served as a control to rule out that the McGurk-MMN was based on the visual difference between standard and deviant (a visual MMN), and to correct the McGurk-MMN accordingly by subtracting the V-only deviance wave from the AV-wave (Saint-Amour et al., 2007). With the A-only condition, we could test whether an actual auditory change from SWS/onso/into/omso/resulted in a (non-illusory) auditory MMN, and whether that differed for listeners in speech and non-speech mode.

Section snippets

Participants

Forty-five healthy students (13 males, 32 females, mean age of 21.0 years) with normal hearing and normal or corrected-to-normal vision participated after giving written informed consent (in accordance with the Declaration of Helsinki). They received course credits for their participation. They were equally divided into three between-subjects conditions (i.e., natural speech, SWS speech mode, and SWS non-speech mode). Note that a between-subject design was required because once participants

Behavioral data

The proportion of correctly identified auditory stimuli was calculated for the congruent and incongruent stimuli. As clearly visibly in Fig. 1, and as expected, lipread information changed the percept of the sound if that sound was perceived as speech, but not if perceived as non-speech. This was confirmed in a MANOVA for repeated measures with as within-subjects variable Congruency (congruent vs. incongruent), and as a between-subject variable Group (SWS non-speech mode, SWS speech mode,

Discussion

Using SWS, we demonstrated that the speech mode of a listener affects the McGurk-MMN. A single auditory token of SWS was presented repeatedly, together with either congruent or incongruent lipread information while listeners were either in speech or non-speech mode. The incongruent lipread information in the AV condition evoked a McGurk-MMN for listeners in speech mode, but not for listeners in non-speech mode. A behavioral experiment further demonstrated that the incongruent lipread

References (45)

  • D. Saint-Amour et al.

    Seeing voices: High-density electrical mapping and source-analysis of the multisensory mismatch negativity evoked during the McGurk illusion

    Neuropsychologia

    (2007)
  • M. Sams et al.

    Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex

    Neuroscience Letters

    (1991)
  • M. Sams et al.

    Auditory frequency discrimination and event-related potentials

    Electroencephalography & Clinical Neurophysiology

    (1985)
  • J.L. Schwartz et al.

    Seeing to hear better: Evidence for early audio–visual interactions in speech identification

    Cognition

    (2004)
  • J.I. Skipper et al.

    Listening to talking faces: Motor cortical activation during speech perception

    Neuroimage

    (2005)
  • J.J. Stekelenburg et al.

    Illusory sound shifts induced by the ventriloquist illusion evoke the mismatch negativity

    Neuroscience Letters

    (2004)
  • J. Tuomainen et al.

    Audio–visual speech perception is special

    Cognition

    (2005)
  • J. Vroomen et al.

    Phonetic recalibration only occurs in speech mode

    Cognition

    (2009)
  • J. Vroomen et al.

    Perception of intersensory synchrony in audiovisual speech: Not that special

    Cognition

    (2011)
  • L.H. Arnal et al.

    Dual neural routing of visual facilitation in speech processing

    Journal of Neuroscience

    (2009)
  • J. Besle et al.

    Bimodal speech: Early suppressive visual effects in human auditory cortex

    European Journal of Neuroscience

    (2004)
  • Boersma, P., & Weenink, K. (1999–2005). Praat: Doing phonetics by computer. Retrieved from:...
  • Cited by (31)

    • How are visemes and graphemes integrated with speech sounds during spoken word recognition? ERP evidence for supra-additive responses during audiovisual compared to auditory speech processing

      2022, Brain and Language
      Citation Excerpt :

      Interestingly, the examination of the neural process underpinning the perceptual bias, using the McGurk-mismatch negativity (MMN) protocol, showed that the illusions induced by lip movements and written syllables were not supported by the same neural mechanism. While the integration between speech sounds and lip movements was associated with a negative deflection corresponding to the MMN typically reported in previous studies (Colin et al., 2002; Saint-Amour et al., 2007; Stekelenburg & Vroomen, 2012), it was not the case for written syllables. The integration between speech sounds and written syllables rather induced a late positive deflection with a frontal distribution that is indicative of a P3a, known to be involved in stimulus selection and decision-making processes.

    • Face configuration affects speech perception: Evidence from a McGurk mismatch negativity study

      2015, Neuropsychologia
      Citation Excerpt :

      The non-Thatcherized upright face produced a strong McGurk-MMN response. The amplitude, latency and scalp distribution of this MGurk-MMN response is comparable with those reported in previous studies (Colin, 2002; Ponton et al., 2009; Saint-Amour et al., 2007; Stekelenburg and Vroomen, 2012). This signifies that the McGurk illusion we found in the behavioral experiment influenced activity in auditory cortex confirming that the effect is truly perceptual.

    • Phonetic matching of auditory and visual speech develops during childhood: Evidence from sine-wave speech

      2015, Journal of Experimental Child Psychology
      Citation Excerpt :

      If so, it seems likely that the children we tested could also rely on the temporal correlation to detect the AV correspondence (note that adults may infer a causal relationship between sight and sound even when the two are asynchronous; Parise, Spence, & Ernst, 2012). Of relevance are studies that used sine-wave speech in behavioral and electrophysiological measures to demonstrate that different properties of the AV speech signal (e.g., temporal features vs. phonetic content) are integrated at different levels in the processing chain (Baart, Stekelenburg, & Vroomen, 2014; Baart, Vroomen, et al., 2014; Eskelund, Tuomainen, & Andersen, 2011; Stekelenburg & Vroomen, 2012; Tuomainen et al., 2005; Vroomen & Baart, 2009; Vroomen & Stekelenburg, 2011). The AV matching paradigm used in the current study indicates that it is likely that such a staged process also occurs in children; children showed a “top-up” benefit (above and beyond their already above-chance performance in non-speech mode) from having phonetic knowledge about the stimuli, but only after approximately 6.5 years of age, indicating that sufficient accrual of phonetic knowledge had occurred by then to influence the AV matching of the degraded stimuli.

    • Electrophysiological evidence for speech-specific audiovisual integration

      2014, Neuropsychologia
      Citation Excerpt :

      In both studies, though, only listeners in speech mode showed phonetic AV integration (i.e., identification of the sound was biased by lip-read information, see Tuomainen, Andersen, Tiippana, & Sams, 2005; Vroomen & Baart, 2009 for similar findings). Recently, Stekelenburg and Vroomen (2012b) showed that the electrophysiological mismatch negativity (i.e., MMN, e.g., Näätänen, Gaillard, & Mäntysalo, 1978), as induced by lip-read information that triggers an illusory phonetic auditory percept (Colin, Radeau, Soquet, & Deltenre, 2004; Colin et al., 2002; Kislyuk, Möttönen, & Sams, 2008; Saint-Amour, De Sanctis, Molholm, Ritter, & Foxe, 2007; Sams et al., 1991), only occurred when listeners were in speech mode. Taken together, these findings indicate that across experiments, listeners bound the AV information together into a coherent phonetic event only when in speech mode, but not when in non-speech mode.

    View all citing articles on Scopus
    View full text