Elsevier

NeuroImage

Volume 37, Issue 4, 1 October 2007, Pages 1445-1456
NeuroImage

Audiovisual integration of emotional signals in voice and face: An event-related fMRI study

https://doi.org/10.1016/j.neuroimage.2007.06.020Get rights and content

Abstract

In a natural environment, non-verbal emotional communication is multimodal (i.e. speech melody, facial expression) and multifaceted concerning the variety of expressed emotions. Understanding these communicative signals and integrating them into a common percept is paramount to successful social behaviour. While many previous studies have focused on the neurobiology of emotional communication in the auditory or visual modality alone, far less is known about multimodal integration of auditory and visual non-verbal emotional information. The present study investigated this process using event-related fMRI. Behavioural data revealed that audiovisual presentation of non-verbal emotional information resulted in a significant increase in correctly classified stimuli when compared with visual and auditory stimulation. This behavioural gain was paralleled by enhanced activation in bilateral posterior superior temporal gyrus (pSTG) and right thalamus, when contrasting audiovisual to auditory and visual conditions. Further, a characteristic of these brain regions, substantiating their role in the emotional integration process, is a linear relationship between the gain in classification accuracy and the strength of the BOLD response during the bimodal condition. Additionally, enhanced effective connectivity between audiovisual integration areas and associative auditory and visual cortices was observed during audiovisual stimulation, offering further insight into the neural process accomplishing multimodal integration. Finally, we were able to document an enhanced sensitivity of the putative integration sites to stimuli with emotional non-verbal content as compared to neutral stimuli.

Introduction

Taking part in social interactions requires us to integrate a variety of different inputs from several sense organs into a single percept of the situation we are dealing with. Inability to perceive and understand non-verbal emotional signals (i.e. speech melody, facial expression, gestures) within this process will often result in impaired communication.

Over the past years a plethora of neuroimaging studies have addressed the processing of faces (e.g. Haxby et al., 1994, Kanwisher et al., 1997, Sergent et al., 1992) and emotional facial expressions (e.g. Blair et al., 1999, Breiter et al., 1996, Morris et al., 1996, Phillips et al., 1997; reviewed in Posamentier and Abdi, 2003) as well as the processing of voices (e.g. Belin et al., 2000, Belizaire et al., 2007, Fecteau et al., 2004, Giraud et al., 2004, Kriegstein and Giraud, 2004) and emotional prosody (e.g. Buchanan et al., 2000, Ethofer et al., 2006b, Ethofer et al., 2006c, George et al., 1996, Grandjean et al., 2005, Imaizumi et al., 1997, Kotz et al., 2003, Mitchell et al., 2003, Wildgruber et al., 2002, Wildgruber et al., 2004, Wildgruber et al., 2005; reviewed in Wildgruber et al., 2006) identifying neural networks subserving the perception of visual and auditory non-verbal emotional signals. Yet, to date only very few neuroimaging studies on audiovisual integration of non-verbal emotional communication are available (Dolan et al., 2001, Ethofer et al., 2006a, Ethofer et al., 2006d, Pourtois et al., 2005). Behavioural studies demonstrated that congruence between facial expression and prosody facilitates reactions to stimuli carrying emotional information (De Gelder and Vroomen, 2000, Dolan et al., 2001, Massaro and Egan, 1996). This parallels findings from audiovisual integration of non-emotional information which indicate shortened response latencies and heightened perceptual sensitivity upon audiovisual stimulation (Miller, 1982, Schroger and Widmann, 1998). Moreover, affective signals received within one sensory channel can affect information processing in another. For instance, the perception of a facial expression can be altered by accompanying emotional prosody (Ethofer et al., 2006a, Massaro and Egan, 1996). Since these crossmodal biases occur mandatorily and irrespective of attention (De Gelder and Vroomen, 2000, Ethofer et al., 2006a, Vroomen et al., 2001) one might assume that the audiovisual integration of non-verbal affective information is an automatic process. This assumption gains further support in the results of electrophysiological experiments providing evidence for multisensory crosstalk during an early perceptual stage about 110–220 ms after stimulus presentation (De Gelder et al., 1999, Pourtois et al., 2000, Pourtois et al., 2002).

Neuroimaging data on audiovisual integration of non-verbal emotional information highlights a stronger activation in left middle temporal gyrus (MTG) (Pourtois et al., 2005) and left posterior superior temporal sulcus (pSTS) (Ethofer et al., 2006d) during audiovisual stimulation as compared to either unimodal stimulation. These findings of activations adjacent to the superior temporal sulcus (STS) correspond well with reports on enhanced responses in pSTS during audiovisual presentation of animals (Beauchamp et al., 2004b) and tools (Beauchamp et al., 2004a, Beauchamp et al., 2004b), speech (Calvert et al., 2000, van Atteveldt et al., 2004, Wright et al., 2003) and letters (van Atteveldt et al., 2004).

Moreover, data from several electrophysiological and functional neuroimaging studies (Ghazanfar et al., 2005, Giard and Peronnet, 1999, von Kriegstein and Giraud, 2006) document early audiovisual integration processes within modality-specific cortices. To date it remains controversial to which extent these audiovisual response modulations arise from feedback connections of the STS with the respective unimodal auditory and visual cortices or from direct crosstalk between auditory and visual associative cortices.

So far, none of the neuroimaging studies on audiovisual integration of emotional prosody and facial expression employed dynamic visual stimuli. Also, stimulus material in these studies portrayed only two exemplary emotions (happy, fearful) and did not contain emotionally neutral stimuli.

In the present study, we used functional magnetic resonance imaging (fMRI) to delineate the audiovisual integration site for dynamic non-verbal emotional information. Participants were scanned while they performed a classification task on audiovisual (AV), auditory (A) or visual (V) presentation of people speaking single words expressing different emotional states in voice and face. Dynamic stimulation in combination with a broad variety of emotions was chosen in order to approximate real life conditions of social communication. The classification task was applied to ascertain constant attention to the stimuli and to acquire a behavioural measure of the audiovisual integration effect. In a prestudy, stimuli used in the fMRI experiment were tested for a relevant behavioural integration effect during the bimodal condition.

This experiment was designed to investigate audiovisual integration of nonverbal communication signals in the context of an explicit emotional categorization task. This, then embraces the class of stimuli with emotionally neutral prosody and facial expression as an “emotional” category. In other words, the main focus of the present study lies on audiovisual integration of nonverbal communication under the top-down influence of an emotional classification task, rather than on bottom-up (stimulus driven) effects of emotional content on the audiovisual integration of non-verbal communication.

For the identification of brain regions contributing to multimodal integration of emotional signals, responses during bimodal stimulation were compared to both unimodal conditions. Areas characterized by stronger responses to audiovisual than to either unimodal stimulation were considered candidate regions for the integration process.

We expected that in such a region, associated with audiovisual integration, the gain in classification accuracy under audiovisual stimulation as compared to either unimodal stimulation might be paralleled by increased cerebral activation.

Moreover, Macaluso and colleagues (2000) demonstrated that a perceptual gain during bimodal as compared to unimodal stimulation was paralleled by enhanced effective connectivity between associative sensory cortices and supramodal integration sites for vision and touch.

Accordingly, we expected enhanced effective connectivity between putative audiovisual integration areas and voice-sensitive (Belin et al., 2000) as well as face-sensitive (Haxby et al., 1994, Kanwisher et al., 1997, Sergent et al., 1992) areas during the audiovisual condition as compared to either unimodal condition as a possible correlate of the perceptional gain in congruent audiovisual integration.

A final point of interest was, if the putative integration sites exhibit a different sensitivity to stimuli with emotional prosody/facial expression as compared to stimuli with neutral prosody/facial expression.

In summary, our fMRI study was designed to delineate brain areas specifically involved in the process of audiovisual integration of dynamic emotional signals from voice and face on the basis that they exhibit stronger responses to audiovisual than to either unimodal stimulation.

Further expectations on the response pattern of regions subserving audiovisual integration of nonverbal emotional communication were:

  • a)

    increase of cerebral responses in correlation with gain of classification accuracy under audiovisual stimulation as compared to either unimodal stimulation;

  • b)

    enhanced effective connectivity with voice-sensitive and face-sensitive cortices during bimodal stimulation as compared to either unimodal stimulation.

In order to gain valuable add-on information about a possible differential sensitivity of the integration sites to the emotionality of stimulus content, we compared the responses to stimuli with emotional non-verbal content with those to stimuli with neutral non-verbal content.

Based on recent neuroimaging studies on audiovisual integration (Beauchamp et al., 2004a, Beauchamp et al., 2004b, Calvert et al., 2000, Ethofer et al., 2006d, van Atteveldt et al., 2004, Wright et al., 2003) we hypothesised that a region featuring the aforementioned characteristics might be located in the pSTS.

Section snippets

Subjects

Thirty right-handed subjects (15 male, 15 female; mean age 23, S.D. 3 years) participated in the behavioural prestudy. Twenty-four right-handed volunteers (12 male, 12 female, mean age 26, S.D. 5 years) who did not take part in the behavioural experiment were included in the fMRI experiment. All of the participants were native speakers of German language and had neither history of neurological or psychiatric illness nor of substance abuse or impaired vision or hearing. None of the participants

Behavioural data

Mean unbiased hit rates (Hu) in the classification task (± standard error of the mean, S.E.M.) were 0.35 ± 0.02 (A), 0.58 ± 0.02 (V) and 0.76 ± 0.02 (AV) corresponding to 56% (A), 75% (V) and 86% (AV) correct classifications. A two-way ANOVA with experimental condition (A, V, AV) and stimulus type (emotional vs. neutral) as within-subject factors indicated significant differences in the data regarding experimental condition (F(43.2,1.8) = 94,0, P < 0.001) and stimulus type (F(24,1) = 24,0, P < 0.001) while

Behavioural integration

At the behavioural level we found a perceptual gain in the emotional classification task when contrasting the bimodal condition to either of the unimodal conditions. This is in good keeping with results from a series of experiments conducted by De Gelder and Vroomen (2000) which demonstrate that emotional facial expressions are more easily classified if accompanied by congruent emotional prosody. One of the most noticeable differences between the present study and the one performed by de Gelder

Conclusion

Bilateral pSTG and right thalamus reacted stronger to audiovisual than to visual and auditory stimuli. Of these areas, the left pSTG conformed best to the further a priori assumed characteristics of an integrative brain region for non-verbal emotional information, namely a positive linear relationship of the BOLD response under audiovisual stimulation with the behaviourally documented perceptual gain during the bimodal condition and an enhanced effective connectivity with auditory as well as

Acknowledgments

This study was supported by the Deutsche Forschungsgemeinschaft (SFB 550/B10).

References (73)

  • K.J. Friston et al.

    Classical and Bayesian inference in neuroimaging: applications

    NeuroImage

    (2002)
  • D.R. Gitelman et al.

    Modeling regional and psychophysiologic interactions in fMRI: the importance of hemodynamic deconvolution

    NeuroImage

    (2003)
  • A.R. Hariri et al.

    Neocortical modulation of the amygdala response to fearful stimuli

    Biol. Psychiatry

    (2003)
  • L. Jancke et al.

    Phonetic perception and the temporal cortex

    NeuroImage

    (2002)
  • E.R. John

    The neurophysics of consciousness

    Brain Res. Brain Res. Rev.

    (2002)
  • S.A. Kotz et al.

    On the lateralization of emotional prosody: an event-related functional MR investigation

    Brain Lang.

    (2003)
  • K.V. Kriegstein et al.

    Distinct functional substrates along the right superior temporal sulcus for the processing of voices

    NeuroImage

    (2004)
  • K. Lange et al.

    Task instructions modulate neural responses to fearful facial expressions

    Biol. Psychiatry

    (2003)
  • J. Miller

    Divided attention: evidence for coactivation with redundant signals

    Cogn. Psychol.

    (1982)
  • R.L. Mitchell et al.

    The neural response to emotional prosody, as revealed by functional magnetic resonance imaging

    Neuropsychologia

    (2003)
  • T. Nichols et al.

    Valid conjunction inference with the minimum statistic

    NeuroImage

    (2005)
  • R.C. Oldfield

    The assessment and analysis of handedness: the Edinburgh inventory

    Neuropsychologia

    (1971)
  • G. Pourtois et al.

    Facial expressions modulate the time course of long latency auditory brain potentials

    Brain Res. Cogn. Brain Res.

    (2002)
  • G. Pourtois et al.

    Perception of facial expressions and voices and of their combination in the human brain

    Cortex

    (2005)
  • M.F. Rushworth et al.

    Action sets and decisions in the medial prefrontal cortex

    Trends Cogn. Sci.

    (2004)
  • B. Seltzer et al.

    Afferent cortical connections and architectonics of the superior temporal sulcus and surrounding cortex in the rhesus monkey

    Brain Res.

    (1978)
  • N. Tzourio-Mazoyer et al.

    Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain

    NeuroImage

    (2002)
  • N. van Atteveldt et al.

    Integration of letters and speech sounds in the human brain

    Neuron

    (2004)
  • D. Wildgruber et al.

    Dynamic brain activation during processing of emotional intonation: influence of acoustic parameters, emotional valence, and sex

    NeuroImage

    (2002)
  • D. Wildgruber et al.

    Identification of emotional intonation evaluated by fMRI

    NeuroImage

    (2005)
  • D. Wildgruber et al.

    Cerebral processing of linguistic and emotional prosody: fMRI studies

    Prog. Brain Res.

    (2006)
  • M.S. Beauchamp et al.

    Unraveling multisensory integration: patchy organization within human STS multisensory cortex

    Nat. Neurosci.

    (2004)
  • P. Belin et al.

    Voice-selective areas in human auditory cortex

    Nature

    (2000)
  • G. Belizaire et al.

    Cerebral response to ‘voiceness’: a functional magnetic resonance imaging study

    NeuroReport

    (2007)
  • R.J. Blair et al.

    Dissociable neural responses to facial expressions of sadness and anger

    Brain

    (1999)
  • D.L. Collins et al.

    Automatic 3D intersubject registration of MR volumetric data in standardized Talairach space

    J. Comput. Assist. Tomogr.

    (1994)
  • Cited by (231)

    • Neural Basis of Impaired Emotion Recognition in Adult Attention-Deficit/Hyperactivity Disorder

      2022, Biological Psychiatry: Cognitive Neuroscience and Neuroimaging
      Citation Excerpt :

      The only ROIs that showed a significant difference with lower activation in patients with ADHD, however, were the right posterior STG/MTG and the right posterior thalamus. The posterior STG/MTG has consistently been reported to be involved during audiovisual integration of facial and vocal information (19,35,45,46). Furthermore, audiovisual percepts, such as the McGurk illusion, depend on the activation level in this area (47) and can be disrupted by means of transcranial magnetic stimulation (48).

    • Weighted RSA: An Improved Framework on the Perception of Audio-visual Affective Speech in Left Insula and Superior Temporal Gyrus

      2021, Neuroscience
      Citation Excerpt :

      Many studies have used combined visual and auditory emotion information to further explore the mechanism of cross-modality representation of emotions. For the audio-visual integration of emotion, a large number of studies showed that the superior temporal gyrus and superior temporal sulcus played an important role in the integration and control of audio-visual emotion information (Kreifelts et al., 2007; Robins et al., 2009; Park et al., 2010; Müller et al., 2012; Hagan et al., 2013). However, little research has been conducted on emotion perception in audio-visual modality.

    View all citing articles on Scopus
    View full text