Elsevier

NeuroImage

Volume 47, Issue 1, 1 August 2009, Pages 392-402
NeuroImage

Auditory–motor integration during fast repetition: The neuronal correlates of shadowing

https://doi.org/10.1016/j.neuroimage.2009.03.061Get rights and content

Abstract

This fMRI study examined which structures of a proposed dorsal stream system are involved in the auditory–motor integration during fast overt repetition. We used a shadowing task which requires immediate repetition of an auditory–verbal input and is supposed to elicit unconscious imitation effects of phonologically irrelevant speech parameters. Subjects' responses were recorded in the scanner. To examine automated auditory–motor mapping of speech gestures of others onto one's own speech production system we contrasted the shadowing of pseudowords produced by multiple speakers (men, women, and children) with the shadowing of pseudowords produced by a single speaker. Furthermore, we asked whether behavioral variables predicted changes in functional activation during shadowing.

Shadowing multiple speakers compared to a single speaker elicited increased bilateral activation predominantly in the superior temporal sulci. These regions may mediate acoustic–phonetic speaker normalization in preparation of a translation of perceptual into motor information. Additional activation in Broca's area and the thalamus may reflect motor effects of the adaptation to multiple speaker models. Item-wise correlational analyses of response latencies with BOLD signal changes indicated that longer latencies were associated with increased activation in the left parietal operculum, suggesting that this area plays a central role in the actual transfer of auditory–verbal information to speech motor representations. A multiple regression of behavioral with imaging data showed activation in a right inferior parietal area near the temporo-parietal boundary which correlated positively with the degree of speech rate imitation and negatively with response latency. This activation may be attributable to attentional and/or paralinguistic processes.

Introduction

The ultimate target of any speech motor activity is to generate sound sequences which are perceived as intelligible and acceptable linguistic utterances. The production of linguistically meaningful sounds involves a number of complex physical transformations elicited by appropriate vocal tract movements. Implicit knowledge about how articulation creates sound is acquired during speech acquisition in childhood (e.g. Kuhl, 2000), or, in adulthood, when a language with an unfamiliar phonological inventory is learned. The learning of these complex movement-to-sound mappings presupposes the existence of a close link between perception and production. The processes of speech perception (e.g. Cutler and Clifton, 1999) and production of speech (e.g. Levelt, 1989) have been investigated extensively. In contrast, relatively few models have specifically addressed the cross-talk of these two functions in auditory–motor mapping.

Among the theoretical accounts, the ‘Motor Theory of Speech Perception’ (Liberman and Mattingly, 1985; based on Liberman et al., 1967) constitutes a first approach towards describing the close connection between speech perception and production. The proponents of this theory postulated that in speech perception speech is perceived as phonetic gestures, not as acoustic features. The ‘phonetic module’ is a structure which is assumed to be shared by the speech perception and production systems, suggesting that the two systems depend on a common code and may have co-evolved (Fowler and Galantucci, 2005). A more recent model addressing an auditory–motor link was advanced by Hickok and Poeppel, 2000, Hickok and Poeppel, 2004; see also Warren et al., 2005). Their functional–anatomical model postulates the existence of a dorsal stream system subserving the auditory–motor integration of speech. The dorsal stream is assumed to connect superior temporal with a posterior perisylvian area at the temporo-parietal boundary (‘area Spt’, for ‘Sylvian–parietal–temporal’) and posterior inferior frontal areas, and is thought to be involved in the mapping of acoustic speech signals onto articulatory representations. It may be particularly engaged in tasks requiring explicit manipulations of phonological representations, including the segmentation of linguistic units and the retrieval of sublexical elements.

Clinical evidence consistent with a role of a dorsal stream in the auditory-to-motor mapping in speech production can be found in patients with conduction aphasia, who, following damage including left posterior superior temporal gyrus (STG) and perisylvian temporo-parietal cortex, often show fair comprehension and fluent speech production capability while repetition typically is prominently disrupted (Bartha and Benke, 2003, Caramazza et al., 1981, Damasio and Damasio, 1980). Thus, structures associated with the dorsal stream appear to be central to the capability to repeat, a function which requires auditory-to-motor mapping.

Additional behavioral evidence for a direct link between auditory and motor systems is provided by shadowing experiments which require immediate repetition of an auditory–verbal stimulus. In a shadowing condition, participants tend to automatically and unconsciously imitate phonologically irrelevant acoustic details of the model (Goldinger, 1998). For example, subjects have been found to imitate fundamental frequency when shadowing a short text (Bailly, 2003). In other studies, subjects imitated phonologically irrelevant variations of voice-onset-time (VOT) without an explicit instruction to do so (Shockley et al., 2004, Fowler et al., 2003).

There is also structural evidence supporting a temporal–parietal–frontal link as the central pathway for the auditory–motor integration of speech. Between childhood and adolescence, structural maturation of the white matter pathway connecting temporal and frontal areas (arcuate fasciculus) is more pronounced in the left as compared to the right hemisphere (Paus et al., 1999). This may lead to an increased interaction between temporal and frontal language areas and facilitate a fast bidirectional cross-talk between auditory and motor regions. In adulthood, the dominant left hemisphere shows stronger connectivity than the right hemisphere between posterior superior temporal and ventrolateral prefrontal areas via the arcuate fasciculus (Parker et al., 2005). Catani et al. (2005) used diffusion tensor magnetic resonance imaging to examine the connection between posterior superior temporal regions and the posterior ventrolateral frontal lobe. The authors reported two parallel pathways between these regions: the “classical” arcuate fasciculus, which provides a direct link between temporal and frontal regions, and a second, indirect pathway running from temporal regions to the inferior parietal cortex and from the parietal region to frontal areas. The authors suggested that the function of the direct route is the fast and automated preparation of a motor copy of the perceived speech input, whereas the indirect pathway is used when an intervening stage, such as phonologic transcoding, lies between auditory input and articulatory output. Two recent studies using diffusion MRI tractography demonstrated the relevance of the dorsal language pathway to phonologic processing indirectly and directly, respectively (Glasser and Rilling, 2008, Saur et al., 2008).

In addition, several recent functional neuroimaging studies on speech processing provide evidence that auditory and motor systems are closely intertwined. Passive listening to stories without any overt motor response requirement, in addition to the expected temporal activation, co-activates motor areas which could have generated the speech sounds (Skipper et al., 2005). Similarly, listening to monosyllables activates ventral premotor areas which are also active when producing the same syllables (Wilson et al., 2004). These results are consistent with the assumption that the motor system is automatically engaged when an acoustic speech signal is encoded.

Despite a body of evidence for a strong interconnection between the perception and production of speech, there are still questions concerning the role of a dorsal stream (i.e., a temporal–parietal–frontal circuit) in the auditory–motor integration of speech which remain unresolved. There are now several fMRI studies which involve translating from perception to (covert or overt) production using verbal (Buchsbaum et al., 2001, Buchsbaum et al., 2005) and/or tonal stimuli (Pa and Hickok, 2008, Hickok et al., 2003). The tasks in all of these studies (i.e., perceiving and silently rehearsing the stimuli over periods of several seconds) explicitly draw on verbal working memory resources to some degree. It is largely unknown, however, whether the dorsal stream is also involved in the immediate and direct transformation of acoustic input into motor output. A first indication was provided by an fMRI study which showed activation of the parietal aspect of the dorsal stream (left inferior parietal lobule) during covert and overt repetition of long vs. short words (Shuster and Lemieux, 2005). A participation of dorsal stream structures in the implicit processing of word length is also suggested by another study using a covert picture naming task with words which varied in the number of syllables. This study reported the greatest effect of word length in the left posterior Sylvian cortex (Okada et al., 2003).

Thus, the aim of the present study was to examine whether structures associated with the dorsal stream are engaged in immediate auditory–motor integration. We used a shadowing paradigm in order to induce articulatory reproduction of spoken input which, compared to conventional repetition tasks, is relatively free from working memory requirements. Shadowing requires subjects to immediately repeat a verbal auditory input, without any explicit phonological processing requirement. In a reaction time task, subjects respond faster to vowel strings when the response is made by shadowing than when it is made by pressing a button (Porter and Lubker, 1980), suggesting that shadowing involves a direct transfer from speech input to oral articulators (Goldinger, 1998). Experiments based on the shadowing paradigm have shown that speakers, in their shadowing responses, tend to subconsciously echo phonologically irrelevant acoustic details (speaking rate, pitch, intonation, voice-onset time) of the model. The process of shadowing encompasses perceptual as well as production aspects of speech processing, and involves the automated transfer of speech gestures of others, via the perceived acoustic signal, into one's own speech production system. Specifically, extraction of gestural information from the acoustic signal, in the understanding of various authors, helps to compensate for the acoustic variation between talkers (Studdert-Kennedy and Goldstein, 2003, Liberman and Whalen, 2000, Goldinger, 1998; see also Johnson, 2005). This information is then transferred to the speaker's own speech production system.

As stimulus materials, we used pseudowords to discourage semantic processing of the stimuli. The items were either spoken by several speakers varying in gender and age (hereafter called ‘multiple speaker condition’), or by a single speaker (hereafter called ‘single speaker condition’). Auditory–motor mapping demands should be higher for the shadowing of utterances of multiple speakers than for the shadowing of a single speaker, because of the higher variability across items in acoustic parameters, such as word duration (as a correlate of speaking rate) or fundamental frequency (F0).

To examine neuronal activation we used event-related fMRI. An event-related design is more suitable for experiments which involve movements of the participants than a block design (Haller et al., 2005, Birn et al., 1999, Birn et al., 2004, Preibisch et al., 2003, Dogil et al., 2002). We decided to test overt speech production mainly because we wanted to record the verbal responses. These recordings enabled us to monitor and analyze subjects' productions with respect to accuracy and response latency, and allowed for a qualitative analysis of potential imitative behavior with respect to fundamental frequency and speech rate.

Even though sparse temporal sampling is recommended for overt speech production to avoid movement artifacts (Gracco et al., 2005, Tanaka et al., 2000, Eden et al., 1999, Elliott et al., 1999), we used continuous sampling in order to keep the duration of the scanning sessions within a tolerable time frame. Furthermore, speech-related hemodynamic signal change is delayed and reaches its maximum 5–6 s after event onset, while movement-induced signal change occurs during the act of speaking (Haller et al., 2005, Birn et al., 2004). In order to correct for potential movement artifacts, we used an additional speech movement regressor (see Materials and methods).

We asked (i) which regions are involved in the entire automated process of auditory–motor mapping of speech gestures produced by others onto one's own speech production system, (ii) whether there are any areas that are associated with the auditory-to-motor transfer process per se, and (iii) whether the behavioral variables (degree of imitation, response latency, and accuracy) affect overall functional activation. We hypothesized that the shadowing paradigm activates a temporo-frontal network for the fast and automated translation of acoustic input into speech motor output. For the comparison of the multiple with the single speaker condition, we expected significantly higher neural activation in temporal and/or frontal regions rather than in temporo-parietal aspects of the dorsal stream. Because they affect executive aspects of speaking, we expected the behavioral variables to be correlated with signal changes in frontal areas, most likely in the left hemisphere.

Section snippets

Participants

20 healthy right-handed subjects participated in the study. All subjects were native speakers of German without any history of serious medical, neurological or psychiatric illness, or hearing loss (mean age 26.1 years, range 20–36, ten females). Hand preference was tested with the 10-item version of the Edinburgh Handedness Inventory (Oldfield, 1971). Subjects had an average laterality quotient of 0.87 (range 0.4–1.0).

This study was performed according to the guidelines of the Declaration of

Behavioral data

The first step in vocal response analysis was to determine the accuracy of responses, using the sound editing software ‘Audacity’. Two raters (CP and a second rater who was not familiar with the goal of the study) decided independently whether responses were qualitatively acceptable (against the backdrop of residual scanner noise or interfering scanner pulses; 93 of 3840 items not analyzable, or 2.4%). Next, analyzable verbal responses were evaluated independently by the two raters for

Accuracy

Three of the 23 subjects originally included failed to reach the criterion of at least 75% correct responses and were excluded from further analysis. One subject reached an accuracy rate of 74.4%. Because this was close to the cut-off value, we decided to include the subject. Generally, included participants reached a high level of accuracy in the shadowing task, with a group mean of 86.1% correct responses (range 74.4–93.8%, SD: 5.5%). For the multiple and single speaker condition, the group

Discussion

Recently a dorsal stream connecting superior temporal with inferior parietal and posterior inferior frontal areas has been suggested to subserve the cortical auditory–motor integration of speech (Hickok and Poeppel, 2000, Hickok and Poeppel, 2004, Hickok et al., 2003). Previous studies have shown that the dorsal stream is involved in perceptual tasks requiring explicit manipulations of phonological representations by translating acoustic speech signals into articulatory representations (Gandour

Acknowledgments

This study was funded by a grant from the German Federal Ministry of Education and Research (BMBF-01GW0572) to W.Z. and A.B. and carried out as part of the collaborative BMBF research project “From dynamic sensorimotor interaction to conceptual representation: Deconstructing apraxia”. We thank Christian Buechel for helpful suggestions on the experimental design, and Gregory S. Hickok and an anonymous reviewer for constructive comments on an earlier version of the manuscript.

References (81)

  • GraccoV.L. et al.

    Imaging speech production using fMRI

    NeuroImage

    (2005)
  • GuentherF.H.

    Cortical interactions underlying the production of speech sounds

    J. Commun. Disord.

    (2006)
  • HallerS. et al.

    Overt sentence production in event-related fMRI

    Neuropsychologia

    (2005)
  • HickokG. et al.

    Towards a functional neuroanatomy of speech perception

    Trends Cogn. Sci.

    (2000)
  • HickokG. et al.

    Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language

    Cognition

    (2004)
  • HickokG. et al.

    A functional magnetic resonance imaging study of the role of left posterior superior temporal gyrus in speech production: implications for the explanation of conduction aphasia

    Neurosci. Lett.

    (2000)
  • KrekelbergB. et al.

    Adaptation: from single cells to BOLD signals

    Trends Neurosci.

    (2006)
  • LibermanA.M. et al.

    The motor theory of speech perception revised

    Cognition

    (1985)
  • LibermanA.M. et al.

    On the relation of speech to language

    Trends Cogn. Sci.

    (2000)
  • MeyerM. et al.

    Distinct fMRI responses to laughter, speech, and sounds along the human peri-sylvian cortex

    Brain Res. Cogn. Brain Res.

    (2005)
  • MorosanP. et al.

    Human primary auditory cortex: cytoarchitectonic subdivisions and mapping into a spatial reference system

    NeuroImage

    (2001)
  • OkadaK. et al.

    Left posterior auditory-related cortices participate both in speech perception and speech production: neural overlap revealed by fMRI

    Brain Lang.

    (2006)
  • OldfieldR.C.

    The assessment and analysis of handedness: the Edinburgh inventory

    Neuropsychologia

    (1971)
  • PaJ. et al.

    A parietal–temporal sensory–motor integration area for the human vocal tract: evidence from an fMRI study of skilled musicians

    Neuropsychologia

    (2008)
  • ParkerG.J. et al.

    Lateralization of ventral and dorsal auditory–language pathways in the human brain

    NeuroImage

    (2005)
  • PreibischC. et al.

    Event-related fMRI for the suppression of speech-associated artifacts in stuttering

    NeuroImage

    (2003)
  • PriceC. et al.

    Speech-specific auditory processing: where is it?

    Trends Cogn. Sci.

    (2005)
  • PughK.R. et al.

    Auditory selective attention: an fMRI investigation

    NeuroImage

    (1996)
  • RieckerA. et al.

    The influence of syllable onset complexity and syllable frequency on speech motor control

    Brain Lang.

    (2008)
  • ScottS.K. et al.

    The functional neuroanatomy of prelexical processing in speech perception

    Cognition

    (2004)
  • ShusterL.I. et al.

    An fMRI investigation of covertly and overtly produced mono- and multisyllabic words

    Brain Lang.

    (2005)
  • SkipperJ.I. et al.

    Listening to talking faces: motor cortical activation during speech perception

    NeuroImage

    (2005)
  • WarrenJ.E. et al.

    Sounds do-able: auditory–motor transformations and the posterior temporal plane

    Trends Neurosci.

    (2005)
  • BaillyG.

    Close shadowing natural versus synthetic speech

    Int. J. Speech Tech.

    (2003)
  • BelinP. et al.

    Adaptation to speaker's voice in right anterior temporal lobe

    NeuroReport

    (2003)
  • BinderJ.R. et al.

    Function of the left planum temporale in auditory and linguistic processing

    Brain

    (1996)
  • BinderJ.R. et al.

    Human temporal lobe activation by speech and nonspeech sounds

    Cereb. Cortex

    (2000)
  • BirnR.M. et al.

    Event-related fMRI of tasks involving brief motion

    Hum. Brain Mapp.

    (1999)
  • BonilhaL. et al.

    Speech apraxia without oral apraxia: can normal brain function explain the physiopathology?

    NeuroReport

    (2006)
  • BurtonM.W. et al.

    The role of segmentation in phonological processing: an fMRI investigation

    J. Cogn. Neurosci.

    (2000)
  • Cited by (0)

    View full text