Abstract
Languages differ depending on the set of basic sounds they use (the inventory of consonants and vowels) and on the way in which these sounds can be combined to make up words and phrases (phonological grammar). Previous research has shown that our inventory of consonants and vowels affects the way in which our brains decode foreign sounds (Goto, 1971; Näätänen et al., 1997; Kuhl, 2000). Here, we show that phonological grammar has an equally potent effect. We build on previous research, which shows that stimuli that are phonologically ungrammatical are assimilated to the closest grammatical form in the language (Dupoux et al., 1999). In a cross-linguistic design using French and Japanese participants and a fast event-related functional magnetic resonance imaging (fMRI) paradigm, we show that phonological grammar involves the left superior temporal and the left anterior supramarginal gyri, two regions previously associated with the processing of human vocal sounds.
- speech perception
- cross linguistic study
- language specific illusion
- planum temporale
- fMRI
- phonological grammar
Introduction
Languages differ considerably depending on not only their inventory of consonants (Cs) and vowels (Vs) but also the phonological grammar that specifies how these sounds can be combined to form words and utterances (Kaye, 1989). Regarding the inventory of consonants and vowels, research has shown that infants become attuned to the particular sound categories used in their linguistic environment during the first year of life (Werker and Tees, 1984a; Kuhl et al., 1992). In adults, these categories strongly influence the way in which foreign sounds are perceived (Abramson and Lisker, 1970; Goto, 1971; Miyawaki et al., 1975; Trehub, 1976; Werker and Tees, 1984b; Kuhl, 1991), causing severe problems in the discrimination between certain non-native sounds. For instance, Japanese listeners have persistent trouble discriminating between English /r/ and /l/ (Goto, 1971; Lively et al., 1994). The current interpretation of these effects is that experience with native categories shapes the early acoustic-phonetic speech-decoding stage (Best and Strange, 1992; Best, 1995; Flege, 1995; Kuhl, 2000). Language experience has been found to modulate the mismatch negativity (MMN) response, which is supposed to originate in the auditory cortex (Kraus et al., 1995; Dehaene-Lambertz, 1997; Näätänen et al., 1997; Dehaene-Lambertz and Baillet, 1998; Sharma and Dorman, 2000).
Regarding phonological grammar, its role has been primarily studied by linguists, starting with early informal reports (Polivanov, 1931; Sapir, 1939) and more recently with the study of loanword adaptations (Silverman, 1992; Hyman, 1997). Although these studies do not include experimental tests, they suggest a strong effect of phonological grammar on perception. For instance, Japanese is primarily composed of simple syllables of the consonant-vowel type and does not allow complex strings of consonants, whereas English and French do allow these complex strings. Conversely, Japanese allows a distinction between short and long vowels, whereas English and French do not (e.g., “tokei” and “tookei” are two distinct words in Japanese). Accordingly, when Japanese speakers borrow foreign words, they insert so-called “epenthetic” vowels (usually /u/) into illegal consonant clusters so that the outcome fits the constraints of their grammar: the word “sphinx” becomes “sufinkusu” and the word “Christmas” becomes “Kurisumasu.” Conversely, when English or French import Japanese words, they neglect the vowel length distinction: “Tookyoo” becomes “Tokyo” and “Kyooto” becomes “Kyoto.” Recent investigations have claimed that such adaptations result from perceptual processes (Takagi and Mann, 1994; Dupoux et al., 1999, 2001; Dehaene-Lambertz et al., 2000). The current hypothesis is that the decoding of continuous speech into consonants and vowels is guided by the phonological grammar of the native language; illegal strings of consonants or vowels are corrected through insertion (Dupoux et al., 1999) or substitution of whole sounds (Massaro and Cohen, 1983; Halle et al., 1998). For instance, Dupoux et al. (1999) found that Japanese listeners have trouble distinguishing “ebza” from “ebuza,” and Dehaene-Lambertz et al. (2000) reported that this contrast does not generate a significant MMN, contrary to what is found with French listeners. This suggests that the process that turns nongrammatical sequences of sounds into grammatical ones may take place at an early locus in acoustic-phonetic processing, probably within the auditory cortex.
In the present study, we aimed at identifying the brain regions involved in the application of phonological grammar during speech decoding. We built on previous studies to construct a fully crossed design with two populations (Japanese and French) and two contrasts (ebuza-ebuuza, and ebza-ebuza). One contrast, ebuza-ebuuza, is licensed by the phonological grammar of Japanese but not in French, in which differences in vowel length are not allowed within words. In French, both ebuza and ebuuza receive the same phonological representation (ebuza), and French participants can discriminate these stimuli only by relying on the acoustic differences between them. The other contrast, ebza-ebuza, has the same characteristics in reverse. It can be distinguished phonologically by the French participants but only acoustically by the Japanese participants. To make acoustic discrimination possible, we presented the contrasts without any phonetic variability, that is, the tokens were always spoken by the same speaker and, when identical, were physically identical. Indeed, previous research has found that phonetic variability increases the error rate for acoustic discriminations considerably (Werker and Tees, 1984b; Dupoux et al., 1997). Here, our aim was to obtain good performance on both acoustic and phonological discrimination but show that these two kinds of discrimination nonetheless involve different brain circuits.
French and Japanese volunteers were scanned while performing an AAX discrimination task. In each trial, three pseudowords were presented; the first two were always identical, and the third was either identical or different. When identical, all stimuli were acoustically the same. When different, the third item could differ from the other two in vowel duration (e.g., ebuza and ebuuza) or in the presence or absence of a vowel “u” (e.g., ebza and ebuza). As explained above, the change that was phonological for one population was only acoustic for the other (Table 1). Hence, by subtracting the activations involved in the phonological versus the acoustic discriminations, the brain areas that are involved in phonological processing alone can be pinpointed (Binder, 2000). Such a comparison is free of stimulus artifacts because across the two populations, the stimuli involved in the phonological and acoustic contrasts are exactly the same.
Two examples of change trials and the condition to which they belong as a function of the native language of the participant
The three auditory stimuli of each example are presented with their spectrogram.
Materials and Methods
Participants. Seven native speakers of Japanese 25-36 years of age (mean, 27) and seven native speakers of French 21-30 years of age (mean, 25) were recruited in Paris and participated in the study after giving written informed consent. All Japanese participants had started studying English after the age of 12 and French after the age of 18. None of the French participants studied Japanese. All participants were right-handed according to the Edinburgh inventory. None had a history of neurological or psychiatric disease or hearing deficit.
Stimuli and task. The stimuli were the 20 triplets of pseudowords described by Dupoux et al. (1999). They followed the pattern VCCV/VCVCV/VCVVCV (e.g., ebza/ebuza/ebuuza). For the present experiment, to present the three stimuli in the 2 sec silent window (see below), the stimuli were compressed to 60% of their original duration using the Psola algorithm in the Praat software (available at http://www.praat.org) so that their final duration was on average 312 msec (±43 msec). A fast event-related fMRI paradigm was used. Each trial lasted 3.3 sec and was composed of a silent window of 2 sec during which three stimuli were presented through headphones mounted with piezoelectric speakers (stimulus onset asynchrony; 600 msec), followed by 1.3 sec of fMRI acquisition. Thus, the noise of the gradients of the scanner did not interfere with the presentation of the stimuli. Trials were administered in sessions of 100, with each session lasting 6 min. Trials were of five types: acoustic change, acoustic no-change, phonological change, phonological no-change, and silence. The first four types corresponded to the crossing of two variables: acoustic versus phonological and change versus no-change. The acoustic versus phonological variable was defined as a function of the language of the subject (Table 1). The no-change trials contained the same items as the corresponding change trials, except that the three stimuli were physically identical. Within a session, 20 trials of each type were presented in random order. After performing a practice session, each participant performed between four and six experimental sessions during fMRI scanning.
Participants were instructed that they would hear a series of three auditory stimuli, of which the first two would always be identical, and that they had to judge whether the last stimulus was strictly (physically) identical to the first two. They indicated their responses (same or different) by pressing a response button either with their left or right thumb. The response side was changed at midpoint during the experiment.
Brain imaging. The experiment was performed on a 3-T whole-body system (Bruker, Ettlingen, Germany) equipped with a quadrature birdcage radio frequency coil and a head-gradient coil insert designed for echo planar imaging. Functional images were obtained with a T2-weighted gradient echo, echo planar-imaging sequence (repetition time, 3.3 sec; echo time, 40 msec; field of view, 240 × 240 mm2; matrix, 64 × 64). Each image, acquired in 1.3 sec, was made up of 22 4-mm-thick axial slices covering most of the brain. A high-resolution (1 × 1 × 1.2 mm) anatomical image using a three-dimensional gradient-echo inversion-recovery sequence was also acquired for each participant.
fMRI data analysis was performed using statistical parametric mapping (SPM99; Welcome Department of Cognitive Neurology, London, UK). Preprocessing involved the following (in this order): slice timing, movement correction, spatial normalization, and smoothing (kernel, 5 mm). The resulting functional images had cubic voxels of 4 × 4 × 4 mm3. Temporal filters (high-pass cutoff at 80 sec; low-pass Gaussian width, 4 sec) were applied. For each participant, a linear model was generated by entering five distinct variables corresponding to the onsets of each of the five types of trials (acoustic change, acoustic no-change, phonological change, phonological no-change, and silence). Planned contrast images were obtained and then smoothed with a 6 mm Gaussian kernel and submitted to one-sample t tests (random effect analysis). Unless specified, the threshold for significance was set at p < 0.001, voxel-based uncorrected, and p < 0.05, corrected for spatial extent.
Results
The analysis of the behavioral results revealed that the participants (combining Japanese and French groups) were globally able to detect the change in both conditions (90% correct). Reaction times and error rates were submitted to ANOVA with the factors language (Japanese vs French) and condition (phonological vs acoustic). The phonological condition was overall easier than the acoustic condition (error rates, 5.6 vs 13.6%; F(1,12) = 25.1; p < 0.001; reaction times, 707 vs 732 msec; F(1,12) = 5.9; p < 0.05 for the phonological vs acoustic condition, respectively). There were no main effects of language but in the analysis of errors only, there was a significant interaction between language and condition (F(1,12) = 14.4; p < 0.01). Post hoc comparisons of the errors showed that the effect of condition was significant in the Japanese (3.1 vs 17.2%; p < 0.01; phonological vs acoustic condition, respectively) but not in the French group (8 vs 9.9%; p > 0.1). Post hoc comparisons of the reaction times showed that the effect of condition was significant in the French (690 vs 725 msec; p < 0.05; phonological vs acoustic condition, respectively) but not in the Japanese group (724 vs 739 msec; p > 0.1). Such an asymmetry between speed and accuracy across languages was already observed by Dupoux et al. (1999) but in both languages, the conclusion is the same: there is an advantage for the phonological condition relative to the acoustic condition. The overall size of the phonological effect is smaller than in the Dupoux et al. study, because we purposefully used a situation with only one speaker voice to facilitate the discrimination on the basis of acoustic differences.
In analyzing the fMRI data, we computed three contrasts, one to identify the circuits involved in the detection of an acoustic change, one for the circuits involved in the detection of a phonological change, and one for the difference between the two circuits.
First, we calculated contrast images between the acoustic change and acoustic no-change conditions. Acoustic change activated a large network, comprising the right superior and middle temporal gyri and, bilaterally, the intraparietal sulci, inferior frontal gyri, insula, cingulate cortex, and thalamus (Fig. 1A; Table 2), which is congruent with previous studies (Zatorre et al., 1994; Belin et al., 1998). Second, we calculated the contrast between the phonological change and phonological no-change conditions. Phonological change caused activation in the perisylvian areas in the left hemisphere, including the inferior frontal gyrus, superior temporal gyrus (STG), supramarginal and angular gyri, and left intraparietal sulcus (Fig. 1B; Table 3), typically associated with discrimination tasks involving speech sound analysis (Démonet et al., 1992; Zatorre et al., 1992; Burton et al., 2000). Significant activation was also observed bilaterally in the cingulate cortex, insula, and precentral gyrus. To a lesser extent, the right inferior frontal and the right superior and middle temporal gyri were also activated. Regions activated in both conditions were the insula, cingulate cortex, and central sulcus. These regions have been shown to be involved in the motor and cognitive components of an auditory task requiring attention and motor response (Zatorre et al., 1994). Finally, we calculated the difference between the phonological and acoustic change circuits. We found two regions that were significantly more activated by the phonological than the acoustic changes: the left STG and anterior part of the left supramarginal gyrus (SMG) (Fig. 2). When the threshold was lowered to p < 0.01, a region in the right STG also appeared (x = 52; y = -8; z = 4; z-score, 3.4; cluster size, 71; p = 0.036, corrected). No region was significantly more activated by the acoustic changes than by the phonological changes.
A, B, Activation rendered on the left hemisphere (left) and right hemisphere (right) of the brain. A, Areas activated by an acoustic change (reaching significance in the comparison of an acoustic change vs no-change conditions). B, Areas activated by a phonological change (reaching significance in the comparison of phonological change vs no-change conditions). Group analysis, voxel-based threshold at p < 0.001, uncorrected; spatial extent threshold, p < 0.05.
Brain areas activated by the detection of an acoustic change
Brain areas activated by the detection of a phonological change
Areas significantly more activated by a phonological change than by an acoustic change. A, Rendering on a three-dimensional left hemisphere template. B, C, Sections centered on the two local maxima in the left STG (B) (coordinates in standard stereotactic space of Talairach and Tournoux: x = -48 mm; y = -24 mm; z = 8 mm; z-score, 3.65; cluster size, 14 voxels) and in the left SMG (C) (coordinates: x = -60 mm; y = -20 mm; z = 28 mm; z-score, 3.92; cluster size, 17 voxels). D, Plots of the size of the effect at the two local maxima, as a function of condition and language (Japanese and French). Scale bars show the mean percentage signal change (±SE) for each of the following conditions: phonological (phonological change vs phonological no-change) and acoustic (acoustic change vs acoustic no-change).
Discussion
We found a phonological grammar effect in two regions in the left hemisphere: one in the STG and one located in the anterior SMG (Fig. 2). There was more activation in these regions when the stimuli changed phonologically than when they changed acoustically. These activations were found by comparing the same two sets of stimuli across French and Japanese speakers. In principle, participants could discriminate against all stimuli solely on the basis of acoustic features. However, our results suggest that a phonological representation of the stimuli was activated and informed the discrimination decision. This is confirmed by behavioral data that show performance was slightly but significantly better in the phonological condition than in the acoustic condition.
The peak activation in the left STG lies on the boundary between the Heschl gyrus (HG) and the planum temporale (PT). Atlases (Westbury et al., 1999; Rademacher et al., 2001) indicate ∼40-60% of probability of localization in either structure (note that the activation observed in the right STG when lowering the statistical threshold is probably located in the Heschl gyrus). Because it is generally believed that the PT handles more complex computations than the primary auditory cortex (Griffiths and Warren, 2002), it is reasonable to think that the complex process of phonological decoding takes place in the PT. Yet, the current state of knowledge does not allow to categorically claim that HG cannot support this process. Jäncke et al. (2002) observed activations that also straddled the PT and HG when comparing unvoiced versus voiced consonants. Numerous studies have revealed increases of PT activations with the spectrotemporal complexity of sounds (for review, see Griffiths and Warren, 2002; Scott and Johnsrude, 2003). The present data indicate that PT activations do not simply depend on the acoustic complexity of speech sounds but also reflect processes tuned to the phonology of the native language. This result adds to the converging evidence in favor of the involvement of the PT in phonological processing. First, lesions in this region can provoke word deafness, the inability to process speech sounds with hearing acuity within normal limits (Metz-Lutz and Dahl, 1984; Otsuki et al., 1998), and syllable discrimination can be disrupted by electrical interference in the left STG (Boatman et al., 1995). Second, activity in the PT has been observed in lip-reading versus watching meaningless facial movements (Calvert et al., 1997) when profoundly deaf signers process meaningless parts of signs corresponding to syllabic units (Petitto et al., 2000) and when reading (Nakada et al., 2001). Finally, PT activations have also been reported in speech production (Paus et al., 1996). These data are consistent with the notion that the PT subserves the computation of an amodal, abstract, phonological representation.
The second region activated by phonological change was located in the left SMG. Focal lesions in this region are not typically associated with auditory comprehension deficits (Hickok and Poeppel, 2000) and are not reported when people listen to speech (Crinion et al., 2003). Yet activations in the SMG have been observed when subjects had to perform experimental tasks involving phonological short-term memory (Paulesu et al., 1993; Celsis et al., 1999). A correlation and regression analysis has also revealed that patients impaired in syllable discrimination tend to have lesions involving the left SMG (Caplan et al., 1995). Thus, the left SMG activation found in the present study may be linked to working memory processes and processes translating from auditory to articulatory representations that can be involved in speech discrimination tasks (Hickok and Poeppel, 2000).
Remarkably, we did not find that frontal areas were more involved in the phonological condition than in the acoustic condition, even when the threshold was lowered. This differs from neuroimaging studies that have claimed that phonological processing relies on left inferior frontal regions (Demonet et al., 1992; Zatorre et al., 1992; Hsieh et al., 2001; Gandour et al., 2002). These studies have used tasks that require the explicit extraction of an abstract linguistic feature, such as phoneme, tone, or vowel duration. Such explicit tasks are known to depend on literacy and engage orthographic representations (Morais et al., 1986; Poeppel, 1996). Burton et al. (2000) claimed that frontal activation is found only in tasks that require explicit segmentation into consonants and vowels and those that place high demands on working memory. In the present study, the task does not require segmentation of the auditory stimuli.
Previous research on speech processing has focused on the effects of consonant and vowel categories. These categories are acquired early by preverbal infants (Werker and Tees, 1984a; Kuhl et al., 1992; Maye et al., 2002), affect the decoding of speech sounds (Goto, 1971; Werker and Tees, 1984b), and involve areas of the auditory cortex (Näätänen et al., 1997; Dehaene-Lambertz and Baillet, 1998). In contrast, the effect of phonological grammar has been less studied but also seems to be acquired early (Jusczyk et al., 1993, 1994) and shapes the decoding of speech sounds (Massaro and Cohen, 1983; Dupoux et al., 1999; Dehaene-Lambertz et al., 2000). At first sight, the regions we found (left STG and SMG) might be the same as those involved in consonant and vowel processing. Additional research is needed to establish whether these regions uniformly represent the different aspects of the sound system, or whether separate subparts of the STG sustain the processing of consonant and vowel categories on the one hand and phonological grammar on the other. This, in turn, could help us tease apart theories of perception that posit two distinct processing stages involving either phoneme identification or grammatical parsing (Church, 1987) from theories in which these two processes are merged into a single step of syllabic template matching (Mehler et al., 1990).
Footnotes
This work was supported by a Cognitique PhD scholarship to C.J., an Action Incitative grant to C.P., a Cognitique grant, and a BioMed grant. We thank G. Dehaene-Lambertz, P. Ciuciu, E. Giacomeni, N. Golestani, S. Franck, F. Hennel, J.-F. Mangin, S. Peperkamp, M. Peschanski, J.-B. Poline, and D. Rivière for help with this work.
Correspondence should be addressed to Charlotte Jacquemot, Laboratoire de Sciences Cognitives et Psycholinguistique, Ecole des Hautes Etudes en Sciences Sociales, 54 bd Raspail, 75006 Paris, France. E-mail: jacquemot{at}lscp.ehess.fr.
Copyright © 2003 Society for Neuroscience 0270-6474/03/239541-06$15.00/0