Abstract
Bilinguals derive the same semantic concepts from equivalent, but acoustically different, words in their first and second languages. The neural mechanisms underlying the representation of language-independent concepts in the brain remain unclear. Here, we measured fMRI in human bilingual listeners and reveal that response patterns to individual spoken nouns in one language (e.g., “horse” in English) accurately predict the response patterns to equivalent nouns in the other language (e.g., “paard” in Dutch). Stimuli were four monosyllabic words in both languages, all from the category of “animal” nouns. For each word, pronunciations from three different speakers were included, allowing the investigation of speaker-independent representations of individual words. We used multivariate classifiers and a searchlight method to map the informative fMRI response patterns that enable decoding spoken words within languages (within-language discrimination) and across languages (across-language generalization). Response patterns discriminative of spoken words within language were distributed in multiple cortical regions, reflecting the complexity of the neural networks recruited during speech and language processing. Response patterns discriminative of spoken words across language were limited to localized clusters in the left anterior temporal lobe, the left angular gyrus and the posterior bank of the left postcentral gyrus, the right posterior superior temporal sulcus/superior temporal gyrus, the right medial anterior temporal lobe, the right anterior insula, and bilateral occipital cortex. These results corroborate the existence of “hub” regions organizing semantic-conceptual knowledge in abstract form at the fine-grained level of within semantic category discriminations.
Introduction
In multilingual environments, verbal communication relies on extraction of unified semantic concepts from speech signals in native as well as non-native languages. How and where our brain performs these language-invariant conceptual transformations remains essentially unknown. Spoken word comprehension proceeds from a sound-based (acoustic–phonological) analysis in posterior superior temporal areas toward meaning-based processing in the semantic–conceptual network of the brain (Binder et al., 2000; Hickok and Poeppel, 2007). This network—including anterior and middle temporal cortex, posterior parietal cortex and inferior frontal gyrus—is involved in the neural representation of semantic memories and connects to modality-specific brain regions subserving perception and action (Martin, 2007; Meyer and Damasio, 2009; Meyer et al., 2010). Evidence from brain lesions in patients with selective deficits of semantic knowledge (semantic dementia) further suggests a crucial role of the anterior temporal lobe (ATL) as a “semantic hub,” in which abstract, amodal representations of words and concepts are constructed (Damasio et al., 1996; Patterson et al., 2007). In neuroimaging studies, it has been more difficult to find reliable ATL activation because of susceptibility artifacts or limited field of view (Visser and Lambon Ralph, 2011). However, recent studies using distortion-corrected fMRI (Binney et al., 2010; Visser et al., 2012), fMRI decoding (Kriegeskorte and Bandettini, 2007), MEG and EEG decoding (Marinkovic et al., 2003; Chan et al., 2011a), or intracranial EEG (Chan et al., 2011b) support the important role of the ATL in semantic processing.
Here, we investigate the semantic representation of spoken words at the fine-grained level of within-category distinction (animal nouns) and across-language generalization. We do so by combining fMRI, state-of-the-art multivariate pattern analysis (MVPA), and an experimental design that exploits the unique capacities of bilingual listeners. In separate Dutch and English blocks, we asked bilingual participants to listen to individual animal nouns and to detect occasional non-animal target nouns. Speech comprehension in skilled bilinguals provides a means to discover neural representations of concepts while keeping the input modality constant. Furthermore, our focus on within-category distinctions avoids global effects resulting from across-category differences (Caramazza and Mahon, 2006). After supervised machine learning approaches, we train multivoxel classifiers [linear support vector machine (SVM); Cortes and Vapnik, 1995] to predict the identity of the animal noun a participant was listening to from new (untrained) samples of brain activity. In a first analysis, we aim to identify the network of brain regions involved in within-language word discrimination. To this end, we train classifiers to discriminate brain responses to English (e.g., “horse” vs “duck”) and Dutch (e.g., “paard” vs “eend”) nouns. In a second analysis, we aim to isolate brain regions involved in language-independent decoding of the animal nouns (Fig. 1A). Here we train classifiers to discriminate brain responses to words in one language (e.g., horse vs duck) and test whether this training generalizes and allows discrimination of brain responses to the corresponding nouns in the other language (e.g., the Dutch equivalents, paard vs eend). Importantly, each word was spoken by three female speakers allowing for speaker-invariant word discrimination. Moreover, all words are acoustically/phonetically distinct both within and across languages.
Materials and Methods
Participants
Ten native Dutch (L1) participants proficient in English (L2) took part in the study (three males; one left-handed; mean ± SD age, 25.4 ± 2.84 years). All participants were undergraduate or postgraduate students of Maastricht University studying and/or working in an English-speaking environment. All participants reported normal hearing abilities and were neurologically healthy. English proficiency was assessed with the LexTALE test, a vocabulary test including 40 frequent English words and 20 nonwords (Lemhöfer and Broersma, 2012). The mean ± SD test score was 85.56 ± 12.27% correct. This score is well above the average score (70.7%) of a large group of Dutch and Korean advanced learners of English performing the same test (Lemhöfer and Broersma, 2012). For comparison reasons, participants also conducted the Dutch version of the vocabulary test. The mean ± SD Dutch proficiency score was 91.11 ± 6.57%. The study was approved by the Ethical Committee of the Faculty of Psychology and Neuroscience at the University of Maastricht.
Stimuli
Stimuli consisted of Dutch and English spoken words representing four different animals (English: Bull, Duck, Horse, and Shark; and the Dutch equivalents: Stier, Eend, Paard, and Haai) and three inanimate object words (English: Bike, Dress, and Suit; and the Dutch equivalents: Fiets, Jurk, and Pak; Fig. 1A). All stimuli were monosyllabic animal nouns acoustically/phonetically distinct from each other both within and across languages. Phonetic distance between word pairs was quantified using the Levenshtein distance, which gives the number of phoneme insertions, deletions, and/or substitutions required to change one word into the other, divided by the number of phonemes of the longest word (Levenshtein, 1965). On a scale from 0 (no changes) to 1 (maximum number of changes), the mean ± SD Levenshtein distances corresponded to 0.83 ± 0.15 for Dutch word pairs, 0.93 ± 0.12 for English word pairs, and 1.00 ± 0.00 for English–Dutch word pairs. Furthermore, all animal nouns had an early age of acquisition in Dutch (mean ± SD, 5.28 ± 0.98 years; De Moor et al. 2000) and a medium–high frequency of use expressed on a logarithmic scale in counts per million tokens in Dutch (mean ± SD, 1.29 ± 0.71) and English (mean ± SD, 1.50 ± 0.42; Celex database; Baayen et al., 1995). To add acoustic variability and allow for speaker-invariant MVPA, the words were spoken by three female native Dutch speakers with good English pronunciation. Stimuli were recorded in a soundproof chamber at a sampling rate of 44.1 kHz (16-bit resolution). Postprocessing of the recorded stimuli was performed in PRAAT software (Boersma and Weenink, 2001) and included bandpass filtering (80–10,500 Hz), manual removal of acoustic transients (clicks), length equalization, removal of sharp onsets and offsets using 30 ms ramp envelopes, and amplitude equalization (average RMS). Stimulus length was equated to 600 ms (original range, 560–640 ms) using pitch synchronous overlap and add (75–400 Hz as extrema of the F0 contour). We carefully checked our stimuli for possible alterations in F0 after length equation and did not find any detectable changes. We ensured that the produced stimuli were unambiguously comprehended by the participants during the stimuli familiarization phase before the experiment.
Experimental procedures
The experimental session was organized in four runs, each run containing two blocks (one Dutch and one English). Each block included 27 nouns: 24 animal nouns and three (11.1%) non-animal nouns. Runs 1 and 3 started with an English block, followed by a Dutch block; runs 2 and 4 started with a Dutch block, followed by an English block (Fig. 1B). Participants were instructed to actively listen to the stimuli and to press a button (with the left index finger) whenever they heard a non-animal word. The goal of the task was to help maintain a constant attention level throughout the experiment and promote speech comprehension at every word presentation. All participants paid attention to the words as indicated by a mean ± SD detection accuracy of 97.5 ± 5.68%. Non-animal trials were excluded from additional analysis. The 24 relevant animal nouns in each block corresponded to six repetitions of each of the four animal nouns. Because nouns were pronounced by three different speakers, each physical stimulus was repeated twice in each block. Stimulus presentation was pseudorandomized within each block, avoiding consecutive presentations of the same words or sequences of words. Throughout the experiment, each animal noun was presented 24 times per language.
Stimuli were presented binaurally at a comfortable intensity level using MR-compatible in-ear headphones. Stimulus presentation occurred in a silent interval of 700 ms between two volume acquisitions (Fig. 1C,D). According to a slow event-related design, the average intertrial interval between two stimuli was 13.5 s (jittered between 10.8 and 16.2 s, corresponding to 4, 5, or 6 fMRI scanning repetitions). Each block had a duration of 6 min; each run took 12 min, resulting in a full functional scanning time of 48 min. A gray fixation cross against a black background was used to keep the visual stimulation constant during the entire duration of a block. Block and run transitions were marked with written instructions.
fMRI acquisition
Functional and anatomical image acquisition was performed on a Siemens Allegra 3 tesla scanner (head setup) at the Maastricht Brain Imaging Center. Functional runs were collected per subject with a spatial resolution of 3 mm isotropic using a standard echo-planar imaging sequence [repetition time (TR), 2.7 s; acquisition time (TA), 2.0 s; field of view, 192 × 192 mm; matrix size, 64 × 64; echo time (TE), 30 ms]. Each volume consisted of 33 slices (10% voxel size gap between slices) that covered most of the brain of the participants (in five subjects, the superior posterior parietal tip of the cortex had to be left out). The difference between the TA and TR introduced a silent gap used for the presentation of the auditory stimuli. High-resolution (voxel size, 1 × 1 × 1 mm3) anatomical images covering the whole brain were acquired after the second functional run using a T1-weighted three-dimensional ADNI (Alzheimer's Disease Neuroimaging Initiative) sequence (TR, 2050 ms; TE, 2.6 ms; 192 sagittal slices).
fMRI data preprocessing
fMRI data were preprocessed and analyzed using Brain Voyager QX version 2.4 (Brain Innovation, Maastricht, The Netherlands) and custom-made MATLAB routines. Functional data were 3D motion corrected (trilinear interpolation), corrected for slice scan time differences, and were temporally filtered by removing frequency components of five or less cycles per time course (Goebel et al., 2006). Anatomical data were corrected for intensity inhomogeneity and transformed into Talairach space (Talairach and Tournoux, 1988). Functional data were then aligned with the anatomical data and transformed into the same space to create 4D volume time courses. Individual cortical surfaces were reconstructed from gray–white matter segmentations of the anatomical data and aligned using a moving target-group average approach based on curvature information (cortex-based alignment) to obtain an anatomically aligned group-averaged 3D surface representation (Goebel et al., 2006; Frost and Goebel, 2012).
fMRI data analysis: univariate statistics
Univariate effects were analyzed using a random-effects general linear model (GLM). A single predictor per stimulus condition was convoluted with a double gamma hemodynamic response function. To identify cortical regions generally involved in the processing of spoken words, we constructed functional contrast maps (t statistics) by (1) comparing activation to all animal words versus baseline across subjects, (2) combining all possible binary contrasts within nouns of the same language, and (3) grouping all equivalent nouns into single concepts and contrasting all possible binary combinations of concepts. All GLM analyses were performed at the cortical surface after macro-anatomical variability had been minimized by applying cortex-based alignment (Goebel et al., 2006; Frost and Goebel, 2012). Each univariate statistical map was corrected for multiple comparisons by applying a surface-based cluster size threshold with a false-positive rate (α) of 1% (Forman et al., 1995; Goebel et al., 2006; Hagler et al., 2006). The cluster-size thresholds were obtained after setting an initial vertex-level threshold (p < 0.01, uncorrected) and submitting the maps to a whole-brain correction criterion based on the estimate of the spatial smoothness of the map (Goebel et al., 2006).
fMRI data analysis: MVPA
To investigate speech content information in the fMRI responses, we used a supervised machine learning algorithm (linear SVM; Cortes and Vapnik, 1995) combined with single-trial multivoxel classification (Cox and Savoy, 2003; Formisano et al., 2008). Classifications were performed to evaluate whether patterns of voxels conveyed information on the identity of spoken words (within-language word discrimination) and their language-invariant representations (across-language word generalization). Within-language word discrimination entailed training a classifier to discriminate between two words of the same language (e.g., horse vs duck) and testing in the same words spoken by a speaker not included in the training phase. Across-language word generalization was performed by training a classifier to discriminate between two words within one language (e.g., horse vs duck) and testing in the translational equivalent words within the other language (e.g., paard vs eend), thus relying on language-independent information of spoken words representing equivalent concepts in Dutch and English (Fig. 1A). Additional methodological steps encompassing the construction of the fMRI feature space (feature extraction and feature selection) and the computational strategy to validate (cross-validation) and display (discriminant maps) the classification results are described below in detail.
Classification procedure
Feature extraction.
Initially, the preprocessed fMRI time series was split into single trials, and time locked to the presentation of each stimulus. Then, for each trial, the time series of the voxels were normalized using the percentage of BOLD signal change in reference to a baseline extracted from the averaged interval between 2.0 s before stimulus onset and 0.7 s after stimulus onset (2 TRs). Finally, one feature was calculated per voxel from the averaged interval between 3.4 and 6.1 s succeeding stimulus onset (2 TRs). Considering the timing of stimulus presentation, this ensured a strict separation between the feature estimates for two subsequent trials.
Feature selection.
To avoid degraded performances of the classification algorithm attributable to the high dimensionality of the feature space (model overfitting; for a description, see Norman et al., 2006), a reduction of the number of features (voxels) is usually performed. In the present work, feature selection was accomplished using the “searchlight method” (Kriegeskorte et al., 2006) that restricts features to a spherical selection of voxels repeated for all possible locations of the gray matter. Here, searchlight used a linear SVM classifier and a sphere radius of 2.5 voxels.
Cross-validation.
Cross-validation for both types of analysis exploited the divisions of stimuli naturally resulting from our design. Cross-validation for word discrimination was performed by assessing the accuracy of across-speaker generalizations. At each cross-validation fold, the stimulus examples recorded from one speaker were left out for testing, whereas the examples recorded from the other two speakers were used for training. Hence, during word discrimination, classifiers were assessed to discriminate the words participants listened to independently of the speaker pronouncing the words. Cross-validation for language-invariant semantic representations was performed by assessing the accuracy of across-language generalizations. This resulted in two folds: (1) training in Dutch words and testing in their English translational equivalents; and (2) training in English words and testing in their Dutch translational equivalents. Cross-validation approaches based on the generalization of other stimuli dimensions have been successfully used previously (Formisano et al., 2008; Buchweitz et al., 2012) and enable detecting activation patterns that convey information on a stimulus dimension, such as word identity or semantic information, regardless of variations in other stimulus dimensions, such as speaker or language input. Hence, the within-language word discrimination and the across-language generalization analyses relied on a different number of training and testing trials (within-language: 18 trials for training and six trials for testing per class; across-language: 24 trials for training and 24 trials for testing per class).
Discriminative maps
At the end of the searchlight computation, accuracy maps of within-language word discrimination and across-language word generalization were constructed. Accuracy maps were averaged within each subject across binary comparisons and cross-validation folds. Thereafter, individual averaged accuracy maps were projected onto the group-averaged cortical surface and anatomically aligned using cortex-based alignment (Goebel et al., 2006). To assess the statistical significance of the individual accuracy maps (chance level is 50%) across subjects, a two-tailed nonparametric Wilcoxon's test was used. In the within-language discrimination analysis, one map was produced per language and subsequently combined into a single map by means of conjunction analysis (Nichols et al., 2005). The within-language word discrimination map thus depicts regions with consistent sensitivity in English and Dutch. For the across-language word generalization, one map was produced from all possible binary language generalizations. The resulting statistical maps were corrected for multiple comparisons by applying a cluster-size threshold with a false-positive rate (α) of 5% (within language discrimination) or 1% (across-language generalization) after setting an initial vertex-level threshold (within language: p < 0.05, uncorrected; language generalization: p < 0.01, uncorrected) and submitting the maps to a whole-brain correction criterion based on the estimate of the spatial smoothness of the map (Forman et al., 1995; Goebel et al., 2006; Hagler et al., 2006).
Results
Univariate analysis
Figure 2 and Table 1 illustrate the fMRI responses elicited by all animal words across subjects, as computed by a functional contrast map (t statistics) comparing all animal words versus baseline. The words elicited extensive bilateral activation in the superior temporal cortices as well as in the right inferior frontal lobe and the bilateral anterior insulas. To assess univariate differences between the responses evoked by individual words, we conducted all possible word-to-word contrasts within the same language (e.g., horse vs duck), as well as all possible concept-to-concept contrasts (e.g., horse + paard vs duck + eend). None of the possible contrasts yielded significant differences within or across participants.
Multivariate analysis
Figure 3 depicts the statistical maps of searchlight selections for which the word discrimination and the language generalization analyses yielded accuracies significantly above chance level (50%). Significance was tested with a Wilcoxon's test across participants and corrected for multiple comparisons using a surface-based cluster-size threshold.
Word discrimination identified a broad set of cortical areas in both hemispheres (Fig. 3A; Table 2). Discriminative voxels surviving cluster-size multiple comparisons correction clustered in the dorsal superior temporal gyrus (STG), posterior STG/supramarginal gyrus, angular gyrus (AG), and middle temporal lobe and ATL in the right hemisphere, as well as bilaterally in the inferior and superior frontal cortex, superior parietal cortex, occipital cortex, and the anterior insulas.
Figure 3B and Table 3 show the statistical maps with consistent across-language generalization performances. Here, a region within the left ATL, in the anterior portion of the superior temporal sulcus (STS), was highlighted together with other clusters in the left and right hemispheres. Specifically, the left inferior parietal lobe/AG, the posterior bank of the left postcentral gyrus, the right posterior STS/STG, the anterior portion of the right insula, the medial part of right ATL, and areas within the occipital cortices bilaterally seem sensitive to generalization. In the across-language generalization analysis, classifications relied on language-invariant properties of the words across subjects. There was no reportable difference for the directionality of the generalization. Both Dutch to English and English to Dutch yielded similar results.
Discussion
By combining fMRI MVPA and an experimental design that exploits the unique capacities of bilingual listeners, we identified a distributed brain network enabling within-category discrimination of individual spoken words. Most crucially, we were able to isolate focal regions including the left ATL that play a relevant role in the generalization of the meaning of equivalent words across languages. These findings provide direct evidence that, in bilingual listeners, semantic-conceptual knowledge is organized in language-independent form in focal regions of the cortex.
The univariate analysis showed distributed brain activation associated with the processing of spoken words (Binder et al., 2000; Pulvermüller et al., 2009; Price, 2010) but did not allow the discrimination of individual words or language-independent concepts. In a first multivariate analysis, we identified the network of brain regions informative of within-language word discrimination. Each area identified in the word discrimination map (Fig. 3A; Table 2) was able to consistently predict single words in both languages based on a local interaction of neighboring voxels. Because the discrimination was generalized across speakers, the maps reflect neural representations of the words that are robust to variations of low-level auditory features and speaker-related phonetic–phonological features. Within-language word discrimination relied on acoustic–phonetic and semantic–conceptual differences between the nouns, as well as possible other differences reflecting their individual properties. Accordingly, our results yielded a distributed cortical network associated with multiple aspects of speech processing (Hickok and Poeppel, 2007), including regions associated with spectrotemporal analysis in the dorsal STG, a sensorimotor interface at the posterior STG/inferior parietal lobe, and an articulatory network involving (pre-)motor areas and the inferior frontal gyrus. Moreover, the involvement of sensorimotor areas during speech perception is a highly debated topic (Galantucci et al., 2006; Pulvermüller et al., 2006; Hickok et al., 2011) for which multivoxel speech decoding may provide important contributions, as to describe the neural code holding phonological and articulatory features of speech.
The second multivariate analysis—across-language generalization—relied uniquely on language-invariant semantic–conceptual properties of the nouns. This analysis revealed neural representations of the animal words independently of the language in which they were presented in the left ATL (STS) but also in other clusters, namely the left AG, the posterior bank of the left postcentral gyrus, the right posterior STS/STG, the right anterior insula, the medial part of right ATL, and areas within visual cortex (Fig. 3B; Table 3). This finding is consistent with the observation that the failure to recognize words in bilingual aphasics typically occurs simultaneously in first and second languages (or anomic aphasia, Kambanaros and van Steenbrugge, 2006; in Alzheimer's disease, Hernàndez et al. 2007; primary progressive aphasia, Hernàndez et al., 2008). Furthermore, our observation of language-independent representations of spoken words in the left ATL converges with previous findings showing fMR adaptation effects in this region for across-language semantic priming of visual words in bilinguals (Crinion et al., 2006). It corroborates neuropsychological evidence of semantic dementia (Damasio et al., 1996) and indicates a role of the left ATL as a central hub in abstract semantic processing (Patterson et al., 2007). However, the location of our left ATL cluster appears more dorsal, comprising the anterior STS, compared with more ventral regions in the anterior medial temporal gyrus/inferior temporal gyrus commonly reported in studies using visual stimulation (Binder et al., 2009). Such a dorsal shift for semantic–conceptual processing of auditory words within the ATL has not been attained from lesion studies given the commonly large coverage of the lesions but has been reported in recent fMRI studies (Scott et al., 2000; Visser and Lambon Ralph, 2011).
We also observed significant language-independent decoding in other clusters of the left and right hemispheres, namely, the left AG and the posterior bank of the left postcentral gyrus, the right posterior STS/STG, the right anterior insula, medial portions of the right ATL, and regions of the occipital cortices bilaterally. These regions have been associated with cross-modal (Shinkareva et al., 2011; Visser et al., 2012) and bilingual (Buchweitz et al., 2012) processing of nouns and with the interface between semantic representations and articulatory networks during speech processing (Hickok and Poeppel, 2007; Shuster, 2009). Furthermore, the left AG has been associated with both semantic representation (Binder et al., 2009) and semantic control (Visser and Lambon Ralph, 2011). Our results suggest a similar role of these regions when bilinguals process within category semantic distinctions of spoken words. At a more general level, our results indicate the involvement of the right hemisphere in speech processing and lexical-semantic access, consistent with previous findings in patients with left hemispheric damage (Stefanatos et al., 2005), subjects undergoing selective hemispheric anesthesia (wada procedure; McGlone et al., 1984), and brain imaging studies (Hickok and Poeppel, 2007; Binder et al., 2009).
Conclusions
Brain-based decoding of individual spoken words at the fine-grained level of within-semantic category is possible within and across the first and second languages of bilingual adults. In particular, our observation of language-invariant representations of spoken words in the left ATL concur with the role attributed to this region as a central semantic hub emerged by the integration of distributed sensory and property-based specific representations (Patterson et al., 2007).
Our results show benefits of MVPA based on the generalization of the pattern information across specific stimulus dimensions. This approach enabled examining the representation of spoken words independently of the speaker and the representation of semantic–conceptual information independently of the input language. Additional studies using similar decoding techniques for bilingual perception of a large variety of individual words or combinations of words in sentences promise to extend these findings to a brain-based account of unified semantic representations (Hagoort, 2005) as they are constructed in real-life multilingual environments.
Footnotes
This study was supported by the European Union Marie Curie Initial Training Network Grant PITN-GA-2009-238593. Additionally, the authors thank Marieke Mur, Job van den Hurk, Francesco Gentile, and Inge Timmers for the valuable discussions throughout the design and implementation of the study and Sanne Kikkert for her assistance during the stimuli recording.
The authors declare no conflict of interests.
- Correspondence should be addressed to João Correia, Oxfordlaan 55, 2nd floor, Room 014, 6229 EV Maastricht, The Netherlands. joao.correia{at}maastrichtuniversity.nl