Abstract
Word segmentation, detecting word boundaries in continuous speech, is a critical aspect of language learning. Previous research in infants and adults demonstrated that a stream of speech can be readily segmented based solely on the statistical and speech cues afforded by the input. Using functional magnetic resonance imaging (fMRI), the neural substrate of word segmentation was examined on-line as participants listened to three streams of concatenated syllables, containing either statistical regularities alone, statistical regularities and speech cues, or no cues. Despite the participants’ inability to explicitly detect differences between the speech streams, neural activity differed significantly across conditions, with left-lateralized signal increases in temporal cortices observed only when participants listened to streams containing statistical regularities, particularly the stream containing speech cues. In a second fMRI study, designed to verify that word segmentation had implicitly taken place, participants listened to trisyllabic combinations that occurred with different frequencies in the streams of speech they just heard (“words,” 45 times; “partwords,” 15 times; “nonwords,” once). Reliably greater activity in left inferior and middle frontal gyri was observed when comparing words with partwords and, to a lesser extent, when comparing partwords with nonwords. Activity in these regions, taken to index the implicit detection of word boundaries, was positively correlated with participants’ rapid auditory processing skills. These findings provide a neural signature of on-line word segmentation in the mature brain and an initial model with which to study developmental changes in the neural architecture involved in processing speech cues during language learning.
- fMRI
- language
- speech perception
- word segmentation
- statistical learning
- auditory cortex
- inferior frontal gyrus
Introduction
Researchers have long been fascinated by how infants and young children can rapidly “crack” the language code with far greater ease than adults. Although ample behavioral evidence suggests a marked decrease in the ability to acquire a new language with native-like proficiency after puberty (Johnson and Newport, 1989; Weber-Fox and Neville, 2001), little is known about the neural correlates underlying this phenomenon. A continuous process of neural commitment to the statistical and prosodic patterns of the language experienced early in life may account for the corresponding decrease in the ability to acquire another language later on (Kuhl, 2004). However, the mechanisms by which the brain encodes these statistical and prosodic patterns during the initial stages of language learning have not been examined in adults until now.
Previous functional neuroimaging studies have used several approaches to investigate different aspects of language learning. Some studies examined changes in the pattern of neural activity after a period of intensive training on a novel linguistic task (Friederici et al., 2002; Callan et al., 2003, 2005; Golestani and Zatorre, 2004). Although these studies can address issues of neural plasticity (because they focus on the neural reorganization that results from learning), they do not provide information about the neural systems that subserve the learning process itself. Other investigations explored the neural basis of language learning by examining changes in brain activity that occur while participants learn an artificial grammar in the scanner (Opitz and Friederici, 2003, 2004; Thiel et al., 2003; Hashimoto and Sakai, 2004). Such studies, however, involved language processing in the visual domain and, thus, cannot shed light on the neural mechanisms underlying the learning of the acoustic properties of a language. To investigate such processes, functional magnetic resonance imaging (fMRI) was used to evaluate learning-associated changes in neural activity during initial exposure to artificial languages presented in the auditory modality, when a listener faces the daunting task of identifying word boundaries in a continuous stream of speech.
Word segmentation is a fundamental aspect of language learning because word boundaries must be identified before any additional linguistic analysis can be performed (Jusczyk, 2002; Saffran and Wilson, 2003). Although speech does not contain acoustic cues that invariably specify word boundaries, research has shown that infants as young as 7.5 months of age can use statistical regularities such as the greater co-occurrence of syllables within words than between words to successfully parse a continuous stream of speech (Saffran et al., 1996a; Aslin et al., 1998). In addition to these transitional probabilities, speech cues available in the input such as stress patterns (i.e., longer duration, increased amplitude, and higher pitch on certain syllables) can also guide word segmentation (Johnson and Jusczyk, 2001; Thiessen and Saffran, 2003). By adapting a well established paradigm from the infant behavioral literature, the present research examined the neural correlates of speech parsing. In light of evidence indicating that the ability to process rapid acoustic changes is highly predictive of future linguistic skills (Benasich and Tallal, 2002), the relationship between neural activity indexing successful word segmentation and rapid auditory processing ability was also explored.
Materials and Methods
Participants
Twenty-seven participants (13 female; mean age, 26.63 years; range, 20–44 years) volunteered in this study after giving written informed consent for the experimental protocol approved by the University of California, Los Angeles Institutional Review Board. By report, the participants were right-handed, native English speakers with no history of neurological or psychiatric disorders. All 27 participants completed an fMRI speech stream exposure scan, and 15 of them (six female; mean age, 26.92 years; range, 20–31 years) also underwent an event-related fMRI word discrimination scan immediately after the speech stream exposure scan. All participants completed a behavioral “word discrimination task” outside of the scanner after the scanning session.
Stimuli and tasks
Speech stream exposure task.
Participants listened to three counterbalanced streams of nonsense speech supposedly spoken by aliens from three different planets. Participants were not explicitly instructed to perform a task except to listen, given that a recent study has demonstrated that implicit learning can be attenuated by explicit memory processes during sequence learning (Fletcher et al., 2004). As shown in Figure 1, A and B, the three streams were created by repeatedly concatenating 12 syllables (a different set of 12 syllables was used for each speech stream). In each of the two artificial language conditions, the 12 syllables were used to make four trisyllabic words, following the exact same procedure used in previous infant and adult behavioral studies (Saffran et al., 1996a,b; Aslin et al., 1998; Johnson and Jusczyk, 2001). Each syllable was recorded separately using SoundEdit (Macromedia; Adobe Systems, San Jose, CA), ensuring that the average syllable duration (0.267 s), amplitude (18.2 dB), and pitch (221 Hz) were (1) not significantly different across the experimental conditions and (2) matched to those used previously in the behavioral literature. For each artificial language, the four words were randomly repeated three times to form a block of 12 words, subject to the constraint that no word was repeated twice in a row. Five such different blocks were created, and then this five-block sequence was itself concatenated three times to form a continuous speech stream lasting 2 min and 24 s, during which each word occurred 45 times. For example, the four words “pabiku,” “tibudo,” “golatu,” and “daropi” were combined to form a continuous stream of nonsense speech containing no breaks or pauses (e.g., pabikutibudogolatudaropitibudo…). Within the speech stream, transitional probabilities for syllables within a word and across word boundaries were 1 and 0.33, respectively. Thus, as the words were repeated, transitional probabilities could be computed and used to segment the speech stream. In the “unstressed language condition” (U), the stream contained only transitional probabilities as cues to word boundaries. In the “stressed language condition” (S), the speech stream contained transitional probabilities, as well as speech, or prosodic, cues introduced by adding stress to the initial syllable of each word, one-third of the time it occurred. Stress was added by slightly increasing the duration (0.273 s), amplitude (16.9 dB), and pitch (234 Hz) of these stressed syllables. At the same time, these small increases were offset by minor reductions in these parameters for the remaining syllables within the stressed language condition, which ensured that the mean duration, amplitude, and pitch would not be reliably different across the three experimental conditions. The initial syllable was stressed because 90% of words in conversational English have stress on their initial syllable (Cutler and Carter, 1987).
A “random syllables condition” (R) was also created so as to identify activity associated with the actual process of word segmentation, as afforded by the presence of statistical and prosodic cues available in the two artificial language conditions (U + S), as opposed to activity related to merely listening to a series of concatenated syllables. In this condition, the 12 syllables were not arranged into four words as in the two artificial language conditions; rather, these syllables were arranged pseudorandomly such that no three-syllable string was repeated more than twice in the stream (the frequency with which two-syllable strings occurred was also minimized). Therefore, in this condition, neither statistical regularities nor prosodic cues were afforded to the listener to aid speech parsing. Thus, as depicted in Figure 1C, each participant listened to three 144 s speech streams (R, U, and S) interspersed between 30 s of resting baseline. Short samples of these speech streams are available in supplemental audio clips 1–3 (available at www.jneurosci.org as supplemental material). The order of presentation of the three experimental conditions was counterbalanced across participants according to a “Latin square” design.
It is important to note the following: (1) the same number of syllables was repeated the same number of times across all three conditions, although it was only in the two artificial language conditions that cues were available to guide word segmentation; (2) across participants, each set of 12 syllables was used with the same frequency in each condition, thus ensuring that any difference between conditions would not be attributable to different degrees of familiarity with a given set of syllables; (3) to guard against the possibility that the computation of the transitional probabilities between the syllables chosen to form the words in the two artificial languages might be influenced by previous experience with the transitional probabilities between these syllables in English, three different versions of each language were created by rotating the position of the syllables within the words of each language (e.g., “pabiku” in one version became “kubipa” and “bipaku” in the others, with each word being used equally often across participants); and (4) although the length of the activation blocks used in this task was unconventional for an imaging study and could have resulted in decreased power to detect reliable differences between conditions, we opted to adhere to the paradigm used in previous behavioral studies in light of preliminary analyses conducted on the first 12 subjects demonstrating that sufficient power was nevertheless achieved. Furthermore, effect sizes, computed to address the robustness of the analyses performed, demonstrated that rather large effects were obtained (Cohen’s d > 0.80).
Word discrimination task.
Fifteen participants also completed a second fMRI task to provide an implicit index of successful word segmentation in lieu of the assessment procedure (i.e., head-turn novelty preference) used in the infant behavioral studies. In this mixed block/event-related fMRI design, participants were presented with trisyllabic combinations from the “speech stream exposure task” and were simply told to listen to what might have been words in the artificial languages they just heard. Based on evidence from previous behavioral studies, participants were not expected to be able to explicitly identify whether these trisyllabic combinations were words in the artificial languages after such a short exposure to the streams (Saffran et al., 1996b; Sanders et al., 2002). Accordingly, participants were not asked to make an explicit judgment in the scanner so as not to have explicit task demands confound the activity associated with the implicit processing of the stimuli. As shown in Figure 2, A and B, trisyllabic combinations were presented in three activation blocks, with each block using the set of syllables that corresponded to those used in one of the three speech streams heard previously by a participant. Within each block, participants listened to the four “words” (W) used to create the artificial language streams (e.g., “golatu”, “daropi”) as well as to four “partwords” (PW), that is, trisyllabic combinations formed by grouping syllables from adjacent words within the speech streams (e.g., the partword “tudaro” consisted of the last syllable of the word “golatu” and the first two of the adjacent word “daropi” within the stream “… golatudaropi… ”). As in the previous behavioral studies, the transitional probabilities between the first and second syllables in the words were appreciably higher than those between the first two syllables in the partwords (1 as opposed to 0.33) as a result of the words and partwords having occurred within the speech stream 45 and 15 times, respectively. Each word and partword was repeated three times and interspersed with null events. The same pseudorandomized order of presentation for words and partwords was used in each block such that no word or partword was repeated twice in a row and no more than three words, partwords, or null events occurred consecutively. The words and partwords from the stressed language condition were presented in their unstressed version, such that there were no differences in duration, amplitude, or pitch between any of the syllables used in this task. The trisyllabic combinations used in the block corresponding to the random syllables condition effectively served as “nonwords” (NW), because these combinations had occurred only one time during the random syllables stream, in which no statistical or prosodic cues were afforded to the listener (transitional probability between the first two syllables equal to 0.1).
Behavioral tasks.
To investigate whether participants were able to explicitly discriminate between words and partwords, behavioral measures (response times and accuracy scores) for the word discrimination task were collected outside of the scanner. Each participant listened to the word stimuli and responded yes or no as to whether they thought each trisyllabic combination could be a word in the artificial languages they heard previously. All 27 participants completed this behavioral testing. All but one participant also completed a computerized version of the Tallal Repetition Test, a hierarchical series of subtests shown to assess several aspects of central auditory processing, including rapid auditory processing skills (Tallal and Piercy, 1973). This test uses a two-alternative forced-choice design in which participants are operantly trained to respond to two complex tones by clicking on two different squares on the computer screen. After participants learn the tone–square association, they are presented with progressively longer sequences of complex tones and must subsequently indicate the temporal order of such sequences. The accuracy scores from the SerialMemory3Fast (SM3Fast) subscale of this test were used in this study because maximum variability is observed among participants at this difficulty level. In this subscale, the stimuli were two 75 ms complex tones, with fundamental frequencies of 100 and 300 Hz. Random combinations of these two tones were presented as tone triplets, consisting of three elements (e.g., tone 1-2-1, 2-1-1, etc.) separated by varying interstimulus intervals of silence (0, 10, 60, or 150 ms). Participants listened to 24 such three-element trials, indicating after each trial the order of the stimulus pattern by clicking on the squares in the same order as the tones were presented.
fMRI data acquisition
Functional images were collected using a Siemens Allegra 3 Tesla head-only MRI scanner. A two-dimensional spin-echo scout [repetition time (TR), 4000 ms; echo time (TE), 40 ms; matrix size, 256 × 256; 4 mm thick; 1 mm gap] was acquired in the sagittal plane to allow prescription of the slices to be obtained in the remaining scans. For each participant, a high-resolution structural T2-weighted echo-planar imaging (EPI) volume [TR, 5000 ms; TE, 33 ms; matrix size, 128 × 128; field of view (FOV), 20 cm; 36 slices; 1.56 mm in-plane resolution; 3 mm thick] was acquired coplanar with the functional scans to allow for spatial registration of each participant’s data into a standard coordinate system. For the speech stream exposure task, one functional scan lasting 8 min and 48 s was acquired covering the whole cerebral volume (174 images; EPI gradient echo sequence; TR, 3000 ms; TE, 25 ms; flip angle, 90°; matrix size, 64 × 64; FOV, 20 cm; 36 slices; 3.125 mm in-plane resolution; 3 mm thick; 1 mm gap). For the word discrimination task, a second functional scan lasting 3 min and 30 s was acquired that also covered the whole brain (103 images; EPI gradient echo sequence; TR, 2000 ms; TE, 25 ms; flip angle, 90°; matrix size, 64 × 64; FOV, 20 cm; 36 slices; 3.125 mm in-plane resolution; 3 mm thick; 1 mm gap).
Participants listened to the auditory stimuli through a set of magnet-compatible stereo headphones (Resonance Technology, Northridge, CA). Stimuli were presented using MacStim 3.2 psychological experimentation software (CogState, West Melbourne, Victoria, Australia).
fMRI data analysis
Using automated image registration (Woods et al., 1998a,b), functional images for each participant were (1) realigned to each other to correct for head motion during scanning and coregistered to their respective high-resolution structural images using a six-parameter rigid body transformation model and a least-square cost function with intensity scaling, (2) spatially normalized into a standard stereotactic space (Woods et al., 1999) using polynomial nonlinear warping, and (3) smoothed with a 6 mm full-width at half-maximum isotropic Gaussian kernel to increase the signal-to-noise ratio. Statistical analyses were implemented in SPM99 (Wellcome Department of Cognitive Neurology, London, UK; http://www.fil.ion.ucl.ac.uk/spm/). For each participant, contrasts of interest were estimated according to the general linear model using a canonical hemodynamic response function. For the speech stream exposure task, the exponential decay function in SPM99 (which closely approximates a linear function) was also used to model changes that occurred within each activation block as a function of exposure to the speech stream. Contrast images from these fixed effects analyses were then entered into second-level analyses using random effects models to allow for inferences to be made at the population level (Friston et al., 1999). All reported activity survived correction for multiple comparisons at the cluster level (p < 0.05, corrected) and t >3.3 for magnitude (p < 0.001, uncorrected), unless otherwise noted. Small-volume correction at the cluster level with a sphere of 1 cm radius around the peak of activation was used in the putamen for the speech stream exposure task because of an a priori hypothesis of basal ganglia involvement during sequence learning.
For the speech stream exposure task, separate one-sample t tests were implemented for each condition (unstressed language, stressed language, and random syllables vs resting baseline) to identify blood oxygenation level-dependent (BOLD) signal increases associated with listening to each speech stream. Direct comparisons between each condition, as well as between the two artificial language conditions (U + S) and the random syllables condition, were also implemented to examine differential activity related to the presence of statistical and prosodic cues within the artificial language streams. Specifically, in an effect akin to repetition priming, the detection of word boundaries (afforded by the statistical regularities and prosodic cues available in the artificial languages only) was expected to yield overall less activity for the artificial language streams within language-relevant frontotemporal regions and the basal ganglia, which are known to be involved in sequence learning (Poldrack et al., 2005).
In addition, because word segmentation is thought to take place as a function of exposure to the language streams, signal increases over time were expected in left-hemisphere language cortices while listening to the artificial languages only, with a larger effect predicted for the stressed language condition in which both statistical and prosodic cues were available to guide word segmentation. A region-of-interest (ROI) analysis was then conducted in the temporal cortices in which such effects were observed to examine whether these signal increases over time for the artificial languages were indeed left-lateralized. This functionally defined ROI included all voxels showing reliable signal increases at the group level in either the left hemisphere (LH) or right hemisphere (RH), as well as their respective counterparts in the opposite hemisphere. For each participant, a laterality index was computed based on the number of voxels showing increased activity over time within these symmetrical ROIs in the LH and RH (number of voxels activated in the LH - number of voxels activated in the RH/number of voxels activated in the LH + RH).
Because the ability to identify words that had occurred within the speech streams should be related to activity observed during exposure to the artificial languages if this activity truly reflects word segmentation, regression analyses were also conducted to assess the relationship between accuracy scores on the behavioral post-scanner word discrimination test and neural activity associated with listening to the two artificial languages compared with the random syllables during the speech stream exposure task.
For the word discrimination task, separate one-sample t tests were implemented to identify changes in the BOLD signal associated with listening to the words, partwords, and nonwords that had occurred with different frequencies during the speech stream exposure task. Direct comparisons between each condition were then implemented to test the hypothesis that differential activity in language networks would be observed for words, partwords, and nonwords, as would be expected if word segmentation had indeed occurred during the speech stream exposure task. Specifically, this effect could manifest as stronger signal increases in response to words (than partwords and nonwords) if these were recognized as having “word-like” status, or, alternatively, the opposite pattern could be observed because of a novelty effect for nonwords. The SPM99 toolbox MarsBaR (http://marsbar.sourceforge.net/) was used to extract parameter estimates for each participant from regions that were significantly active in response to words, partwords, and nonwords.
Finally, given that rapid auditory processing skills are highly predictive of future language outcomes (Benasich and Tallal, 2002), a regression analysis was conducted to examine the relationship between participants’ rapid auditory processing skills and the ability to implicitly segment the artificial language streams, as indexed by activity observed in response to words in the word discrimination task. Based on previous findings linking the left middle frontal gyrus (MFG) to the processing of rapid acoustic changes (Temple, 2002), a positive correlation was expected between accuracy scores on the SM3Fast subscale of the Tallal Repetition Test and neural activity associated with listening to words in the word discrimination task.
Results
Behavioral results
Accuracy and reaction times on the behavioral word recognition test conducted immediately after the fMRI scans are reported in Table 1. Not surprisingly, given the findings of previous behavioral research (Saffran et al., 1996b; Sanders et al., 2002), participants were unable to explicitly recognize which trisyllabic combinations might have been words in the artificial languages they heard during the speech stream exposure task, as demonstrated by accuracy scores not different from chance for either the stressed or unstressed language conditions. Similarly, there were no significant differences in reaction time to items from within the unstressed language, stressed language, or random syllables streams.
Speech stream exposure task
Random effects models were used to identify changes in the BOLD signal associated with listening to the unstressed language, stressed language, and random syllables during the speech stream exposure task. Compared with resting baseline, listening to all three streams of continuous speech activated a large-scale bilateral neural network including canonical language areas in the LH and their RH counterparts, with extensive activation in temporal, frontal, and parietal cortices as well as in the putamen (Fig. 3; Table 2). Overall, a pattern of more focal activity (i.e., less activation in terms of spatial extent) was observed when participants listened to the artificial languages (U + S) than when they listened to the stream of random syllables (Table 2). Similarly, more focal activity was observed when participants listened to the stressed language than to the unstressed language (Table 3), indicating that the presence of statistical cues in both artificial language conditions, as well as the addition of prosodic cues in the stressed language, afforded more efficient neural processing.
Because transitional probabilities and the related identification of word boundaries are computed on-line during the course of the activation conditions, regions in which activity might be increasing as a function of exposure to the speech streams were examined. A statistical contrast modeling increases in activity over time for the artificial languages (U↑ + S↑) revealed significant bilateral signal increases in superior temporal gyrus (STG) and transverse temporal gyrus, extending into the supramarginal gyrus (SMG) in the LH (Fig. 4A,B; Table 4). A region-of-interest analysis indicated that this increasing activity during exposure to the unstressed and stressed language conditions was reliably stronger in the LH than in the RH, as indicated by a significantly larger number of voxels showing this effect in the LH than in the RH (LH, 235 vs RH, 180; F(1,26) = 4.622; p < 0.05). In contrast, no reliable signal increases over time were observed during exposure to the random syllables condition, as can be seen in the time series shown in Figure 4B. A contrast directly comparing signal increases occurring during exposure to the two artificial languages versus signal increases occurring during exposure to the random syllables condition [(U↑ + S↑) − R↑] confirmed that activity in bilateral middle temporal gyrus (MTG) and STG, and left SMG showed significantly greater increases as a function of exposure to the two artificial language conditions than the random syllables condition (Table 4). This pattern held when comparing signal increases during each artificial language condition versus the random syllables condition (U↑ − R↑ and S↑ − R↑) (Fig. 5A,B; Table 5), with stronger signal increases for the comparison of the stressed language versus random syllables than for the unstressed language versus random syllables [(S↑ − R↑) − (U↑ − R↑)] in left STG and the right temporoparietal junction (Fig. 5C; Table 5).
A statistical contrast modeling decreases in activity over time for the artificial language conditions (U↓ + S↓) revealed activity in ventromedial prefrontal cortex (Talairach coordinates 4, 56, 8; t = 5.26, corrected for multiple comparisons at the cluster level). Interestingly, at more liberal thresholds (t > 1.71; p < 0.05, uncorrected; minimum cluster size, 5 voxels), signal decreases were also observed in several other regions (i.e., left MTG, bilateral inferior parietal lobule (IPL), right middle and medial frontal gyri, anterior cingulate, and left putamen) in which overall greater activity was observed for the random syllables versus artificial language conditions [(R − (S + U)]. Conversely, at these more liberal thresholds, greater activity was observed for the artificial languages versus random syllables condition [(S + U) − R)] in the left STG and IPL, that is, within those regions in which significant increases over time were observed for the artificial languages. For the random syllables condition (R↓), signal decreases over time were only observed in the left insula and IPL (Talairach coordinates −34, −24, −2, t = 5.3 and −50, −36, 50, t = 3.75, respectively, both corrected for multiple comparisons at the cluster level).
Correlation with behavioral word discrimination accuracy
Participants’ accuracy at discriminating words on the post-scanner behavioral word discrimination task was positively correlated with activity in the left STG as participants listened to the stressed and unstressed languages compared with random syllables during the speech stream exposure task (Fig. 6; Table 6). Importantly, this effect was observed in the same region showing increasing activity over time while listening to the artificial languages.
Word discrimination task
Random effects models were used to identify changes in the BOLD signal associated with listening to words, partwords, and nonwords presented during the three blocks of the word discrimination task. Activity in response to hearing both words and partwords (which had occurred 45 and 15 times, respectively, during the speech stream exposure task) was summed across the unstressed and stressed language blocks. (In keeping with the head-turn novelty preference paradigm used in the developmental literature on word segmentation, only 12 words and 12 partwords were presented within each language block. Collapsing across the language blocks yielded 24 events for words and 24 for partwords to be contrasted with the 24 events for the nonwords from the random syllable block.) Activity for the nonwords corresponded to trisyllabic combinations presented during the random syllables block (these combinations had occurred only once during the random syllables condition of the speech stream exposure task). A statistical contrast comparing BOLD signal associated with listening to words versus nonwords (W − NW) revealed significantly greater activity for words in left inferior frontal gyrus (IFG) and MFG (Fig. 7A; Table 7). Reliably greater activity, albeit to a lesser extent, was also observed in these same regions when comparing words versus partwords (W − PW) (Fig. 7B; Table 7) and partwords versus nonwords (PW − NW) (Table 7). The partwords versus nonwords comparison also revealed significantly greater activity in the superior parietal lobule. Importantly, no reliably greater cortical activity was observed for any of the reverse statistical comparisons (NW − W, NW − PW, and PW − W). Despite the limited number of words and partwords within each language block, when the above analyses were conducted for the stressed and unstressed language blocks separately, a virtually identical pattern of results was observed (albeit with smaller cluster size).
Additional analyses were conducted to determine whether differential activity would be observed as participants listened to words that had occurred in the speech stream when both prosodic and statistical cues were available to guide word segmentation (stressed language) as opposed to words that had occurred in the speech stream containing only statistical cues (unstressed language). Comparing activity observed during the stressed language block with that observed during the unstressed language block [S(W + PW) − U(W + PW)] revealed that left MFG and the right IPL (Talairach coordinates −48, 12, 34, t = 2.85 and 36, −50, 38, t = 3.43, respectively, corrected for multiple comparisons at the cluster level) showed greater activation when participants listened to words and partwords from within the speech stream that had contained additional prosodic cues (stressed language) than to words and partwords from within the stream containing statistical regularities alone (unstressed language). No reliable activity was detected for the reverse comparison [U(W + PW) − S(W + PW)].
Correlation with rapid auditory processing skills
Participants’ performance on the Tallal Repetition Test was positively correlated with left MFG activity while participants listened to words during the word discrimination task (Fig. 8). That is, the better the participants’ rapid auditory processing skills, as indexed by accuracy on the SM3Fast subscale of the Tallal Repetition Test (mean ± SD, 80.3 ± 19.6%), the greater the activity observed in left MFG during the word discrimination task when listening to trisyllabic combinations that had occurred 45 times during the stressed and unstressed language conditions of the speech stream exposure task.
Discussion
The present research investigated the neural correlates of word segmentation, a crucial process in the early stages of language learning. Over the past decade, milestone studies in the infant behavioral literature have revolutionized our thinking about the mechanisms underlying language acquisition, showing that the brain possesses striking computational abilities that allow young infants to rapidly identify word boundaries by extracting statistical and prosodic patterns within continuous speech (Saffran et al., 1996a; Aslin et al., 1998; Johnson and Jusczyk, 2001; Thiessen and Saffran, 2003). The neural architecture underlying this computational feat, however, has not yet been identified. By adapting the paradigm used in the infant literature, this study explored how the brain processes the statistical and prosodic speech cues available in the input to crack the code of a new artificial language.
Despite the somewhat unusual fMRI design, the frontotemporal networks activated while participants listened to streams of concatenated syllables are consistent with those observed in previous studies involving the presentation of similar speech-like stimuli and spectrally complex acoustic information (Binder et al., 2000; Hickok and Poeppel, 2000; Johnsrude et al., 2000; Scott et al.; 2000; Vouloumanos et al., 2001; Gandour et al., 2002; Jäncke et al., 2002; Zatorre et al., 2002; Gelfand and Bookheimer, 2003; Scott and Johnsrude, 2003; Dehaene-Lambertz et al., 2005) (Fig. 3; Table 2). The observed activity in motor speech areas is also in line with a recent study showing considerable overlap between the areas active during passive listening to syllables and those active during overt syllable production (Wilson et al., 2004). Moreover, our finding of basal ganglia activity is in agreement with previous studies involving sequence learning across different modalities (Poldrack et al., 2005; Doyon et al., 2003; Saint-Cyr, 2003; Van der Graaf et al., 2004).
The more focal pattern of activity observed while participants listened to the artificial languages, as opposed to the stream of random syllables, suggests that the statistical and prosodic cues to word segmentation present only in the artificial languages facilitated the parsing of the speech streams, thus affording more efficient neural processing (Table 2). This interpretation is line with other studies reporting less activity for easier or better-learned tasks (Petersen et al., 1998; Landau et al., 2004; Casey et al., 2005; Kelly and Garavan, 2005). The overall greater activity associated with listening to the stream of random syllables, in which the order of occurrence of syllables could not be predicted, is also consistent with a study showing greater activity in superior temporal cortices in response to sequences of visual stimuli with low temporal predictability versus those with high predictability (Bischoff-Grethe et al., 2000).
During exposure to three continuous streams of concatenated syllables, left-lateralized BOLD signal increases in temporal and inferior parietal cortices occurred only when participants listened to the two streams containing statistical regularities, particularly the stream also containing prosodic cues (Figs. 4, 5; Tables 4, 5). Because the artificial languages and random syllables stream differed only in the presence of cues to word boundaries, the signal increases observed for both artificial languages along the superior temporal gyri may reflect the ongoing computation of transitional probabilities (Blakemore et al., 1998; Bischoff-Grethe et al., 2000), whereas the increases in activity in the left supramarginal gyrus may reflect the development of phonological representations for the “words” in the artificial languages (Gelfand and Bookheimer, 2003; Xiao et al., 2005). The stronger signal increases observed in these regions, as well as in primary auditory cortices, for the stressed language compared with the unstressed language, suggest that the presence of prosodic cues did aid speech parsing, in line with previous behavioral data in both infants and adults (Johnson and Jusczyk, 2001; Thiessen and Saffran, 2003) (Table 5). Although several studies have reported decreases in activity as participants learn to perform a linguistic task (Thiel et al., 2003; Golestani and Zatorre, 2004; Noppeney and Price, 2004) and, indeed, some decreases were also observed in the present study, our findings are consistent with previous reports of increased activity as a function of learning in primary and secondary sensory and motor areas, as well as in regions involved in the storage of task-related cortical representations (for review, see Kelly and Garavan, 2005).
Our finding of signal increases in temporal and inferior parietal cortices during exposure to the artificial languages indicates that initial word segmentation had occurred, despite the fact that, overall, participants could not reliably identify the words used to create the speech streams in a post-scan behavioral test. This interpretation is strongly supported by the positive relationship found between activity within the left STG during exposure to the artificial languages (compared with the random syllables stream) and participants’ ability to discriminate words in the post-scanner behavioral word discrimination task (Fig. 6; Table 6). Additionally, although neuroimaging studies of learning have typically examined changes in neural activity associated with observable behavioral changes (indexed by explicit or implicit measures), several other studies have reported changes in neural activity that occur in the absence of any overt change in task performance (Shadmehr and Holcomb, 1997; Jaeggi et al., 2003; Landau et al., 2004; Kelly and Garavan, 2005). Neural changes that precede behavioral changes have also been shown in event-related potential (ERP) studies across a variety of linguistic tasks (Shestakova et al., 2003; McLaughlin et al., 2004). Notably, an ERP study of word segmentation has also demonstrated that, although adults required prolonged exposure and training to explicitly identify words within a speech stream, they displayed an increased N100 component to word initial syllables (taken to index the implicit detection of word boundaries) long before showing evidence of explicit knowledge (Sanders et al., 2002).
The results of the word discrimination task, in which participants were exposed to trisyllabic combinations that had occurred with different frequencies within the speech streams, provide additional evidence that word segmentation had implicitly occurred during the speech stream exposure task. Here, greater activity was observed in left prefrontal regions in response to trisyllabic combinations with higher frequency of occurrence within the speech streams and, thus, higher transitional probabilities between syllables (Fig. 7; Table 7). Specifically, listening to words compared with partwords (i.e., trisyllabic combinations that had occurred 45 times versus 15 times within the artificial language streams) and nonwords (i.e., trisyllabic combinations that occurred only once within the random syllables stream) elicited greater activity in the posterior, superior region of left IFG extending into the MFG, regions known to be important for phonological and sequential processing (Gelfand and Bookheimer, 2003). That this effect was more pronounced for the words from the stressed language than the unstressed language again suggests that participants were able to capitalize on the presence of prosodic cues to aid word segmentation during the speech stream exposure task. The observed left-lateralized prefrontal activity is consistent with a host of previous imaging studies implicating these regions in several aspects of phonological processing (for review, see Bookheimer, 2002), including phonemic discrimination, temporal sequencing, articulatory recoding, and the maintenance of acoustic information in working memory (Burton et al., 2000; Newman and Twieg, 2001; Gelfand and Bookheimer, 2003; LoCasto et al., 2004; Xiao et al., 2005).
Interestingly, activity within the left MFG while listening to words from within the stressed and unstressed languages was modulated by participants’ rapid auditory processing skills, as indexed by their performance on the Tallal Repetition Test (Tallal and Piercy, 1973) (Fig. 8). Previous research has shown that the ability to process rapid acoustic changes is a highly significant predictor of future language outcomes and verbal intelligence and that this ability is compromised in individuals with language learning impairments such as dyslexia (Benasich and Tallal, 2002; Tallal, 2004). The observed relationship between MFG activity (taken to index successful implicit word segmentation) and individual differences in rapid auditory processing skills further attests to the importance of this region for language learning. Although other neuroimaging studies have linked MFG activity to the processing of rapid acoustic information at the phonemic level (Fiez et al., 1995; Temple et al., 2000; Poldrack et al., 2001; Temple, 2002), the present results indicate that rapid auditory processing skills may also be important for other aspects of language learning.
In conclusion, the current research confirms that minimal exposure to a continuous stream of speech containing statistical and prosodic cues is sufficient for implicit word segmentation to occur and provides a neural signature of on-line word segmentation in the mature brain. At an applied level, this paradigm may be useful for identifying abnormalities in the neural architecture subserving language learning in developmental language disorders and for exploring changes in this circuitry after interventions. At a theoretical level, given that infants and adults can segment sequences of tones just as well as syllables (Saffran et al., 1999), a comparison of neural activity during speech parsing versus parsing of nonlinguistic stimuli such as tones may allow us to disentangle domain-specific from domain-general mechanisms subserving language learning. Most importantly, the present findings in adults can serve as a neurodevelopmental endpoint to investigate changes in the neural basis of language learning occurring with age and linguistic experience to help answer the long-standing question of why children are better language learners than adults.
Footnotes
-
This work was supported by the Santa Fe Institute Consortium, the National Alliance for Autism Research, the National Science Foundation, and the Foundation for Psychocultural Research–UCLA Center for Culture, Brain, and Development. For generous support, we also thank the Brain Mapping Medical Research Organization, Brain Mapping Support Foundation, Pierson-Lovelace Foundation, The Ahmanson Foundation, William M. and Linda R. Dietel Philanthropic Fund at the Northern Piedmont Community Foundation, Tamkin Foundation, Jennifer Jones-Simon Foundation, Capital Group Companies Charitable Foundation, Robson Family, and Northstar Fund. The project described was supported by Grant Numbers RR12169, RR13642, and RR00865 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH); its contents are solely the responsibility of the authors and do not necessarily represent the official views of NCR or NIH. Thanks also to Nicole Vazquez, Jennifer Pfeifer, Ting Wang, and Ashley Scott for their help with stimuli recording and data analysis, and to Paula Tallal for her insightful comments on a previous version of this manuscript.
- Correspondence should be addressed to Mirella Dapretto, Ahmanson-Lovelace Brain Mapping Center, 660 Charles E. Young Drive, Los Angeles, CA 90095. Email: mirella{at}loni.ucla.edu