Abstract
Understanding spoken language requires a complex series of processing stages to translate speech sounds into meaning. In this study, we use functional magnetic resonance imaging to explore the brain regions that are involved in spoken language comprehension, fractionating this system into sound-based and more abstract higher-level processes. We distorted English sentences in three acoustically different ways, applying each distortion to varying degrees to produce a range of intelligibility (quantified as the number of words that could be reported) and collected whole-brain echo-planar imaging data from 12 listeners using sparse imaging. The blood oxygenation level-dependent signal correlated with intelligibility along the superior and middle temporal gyri in the left hemisphere and in a less-extensive homologous area on the right, the left inferior frontal gyrus (LIFG), and the left hippocampus. Regions surrounding auditory cortex, bilaterally, were sensitive to intelligibility but also showed a differential response to the three forms of distortion, consistent with sound-form-based processes. More distant intelligibility-sensitive regions within the superior and middle temporal gyri, hippocampus, and LIFG were insensitive to the acoustic form of sentences, suggesting more abstract nonacoustic processes. The hierarchical organization suggested by these results is consistent with cognitive models and auditory processing in nonhuman primates. Areas that were particularly active for distorted speech conditions and, thus, might be involved in compensating for distortion, were found exclusively in the left hemisphere and partially overlapped with areas sensitive to intelligibility, perhaps reflecting attentional modulation of auditory and linguistic processes.
- speech
- language
- auditory cortex
- hierarchical processing
- primate
- human
- inferior frontal gyrus
- temporal lobe
- hippocampus
- sentence processing
- fMRI
Introduction
Understanding spoken language is a rapid and seemingly automatic process. The translation of speech sounds (in our native language) into meaning is generally achieved without awareness of intervening processes, despite the background noise and interspeaker variability that is characteristic of everyday speech. This robustness reflects the multiple acoustic means by which stable elements (such as phonetic features or syllables) are coded in clear speech; this redundancy permits comprehension when some acoustic information is lost. Robustness in speech comprehension may derive from the operation of compensatory mechanisms that are recruited when speech becomes difficult to understand, such as listening to loudspeaker announcements at a busy train station or a radio with poor reception. In this study, we use functional magnetic resonance imaging (fMRI) to explore the functional organization of brain regions involved in spoken language comprehension, with a view to understanding the neural basis for normal comprehension and processes that are recruited when speech becomes more difficult to understand.
Several different levels of representation (e.g., phonetic features, phonemes, morphemes, and words) have been proposed to mediate between an incoming speech signal and the computation of its meaning (McClelland and Elman, 1986; Gaskell and Marslen-Wilson, 1997). Models of spoken language comprehension assume that processing is hierarchically organized, with greater abstraction from the surface (acoustic) properties of speech at higher processing levels. However, the degree to which higher-level linguistic processes can be distinguished from less-specialized auditory and sound-form-based processes remains unclear (Whalen and Liberman, 1987; Remez et al., 1994; Scott et al., 2000).
This hierarchy of processing stages may map onto auditory anatomy. The auditory cortex in primates comprises several cortical fields, organized into core (primary), belt (secondary), and parabelt regions. Anatomical and electrophysiological studies indicate that adjacent regions are interconnected and information proceeds from core, to belt, to parabelt, and to more distal areas as processing demands become more complex (for review, see Rauschecker, 1998; Kaas and Hackett, 2000). Neuroimaging studies have suggested a similar processing hierarchy in humans, but, to date, such studies have been limited to nonlinguistic stimuli (e.g., frequency-modulated tones or bandpassed noise) (Wessinger et al., 2001; Hall et al., 2002).
In this study, we alter (distort) the specific surface properties of speech in three different ways (see Fig. 1) and use a correlation design to relate brain activity to intelligibility. We operationalize “intelligibility” as the amount of a sentence that is understood, an aggregate measure of the multiple hierarchically organized processes involved in comprehension. Within areas that are sensitive to intelligibility, we can differentiate regions that are also sensitive to the type of distortion used (form-dependent) and, thus, probably involved in acoustic analysis, and those that are insensitive to distortion type (form-independent); these areas may be involved in higher-level linguistic processes.
We can also identify the neural correlates of effortful understanding by contrasting the neural response to distorted (yet still intelligible) sentences for which comprehension is difficult with conditions that involve less effort (cf. Poldrack et al., 2001). Activation in this contrast may reflect the action of compensatory mechanisms that modulate activity at an acoustic level or at higher levels of processing.
Materials and Methods
Stimulus preparation
There were 190 declarative English sentences ranging in topics, comprising 5–17 words (1.7–4.3 sec in duration), and digitized at a sampling rate of 22.1 kHz taken from the test and filler sentences used in a previous behavioral study (Fig.1a) (Davis et al., 2002). Three forms of distortion were applied to these sentences using Praat software (Institute of Phonetic Sciences, University of Amsterdam, Amsterdam, The Netherlands) (available atwww.praat.org). All three forms of distortion preserved the duration, amplitude, and average spectral composition of the original sentences, although the acoustic form of sentences processed with the three types of distortion were markedly different.
Segmented. Segmented speech was created by dividing the speech waveform into short chunks at fixed intervals and replacing even-numbered chunks of speech with a signal-correlated noise (SCN) version of the original speech (Bashford et al., 1996). Signal-correlated noise is a waveform with the same spectral profile and amplitude envelope as the original speech but consisting entirely of noise. Although it retains some physical properties of the speech that it replaces (e.g., a speech-like rhythmical structure), these periods of signal-correlated noise do not contain any intelligible speech sounds (Schroeder, 1968). The duration of clear speech was fixed at 200 msec and 500, 200, or 100 msec sections of speech were replaced by signal-correlated noise (Fig. 1b).
Vocoded. Noise-vocoded speech (Shannon et al., 1995) was created by dividing the speech signal between 50 and 8000 Hz into 4, 7, or 15 bandpass-filtered frequency bands. Sentences were resynthesized by replacing information in each frequency band with amplitude-modulated bandpass noise (Fig. 1c). Frequency bands were approximately lineally spaced (i.e., the width of each band was proportional to the center frequency of that band). Noise vocoded speech sounds like a harsh robotic whisper.
Noise. Speech in noise was generated by adding a continuous speech-spectrum noise background to sentences at three signal-to-noise ratios (−1, −4, or −6 dB) (Fig. 1d). The overall amplitude of each speech-in-noise stimulus was reduced to match the amplitude of the original sentence.
In addition to these three forms of distortion, a signal-correlated noise baseline was generated using the same algorithm as that for segmented speech but without periods of clear speech. Sentences processed in this way sound like a rhythmic sequence of noise bursts, carry no linguistic information, and are entirely unintelligible (Schroeder, 1968).
Pilot study
A pilot behavioral study was conducted to ensure that a continuum of intelligibility was obtained for each form of distortion. Eighteen native English speakers heard single-stimulus sentences over closed-ear headphones (model DT770; Beyerdynamic, West Sussex, UK) played from the soundcard of a Dell laptop PC (Dell Computer Company, Round Rock, TX). Participants were required to either type as many words as they could understand or to rate intelligibility (on a nine-point scale) immediately after each item. Sentences were pseudorandomly assigned to a type and level of distortion (three versions of the test were created with the same sentences assigned to different conditions). Each subject was tested on one version of this behavioral study and therefore heard each sentence only once. Word-report performance (calculated as the proportion of words per sentence that were reported correctly) and rated intelligibility were averaged over five items per condition per subject. A total of six levels of intelligibility were tested for each form of distortion. For the 19 conditions tested (six levels of three types of distortion and clear speech), word-report scores and rated intelligibility were reliably correlated (r = 0.99; p < 0.001).
We selected three levels of each form of distortion described above: a low-intelligibility condition (∼20% of words reported correctly), a medium-intelligibility condition (∼65% of words reported correctly), and a high-intelligibility condition (∼90% of words reported correctly) (Fig. 2). ANOVA comparing intelligibility ratings showed no significant difference between the three types of distortion at each level of intelligibility (low intelligibility, F(2,34) = 1.58,p > 0.1; medium and high intelligibility, bothF values < 1). However, for word-report scores, some differences between types of distortions were reliable at each level of intelligibility. For low intelligibililty (F(2,34) = 8.75; p < 0.001) pairwise comparisons indicated significantly reduced intelligibility for vocoded speech, medium intelligibility(F(2,34) = 4.35;p < 0.05), with increased intelligibility for segmented speech, and for high intelligibility stimuli (F(2,34) = 3.76; p < 0.05) marginally reduced intelligibility for speech in noise.
Scanning procedure
Twelve right-handed volunteers (five females) between 18 and 42 years of age were scanned. All subjects were native speakers of English, without any history of neurological illness, head injury, or hearing impairment. This study was approved by the Addenbrooke's Local Research Ethics Committee (Cambridge, UK), and written informed consent was obtained from all subjects. Volunteers were told that they would be listening to sentences that were distorted with different amounts and types of noise and were asked to rate the intelligibility of each item using a four-alternative button press with their right hand. The alternatives ranged from understanding most or all of the sentence (index finger; button 4) to none or not very much of the sentence (little finger; button 1). Volunteers were given a short period of practice in the scanner with a different set of sentences that were processed in the same way as the experimental items.
We acquired imaging data using a Medspec (Bruker, Ettlingen, Germany) 3 tesla MRI system with a head gradient set. Echo-planar imaging (EPI) volumes (228 in total) were acquired over two 17 min sessions. Each volume consisted of 21 × 4 mm thick slices with an interslice gap of 1 mm; field of view, 25 × 25 cm; matrix size, 128 × 128; echo time, 27 msec; acquisition time, 3.02 sec; and actual repetition time, 9 sec. We used a sparse imaging technique in which stimuli are presented in the silent period between successive scans, minimizing acoustic interference (Edmister et al., 1999; Hall et al., 1999). Acquisition was transverse oblique, angled away from the eyes, and covered all of the brain except in a few cases (the very top of the superior parietal lobule, the anterior inferior temporal cortex, and the inferior aspect of the cerebellum).
Two scanning sessions of 114 trials were performed. Each trial comprised a stimulus item followed by a tone pip and a single EPI volume (Fig. 3). Stimulus items in other trials were pseudorandomly drawn from the 11 experimental conditions (low-, medium-, and high- intelligibility conditions for each of three forms of distortion, plus signal-correlated noise and clear speech). There were 19 trials of each stimulus type and an additional 19 silent trials. Stimulus onset and offset were jittered relative to scan onset by temporally aligning the midpoint of the stimulus item (0.8–2.1 sec after sentence onset) with the midpoint of the gap between scans (6 sec), thus ensuring that scans were obtained 3–6 sec after stimulation. This coincided with the peak of the hemodynamic response evoked by the stimulus (Edmister et al., 1999; Hall et al., 1999). The tone pip occurred 1 sec after stimulus offset (or at a matched position in silent trials) and cued the subject to rate the intelligibility of the item just presented [or a self-determined (random) button press for silent scans]. Items from the 190-sentence corpus were pseudorandomly assigned to different distortion conditions using three different forms of randomization. Sentences presented as SCN were chosen from the other 12 conditions; no other items were presented more than once in the experiment. Stimuli were presented diotically, using a high-fidelity auditory stimulus-delivery system incorporating flat-response electrostatic headphones inserted into sound-attenuating ear defenders (Palmer et al., 1998). To further attenuate scanner noise, participants wore insert earplugs (E.A.R. Supersoft; Aearo Company, Indianapolis, IN) rated to attenuate by ∼30 dB. When wearing earplugs and ear defenders, participants reported that the scanner noise was unobtrusive and that sentences were presented at a comfortable listening volume and at equal levels in both ears. Custom software (Palmer et al., 1998) was used to present the stimulus items, and DMDX (Forster and Forster, 2003) was used to record button-press responses.
Analysis of fMRI data
Data processing and analysis was accomplished using Statistical Parametric Mapping (SPM99; Wellcome Department of Cognitive Neurology, London, UK). Preprocessing steps included within-subject realignment, spatial normalization of the functional images to a standard EPI template (masking regions of susceptibility artifact to reduce tissue distortion) (Brett et al., 2001), and spatial smoothing using a Gaussian kernel of 12 mm, suitable for random-effects analysis (Xiong et al., 2000).
We were interested, first of all, in identifying areas in which activation correlated with intelligibility (see Fig. 4a). Within these intelligibility-sensitive areas, we then wanted to differentiate between areas of form dependence (activation that was sensitive to the acoustic form of the stimulus, as shown in Fig.4b) and areas of form independence (areas that responded equivalently to the different forms of distortion). This distinction might plausibly separate areas involved in lower-level acoustic processes from higher-order linguistic levels of processing. In addition, we thought it would be informative to establish the overlap, if any, between intelligibility-sensitive form-dependent areas and primary-like cortical auditory areas. Such cortical auditory areas were identified as those exhibiting elevated response to signal-correlated noise over silence. Some spatial segregation of the two response types might indicate a hierarchy of processing within auditory cortices as stimulus characteristics become more complex, such as has been observed in the macaque (Rauschecker et al., 1995; Rauschecker, 1998).
We also wanted to identify brain areas exhibiting increased response to degraded speech stimuli, over and above any correlation with intelligibility. We hypothesized that, when speech is difficult to understand (i.e., when speech is distorted yet still potentially intelligible), listeners will make additional effort to extract as much meaning as possible. This might be reflected in an increased brain response to distorted speech compared with clear speech (which, with sparse imaging, a high-fidelity sound delivery system, and comfortable listening volume, was presented under near-ideal conditions for effortless comprehension). Because this elevated response for more distorted speech could also arise from processes that are recruited as the auditory stimulus becomes less comprehensible and therefore less engaging (cf. Stark and Squire, 2001), we also compared activation for distorted speech with signal-correlated noise, which, as is immediately evident to the subjects, is not speech (see Fig. 4c). An elevated response to the distorted conditions, over normal speech and SCN, would be consistent with mechanisms acting to enhance comprehension for potentially intelligible input. Such mechanisms may be domain general (e.g., attentional modulation of auditory processing) or more specific to speech processing. Overlap with other kinds of response would be informative in this regard; a distortion-elevated response that overlapped with sensitivity to distortion type might indicate compensation acting on an acoustic level (consistent with attentional facilitation). In contrast, areas exhibiting an elevated response to distorted input and sensitivity to intelligibility, independent of distortion type, might indicate that distortion places additional “load” on higher-level linguistic processes (these alternatives are presented in more detail in Discussion).
Two separate design matrices were constructed for each listener to optimize sensitivity to our effects of interest. The primary analysis included two columns indicating both the presentation of a sentence before each scan (as opposed to a silent period) and the mean proportion of words reported correctly for that type and level of distortion in the behavioral pilot. [We used word-report scores as a covariate for two reasons: (1) to compensate for the small but significant differences in the intelligibility of the three types of distortion identified in the pilot study and (2) report scores provide a more objective measure than the ratings obtained during scanning, and the two measures are highly correlated (see Fig. 2).] Three additional columns were included in the design matrix that coded which of the three types of distortion was applied to each sentence (scans following signal-correlated noise and clear-speech stimuli were modeled only in the first two columns). Realignment parameters and a dummy variable coding the two scanning sessions were included as covariates of no interest, and a correction for global signal magnitude was made. The second design matrix was used to evaluate activation for signal-correlated noise sentences compared with silence and to obtain signal change estimates for each condition (see Figs. 5f,7d); it included a single column for each of the twelve conditions in the experiment and covariates of no interest as before.
The parameter estimates for each subject, derived from the least-mean-squares fit of these models, were entered into second-level group analyses in which t values were calculated for each voxel, treating intersubject variation as a random effect. For main effects of intelligibility and compensation for distortion, we report activation foci that survive a whole-brain false discovery rate (FDR) (Genovese et al., 2002) correction at p < 0.05. This procedure controls the expected proportion of false positives among suprathreshold voxels to the specified rate (0.05). Where the null hypothesis is true (i.e., there are no activated voxels), the FDR procedure produces identical results to a Bonferroni correction, providing stringent control of familywise-error rate (Benjamini and Hochberg, 1995). Contrasts that were used to evaluate sensitivity to acoustic form (i.e., detecting differences between the three forms of distortion) were applied over the whole brain (with appropriate correction) and within regions of interest defined by the areas revealed as active by the intelligibility and compensation-for-distortion contrasts.
Results
Behavioral data were not available from four subjects in the fMRI study. Mean four-point ratings from the remaining eight subjects correlated highly (r = 0.98; p < 0.001) with report scores from the 18 subjects in the pilot study (Fig.2). (Report scores of zero were assumed for signal-correlated noise in the pilot study.) Given the greater accuracy and consistency of the report scores compared with the ratings obtained during scanning, we used these as regressors in the analysis of the fMRI data. Intelligibility-sensitive areas were identified as those voxels in which blood oxygenation level-dependent (BOLD) signal was correlated with word-report scores (Fig.4a).
The comparison of SCN versus silence across subjects yielded activation bilaterally, in Heschl's gyrus and surrounding areas, consistent with recruitment of core and belt auditory cortex, as predicted (Fig.5a,b,pink-blue intensity scale).
Correlation with intelligibility
The BOLD signal was positively correlated with word-report score in voxels along the length of the superior and middle temporal gyri in the left hemisphere, extending outward from auditory cortex toward the temporal pole and the temporoparietal junction (Fig.5a,b, red and yellow scale). Similar less-extensive activation was observed in the right superior and middle temporal gyri. A portion of left inferior frontal gyrus also showed a positive correlation with intelligibility, as did the body of the left hippocampal complex (Fig. 5d). Within the superior temporal gyri, a correlation with intelligibility was observed in areas adjacent to those activated in the SCN–silence contrast (Fig. 5a,b).
To test for brain areas sensitive to differences between the three forms of distortion, the intelligibility-responsive region was masked by all six possible contrasts between pairs of the three distortion types (Fig. 4b). Setting the threshold for each of these six contrasts to p = 0.00851 results in a combined α level of p < 0.05 [because, by binomial expansion, 0.95 = (1 − 0.00851)6]. Intelligibility-responsive areas in which none of these contrasts reach significance at p < 0.00851 can therefore be considered to be form independent (i.e., insensitive to differences between types of distortion) at an uncorrected p > 0.05 (Fig. 5c,d,e, blue; Table 1, top). A form-independent correlation with intelligibility was observed in the anterior middle temporal gyrus bilaterally and in posterior superior temporal sulcus, inferior frontal gyrus, hippocampus, and precuneus in the left hemisphere (for a typical response profile, see Fig.5f).
By increasing the statistical threshold for the form-dependent contrasts to a more stringent significance level (p < 0.001 uncorrected, corresponding top < 1.67 × 10−4 in each of the six contrasts, or to an FDR corrected p < 0.00851 within the region of interest), we identified two areas in which activation not only reflects a response to intelligibility but also shows form dependence at a corrected level of significance. This contrast identifies areas that are engaged in spoken language comprehension (as shown by the significant correlation between activation and intelligibility) but also shows differential activation depending on the acoustic form(s) of distorted speech presented. Such a response was observed bilaterally in the superior temporal gyrus (Fig.5c,d,e, red).
To explore the nature of the sensitivity to acoustic form observed in the form-dependent regions, an estimate of the effect of different distortion types was calculated for peak voxels in the left and right superior temporal gyrus (Fig.6a,b; see figure legend for coordinates). In both hemispheres, this difference arises through an elevated response to segmented speech compared with the other two forms of distortion, particularly to speech in noise.
Compensation for distortion
We reasoned that mechanisms involved in compensating for distortion are not required in either the clear-speech condition or the signal-correlated-noise condition, and we identified relevant areas by comparing activation for potentially intelligible conditions with both fully intelligible and completely unintelligible conditions. In addition, because differences in intelligibility between conditions were included in the model, elevated activity for distorted speech is statistically independent of a response that is correlated with intelligibility. A distortion-elevated response was observed in left-hemisphere areas, including the middle and superior temporal gyri, the opercular part of the inferior frontal gyrus, the lateral inferior frontal gyrus, the posterior middle frontal gyrus (premotor cortex), and the ventral anterior nucleus of the thalamus (Fig.7a; Table 1, bottom). Both a distortion-elevated response and a correlation with intelligibility was observed in the left temporal cortex and left frontal operculum.
As for the correlation with intelligibility, the distortion-elevated response was masked with a combined map showing sensitivity to different forms of distortion. Regions surviving this exclusive masking procedure (at p > 0.00851, uncorrected for each of six contrasts, equivalent to a combined p > 0.05) are considered to be insensitive to acoustic form (Fig.7b,c, blue; for the response profile in a typical voxel, see d). This analysis also revealed that a subset of the areas showing a distortion-elevated response was also sensitive to acoustic form (Fig. 7b,c,red). Interestingly, the distortion-elevated response in the temporal lobe was primarily form dependent, except for its most posterior, anterior, and inferior aspects. As discussed before (and shown in Fig. 6a), this form-dependent response arises from an elevated response to segmented speech. In contrast, most of the frontal-lobe distortion-elevated activation was form independent, except for a region in the frontal operculum, just lateral to the insula. This region exhibited a significantly elevated response for vocoded speech compared with segmented speech (Fig. 6c).
Discussion
Our observation of intelligibility-sensitive regions in the lateral temporal lobe replicates and extends the findings of previous functional imaging studies (Binder et al., 2000; Scott et al., 2000;Vouloumanos et al., 2001). These studies used subtractive designs; regions that were active for intelligible speech were identified by comparison with nonspeech baseline conditions that may not be of equivalent acoustic structure or complexity. Although Scott et al. (2000) used more than one form of intelligible speech with a well controlled baseline, their design did not permit them to fractionate areas that respond to speech intelligibility into regions that are involved in low-level acoustic and higher-level nonacoustic processing. Our correlational design reduces the importance of subtractions from baseline conditions. Comparing among different forms of distortion allows us to test for acoustic processes within intelligibility-responsive areas.
Sound (compared with silence) produced activation in the probable location of primary auditory cortex (Rivier and Clarke, 1997; Morosan et al., 2001; Rademacher et al., 2001). Importantly, activation in primary auditory cortex did not correlate reliably with intelligibility; instead, the bilateral temporal-lobe region in which activation correlated with intelligibility is adjacent to primary auditory cortex. The form-dependent portion of this intelligibility-sensitive region includes both auditory belt and parabelt areas (and beyond) and, therefore, probably more than one processing stage (Rauschecker et al., 1995; Rauschecker, 1998; Kaas and Hackett, 2000), although these data cannot speak to further functional segregation.
The form-dependent profile observed in these periauditory areas arises from an increased response to segmented speech compared with other forms of distortion that are matched on intelligibility. This may reflect differential sensitivity of neurons to particular acoustic features of speech. For instance, neurons sensitive to rapid spectral changes or formant transitions that are present in clear speech might respond more strongly to segmented speech. Alternatively, neurons that are sensitive to transitions between periodic and aperiodic sounds might respond more strongly to segmented speech, because these transitions are absent from the other forms of distortion.
Surrounding this periauditory form-dependent region anteriorly, posteriorly, and inferolaterally, we observed areas in which activation correlated significantly with intelligibility but was insensitive to acoustic differences among types of distortion. We conclude that these form-independent areas are involved in processing speech at more abstract nonacoustic levels of representation. The hierarchical structure that we infer from these results is consistent with cognitive accounts of spoken language comprehension (McClelland and Elman, 1986;Gaskell and Marslen-Wilson, 1997) in which lexical and semantic processes are driven by the output of lower-level acoustic and phonetic processes. This finding also mirrors what is known of the anatomical and functional organization of the auditory system in nonhuman primates. Whereas form-dependent responses were observed in both core and belt areas of auditory cortex, it is only in the parabelt and more distant polymodal cortex that we see a form-independent response to the intelligibility of speech signals.
A stream of processing, specialized for sound–object identification, has been documented previously in nonhuman primates. This extends anteriorly within lateral temporal neocortex (Rauschecker and Tian, 2000; Tian et al., 2001), similar to the anterior temporal portion of the form-independent intelligibility response. Future work to determine the functional specialization of these anterior temporal regions might therefore focus on whether responses in these regions are affected by the lexical and semantic content of sentences. An additional inferior frontal area exhibited a similar form-independent intelligibility profile. This area in humans, as in other primates, may receive projections from anterior auditory areas and anterior temporal lobe, extending the anteroventral-processing stream into ventrolateral frontal cortex (Hackett et al., 1999; Romanski et al., 1999a,b; Mamata et al., 2002).
We also observed a form-independent, intelligibility-related response in left posterior superior temporal gyrus and left angular gyrus. These activations may be indicative of other parallel streams of processing, extending posteriorly from auditory and form-dependent regions (Hickok and Poeppel, 2000; Scott and Johnsrude, 2003). Although the functional significance of such posterior streams has yet to be firmly established, one proposal common to these accounts is that a stream running dorsally to the sylvian fissure may play a role in linking the perception and production of speech. In support of this account, a number of previous cognitive models have proposed separate processing pathways involved in phonological versus lexical processing of speech (Gaskell and Marslen-Wilson, 1997; Norris et al., 2000).
An additional region in which the BOLD signal was correlated with intelligibility in a form-independent manner was the left anterior hippocampus. The medial temporal-lobe structures of the left (usually language-dominant) hemisphere are known to be required for the encoding and retention of verbal material (Milner, 1958; Johnsrude, 2001;Strange et al., 2002). Results from neuroimaging studies suggest that activation in the left anterior hippocampus is sensitive to the presence of meaning in verbal stimuli (Martin et al., 1997; Otten et al., 2001). The correlation that we observe may thus reflect the increasing memorability of sound sequences as they become more meaningful.
In addition to establishing anatomical specialization for speech comprehension, we wanted to identify brain areas that exhibited an increased response to degraded speech stimuli compared with both clear speech and an unintelligible baseline. We observed a left-lateralized frontal and temporal lobe system that showed this profile. This network of areas is consistent with anatomical connectivity. Auditory belt and parabelt are known (from work in nonhuman primates) to be reciprocally connected with prefrontal areas, including premotor cortex and areas in the inferior frontal gyrus, such as Brodmann area 45 (Hackett et al., 1999; Romanski et al., 1999a,b). These connections may provide a means by which frontal areas can modulate the operation of lower-level auditory areas in the temporal lobe during effortful comprehension of spoken language, thereby assisting in the recovery of meaning from distorted speech input.
Mechanisms to compensate for degraded input need not be speech specific. Low-level auditory or attentional processes that segregate speech from noise or restore continuity to speech briefly masked by noise (Cherry, 1953; Warren, 1970; Bregman, 1990) may assist in perceiving speech heard in noisy environments. Because the distortion-elevated activation in the temporal lobe was primarily form dependent (i.e., sensitive to distortion type) and particularly pronounced in areas adjacent to the probable location of primary auditory cortex, this response may reflect increased allocation of attention to spoken input (Grady et al., 1997). Because the response of this temporal lobe region was particularly pronounced for segmented speech, we speculate that perceiving speech in the “gaps” between noise bursts places a particular demand on this attentional system (cf.Warren, 1970; Bashford et al., 1996).
In contrast to the response profile observed in temporal lobe regions, the elevated response to distorted speech in the left frontal cortex was primarily insensitive to the form of distortion applied, as might be expected for compensatory processes that apply at a nonacoustic level. Although some of these frontal regions may not be involved in processes specific to language comprehension (such as decision processes involved in assessing intelligibility), we propose that a restricted portion of this activated area, the inferior frontal region in which responses were also correlated with intelligibility, contributes to the linguistic processes involved in accessing and combining word meanings (Thompson-Schill et al., 1997; Wagner et al., 2001). All three forms of distortion might be expected to draw more heavily on processes in which semantic or syntactic context is used to recover words and meanings that cannot be identified from bottom-up information alone (Miller et al., 1951; Gordon-Salant and Fitzgibbons, 1997). The results of previous work, in which sentence comprehension is challenged by the inclusion of more complex grammatical structures or lexical ambiguity, are consistent with this hypothesized role for left inferior frontal regions (Kaan and Swaab, 2002; Rodd et al., 2002).
Finally, we observed a focal region in the frontal operculum that showed an elevated response to distortion that was form dependent (particularly sensitive to noise-vocoded speech). This was the only form-dependent region that we observed outside of the temporal lobe and may correspond to an area previously identified electrophysiologically in macaques, which responds specifically to auditory stimuli, including both vocal and nonvocal sounds (Romanski and Goldman-Rakic, 2002). Additional behavioral work investigating comprehension of noise-vocoded speech may be informative in assessing the role of this region of elevated activation.
Footnotes
This work was supported by the Medical Research Council (UK). We thank Paul Boersma and Chris Darwin for assistance with Praat scripts, Philip Dilks and Iain Turnbull for their help with the pilot study, the staff of the Wolfson Brain Imaging Centre, University of Cambridge, for their help with data acquisition, Matthew Brett and Ian Nimmo-Smith for advice on image processing and statistical analysis, and Brian Cox for his assistance with figures. We are also grateful to Daniel Bor, John Duncan, Stefan Kohler, William Marslen-Wilson, Dennis Norris, Sophie Scott, and anonymous reviewers for comments and suggestions. Example sounds can be found on the Internet at: www.mrc-cbu.cam.ac.uk/∼matt.davis/jneurosci/.
Correspondence should be addressed to Matt Davis, Medical Research Council Cognition and Brain Sciences Unit, 15 Chaucer Road, Cambridge, UK CB2 2EF. E-mail: matt.davis{at}mrc-cbu.cam.ac.uk.