Language and music exhibit similar acoustic and structural properties, and both appear to be uniquely human. Several recent studies suggest that speech and music perception recruit shared computational systems, and a common substrate in Broca's area for hierarchical processing has recently been proposed. However, this claim has not been tested by directly comparing the spatial distribution of activations to speech and music processing within subjects. In the present study, participants listened to sentences, scrambled sentences, and novel melodies. As expected, large swaths of activation for both sentences and melodies were found bilaterally in the superior temporal lobe, overlapping in portions of auditory cortex. However, substantial nonoverlap was also found: sentences elicited more ventrolateral activation, whereas the melodies elicited a more dorsomedial pattern, extending into the parietal lobe. Multivariate pattern classification analyses indicate that even within the regions of blood oxygenation level-dependent response overlap, speech and music elicit distinguishable patterns of activation. Regions involved in processing hierarchical aspects of sentence perception were identified by contrasting sentences with scrambled sentences, revealing a bilateral temporal lobe network. Music perception showed no overlap whatsoever with this network. Broca's area was not robustly activated by any stimulus type. Overall, these findings suggest that basic hierarchical processing for music and speech recruits distinct cortical networks, neither of which involves Broca's area. We suggest that previous claims are based on data from tasks that tap higher-order cognitive processes, such as working memory and/or cognitive control, which can operate in both speech and music domains.
Language and music share a number of interesting properties spanning acoustic, structural, and possibly even evolutionary domains. Both systems involve the perception of sequences of acoustic events that unfold over time with both rhythmic and tonal features; both systems involve a hierarchical structuring of the individual elements to derive a higher-order combinatorial representation; and both appear to be uniquely human biological capacities (Lerdahl and Jackendoff, 1983; McDermott and Hauser, 2005; Patel, 2007). As such, investigating the relation between neural systems supporting language and music processing can shed light on the underlying mechanisms involved in acoustic sequence processing and higher-order structural processing, which in turn may inform theories of the evolution of language and music capacity.
Recent electrophysiological, functional imaging, and behavioral work has suggested that structural processing in language and music draws on shared neural resources (Patel, 2007; Fadiga et al., 2009). For example, presentation of syntactic or musical “violations” (e.g., ungrammatical sentences or out-of-key chords) results in a similar modulation of the P600 event-related potential (ERP) (Patel et al., 1998). Functional magnetic resonance imaging (MRI) studies have reported activation of the inferior frontal gyrus during the processing of aspects of musical structure (Levitin and Menon, 2003; Tillmann et al., 2003), which is a region that is also active during the processing of aspects of sentence structure (Just et al., 1996; Stromswold et al., 1996; Caplan et al., 2000; Friederici et al., 2003; Santi and Grodzinsky, 2007), although there is much debate regarding the role of this region in structural processing (Hagoort, 2005; Novick et al., 2005; Grodzinsky and Santi, 2008; Rogalsky et al., 2008; Rogalsky and Hickok, 2009, 2010). Behaviorally, it has been shown that processing complex linguistic structures interacts with processing complex musical structures: accuracy in comprehending complex sentences in sung speech is degraded if the carrier melody contains an out-of-key note at a critical juncture (Fedorenko et al., 2009).
Although this body of work suggests that there are similarities in the way language and music structure are processed under some circumstances, a direct within-subject investigation of the brain regions involved in language and music perception has not, to our knowledge, been reported. As noted by Peretz and Zatorre (2005), such a direct comparison is critical before drawing conclusions about possible relations in the neural circuits activated by speech and music. This is the aim of the present functional MRI (fMRI) experiment. We assessed the relation between music and language processing in the brain using a variety of analysis approaches including whole-brain conjunction/disjunction, region of interest (ROI), and multivariate pattern classification analyses (MVPA). We also implemented a rate manipulation in which linguistic and melodic stimuli were presented 30% faster or slower than their normal rate. This parametric temporal manipulation served as a means to assess the effects of temporal envelope modulation rates, a stimulus feature that appears to play a major role in speech perception (Shannon et al., 1995; Luo and Poeppel, 2007), and provided a novel method for assessing the domain specificity of processing load effects by allowing us to map regions that were modulated by periodicity manipulations within and across stimulus types.
Materials and Methods
Twenty right-handed native English speakers (9 male, 11 female; mean age = 22.6 years, range 18–31) participated in this study. Twelve participants had some formal musical training (mean years of training = 3.5, range 0–8). All participants were free of neurological disease (self report) and gave informed consent under a protocol approved by the Institutional Review Board of the University of California, Irvine (UCI).
Our mixed block and event-related experiment consisted of the subject listening to blocks of meaningless “jabberwocky” sentences, scrambled jabberwocky sentences, and simple novel piano melodies. We used novel melodies and meaningless sentences to emphasize structural processing over semantic analysis. Within each block, the stimuli were presented at three different rates, with the midrange rate being that of natural speech/piano playing (i.e., the rate at which a sentence was read or a composition played without giving the reader/player explicit rate instructions). The presentation rates of the stimuli in each block were randomized, with the restriction that each block had the same total amount of time in which stimuli were playing. Each trial consisted of a stimulus block (27 s) followed by a rest period (jittered around 12 s). Within each stimulus block, the interval between stimulus onsets was 4.5 s. Subjects listened to 10 blocks of each stimulus type, in a randomized order, over eight scanning runs.
As mentioned above, three types of stimuli were presented: jabberwocky sentences, scrambled jabberwocky sentences, and simple novel melodies. The jabberwocky sentences were generated by replacing the content words of normal, correct sentences with pseudowords (e.g., “It was the glandar in my nederop”); these sentences had a mean length of 9.7 syllables (range = 8–13). The scrambled jabberwocky sentences were generated by randomly rearranging the word order of the previously described jabberwocky sentences; the resulting sequences, containing words and pseudowords, were recorded and presented as concatenated lists. As recorded, these sentences and scrambled sentences had a mean duration of 3 s (range = 2.5–3.5). The sentences and scrambled sentences then were digitally edited using sound-editing software to generate the two other rates of presentation: each stimulus's tempo was increased by 30%, and also decreased by 30%. The original (normal presentation rate) sentences and scrambled sentences averaged 3.25 syllables/s; the stimuli generated by altering the tempo of the original recordings averaged 4.23 syllables/s (30% slower tempo than normal) and 2.29 syllables/s (30% faster tempo than normal), respectively.
The melody stimuli were composed and played by a trained pianist. The composition of each melody outlined a common major or minor chord in the system of tonal Western harmony, such as C major, F major, G major, or D minor. Most pitches of the melodies were in the fourth register (octave) of the piano; pitches ranged from G3 to D5. Durations in the sequence were chosen to sound relatively rhythmic according to typical Western rhythmic patterns. There are 5–17 pitches in each melodic sequence (on average, ∼8 pitches per melody). Pitch durations range from 106 to 1067 ms (mean = 369 ms, SD = 208 ms). The melodies were played on a Yamaha Clavinova CLP-840 digital piano and were recorded through the MIDI interface, using MIDI sequencing software. The melodies, as played by the trained pianist, averaged 3 s in length. The melodies (like the sentence and scrambled sentences) were then digitally edited to generate versions of each melody that were 30% slower and 30% faster than the original melodies.
fMRI data acquisition and processing.
Data were collected on the 3T Phillips Achieva MR scanner at the UCI Research Imaging Center. A high-resolution anatomical image was acquired, in the axial plane, with a three-dimensional spoiled gradient-recalled acquisition pulse sequence for each subject [field of view (FOV) = 250 mm, repetition time (TR) = 13 ms, flip angle = 20°, voxel size = 1 mm × 1 mm × 1 mm]. Functional MRI data were collected using single-shot echo-planar imaging (FOV = 250 mm, TR = 2 s, echo time = 40 ms, flip angle = 90°, voxel size = 1.95 mm × 1.95 mm × 5 mm). MRIcro (Rorden and Brett, 2000) was used to reconstruct the high-resolution structural image, and an in-house Matlab program was used to reconstruct the echo-planar images. Functional volumes were aligned to the sixth volume in the series using a six-parameter rigid-body model to correct for subject motion (Cox and Jesmanowicz, 1999). Each volume then was spatially filtered (full-width at half-maximum = 8 mm) to better accommodate group analysis.
Data analysis overview.
Analysis of Functional NeuroImaging (AFNI) software (http://afni.nimh.nih.gov/afni) was used to perform analyses on the time course of each voxel's blood oxygenation level-dependent (BOLD) response for each subject (Cox and Hyde, 1997). Initially, a voxelwise multiple regression analysis was conducted, with regressors for each stimulus type (passive listening to jabberwocky sentences, scrambled jabberwocky sentences, and simple piano melodies) at each presentation rate. These regressors (in addition to motion correction parameters and the grand mean) were convolved with a hemodynamic response function to create predictor variables for analysis. An F statistic was calculated for each voxel, and activation maps were created for each subject to identify regions that were more active while listening to each type of stimulus at each presentation rate compared to baseline scanner noise. The functional maps for each subject were transformed into standardized space and resampled into 1 mm × 1 mm × 1 mm voxels (Talairach and Tournoux, 1988) to facilitate group analyses. Voxelwise repeated-measures t tests were performed to identify active voxels in various contrasts. We used a relatively liberal threshold of p = 0.005 to ensure that potential nonoverlap between music and speech activated voxels was not due to overly strict thresholding. In these contrasts, we sought to identify regions that were (1) active for sentences compared to baseline (rest), (2) active for melodies compared to baseline, (3) equally active for sentences and melodies (i.e., the conjunction of 1 and 2), (4) more active for sentences than music, (5) more active for music than sentences, (6) more active for sentences than scrambled sentences (to identify regions selective for sentence structure over unstructured speech), or (7) more active for sentences than either scrambled sentences and melodies (to determine whether there are regions selective for sentence structure compared to both unstructured speech and melodic structure). In addition, several of the above contrasts were performed with a rate covariate (see immediately below).
Analysis of temporal envelope spectra.
Sentences, scrambled sentences, and melodies differ not only in the type of information conveyed, but also potentially in the amount of acoustic information being presented per unit time. Speech, music, and other band-limited waveforms have a quasiperiodic envelope structure that carries significant temporal-rate information. The envelope spectrum provides important information about the dominant periodicity rate in the temporal-envelope structure of a waveform. Thus, additional analyses explored how the patterns of activation were modulated by a measure of information presentation rate. To this end, we performed a Fourier transform on the amplitude envelopes extracted from the Hilbert transform of the waveforms, and determined the largest frequency component of each stimulus's envelope spectrum using MATLAB (MathWorks). The mean peak periodicity rate of each stimulus type's acoustic envelope varies between the three stimuli types at each presentation rate (Fig. 1, Table 1). Of particular importance to the interpretation of the results reported below is that, at the normal presentation rate, the dominant frequency components of the sentences' temporal envelope [mean (M) = 1.26, SD = 0.78] are significantly different from those of the scrambled sentences (M = 2.13, SD = 0.42; t(82) = 14.1, p < 0.00001); however, at the normal presentation rate, the sentences' principal frequency components are not significantly different from those of the melodies (M = 1.19, SD = 0.82; t(82) = 0.34, p = 0.73). To account for temporal envelope modulation rate effects, we conducted a multiple regression analysis, including the principal frequency component of each stimulus as a covariate regressor (Cox, 2005). This allowed us to assess the distribution of brain activation to the various stimulus types controlled for differences in peak envelope modulation rate. In addition, we assessed the contribution of envelope modulation rate directly by correlating the BOLD response with envelope modulation rate in the various stimulus categories.
Regions of interest analyses.
Voxel clusters identified by the conjunction analyses described above were further investigated in two ways: (1) mean peak amplitudes for these clusters across subjects were calculated using AFNI's 3dMaskDump program and MATLAB, and (2) an MVPA was conducted on the overlap regions. MVPA provides an assessment of the pattern of activity within a region and is thought to be sensitive to weak tuning preferences of individual voxels driven by nonuniform distributions of the underlying cell populations (Haxby et al., 2001; Kamitani and Tong, 2005; Norman et al., 2006; Serences and Boynton, 2007).
The MVPA was performed to determine whether the voxels of overlap in the conjunction analyses were in fact exhibiting the same response pattern to both sentences and melodies. This analysis proceeded as follows: ROIs for the MVPA were defined for each individual subject by identifying, in the four odd-numbered scanning runs, the voxels identified by both (i.e., the conjunction of) the sentences > rest and the melodies > rest comparisons (p < 0.005) across all presentation rates. In addition, a second set of ROIs was defined in a similar manner, but also included the envelope modulation rate for each stimulus as a covariate. All subjects yielded voxels meeting these criteria in both hemispheres. The average ROI sizes for the left and right hemispheres were as follows: 93.4 (range 22–279) and 37.2 (range 6–94) voxels without the covariate, respectively, and 199.6 (range 15–683) and 263.7 (range 26–658) voxels with the envelope modulation rate covariate.
MVPA was then applied in this sentence–melody overlap ROI in each subject to the even-numbered scanning runs (i.e., the runs not used in the ROI identification process). In each ROI, one pairwise classification was performed to explore the spatial distribution of activation that varies as a function of the presentation of sentence versus melody stimuli, across all presentation rates. MVPA was conducted using a support vector machine (SVM) (Matlab Bioinformatics Toolbox v3.1, The MathWorks) as a pattern classification method. This process is grounded in the logic that if the SVM is able to successfully distinguish (i.e., classify) the activation pattern in the ROI for sentences from that for melodies, then the ROI must contain information that distinguishes between sentences and melodies.
Before classification, the EPI data were motion corrected (described above). The motion-corrected data were then normalized so that in each scanning run a z score was calculated for each voxel's BOLD response at each time point. In addition, to ensure that overall amplitude differences between the conditions were not contributing to significant classification, the mean activation level across the voxels within each trial was removed before classification. We then performed MVPA on the preprocessed dataset using a leave-one-out cross-validation approach (Vapnik, 1995). In each iteration, data from all but one even-numbered session was used to train the SVM classifier and then the trained SVM was used to classify the data from the remaining session. The SVM-estimated labels of sentences and melody conditions were then compared with the actual stimuli types to compute a classification accuracy score. Classification accuracy for each subject was derived by averaging across the accuracy scores across all leave-one-out sessions. An overall accuracy score was then computed by averaging across all subjects.
We then statistically evaluated the classification accuracy scores using nonparametric bootstrap methods (Lunneborg, 2000). Similar classification procedures were repeated 10,000 times for the pairwise classification within each individual subject's dataset. The only difference between this bootstrapping method and the classification method described above is that in the bootstrapping, the labels of “sentences” and “melodies” were randomly reshuffled on each repetition. This process generates a random distribution of the bootstrap classification accuracy scores ranging from 0 to 1 for each subject for the pairwise classification, where the ideal mean of this distribution is at the accuracy value of 0.5. We then tested the null hypotheses that the original classification accuracy score equals the mean of the distribution via a one-tailed accumulated percentile of the original classification accuracy score in the distribution. If the accumulated p > 0.95, the null hypothesis was rejected and it would be concluded that, for that subject, signal from the corresponding ROI can classify the sentence and melody conditions. In addition, a bootstrap t approach was used to assess the significance of the classification accuracy on the group level. For each bootstrap repetition, a t test accuracy score across all subjects against the ideal accuracy score (0.5) was calculated. The t score from the original classification procedures across all subjects was then statistically tested against the mean value of the distributed bootstrap t scores. An accumulated p > 0.95 was the criterion for rejecting the null hypothesis and concluding that the accuracy score is significantly greater than chance (this is the criterion as used for the individual subject testing).
Activations for sentences and melodies compared with rest
Extensive activations to both sentence and melodic stimuli were found bilaterally in the superior temporal lobe. The distribution of activity to the two classes of stimuli is far from identical: a gradient of activation is apparent with more dorsomedial temporal lobe regions responding preferentially to melodic stimuli and more ventrolateral regions responding preferentially to sentence stimuli, with a region of overlap in between, bilaterally. The melody-selective response in the dorsal temporal lobe extended into the inferior parietal lobe. Sentences activated some frontal regions (left premotor, right prefrontal), although not Broca's area (Fig. 2A).
Consistent with the claim that speech and music share neural resources, our conjunction analysis showed substantial overlap in the response to sentence and music perception in both hemispheres. However, activation overlap does not necessarily imply computational overlap or even the involvement of the same neural systems at a finer-grained level of analysis. Thus we used MVPA to determine whether the activation within the overlap region may be driven by nonidentical cell assemblies. Classification accuracy for sentence versus melody conditions in the ROI representing the region of overlap was found to be significantly above chance in both hemispheres: left hemisphere, classification accuracy [proportion correct (PC)] = 0.79, t = 12.75, p = 0.0015, classification accuracy (d′) = 2.90, t = 10.87, p = 0.0012; right hemisphere, classification accuracy (PC) = 0.79, t = 11.34, p = 0.0003, classification accuracy (d′) = 3.19, t = 10.61, p = 0.0006. This result indicates that even within regions that activate for both speech and music perception, the two types of stimuli activate nonidentical neural ensembles, or activate them to different degrees.
Analysis of the conjunction between speech and music (both vs rest) using the envelope modulation rate covariate had two primary effects on the activation pattern: (1) the speech-related frontal activation disappeared, and (2) the speech stimuli reached threshold in the medial posterior superior temporal areas, leaving the anterior region bilaterally as the dominant location for music-selective activations, although some small posterior foci remained (Fig. 2B). MVPA was run on the redefined region of overlap between music and speech. The results were the same in that the ROI significantly classified the two conditions in both hemispheres: left hemisphere, classification accuracy (PC) = 0.70, t = 6.49, p = 0.0026, classification accuracy (d′) = 1.88, t = 3.73, p = 0.0027; right hemisphere, classification accuracy (PC) = 0.75, t = 8.66, p = 0.0008, classification accuracy (d′) = 2.49, t = 6.50, p = 0.0003.
Because the envelope modulation rate covariate controls for a lower-level acoustic feature and tended to increase the degree of overlap between music and speech stimuli, indicating greater sensitivity to possibly shared neural resources, all subsequent analyses use the rate covariate.
Contrasts of speech and music
A direct contrast between speech and music conditions using the envelope modulation rate covariate was performed to identify regions that were selective for one stimulus condition relative to the other. Figure 3 shows the distribution of activations revealed by this contrast. Speech stimuli selectively activated more lateral regions in the superior temporal lobe bilaterally, while music stimuli selectively activated more medial anterior regions on the supratemporal plane and extending into the insula, primarily in the right hemisphere. It is important not to conclude from this apparently lateralized pattern for music that the right hemisphere preferentially processes music stimuli as is often assumed. As is clear from the previous analysis (Fig. 2), music activates both hemispheres rather symmetrically; the lateralization effect is in the relative activation patterns to music versus speech.
Previous work has suggested the existence of sentence-selective regions in the anterior temporal lobe (ATL) and posterior temporal lobe (bilaterally). Contrasts were performed to identify these sentence-selective regions as they have been defined previously, by comparing structured sentences with unstructured lists of words, and then to determine whether this selectivity holds up in comparison to musical stimuli, which share hierarchical structure with sentences.
The contrast between listening to sentences and listening to scrambled sentences identified bilateral ATL regions that were more active for sentence than scrambled sentence stimuli (left −53 −2 −4, right 58 −2 −4) (Fig. 4, Table 2). No inferior frontal gyrus (IFG) activation was observed in this contrast, and no regions were more active for scrambled than nonscrambled sentences. Figure 4 shows the relation between sentence-selective activations (sentences > scrambled sentences) and activations to music perception (music > rest). No overlap was observed between the two conditions.
As a stronger test of sentence-selective responses, we performed an additional analysis that looked for voxels that were more active for sentences than scrambled sentences and more active for sentences than music. This analysis confirmed that sentences yield significantly greater activity than both scrambled sentences and melodies in the left anterior (−53 7 −4; −55 −5 0), middle (−57 −21 4), and posterior (−57 −35 2) temporal lobe; fewer voxels survived this contrast in the right hemisphere, however, and those that did were limited to the anterior temporal region (53 −25 −2; 57 −3 2) (Fig. 5, Table 2).
Envelope modulation rate effects
In addition to using envelope modulation rate as a covariate to control for differences in the rate of information presentation in the different stimulus conditions, we performed an exploratory analysis to map brain regions that were correlated with envelope modulation rate directly and whether these regions varied by stimulus condition. The logic is that the rate manipulation functions as a kind of parametric load manipulation in which faster rates induce more processing load. If a computation is unique to a given stimulus class (e.g., syntactic analysis for sentences), then the region involved in this computation should be modulated only for that stimulus class. If a computation is shared by more than one stimulus type (e.g., pitch perception), then the regions involved in that computation should be modulated by rate similarly across those stimulus types.
Envelope modulation rate of sentences positively correlated with activity in the anterior and middle portions of the superior temporal lobes bilaterally as well as a small focus in the right posterior temporal lobe (45 −39 10) (Fig. 6A, Table 3). Positive correlations with the envelope modulation rate of scrambled sentences had a slightly more posterior activation distribution, which overlapped that for sentences in the middle portion of the superior temporal lobes bilaterally [center of overlap: left hemisphere (LH): −55 −11 4, right hemisphere (RH): 59 −15 2]. There was no negative correlation with envelope modulation rate for either sentences or scrambled sentences, even at a very liberal threshold (p = 0.01).
Envelope modulation rate of the music stimuli was positively correlated with a very different activation distribution that included the posterior inferior parietal lobe bilaterally; more medial portions of the dorsal temporal lobe, which likely include auditory areas on the supratemporal plane; the insula; and prefrontal regions. There was no overlap in the distribution of positive envelope modulation rate correlations for music and either of the two speech conditions. A negative correlation with envelope modulation rate of the music stimuli was observed primarily in the lateral left temporal lobe (see supplemental Fig. 1, available at www.jneurosci.org as supplemental material).
Direct contrasts of envelope modulation rate effects between stimulus conditions revealed a large bilateral focus in the middle portion of the superior temporal lobe with extension into anterior regions that was more strongly correlated with sentence than melody envelope modulation rate (LH: −53 −27 2; RH: 53 −21 −4) (Fig. 6B). Anterior to this region were smaller foci that were correlated more strongly with sentence than scrambled sentence envelope modulation rate and/or with sentence than melody envelope modulation rate. Thus, the envelope modulation rate analysis was successful at identifying stimulus-specific effects and confirms the specificity of more anterior temporal regions in processing sentence stimuli.
Sentences and scrambled sentences at normal rates
The contrast between sentences and scrambled sentences at the normal rate identified voxels that activated more to sentences (p = 0.005) in two left ATL clusters (−51 15 −13, −53 −6 1), a left posterior middle temporal gyrus cluster (−43 −51 −1), and the left IFG (−46 26 8), as well as in the right ATL (55 2 −8), right IFG (32 26 9), and right postcentral gyrus (49 −9 18) (supplemental Fig. 2, available at www.jneurosci.org as supplemental material). No other regions were more active for sentences than for scrambled sentences at this threshold (for peak t values, see supplemental table, available at www.jneurosci.org as supplemental material), and no regions were found to be more active for the scrambled sentences than sentences. Signal amplitude plots for a left ATL (supplemental Fig. 2, middle graph, available at www.jneurosci.org as supplemental material) and the left frontal ROI (supplemental Fig. 2, right graph, available at www.jneurosci.org as supplemental material) are presented for each condition for descriptive purposes. It is relevant that in both ROIs the music stimulus activation is low (particularly in the ATL), and in fact, music stimuli do not appear to generate any more activation than the scrambled sentence stimuli.
The aim of the present study was to assess the distribution of activation associated with processing speech (sentences) and music (simple melodies) information. Previous work has suggested common computational operations in processing aspects of speech and music, particularly in terms of hierarchical structure (Patel, 2007; Fadiga et al., 2009), which predicts significant overlap in the distribution of activation for the two stimulus types. The present study indeed found some overlap in the activation patterns for speech and music, although this was restricted to relatively early stages of acoustic analysis, and not in regions, such as the anterior temporal cortex or Broca's area, that have previously been associated with higher-level hierarchical analysis. Specifically, only speech stimuli activated anterior temporal regions that have been implicated in hierarchical processes (defined by the sentence vs scrambled sentence contrast) and Broca's area did not reliably activate to either stimulus type once lower-level stimulus features were factored out. Furthermore, even within the region of overlap in auditory cortex, multivariate pattern classification analysis showed that the two classes of stimuli yielded distinguishable patterns of activity, likely reflecting the different acoustic features present in speech and music (Zatorre et al., 2002). Overall, these findings seriously question the view that hierarchical processing in speech and music rely on the same neural computation systems.
If the neural systems involved in processing higher-order aspects of language and music are largely distinct, why have previous studies suggested common mechanisms? One possibility is that similar computational mechanisms are implemented in distinct neural systems. This view, however, is still inconsistent with the broader claim that language and music share computational resources (e.g., Fedorenko et al., 2009). Another possibility is that the previously documented similarities derive not from fundamental perceptual processes involved in language and music processing but from computational similarities in the tasks used to assess these functions. Previous reports of common mechanisms employ structural violations in sentences and melodic stimuli, and it is in the response to violations that similarities in the neural response have been observed (Patel et al., 1996; Levitin and Menon, 2003). The underlying assumption of this body of work is that a violation response reflects the additional structural load involved in trying to integrate an unexpected continuation of the existing structure and therefore indexes systems involved in more natural structural processing. However, this is not the only possibility. A structural violation could trigger working memory processes (Rogalsky et al., 2008) or so-called “cognitive control” processes (Novick et al., 2005), each of which have implicated regions, Broca's area in particular, that are thought to be involved in structural processing. The fact that Broca's area was not part of the network implicated in the processing of structurally coherent language and musical stimuli in the present study argues in favor of the view that the commonalities found in previous studies indeed reflect a task-specific process that is not normally invoked under more natural circumstances.
The present study used a temporal rate manipulation that, we suggest, may function as a kind of parametric load manipulation that can isolate stimulus-specific processing networks. If this reasoning is correct, such a manipulation may be a better alternative to the use of structural violations to induce load. We quantified our rate manipulation by calculating the envelope modulation rates of our stimuli and used these values as a predictor of brain activity. Envelope modulation rate has emerged in recent years as a potentially important feature in the analysis of acoustic input in that it may be important in driving endogenous cortical oscillations (Luo and Poeppel, 2007). We reasoned that correlations with envelope modulation rate would tend to isolate higher-level, integrative, aspects of processing on the assumption that such levels would be more affected by rate modulation (although this is clearly an empirical question). Indeed, based on the pattern of correlated activity, this appears to be the case. Regions that correlated with the modulation rate of sentences included anterior and middle portions of the superior temporal lobes bilaterally as well as a small focus in the right posterior temporal lobe (Fig. 6A, Table 3). No core auditory areas showed a correlation with modulation rate of sentences, nor did Broca's region show a correlation with modulation rate. When structural information is largely absent, as in scrambled sentences, modulation rate does not correlate with more anterior temporal regions, but instead has a more middle to posterior distribution. In the absence of structural information, the load on lexical–phonological-level processes may be amplified, explaining the involvement of more posterior superior temporal regions, which have been implicated in phonological level processes (for review, see Hickok and Poeppel, 2007).
Regions that correlated with the modulation rate of melodies were completely nonoverlapping with regions correlated with the modulation rate of sentences. The modulation rate of the melodies correlated with dorsomedial regions of the anterior temporal lobe, likely including portions of auditory cortex, inferior parietal–temporal regions, and prefrontal cortex. A previous study that manipulated prosodic information in speech, which is similar to a melodic contour, also reported activation of anterior dorsomedial auditory regions (Humphries et al., 2005), and other studies have implicated the posterior supratemporal plane in aspects of tonal perception (Binder et al., 1996; Griffiths and Warren, 2002; Hickok et al., 2003). Lesion studies of music perception, similarly, have implicated both anterior and posterior temporal regions (Stewart et al., 2006), consistent with our findings.
The rate modulation analysis should be viewed as preliminary and requires further empirical exploration regarding the present assumptions. Although the pattern of activity revealed by these analyses is consistent with our reasoning that rate modulation correlations isolate higher-order aspects of processing, this has not been demonstrated directly. Another aspect of this approach that warrants investigation is the possibility of a nonmonotonic function relating rate modulation and neural load. For example, one might expect that both very slow and very fast rates induce additional processing loads, perhaps in different ways. In addition, it is important to note that the lack of a response in core auditory regions to the temporal modulation rate does not imply that auditory cortex is insensitive to modulation rate generally. In fact, previous studies have demonstrated that auditory cortex is sensitive to the temporal modulation rate of various types of noise stimuli (Giraud et al., 2000; Wang et al., 2003; Schönwiesner and Zatorre, 2009). However, the modulation rates in our stimuli were below the modulation rates used in these previous studies, and the range of rates in our stimuli was relatively small. It is therefore difficult to compare the present study with previous work on temporal modulation rate.
Nonetheless, in the present study, both the traditional conjunction/disjunction and rate modulation analyses produced convergent results, suggesting that higher-order aspects of language and music structure are processed largely within distinct cortical networks, with music structure being processed in more dorsomedial temporal lobe regions and language structure being processed in more ventrolateral structures. This division of labor likely reflects the different computational building blocks that go into constructing linguistic compared to musical structure. Language structure is largely dependent on the grammatical (syntactic) features of words, whereas musical structure is determined by pitch and rhythmic contours (i.e., acoustic features). Given that computation of hierarchical structure requires the integration of these lower-level units of analysis to derive structural relations, given that the representational features of the language and melodic units are so different, and given that the “end-game” of the two processes are different (derivation of a combinatorial semantic representation vs acoustic recognition or perhaps an emotional modulation), it seems unlikely that a single computational mechanism could suffice for both.
In terms of language organization, available evidence points to a role for the posterior–lateral superior temporal lobe in analyzing phonemic and lexical features of speech (Binder et al., 2000; Vouloumanos et al., 2001; Liebenthal et al., 2005; Okada and Hickok, 2006; Hickok and Poeppel, 2007; Obleser et al., 2007; Vaden et al., 2010) with more anterior regions playing some role in higher-order sentence-level processing (Mazoyer et al., 1993; Friederici et al., 2000; Humphries et al., 2001, 2005, 2006; Vandenberghe et al., 2002; Rogalsky and Hickok, 2009). This is consistent with the present finding from the rate modulation analysis of an anterior distribution for sentences and a more posterior distribution for speech with reduced sentence structure (scrambled sentences). Perhaps music perception has a similar posterior-to-anterior, lower-to-higher-level gradient, although this remains speculative.
One aspect where there may be some intersection is with prosodic aspects of language, which, like music, rely on pitch and rhythm contours, and which certainly inform syntactic structure [e.g., see Cutler et al. (1997) and Eckstein and Friederici (2006)]. As noted above, prosodic manipulations in sentences activate a dorsomedial region in the anterior temporal lobe (Humphries et al., 2005) that is similar to the region responsive to musical stimuli in the present study. This point of contact warrants further investigation on a within-subject basis.
Despite increasing enthusiasm for the idea that music and speech share important computational mechanisms involved in hierarchical processing, the present direct within-subject comparison failed to find compelling evidence for this view. Music and speech stimuli activated largely distinct neural networks except for lower-level core auditory regions, and even in these overlapping regions, distinguishable patterns of activation were found. Many previous studies hinting at shared processing systems may have induced higher-order cognitive mechanisms, such as working memory or cognitive control systems, that may be the basis of the apparent process similarity rather than a common computation system for hierarchical processing.
This research was supported by National Institutes of Health Grant DC03681 to G.H.
- Correspondence should be addressed to Gregory Hickok, Department of Cognitive Sciences, University of California, Irvine, 3151 Social Sciences Plaza, Irvine, CA 92697.