Abstract
Linguistic content can be conveyed both in speech and in writing. But how similar is the neural processing when the same real-life information is presented in spoken and written form? Using functional magnetic resonance imaging, we recorded neural responses from human subjects who either listened to a 7 min spoken narrative or read a time-locked presentation of its transcript. Next, within each brain area, we directly compared the response time courses elicited by the written and spoken narrative. Early visual areas responded selectively to the written version, and early auditory areas to the spoken version of the narrative. In addition, many higher-order parietal and frontal areas demonstrated strong selectivity, responding far more reliably to either the spoken or written form of the narrative. By contrast, the response time courses along the superior temporal gyrus and inferior frontal gyrus were remarkably similar for spoken and written narratives, indicating strong modality-invariance of linguistic processing in these circuits. These results suggest that our ability to extract the same information from spoken and written forms arises from a mixture of selective neural processes in early (perceptual) and high-order (control) areas, and modality-invariant responses in linguistic and extra-linguistic areas.
Introduction
Until ∼5000 years ago, before the development of logographic and alphabetic writing systems, human language relied mainly upon spoken utterances (Houston, 2004). The advent of written language provided a new, visual pathway for communication. However, because written language requires extensive training and typically follows the acquisition of spoken language, it is thought to rely on neural pathways that originally supported spoken language (van Atteveldt et al., 2004). But which brain regions are common to the written and spoken language systems, and do they function in the same way across modalities?
Prior work has mapped the regions responsive to linguistic stimuli presented visually (writing) and auditorily (speech). Regions that show “modality-selective” responses were defined as those that produce aggregate activity increases for stimuli of only a single modality. Regions that show “modality-invariant” responses were defined as those that responded to both written and spoken stimuli. This usage of “modality” therefore subsumes both the sensory modality (auditory vs visual) and the task modality (listening vs reading). Modality selectivity was observed in sensory cortices: early auditory cortex responds to spoken (but not written) stimuli, whereas early visual cortex responds to written (but not spoken) stimuli. Modality-invariant activations were observed in widespread language systems, including the posterior superior temporal cortex, inferior parietal cortex, and subsets of inferior frontal cortex (Raij et al., 2000; Shaywitz et al., 2001; Booth et al., 2002; Marinković, 2004; van Atteveldt et al., 2004; Spitsyna et al., 2006; Jobard et al., 2007; Lindenberg and Scheef, 2007; Vagharchakian et al., 2012). Thus, prior studies suggest a hierarchical model in which early sensory regions are modality-selective but the written and spoken systems gradually converge, so that modality-invariance increases toward higher-order language regions.
The evidence supporting the hierarchical convergence of spoken and written systems can be criticized in two respects. First, responses to language stimuli were measured via aggregate activations to constrained stimuli such as isolated words or isolated short sentences. These paradigms do not map the full set of regions engaged in real-life language comprehension (Lerner et al., 2011; Ben-Yakov et al., 2012), and thus may underestimate the responses in high-order regions to spoken and written language, as well as their overlap. Second, the demonstration of spatially overlapping neural responses is not strong evidence for modality invariance, because an aggregate activation may be observed in both modalities even when different kinds of processing are taking place (Dinstein et al., 2007; Ben-Yakov et al., 2012; Honey et al., 2012). Although spatial overlap between averaged signals is an informative finding that suggests some level of modality invariance, a stronger form of modality invariance is indicated when a region responds with the same temporal response profile to spoken and written forms of the same linguistic input.
In the current study, we measured temporal response profiles to 7 min real-life narrative stimuli and directly compared the response time courses within and across the spoken and written versions of the narrative. Real-life linguistic stimuli can evoke highly reliable and selective responses, even in regions that show little response modulation to isolated letters, words, or sentences (Lerner et al., 2011). Intersubject correlation analysis (inter-SC; Hasson et al., 2010) provides an ideal tool for measuring the reliability of response time course to natural stimuli. Each subject in this study either heard or viewed a spoken or a written version of the same continuous narrative while undergoing functional magnetic resonance imaging (fMRI). Subjects were instructed to attend to the details of the narrative, and a postscan questionnaire assessed their comprehension and engagement.
This design allowed us to identify two types of response profiles: (1) modality-selective responses in areas which responded more reliably across subjects to the written (or spoken) narrative, and (2) potentially modality-invariant responses in areas which responded equally reliably to the spoken and written narratives. For each potentially modality-invariant region, we then examined whether that region produced the same time-varying response profile when the narrative was spoken and when it was written by performing intersubject correlations across modalities. Regions passing this test were classified as truly displaying modality-invariant response. Throughout the paper, the term modality refers to both the “sensory modality” (auditory vs visual) as well as the “task modality” (listening vs reading). Using this approach, we identified robust modality-invariant responses in linguistic areas along the posterior superior temporal gyrus (pSTG) and in the left inferior frontal gyrus, as well as in some extra-linguistic areas, such as the precuneus. However, not all higher-order areas demonstrated modality-invariant responses: some parietal and frontal areas produced responses that were selective for either the spoken or written stories, in addition to the selectivity observed in early sensory areas.
Materials and Methods
Subjects
Thirty-eight subjects successfully participated in one of the two main experimental conditions (written narrative and spoken narrative), or in a third condition (combined narrative), designed to guide us in defining a set of independent regions of interest (ROIs). Eleven subjects were discarded from the analysis: four subjects due to head motion >2 mm, two due to corrupted functional signal, one due to corrupted anatomical signal, one due to anomalous anatomy, one due to difficulties in hearing the stimulus, and two due to failure of the stimulus comprehension test. Additional subjects were scanned until data from nine subjects were collected for spoken (four males, five females; ages 19–28), written (five males, four females; ages 19–22), and combined (five males, four females; ages 19–22) narrative conditions. In addition, nine of the subjects from the written and spoken conditions also participated in an unintelligible written control experiment, and an additional set of 11 subjects participated in an unintelligible spoken control experiment. Procedures were approved by the Princeton University Committee on Activities Involving Human Subjects. All subjects were right-handed native-English speakers with normal hearing and provided written informed consent.
MRI acquisition
Subjects were scanned in a 3T full-body MRI scanner (Skyra, Siemens) with a 12-channel head coil. For functional scans, images were acquired using a T2*-weighted echo planer imaging (EPI) pulse sequence [repetition time (TR), 1500 ms; echo time (TE), 28 ms; flip angle, 64°], each volume comprising 27 slices of 4 mm thickness with 0 mm gap; slice acquisition order was interleaved. In-plane resolution was 3 × 3 mm2 [field of view (FOV), 192 × 192 mm2]. Anatomical images were acquired using a T1-weighted magnetization-prepared rapid-acquisition gradient echo (MPRAGE) pulse sequence (TR, 2300 ms; TE, 3.08 ms; flip angle 9°; 0.89 mm3 resolution; FOV, 256 mm2). To minimize head movement, subjects' heads were stabilized with foam padding. Stimuli were presented using the Psychophysics toolbox (Brainard, 1997; Pelli, 1997). Subjects were provided with an MRI compatible in-ear mono earbuds (Sensimetrics model S14), which provided the same audio input to each ear. MRI-safe passive noise-canceling headphones were placed over the earbuds for noise removal and safety.
Stimuli and experimental design
The spoken language stimulus was a 7 min real-life story (“Pie-man,” told by Jim O'Grady) recorded at a live storytelling performance (“The Moth” storytelling event, New York City). The written language stimulus was a 954-word transcript of the same narrative. The spoken and written versions of the narrative were combined simultaneously to create an audio-visual stimulus. The combined audiovisual experiment was used to define an unbiased set of ROIs. These three stimuli (Fig. 1) were presented in a between-subjects design; each subject participated in only one of the following conditions: the spoken condition (auditory stimulus), the written condition (visual stimulus), or the combined condition (audiovisual stimuli).
In the written condition, words were individually presented in the center of the screen in rapid serial visual presentation in a rhythm that accurately matched the timing of the original spoken version. In cases where a few spoken words were inseparable in time (46.17% of the screens), we presented a few words on the screen (two-word screens appeared 207 times, three-word screens 64 times, and four-word or more screens six times). Overall, the narrative contained 600 screen images, 0.7 ± 0.5 s each. Infrequently, the recording contained the laughter and applause of the audience. Each of these laughter/applause segments was classified as a “single word” event (5.5% of screens). A “smiley” icon was used in correspondence to these segments in the written condition. Neutral lead-in music was played for 12 s before the onset of the spoken stimulus, and graphical music symbols were shown for 12 s before the onset of the written stimulus. Responses to these initial 12 s were excluded from all analyses.
Subjects also participated in two control conditions, one for the spoken and one for the written conditions (Fig. 1). In the unintelligible spoken condition, the narrative waveform was played reversed in time (backward narrative), creating the perceptual effect of an unintelligible speech-like stimulus. In the unintelligible written condition, the letters constituting each word were randomly permuted and then the entire scrambled word was rotated 180 degrees, creating an unreadable array of unfamiliar letters from the exact same set of visual features.
Data analysis
Preprocessing.
fMRI data were reconstructed and analyzed with the BrainVoyager QX software package (Brain Innovation) and with in-house software written in MATLAB (MathWorks). Preprocessing of functional scans included intrasession 3D motion correction, slice scan time correction, linear trend removal, and high-pass filtering (two cycles per condition). Spatial smoothing was applied using a Gaussian filter of 6 mm full-width at half-maximum value. The cortical surface was reconstructed from the 3D MPRAGE anatomical images using BrainVoyager software. The complete functional dataset was transformed to a 3D Talairach space (Talairach and Tournoux, 1988) and projected on a reconstruction of the cortical surface.
Inter-SC maps were produced for each condition (e.g., spoken narrative, written narrative, combined narrative) and across conditions (e.g., spoken narrative vs written narrative). The inter-SC maps provide a measure of the reliability of brain responses to each of the conditions by quantifying the correlation of the time course of BOLD activity across subjects listening to the spoken narrative or reading the same written narrative (Hasson et al., 2004, 2010; Lerner et al., 2011).
For each voxel, inter-SC within a condition is calculated as an average correlation
Projection of white matter.
To diminish the impact of global, non-neural signal artifact on local BOLD signals, we projected-out the mean white matter signal from the BOLD signal in each voxel in each subject. The mean signal was calculated individually for each subject, and was entered into a linear regression to predict the BOLD signal in each voxel. The BOLD signals were then replaced with the residuals resulting from this regression, and the mean and variance of each of these residuals were matched to the mean and variance of the preprojection BOLD signal.
Bootstrapping by phase-randomization.
Because of the presence of long-range temporal autocorrelation in the BOLD signal (Zarahn et al., 1997), the statistical likelihood of each observed correlation was assessed using a bootstrapping procedure based on phase-randomization. The null hypothesis was that the BOLD signal in each voxel in each individual was independent of the BOLD signal values in the corresponding voxel in any other individual at any point in time (i.e., that there was no inter-SC between any pair of subjects).
For all conditions, a phase randomization of each voxel time course was performed by applying a fast Fourier transform to the signal, randomizing the phase of each Fourier component, and inverting the Fourier transformation. This procedure scrambles the phase of the BOLD time course but leaves its power spectrum intact. For each randomly phase-scrambled surrogate dataset, we computed the inter-SC (R) for all voxels in the exact same manner as the empirical correlation maps described above, i.e., by calculating the Pearson correlation between that voxel's BOLD time course in one individual and the average of that voxel's BOLD time courses in the remaining individuals. The resulting correlation values were averaged within each voxel across all subjects, creating a null distribution of average correlation values for all voxels.
To correct for multiple-comparisons, we selected the highest inter-SC value from the null distribution of all voxels in a given iteration. We repeated this bootstrap procedure 1000 times to obtain a null distribution of the maximum noise correlation values (i.e., the chance level of receiving high correlation values across all voxels in each iteration). Familywise error rate (FWER) was defined as the top 5% of the null distribution of the maximum correlations values exceeding a given threshold (R*), which was used to threshold the veridical map (Nichols and Holmes, 2002). In other words, in the inter-SC map, only voxels with mean correlation value (R) above the threshold derived from the bootstrapping procedure (R*) were considered significant after correction for multiple-comparisons and were presented on the final map. Using this method, the thresholds for each condition were as follows: spoken condition R* = 0.17; written condition R* = 0.16; combined condition R* = 0.15; unintelligible speech control R* = 0.12; unintelligible written control R* = 0.16. The same procedure was performed on the correlation values which were computed between the spoken and the written conditions, producing a threshold of R̃* = 0.17.
To identify areas that show increase in response reliability for one condition over the other, a t test (α = 0.05) was performed within each voxel that exceeded the threshold in at least one of the inspected conditions (see Fig. 4A). Thus, the t test was performed by comparing the correlation values of subjects from the spoken condition {rj, rj+1…rn} to the correlation values of subjects from the written condition {rj, rj+1…rn}, within each voxel.
ROI analysis.
In this work, we used two types of ROIs. (1) To sample an ROI in the primary auditory area (A1+), the narrative's sound envelope was convolved with a hemodynamic response function (Glover, 1999) to simulate a BOLD time course, and was used as a regressor. A bilateral superior temporal ROI called A1+ was then defined using two 10 × 10 × 10 mm3 cubes, located on the peaks of the audio regression in each hemisphere. (2) A set of independent ROIs (see Figs. 4, 5) was defined based on an intersubject reliability map calculated within an independent group of subjects who concurrently read and listened to the narrative (the combined condition). The ROIs were defined by sampling 252–3186 adjoining voxels around the response reliability peaks in the vicinity of the following areas: calcarine sulcus (primary visual area, V1+), left posterior dorsolateral prefrontal cortex (pDLPFC) and anterior dorsolateral prefrontal cortex (aDLPFC), the left inferior frontal gyrus (IFG) [which includes pars opercularis (approximately BA44) and pars triangularis (approximately BA45)], the left and right angular gyrus, left and right posterior regions of the dorsal medial prefrontal cortex (dmPFC), the left pSTG, the precuneus, and anterior regions in the left and right inferior parietal lobule (aIPL; Table 1).
Talairach coordinates of independently defined ROIs
To identify which of the ROIs show increase in response reliability for one condition over the other, a one tailed t test (α = 0.05) was performed within each ROI. Thus, the t test was performed comparing the mean correlation values of all voxels within an ROI from all subjects at the spoken condition {r̄j, r̄j+1 … r̄n} to the mean correlations value of all voxels within the ROI from all subjects at the written condition {r̄j, r̄j+1 … r̄n}.
Behavioral assessment
Immediately following the scan, we assessed each subject's engagement and the intelligibility of the stimulus using a simple questionnaire. Subjects were asked to write down a summary of the narrative they had just heard or read, as detailed as possible (spoken n = 9, written n = 9). Four independent raters graded these written records against a standard consisting of four questions about characters in the narrative, nine questions about particular events in the narrative, two questions about prominent keywords, as well as comprehensiveness level of the summary. The mean of the resulting scores (on a scale from 0 to 13) provided a measure of each subject's comprehension of the narrative. In addition, most of the subjects were asked to rate on a scale from 1 to 7 how difficult it was for them to reconstruct the narrative (spoken n = 7, written n = 9) and how engaged they felt with the narrative (spoken n = 8, written n = 9).
Two-tailed Welch's t tests (α = 0.05) were conducted between-subjects to compare the effect of the different experimental conditions on self-reported engagement and recall difficulty, as well as independently rated narrative-comprehension.
Results
We compared the behavioral and neural responses within and between two groups of subjects. One group (“spoken”) listened to a 7 min real-life spoken narrative, whereas the other group (“written”) read an exact transcript of the spoken narrative, in which words were presented at the center of the screen, at a presentation rate which was matched to the spoken condition (see Materials and Methods; Fig. 1).
A 1.2 s segment of the 7 min narrative stimulus. While undergoing fMRI, subjects were exposed to one of two narrative presentation modes: a spoken version or a written version. Subjects were also exposed to a spoken and written control conditions. In the spoken control, the narrative was played backwards, creating a perceptual effect of unintelligible speech; in the written control, the letters in each word were permuted and the emerging letter array was rotated by 180 degrees, resulting in an unintelligible written stimulus.
Behavioral results
Subjects comprehended the narrative well (spoken: M = 10, SD = 2.94; written: M = 8.86, SD = 2.32), with no difference in comprehension across the two conditions (spoken and written; t(15.17) = 0.86, p = 0.4; Fig. 2A). The subjective level of difficulty recalling the narrative was low and equal across the two conditions (t(13.99) = −1, p = 0.33; Fig. 2B). The subjective level of engagement was high and also equal across conditions (t(8.57) = −1.41, p = 0.19; Fig. 2C). These results suggest that the nonstandard reading condition for the written group (see Materials and Methods) did not hinder their comprehension or engagement with the narrative, relative to the spoken group. We next compared the time courses of neural activation for spoken and written naturalistic language.
Behavioral measures of comprehension and engagement did not differ between the spoken and written conditions. A, Comparable levels of narrative comprehension were measured across the two modes of narrative presentation. B, Difficulty in recalling the narrative was equal across all conditions. C, Similar levels of engagement were measured across the two modes of narrative presentation. Values are means and error bars represent SEM across subjects.
Identifying the language network involved in listening and reading
We began by identifying the set of brain areas that (1) responded reliably across subjects who listened to the spoken narrative or (2) responded reliably across subjects that read the written narrative. This was done by mapping the inter-SC separately for the subjects in the spoken group and for the subjects in the written group.
Intersubject correlation in the spoken condition
Consistent with previous reports (Lerner et al., 2011; Honey et al., 2012), the spoken condition showed reliable responses across subjects in early auditory areas, as well as linguistic and extra-linguistic areas (Fig. 3A). Early auditory areas included primary and secondary cortices that process low-level features of the sound (A1+; Romanski and Averbeck, 2009). Linguistic areas include the pSTG and posterior superior temporal sulcus (pSTS), angular gyrus, supramarginal gyrus, posterior inferior parietal lobule, and IFG (which includes its opercular and ventral triangular parts; Table 2). Each of these regions has been previously linked with one or more core linguistic processes at the level of phonemes, lexical items, grammar, or articulation (Hickok and Poeppel, 2007; Sahin et al., 2009; Price, 2010). Finally, extra-linguistic regions, which seem to be involved in processing the narrative and the social content of the story (Fletcher et al., 1995; Xu et al., 2005; Ferstl et al., 2008; Lerner et al., 2011), include the precuneus, the posterior cingulate cortex (PCC), left aDLPFC, left orbitofrontal cortex, and dmPFC (Table 2; summary of Talairach coordinates of all areas presented in each map).
Reliability of brain response within the spoken and the written conditions. The fMRI BOLD time course in each voxel was correlated across subjects to produce a map of inter-SC within each presentation mode. A, B, The surface maps show the areas exhibiting reliable responses for (A) subjects who listened to the narrative and (B) subjects who read the narrative (p(FEWER) < 0.05). C, Brain regions that respond reliably to both the spoken narrative and the written narrative; this is the intersection of the areas shown in A and B.
Talairach coordinates of the statistical maps
Intersubject correlation in the written condition
Computing the inter-SC within the written condition (Fig. 3B) revealed reliable responses across subjects in the occipital visual cortex, as well as in many of the linguistic and extra-linguistic areas observed in the spoken condition (Fig. 3C). Areas that exhibited a reliable response include the pSTG and pSTS, anterior superior temporal gyrus (aSTG), angular gyrus, and IFG, all in the language network, and extra-linguistic areas including the precuneus, the aIPL, the PCC, the dmPFC, DLPFC, and the orbitofrontal cortex (Table 2).
Modality-selective responses
Next, considering all regions that responded reliably to one or both conditions, we sought to identify those voxels with significantly greater response reliability in one condition rather than another by using a voxelwise t test (see Materials and Methods).
The written condition evoked significantly greater reliability not only in the visual cortex, but also in the aIPL and some frontal areas including the left and right pDLPFC, dorsal regions in the triangularis, the right orbitofrontal cortex, and posterior region within the right dmPFC (Fig. 4A, green; Table 2). The spoken condition evoked significantly greater reliability not only in early auditory cortex, but also in a smaller set of frontal and parietal areas, including the left anterior DLPFC, a posterior region within the left dmPFC, and right superior parietal lobule (Fig. 4A, red; Table 2). In addition, bilateral areas in the middle STG and an area within the right STS exhibited significantly more reliable response for the spoken condition than the written condition (p < 0.05). The responses in the right STS may be related to prosodic information, which is thought to be preferentially processed in the right hemisphere (Ross and Mesulam, 1979; George et al., 1996) and especially in the right STS (Belin et al., 2002; Bestelmeyer et al., 2011).
Brain regions exhibiting modality-selective bias to the spoken or written narratives. A, t tests between the spoken and written conditions within each voxel revealed areas that exhibit more reliable responses in the spoken condition (red) and other areas that exhibit more reliable responses in the written condition (green). Dashed circles represent ROIs locations. B, Independently defined ROIs that exhibit modality-selective neural responses to spoken and written narratives. These regions include primary auditory area (A1+), primary visual area (V1+), left pDLPFC and aDLPFC, the left and right posterior dmPFC, and anterior regions in the left and right IPL.
The results of the voxelwise analysis were reproduced in a group of sensory and high-order ROIs (see Materials and Methods). Although early auditory and visual modality-selective areas responded reliably to both intelligible speech and unintelligible scrambled stimuli, high-order parietal and frontal regions that show modality-selective responses did not respond reliably to the unintelligible scrambled stimuli (Fig. 4B). Early auditory areas (A1+), as defined using the narrative's acoustic envelope, showed reliable responses in the spoken condition, but not in the written condition (t(16) = 8.08, p ≪ 0.0001). Moreover, responses in A1+ were reliable even when the speech was unintelligible (played backwards), suggesting that this region is involved in low-level, prelinguistic processing of the spoken input. Unimodal visual areas (V1+), exhibited reliable responses in the written condition, but not in the spoken condition (t(16) = 11.65, p ≪ 0.0001). Moreover, this early visual area responded reliably, but to a lesser extent, when the letters in each word were scrambled and rotated to create unreadable arrays of visual input. This effect suggests that V1+ is involved in low-level processing of visual inputs, but may be influenced to some extent (via top-down feedback or attentional modulations) by the presence of readable orthographic input.
Some frontal ROIs such as the left posterior dmPFC exhibited significantly greater reliable response for the spoken condition relative to the written condition (t(16) = 3.03, p = 0.003), but did not respond to the unintelligible written condition. Conversely, some high-order areas in the left pDLPFC (t(16) = 3.06, p = 0.004) and the left and right aIPL(t(16) = 5.02, p ≪ 0.0001; t(16) = 3.91, p = 0.001) exhibited significantly greater reliable response for the written condition relative to the spoken condition, but did not respond to the unintelligible spoken condition. Overall, these results suggest differential involvement of these frontal areas in the processing of spoken and written information.
Modality-invariant responses
Spatial overlap of regions responsive to spoken and written narratives
Next we looked at areas that responded reliably to both the spoken and the written narratives. An overlap between the high-order cortical areas which responded reliably to both conditions was seen in many linguistic and extra-linguistic areas (Fig. 3C). The linguistic areas include the pSTG and pSTS, the angular gyrus, the supramarginal gyrus, and the IFG (which includes its opercular and ventral triangular parts). The extra-linguistic areas include the precuneus, PCC, anterior regions within the dmPFC, and the posterior IPL (Table 2).
Direct comparison of response time courses to spoken and written narratives
Overlap in the reliability of responses across the spoken and written conditions does not tell us whether the response time courses for individual sentences embedded within a real-life narrative are similar across the two conditions. To test for the direct correspondence, we correlated the response time courses in the listeners' brains to the response time courses in the readers' brains within each brain area.
Most linguistic areas and some extra-linguistic areas demonstrated a remarkable invariance to modality, responding similarly to the narrative regardless of whether it was represented aurally or visually (Fig. 5). To illustrate the effect, we first present the mean time course for the spoken and written conditions sampled from two independent ROIs in the left pSTG and precuneus (Fig. 5A), followed by whole brain analysis (Fig. 5B), and ROI analysis (Fig. 5C). Written and spoken narratives clearly evoked similar mean response time courses in the pSTG and precuneus (Fig. 5A). Equally modality-invariant responses were observed in the angular gyrus, IFG, anterior dmPFC, and left DLPFC (Fig. 5B; Table 2). The results of the voxelwise analysis were reproduced in a group of independent linguistic ROIs (see Materials and Methods), such as the left IFG, the left pSTG, and left and right angular gyrus, and extra-linguistic ROIs such as the precuneus (Fig. 5C). These areas exhibited similar responses regardless of presentation modality, but did not respond to the unintelligible conditions in either modality.
Brain regions exhibiting modality-invariant responses to the spoken and the written narratives. A, The average time courses of the responses in the left pSTG and the precuneus evoked by the written (green) and spoken (red) narrative. B, The fMRI BOLD time course in each area was correlated across conditions to produce a map of inter-SC across modes of presentation. Dashed circles represent ROI locations. C, Independently defined regions of interest that exhibit modality-invariant neural responses to spoken and written narratives. These regions include the left IFG, the left pSTG, the precuneus, and the left and right angular gyrus.
The time-locking of auditory and visual stimulus onsets cannot account for the cross-modally shared neural responses. We tested the magnitude of this low-level onset effect by presenting unintelligible scrambled letters in a time-locked rhythm with the full spoken story. This control stimulus did not exhibit any significant correlations with the full spoken story across the entire brain using the same corrected threshold (p(FEWER) < 0.05). In addition, the correlations across subjects within the scrambled-letters condition were reliable only within visual cortex, and not in any of the high-order areas that exhibited modality-invariant responses in other conditions (Figs. 4B, 5C). These data rule out the possibility that regular stimulus onsets could have elicited the cross-modally reliable responses.
Discussion
In this study, we compared brain responses within and across subjects who either listened to a real-life spoken narrative or read a time-locked presentation of its transcript. Analysis of the temporally extended neural responses revealed two novel findings. The first finding is of robust modality-invariant response time courses within language-related areas along the pSTG and the IFG, as well as the precuneus (Fig. 5). The invariant response time courses indicate that, not only do these regions process real-life linguistic inputs of multiple modalities, they process that information in a similar fashion across modalities. The second finding is of modality-selective responses, which were not restricted to early visual and auditory cortices, but were also observed in parietal and frontal cortices. Although the sensory regions are expected to exhibit modality-selectivity, the observation of selective responses in parietal and frontal areas is surprising in light of their suggested amodal control functions (Mesulam, 1998; Chee et al., 1999).
Modality-invariant responses
The spoken and written narratives we presented have little in common in terms of their low-level sensory properties. This fit well with our observation of modality-selective response patterns in early visual and auditory areas (Fig. 4B). Nevertheless, both forms convey essentially the same meaning as verified by our comprehension test (Fig. 2). The behavioral invariance was paralleled by robust response invariance within language-related areas. Prior studies reported spatial overlap between areas that respond to spoken language and areas that respond to written text (Raij et al., 2000; Shaywitz et al., 2001; Booth et al., 2002; Marinković, 2004; van Atteveldt et al., 2004; Spitsyna et al., 2006; Jobard et al., 2007; Lindenberg and Scheef, 2007; Vagharchakian et al., 2012). In principle, a brain area could respond reliably to both spoken and written narratives, but with one temporal response profile for the spoken narrative and another for the written narrative. This study goes beyond these results by demonstrating that the response time courses for real-life complex narratives across the two conditions were highly similar within the regions noted above (Fig. 5). Moreover, the shared responses across spoken and written language extended to the precuneus, a high-order area whose responses are strongly contextually modulated (Ben-Yakov et al., 2012) and which does not respond reliably to streams of unrelated words or sentences (Xu et al., 2005; Lerner et al., 2011).
The similarity in neural responses across spoken and written stimuli in pSTG, left IFG, and precuneus may arise from the grammatical structures, lexical items, and situation models (van Dijk and Kintsch, 1983; Zwaan and Radvansky, 1998; Fairhall and Caramazza, 2013) that are shared by the spoken and written stimuli. Shared responses in many of the same temporal and parietal areas were also observed across Russian-speakers who listened to a Russian narrative and English-speakers who listened to its English translation (Honey et al., 2012). The partial invariance to both modalities and languages in these areas indicates that their representations are highly abstracted from the sensory input.
Given the uncommon reading task, where words appear in the middle of the screen at a fixed rate, our design is not suitable for revealing additional processes (such as the control of eye movements), which are unique to the reading of written text. However, our behavioral results point toward similar levels of comprehension and engagement with our stimuli across the groups. Moreover, such task differences cannot induce similarities in the neural activity across conditions; rather, they will tend to reduce the correlation across subjects who read and listened to the story. Thus, the actual invariant responses across reading and listening may be even more extensive than reported in this study.
The similarity of response time courses to spoken and written narratives attests to the plasticity of the language system. Regions that exhibit modality-invariant responses in the present study would have processed only auditory language signals within the first few years of life, before written language skills were acquired. Remarkably, the human nervous system learns to extract similar information from purely visual signs. In earlier stages of language processing, within the superior temporal cortex, this invariance may reflect the encoding of visual information (graphemes) into originally auditory representations (phonemes; Calvert et al., 2000; Raij et al., 2000). However, in regions further away from the auditory cortex and the STG, the modality-invariant responses more likely reflect amodal information processing elicited in a similar fashion by auditory and visual input.
Modality-selective responses
Surprisingly, a subset of parietal and frontal cortices exhibited strong preference for one modality over the other (Fig. 4). In particular, we observed a greater reliability for spoken narratives in the left anterior DLPFC, and a greater reliability for written narratives in the left posterior DLPFC. Similarly, the responses in the left (right) posterior dmPFC were more reliable for spoken (written) narratives. Finally, responses in the lateral anterior parietal cortex were more reliable for the written version of the narrative. The double-dissociated selective responses such as in the anterior and posterior portions of left DLPFC may indicate differential frontal cortical involvement in the active maintenance of auditory and visual information.
The functional selectivity observed in frontal areas in this study is consistent with nonhuman primate studies that have revealed response selectivity for faces in anterior ventrolateral prefrontal cortex (VLPFC) and for vocalizations in posterior VLPFC (Romanski, 2007). In addition, distinct frontoparietal networks in humans have been associated with memory for auditory and visual inputs (Protzner and McIntosh, 2007) and with attention to frequency-based auditory information and spatial-based visual information (Braga et al., 2013).
The fact that distinct subsections of medial and lateral frontal cortex exhibited a preference for spoken over written language (and vice versa; Fig. 4) suggests that spoken and written language inputs may induce distinct control processes. These distinct control processes may be related to the sensory modality (i.e., audio vs visual) as well as to the task modality (i.e., reading vs listening). At the same time, it appears that the modality-specific information that reaches these frontal and parietal regions is not low-level sensory information, because the higher-order regions only respond reliably to the meaningful linguistic stimuli and not to unintelligible scrambled letters or sounds (Figs. 4, 5).
Although our inter-SC analysis method is successful at characterizing the neural dynamics that are shared over time across spoken and written natural conditions, it also has its limitations. First, more spatially refined methods, such as fMRI-adaptation, are needed to map the neural organization of writing-selective, speech-selective, and amodal neurons at subvoxel resolution (Grill-Spector and Malach, 2001; van Atteveldt et al., 2010). Second, identifying areas with superadditive responses to simultaneous spoken and written stimuli could potentially indicate how neurons in these areas integrate information across modalities (Calvert, 2001).
In conclusion, the present study reveals modality-invariant and modality-selective responses to written and spoken narrative by directly comparing response time courses across listeners and readers. First, we observed that real-life narratives evoked reliable responses across many brain areas, ranging from early sensory areas to linguistic areas, and up to high-order parietal and frontal areas. Second, we observed a remarkable invariance to input form in linguistic areas, which responded similarly to the spoken and written narratives. However, the strong modality-invariance in these linguistic areas was accompanied by modality-selective responses in high-order parietal and frontal cortices. These findings challenge the classical distinction between sensory unimodal areas and higher-order amodal areas by demonstrating that some higher-order areas can display strong invariance to the input modality, whereas other areas can retain strong selectivity.
Footnotes
This work was supported by NIH Grant R01-MH094480 (to U.H., M.R., C.J.H., and E.S.). We thank Ido Davidesco and Janice Chen for helpful comments on the paper, and Yulia Lerner for sharing her data.
The authors declare no competing financial interests.
- Correspondence should be addressed to Uri Hasson, 3-C-13 Green Hall, Psychology Department, Princeton University, Princeton, NJ 08540. hasson{at}princeton.edu