Abstract
Neuroanatomical models hypothesize a role for the dorsal auditory pathway in phonological processing as a feedforward efferent system (Davis and Johnsrude, 2007; Rauschecker and Scott, 2009; Hickok et al., 2011). But the functional organization of the pathway, in terms of time course of interactions between auditory, somatosensory, and motor regions, and the hemispheric lateralization pattern is largely unknown. Here, ambiguous duplex syllables, with elements presented dichotically at varying interaural asynchronies, were used to parametrically modulate phonological processing and associated neural activity in the human dorsal auditory stream. Subjects performed syllable and chirp identification tasks, while event-related potentials and functional magnetic resonance images were concurrently collected. Joint independent component analysis was applied to fuse the neuroimaging data and study the neural dynamics of brain regions involved in phonological processing with high spatiotemporal resolution. Results revealed a highly interactive neural network associated with phonological processing, composed of functional fields in posterior temporal gyrus (pSTG), inferior parietal lobule (IPL), and ventral central sulcus (vCS) that were engaged early and almost simultaneously (at 80–100 ms), consistent with a direct influence of articulatory somatomotor areas on phonemic perception. Left hemispheric lateralization was observed 250 ms earlier in IPL and vCS than pSTG, suggesting that functional specialization of somatomotor (and not auditory) areas determined lateralization in the dorsal auditory pathway. The temporal dynamics of the dorsal auditory pathway described here offer a new understanding of its functional organization and demonstrate that temporal information is essential to resolve neural circuits underlying complex behaviors.
Introduction
Under adverse listening conditions when speech is distorted, noisy, or delivered with a foreign accent, phonemic perception is effortful and facilitated by phonological processing. Phonological processing consists of short-term maintenance of sound sequences in auditory memory during analysis of their auditory, somatosensory, and motor properties to support phonemic categorization (Wise et al., 2001; Buchsbaum et al., 2005; Hickok and Poeppel, 2007). Categorization of ambiguous syllables engages a dorsal pathway, from primary auditory cortex to posterior temporal gyrus (pSTG) and ventral parietal regions, associated with auditory short-term memory and interaction with somatosensory and motor areas (Callan et al., 2004; Golestani and Zatorre, 2004; Dehaene-Lambertz et al., 2005; Desai et al., 2008; Liebenthal et al., 2010; Kilian-Hütten et al., 2011a).
The functional organization of auditory regions in the dorsal pathway during phonological processing, particularly the time course of interactions with somatosensory and motor regions and the hemispheric lateralization pattern, is largely unknown. The notion of a simple hierarchical organization in posterior temporal cortex has been challenged by findings of phonemic neural representations not only in left pSTG (Dehaene-Lambertz et al., 2005; Chang et al., 2010; Liebenthal et al., 2010; Kilian-Hütten et al., 2011a), but also near the auditory core (Kilian-Hütten et al., 2011a), though the relative timing of phonemic activity in these areas is unknown. Interhemispheric differences in sensitivity to temporal and spectral sound properties (Zatorre and Belin, 2001; Poeppel, 2003; Boemio et al., 2005), or in resting–oscillatory properties of neurons (Giraud et al., 2007), have been suggested to predispose the left auditory cortex for processing of fast spectral transitions characteristic of phonemes. In the ventral auditory pathway, the first left-lateralized speech-specific stage of processing may be when phonemic representations in middle STG are accessed (Liebenthal et al., 2005; Obleser et al., 2007). However, it has been argued that phonological processing in pSTG is bilateral (Hickok and Poeppel, 2007).
Here, ambiguous duplex syllables, with elements presented dichotically at varying interaural asynchronies [stimulus-onset asynchronies (SOAs)], were used to parametrically modulate phonological processing and associated neural activity in the dorsal auditory pathway. Subjects performed a syllable and a chirp identification task with identical stimuli, while event-related potentials (ERPs) and functional magnetic resonance (fMR) images were concurrently collected. The premise for the experimental design was that duplex syllable identification would degrade with increasing SOA, whereas duplex chirp identification would not, permitting characterization of the neural network associated with phonological processing independent of general auditory analysis. Joint independent component analysis (jICA) was applied to the neuroimaging data to study the neural dynamics of brain regions involved in phonological processing with high spatiotemporal resolution (Calhoun et al., 2006; Mangalathu-Arumana et al., 2012).
Results showed that phonological processing is highly interactive, with functional fields in pSTG, inferior parietal (IPL), and ventral central sulcus (vCS) engaged early and almost simultaneously (at 80–100 ms latency), and with activity rebounding in a similar network after 300 ms. Left hemispheric lateralization was observed in IPL and vCS at 120–130 ms, but in pSTG only at 380 ms, suggesting that dorsal stream lateralization stems from a functional specialization of articulatory somatomotor (and not auditory) fields.
Materials and Methods
Subjects.
Participants were 25 adults (15 females; mean age, 24 years) with no history of neurological or hearing impairments, native speakers of English, and right handed according to the Edinburgh Handedness Inventory (Oldfield, 1971). Data from five participants were excluded from ERP analysis, and data from one participant was excluded from fMR image analysis, due to excessive artifact contamination (determined as ≥15% of trials in a condition that was affected). In four additional participants, ERP data were not obtained due to equipment malfunction. In total, the report is based on behavioral and fMRI results from 24 participants, and ERP and jICA results from 15 participants. Informed consent was obtained in accordance with the Medical College of Wisconsin Institutional Review Board.
Stimuli.
The duplex stimuli were derived from a natural utterance of /ga/, resynthesized to a two-formant syllable using Multispeech 3700 (Kay Elemetrics) as described in prior work (Liebenthal et al., 2005), and edited to 295 ms duration. The second formant (F2) spectral transition of /ba/ and /ga/ was separated from the remaining sound structure to create an isolated chirp (containing the distinctive cue for syllable identification) and a base (identical for the two syllables), each presented to one of the ears (with equal frequency) at four SOAs (0, 20, 40, and 80 ms; Fig. 1). Sounds were delivered using a pneumatic audio system (Avotec) at ∼70 dB, adjusted individually to accommodate differences in hearing and in positioning of the ear tips. Stimulus delivery was controlled with Presentation (Neurobehavioral Systems).
Task design and procedure.
Trials, consisting of a sequence of three identical duplex stimuli presented at 1 Hz, were presented in eight runs during which simultaneous ERP/fMRI data were acquired. In half of the runs (presented in random order), participants were instructed to identify the duplex syllables as ba or ga (syllable task), and in the other half they were instructed to identify the initial chirp as falling or rising (chirp task). Participants responded by pressing one of two keys at the end of each trial. In each run, 20 trials per SOA and 20 silence baseline trials were presented in random order, for a total of 80 trials per SOA and task condition.
In a prescan session, participants practiced the syllable identification task with the original (nonduplex) syllables and the duplex syllables at an SOA of 0 ms, and the chirp identification task with the isolated chirps and duplex syllables at an SOA of 80 ms. This practice set was selected for each task because performance with it was expected to be the highest. Participants were required to achieve 90% accuracy in two consecutive practice runs to proceed to the neuroimaging session, and all were able to reach this accuracy level in two to six practice runs.
fMR image acquisition and analysis.
Images were acquired on a 3 T Excite scanner (GE Medical Systems). Functional data consisted of T2*-weighted, gradient echo, echoplanar images acquired using clustered acquisition at 7 s intervals (echo time = 20 ms, flip angle = 77°, acquisition time = 2 s). The sound sequences were positioned to start 500 ms after the end of each image acquisition to avoid perceptual masking by the acoustic noise of the scanner and to synchronize the next image acquisition with the estimated peak of the BOLD response to the sounds (Vagharchakian et al., 2012). Functional images consisted of 35 axially oriented 3.5 mm slices with a 0.5 mm interslice gap (field of view = 192 mm, 64 × 64 matrix), covering the whole brain. A total of 72 images were acquired per run. High-resolution anatomical images of the entire brain were obtained, using a 3-D spoiled gradient-recalled acquisition in a steady state, as a set of 130 contiguous axial slices with 0.938 × 0.938 × 1.0 mm voxel dimensions.
Image analysis was conducted in AFNI (Cox, 1996). Within-subject analysis consisted of spatial coregistration of the functional images to minimize motion artifacts and registration to the anatomical images. Voxelwise multiple linear regression was applied to analyze individual time series, with reference functions representing the task (syllable, chirp), and the four SOA levels coded as a linear progression (1, 2, 3, 4). The trial reaction time (RT) was also modeled to remove activity related to behavioral performance, and six motion parameters were included as covariates of no interest. In each analysis, general linear tests were conducted between conditions.
The individual statistical maps and the anatomical scans were projected into standard stereotaxic space (Talairach and Tournoux, 1988) by linear resampling and then smoothed with a Gaussian kernel of 4 mm FWHM. Group maps were created in a random-effects analysis. The group condition maps were thresholded at a voxelwise p < 0.01 and were corrected for multiple comparisons by removing clusters <380 μl, resulting in a mapwise two-tailed p < 0.01. The group contrast maps were thresholded at a voxelwise p < 0.05 and were corrected for multiple comparisons by removing clusters <1641 μl, resulting in a mapwise two-tailed p < 0.05. The cluster thresholds were determined through Monte-Carlo simulations that provide the chance probability of spatially contiguous voxels exceeding the voxelwise p threshold.
ERP acquisition and analysis.
Sixty-four-channel EEG activity was acquired using the Maglink system (Compumedics) in a continuous mode, with Quik-Cap electrodes positioned according to the International 10–20 System, and CPz serving as the reference. Activity was recorded at full bandwidth and digitally sampled at 500 Hz per channel. Vertical eye movements and electrocardiogram activity were monitored with bipolar recordings. Interelectrode resistance was kept below 5 kΩ.
EEG analysis was conducted in Scan 4.4 (Compumedics) and consisted of bandpass filtering at 0.1–30 Hz; ballistocardiogram artifact removal (Ellingson et al., 2004); creation of epochs of −100 to +500 ms from sound onset, baseline correcting each epoch by removing the mean voltage value of the whole sweep; and rejection of epochs with voltage values exceeding ±100 μV. The remaining epochs were sorted and averaged according to task and SOA condition.
jICA of ERP and fMRI.
In a variation of previous jICA approaches to multimodal neuroimaging data (Calhoun et al., 2006), condition-wise fMR images and ERP epochs at 62 electrode sites were integrated in a within-subject analysis (Mangalathu-Arumana et al., 2012). This version of jICA is powerful in that it is sensitive to nonlinear (and linear) patterns of dependence on an experimental variable, and is readily amenable to incorporation of the full array of spatiotemporal information in the ERPs. Here, jICA was applied in each subject across the four SOA levels in each task. The fMR images and ERP data in each task were restructured into a joint matrix where each row corresponds to the flattened t score functional image for one SOA level relative to baseline, concatenated with the ERP temporal sequence (−100 to +500 ms from stimulus onset) and flattened across electrodes for one SOA level. Principal component analysis was applied to the joint matrix to whiten the data without reducing its dimensionality, and jICA was then applied to the principal components using the Fusion ICA Toolbox (http://mialab.mrn.org/software/fit/). Four components were returned, each containing a flattened array of fMRI/ERP activity covarying across SOA levels. The components were expanded into their native spaces, resulting in four jICA-fMRI maps and four corresponding jICA-ERP field map time series in each task. The jICA-fMRI maps with the highest positive amplitude values in the syllable and chirp tasks were considered to represent the bulk of the activity related to phonological and general auditory processing, respectively, and were selected for further analysis. In a second step, to consider the possibility that relevant activity was also represented in other components, jICA-fMRI maps with amplitude values reaching at least 50% of those in the map selected in the first step were added. Using this two-step procedure, one joint component per task was selected in 12 subjects, and two components were selected and summed together in one of the tasks in 3 subjects.
Grouping of the individual jICA-fMRI maps was performed using a random-effects model and mapwise correction levels as used in the multiple regression fMRI analysis (computed relative to the distribution of the nonselected jICA-fMRI components). To gain greater sensitivity in the temporal cortex, the jICA-fMRI task-contrast map was also corrected by removing clusters smaller than 800 μl in the left and right superior and middle temporal gyri, resulting in two-tailed p < 0.05 in these areas (see Fig. 5, bottom row). Grand average (across subjects) jICA-ERP waveforms were computed for each task condition.
jICA-ERP source reconstruction.
Source reconstruction of the grand average (n = 15) jICA-ERP waveform was performed using the weighted minimum norm estimate to solve the inverse problem (Brainstorm 3.0), and a template head model created from T1-weighted MR images of the Colin brain available in Brainstorm, using a three-shell sphere Berg approximation representing the brain, scalp, and skull. The cortical surface was parsed and represented as a high-density mesh of 15,000 vertices, with sample electric dipoles positioned at each vertex perpendicular to the cortical surface. Electrode positions were approximated based on a template electrode position file. Activity in vertices at the base of the temporal pole was masked because the assumption of uniform skull thickness is violated and a spherical head model is therefore inadequate in this region (Teale et al., 2002; Hamalainen et al., 2007).
Current source density estimates in each task (see Fig. 6A,B, top and middle rows) were expressed as time point by time point z-scores relative to the mean activity in the baseline period (−100 to −2 ms before stimulus onset). Contrast maps between tasks (see Fig. 6A,B, bottom rows) were computed by subtraction of the z-score source maps. The task maps were thresholded vertexwise at z-scores of ±4, and the contrast maps at z-scores of ±3, corresponding to p < 10−5 and p < 10−3, respectively, relative to the poststimulus distribution of all vertices across time. Cluster thresholding at seven contiguous vertices, spatial smoothing with a Gaussian kernel SD of five vertices, and low-pass temporal filtering at 15 Hz were applied to remove spatially and temporally spurious activity.
A region-of-interest (ROI) analysis of the jICA-ERP source maps was performed to examine the temporal course of processing in brain areas involved in the two tasks. The ROIs were seeded in areas in which fMRI activity related to SOA was stronger in one of the tasks (see Figs. 3, 5, bottom rows), but was not correlated with RT. Using a conservative approach guided by general anatomical (arrangement of functional regions according to gyral patterns) and methodological (lower spatial resolution of ERP compared with fMRI) considerations, the ROIs were expanded along gyri so as to cover the posterior superior temporal, parietal, and precentral cortex, without overlap. Seven ROIs were created in the left hemisphere and then mirrored on the right. The regions (Fig. 7A,B, middle) consisted of the pSTG (blue), IPL (magenta), superior parietal lobule (SPL; cyan), ventral post-central gyrus (vPostCG; red), dorsal post-central gyrus (dPostCG; yellow green), ventral pre-central gyrus (vPreCG; green), and dorsal pre-central gyrus (dPreCG; yellow brown). Note that, based on these considerations, IPL included only the anterior part of the supramarginal gyrus. The pSTG included the portion of STG ventral and posterior to Heschl's gyrus. The temporal course of jICA-ERP mean source activity across vertices in each ROI was computed in each task and expressed as z-scores at each time point relative to the mean activity in the baseline period (−100 to −2 ms). Differences in time course between the two tasks and hemispheres in each ROI were considered significant at pointwise t scores corresponding to p < 0.0002, sustained for a period of at least 30 ms, resulting in a corrected p < 0.005 relative to a temporally randomized distribution of points (Guthrie and Buchwald, 1991).
Results
Ambiguous duplex syllables, consisting of a brief spectral transition containing the cue for syllable identification (the chirp) and a base containing the remainder of the syllable, were each presented to one ear at interaural stimulus onset intervals (i.e., SOA) ranging from 0 to 80 ms (Fig. 1). Subjects were asked to identify the syllables as /ba/ or /ga/ (syllable task), or the chirps as rising or falling (chirp task). In this paradigm, only syllable identification is expected to depend on SOA, thereby allowing disentanglement of the neural processes associated with phonological processing (engaged in the syllable task) from those associated with general auditory (nonlinguistic, engaged in the chirp task) analysis of the speech sounds.
Behavioral performance
The effects of task (syllable, chirp) and SOA (0, 20, 40, 80 ms) on the behavioral accuracy and RT of duplex stimulus identification were examined in an ANOVA and are summarized in Figure 2. Overall, the results confirm that the performance accuracy and RT were dependent on SOA only in the syllable task.
Mean accuracy across SOAs was significantly lower in the syllable (78%) than in the chirp (92%) task (F(1,23) = 34.66, p < 0.00001). Across tasks, accuracy was lower at an SOA of 80 ms than at SOAs of 0, 20, and 40 ms (F(3,69) = 9.10, p < 0.00005). There was also an interaction between task and SOA, with accuracy in the syllable task lower at SOAs of 80 ms (66%, SE = 3%) and 40 ms (78%, SE = 3%) than at SOAs of 20 ms (86%, SE = 3%) and 0 ms (83%, SE = 2%); at an SOA of 80 ms than at an SOA of 40 ms; and no accuracy change as a function of SOA in the chirp task (F(3,69) = 15.37, p < 0.00001).
For RT, there was a trend for an overall longer RT in the syllable (1025 ms) than in the chirp task (965 ms; p = 0.07). Across tasks, the RT was longer at an SOA of 80 ms than at SOAs of 40, 20, and 0 ms (F(3,69) = 6.01, p < 0.001). There was also a significant interaction between task and SOA, with RT in the syllable task significantly longer at SOAs of 80 ms (1073 ms, SE = 40 ms) and 40 ms (1034 ms, SE = 37 ms) than at SOAs of 20 ms (1001 ms, SE = 35 ms) and 0 ms (992 ms, SE = 34 ms); at an SOA if 80 ms than at an SOA of 40 ms; and no RT change as a function of SOA in the chirp task (F(3,69) = 3.74, p < 0.01).
Importantly, the mean RT in the syllable task was related to SOA, such that the increase in RT at each SOA (relative to SOA 0 ms) was generally consistent with the SOA (81 ms at SOA 80 ms, 43 ms at SOA 40 ms, and 10 ms at SOA 20 ms). This suggests that the longer RT in the syllable relative to chirp task was largely due to an increase with SOA in the duration of the initial portion of the syllable (consisting of the chirp and corresponding portion of the base) containing the relevant information for syllable identification (Fig. 1), and not to an inherent longer neural processing time for phonological relative to nonlinguistic auditory processing.
Functional magnetic resonance imaging
The effects of task and SOA on the fMRI activity were investigated in a voxelwise multiple regression coding the trial SOA as a linear progression, and including regressors representing trial RT to model activity related to behavioral performance.
In the syllable task, activity in bilateral pSTG, IPL, and vCS increased proportionally with SOA (Fig. 3, top row). In the chirp task, only activity negatively related with SOA was found, in left angular gyrus (AG), right inferior frontal gyrus (IFG), and the postcentral gyrus, and in bilateral supplementary motor area (SMA; Fig. 3, middle row). The areas of negative activation in the chirp SOA correlation map largely coincided with areas of positive activation seen in the correlation map with RT, suggesting that the chirp SOA correlation map primarily reflected a decrease in executive functions with SOA in this task. Activity systematically related to behavioral performance may have been less effectively modeled (and removed) by the RT regressor in the chirp condition because of the limited variability in mean performance measures in that condition (Fig. 2). A direct comparison of the SOA effect in the two tasks revealed a stronger linear relationship with SOA in the syllable task in left pSTG, IPL, and vCS (Fig. 3, bottom row). Other foci of activation seen in this contrast were either in areas related with RT and negatively activated in the chirp condition (right IFG, bilateral SMA, and anterior cingulate) or reflected stronger activity during the baseline period (left anterior temporal pole; see Fig. 5). Importantly, the activity positively and linearly related to SOA in the syllable task, in left pSTG, IPL, and vCS was in areas in which activity was not associated with behavioral performance. The size, mean, and peak amplitude, and the peak location of activation clusters in each contrast in Figure 3 are given in Table 1.
Event-related potentials
The ERP waveforms in the two tasks, shown averaged across SOA conditions in Figure 4, were characterized by a frontocentral negativity peaking at ∼160 ms followed by a frontocentral positivity peaking at ∼250 ms, consistent with the spatiotemporal characteristics of N1 and P2 responses evoked by syllables (Näätänen and Picton, 1987; Martin et al., 1997). In the syllable task, this sequence of ERPs was followed by a prolonged frontal negativity peaking at ∼350 ms, coinciding with the range of the N320 and N350 components previously associated with phonological processing (Bentin et al., 1999).
Joint independent component analysis of fMRI and ERP
Integration of the fMRI and ERP results was conducted using a within-subject variant of jICA (Mangalathu-Arumana et al., 2012), to examine the temporal course of processing in brain regions showing linear or nonlinear variation with SOA. In each subject and task, the mean fMRI and ERP signals covarying as a function of SOA were associated with a joint component consisting of an fMRI spatial map (jICA-fMRI) and an ERP topographical map time series (jICA-ERP). Group jICA-fMRI maps in each task were computed using a random-effects model (Fig. 5), and the neural generators of the grand average jICA-ERP in each task were estimated using a minimum norm solution (Fig. 6). The temporal course of neural activity, in seven brain regions in each hemisphere that were more strongly activated in one of the tasks, was assessed in an ROI analysis of the jICA-ERP sources (Fig. 7).
jICA-fMRI
The syllable and chirp tasks (compared with baseline) induced overall similar patterns of activation, with greater signal (Fig. 5, two top rows, orange-yellow colors) in bilateral STG, IPL, IFG, SMA, and thalamus, and in left SPL, and pre- and post-central gyri. Greater signal in the baseline (blue-cyan colors) was observed in bilateral anterior temporal pole, AG, parieto-occipital cortex, middle frontal gyrus, anterior cingulate, and precuneus. A direct comparison between the two tasks showed greater activation in the chirp task (Fig. 5, third row, blue-cyan colors) in IPL and SPL bilaterally. Stronger activity for the syllable over the chirp task was observed in the left pSTG, albeit only at a more lenient threshold using a small-volume temporal lobe mask correction (Fig. 5, bottom row). Importantly, there was not significant activity in the right temporal cortex at this lenient correction level. Other small foci of activity in the task contrast, in left AG, bilateral precuneus, medial superior frontal gyrus, and anterior temporal pole, were due to greater activation in the baseline relative to task conditions (observed as negative activation in Fig. 5, two top rows), consistent with task-induced deactivations possibly reflecting suspension of spontaneous semantic processing during rest (McKiernan et al., 2006). The small foci in the task contrast in bilateral SMA were in areas where activity was related to RT. The size, mean, and peak amplitude, and the peak location of activation clusters in each contrast are given in Table 2.
Together, the linear regression (Fig. 3) and jICA (Fig. 5) fMRI maps suggest that neural activity in the syllable task in left pSTG, IPL, and vCS varied linearly with SOA and independently of performance RT. This activity was better described in the linear regression analysis (which was based on trial-by-trial variations with SOA and included regressors for trial RT) than in the jICA (which was based on average variations with SOA across trials). In contrast, neural activity in the chirp task in bilateral IPL and SPL varied nonlinearly with SOA and was therefore better described in the jICA-fMRI map (which was sensitive to nonlinear variations).
jICA-ERP
ERP activity in both tasks was observed during two main time periods, an early period ∼80–230 ms and a late period ∼300 ms and onward. Neural source reconstructions of the grand average jICA-ERP waveforms in the early and late time periods (Fig. 6A,B, respectively) are shown for the syllable (Fig. 6A,B, top row) and chirp (Fig. 6A,B, middle row) tasks, and for the task difference (Fig. 6A,B, bottom row). The main activity stronger in the syllable task was seen in bilateral pSTG, and left ventral parietal and posterior frontal areas (IPL, vPostCG, and vPreCG). Activity stronger in the chirp task was seen primarily in bilateral STG and right parietal areas (SPL, IPL, and dPostCG). Other activation foci in the task contrast (in bilateral SMA, left anterior STG, left parieto-occipital cortex, and right IFG) were in areas in which activity was found to be stronger in the baseline condition (Fig. 5, jICA-fMRI maps) or was related to RT.
The temporal course of ERP activity in each ROI, represented as mean z-scores across all vertices in the ROI in the period after stimulus presentation (0–500 ms) relative to the baseline (−100 to −2 ms), is shown for the syllable (full trace) and chirp (dotted trace) tasks in the left and right hemispheres (Fig. 7A,B, respectively). The pattern of lateralization in each ROI and task is shown in Figure 7C. Overall, neural activity was stronger in the left hemisphere and centered in ventral parietal areas in the syllable task, and it was stronger in the right hemisphere and centered in dorsal parietal areas in the chirp task. In the syllable task, the earliest activity was observed in bilateral pSTG starting at 80 ms and peaking at ∼100 ms, and was closely followed and dominated by activity in left IPL starting at 96 ms and peaking at ∼140 ms, and somewhat weaker activity occurred in left vPostCG and vPreCG with a similar temporal profile to left IPL. The activity was significantly left lateralized in the ventral parietal regions (IPL, vPostCG), vPreCG, and SPL in the early time window, and in pSTG, vPostCG, and vPreCG in the late time window. In the chirp task, the earliest and strongest activity was observed in right SPL, starting at 90 ms and peaking at ∼160 ms, with activity in bilateral STG exceeding the significance threshold only at ∼120 ms. Activity in the chirp task was right lateralized in all ROIs except STG and dPreCG in both the early and late time windows.
Discussion
The neural dynamics of phonological processing were examined independently of those of auditory processing, by parametrically modulating the perception of ambiguous speech stimuli. The behavioral and neuroimaging results, showing a significant interaction between the effects of task type and interaural SOA on both behavioral performance measures and functional brain maps, confirm that different neural processing of identical duplex stimuli was elicited in each task. The dependence of behavioral and neural measures on interaural SOA specifically in the syllable task, suggests that dichotic fusion of the chirp and base portions of the syllable was required for syllable identification, consistent with prior reports (Repp et al., 1983; Bentin and Mann, 1990). Phonemic perception in the duplex syllable task emerged from dichotic fusion of temporally misaligned spectral elements of the syllable. Under these listening conditions with distorted speech input, a dorsal auditory stream associated with phonological processing (Wise et al., 2001; Buchsbaum et al., 2005; Hickok and Poeppel, 2007) was activated, with relatively stronger activity in the syllable task in a perisylvian network including left pSTG, IPL, and vCS.
Activity in left pSTG, IPL, and vCS increased linearly with interaural SOA specifically in the syllable task and independently of fluctuations in behavioral performance (Fig. 3), consistent with a role for these areas in phonological processing. jICA of the EEG and fMRI data indicated that activity correlated with SOA in the syllable task arose in bilateral pSTG at 80–90 ms after stimulus onset, was quickly followed and dominated by strong activity in left IPL, vPostCG, and vPreCG at 95–230 ms; and rebounded in bilateral STG, IPL, left vPostCG, and vPreCG after 300 ms (Figs. 6, 7). This pattern of neural dynamics is incompatible with a simple hierarchical organization in an afferent pathway, whereby neural activity flows from primary to higher fields that process increasingly complex auditory patterns, and on to cognitive and executive areas. Rather, the finding that the left IPL, an area associated with short-term phonological storage (Paulesu et al., 1993; Buchsbaum and D'Esposito, 2009), possibly subvocalization (Koelsch et al., 2009; Price, 2010) and somatosensory articulatory feedback (Tourville et al., 2008), and the left vCS, associated with orofacial somatomotor control (Corfield et al., 1999; Fesl et al., 2003), are active early during the period of perceptual analysis of the sounds in superior temporal cortex points to interactive processing with efferent feedback from somatomotor to auditory cortex. The period 80–230 ms after sound onset coincides with the N1 and P2 electrophysiological responses, reflecting neural processes related to auditory analysis and object perception in superior temporal cortex, including analysis of speech spectrotemporal features relevant to phonemic perception (Liégeois-Chauvel et al., 1999; Eggermont and Ponton, 2002; Ahveninen et al., 2006; Chang et al., 2010; Liebenthal et al., 2010; Steinschneider et al., 2011; Tsunada et al., 2011). The strong and early activity in left ventral parietal and central sulcus regions observed here provides firm evidence that the somatomotor cortex associated with orofacial movement control plays a direct role in phonemic perception, at least when speech is ambiguous. Later activity observed after 300 ms in the duplex syllable task, in the same left temporoparietal–posterior frontal perisylvian network, could reflect the activation and maintenance of categorical neural representations of the syllables for task-related response selection. Indeed, this time range coincides with the N2 and P3 ERP components, shown to accurately reflect speech categorization in discrimination tasks (Maiste et al., 1995; Martin et al., 1997; Toscano et al., 2010), and with the negative N320 and N350 responses to pronounceable letter strings, associated with the activation of phonological representations from print (Bentin et al., 1999).
Despite the mounting evidence for involvement of the dorsal auditory stream in perceptual processes, the neural mechanisms underlying auditory and somatomotor interactions are far from resolved. Recent neuroanatomical models (Rauschecker and Scott, 2009; Hickok et al., 2011) postulate a primary role for the dorsal auditory stream in speech perception as a feedforward efferent system, from frontal premotor to posterior temporal auditory areas via inferior parietal cortex, whereby predictive motor signals modulate sensory processing, at least when the speech input is degraded. Motor influences on phonemic categorization performance have been described in the ventral precentral gyrus at the level of the premotor cortex (Wilson et al., 2004; Meister et al., 2007; Chang et al., 2011; Osnes et al., 2011) and primary motor cortex (Mottonen et al., 2009). In ventral parietal cortex, neurons representing the somatosensory (tactile and proprioceptive) articulatory properties of speech sounds have been hypothesized to exert a modulatory influence on phonemic perception (Guenther, 2006; Tourville et al., 2008). Feedback from IPL to left posterior auditory areas has been suggested to play an important role in perceptual learning of ambiguous phonemic categories (Kilian-Hütten et al., 2011b). Indeed, the left IPL is activated during overt categorization tasks of phonemic and also of trained nonphonemic sounds, with the level of activity positively related with categorization ability (Caplan and Waters, 1995; Celsis et al., 1999; Jacquemot et al., 2003; Dehaene-Lambertz et al., 2005; Raizada and Poldrack, 2007; Desai et al., 2008). Feedback to the pSTG from areas representing the somatomotor articulatory properties of speech sounds may act as a top-down selection mechanism to tune the auditory areas to the set of possible phonemic inputs (Hickok et al., 2011) and to facilitate activation of phonemic representations in pSTG. Neurocomputationally, the role of bottom-up and top-down interactions in perceptual processing can be understood in terms of predictive coding, whereby forward connections conveying prediction errors and reciprocal backward connections mediating predictions are balanced to minimize prediction errors and optimize the probabilistic representation of sensory input (Friston, 2010). Such an inference scheme can be used to model the neural computations underlying perceptual categorization, corresponding to the mapping of noisy and dynamic sensory input to a fixed point in perceptual space (Friston and Kiebel, 2009).
The present findings, revealing the temporal dynamics of the temporoparietal–posterior frontal perisylvian network during a phonemic categorization task, permit the updating of our current understanding of auditory dorsal pathway function in phonemic perception in several important ways. The early timing of the IPL and vCS activity, during the phase of auditory perceptual analysis and a mere 15–20 ms after the pSTG, alleviates any concerns regarding potential confounding effects of subarticulatory or behavioral decision-making processes (Callan et al., 2004; Hickok et al., 2011) and confirms a genuine role for these regions in phonemic perception. The location of the BOLD activity in the ventral tip of the post- and pre-central gyri is consistent with that of primary sensorimotor cortex associated with orofacial motor control (Corfield et al., 1999; Fesl et al., 2003), although involvement of premotor fields cannot be ruled out. The strong activity in inferior parietal cortex further suggests that somatosensory feedback contributed significantly to phonemic perception in the syllable task. On the other hand, the ventrolateral prefrontal cortex was not differentially activated during phonological processing despite the use of an overt categorization task, consistent with a domain-nonspecific role for this region. Together, the results suggest the existence of a direct feedback loop from ventral parietal and ventral central sulcus regions to posterior temporal cortex, representing an influence of somatosensory and articulatory representations of speech sounds on phonemic perception. In the monkey (Seltzer and Pandya, 1978; Petrides and Pandya, 1984, 2009) and more recently in the human (Frey et al., 2008; Makris et al., 2009), the anterior part of the IPL was demonstrated to have strong reciprocal connections with ventral premotor (and ventrolateral prefrontal) areas controlling orofacial musculature via the superior longitudinal fasciculus, and with the posterior superior temporal cortex via the middle longitudinal fasciculus. These anatomical connections could form the basis for the functional phonological loop activated here. Feedback from discrete somatomotor representations of speech may serve to narrow the range of possible sound inputs, and to activate categorical phoneme representations in pSTG. The activation and maintenance of categorical phonemic representations may correspond to the second phase of neural activity observed in this study. According to this view, neural representations of both graded and categorical properties of sounds are present within the same general posterior temporal region, but are activated at different time phases, consistent with the existence of feedforward and feedback processes in this region.
An important aspect of the findings is that left lateralization in the syllable task was observed in IPL and vCS in the early time period, but in the pSTG only at ∼380 ms latency. This pattern implies that left lateralization in the phonological dorsal pathway is due initially to stronger activation of left IPL and vCS, and is imposed only later on the pSTG through feedback interactions. The results are generally incompatible with theories emphasizing hemispheric differences in auditory processing (Boemio et al., 2005; Giraud et al., 2007), at least as the basis for left hemispheric dominance during phonological processing. Instead, functional specialization of somatosensory and motor areas may determine lateralization in the dorsal auditory stream.
In summary, the neural dynamics of phonological processing in the dorsal auditory pathway described here are consistent with reciprocal activity in pSTG, IPL, and vCS, and with left lateralization originating in IPL and vCS.
Notes
Supplemental material for this article is available at http://www.neuro.mcw.edu/~einatl/files/. This material has not been peer reviewed.
Footnotes
The authors declare no competing financial interests.
This research was supported by National Institute on Deafness and other Communication Disorders Grant R01 DC006287 (E.L.) and the Medical College of Wisconsin Clinical Translational Science Institute (E.L., S.A.B.).
- Correspondence should be addressed to Einat Liebenthal, Medical College of Wisconsin, Department of Neurology, 8701 West Watertown Plank Road, Milwaukee, WI 53226. einatl{at}mcw.edu