Abstract
Debates about motor theories of speech perception have recently been reignited by a burst of reports implicating premotor cortex (PMC) in speech perception. Often, however, these debates conflate perceptual and decision processes. Evidence that PMC activity correlates with task difficulty and subject performance suggests that PMC might be recruited, in certain cases, to facilitate category judgments about speech sounds (rather than speech perception, which involves decoding of sounds). However, it remains unclear whether PMC does, indeed, exhibit neural selectivity that is relevant for speech decisions. Further, it is unknown whether PMC activity in such cases reflects input via the dorsal or ventral auditory pathway, and whether PMC processing of speech is automatic or task-dependent. In a novel modified categorization paradigm, we presented human subjects with paired speech sounds from a phonetic continuum but diverted their attention from phoneme category using a challenging dichotic listening task. Using fMRI rapid adaptation to probe neural selectivity, we observed acoustic-phonetic selectivity in left anterior and left posterior auditory cortical regions. Conversely, we observed phoneme-category selectivity in left PMC that correlated with explicit phoneme-categorization performance measured after scanning, suggesting that PMC recruitment can account for performance on phoneme-categorization tasks. Structural equation modeling revealed connectivity from posterior, but not anterior, auditory cortex to PMC, suggesting a dorsal route for auditory input to PMC. Our results provide evidence for an account of speech processing in which the dorsal stream mediates automatic sensorimotor integration of speech and may be recruited to support speech decision tasks.
Introduction
There remains an ongoing and contentious debate in speech processing concerning what role premotor cortex (PMC) might play in speech perception, a function that is largely attributed to the ventral auditory stream (Hickok and Poeppel, 2007; Rauschecker and Scott, 2009; DeWitt and Rauschecker, 2012). Although there is little doubt that PMC can be active while listening to speech (Wilson et al., 2004; Skipper et al., 2005; Turkeltaub and Coslett, 2010), recent studies providing evidence that disrupting PMC processing impedes performance on explicit phoneme category judgments (Meister et al., 2007; Sato et al., 2009) have rekindled the debate about whether PMC function in speech perception is essential (Meister et al., 2007; Pulvermüller and Fadiga, 2010) or not (Hickok et al., 2011; Schwartz et al., 2012). Yet, the aforementioned studies linking PMC to phoneme categorization performance used speech categorization tasks, making it difficult to disentangle PMC's contribution to perceptual versus decision processes (Binder et al., 2004; Holt and Lotto, 2010).
Current theories of speech processing have partially converged in proposing a nonperceptual role for the dorsal stream (Hickok et al., 2011; Rauschecker, 2011), proposing instead that the dorsal auditory stream likely mediates sensorimotor integration (i.e., mapping speech signals to motor articulations) and vice versa. Although this role could explain PMC activity during listening to speech, it is currently unknown whether the dorsal stream exhibits neural tuning reflecting this mapping.
Recent evidence that PMC activity is modulated by performance during a phoneme categorization task (Alho et al., 2012) have given rise to the hypothesis that the dorsal stream might be recruited in a task-specific manner to support speech categorization. Although these studies provide circumstantial support for this hypothesis, it is currently unknown whether the dorsal stream exhibits neural selectivity supportive of speech categorization, as it is difficult to infer neural tuning properties from traditional BOLD-contrast fMRI paradigms. Additionally, it is unknown whether dorsal-stream processing is automatic, a defining criterion for which is that processing take place even when it is unnecessary for the task at hand (Moors and De Houwer, 2006). Finally, although evidence suggests that dorsal-stream recruitment may vary across subjects (Alho et al., 2012; Szenkovits et al., 2012), it is currently unclear whether the underlying neural tuning properties predict behavioral performance.
Using fMRI rapid adaptation (fMRI-RA), we probed neural selectivity in human subjects listening to paired speech stimuli from a place-of-articulation continuum (/da/−/ga/). Subjects were asked to perform a difficult dichotic listening task for which phoneme category information was irrelevant. The following questions were addressed: (1) Would graded acoustic-phonetic differences between stimuli be detected in auditory cortical regions, especially in the left hemisphere? (2) Would phoneme category-selectivity be observed in the PMC despite it being task-irrelevant? Importantly, we asked whether the degree of category selectivity in PMC would be found to correlate with the sharpness of subjects' category boundary. The latter was measured outside the scanner after the imaging experiment. (3) In addition, structural equation modeling (SEM) was applied to reveal the source of auditory cortical input to PMC.
Materials and Methods
Participants.
Sixteen subjects participated in this study (7 females, 18–32 years of age). The Georgetown University Institutional Review Board approval all experimental procedures, and all subjects gave written informed consent before participating. All subjects were right-handed, reported no history of hearing problems or neurological illness, and spoke American English as their native language. One subject (female) was excluded from the imaging study based on the behavioral pretest showing no evidence of categorical phoneme perception (see below). Imaging data from one additional subject (male) were excluded from subsequent analyses because of excessive head motion, leaving a total of 14 subjects in the imaging study.
Stimuli.
To probe neuronal tuning for phonemes, we generated a place-of-articulation continuum between natural utterances of /da/ and /ga/ (Fig. 1) using the MATLAB toolbox STRAIGHT (MathWorks). STRAIGHT allows for finely and parametrically manipulating the acoustic and acoustic-phonetic structure of high-fidelity natural voice recordings (Kawahara and Matsui, 2003). Speech stimuli were recordings (44.1 kHz sampling frequency) of natural /da/ and /ga/ utterances taken from recordings provided by Shannon et al. (1999). Two phonetic continua (or “morphs”) were generated at 0.5% intervals between /da/ and /ga/ prototypes: one for a male voice and one for a female voice. Morphed stimuli were generated up to 25% beyond each prototype (i.e., from −25% /ga/ to 125% /ga/), for a total of 301 stimuli per morph line. The stimuli created beyond the prototypes were qualitatively assessed to ensure they were intelligible, and this assessment was then verified behaviorally in an identification experiment (described below), wherein subjects robustly labeled these stimuli appropriately. All stimuli were then resampled to 48 kHz, trimmed to 300 ms duration, and root-mean-square normalized in amplitude. A linear amplitude ramp of 10 ms duration was applied to sound offsets to avoid auditory artifacts. Amplitude ramps were not applied to the onsets, however, to avoid interfering with the natural features of the consonant sound.
Discrimination behavior.
To identify participants' individual category boundaries while minimizing the risk that they would covertly categorize sounds in the scanner, participants completed a discrimination test before scanning. Participants' discrimination thresholds were measured at 10% intervals along each continuum, for stimuli displaced in each direction. The adaptive staircase algorithm QUEST (Watson and Pelli, 1983), implemented in MATLAB using the Psychophysics Toolbox (Brainard, 1997; Pelli, 1997), was used to adjust the difference between paired stimuli based on subject performance. On each trial, participants heard two sounds and were asked to report as quickly and accurately as they could whether the two sounds were exactly the same, or in any way different. A maximum time of 3.0 s was allowed for a response before the next trial started. In half the trials, paired stimuli were identical and in the other half they were different. Of the different pairs, half were for displacement in one direction along the continuum, and half were in the opposing direction. Subjects completed 28 trials per condition, for 20 conditions (10% intervals from 0 to 90% with displacements toward 100%, and 10% intervals from 10 to 100% with displacement toward 0%), yielding 560 total trials.
Identification behavior.
After subjects completed the fMRI experiment, they were asked to categorize the auditory stimuli along each continuum to confirm the location of participants' category boundary and to measure its sharpness. Categorization was tested at 10% intervals from −25% (i.e., 25% past /da/ away from /ga/) to 125% (25% past /ga/ away from /da/). On each trial, subjects were presented with a single sound and given up to 3 s to indicate as quickly and accurately as they could whether they heard /da/ or /ga/. Subjects completed 20 trials per condition for 15 conditions, for a total of 300 trials per morph. We then fit the resulting data with a sigmoid function to estimate the boundary location as well as boundary sharpness for each subject. The sigmoid was calculated according to the generalized logistic curve: where x is the location along the morph line, α is the location of the boundary along the morph line, and 1/β is the steepness of the boundary (with larger values of 1/β resulting in sharper boundaries).
Event-related fMRI-RA experiment.
After the phoneme identification screening, subjects participated in an fMRI-RA experiment. fMRI adaptation techniques have been developed to improve upon the limited ability of conventional fMRI to probe neuronal selectivity. Conventional fMRI takes average BOLD-contrast response to the stimuli of interest as an indicator of neuronal stimulus selectivity in a particular voxel. Although this is sufficient to assess whether neurons in the voxel respond to the stimuli of interest, a more detailed assessment of the degree of selectivity (and, of particular interest in our study, whether neurons show selectivity for acoustic features or for phoneme category labels) is complicated by the fact that the density of selective neurons as well as the broadness of their tuning contribute to the average activity level in a voxel. A particular mean voxel response could be obtained by a few neurons in that voxel, which each respond unselectively to many different sounds, or by a large number of highly selective neurons that each respond only to a few sounds, yet these two scenarios obviously have very different implications for neuronal selectivity in that particular region (Jiang et al., 2006). In contrast, the fMRI-RA approach is motivated by findings from inferotemporal cortex monkey electrophysiology experiments reporting that the second presentation of a stimulus (within a short time period) evokes a smaller neural response than the first (Miller et al., 1993). It has been shown that this adaptation can be measured using fMRI and that the degree of adaptation depends on stimulus similarity, with repetitions of the same stimulus causing the greatest suppression. We (Jiang et al., 2006, 2007) as well as others (Murray and Wojciulik, 2004; Fang et al., 2007; Gilaie-Dotan and Malach, 2007) have provided evidence that parametric variations in visual object parameters (shape, orientation, or viewpoint) are reflected in systematic modulations of the fMRI-RA signal and can be used as an indirect measure of neural population tuning (Grill-Spector et al., 2006). In fMRI-RA, two stimuli are presented in rapid succession in each trial, with the similarity between the two stimuli in each trial varied to investigate neuronal tuning along the dimensions of interest. In our experiment, in each trial, subjects heard pairs of sounds, each of which was 300 ms in duration, separated by 50 ms, similar to Joanisse et al. (2007) and Myers et al. (2009). There were five different conditions of interest, defined by acoustic-phonetic change and category change between the two stimuli: M0, same category and no acoustic-phonetic difference; M3within, same category and 33% acoustic-phonetic difference; M3between, different category and 33% acoustic-phonetic difference; M6, different category and 67% acoustic-phonetic difference; and null trials (Fig. 1). This made it possible to attribute possible signal differences between M3within and M3between, which were equalized for acoustic-phonetic dissimilarity, to an explicit representation of the phonetic categories. Specifically, we predicted that brain regions containing category-selective neurons should show stronger activations in the M3between and M6 trials than in the M3within and M0 trials, as the stimuli in each pair in the former two conditions should activate different neuronal populations, whereas they would activate the same group of neurons in the latter two conditions. In this way, fMRI-RA makes it possible to dissociate phonetic category selectivity, which requires neurons to respond similarly to dissimilar stimuli from the same category as well as respond differently to similar stimuli belonging to different categories (Freedman et al., 2003; Jiang et al., 2007) from mere tuning to acoustic differences, where neuronal responses gradually drop off with acoustic dissimilarity, without the sharp transition at the category boundary that is a hallmark of perceptual categorization. Morph lines were also extended beyond the prototypes (25% in each direction) so that the actual stimuli used to create the stimulus pairs for each subject would span 100% of the difference between /da/ and /ga/ but could be shifted so that they centered on the category boundary for each subject, measured behaviorally before scanning. To observe responses largely independent of overt phoneme categorization, we scanned subjects while they performed an attention-demanding distractor task for which phoneme category information was irrelevant. Each sound in a pair persisted slightly longer in one ear than in the other (∼30 ms between channels). The subject was asked to listen for these offsets and to report whether the two sounds in the pair persisted longer in either the same or different ears. Subjects held two response buttons, and an “S” and a “D,” indicating “same” and “different,” respectively, were presented on opposing sides of the screen to indicate which button to press. Their order was alternated on each run, to disentangle activation resulting from categorical decisions from categorical motor activity (i.e., to average out the motor responses). The nature of this task required that subjects listen closely to all sounds presented. The average performance across subjects was 71%, indicating that the task was attention-demanding, minimizing the chance that subjects covertly categorized the stimuli in addition to doing the direction task. Importantly, as reaction time differences have been shown to cause BOLD response modulations (Sunaert et al., 2000; Mohamed et al., 2004) that could interfere with the fMRI adaptation effects of interest, reaction times on the dichotic listening task did not differ significantly across the four conditions of interest (ANOVA, p > 0.17). Images were collected for 6 runs, each run lasting 669 s. Trials lasted 12 s each, yielding 4 volumes per trial, and there were two silent trials (i.e., 8 images) at the beginning and end of each run. The first 4 images of each run were discarded, and analyses were performed on the other 50 trials (10 for each condition of interest). Trial order was randomized and counterbalanced using m-sequences (Buracas and Boynton, 2002), and the number of presentations was equalized for all stimuli in each experiment.
fMRI data acquisition.
All MRI data were acquired at the Center for Functional and Molecular Imaging at Georgetown University on a 3.0-Tesla Siemens Trio Scanner using whole-head echo-planar imaging sequences (flip angle = 90°, TE = 30 ms, FOV = 205, 64 × 64 matrix) with a 12-channel head coil. A clustered acquisition paradigm (TR = 3000 ms, TA = 1500 ms) was used such that each image was followed by an equal duration of silence before the next image was acquired. Stimuli were presented after every fourth volume, yielding a trial time of 12.0 s. In all functional scans, 28 axial slices were acquired in descending order (thickness = 3.5 mm, 0.5 mm gap; in-plane resolution = 3.0 × 3.0 mm2). After functional scans, high-resolution (1 × 1 × 1 mm3) anatomical images (MPRAGE) were acquired. Auditory stimuli were presented using Presentation (Neurobehavioral Systems) via customized STAX electrostatic earphones at a comfortable listening volume (∼65–70 dB) worn inside ear protectors (Bilsom Thunder T1) giving ∼26 dB attenuation.
fMRI data analysis.
Data were analyzed using the software package SPM2 (http://www.fil.ion.ucl.ac.uk/spm/software/spm2/). After discarding images from the first 12 seconds of each functional run, echo-planar imaging images were temporally corrected to the middle slice, spatially realigned, resliced to 2 × 2 × 2 mm3, and normalized to a standard MNI reference brain in Talairach space. Images were then smoothed using an isotropic 6 mm Gaussian kernel.
For whole-brain analyses, a high-pass filter (1/128 Hz) was applied to the data. We then modeled fMRI responses with a design matrix comprising the onset of predefined non-null trial types (M0, M3within, M3between, and M6) as regressors of interest using a standard canonical hemodynamic response function, as well as six movement parameters and the global mean signal (average over all voxels at each time point) as regressors of no interest. The parameter estimates of the hemodynamic response function for each regressor were calculated for each voxel. The contrasts for each trial type against baseline at the single-subject level were computed and entered into a second-level model (ANOVA) in SPM2 (participants as random effects) with additional smoothing (6 mm). For all whole-brain analyses, a threshold of at least p < 0.001 (uncorrected) and cluster-level correction to p < 0.05 using AlphaSim (Forman et al., 1995) was used unless otherwise mentioned.
Results
Behavior
Before scanning, subjects first participated in a behavioral experiment. The experiment served both as an initial screen to ensure subjects exhibited clear category boundaries for our stimuli as well as to locate the subject-specific category boundary for each of the two morph lines. To minimize the likelihood of subjects covertly categorizing the sounds during the scans, we did not have them overtly categorize the sounds before scanning. Instead, subjects were asked to report whether pairs of sounds were exactly the same or in any way different. Discrimination was tested at 10% intervals along each morph line, and the morph distance between paired sounds was varied according to the adaptive staircase algorithm QUEST (Watson and Pelli, 1983) in each direction of the morph line, similar to the experiments by Kuhl (1981). This allowed us to measure the just-noticeable difference (JND) at each location (for each morph direction), which should have its minimum value at the category boundary. In all but one subject, we observed clear minima in the JND for morph differences in each direction. The subject with no detectable boundary was excluded from further participation and analyses. The boundary in all other subjects was inferred to be halfway between the two smallest JND measurements (one in each direction). After scanning, subjects participated in a second behavioral experiment in which they were asked to overtly categorize each sound to explicitly confirm the location of the category boundary. Average discrimination and categorization results for the group (Fig. 2) indicate a clear correspondence between the JND minima and the explicit category boundary.
fMRI-RA
To probe neuronal selectivity using fMRI, we used an event-related fMRI-RA paradigm (Kourtzi and Kanwisher, 2001; Jiang et al., 2006, 2007) in which a pair of sounds of varying acoustic-phonetic similarity and category similarity was presented in each trial while subjects performed the dichotic listening task. To identify the subset of voxels that exhibited adaptation to our phoneme stimuli, we created a mask using the contrast of M6 > M0 (Table 1). As a control, we performed the inverse contrast of M0 > M6, which identified no selective clusters even at the very permissive threshold of p < 0.1 (corrected) anywhere in the brain. To locate cortical regions sensitive to acoustic-phonetic differences regardless of category label, we examined the results of the contrast M6 + M3between + M3within > 3 * M0 within the M6 > M0 mask. This contrast yielded two significant clusters (Fig. 3) in the left hemisphere; the anterior middle temporal gyrus (aMTG; MNI coordinate, −56, −6, −22) and the posterior superior temporal gyrus (pSTG; MNI coordinate, −56, −50, 8). We then extracted the percentage signal change for each condition from each of these clusters. Post hoc paired t tests revealed that M3within, M3between, and M6 conditions were each significantly higher than the M0 condition in both the aMTG and pSTG (p < 0.0028, one-tailed).
Phoneme category selectivity in left PMC
We then examined the M3between > M3within contrast on the activation masked by M6 > M0 (see above) to locate cortical regions explicitly selective for the phoneme categories. This contrast yielded only one significant cluster (Fig. 4), in the posterior portion of the left middle frontal gyrus (MNI coordinate, −54, 4, 44), corresponding to PMC (Brodmann area 6). Post hoc paired t tests showed that M3between and M6 conditions were each significantly higher than both the M0 and M3within conditions (p < 0.007, one-tailed), with no significant differences between M0 and M3within or between M3between and M6 (p > 0.28, two-tailed).
Functional connection between pSTG and PMC
To determine whether the category signals observed in PMC are functionally linked to either the anterior or posterior speech selectivity clusters identified in auditory cortex, we examined the functional connectivity between each acoustic-phonetic region of interest (ROI) (i.e., aMTG and pSTG) and the phoneme category-selective ROI in PMC. For this analysis, we extracted the measured BOLD time course in each of these group-defined ROI and calculated their pairwise correlation coefficients in each subject. A strong correlation was found between aMTG and pSTG (mean, r = 0.41), and the distribution of r values was found to be significantly different from zero (p < 0.0001). Similarly, connectivity was detected between pSTG and PMC (mean, r = 0.28, p < 0.01). However, correlations between aMTG and PMC were not significant (mean, r = 0.11, p > 0.3). Testing for partial correlations, controlling for the strong correlation between pSTG and aMTG, further supported the existence of a dorsal-stream connection from pSTG to PMC (mean, r = 0.29, p < 0.005), and no ventral stream connection from aMTG to PMC (mean, r = 0.015, p > 0.8). These results suggest that the category selectivity in PMC arises from processing in the dorsal stream (via pSTG) rather than the ventral stream, probably because of a direct projection from posterior auditory areas (e.g., pSTG) to premotor areas (e.g., PMC). To test this hypothesis, we used SEM, with the SEM toolbox (Steele et al., 2004), to assess the connectivity among the three brain regions. Using the pairwise correlation coefficients between the three regions (see above), we tested all 60 “probable” candidate models of connectivity among the three regions (60 = 43-4, after assuming that any one of the three regions is connected to at least one of the other two regions). The goodness of fit between the model and data was tested according to the standardized root mean square residual (SRMR), which can have values between 0 and 1 and where SRMR < 0.05 indicates a particularly good fit (Steele et al., 2004). The best-fitted structural equation model is shown in Figure 5 (SRMR: 0.0246 ± 0.0193; mean ± SD). This model includes a bidirectional connection between aMTG and pSTG and a unidirectional connection from pSTG to PMC, suggesting that the PMC selectivity we observe via fMRI-RA can be attributed to the dorsal, rather than the ventral, auditory pathway. Additionally, the identified dorsal connection was found to be directional, with signals propagating from pSTG to PMC.
PMC signals predict behavior
The phoneme category selectivity identified in PMC raises the possibility that this area might be recruited when subjects are asked to explicitly categorize morphed phoneme stimuli. In a separate experiment, after the scanning, we therefore determined whether the category selectivity observed in left premotor cortex was correlated to subjects' ability to categorize the morphed phoneme stimuli. More specifically, we predicted that subjects with larger signal differences between the M3between and M3within conditions, indicative of stronger release from adaptation for stimuli in different phoneme categories relative to stimuli of the same acoustic-phonetic dissimilarity but belonging to the same category, would exhibit better categorization performance. To test this, we extracted the M3between-M3within signal values for each subject from the group-defined PMC cluster and tested for correlations against the sharpness of the behavioral category boundary for each subject. Although there appeared to be a strong correlation for most of the subjects, there were several subjects (n = 4) for which M3between-M3within was zero, making the correlation for the group insignificant. We hypothesized that this was attributable to intersubject variability in functional localization, and we refined our analysis by defining subject-specific PMC ROI. For each individual subject, we searched the M6 + M3between > M3within + M0 contrast map and identified activity peaks within a 20 mm radius of the group-defined PMC cluster. In cases where multiple peaks were detected, the one closest to the group-defined PMC cluster was chosen. We then centered a 5 mm spherical ROI on each of these coordinates and extracted the percentage signal change for each testing condition. Figure 6 shows the signal difference M3between-M3within (in percentage signal change) measured in each subjects' PMC ROI against the sharpness of their behavioral category boundary, revealing a significant correlation between these measures (r = 0.63, p < 0.014).
Discussion
Although current theories of speech processing (Hickok et al., 2011; Rauschecker, 2011) implicate the dorsal auditory stream in sensorimotor integration—mapping speech sounds to their associated motor articulations, and vice versa—contentious debate still remains about whether the dorsal stream is involved in speech “perception” (decoding of speech sounds and subsequent mapping of these acoustically variable sounds to their invariant representations). Recent reports of correlative (Osnes et al., 2011; Alho et al., 2012; Szenkovits et al., 2012) and causal (Meister et al., 2007; Sato et al., 2009) evidence of PMC involvement in speech processing have given rise to the hypothesis that the dorsal stream might be recruited in a task-specific manner to support speech categorization tasks. Although providing circumstantial support for this hypothesis, these studies do not address whether or not PMC exhibits neural tuning properties that might be relevant to a speech categorization task.
What neural tuning properties are exhibited by PMC?
In the present fMRI-RA study, we detected selectivity for phoneme category in left PMC (Fig. 4). Although previous studies have suggested that the dorsal stream may be recruited to perform speech judgments (at least under adverse listening conditions), our present results provide insight into why such recruitment may be beneficial; in the case of phoneme categories, PMC may provide additional information (potentially useful for disambiguating similar words when sensory information is degraded). Whereas phoneme-category tuning was observed in left PMC, graded acoustic-phonetic selectivity was observed in left aMTG and left pSTG (Fig. 3). Although not tested in the current experiment, such neural tuning properties are likely useful in tasks requiring fine discriminations between speech sounds. Additional experiments would add valuable insight into why this tuning profile was observed in both anterior (ventral) and posterior (dorsal) regions and how the function of both regions differs (for a potential explanation, see Obleser et al., 2007).
Is PMC tuning the result of dorsal- or ventral-stream processing?
In our ROI and connectivity analyses, we identified a left-lateralized network in which connectivity was directional from pSTG to PMC. Similar directional connectivity, in that case between anatomically defined ROIs in planum temporale and premotor cortex, was observed in a previous study (Osnes et al., 2011). We interpret our finding to reflect processing within the dorsal stream. This interpretation is consistent with recent electroencephalography data (De Lucia et al., 2010), reporting selectivity for conspecific vocalizations in humans at 169–219 ms after stimulus in pSTG, and then later at 291–357 ms in PMC. Whereas Osnes et al. (2011) also found connectivity from middle superior temporal sulcus to PMC, however, we found no connectivity between aMTG and PMC. We interpret our finding to indicate that our difficult dichotic listening task (71% mean accuracy) was effective at distracting subjects from the task-irrelevant phoneme-category information. Our findings are also consistent with the sensorimotor integration function putatively assigned to the dorsal auditory pathway, which critically involves an internal modeling process (Hickok et al., 2011; Rauschecker, 2011), similar to that described in motor control theory (Wolpert et al., 1995). This internal model is thought to map motor articulations to the speech sounds that result (i.e., a “forward model”) and can be inverted to map a speech sound to the motor articulations likely to have caused it (an “inverse model”). We interpret the category selectivity observed in PMC to reflect the categorical sensorimotor representation of this inverse model.
Is dorsal stream phoneme-category selectivity automatic or task-specific?
We observed phoneme-category selectivity in PMC despite subjects never being informed of, and actively being distracted from, the phoneme-category information present in the stimuli. Our dichotic listening task proved to be difficult for subjects, who attained only 71% mean accuracy. Although we interpret this behavioral performance to indicate that the dichotic listening task was effective at distracting subjects from the task-irrelevant phoneme-category information, we cannot be certain that this manipulation of subject attention was 100% effective (i.e., subjects may have still attended to some degree to phoneme-category information). Nonetheless, our observation suggests that this process fits at least one defining criterion for an automatic process, in that it is goal-independent, being performed even when it is irrelevant for the proximal goal (Moors and De Houwer, 2006), in our case, localization and not categorization. This suggests that this mapping takes place regardless of whether the resulting category-label information is task-relevant, which is compatible with general theories of cognitive processing implicating the premotor cortex in automatic processing in a variety of tasks (Ashby et al., 2007; D'Ostilio and Garraux, 2011), including other sensorimotor tasks, such as handwriting (Ashby et al., 2007; D'Ostilio and Garraux, 2011).
Do neural tuning properties in dorsal stream predict subject performance?
We found that subjects with more category-selective activity in PMC also exhibited sharper category boundaries. Thus, our findings provide novel evidence that the degree of category tuning exhibited by PMC can account for behavioral performance in phoneme categorization, further supporting the idea that this representation might be usefully recruited to assist in speech categorization tasks. One possibility for this correlation is that category selectivity in PMC might merely be a corollary of earlier categorization processes somewhere else in the brain that may not have been detected by our fMRI-RA paradigm. Previous work has also reported category effects for speech in posterior STG (Liebenthal et al., 2010) and supramarginal gyrus (Raizada and Poldrack, 2007). However, a causal contribution of representations in PMC to phoneme categorization, at least in some circumstances, is supported by recently reported results from transcranial magnetic stimulation studies, where temporary disruption of the PMC has been found to interrupt (Meister et al., 2007; Möttönen and Watkins, 2009) or delay (Sato et al., 2009) phoneme category processing. Our results are also in agreement with a recent study using magnetoencephalography that found a correlation of left PMC responses with behavioral accuracy in a phoneme categorization task (Alho et al., 2012; Szenkovits et al., 2012).
Why did we not find tuning for phoneme category in the ventral auditory stream?
We found evidence of graded acoustic-phonetic selectivity in an anterior region of auditory cortex, which has also been reported by other studies (Frye et al., 2007; Toscano et al., 2010). This result is consistent with studies reporting phoneme selectivity in the ventral stream (Liebenthal et al., 2005; Leaver and Rauschecker, 2010; DeWitt and Rauschecker, 2012). One previous study (using a similar adaptation design as the present one and stimuli paired along a phonetic continuum) even claimed evidence for phonetic category selectivity in the left anterior superior temporal sulcus/middle temporal gyrus (Joanisse et al., 2007), which contrasts with the present results. A difficulty comparing the results of the two studies is that Joanisse et al. (2007) presented the stimuli used in their “within-category” and “same-category” conditions substantially more often than those used in the “between-category” condition. This imbalance in stimulus presentations would be expected to cause increased adaptation for the “within” and “same-category” stimuli relative to the “between-category” stimuli (Grill-Spector et al., 1999), possibly producing the appearance of category selectivity in the adaptation paradigm despite an underlying graded noncategorical representation.
Two other recent studies have observed prefrontal correlates of phoneme category even when subjects performed noncategorical tasks, such as volume- or pitch-difference detection (Myers et al., 2009; Lee et al., 2012). We think that these results are compatible with ours; although the volume- and pitch-difference detection tasks may not require category information, it is unclear whether subjects covertly categorized the sounds. We interpret the lack of any ROI in prefrontal regions in our study, expected during explicit categorization (Binder et al., 2004; Husain et al., 2006), to indicate that our difficult dichotic listening task (71% mean accuracy) was effective at distracting subjects from the task-irrelevant phoneme-category information.
In conclusion, we have provided novel evidence in support of the hypothesis that dorsal-stream regions, specifically left PMC, are recruited to support speech categorization tasks. Whereas previous evidence was circumstantial, we provide evidence that PMC exhibits neural tuning properties useful for a phoneme-categorization task, that this processing may take place automatically (making the information available for recruitment on demand), and that tuning in this representation can account for differences in behavioral performance. It would be useful for future experiments to consider whether the role of PMC in categorization extends to speech sounds that are judged less categorically than stop consonants, such as vowels (Schouten et al., 2003), and perhaps even nonspeech stimuli. Indeed, evidence suggests that the same process of matching sensory inputs with their motor causes may also occur for other kinds of “doable sounds” outside the speech domain (Rauschecker and Scott, 2009). There is indeed experimental evidence for this operation to include musical sounds, tool sounds, or “action sounds” in more general terms (Hauk et al., 2004; Lewis et al., 2005; Zatorre et al., 2007).
Footnotes
This work was supported by National Science Foundation Grant 0749986 to M.R. and J.P.R. and PIRE Grant OISE-0730255. We thank Dr. Sheila Blumstein for discussions of results and interpretations, Dr. John VanMeter for suggesting the structural equation modeling analyses, and Dr. Hideki Kawahara for providing comments and suggestions regarding sound morphing as well as access to the STRAIGHT toolbox for MATLAB.
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Maximilian Riesenhuber, Department of Neuroscience, Georgetown University Medical Center, Research Building Room WP-12, 3970 Reservoir Road NW, Washington, DC 20007. mr287{at}georgetown.edu