Abstract
The brain's circuitry for perceiving and producing speech may show a notable level of overlap that is crucial for normal development and behavior. The extent to which sensorimotor integration plays a role in speech perception remains highly controversial, however. Methodological constraints related to experimental designs and analysis methods have so far prevented the disentanglement of neural responses to acoustic versus articulatory speech features. Using a passive listening paradigm and multivariate decoding of single-trial fMRI responses to spoken syllables, we investigated brain-based generalization of articulatory features (place and manner of articulation, and voicing) beyond their acoustic (surface) form in adult human listeners. For example, we trained a classifier to discriminate place of articulation within stop syllables (e.g., /pa/ vs /ta/) and tested whether this training generalizes to fricatives (e.g., /fa/ vs /sa/). This novel approach revealed generalization of place and manner of articulation at multiple cortical levels within the dorsal auditory pathway, including auditory, sensorimotor, motor, and somatosensory regions, suggesting the representation of sensorimotor information. Additionally, generalization of voicing included the right anterior superior temporal sulcus associated with the perception of human voices as well as somatosensory regions bilaterally. Our findings highlight the close connection between brain systems for speech perception and production, and in particular, indicate the availability of articulatory codes during passive speech perception.
SIGNIFICANCE STATEMENT Sensorimotor integration is central to verbal communication and provides a link between auditory signals of speech perception and motor programs of speech production. It remains highly controversial, however, to what extent the brain's speech perception system actively uses articulatory (motor), in addition to acoustic/phonetic, representations. In this study, we examine the role of articulatory representations during passive listening using carefully controlled stimuli (spoken syllables) in combination with multivariate fMRI decoding. Our approach enabled us to disentangle brain responses to acoustic and articulatory speech properties. In particular, it revealed articulatory-specific brain responses of speech at multiple cortical levels, including auditory, sensorimotor, and motor regions, suggesting the representation of sensorimotor information during passive speech perception.
Introduction
Speech perception and production are closely linked during verbal communication in everyday life. Correspondingly, the neural processes responsible for both faculties are inherently connected (Hickok and Poeppel, 2007; Glasser and Rilling, 2008), with sensorimotor integration subserving transformations between acoustic (perceptual) and articulatory (motoric) representations (Hickok et al., 2011). Although sensorimotor integration mediates motor speech development and articulatory control during speech production (Guenther and Vladusich, 2012), its role in speech perception is less established. Indeed, it remains unknown whether articulatory speech representations play an active role in speech perception and/or whether this is dependent on task-specific sensorimotor goals. Tasks explicitly requiring sensory-to-motor control, such as speech repetition (Caplan and Waters, 1995; Hickok et al., 2009), humming (Hickok et al., 2003), and verbal rehearsal in working memory (Baddeley et al., 1998; Jacquemot and Scott, 2006; Buchsbaum et al., 2011), activate the dorsal auditory pathway, including sensorimotor regions at the border of the posterior temporal and parietal lobes, the sylvian-parieto-temporal region and supramarginal gyrus (SMG) (Hickok and Poeppel, 2007). Also, in experimental paradigms that do not explicitly require sensorimotor integration, the perception of speech may involve a coactivation of motor speech regions (Zatorre et al., 1992; Wilson et al., 2004). These activations may follow a topographic organization, such as when listening to syllables involving the lips (e.g., /ba/) versus the tongue (e.g., /da/) (Pulvermüller et al., 2006), and may be selectively disrupted by transcranial magnetic stimulation (D'Ausilio et al., 2009). Whether this coinvolvement of motor areas in speech perception reflects an epiphenomenal effect due to an interconnected network for speech and language (Hickok, 2009), a compensatory effect invoked in case of noisy and/or ambiguous speech signals (Hervais-Adelman et al., 2012; Du et al., 2014), or neural computations used for an articulatory-based segmentation of speech input in everyday life situations remains unknown (Meister et al., 2007; Pulvermüller and Fadiga, 2010).
Beyond regional modulations of averaged activity across different experimental conditions, fMRI in combination with multivoxel pattern analysis (MVPA) allows investigating the acoustic and/or articulatory representation of individual speech sounds. This approach has been successful in demonstrating auditory cortical representations of speech (Formisano et al., 2008; Kilian-Hütten et al., 2011; Lee et al., 2012; Bonte et al., 2014; Arsenault and Buchsbaum, 2015; Evans and Davis, 2015). Crucially, MVPA enables isolating neural representations of stimulus classes from variation across other stimulus dimensions, such as the representation of vowels independent of acoustic variation across speakers' pronunciations (Formisano et al., 2008) or the representation of semantic concepts independent of the input language in bilingual listeners (Correia et al., 2014).
In this high-resolution fMRI study, we used a passive listening paradigm and an MVPA-based generalization approach to examine neural representations of articulatory features during speech perception with minimal sensorimotor demands. A balanced set of spoken syllables and MVPA generalization using a surface-based searchlight procedure (Kriegeskorte et al., 2006; Chen et al., 2011) allowed unraveling these representations in distinct auditory, sensorimotor, and motor regions. Stimuli consisted of 24 consonant-vowel syllables constructed from 8 consonants (/b/, /d/, /f/, /p/, /s/, /t/, /v/, and /z/) and 3 vowels (/a/, /i/, and /u/), forming two features for each of three articulatory dimensions: place and manner of articulation and voicing (Fig. 1A). A slow event-related design with an intertrial interval of 12–16 s assured a maximally independent single-trial fMRI acquisition. Our MVPA generalization approach consisted of training, for example, a classifier to discriminate between two places of articulation (e.g., /pa/ vs /ta/) for stop consonants and testing whether this training generalizes to fricatives (e.g., /fa/ vs /sa/), thereby decoding fMRI responses to speech gestures beyond individual stimulus characteristics specific to, for example, abrupt sounds such as stop consonants or sounds characterized by a noise component such as fricatives. This decoding procedure was performed for (1) place of articulation across manner of articulation, (2) manner of articulation across place of articulation, and (3) voicing across manner of articulation.
Materials and Methods
Participants.
Ten Dutch-speaking participants (5 males, 5 females; mean ± SD age, 28.2 ± 2.35 years; 1 left handed) took part in the study. All participants were undergraduate or postgraduate students of Maastricht University, reported normal hearing abilities, and were neurologically healthy. The study was approved by the Ethical Committee of the Faculty of Psychology and Neuroscience at the University of Maastricht, Maastricht, The Netherlands.
Stimuli.
Stimuli consisted of 24 consonant-vowel (CV) syllables pronounced by 3 female Dutch speakers, generating a total of 72 sounds. The syllables were constructed based on all possible CV combinations of 8 consonants (/b/, /d/, /f/, /p/, /s/, /t/, /v/, and /z/) and 3 vowels (/a/, /i/, and /u/). The 8 consonants were selected to cover two articulatory features per articulatory dimension (Fig. 1A): for place of articulation, bilabial/labio-dental (/b/, /p/, /f/, /v/) and alveolar (/t/, /d/, /s/, /z/); for manner of articulation, stop (/p/, /b/, /t/, /d/) and fricative (/f/, /v/, /s/, /z/); and for voicing, unvoiced (/p/, /t/, /f/, /s/) and voiced (/b/, /d/, /v/, /z/). The different vowels and the three speakers introduced acoustic variability. Stimuli were recorded in a soundproof chamber at a sampling rate of 44.1 kHz (16-bit resolution). Postprocessing of the recorded stimuli was performed in PRAAT software (Boersma and Weenink, 2001) and included bandpass filtering (80–10,500 Hz), manual removal of acoustic transients (clicks), length equalization, removal of sharp onsets and offsets using 30 ms ramp envelops, and amplitude equalization (average RMS). Stimulus length was equated to 340 ms using PSOLA (pitch synchronous overlap and add) with 75–400 Hz as extrema of the F0 contour. Length changes were small (mean ± SD, 61 ± 47 ms), and subjects reported that the stimuli were unambiguously comprehended during a stimuli familiarization phase before the experiment. We further checked our stimuli for possible alterations in F0 after length equation and found no significant changes of maximum F0 (p = 0.69) and minimum F0 (p = 0.76) with respect to the original recordings.
Acoustic characteristics of the 24 syllables were determined using PRAAT software and included the first three spectrogram formants (F1, F2, and F3) extracted from the 100 ms window centered at the midpoint of the vowel segment of each syllable (Fig. 2A). Because place of articulation of consonants is known to influence F2/F1 values of subsequent vowels in CV syllables due to coarticulation (Rietveld and van Heuven, 2001; Ladefoged and Johnson, 2010; Bouchard and Chang, 2014), we additionally calculated the logarithmic ratio of F2/F1 for the vowels in each of the syllables and assessed statistical differences between articulatory features (Fig. 2B–D). As expected, place of articulation led to significant log(F2/F1) differences for the vowel ‘u’ and ‘a’ (p < 0.05), with smaller log(F2/F1) values for vowels preceded by bilabial/labio-dental than alveolar consonants. No significant log(F2/F1) differences were found for the vowel ‘i’ or for any of the vowels when syllables were grouped along manner of articulation or voicing. Importantly, together with pronunciations from three different speakers, we used three different vowels to increase acoustic variability and to weaken stimulus-specific coarticulation effects in the analyses.
Experimental procedures.
The main experiment was divided into three slow event-related runs (Fig. 1B). The runs consisted of randomly presenting each of the 72 sounds once, separated by an intertrial interval (ITI) of 12–16 s (corresponding to 6–8 TRs) while participants were asked to attend to the spoken syllables. Stimulus presentation was pseudorandomized such that consecutive presentations of the same syllables were avoided. Before starting the measurements, examples of the syllables were presented binaurally (using MR-compatible in-ear headphones; Sensimetrics, model S14; www.sens.com) at the same comfortable intensity level. This level was then individually adjusted according to the indications provided by each participant to equalize their perceived loudness. During scanning, stimuli were presented in silent periods (1 s) between two acquisition volumes. Participants were asked to fixate at a gray fixation cross against a black background to keep the visual stimulation constant during the entire duration of a run. Run transitions were marked with written instructions. Although monitoring of possible subvocal/unconscious rehearsal accompanying the perception of the spoken syllables was not conducted, none of the participants reported the use of subvocal rehearsal strategies.
fMRI acquisition.
Functional and anatomical image acquisition was performed on a Siemens TRIO 3 tesla scanner (Scannexus) at the Maastricht Brain Imaging Center. Functional runs used in the main experiment were collected per subject with a spatial resolution of 2 mm isotropic using a standard echo-planar imaging sequence [repetition time (TR) = 2.0 s; acquisition time (TA) = 1.0 s; field of view = 192 × 192 mm; matrix size = 64 × 64; echo time (TE) = 30 ms; multiband factor 2]. Each volume consisted of 25 slices aligned and centered along the Sylvian fissures of the participants. The duration difference between the TA and TR introduced a silent period used for the presentation of the auditory stimuli. High-resolution (voxel size 1 mm3 isotropic) anatomical images covering the whole brain were acquired after the second functional run using a T1-weighted 3D Alzheimer's Disease Neuroimaging Initiative sequence (TR = 2050 ms; TE = 2.6 ms; 192 sagittal slices).
Two additional localizer runs presented at the end of the experimental session were used to identify fMRI activations related to listening and repeating the spoken syllables and to guide the multivariate decoding analysis conducted in the main experiment. For the first localizer run, participants were instructed to attentively listen to the spoken syllables. For the second localizer run, participants were instructed to listen and repeat the spoken syllables. This run was presented at the end of the scanning session to prevent priming for vocalizations during the main experiment. Both localizer runs consisted of 9 blocks of 8 syllables, with one syllable presented per TR. The blocks were separated by an ITI of 12.5–17.5 s (5–7 TRs). The scanning parameters were the same as used in the main experiment, with the exception of a longer TR (2.5 s) that assured that participants were able to listen and repeat the syllables in the absence of scanner noise (silent period = 1.5 s). Figure 3, A and B, shows the overall BOLD activation evoked during the localizer runs. Listening to the spoken syllables elicited activation in the superior temporal lobe in both hemispheres (Fig. 3A), as well as in inferior frontal cortex/anterior insula in the left hemisphere. Repeating the spoken syllables additionally evoked activation in premotor (anterior inferior precentral gyrus and posterior inferior frontal gyrus), motor (precentral gyrus), and somatosensory (postcentral gyrus) regions (Fig. 3B). BOLD activation in these regions was statistically assessed using random-effects GLM statistics (p < 0.05) and corrected for multiple comparisons using cluster size threshold correction (α = 5%). We defined an ROI per hemisphere that was used for the decoding analysis of the main experiment (Fig. 3C). The ROI included parts of the temporal lobes, inferior frontal cortices and parietal lobes, which are typically activated in speech perception and production tasks. By comparing the ROI with the activity obtained with the localizers, we made sure that areas activated during the perception and repetition of spoken syllables were all included.
fMRI data preprocessing.
fMRI data were preprocessed and analyzed using Brain Voyager QX version 2.8 (Brain Innovation) and custom-made MATLAB (The MathWorks) routines. Functional data were 3D motion-corrected (trilinear sinc interpolation), corrected for slice scan time differences, and temporally filtered by removing frequency components of ≤5 cycles per time course (Goebel et al., 2006). According to the standard analysis scheme in Brain Voyager QX (Goebel et al., 2006), anatomical data were corrected for intensity inhomogeneity and transformed into Talairach space (Talairach and Tournoux, 1988). Individual cortical surfaces were reconstructed from gray-white matter segmentations of the anatomical acquisitions and aligned using a moving target-group average approach based on curvature information (cortex-based alignment) to obtain an anatomically aligned group-averaged 3D surface representation (Goebel et al., 2006; Frost and Goebel, 2012). Functional data were projected to the individual cortical surfaces, creating surface-based time courses. All statistical analyses were then conducted on the group averaged surface making use of cortex-based alignment.
MVPA classification (generalization of articulatory features).
To investigate the local representation of spoken syllables based on their articulatory features, we used multivariate classification in combination with a moving searchlight procedure that selected cortical vertices based on their spatial (geodesic) proximity. The particular aspect pursued by the MVPA was to decode articulatory features of the syllables beyond their phoneme specific acoustic signatures (Fig. 1C). Hence, we used a classification strategy based on the generalization of articulatory features across different types of phonemes. For example, we trained a classifier to decode place of articulation features (bilabial/labio-dental vs alveolar) from stop syllables ([/b/ and /p/] vs [/t/ and /d/]) and tested whether this learning is transferable to fricative syllables ([/f/ and /v/] vs [/s/ and /z/]), thus decoding place of articulation features across phonemes differing in manner of articulation. In total, we performed such a generalization strategy to investigate the neural representation of place of articulation (bilabial/labio-dental vs alveolar) across manner of articulation (stop and fricatives); manner of articulation (stop vs fricatives) across place of articulation (bilabial/labio-dental and alveolar); and voicing (voiced vs unvoiced) across manner of articulation (stop and fricatives). Additional methodological steps encompassing the construction of the fMRI feature space (fMRI feature extraction and surface-based searchlight procedure) as well as the computational strategy to validate (cross-validation) and display (generalization maps) the classification results are described below in detail.
fMRI feature extraction.
Before classification, BOLD responses of each fMRI trial and each cortical vertex were estimated by fitting a hemodynamic response function using a GLM. To account for the temporal variability of single-trial BOLD responses, multiple hemodynamic response function fittings were produced by shifting their onset time (lag) with respect to the stimulus event time (number of lags = 21, interval between consecutive lags = 0.1 s) (Ley et al., 2012; De Martino et al., 2008). At each trial, the GLM coefficient β resulting from the best fitting hemodynamic response function across lags in the whole brain was used to construct an fMRI feature space composed by the number of trials by the number of cortical vertices, which was thereafter used in the multivariate decoding.
Surface-based searchlight procedure (selecting cortical vertices for classification).
To avoid degraded performances of the classification algorithm attributable to the high dimensionality of the feature space (model overfitting; for a description, see Norman et al., 2006), a reduction of the number of fMRI features is usually performed. The moving searchlight approach (Kriegeskorte et al., 2006) restricts fMRI features using focal selections centered at all voxels within spherical patches of the gray matter volume. Here, we used a searchlight procedure on the gray-white matter segmentation surface (Chen et al., 2011), which selected cortical vertices for decoding based on their spatial (geodesic) distance within circular surface patches with a radius of 10 mm (Fig. 3D). The surface-based searchlight procedure reduces the concurrent inclusion of voxels across different gyri that are geodesically distant but nearby in 3D volume space, and has been shown reliable for fMRI MVPA (Chen et al., 2011; Oosterhof et al., 2011). Crucially, the surface-based searchlight procedure assures an independent analysis of superior temporal and ventral frontal cortical regions that may be involved in the articulatory representation of speech. The primary searchlight analysis was conducted in the predefined ROI comprising perisylvian speech and language regions (Fig. 3C). An additional exploratory analysis in the remaining cortical surface covered by the functional scans was also conducted. Furthermore, next to the searchlight analysis using a radius similar to that used in previous speech decoding studies (Lee et al., 2012; Correia et al., 2014; Evans and Davis, 2015), we performed a further analysis using a larger searchlight radius of 20 mm (Fig. 3D). Different radius sizes in the searchlight method may exploit different spatial spreads of fMRI response patterns (Nestor et al., 2011; Etzel et al., 2013).
Cross-validation.
Cross-validation was based on the generalization of articulatory features of syllables independent of acoustic variation across other articulatory features. For each classification strategy, two cross-validation folds were created and included generalization in one and the opposite direction (e.g., generalization of place of articulation from stop to fricatives and from fricatives to stop syllables). Cross-validation based on generalization strategies is attractive because it enables detecting activation patterns resistant to variation across other stimuli dimensions (Formisano et al., 2008; Buchweitz et al., 2012; Correia et al., 2014). As we aimed to maximize the acoustic variance of our decoding scheme, generalization of place of articulation and voicing were both calculated across manner of articulation, the dimension that is acoustically most distinguishable (Rietveld and van Heuven, 2001; Ladefoged and Johnson, 2010). Generalization of manner of articulation was performed across place of articulation.
Generalization maps.
At the end of the searchlight decoding procedure, individual averaged accuracy maps for place of articulation, manner of articulation, and voicing were constructed, projected onto the group-averaged cortical surface, and anatomically aligned using cortex-based alignment. To assess group-averaged statistical significance of cortical vertices (chance level is 50%), exact permutation tests were used (n = 1022). The resulting statistical maps were then corrected for multiple comparisons by applying a cluster size threshold with a false-positive rate (α = 5%) after setting an initial vertex-level threshold (p < 0.05, uncorrected) and submitting the maps to a correction criterion based on the estimate of the spatial smoothness of the map (Forman et al., 1995; Goebel et al., 2006; Hagler et al., 2006).
Results
To investigate the neural representation of articulatory features during speech perception, we used a classification strategy that relied on generalizing the discriminability of three articulatory features across different syllable pairs. That is, using a moving searchlight procedure, we tested which cortical regions allow the generalization of (1) place of articulation across two distinct manners of articulation (Figs. 4A, 5A), (2) manner of articulation across two distinct places of articulation (Figs. 4B, 5B), and (3) voicing across two distinct manners of articulation (Figs. 4C, 5C). For each type of generalization analysis, half of the available trials was used to learn to discriminate the neural responses evoked by the articulatory features of interest and the other half was used to test the generalization of this learning. The same stimuli were hence included in all generalization analyses in a counterbalanced manner.
Figure 4A–C depicts the generalization accuracy maps for the three types of decoding obtained from the primary analysis, statistically masked (p < 0.05) with black circled clusters indicating regions that survived cluster size multiple comparisons correction. The generalization maps revealed successful decoding of each of the three articulatory features within distinct but partially overlapping regions of the brain's speech perception network. We observed generalization foci within both the left and right hemispheres, suggesting the participation of bilateral language and speech regions in the representation of spoken syllables based on their articulatory/motoric properties. Cortical regions within the ROI (Fig. 3C) enabling the generalization of place and manner of articulation were most widely distributed, including both superior temporal, premotor, motor and somatosensory regions bilaterally, as well as sensorimotor areas at the border between the parietal and temporal lobes in the left hemisphere in case of place of articulation and in the right hemisphere in case of manner of articulation. Specific regions leading to significant generalization of place of articulation included localized clusters in the left medial and posterior superior temporal gyrus (STG), left posterior inferior post-central gyrus (CG) and left anterior SMG, as well as right middle and anterior superior temporal sulcus (STS), right posterior inferior post-CG, inferior pre-CG, and right posterior inferior frontal gyrus (IFG). Generalization of manner of articulation was significantly possible based on clusters within the left posterior inferior post-CG, right mid STS, right posterior mid/inferior post-CG, inferior anterior pre-CG/posterior IFG, right anterior SMG, and right anterior insula. In contrast to the contribution of multiple, distributed brain activity clusters to the representation of place and manner of articulation, generalization of voicing across different syllables was more confined, including specifically the right anterior STS. Finally, a visual comparison of the three types of generalization maps (Fig. 4D) demonstrates spatially overlapping clusters of significant generalization. Overlap between place and manner of articulation was observed within the inferior post-CG bilaterally as well as in the right anterior precentral gyrus/posterior IFG and the right anterior insula. Overlap between place of articulation and voicing was observed within the right anterior STS.
An exploratory analysis beyond our speech-related perisylvian ROI showed additional clusters with the capacity to generalize articulatory properties (Fig. 4; Table 1). Specifically, place of articulation was significantly decodable in the right posterior middle temporal gyrus, left cuneus and precuneus, as well as the caudate and cerebellum bilaterally. Clusters allowing the generalization of manner of articulation were found in the right inferior angular gyrus, as well as in the right parahippocampal gyrus and cerebellum. Finally, additional clusters for voicing included the left intraparietal sulcus, the right inferior angular gyrus as well as the left anterior cingulate, right cuneus and precuneus, and the right parahippocampal gyrus.
To investigate the contribution of cortical representations with a larger spatial extent, we performed an additional analysis using a searchlight radius of 20 mm (Fig. 5; Table 2). The larger searchlight radius yielded broader clusters, as expected, and the overall pattern of results was comparable with that obtained with the 10 mm searchlight radius (Fig. 4). It also yielded significant effects in additional clusters within the ROI. In particular, generalization of manner of articulation led to significant clusters along the superior temporal lobe bilaterally and in the left inferior frontal gyrus. Significant decoding of place of articulation was found in the right SMG, and the generalization of voicing led to additional significant clusters in the inferior post-CG bilaterally. Finally, decreases of generalization for the larger searchlight radius were found, especially within the left superior temporal lobe, posterior inferior post-CG, and anterior SMG for place of articulation.
Discussion
The present study aimed to investigate whether focal patterns of fMRI responses to speech input contain information regarding articulatory features when participants are attentively listening to spoken syllables in the absence of task demands that direct their attention to speech production or monitoring. Using high spatial resolution fMRI in combination with an MVPA generalization approach, we were able to identify specific foci of brain activity that discriminate articulatory features of spoken syllables independent of their individual acoustic variation (surface form) across other articulatory dimensions. These results provide compelling evidence for interlinked brain circuitry of speech perception and production within the dorsal speech regions, and in particular, for the availability of articulatory codes during online perception of spoken syllables within premotor and motor, somatosensory, auditory, and/or sensorimotor integration areas.
Our generalization analysis suggests the involvement of premotor and motor areas in the neural representation of two important articulatory features during passive speech perception: manner and place of articulation. These findings are compelling as the role of premotor and motor areas during speech perception remains controversial (Liberman and Mattingly, 1985; Galantucci et al., 2006; Pulvermüller and Fadiga, 2010). Left hemispheric motor speech areas have been found to be involved both in the subvocal rehearsal and perception of the syllables /ba/ and /da/ (Pulvermüller et al., 2006) and of spoken words (Schomers et al., 2014). Furthermore, speech motor regions may bias the perception of ambiguous speech syllables under noisy conditions (D'Ausilio et al., 2009; Du et al., 2014) and have been suggested to be specifically important for the performance of tasks requiring subvocal rehearsal (Hickok and Poeppel, 2007; Krieger-Redwood et al., 2013). However, the involvement of (pre)motor cortex in speech perception may also reflect an epiphenomenal consequence of interconnected networks for speech perception and production (Hickok, 2009). Importantly, the observed decoding and generalization capacity of activation patterns in (pre)motor regions for place across variation in manner and manner across variation in place of articulation indicate the representation of articulatory information beyond mere activation spread also while passively listening to clear spoken syllables. Further investigations exploiting correlations of articulatory MVPA representations to behavioral measures of speech perception and their modulation by task difficulty (Raizada and Poldrack, 2007) may permit mapping aspects related to speech intelligibility and may lead to a further understanding of the functional relevance of such articulatory representations.
Our results also show decoding of articulatory features in bilateral somatosensory areas. In particular, areas comprising the inferior posterior banks of the postcentral gyri were sensitive to the generalization of place and manner of articulation (Fig. 4), and of voicing when using a larger searchlight radius (Fig. 5). Somatosensory and motoric regions are intrinsically connected, allowing for the online control of speech gestures and proprioceptive feedback (Hickok et al., 2011; Bouchard et al., 2013). Together with feedback from auditory cortex, this somatosensory feedback may form a state feedback control system for speech production (SFC) (Houde and Nagarajan, 2011). The involvement of somatosensory areas in the representation of articulatory features during passive speech perception extends recent findings showing the involvement of these regions in the neural decoding of place and manner of articulation during speech production (Bouchard et al., 2013), and of place of articulation during an active perceptual task in English listeners (Arsenault and Buchsbaum, 2015). In particular, they may indicate automatic information transfer from auditory to somatosensory representations during speech perception (Cogan et al., 2014) similar to their integration as part of SFC systems for speech production.
Especially in the auditory cortex, it is essential to disentangle brain activity indicative of articulatory versus acoustic features. So far, methodological constraints related to experimental designs and analysis methods have often prevented this differentiation. Moreover, it is likely that multiple, different types of syllable representations are encoded in different brain systems responsible for auditory- and articulatory-based analysis (Cogan et al., 2014; Mesgarani et al., 2014; Evans and Davis, 2015). In a recent study, intracranical EEG responses to English phonemes indicated a phonetic organization in the left superior temporal cortex, especially in terms of manner of articulation and to a lesser extent also of place of articulation and voice onset time (Mesgarani et al., 2014). Our fMRI decoding findings show and confirm the expected encoding of manner and place in bilateral superior temporal cortex. Most relevantly, our generalization analysis suggests the existence of encoding of articulatory similarities across sets of acoustically different syllables. In particular, auditory (STG) response patterns distinguishing place of articulation in one set of speech sounds (e.g., stop consonants) were demonstrated to predict place of articulation in another set of speech sounds (e.g., fricatives). In connected speech, such as our natural consonant-vowel syllables, place of articulation also induces specific coarticulatory cues that may contribute to its neural representation (Bouchard and Chang, 2014). Importantly however, although our stimuli showed a smaller F2/F1 ratio for vowel /u/ and vowel /a/ due to coarticulation following consonants with a bilabial/labio-dental versus alveolar place of articulation, this coarticulatory cue was not present for vowel /i/. Thus, the success of our cross-generalization analysis in the presence of the balanced variance of pronunciations per consonant (vowels /a/, /i/ and /u/ and pronunciations from three different speakers) suggests that it is possible to study the representation of articulatory, in addition to auditory, features in auditory cortex. Finally, the finding that auditory cortical generalization of the most acoustically distinguishable feature, manner of articulation, was mainly present in the analysis using a larger searchlight radius is consistent with distributed phonetic representation of speech in these regions (Formisano et al., 2008; Bonte et al., 2014). Although these findings also suggest a different spatial extent of auditory cortical representations for place versus manner features, the exact nature of the underlying representations remains to be determined in future studies.
Our cortical generalization maps not only show that it was possible to predict manner and place of articulation based on activation patterns in bilateral speech-related auditory areas (STG), but also to predict place of articulation and voicing based on patterns within the right anterior superior temporal lobe (STS). Articulation and especially voicing-related representations within the right anterior STS may relate to its involvement in the processing of human voices (Belin et al., 2000), and possibly to the proposed specialization of this area in perceiving vocal-tract properties of speakers (e.g., shape and characteristics of the vocal folds). A role of this phonological feature in the processing of human voices is also compatible with the previous finding that voicing was more robustly classified than either place or manner of articulation when subjects performed an active gender discrimination task (Arsenault and Buchsbaum, 2015).
Decoding of articulatory representations that may relate to sensorimotor integration mechanisms, thus possibly involving the translation between auditory and articulatory codes, included regions within the inferior parietal lobes. Specifically, the anterior SMG was found to generalize manner of articulation in the right hemisphere and place of articulation in the left (10 mm searchlight radius) and right (20 mm searchlight radius) hemisphere. Nearby regions involving the inferior parietal lobe (Raizada and Poldrack, 2007; Moser et al., 2009; Kilian-Hütten et al., 2011) and sylvian-parietal-temporal regions (Caplan and Waters, 1995; Hickok et al., 2003, 2009; Buchsbaum et al., 2011) have been implicated in sensorimotor integration during speech perception as well as in mapping auditory targets of speech sounds before the initiation of speech production (Hickok et al., 2009; Guenther and Vladusich, 2012). Here, we show the sensitivity of SMG to represent articulatory features of spoken syllables during speech perception in the absence of an explicit and active task, such as repetition (Caplan and Waters, 1995; Hickok et al., 2009) or music humming (Hickok et al., 2003). Furthermore, our results suggest the involvement of inferior parietal lobe regions in the perception of clear speech, extending previous findings showing a significant role in the perception of ambiguous spoken syllables (Raizada and Poldrack, 2007) and the integration of ambiguous spoken syllables with lip-read speech (Kilian-Hütten et al., 2011).
Beyond regions in speech-related perisylvian cortex (ROI used), our exploratory whole-brain analysis suggests the involvement of additional clusters in parietal, occipital, and medial brain regions (Table 1). In particular, regions, such as, for example, the angular gyrus and intraparietal sulcus, are structurally connected to superior temporal speech regions via the middle longitudinal fasciculus (Seltzer and Pandya, 1984; Makris et al., 2013; Dick et al., 2014) and may be involved in the processing of vocal features (Petkov et al., 2008, Merril et al., 2012). Their functional role in the sensorimotor representation of speech needs to be determined in future studies, for example, using different perceptual and sensorimotor tasks. Moreover, it is possible that different distributions of informative response patterns are best captured by searchlights of different radii and shapes (Nestor et al., 2011; Etzel et al., 2013). Our additional searchlight results obtained with an increased radius of 20 mm (Fig. 5) validated the results from the primary analysis using a radius of 10 mm (Fig. 4). They also showed significant decoding accuracy in the bilateral superior temporal lobe for manner of articulation. Furthermore, decreases of generalization for the larger searchlight radius were also found, especially in the left sensorimotor and auditory regions for place of articulation. A similar pattern of results was previously reported in a study on the perception of ambiguous speech syllables, with a smaller radius allowing decoding in premotor regions and a larger radius allowing decoding in auditory regions (Lee et al., 2012). Whereas increases of generalization capability in given locations with a larger searchlight may indicate a broader spatial spread of the underlying neural representations, decreases of generalization in other locations may reflect the inability of more localized multivariate models to detect these representations. This also points to the important issue of how fMRI features are selected for MVPA. Alternative feature selection methods, such as recursive feature elimination (De Martino et al., 2008), optimize the number of features recursively starting from large ROIs. However, these methods are computationally very heavy and require a much larger number of training trials. Furthermore, the searchlight approach provides a more direct link between classification performance and cortical localization, which was a major goal in this study. Another relevant aspect for improving MVPA in general and for our generalization strategy in particular is the inclusion of additional variance in the stimulus set used for learning the multivariate model. For instance, future studies aiming to particularly target place of articulation distinctions could, in addition to the variations along manner of articulation used (stop and fricatives), include nasal consonants, such as the bilabial ‘m’ and alveolar ‘n.’
Overall, the combination of the searchlight method with our generalization strategy was crucial to disentangle the neural representation of articulatory and acoustic differences between individual spoken syllables during passive speech perception. In particular, it allowed localizing the representation of bilabial/labio-dental versus alveolar (place), stop versus fricative (manner), and voiced versus unvoiced (voicing) articulatory features in multiple cortical regions within the dorsal auditory pathway that are relevant for speech processing and control (Hickok and Poeppel, 2007). Similar generalization strategies capable of transfer representation patterns across different stimulus classes have been adopted in MVPA studies, for example, in isolating the identity of vowels and speakers independent of acoustic variation (Formisano et al., 2008), in isolating concepts independent of language presentation in bilinguals (Buchweitz et al., 2012; Correia et al., 2014, 2015), and in isolating concepts independent of presentation modality (Shinkareva et al., 2011; Simanova et al., 2014). Together, these findings suggest that the neural representation of language consists of specialized bilateral subnetworks (Cogan et al., 2014) that tune to certain feature characteristics independent of other features within the signal. Crucially, our findings provide evidence for the interaction of auditory, sensorimotor, and somatosensory brain circuitries during speech perception, in conformity with the behavioral link between perception and production faculties in everyday life. The applications of fMRI decoding and generalization methods also hold promise to investigate similarities of acoustic and articulatory speech representations across the perception and production of speech.
Footnotes
This work was supported by European Union Marie Curie Initial Training Network Grant PITN-GA-2009-238593. We thank Elia Formisano for thorough discussions and advice on the study design and analysis and Matt Davis for valuable discussions.
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Joao M. Correia, Maastricht University, Oxfordlaan 55, 6229 EV Maastricht, The Netherlands. joao.correia{at}maastrichtuniversity.nl