Abstract
Speech is a critical form of human communication and is central to our daily lives. Yet, despite decades of study, an understanding of the fundamental neural control of speech production remains incomplete. Current theories model speech production as a hierarchy from sentences and phrases down to words, syllables, speech sounds (phonemes), and the actions of vocal tract articulators used to produce speech sounds (articulatory gestures). Here, we investigate the cortical representation of articulatory gestures and phonemes in ventral precentral and inferior frontal gyri in men and women. Our results indicate that ventral precentral cortex represents gestures to a greater extent than phonemes, while inferior frontal cortex represents both gestures and phonemes. These findings suggest that speech production shares a common cortical representation with that of other types of movement, such as arm and hand movements. This has important implications both for our understanding of speech production and for the design of brain–machine interfaces to restore communication to people who cannot speak.
SIGNIFICANCE STATEMENT Despite being studied for decades, the production of speech by the brain is not fully understood. In particular, the most elemental parts of speech, speech sounds (phonemes) and the movements of vocal tract articulators used to produce these sounds (articulatory gestures), have both been hypothesized to be encoded in motor cortex. Using direct cortical recordings, we found evidence that primary motor and premotor cortices represent gestures to a greater extent than phonemes. Inferior frontal cortex (part of Broca's area) appears to represent both gestures and phonemes. These findings suggest that speech production shares a similar cortical organizational structure with the movement of other body parts.
Introduction
Speech is composed of individual sounds, called segments or (hereafter) phonemes (Bakovic, 2014), that are produced by coordinated movements of the vocal tract (e.g., lips, tongue, velum, and larynx). However, it is not certain exactly how these movements are planned. For example, during speech planning, phonemes are coarticulated—the vocal tract actions (constrictions or releases), or articulatory gestures, that comprise a given phoneme change based on neighboring phonemes in the uttered word or phrase (Whalen, 1990). While the dynamic properties of these gestures, which are similar to articulator kinematics, have been extensively studied (Westbury et al., 1990; Nam et al., 2012; Bocquelet et al., 2016; Bouchard et al., 2016; Carey and McGettigan, 2017), there is no direct evidence of gestural representations in the brain.
Recent models of speech production propose that articulatory gestures combine to create acoustic outputs (phonemes and phoneme groupings such as syllables; Browman and Goldstein, 1992; Guenther et al., 2006). Guenther et al. (2006) hypothesized that ventral premotor cortex (PMv) and inferior frontal gyrus (IFG; part of Broca's area) preferentially represent (groupings of) phonemes and that ventral motor cortex (M1v) preferentially represents gestures. This hypothesis is analogous to limb motor control, in which premotor cortices preferentially encode reach targets and M1 encodes reaching details (Hocherman and Wise, 1991; Shen and Alexander, 1997; Hatsopoulos et al., 2004; Pesaran et al., 2006). However, the hypothesized localizations of speech motor control of the model were based on indirect evidence from behavioral studies (Ballard et al., 2000), nonspeech articulator movements (Penfield and Roberts, 1959; Fesl et al., 2003), and fMRI studies of syllables (Riecker et al., 2000; Guenther et al., 2006; Ghosh et al., 2008; Tourville et al., 2008). None of the modalities used in these studies had a sufficient combination of temporal and spatial resolution to provide definitive information about where and how gestures and phonemes are encoded.
Electrocorticography (ECoG) has enabled the identification of neural activity with high spatial and temporal resolution during speech production (Kellis et al., 2010; Pei et al., 2011b; Bouchard et al., 2013; Mugler et al., 2014b; Slutzky, 2018). High gamma activity (70–200 Hz) in ECoG from ventral precentral gyrus (PCG; encompassing M1v and PMv) corroborated Penfield's original somatotopic mappings of the articulators (Penfield and Boldrey, 1937) and approximately correlated with phoneme production (Bouchard et al., 2013; Lotte et al., 2015; Ramsey et al., 2018), as well as the manner and place of articulation (Bouchard et al., 2013; Lotte et al., 2015). Mugler et al. (2014b) demonstrated that single instances of phonemes can be identified during word production using ECoG from PCG. However, the ability to decode phonemes from these areas was rather limited, which suggests that phonemes may not optimally characterize the representation of these cortical areas. Some evidence exists that cortical activations producing phonemes differ depending on the context of neighboring phonemes (Bouchard and Chang, 2014; Mugler et al., 2014a). Moreover, incorporating probabilistic information of neighboring phonemes improves the ability to decode phonemes from PCG (Herff et al., 2015). Therefore, these areas might demonstrate predominant representation for gestures, not phonemes. However, no direct evidence of gestural representation in the brain has yet been demonstrated.
Here, we used ECoG from PCG and IFG to classify phonemes and gestures during spoken word production. We hypothesized that posterior PCG (approximate M1v) represents the movements, and hence the gestures, of speech articulators. We first examined the ability to determine the positions of phonemes and gestures within words using ECoG. We next compared the relative performances of gesture and phoneme classification in each cortical area. Finally, we used a special case of contextual variance—allophones, in which the same phoneme is produced with different combinations of gestures—to highlight more distinctly the gestural versus phonemic predominance in each area. The results indicate that gestures are the predominant elemental unit of speech production represented in PCG, while both phonemes and gestures appear to be more weakly represented in IFG, with gestures still slightly more predominant.
Materials and Methods
Subject pool.
Nine adults (mean age, 42 years; five females) who required intraoperative ECoG monitoring during awake craniotomies for glioma removal volunteered to participate in a research protocol during surgery. We excluded subjects with tumor-related symptoms affecting speech production (as determined by neuropsychological assessment) and non-native English speakers from the study. All tumors were located at least two gyri (∼2–3 cm) away from the recording electrodes. As per the standard of care, subjects were first anesthetized with low doses of propofol and remifentanil, then awakened for direct cortical stimulation mapping. All experiments were performed after cortical stimulation, hence, during experiments, no general anesthesia had been administered for at least 45 min; no effects on speech articulation were detected. Subjects provided informed consent for research, and the Institutional Review Board at Northwestern University approved the experimental protocols.
Electrode grid placement was determined using both anatomical landmarks and functional responses to direct cortical stimulation. Electrode grids were placed to ensure the coverage of areas that produced movements of the articulators when stimulated. ECoG grid placement varied slightly with anatomy but consistently covered targeted areas of ventral posterior PCG (pPCG; the posterior half of the gyrus, approximately equivalent to M1v), ventral anterior PCG (aPCG; the anterior half of the gyrus, approximately equivalent to PMv), and IFG pars opercularis, usually aligning along the Sylvian fissure ventrally. We defined our locations purely by anatomy to be conservative, since it was impossible to define them functionally in vivo, but with the intention of estimating M1v and PMv. We confirmed grid location with stereotactic procedure planning, anatomical mapping software (Brainlab), and intraoperative photography (Hermes et al., 2010).
Data acquisition.
A 64-electrode, 8 × 8 ECoG grid (4 mm spacing; Integra) was placed over the cortex and connected to a Neuroport data acquisition system (Blackrock Microsystems). Both stimulus presentation and data acquisition were facilitated through a quad-core computer running a customized version of BCI2000 software (Schalk et al., 2004). Acoustic energy from speech was measured with a unidirectional lapel microphone (Sennheiser) placed near the patient's mouth. The microphone signal was wirelessly transmitted directly to the recording computer (Califone), sampled at 48 kHz, and synchronized to the neural signal recording.
All ECoG signals were bandpass filtered from 0.5 to 300 Hz and sampled at 2 kHz. Differential cortical recordings compared with a reference ECoG electrode were exported for analysis with an applied bandpass filter (0.53–300 Hz) with 75 μV sensitivity. Based on intraoperative photographs and Brainlab reconstructions of array coordinates, electrodes in the posterior and anterior halves of the precentral gyrus were assigned to pPCG and aPCG, respectively, while those anterior to the precentral sulcus and ventral to the middle frontal sulcus were assigned to IFG. Data will be made available upon request to the senior author.
Experimental protocol.
We presented words in randomized order on a screen at a rate of 1 every 2 s, in blocks of 4.5 min. Subjects were instructed to read each word aloud as soon as it appeared. Subjects were surveyed regarding accent and language history, and all subjects included here were native English speakers. All subjects completed at least two blocks, and up to three blocks.
All word sets consisted of monosyllabic words and varied depending on subject and anatomical grid coverage. Stimulus words were chosen for their simple phonological structure, phoneme frequency, and phoneme variety. Many words in the set were selected from the modified rhyme test (MRT), consisting of monosyllabic words with primarily consonant–vowel–consonant (CVC) structure (House et al., 1963). The frequency of phonemes within the MRT set roughly approximates the phonemic frequency in American English (Mines et al., 1978). The MRT was then supplemented with additional CVC words to incorporate all general American English phonemes to the word set with a more uniform phoneme incidence. The mean word duration was 520 ms. Consonant cluster allophone words contained initial stop consonants; each allophone example included a voiced, a voiceless, and a consonant cluster allophone word (e.g., “bat,” “pat,” and “spat”; Buchwald and Miozzo, 2011).
Signal processing.
We examined normalized activity in the high gamma band (70–290 Hz), since this band is highly informative about limb motor (Crone et al., 2001; Mehring et al., 2004; Chao et al., 2010; Flint et al., 2012a,b, 2017), speech (Crone et al., 2001; Pei et al., 2011a; Bouchard et al., 2013; Ramsey et al., 2018), and somatosensory activity (Ray et al., 2008), and correlates with ensemble spiking activity (Ray and Maunsell, 2011) and blood oxygenation level-dependent activity (Logothetis et al., 2001; Hermes et al., 2012). ECoG signals were first rereferenced to a common average of all electrodes in the time domain. We used the Hilbert transform to isolate band power in eight linearly distributed 20-Hz-wide sub-bands within the high gamma band that avoided the 60 Hz noise harmonics and averaged them to obtain the high gamma power. We then normalized the high gamma band power changes of each electrode, by subtracting the median and dividing by the interquartile range, to create frequency features for each electrode.
To create features in the time domain, we segmented normalized high gamma values for each electrode in 50 ms time bins from 300 ms before and 300 ms after the onset of each event (phoneme or gesture). This was far enough in advance of event onset to capture most relevant information in IFG, which starts ∼300 ms before word onset (Flinker et al., 2015). This created discrete, event-based trials that summarized the time-varying neural signal directly preceding and throughout the production of each phoneme or gesture. Time windows for allophone feature creation were shorter (−300 to 100 ms) to further reduce the effect of coarticulation on the allophone classification results. The phonemes that were classified in allophone analysis (/p/, /b/, /t/, /d/, /k/, and /g/) were all plosives (stop consonants) and had durations of <100 ms, so we were able to use this shorter window without losing information about the phonemes. This is in contrast to the direct classification of phonemes and gestures, which included phonemes such as /m/ and /n/ that were longer in duration; hence, we used activity up to 300 ms after onset to capture this information.
Phoneme and gesture labeling.
Following standard practices, we used visual and auditory inspection of auditory spectral changes to manually label the onset of each phoneme in the speech signal (Mugler et al., 2014b). For plosives, phoneme onset was marked by acoustic release. For fricatives, phoneme onset was marked by the onset of aperiodic noise. For sonorants and vowels, onset was marked by changes to spectral properties. To label gesture onset times, acoustic–articulatory inversion was used on the audio recordings of subjects. This technique maps articulator trajectories from acoustic data, using a model that accounts for subject- and utterance-specific differences in production. We used an acoustic–articulatory inversion (AAI) model, described in (Wang et al., 2015), based on a deep neural network trained on data from the University of Wisconsin x-ray Microbeam corpus (Westbury et al., 1990), with missing articulatory data filled in using the data imputation model of (Wang et al., 2014). This model performed highly in predicting articulators in data from the corpus that were not used in training (i.e., in cross-validation), with a root-mean square error of only 1.96 mm averaged over all articulators. This error was smaller than that reported in similar studies, including a study that used AAI to then drive a speech synthesizer, in which an error of 2.5 mm still produced synthesized speech that was recognizable a high percentage of the time (Bocquelet et al., 2016). Moreover, we simulated this error by adding Gaussian noise with mean of 0 and an SD of 1.96 to the position and velocity estimates from AAI, and computed the error in gestural time estimates in two subjects. We found that this amount of noise translated to a mean ± SD error of 5.2 ± 9.8 ms in time, which was far smaller than our time bins used for decoding. While there could be some discrepancies in applying this model to patients in an operating room, possibly with dry mouths, lying on their side, even an error of 5 mm per articulator translated in simulation to errors of only 5.3 ± 13 ms in timing. Even if there were errors that were larger than this, the resulting errors in timing would bias the decoding performance results to be poorer for gestures, rather than better. Thus, any discrepancies in gestural timing due to the limits of AAI would not affect our results.
We used AAI to generate articulator positions of the lips, tongue tip, and tongue body at a time resolution of 10 ms (Fig. 1). The lip aperture was defined as the Euclidean combination of vertical and horizontal positions in the sagittal plane, and tongue apertures were defined using vertical position. Position trajectories were smoothed with a Gaussian kernel of 50 ms. The onsets of each gesture (closure, critical closure, and release) were defined from the position and velocity traces, as in the study by Marianne and Goldstein (2010). In brief, gesture onset time was defined as the moment the articulator had surpassed 20% of the difference between minimum velocity preceding movement and maximum velocity during gesture formation. For plosives, the onset of gesture release (e.g., tongue tip release) was set to phoneme onset time. Since AAI does not provide laryngeal or velar information, the Task Dynamic (TADA) model of interarticulator coordination was used to generate expected velar gesture onset times (Saltzman and Munhall, 1989; Nam et al., 2012). This model (TADA) is not speaker specific, so the onset times were scaled proportionally by the ratio of the default word duration (from TADA) to the actual duration of each word. We used these onset times for each event in the speech signal to segment ECoG features.
Intraword position classification.
We analyzed how cortical high gamma activity varies with the context of phonemic and gestural events (i.e., coarticulation) in two subjects producing consonant–vowel–consonant words. We used the high gamma activity on each electrode individually to classify whether each consonant phoneme or gesture was the initial or final consonant in each word. The coarticulation of speech sounds means that phonemes are not consistently associated with one set of gestures across intraword positions. Therefore, we predicted that if gestures characterize the representational structure of a cortical area, the cortical activity associated with a phoneme should vary across word positions. In contrast, because gestures characterize speech movements that do not vary with context, the cortical activity associated with a gesture should also be context invariant. Therefore, we did not expect to be able to classify the position of a gesture with better than chance accuracy. For this analysis, we included three types of gestures (closures of tongue tip, tongue body, or lips) and their associated phonemes. To reduce the likelihood of including cortical activity related to production of neighboring events (e.g., vowel-related phonemes or gestures) in our classification, we only used the high gamma activity immediately surrounding event onset (from 100 ms before to 50 ms after, in 25 ms time bins) to classify intraword position from individual electrodes. We classified initial versus final position using linear discriminant analysis (LDA; with 10 × 10 cross-validation repeats), since there were only six features for each classifier.
To quantify the significance (and effect size) of our results, we examined the discriminability index d′ between accuracy (percentage correct) of phonemic or gestural position and chance accuracy. The d′ between two groups is defined as the difference of their means divided by their pooled SD. For example, , where μg is the mean of gestural position accuracy, ng is the number of gesture instances minus one, and σg is the SD of gesture instances, and the same symbols with subscript p stand for phonemes. Mean values of d′ were computed from electrodes that were related to the corresponding gesture type. This was determined by classifying all gestures (except larynx) using the high gamma activity from each individual electrode, in 25 ms time bins, from 100 ms before to 50 ms after gesture or phoneme onset as features; and classifying using classwise principal component analysis (PCA; see below). Each electrode was designated as being related to the gesture that was classified most accurately.
Event classification and statistical analysis.
To obtain more detailed information about the encoding of each cortical area, we also used ECoG high gamma activity to classify which phoneme or gesture was being uttered at each event onset. We classified consonant phonemes and all gestures except for larynx. We limited our phoneme/gesture classification analysis to consonant phonemes for two reasons. First, the TADA model assumes that the larynx (or glottis) is closed by default (Browman and Goldstein, 1992), which makes it very difficult, if not impossible, to assign meaningful onset (closure) times to this gesture that is present in all vowels. In addition, we wished to avoid the influence of coarticulation of neighboring phonemes. Therefore, we removed vowels and /s/ phonemes, as well as the larynx-closing gesture, from the analysis. To ensure sufficient accuracy of our classification models, we included only phonemes with at least 15 instances, resulting in approximately the same number of phoneme classes as gesture classes (average of 15.2 phonemes across subjects). The phonemes most commonly included were {/p/,/b/,/m/,/f/,/d/,/t/,/n/,/l/,/r/,/g/,/k/,/v/,/j/}. We classified 12 gestures—lips (open, close, critical), tongue tip (open, close, critical), tongue body (open, close, critical), and velum (open, close, critical) in all subjects.
Due to the large number of potential features and the relatively low number of trials, we used classwise PCA (CPCA) to reduce the dimensionality of the input feature space and hence to reduce the risk of overfitting. CPCA performs PCA on each class separately, which enables dimensionality reduction while preserving class-specific information (Das and Nenadic, 2009; Das et al., 2009). For each class, the procedure chose a feature subspace consisting of all components with eigenvalues larger than the mean of the nonzero eigenvalues (Das and Nenadic, 2009). LDA was then used to determine the feature subspace with the most information about the classes. The high gamma features were then projected into this subspace, and LDA was used to classify the data (Slutzky et al., 2011; Flint et al., 2012b). We used one-versus-the rest classification, in which one event class was specified and events not in that class were combined into a “rest” group. We reported only the accuracy of classifying a given class (e.g., in /p/ vs the rest, we reported the accuracy of classifying the /p/ class, but not the rest class), to avoid bias in accuracy due to the imbalance in “one” and rest class sizes. We used 10-fold cross-validation with randomly selected test sets (making sure that at least some of the target events were in each test set) to compute classification performance. We repeated the 10-fold cross-validation 10 times (i.e., reselected random test sets 10 times), for a total of 100-fold. Chance classification accuracies were determined by randomly shuffling event labels 200 times and reclassifying. We created an overall performance for each subject as a weighted average of all the events; the performance of each phoneme or gesture was weighted by the probability of that phoneme or gesture in the dataset. The Wilcoxon signed-rank test was used for all statistical comparisons reported.
Allophone classification.
Four participants read aloud a specific set of spoken, monosyllabic words from the speech control literature that included allophones to amplify the distinction between phonemic and gestural representation in specific cortical areas (Buchwald and Miozzo, 2011). Allophones are different pronunciations of the same phoneme in different contexts within words, which reflect the different gestures being used to produce that phoneme (Browman and Goldstein, 1992). For example, consonant phonemes are produced differently when isolated at the beginning of a word (e.g., the /t/ in “tab,” which is voiceless) compared with when they are part of a cluster at the beginning of a word (e.g., the /t/ in “stab,” which is acoustically more similar to a voiced /d/; see Fig. 5A). Using word sets with differing initial consonant allophones (either CVC or consonant–consonant–vowel–consonant in organization) enabled us to dissociate more directly the production of phonemes from the production of gestures. This can be thought of as changing the mapping between groups of gestures and an allophone, somewhat analogous to limb motor control studies that used artificial visual rotations to change the mapping between reach target and kinematics to assess cortical representation (Wise et al., 1998; Paz et al., 2003).
We trained separate classifiers (CPCA with LDA, as in the prior section) for voiceless consonants (VLCs) and voiced consonants (VCs), and tested their performance in decoding both the corresponding isolated allophone (VLC or VC) and the corresponding consonant cluster allophone (CClA). For example, we built classifiers of /t/ (vs all other consonants) and /d/ (vs all other consonants) and tested them in classifying the /t/ in words starting with “st.”
Results
We simultaneously recorded ECoG from PCG and IFG (pars opercularis) and speech audio during single-word, monosyllabic utterances by nine human participants (eight with left hemispheric recordings) undergoing functional mapping during awake craniotomies for the resection of brain tumors (Fig. 2).
Phoneme-related, but not gesture-related, cortical activity varies with intraword position
We first analyzed how high gamma activity varies with the position of phonemes and gestures within words. We found that the high gamma activity in pPCG and aPCG did not change with the intraword position of the gesture (Fig. 3A, right, examples). In contrast, when aligned to phoneme onset, high gamma activity in pPCG and aPCG did vary with intraword position (Fig. 3A, left). Figure 3B shows an example of the classification of tongue body and tongue tip closure position from all electrodes that predominantly encoded those gestures (based on single-electrode decoding of all gesture types; see Materials and Methods). Gesture classification accuracies were not larger than chance, while accuracies of classifying associated phonemes ({/k/,/g/} for tongue body and {/t/,/d/,/l/,/n/,/s/} for tongue tip) were indeed larger than chance. To quantify the accuracy of classification compared with chance over electrodes, we computed the d′ value on each electrode (Fig. 3C, examples). d′ is the difference of means (in this case, between phoneme or gesture position and chance accuracy) divided by the pooled SD (see Materials and Methods); a d′ value of >1 is considered large. We computed the mean d′ value over all electrodes in pPCG and aPCG that were modulated with lip, tongue tip, or tongue body gestures (see Materials and Methods). We found that, over all of these electrodes in both subjects, d′ was large for the associated phonemes (2.3 ± 0.6; mean ± SEM) and no different from zero for gestures (−0.06 ± 0.6). We also examined all electrodes in pPCG and aPCG, regardless of modulation, and found similar results: d′ was large for phonemes (2.7 ± 0.3) and no different from zero for gestures (0.2 ± 0.3). Thus, cortical activity for gestures did not vary with context, while cortical activity for phonemes varied substantially across contexts.
pPCG, aPCG, and IFG more accurately represent gestures than phonemes
To further investigate sublexical representation in the cortex, we used high gamma activity from eight participants to classify which phoneme or gesture was being uttered at each event onset. We first classified phonemes and gestures separately using recordings combining all precentral gyrus electrodes (pPCG/aPCG). Combined pPCG/aPCG (PCG for short) activity classified gestures with significantly higher accuracy than phonemes, as follows: 63.7 ± 3.4% vs 41.6 ± 2.2% (mean ± SEM across subjects; p = 0.01) as seen in Figure 4A. Gestural representations remained significantly dominant over phonemes after subtracting the chance decoding accuracy for each type (mean 34.3 ± 3.4% vs 17.5 ± 2.2%; p = 0.008; Fig. 4B).
M1v, PMv, and IFG have been theorized to contribute differently to speech production, movements, and preparation for speech. We therefore investigated the representation of each individual area by performing gesture and phoneme classification using the ensemble of electrodes from each cortical area, in each subject, separately. Classification performance of both types increased as the area used moved from anterior to posterior location. In each area, gestures were classified with greater accuracy than phonemes (IFG: 48.8 ± 6.8% vs 39.1 ± 5.6%, p = 0.03; aPCG: 58.3 ± 3.6% vs 40.7 ± 2.1%, p = 0.016; pPCG: 62.6 ± 2.2% vs 47.3 ± 2.0%, p = 0.008; Fig. 4C). This predominance remained after subtracting chance accuracy across subjects (IFG: 17.9 ± 6.4%, p = 0.016, aPCG: 25.3 ± 12.0%, p = 0.008, pPCG: 27.7 ± 16.4%, p = 0.016; Fig. 4D). The difference was significant in pPCG and aPCG, but not in IFG, when using Bonferroni's correction for multiple comparisons. The difference in accuracy was not due to gestures having a greater incidence than phonemes (mean ± SEM; 61 ± 13 vs 147 ± 44 instances per phoneme vs per gesture, respectively), as significant differences remained when we performed decoding on a dataset with maximum numbers of gesture and phoneme instances matched (data not shown). To quantify the difference further, we computed the d′ values between accuracies of gestures and phonemes in each area. The d′ values in pPCG and aPCG were both very high (3.6 and 2.9), while that in IFG was slightly less (2.0), suggesting a more decreased gestural predominance in IFG than in pPCG or aPCG.
Allophone classification supports predominance of gestural representations
In four participants, we used word sets emphasizing consonant allophones (voiced, voiceless, and clustered with /s/) to amplify the distinction between phonemic and gestural representations. The /t/ in st words was acoustically more similar to, and produced with high gamma activity more like, a /d/ in aPCG electrodes, and more like a solitary initial /t/ in aPCG and IFG (Fig. 5A,B). We investigated the extent to which CClAs behaved more similarly to VLCs or to VCs in each area. If CClAs were classified with high performance using the voiceless classifier (Fig. 5C, blue rectangle), we would infer that phonemes were the dominant representation. If CClAs were classified with high performance using the voiced classifier, we would infer that gestures were the dominant representation (Fig. 5C, orange rectangle). If CClAs were classified with low performance by both classifiers (Fig. 5C, green rectangle), it would suggest that the CClA were a distinct category, produced differently from the voiced and from the voiceless allophone.
Cluster consonants behaved less like the phoneme and more like the corresponding gesture when moving from anterior to posterior in the cortex (Fig. 5D,E). For example, in IFG and aPCG, the CClAs behaved much more like the VLC phonemes than they did in pPCG (p = 0.6, 0.5, and 0.008 and d′ = 0.1, 0.2, and 0.4 in IFG, PMV, and pPCG, respectively for performance of the VLC classifier on VLCs vs CClAs). The CClAs behaved more like the VC phonemes in pPCG than in aPCG and IFG (d′ = 0.4, 0.7, and 0.3 in IFG, aPCG, and pPCG, respectively), although there was still some difference in pPCG between CClA performance and VC performance. The CClAs were produced substantially more like VC phonemes than like VLC phonemes in pPCG, which implies that pPCG predominantly represents gestures. The difference between CClAs and VC phonemes suggests that the cluster allophones may represent another distinct speech sound category.
Discussion
We investigated the representation of articulatory gestures and phonemes in precentral and inferior frontal cortices during speech production. Activity in these areas revealed the intraword position of phonemes but not the position of gestures. This suggests that gestures provide a more parsimonious, and more accurate, description of what is encoded in these cortices. Gesture classification significantly outperformed phoneme classification in pPCG and aPCG, and in combined PCG, and trended toward better performance in IFG. Consonants in clusters behaved more similarly to the consonant that shared more similar gestures (voiced), rather than the consonant that shared the same phoneme (voiceless) in more posterior areas; this relationship tended to reverse in more anterior areas. Together, these results indicate that cortical activity in PCG (M1v and PMv), but not in IFG, represents gestures to a greater extent than phonemes during production.
This is the most direct evidence of gesture encoding in speech motor cortices. This evidence supports theoretical models incorporating gestures in speech production, such as the TADA model of interarticulator coordination and the Directions-Into-Velocities of Articulators (DIVA) model (Saltzman and Munhall, 1989; Guenther et al., 2006; Hickok et al., 2011). DIVA, in particular, hypothesizes that gestures are encoded in M1v. These results also suggest that models not incorporating gestures, instead proposing that phonemes are the immediate output from motor cortex to brainstem motor nuclei, may be incomplete (Levelt, 1999; Levelt et al., 1999; Hickok, 2012b).
The phenomenon of coarticulation (i.e., phoneme production is affected by planning and production of neighboring phonemes) has long been established using kinematic, physiologic (EMG), and acoustic methods (Ohman, 1966; Kent and Minifie, 1977; Whalen, 1990; Magen, 1997; Denby et al., 2010; Schultz and Wand, 2010). Our results showing the discrimination of intraword phoneme position and differences in allophone encoding confirm the existence of phoneme coarticulation in cortical activity as well. Bouchard and Chang (2014) first demonstrated evidence of PCG representation of coarticulation during vowel production. Our results demonstrate cortical representation of coarticulation during consonant production. Some have suggested that coarticulation can be explained by the different gestures that are used when phonemes are in different contexts (Browman and Goldstein, 1992; Buchwald, 2014). Since gestures can be thought of as a rough estimate of articulator movements, our results demonstrating gesture encoding corroborate the findings of a recent study (Conant et al., 2018) of isolated vowel production showing that PCG encodes the kinematics of articulators to a greater extent than the acoustic outputs.
The use of allophones enabled us to dissociate the correlation between phonemes and gestures, as a single consonant phoneme is produced differently in the different allophones. In pPCG, the CClAs did not behave like either the VLC phonemes or VC phonemes, though they were more similar to the VC phonemes. This suggests that the CClAs are produced differently than either VCs or VLCs. It is also possible that there may have been some features in the CClAs that were related to /s/ production, in the time from 300 to 200 ms before plosive onset, that affected the results. Overall, these results support the following previous findings: before the release of the laryngeal constriction, the CClAs are hypothesized to be associated with a laryngeal gesture that is absent in VC phonemes (Browman and Goldstein, 1992; Cho et al., 2014). Thus, it is not surprising that we observed this difference in classification between CClAs and VCs (Fig. 5D). These results, therefore, still support a gestural representation in M1v as well as in PMv and IFG.
This study provides a deeper look into IFG activity during speech production. The role of IFG in speech production to date has been unclear. Classically, based on lesion studies and electrical stimulation, the neural control of speech production was described as starting in the inferior frontal gyrus, with low-level, nonspeech movements elicited in M1v (Broca, 1861; Penfield and Rasmussen, 1949). The classical view that IFG was involved in word generation (Broca, 1861) has been contradicted by more recent studies. Electrical stimulation sites causing speech arrest were located almost exclusively in the ventral PCG (Tate et al., 2014). Other recent studies have provided conflicting imaging evidence in IFG of phoneme production (Wise et al., 1999), syllables (Indefrey and Levelt, 2004), and syllable-to-phoneme sequencing and timing (Gelfand and Bookheimer, 2003; Papoutsi et al., 2009; Flinker et al., 2015; Flinker and Knight, 2016; Long et al., 2016). Flinker et al. (2015) showed that IFG was involved in articulatory sequencing. The equal classification performance for gestures and phonemes using IFG activity suggests that there is at least some information in IFG related to gesture production. While our results cannot completely address the function of IFG due to somewhat limited electrode coverage (mainly pars opercularis) and experimental design (monosyllabic words likely limited IFG activation and classification performance somewhat), they do provide evidence for gesture representation in IFG.
These results imply that speech production cortices share a similar organization to limb-related motor cortices, despite clear differences between the neuroanatomy of articulator and limb innervation (e.g., cranial nerve compared with spinal cord innervation). In this analogy, gestures represent articulator positions at discrete times (Guenther et al., 2006), while phonemes can be considered speech targets. Premotor and posterior parietal cortices preferentially encode the targets of reaching movements (Hocherman and Wise, 1991; Shen and Alexander, 1997; Pesaran et al., 2002, 2006; Hatsopoulos et al., 2004), while M1 preferentially encodes reach trajectories (Georgopoulos et al., 1986; Moran and Schwartz, 1999), force (Evarts, 1968; Scott and Kalaska, 1997; Flint et al., 2014), or muscle activity (Kakei et al., 1999; Morrow and Miller, 2003; Cherian et al., 2013; Oby et al., 2013). This suggests that M1v predominantly represents articulator kinematics and/or muscle activity; detailed measurements of articulator positions are starting to demonstrate this (Bouchard et al., 2016; Conant et al., 2018). Although we found that gesture representations predominated over phonemic representations in all three areas, there was progressively less predominance in aPCG and IFG, which could suggest a rough hierarchy of movement-related information in the cortex (although phonemic representations can also be distributed throughout the cortex (Cogan et al., 2014). We also found evidence for the encoding of gestures and phonemes in both dominant and nondominant hemispheres, which corroborates prior evidence of bilateral encoding of sublexical speech production (Bouchard et al., 2013; Cogan et al., 2014). The homology with limb motor areas is perhaps not surprising, since Broca's area is thought to be homologous to premotor areas in apes (Mendoza and Merchant, 2014). This analogous organization suggests that observations from studies of limb motor control may be extrapolated to other parts of motor and premotor cortices.
As in limb movements, sensory feedback is important in speech production (Hickok, 2012a). However, it is unlikely that auditory or somatosensory feedback accounts for the relative representations of gestures and phonemes observed here. Motor cortical activity during listening is organized based on acoustics, rather than on articulators (Cheung et al., 2016); thus, any effect of auditory feedback would be to improve phoneme performance. The contribution of somatosensory feedback to activity should be limited by the very short amount of time after events included in location and allophone analyses. Overall, consistent findings across multiple types of analyses strongly favor gestural predominance. Possible sensory contributions to speech production representations is an important area for future research.
Brain–machine interfaces (BMIs) could substantially improve the quality of life of individuals who are paralyzed from neurological disorders. Just as understanding the cortical control of limb movements has led to advances in motor BMIs, a better understanding of the cortical control of speech will likely improve the ability to decode speech directly from the motor cortex. A speech BMI that could directly decode attempted speech would be more efficient than, and could dramatically increase the communication rate over, current slow and often tedious methods for this patient population (e.g., eye trackers, gaze communication boards, and even the most recent spelling-based BMIs; Brumberg et al., 2010; Chen et al., 2015; Pandarinath et al., 2017). Although we can use ECoG to identify words via phonemes (Mugler et al., 2014b), these results suggest that gestural decoding would outperform phoneme decoding in BMIs using M1v/PMv activity. The decoding techniques used here would require modification for closed-loop implementation, although signatures related to phoneme production have been used for real-time control of simple speech sound-based BMIs (Leuthardt et al., 2011; Brumberg et al., 2013). Also, the analysis of preparatory (premotor) neural activity of speech production, which our study was not designed to examine, would be important to investigate for speech BMI control. Overall, improving our understanding of the cortical control of articulatory movements advances us toward viable BMIs that can decode intended speech movements in real time.
Understanding the cortical encoding of sublexical speech production could also improve the identification of functional speech motor areas. More rapid and/or accurate identification of these areas using ECoG could help to make surgeries for epilepsy or brain tumors more efficient, and possibly safer, by reducing operative time and the number of stimuli and better defining areas to avoid resecting (Schalk et al., 2008; Roland et al., 2010; Korostenskaja et al., 2014). These results therefore guide future investigations into the development of neurotechnology for speech communication and functional mapping.
Footnotes
This work was supported in part by the Doris Duke Charitable Foundation (Clinical Scientist Development Award, Grant #2011039), a Northwestern Memorial Foundation Dixon Translational Research Award (including partial funding from National Institutes of Health (NIH)/National Center for Advancing Translational Sciences Grants UL1-TR-000150 and UL1-TR-001422), NIH Grants F32-DC-015708 and R01-NS-094748, and National Science Foundation Grant #1321015. We thank Robert D. Flint, Griffin Milsap, Weiran Wang, our EEG technologists, and our participants.
We declare no competing financial interests.
- Correspondence should be addressed to Dr. Marc W. Slutzky, Northwestern University, Department of Neurology, 303 East Superior Avenue, Lurie 8-121, Chicago, IL 60611. mslutzky{at}northwestern.edu