Abstract
To produce a word, speakers need to decide which concept to express, select an appropriate item from the mental lexicon, and spell out its phonological form. The temporal dynamics of these processes remain a subject of debate. We investigated the time course of lexical access in picture naming with electroencephalography (EEG). Thirty participants (23 female) named pictures using simple nouns. The pictures varied in conceptual category (animate or inanimate), stress pattern (first or second syllable), and the structure of the first syllable (open or closed). Using time-resolved multivariate pattern analysis (MVPA), we decoded the time course in which each dimension was available during speech preparation. The results demonstrated above-chance decoding of animacy within 100 ms after picture onset, confirming early access to conceptual information. This was followed by stress pattern and syllable structure, at approximately 150 and 250 ms after picture onset, respectively. These results suggest that a word's stress pattern can be retrieved before syllable structure information becomes available. An exploratory analysis demonstrated the availability of the word-initial phoneme within 100 ms after picture onset. This result hints at the possibility that during picture naming, conceptual, phonological, and phonetic information may be accessed rapidly and in parallel.
Significance Statement
Producing spoken words is an effortless yet complex process. The mechanisms through which we retrieve and assemble the sound structure of words remain largely unknown. So far, speech production theories have mostly relied on behavioral experiments. We investigated the time course of phonological encoding during spoken word production using EEG. Results show that after rapid processing of word meaning, speakers access a word's stress pattern, followed by the composition of individual syllables. We show that subtle linguistic characteristics can be predicted before a speaker produces a word. Importantly, this study demonstrates successful application of MVPA on pre-articulation data from the widely available method EEG, offering an accessible approach to address novel questions in speech production research.
Introduction
Current psycholinguistic theories largely agree that the retrieval of a single word from the mental lexicon minimally requires the selection of the concept to be expressed, the selection of a unique lexical item (the lemma), and the retrieval of the corresponding phonological form (Levelt et al., 1999). However, despite extensive research, the temporal dynamics of these processes remain debated (Indefrey, 2016; Strijkers and Costa, 2016). Some behavioral and neurophysiological studies point to a cascade of processes, with the onset of lemma selection preceding the onset of phonological encoding (van Turennout et al., 1997, 1998; Maess et al., 2002; Hultén et al., 2009; Laganaro et al., 2009; Sahin et al., 2009; Carota et al., 2022; for reviews, see Indefrey, 2011 and Indefrey and Levelt, 2004). Other studies found evidence for near-simultaneous retrieval of semantic and phonological knowledge during the earliest stages of speech production planning (Strijkers et al., 2010, 2017; Miozzo et al., 2015; Riès et al., 2017; Fairs et al., 2021; Feng et al., 2021; Carota et al., 2023). These latter findings can be explained by theories assuming semantic and phonological information are activated rapidly and simultaneously because, due to their co-occurrence, they have been bound into integrated cell assemblies during the course of language acquisition (Strijkers and Costa, 2016; Kerr et al., 2023; Pickering and Strijkers, 2024). After this parallel ignition of the “word web”, activity reverberates in local parts of the assembly producing the distinct spatiotemporal sequences reported in previous studies (Indefrey, 2011).
In the present study, we used multivariate pattern analysis (MVPA) of EEG data to study the time course of conceptual and phonological encoding during picture naming in adult native speakers of Dutch. We presented pictures representing target words (Fig. 1) that depicted animate or inanimate concepts and varied in their phonological properties, specifically their metrical structure (first or second syllable stress) and the internal structure of the first syllable (open, CV, vs closed, CVC). Inspired by Carota et al. (2023), we also conducted an exploratory analysis decoding the manner of articulation of the word-initial phoneme (plosive vs fricative).
Target picture names used in the experiment divided over the phonological dimensions “lexical stress” (word-initial vs word-final) and “syllable structure” (open vs closed). The dimension “animacy” is indicated with bold or regular font. Words marked with an asterisk were not used in Schiller et al. (2003).
One goal of the current study was methodological, namely, to establish whether EEG data collected during a speech production task allow for time-resolved MVPA, which before had only been performed on MEG data collected during picture naming (Carota et al., 2022, 2023). A second goal was to examine the time course of conceptual and phonological encoding, for which cascaded and parallel models make different predictions. The cascade view predicts that conceptual information should be decodable before metrical structure and phoneme information, whereas word web models allow for early, simultaneous availability of all stored information about a word.
Our third and most important goal was to study the timing of the processes within the phonological component, which has not been done with MVPA before (and rarely with other methods, but see Schiller et al., 2003). In languages with lexical stress, such as English or Dutch, the segmental structure of words (e.g., ɡ-ɪ-t-ɑ-ɹ) and their stress pattern (σ'σ) must be stored in the mental lexicon (or at least information on whether the word's stress pattern is regular or irregular). It is often assumed that segmental and metrical information are represented as separate tiers, which are retrieved in parallel and subsequently combined in the process of syllabification (Levelt et al., 1999; Cholin et al., 2006; Goldrick and Rapp, 2007; Alderete and O'Séaghdha, 2022). This combination, which leads to the grouping of strings of segments into syllable chunks, is not stored in the mental lexicon. Rather, it is context dependent and often straddles morpheme and word boundaries (e.g., se-lect vs se-lec-ting). Importantly, both traditional cascaded and word web models predict that such context-sensitive processes should follow the retrieval of stored conceptual and phonological information. Time-resolved MVPA of EEG data, recorded during speech planning, should shed light on the relative time courses of these processes.
Materials and Methods
Participants
Thirty-eight native Dutch speakers participated in the experiment after providing written, informed consent. Five participants were excluded due to technical issues and three for miscellaneous reasons (response accuracy below 75%, wrong EEG setup, and bad signal quality), leaving a final sample of 30 participants (23 females; mean age, 23.4 years; SD = 4.2 years). All participants were right-handed, had normal or corrected-to-normal vision, and reported no history of neurologic, developmental, or language deficits. Participants received €15 for participation. The study was approved by the Ethics Committee of the Radboud University Faculty of Social Sciences under registration number ECSW-2019-019.
Materials
The stimuli consisted of 48 line drawings taken from the picture database of the Max Planck Institute for Psycholinguistics or public domain repositories. Each picture corresponded to a disyllabic Dutch noun indicating either an object, an animal, or a professional figure (see Fig. 1 for a full list). All targets but the three indicated with an asterisk were selected from the 96 disyllabic items used by Schiller et al. (2003). The items were balanced across the dimensions of animacy (animate vs inanimate), lexical stress (word-initial vs word-final), and structure of the first syllable (open vs closed). We chose to investigate lexical stress and syllable structure by contrasting first versus second syllable stress and open versus closed syllables, respectively, because these features have been used to study the processing of lexical stress and syllable structure in previous work (Schiller et al., 2003). The conceptual dimension was included to enable a comparison to the time course of conceptual processing in Carota et al. (2022). The animacy feature, specifically, was selected because the Schiller et al. (2003) stimulus set contained a selection of both animate and inanimate target words. The mean lexical frequency of each item was obtained from SUBTLEX-NL (Keuleers et al., 2010), a database that provides a standard measure of word frequency independent of corpus size. Table 1 summarizes the mean Zipf frequencies per condition.
Mean word frequency and standard deviation per condition
Procedure
Participants were tested individually while seated in front of a computer screen located inside an electrically shielded, sound-attenuated booth. After being fitted with the EEG cap, participants took part in a practice round where they named eight monosyllabic targets not included in the experiment. During this first round, they reviewed PowerPoint slides containing line drawings of the practice targets with their designated names (e.g., the Dutch word “bloem” and the picture of a flower) and then produced these words in a short practice experiment that mirrored the timing of the experiment to familiarize themselves with the task. They then familiarized themselves with the experimental items, reviewing a PowerPoint presentation containing the 48 experimental pictures with their names. After reviewing the experimental targets, participants performed a round of self-paced naming with feedback from the experimenter to confirm that they had memorized the target items correctly.
At the beginning of each trial, the picture was presented at the center of the screen for 450 ms followed by a blank interval and a fixation cross for 600 ms each. Stimuli were presented using Presentation (Neurobehavioral Systems). The pictures were displayed at the center of the screen (1,920 × 1,080 resolution, 60 Hz refresh rate) at a size of 250 × 250 pixels on a light gray background. Vocal responses were recorded using a microphone placed in front of the participant and digitized at a sampling rate of 44.1 kHz. Every target picture was presented 10 times, amounting to a total of 480 trials divided into 10 blocks of 48 trials each. Since previous work has shown that naming latencies stabilize from the second repetition onward (van Turennout et al., 2003; Corps and Meyer, 2025), the first two naming cycles were excluded from the analyses. Each block contained all 48 items in randomized order. Participants took short self-paced breaks in the middle and at the end of each block. The breaks often lasted for <15 s, and no participant paused for more than 1 min. At the end of the second naming block (i.e., during the fourth break), the experimenter verbally corrected non-target productions (e.g., please use “portrait” instead of “painting”). After this, participants received no more feedback for the remainder of the experiment. Altogether, the training and the actual experiment lasted approximately 20 min.
EEG recordings
Participants were instructed to focus on the center of the screen, to avoid blinking and body movements. The EEG signal was recorded at 1,000 Hz from 62 scalp electrodes using a 64-channel BrainAmp amplifier. The two remaining electrodes were placed on the left mastoid and under the left eye. During acquisition, all channels were referenced to the right mastoid and an electrode placed on the forehead was used as the ground. Channels 1–9, 11–32, and 33–63 correspond to electrodes located on the scalp. The remaining channels were connected to electrodes placed at the left mastoid (channel 10) and below the left eye (channel 64). To ensure symmetry, channel 21 was placed at position FCz instead of position TP10.
Data analysis
As part of the picture naming paradigm, onset latencies and accuracy data were acquired. These behavioral variables mainly served as control measures, to ascertain that any EEG effects were not driven by differences in processing speed or accuracy across conditions. For animate words, we expected slightly faster naming latencies than inanimate words (Sá-Leite et al., 2021). In addition, we expected that words with first-syllable stress would be named faster than those with second-syllable stress (Schiller, 2006).
Behavioral data
The data from the first two blocks were considered warming-up trials and were excluded from the analyses. The remaining trials were manually tagged as either correct or incorrect. Incorrect trials were any trials where participants produced words that did not correspond exactly to the target name introduced during the familiarization phase, trials including filled pauses and/or repairs, trials where no verbal response was produced, and trials where the target word was not completed by the end of trial (i.e., within 1,650 ms after picture onset). When a response straddled a trial boundary, the following trial was also excluded. If response accuracy was below 75%, the participant was excluded from further analyses (1 participant). Response latencies for correct responses were calculated offline by subtracting the time of picture onset from the time of speech onset. The onset and offset of articulation were identified semiautomatically in Praat (Boersma and Weenink, 2024) and manually corrected where needed.
The effect of the experimental variables on response accuracy and onset latencies (log-transformed) was assessed with (generalized) linear mixed models (GLMMs) conducted using the lme4 package (Bates et al., 2015) in R (R Core Team, 2024). For the fixed effects analysis, the experimental factors were sum-coded, and we aimed to include a maximal random effects structure (Barr et al., 2013). In case of convergence issues, the correlation between random intercepts and slopes was removed first and then the random slope that accounted for the least amount of variance until the model converged. As a result of this procedure, the random effects structures were slightly different across models, yet each model contained the maximal random effects structure that still converged. For both dependent variables, the full model included fixed effects for animacy, lexical stress, and syllable structure. The random effects structure included by-participant and by-item random intercepts as well as a by-participant random slope for animacy and syllable structure while omitting the correlation between random intercepts and slopes for the response accuracy model. In the analysis of onset latencies, the random effects structure included a by-participant random slope for animacy, lexical stress, and syllable structure.
To assess the effect of manner of articulation, we ran further, exploratory models that included animacy, lexical stress, and syllable structure, with manner of articulation as an additional fixed effect. We chose to analyze manner of articulation in separate models, so that the main, confirmatory model contained only predictors that had been originally planned. In the exploratory model of response accuracy, the random effects structure included by-participant and by-item random intercepts as well as by-participant random slopes for all predictors, while omitting the correlations between random intercepts and slopes. In the exploratory model of onset latencies, the random effects structure included by-item and by-participant random intercepts and by-participant random slopes for animacy, lexical stress, and manner of articulation. Statistical inference was performed according to the z-value (response accuracy) or t-value (onset latencies) associated with the fixed effects parameter estimates. Values larger than 1.96 and smaller than −1.96 were considered statistically significant.
EEG data preprocessing
The EEG data was preprocessed in MATLAB (The MathWorks Inc., 2024) using the FieldTrip toolbox (Oostenveld et al., 2011). Prior to segmentation, the continuous EEG signals were preprocessed using a discrete Fourier transform for line noise removal and bandpass filtered in the 0.1–200 Hz range to minimize temporal distortions (Tanner et al., 2015). The EEG data was later epoched into segments from −100 to 1500 ms relative to picture onset. Following the rejection of trials with non-target productions, non-EEG channels were discarded and the data was re-referenced to the average of the 62 scalp electrodes. Because MVPA is more robust to artifacts than ERP analyses (Carlson et al., 2020), standard artifact rejection (e.g., ICA, manual rejection of trials contaminated by eyeblinks or muscle artifacts, etc.) was not performed.
Multivariate EEG analyses
To assess whether the experimental manipulations could be decoded from the EEG signal, we performed within-participant MVPA using the linear discriminant analysis (LDA) implemented in the MVPA-Light toolbox (Treder, 2020). Decoding performance was assessed using the fraction of class labels predicted correctly (i.e., classification accuracy). Trials were classified according to their animacy, lexical stress, syllable structure, or word-initial phoneme. Because decoding the variable of interest constituted a binary classification task in all three variables (animate/inanimate, word-initial/word-final, open/closed, plosive/fricative), the theoretical chance accuracy was always 0.5.
In the first set of analyses, we performed (1) time-resolved MVPAs to investigate classification performance across time and (2) searchlights across time and channels to identify which electrodes drove classification performance at different times in the trial. In the second set, we performed time × time generalizations to investigate whether the neural code supporting above-chance decoding in the first set of analyses was stable or dynamically evolving (King and Dehaene, 2014). The resulting temporal generalization matrix represents the decoding performance of a classifier trained at one time point of the trial and later tested at all the other time points of the trial. Square generalization matrices indicated that the neural code supporting above-chance decoding is stable (i.e., the neural representations are sustained over time), while diagonal matrices indicate that the classifier can generalize only over a transient period of time, suggesting the neural representations are dynamically evolving (for a review, see King and Dehaene, 2014).
In both sets of analyses, the data points included in a 75 ms window centered around a given time point were considered “time” neighbors of that specific time point. In the searchlights across time and channels, we also calculated the pairwise distances between electrodes using the information contained in layout.pos (see FieldTrip's acticap-64ch-standard2.mat). In the distance matrix obtained using these pairwise distances, all electrodes closer to each other than 0.0001 units were considered neighbors (∼2–3 neighbors per electrode). Including neighbors increases the dimensionality of the feature space, thus providing more information to the classifier.
Increasing the dimensionality of the feature space, however, also makes the classifier more susceptible to the danger of overfitting (Treder, 2020). To control for the risk of overfitting, classification performance was evaluated using k-fold cross-validation with k = 5. In this validation technique, the data is randomly split into k equal folds to train and evaluate the model on slightly different datasets. To achieve this, the model is trained on k − 1 folds and tested on the remaining fold for k times to ensure each fold is used for validation at least once. The resulting k performance metrics are then averaged to obtain a stable estimate of model performance. This entire cross-validation process was repeated five times using new randomly assigned folds to reduce bias that might be caused by the way the data was randomly partitioned into folds. This procedure allows for a robust estimate of model performance resulting from the average of the accuracies obtained in the various rounds of cross-validation (Carota et al., 2022).
Prior to training, randomly selected samples belonging to majority classes were discarded until each class was represented by an equal number of samples to avoid classification bias as a result of class imbalances caused by trial rejection. The data was then z-scored using parameters obtained solely from the training data to avoid information transfer between the training and test set (nested preprocessing). Although z-score normalization may not always be necessary or helpful, we included it to guard against the possibility that relatively noisier channels would drive multivariate findings (Ashton et al., 2022). In the final step of the MVPA preprocessing pipeline, we binned 16 z-scored trials from the same class into a single pseudo-trial to downsample the data and simultaneously increase signal-to-noise ratio (Treder, 2020). Statistical significance of classification accuracies against chance (i.e., right-tail tests against the chance level of 0.5) was established using non-parametric permutation tests with 1,000 repetitions and cluster-based correction for multiple comparisons (cluster-defining threshold = 1.96, α = 0.01; Maris and Oostenveld, 2007). The analysis scripts and the results of the MVPA analyses have been deposited on the Open Science Framework (https://osf.io/nvgwx/).
Results
Behavioral results
The 30 participants included in the analyses produced the target word fluently and within the allotted time frame in 89.8% of the trials included in the analysis (Table 2). In the remaining trials, they produced verbal disfluencies (3.2%), trial overlaps (2.6%), incomplete responses (2%), incorrect targets (1%), or failed to produce a response (1.4%). In the analysis of response accuracy, there was a main effect of lexical stress (β = −0.472, SE = 0.151, z = −3.129; Table 3). On average, participants named targets with word-initial stress more accurately than targets with word-final stress (92% vs 87.6%). The mean speech onset latency across conditions was 636 ms after picture onset (SD = 54 ms). The average word duration was 530 ms (SD = 61 ms). There was a main effect of animacy (β = 0.068, SE = 0.026, t = 2.615; Table 4) on speech onset latencies: on average, participants produced inanimate targets 33 ms earlier than animate ones (620 ms vs 653 ms after picture onset). Finally, the additional analysis on manner of articulation revealed no significant effect of manner of articulation on response accuracy. The onset latency analysis showed that targets with word-initial fricatives were produced faster than targets starting with plosives (606 ms vs 653 ms after picture onset; β = −0.079, SE = 0.027, t = −2.916). The model results from the analyses that included manner of articulation are presented in Tables S1 and S2.
Accuracies and onset latencies per category
Fixed effects from generalized linear mixed effects model of response accuracy
Fixed effects from linear mixed effects models of response latencies
Multivariate pattern analysis
Time-resolved MVPAs
The classifier distinguished between animate and inanimate targets significantly above chance from 78 ms after picture onset onward (Fig. 2A). At ∼150 ms post stimulus, classification accuracy increased from below 60% to over 70%. Between 600 and 900 ms post picture onset, decoding accuracies gradually decreased before plateauing at ∼60% for the remainder of the trial. The lexical stress (Fig. 2B) and syllable structure (Fig. 2C) of the target word started to be differentiated significantly above chance at 158 and 253 ms after picture onset, respectively. The decoding of lexical stress peaked around the average speech onset (from 55 to 75% between 400 and 700 ms post stimulus) and remained above chance throughout word production as well as after the average speech offset. The decoding accuracy of syllable structure remained above chance throughout word production (at ∼65% accuracy). Differently from lexical stress, classification of syllable structure did not show a peak at the average speech onset or remained stable after the average speech offset. Lastly, in the exploratory analysis, the manner of articulation of the first phoneme was differentiated significantly above chance between 91 and 139 ms after picture onset as well as between 428–1,151 ms and 1,219–1,313 ms post stimulus (Fig. 2D).
Time courses of decoding accuracy for the experimental variables (A) animacy, (B) lexical stress, (C) syllable structure, and (D) manner of articulation. Lines depict mean classification accuracy across participants, with bold marking above-chance level decoding accuracy, and the shaded areas correspond to the standard error. The dashed horizontal line indicates chance-level decoding accuracy (0.5). Vertical lines indicate picture onset. Dashed vertical lines indicate the average onset and offset of speech, while shaded areas represent ±1 standard deviations around these onset and offset times.
Searchlights across time and channels
In the interest of completeness, we report the results of the searchlights across time and channels but, considering the limited spatial resolution of EEG, our discussion will focus on the analyses in the time domain. Above-chance decoding of animacy was driven by electrodes located in occipital and right centro-parietal areas during the earliest stages of speech production planning (0–100 ms post picture onset) and by all the electrodes considered in the analysis in subsequent time windows (Fig. 3A). During the earliest stages of speech production planning (0–100 ms after picture onset), the classification of lexical stress was driven by one left frontal and several left occipito-parietal electrodes (Fig. 3B). Between 100 and 300 ms post stimulus, classification performance was driven by electrodes located bilaterally in occipito-parietal and frontal areas. Between 300 and 500 ms after picture onset, classification of lexical stress was driven by the vast majority of the electrodes considered in the analysis. During the earliest stages of speech production planning (0–100 ms post stimulus), classification of syllable structure was driven by an electrode located in right fronto-temporal areas (Fig. 3C). At later time windows, classification performance was driven by electrodes located in centro-parietal and fronto-temporal regions. Above-chance decoding of manner of articulation was driven by an electrode located in occipital areas between 0 and 100 ms after picture onset, by electrodes located in occipito-parietal areas between 100 and 300 ms post stimulus, and by electrodes located in occipito-parietal and left fronto-temporal areas between 300 and 500 ms after picture onset (Fig. 3D).
Topographic maps displaying the electrodes driving decoding accuracy for the experimental variables (A) animacy, (B) lexical stress, (C) syllable structure, and (D) manner of articulation across three time windows (0–100 ms, 100–300 ms, and 300–500 ms after picture onset). The color bar indicates decoding accuracy (red, above-chance; blue, below-chance). Statistically significant points are marked with an asterisk.
Time × time generalizations
During the earliest stages of speech production planning (0–150 ms post stimulus), the temporal generalization matrix of the animacy classifier (Fig. 4A) was extremely narrow, suggesting the neural code supporting above-chance decoding in this time window is dynamically evolving (King and Dehaene, 2014). Between 150 and 900 ms after picture onset, the animacy classifier generalized across time windows of ∼300 ms, suggesting the neural representation of animacy stabilizes around the same time as decoding accuracies showed a steep increase (Fig. 2A). From 900 ms post stimulus onward, the temporal generalization matrix revealed widespread generalization across a time window approximately twice as large as the ones observed between 150 and 900 ms post picture onset. The transition from diagonal to sustained indicates that the neural code supporting above-chance classification during this time window is different from the one observed between 150 and 900 ms after picture onset. This pattern appears to be confirmed by the lower decoding accuracies observed from 900 ms after picture onset onward (Fig. 2A).
Generalizations across time for the experimental variables (A) animacy, (B) lexical stress, (C) syllable structure, and (D) manner of articulation. Vertical lines indicate picture onset. Dashed lines indicate the average onset and offset of speech. The color bar indicates decoding accuracy (red, above-chance; blue, below-chance). Accuracy values are masked by statistical significance (significant decoding accuracies within the bold line). Temporal generalization matrices represent the decoding performance of a classifier trained at one time point of the trial and later tested at all the other time points of the trial. Square generalization matrices indicated that the neural code supporting above-chance decoding is stable (i.e., the neural representations are sustained over time), while diagonal matrices indicate that the classifier can generalize only over a transient period of time, suggesting the neural representations are dynamically evolving (King and Dehaene, 2014).
The temporal generalization of lexical stress revealed three distinct areas of generalization (Fig. 4B). The first square matrix emerged between 200 and 600 ms after picture onset, suggesting that the neural code supporting above-chance decoding during this time window is stable. The second generalization window spanned from 400 to 800 ms post stimulus, partially overlapping with the first generalization window, but extending further in time. This transitional phase marks a shift in the neural representation of lexical stress, as corroborated by the gradual increase in decoding accuracies starting at 400 ms after picture onset (Fig. 2B). This shift may reflect the transition from speech planning to the execution of actual speech, with the earliest speech onset occurring 396 ms post picture onset. The third and most extensive generalization window occurred between 1,000 and 1,500 ms post picture onset.
The temporal generalization of syllable structure revealed two distinct patterns of neural activity (Fig. 4C). Between 200 and 600 ms after picture onset, the generalization matrix was predominantly diagonal, suggesting that the neural code supporting above-chance decoding during this time window was dynamically evolving. At 600 ms post picture onset, the generalization matrix transitioned from diagonal to sustained. This shift coincides with the average speech onset and may reflect the transition from the neural processes associated with syllabification to those supporting overt speech production.
Finally, an exploratory analysis showed that the classifier decoding manner of articulation of the first phoneme generalized significantly above chance from 300 ms after picture onset (Fig. 4D). The generalization matrix was predominantly diagonal, suggesting that the neural code supporting above-chance decoding in this time window was dynamically evolving.
Discussion
We investigated the temporal dynamics of lexical retrieval in picture naming using MVPA of EEG data. We had three main aims. The first and second were to examine the application of EEG-based MVPA for speech production research and to use this method to compare the time course of conceptual and phonological processing. Our third and most important goal was to use EEG-MVPA to investigate the timing of two processes within the phonological domain: the retrieval of a word's stress pattern and syllable structure. Conceptual preparation for picture naming was indexed by an animacy distinction, while the phonological processes of metrical encoding and syllabification were linked to the manipulation of lexical stress and syllable structure, respectively. Inspired by recent work, we also conducted an exploratory analysis decoding the phonetic characteristics of the word-initial phoneme.
Concerning the first goal, our results clearly demonstrate that time-resolved MVPA of EEG data can be used to obtain information about the time course of lexical access in speech planning. This is important because purely behavioral methods do not suffice to study the time course of the fast processes occurring during speech planning and because researchers generally have easier access to EEG than MEG facilities. Thus, we see EEG-MVPA as a promising tool for language production research and an exciting addition to the methodological repertoire available in the field (van der Burght et al., 2023).
Our second goal was to compare the time course of conceptual and phonological processing during word production planning. Comparing the time course of decoding accuracy of the conceptual to both phonological variables (metrical stress and syllable structure), the results showed that conceptual preceded phonological processing. Specifically, classification of animacy demonstrated two early peaks, first around 75–100 ms after picture onset and then again at 150 ms post stimulus onset. Further support for this two-stage process comes from the temporal generalization matrix of the animacy classifier. Temporal generalization matrices represent the decoding performance of a classifier trained at one time point of the trial and later tested at all the other time points of the trial. Square areas of above-chance decoding accuracy indicate that the neural code supporting above-chance decoding is stable (i.e., the neural representations are sustained over time). In contrast, diagonal matrices indicate that the classifier can generalize only over a transient period of time, suggesting that the neural representations are dynamically evolving (King and Dehaene, 2014). In the animacy classifier, generalizability became statistically significant between 75 and 150 ms after picture onset. During this period, however, the classifier could not generalize to other time points of the speech preparation time course (i.e., the matrix is diagonal), suggesting that classification performance was driven by multiple neural representations. At around 150 ms after picture onset, the temporal generalization matrix sharply transitions from diagonal to sustained, likely reflecting the selection of the specific lexical item associated with the input. In sum, the processing of animacy occurred dynamically, with its onset preceding the earliest decodability of either phonological variable.
Our third goal was to examine the relative time course of metrical encoding and syllabification. As explained in the Introduction, Levelt et al. (1999) proposed that syllabification should follow metrical retrieval. We found that the decodability of lexical stress became statistically significant at around 160 ms after picture onset, shortly after the decodability of conceptual information but before the decodability of syllable structure at around 250 ms after picture onset. In the temporal generalization matrix, the decodability of lexical stress transitioned from diagonal to sustained around the time the decodability of syllable structure became statistically significant (i.e., ∼200 ms after picture onset). This result indicates that the neural representation supporting metrical encoding stabilizes around the onset of above-chance decoding of the structure of the first syllable, suggesting that information regarding a word's lexical stress is available before the onset of syllabification. The results support the notion that metrical encoding precedes syllabification as predicted by Levelt et al. The time course of metrical encoding and syllabification had, to our knowledge, only been directly assessed in an earlier event-related potential (ERP) study using a complex, metalinguistic judgement task (Schiller et al., 2003). The authors found no evidence for the sequentiality of metrical encoding and syllabification, but, as they note, their paradigm might not have been sensitive enough to detect small time differences in the onset of the two processes. As a further explanation for the discrepancy between the Schiller et al. results and ours, time-resolved MVPA may offer increased sensitivity as compared with ERPs (Carlson et al., 2020).
Our results invite further research along a number of lines. For instance, the two stress patterns used in the current study differ in frequency: the majority of Dutch disyllabic nouns have word-initial stress, and so, effectively, the stress manipulation of first versus second syllable stress simultaneously varied regular (default) versus irregular stress patterns. It is therefore unclear whether the decoder decoded the actual stress patterns or frequent versus infrequent. Alternatively, decoding accuracy could reflect discrimination between a stress pattern that can be derived by rule (as the default pattern) versus one that needs to be stored lexically (for a discussion of the storage versus computation of Dutch stress patterns, see Schiller et al., 2004; Schiller, 2006).
Finally, inspired by recent results (Strijkers et al., 2017; Fairs et al., 2021; Carota et al., 2023), we also conducted exploratory analyses to examine whether we could decode phonetic properties of the initial phoneme of the target words (plosive vs fricative). We found that the decodability of word-initial phonetic information peaked at approximately 125 ms after picture onset, returned to chance-level accuracy until 420 ms post stimulus, and then peaked again around the average speech onset. The late effect can be easily explained within cascaded models of lexical access, which predict that phonetic information should become available only after phonological encoding (Indefrey, 2011). The early effect, however, does not sit well with these models. Yet, similar patterns of results have recently been reported and have been seen as supporting “word web” models (Strijkers and Costa, 2016; Kerr et al., 2023; Pickering and Strijkers, 2024). Specifically, a previous MEG study showed that word frequency and the place of articulation of the word-initial phoneme (labial vs coronal) elicited stimulus-specific fronto-temporal activity between 160 and 240 ms after picture onset (Strijkers et al., 2017). Similarly, Fairs et al. (2021) found that lexical and phonetic variables (lexical frequency and phonotactic frequency, respectively) elicited early, parallel ERP modulations. In an exploratory analysis, Carota et al. (2023) showed that manner of articulation could be decoded from MEG source data in early time windows.
Note, however, that our results concerning the early decoding of phonetic information stem from post hoc analyses and therefore need to be replicated before any firm conclusions are drawn. There are two additional reasons that warrant a replication of this result. First, differently from the planned contrasts, the significant classification accuracy of the phonemic contrast was not corroborated by a significant temporal generalization matrix during the earliest stages of speech production planning (although we note that the early decodability of manner of articulation was rather short, ∼50 ms, and the training dataset was smaller, 240 trials versus 384 in the other classifiers, suggesting the early effect observed in the time-resolved MVPA might simply be too weak or short-lived to form a statistically significant cluster in the time × time plot; Maris and Oostenveld, 2007). Second, to ensure that naming was accurate and to achieve sufficient power, our training and experiment involved a considerable number of repetitions of each target. It is conceivable that after a certain number of repetitions, the response was automatically evoked, and this may have contributed to the early phonetic effect.
To summarize, our results suggest that syllabification is initiated after the onset of conceptual and metrical encoding. Both cascading and parallel activation models can account for this result. As discussed in the Introduction, Levelt et al. argued that syllabification of a string of segments is a context-dependent operation and is therefore considered to be computed online after metrical retrieval (i.e., not stored in the mental lexicon). Parallel activations models, too, suggest that context-dependent processes are computed after the initial activation of lexical information (Pickering and Strijkers, 2024). Our exploratory analysis, on the other hand, tentatively suggests that some phonetic information about picture names may become activated rapidly after picture onset. This result, while in need of replication, hints at the possibility that conceptual and phonetic information might be accessed in parallel during the initial activation of lexical information as suggested by parallel activations models (Strijkers and Costa, 2016; Pickering and Strijkers, 2024).
In conclusion, our findings demonstrate that EEG-MVPA can decode distinct conceptual and phonological processes during speech production planning. Our results showed that, following conceptual information (i.e., animacy), a word's stress pattern can be retrieved before a complete representation of the syllable structure is computed. Finally, we obtained preliminary evidence suggesting that some phonetic information about picture names may become rapidly available, in parallel with conceptual information.
Footnotes
We thank Birgit Knudsen for help with data collection and Geert van der Meulen for help with the figure design. The work was supported by the Max Planck Society.
The authors declare no competing financial interests.
This paper contains supplemental material available at: https://doi.org/10.1523/JNEUROSCI.0546-25.2025
- Correspondence should be addressed to Giulia Li Calzi at giulia.licalzi{at}uzh.ch.










