Abstract
To understand language, we need to recognize words and combine them into phrases and sentences. During this process, responses to the words themselves are changed. In a step toward understanding how the brain builds sentence structure, the present study concerns the neural readout of this adaptation. We ask whether low-frequency neural readouts associated with words change as a function of being in a sentence. To this end, we analyzed an MEG dataset by Schoffelen et al. (2019) of 102 human participants (51 women) listening to sentences and word lists, the latter lacking any syntactic structure and combinatorial meaning. Using temporal response functions and a cumulative model-fitting approach, we disentangled delta- and theta-band responses to lexical information (word frequency), from responses to sensory and distributional variables. The results suggest that delta-band responses to words are affected by sentence context in time and space, over and above entropy and surprisal. In both conditions, the word frequency response spanned left temporal and posterior frontal areas; however, the response appeared later in word lists than in sentences. In addition, sentence context determined whether inferior frontal areas were responsive to lexical information. In the theta band, the amplitude was larger in the word list condition ∼100 milliseconds in right frontal areas. We conclude that low-frequency responses to words are changed by sentential context. The results of this study show how the neural representation of words is affected by structural context and as such provide insight into how the brain instantiates compositionality in language.
SIGNIFICANCE STATEMENT Human language is unprecedented in its combinatorial capacity: we are capable of producing and understanding sentences we have never heard before. Although the mechanisms underlying this capacity have been described in formal linguistics and cognitive science, how they are implemented in the brain remains to a large extent unknown. A large body of earlier work from the cognitive neuroscientific literature implies a role for delta-band neural activity in the representation of linguistic structure and meaning. In this work, we combine these insights and techniques with findings from psycholinguistics to show that meaning is more than the sum of its parts; the delta-band MEG signal differentially reflects lexical information inside and outside sentence structures.
- combinatorial processing
- lexical processing
- sentence comprehension
- surprisal
- temporal response functions
- word frequency
Introduction
During language comprehension, listeners recognize words, retrieve stored information about them, and use this knowledge to combine the words into phrases and sentences. Psycholinguistic experiments have long shown that the behavioral responses to words change under the influence of the syntactic and sentential context that the words appear in (Marslen-Wilson and Welsh, 1978; Tyler and Wessels, 1983; Katz et al., 1987). In a step toward understanding how the brain builds sentence structure, the present study concerns the neural readout of this process. We ask (1) whether low-frequency neural readouts associated with words systematically change as a function of being or not being in a sentence context and (2) whether neural readouts are modulated by purely lexical properties over and above sensory and distributional variables. We do this by contrasting MEG responses to words in sentences with word lists, the latter lacking any syntactic structure or coherent lexical and combinatorial meaning.
In psycholinguistic models, language comprehension is instantiated as a cascaded process in which information can flow bidirectionally (Marslen-Wilson and Welsh, 1978; Martin, 2016, 2020). Put simply, this means that speech sounds cue stored representations of words, and while the next words are being recognized, the retrieved information about words cues representations of phrase and sentence structure. At the same time, the already formed representations of sentences, phrases, and words cue lower-level representations: the information flows in two directions (Schoffelen et al., 2017).
As words are being combined into phrases and sentences, then, responses to words change as a consequence of the top-down information flow. Indeed, a long tradition of research in psycholinguistics has shown that words in sentences are recognized faster than those same words appearing in isolation (Marslen-Wilson and Welsh, 1978; Tyler and Wessels, 1983). This effect is so powerful that it reduces effects of properties of the words themselves, such as word frequency. In isolation, highly frequent words are recognized faster than low-frequency words. In sentence context, this effect tends to be reduced: low-frequency words are recognized faster in sentence context than in isolation, although there is little change in recognition times for the high-frequency words (Schuberth and Eimas, 1977; Simpson et al., 1989).
To gain a full understanding of human sentence comprehension, those in the field currently face the challenge of integrating these findings with knowledge of neural processing. Although previous studies provide insight into the neural correlates of sentence structure (Ding et al., 2016; Meyer et al., 2017; Nelson et al., 2017; Ding et al., 2018; Brennan and Martin, 2020; Kaufeld et al., 2020; Bai et al., 2022; Coopmans et al., 2022; Tavano et al., 2022; ten Oever et al., 2022), much about the process of building these structures remains unknown (ten Oever et al., 2022). Furthermore, although we know that the neural signal is sensitive to lexical information (Brodbeck et al., 2018a,b; Armeni et al., 2019; Weissbart et al., 2020; Heilbron et al., 2021) we do not know how neural responses to words are transformed in the process of comprehension.
In this study, therefore, we aim to add to our understanding of how the brain leverages linguistic information when building sentence structure by finding a neural readout of the context effect on responses to words above and beyond statistical predictability effects as quantified through entropy and surprisal. To this end, we analyzed a published MEG dataset by Schoffelen et al. (2019) of participants listening to sentences and word lists. Despite these conditions being the main experimental manipulation in this open dataset, they have not previously been directly compared. Using temporal response functions (TRFs), we disentangled delta- and theta-band responses to individual words from responses to the speech envelope and word onsets, as well as entropy and surprisal. This method allowed us to model any differences between the conditions that go beyond our difference of interest (structured/unstructured), and, as such, control for them. We compared the responses to individual words between word lists and sentences. Any differences between the lexical responses in these conditions reflect the effect of structure building on the processing of words.
The lexical response was modeled using word frequency. We chose this feature because word frequency is a proxy for the likely familiarity of the listener with the word and relatedly of ease of processing. Any modulation as a consequence of word frequency, therefore, captures the presence of word identity information in the signal. Furthermore, word frequency is unigram; in other words, it does not depend on the context. Therefore, the value corresponding to a given word is the same in a sentence and a word list. Differences between the neural readout of both conditions will therefore be because of the sentence context supplying structure and meaning and not the predictor itself.
We hypothesized that the delta-band responses to word frequency would be different in word lists and sentences as a consequence of the (in)availability of sentence context (Huizeling et al., 2022; Meyer, 2018; Meyer et al., 2020a,b). Studies that investigated the presence of lower-level features in the neural signal as a function of the availability of linguistic information suggest that lower-level features are represented by the delta-band neural signal more reliably when higher-level information is available. For example, mutual information between the speech signal and the neural signal is higher in the presence of structure and meaning (Kaufeld et al., 2020; Coopmans et al., 2022; ten Oever et al., 2022), and the strength of speech tracking is dependent on the listener's knowledge of the language (Molinaro and Lizarazu, 2018; Blanco-Elorrieta et al., 2020) and general comprehension (Keitel et al., 2018). Following these results, we expected a stronger presence of the word frequency response (the lower-level feature) in the sentence condition than in the word list condition (the higher-level information) in the delta band specifically. Theta-band effects tend to be found as a function of acoustic rather than abstract linguistic manipulations (Sohoglu et al., 2012; Molinaro and Lizarazu, 2018; Etard and Reichenbach, 2019; Blanco-Elorrieta et al., 2020). In this study, we expected to observe this distinction between delta- and theta-band activity through an absence of effects in the theta band.
Materials and Methods
To answer our research question, we analyzed a part of the open-access large multimodal MEG dataset (N = 204) Mother of all Unification Studies published by Schoffelen et al. (2019). In addition, we performed two types of control analyses, an analysis of a dataset published by ten Oever et al. (2022) and a set of simulations. Methods for all analyses are described below.
Participants
A total of 102 native speakers of Dutch (51 men, 51 women) with a mean age of 22 years (range, 18–33) were included in this analysis. In this half of the dataset, participants were presented with the stimuli auditorily (as opposed to the other half, where stimuli were presented visually). All participants were right-handed, reported normal hearing, had normal or corrected-to-normal vision, and had no history of neurologic, developmental, or linguistic deficits. All participants provided informed consent, and the study was approved by the local ethics committee (Committee on Research Involving Human Subjects in the Arnhem-Nijmegen region, The Netherlands) and followed guidelines of the Helsinki Declaration. Participants took part in an fMRI and an MEG session, during which they listened to sentences and word lists. Only the MEG data are included in the present study.
Materials
The complete set of stimuli consisted of 360 natural Dutch sentences of 9–15 words (mean, 11.6), with varying syntactic structures, and 360 word lists. To create the word lists, the words from the sentences were scrambled so that more than two consecutive words did not form a coherent fragment. The stimuli were recorded by a female native speaker of Dutch. The sentences were pronounced naturally. The word lists were pronounced with neutral prosody and with a clear pause between each word. The files were recorded in stereo at 44 100 Hz. The sentences had an average duration of 4.27 s (SD 0.61), and the word lists of 7.67 s (SD 1.04). During the postprocessing, the audio files were low-pass filtered at 8500 Hz and normalized so that all the audio files had the same peak amplitude and peak intensity. In the word list condition, the individual words were spliced together with variable silence between them. This created conditions with different acoustic properties. We address this issue in the sections below, beginning with MEG preprocessing. In both conditions, the transition from silence to speech was ramped at the onset and offset with a rise/fall time of 10 ms. Word onsets and offsets were determined manually for each audio file using Praat software (Boersma and Weenink, 2018).
The stimuli were divided over two sets, A and B. During the MEG session, participants were presented with 120 sentences from set A and 120 word lists from set B (or the reverse). Across participants, all stimuli were presented the same number of times in the sentence and word list condition.
Procedure
Before the task, participants read written instructions and were allowed to ask clarification questions. The experimenter emphasized that the sentences and word lists should be attended to carefully and discouraged attempts to integrate the words in the word list condition. To familiarize the participants with the task, all participants performed a practice block with stimuli not included in the study. During the MEG measurement, the stimuli were presented in 24 blocks, alternating between sentence blocks (each containing five sentences) and word list blocks (each containing five word lists). The starting block type (either sentences or word lists) was randomized across participants. At the start of each block there was a 1500 ms presentation of the block type: zinnen (sentences in Dutch) or woorden (words in Dutch). The intertrial interval was jittered between 3200 and 4200 ms. During this period, an empty screen was presented, followed by a fixation cross.
To ensure participants paid attention to the stimuli, 20% of the trials were followed by a Yes/No question about the content of the preceding sentence/word list. Half the questions on the sentences addressed the content of the sentence (e.g., Did grandma give a cookie to the girl?), whereas the other half and all the questions about the word lists addressed one of the main content words (e.g., Was a grandma mentioned?). Participants answered the question by pressing a button for Yes/No with their left index and middle finger, respectively. Although the tasks were not identical between the conditions, the randomized order of appearance of question types ensured that participants could not approach the sentences any differently from the word lists; any sentence or list trial could be followed by the word monitoring task.
The stimuli were presented via plastic tubes and ear pieces in both ears. The hearing threshold was determined individually for each participant before the experiment, and the stimuli were presented at an intensity of 50 dB above the hearing threshold.
The experiment was run using Presentation software (version 16.0, Neurobehavioral Systems, www.neurobs.com). MEG was continuously recorded with a 275-channel axial gradiometer system (CTF) at a sampling frequency of 1200 Hz (cutoff frequency of the analog antialiasing low-pass filter was 300 Hz). Three head localizer coils were attached to the participant's head (nasion, left- and right ear canals) to determine the position of the head relative to the MEG sensors. The head position was monitored throughout the measurement. If needed, the participant was asked to reposition to correct for head position changes during breaks. The audio signal of the stimuli presented in the scanner was recorded along with the MEG data using an analog-to-digital converter channel.
Structural MRI images for source reconstruction were acquired using a T1-weighted magnetization-prepared rapid gradient echo pulse sequence with the following acquisition parameters: volume TR = 2300 ms, TE = 3.03 ms, flip angle = 8 degrees, 1 slab, slice matrix size = 256 × 256, slice thickness = 1 mm, field of view = 256 mm, isotropic voxel size = 1.0 × 1.0 × 1.0 mm. A vitamin E capsule was placed as a fiducial behind the right ear to allow visual confirmation of left–right consistency.
MEG preprocessing
The MEG data were preprocessed with custom-written MATLAB scripts using the FieldTrip toolbox (Oostenveld et al., 2011; Donders Institute for Brain, Cognition and Behavior, Radboud University, The Netherlands; http://fieldtriptoolbox.org). Before filtering, the data were epoched from audio onset to audio offset. The epochs were baseline corrected and bandpass filtered into the designated frequency band using a windowed-sinc finite impulse response (FIR) filter (15 s data padded), after which they were resampled to 120 Hz for TRF estimation.
The frequency band of interest was defined on the basis of the rate of occurrence of words in the stimuli, the differences in speech–brain coherence between conditions, and the literature (Blanco-Elorrieta et al., 2020; Donhauser and Baillet, 2020; Molinaro and Lizarazu, 2018; Weissbart et al., 2020). The word rate in the word lists was 1.5 Hz (SD 0.1), and in the sentences 2.7 Hz (SD 0.3). To compute speech–brain coherence, we first computed the broadband speech envelope by taking the absolute value of the Hilbert transform of the speech signal, low passing it at 20 Hz, and scaling the output between zero and one. We computed the magnitude squared coherence estimate of the broadband speech envelope and the MEG signal using Welch's method. The differences between word lists and sentences were estimated using a cluster-based permutation test. This revealed three peaks in the low-frequency signal—one between 1 and 3 Hz, one between 4.5 and 7 Hz, and one between 9.5 and 12 Hz (Fig. 1; Lam et al., 2018). On the basis of these clusters and frequency bands analyzed in the literature (Donhauser and Baillet, 2020), we analyzed two frequency windows, delta (0.5–4 Hz) and theta (4–10 Hz). To account for differences in speech–brain coherence that were exclusively because of acoustic differences between the conditions, we included the speech envelope as a predictor in all the models of the data (Fig. 1B, modulation spectra). Details of the models are below in Temporal response functions and Stimulus representation.
Source reconstruction
MRI images were coregistered to the MEG headspace coordinate system by aligning the positions of the preauricular points and the nasion MEG coil to the MRI images using the MNE-Python coregistration GUI (Gramfort et al., 2013). For each participant, we reconstructed the cortical surface using the watershed algorithm from FreeSurfer. We created a surface-based source space with oct6 spacing, meaning ∼5 mm was between the source points. This generates 4098 sources per hemisphere. We created a single-layer Boundary element model (BEM) model with surface ico downsampling of 5120, from which the lead field was computed. The sources were reconstructed using a scalar Linear-constraint minimum-variance (LCMV) beamformer approach with a unit-noise gain beamformer to deal with depth bias. The data covariance used for computing LCMV filters was whitened using the covariance matrix of resting-state data. The resting-state data were bandpass filtered into the appropriate frequency band (i.e., 0.5–4 Hz for the delta band, and 4–10 Hz for the theta band). After application of the LCMV beamformer filters to the epoched MEG data, the source-localized epochs were morphed to fsaverage for group statistics. These source-localized, morphed epochs were then entered into the pipeline for temporal response function estimation. Source localization failed for 11 participants because of convergence issues for the noise covariance matrix or missing resting-state data (Nsource = 91).
Temporal response functions
To characterize the effect of linguistic structure and meaning on the neural response, we estimated TRFs with different acoustic and linguistic features. This approach has been used to determine responses to different linguistic features, ranging from the speech envelope and phonemic information (di Liberto et al., 2015; Donhauser and Baillet, 2020) to lexical information (Broderick et al., 2018; Weissbart et al., 2020) and even syntactic embedding (Nelson et al., 2017). The response function of interest here is the response to word frequency as this is a unigram feature and therefore has the same per-word values in both conditions.
The TRFs were estimated using linear regression. We modeled the neural response by convolving the TRF kernel with the stimulus representation signal. In summary, this method reduces to a multivariate multiple linear regression, where we used lagged time series of stimulus features as predictors. The model equation reads as follows:
In Equation 1, the vector of weights
To evaluate how our models perform at reconstructing the neural data, we computed the Pearson's correlation coefficient between the true data and data reconstructed using the estimated TRFs. The correlation between the reconstruction and the original MEG indicates how much of the variance in the neural signal is explained by the features. The TRFs were not estimated on the same portion of data used to score the model. As further explained (see below, Model fitting), we used a nested cross-validation procedure to tune the regularization parameter, estimate the TRF coefficients, and finally score the resulting model. Unless specified otherwise, all analyses described below were done with custom-made Python scripts using MNE-Python (Gramfort et al., 2013). The whole analysis was conducted both in sensor and in source space.
Stimulus representation
Its multivariate character makes the TRF especially suitable for the current analysis: it allows for controlling for differences between conditions that are not currently under discussion by modeling them. To characterize the speech signal and part of its linguistic content, we constructed the following five different features: word frequency (the feature of interest) and four control features consisting of the speech envelope, word onsets, entropy, and surprisal.
The speech envelope feature was computed for each stimulus by taking the absolute value of the Hilbert transform and downsampling it to 120 Hz to match the downsampled MEG sampling rate. The envelope feature was added to represent the acoustic response and as such captures the difference between conditions observed in the cerebro-acoustic coherence that was caused by differences in the acoustic input (Fig. 1A,B).
The word onset feature was added to capture broadly any time-locked response to word onset for which the variance is not already explained by other features. As such, this feature can also capture any effects of segmentation that were different between the conditions. The word onsets and offsets were transcribed manually for each stimulus. We used a train of unit impulses, where the feature signal is one at the word onset sample and zero otherwise as follows:
These impulse trains were convolved with a Gaussian kernel with an SD of 15 ms. Such temporal smoothing has the effect of inflating the autocorrelation of the signal. We designed the width of this smoothing so that the smoothed impulses end up with energy spanning a frequency band comparable to our continuous regressor (envelope). The Fourier transform of a Gaussian is also a Gaussian, and the 15 ms SD of the temporal smoothing kernel equates to a spectral SD of 21.22 Hz. This ensured that all features required a similar degree of regularization in the regression analysis and made it possible to include impulse-like features such as word onsets and the envelope in the same regularized regression. Notably, this also translates into some uncertainty about or knowledge of the exact word onset timings.
Like the word onset feature, the word frequency feature was constructed as an impulse train of zeros everywhere but at word onset. Here, we used the respective word frequency value to modulate the height of the impulses. We used the log-transformed value of occurrence per million words, obtained from the SUBTLEX-NL corpus (Keuleers et al., 2010), as follows:
If a word did not exist in the corpus, the fallback value of 0.301 (log/million) was used, corresponding to the lowest word frequency in the corpus. The values were z-scored across all stimuli. The resulting signal was convolved with the same Gaussian kernel as the word onset feature.
The entropy feature consists of lexical entropy, a weighted probability measure that quantifies the uncertainty about the upcoming word on the basis of the previous words. It provides a numeric answer to the question, Given the n previous words, with what degree of certainty can we predict the upcoming word? as follows:
The value was derived from a trigram model trained on the NLCOW2012 corpus using WOPR (van den Bosch and Berck, 2009). If a value was missing, the average of all entropy values was used. Like the word frequency feature, the entropy values were z-scored relative to all stimuli and inserted in a stick function, after which the stick function was convolved with the same Gaussian window. This feature was added to ensure that any effects on the word frequency feature were of a compositional semantic and structural nature rather than a probabilistic one.
The surprisal feature reflects how surprising a given word is in its immediate context. From an information-theoretic perspective, this reflects the information content, or self-information, of a word. It was calculated as the log 10 transformation of the conditional probability of a word, which was taken from the same trigram model as the entropy values. This means that surprisal is always based on the two preceding words; given the two preceding words, how high was the chance that the observed word would indeed appear? If the chance was low, surprisal is high. The feature was constructed in the same way as the word frequency and entropy features; the values were z-scored across all stimuli, inserted in a stick function at word onsets, and convolved with the Gaussian window as follows:
Because the three numerical lexical features (frequency, entropy, surprisal) might be correlated to some extent, we need to assert that the degree of multicollinearity present in our stimulus representation will not hinder the TRF coefficient interpretation. We checked whether the variance inflation factor (VIF) was below five (considered a relatively conservative measure of multicollinearity; Sheather, 2009; Tomaschek et al., 2018). The VIF was computed by correlating the z-scored entropy, surprisal, and word frequency values and by taking the diagonal of the inverted correlation matrix. This was done for all the stimuli and for both conditions separately. The VIF was never higher than five; the highest VIF was for surprisal at 4.8 in the word list condition.
Model fitting
The features were fitted in a cumulative manner to assess the contribution of each feature. This led to a total of seven models per frequency band, an Envelope model, consisting of only the speech envelope; an Onset model, consisting of the speech envelope and the word onset features; a Frequency model, consisting of the speech envelope, word onset, and word frequency features; an Entropy model, containing the speech envelope, word onset, and entropy features; a Surprisal model, consisting of the speech envelope, word onset, and surprisal features; and cross-combinations of those with and without the word frequency feature. An overview of all models and the corresponding features is provided in Table 1.
Before model fitting, the data were split pseudorandomly into a training and testing set at a 80/20 ratio. Care was taken that the sentences and word lists were evenly divided across the training and test sets. The sentence and word list models were each trained on 96 of 120 trials. The regularization parameter was optimized individually per participant, frequency band, and model (but not per condition) using an eightfold cross-validation procedure with 20 log-spaced values around the eigenvalues of the covariance matrix of the lagged speech envelope (λ = 60470.9) ranging from λ × 10−3 to λ × 103. The best regularization parameter was determined as the value for which the average (across sensors) reconstruction accuracies were highest. Occasionally, reconstruction accuracies would not increase with a higher degree of regularization; instead, increasing the regularization would leave the reconstruction accuracy at the same value until overregularization occurred and reconstruction accuracy went down. In this case, the highest lambda value before a drop in accuracy occurred was chosen to ensure some degree of regularization. Each model was fitted on the complete training set using the regularization parameter from the cross-validation procedure, yielding the TRFs.
In the analysis of the source-localized MEG data, the manipulations were simplified because of computational limitations. The two maximal models were fitted, with word frequency as the only difference as follows: the Entropy/Surprisal model, consisting of the speech envelope, word onsets, entropy, and surprisal features; and the full model, consisting of all features. The cross-validation procedure was brought down to fivefold with 10 log-spaced values around the eigenvalue of the stimuli (60470.9) ranging from λ × 10−2 to λ × 102.
Model evaluation
Each model was evaluated by convolving the estimated TRFs with the unseen stimuli from the test dataset. This yields, in essence, a prediction of the neural signal according to the model. The predicted neural signal was then correlated with the original neural signal from the test set using the Pearson product-moment correlation on a sensor-by-sensor or source-by-source basis. For every individual participant, this yielded a set of sensor- or source-based reconstruction accuracies for each model.
Statistical analysis
The TRF analysis has two deliverables. First, the TRF (the development of the estimated coefficients across time) is an ERP-like waveform that captures how the neural signal changes as a function of, for example, word frequency, and, second, the reconstruction accuracy, which is a metric of model fit. Here, we wanted to know (1) whether the responses to word frequency differ between sentences and word lists in time and space, so we compared the TRFs between conditions, and (2) whether the presence of the word frequency response differed between sentences and word lists, so we tested whether the word frequency predictor contributed differently to the reconstruction accuracy of a model in the two conditions.
Throughout, evaluation for statistical significance of the difference between TRFs was done using cluster-based permutation tests. Cluster-based permutation tests address the null hypothesis of exchangeability across conditions by a Monte Carlo estimate of the randomization distribution of a cluster-based test statistic, optimizing statistical sensitivity while controlling the false alarm rate. Here, we used the t statistic as the test statistic. In these tests, we create matrices of all sensors and samples. Then, we compute the difference between two conditions and express it as a t statistic for each of these data points. The t values are thresholded at an a priori threshold, and the thresholded t values are summed across clusters on the basis of spatial and temporal adjacency. The significance of the test statistic of the resulting largest cluster is compared with 1024 of similarly obtained test statistics, after random permutation of the condition labels. We used the function spatio_temporal_cluster_test from the MNE-Python library (Gramfort et al., 2013) with the t statistic as the test statistic and 1024 permutations.
To assess whether the responses to word frequency differed qualitatively between conditions in sensor space, the difference between the word frequency TRFs for the sentence and word list conditions was evaluated using a cluster-based permutation test. In addition, to characterize the response in each condition separately, we performed two cluster-based permutation tests with the same methods in which we contrasted the response against zero in each condition separately. In total, we performed three cluster-based permutation tests on the sensor TRFs, one on the difference between conditions and one on the TRF for each of the two conditions separately (against zero). In all cases, we calculated the threshold on the basis of the t distribution with a significance level of 5 × 10−8 with 101 (number of participants minus one) degrees of freedom. This equals three times the recommended threshold for the number of participants. The threshold was increased to yield the most informative results (i.e., to ensure not every sensor and time lag would be significant). Subsequent comparisons were done with a threshold calculated using a Bonferroni adjusted significance level (i.e., divided by two) to correct for multiple comparisons; everything else was the same.
In addition, we wanted to evaluate whether there was a latency difference between the responses in the two conditions. To this end, we compared the responses from the sentences and word list conditions in a cross-correlation. The cross-correlation was done on the grand-average TRF waveforms of overlapping sensors between conditions from the clusters resulting from the one-sample tests. We sequentially cross-correlated each sensor and normalized the values by dividing them by the maximal value from the cross-correlation for that sensor. We then obtained the peaks for every sensor. This number corresponds to the lag at which the two signals had the highest correlation and shows how different the responses are in time. Subsequently, we shifted the sentence response in time by the number of samples of the peak. We then correlated the shifted sentence response and the original word list response. To check for significance, we performed the same procedure for randomly selected channels and repeated this process 10,000 times.
In source space, we compared the TRFs for word lists and sentences using a cluster-based permutation test in two time windows on the basis of the results from the analysis in sensor space, 200–400 and 500–700 ms post stimulus onset (PSO), respectively. We did this to get a more reliable estimate of the spatial distribution of the effects, although cluster-based permutation tests account only for a difference between the distribution overall, therefore any spatial or temporal differences are approximations and inconclusive (Maris and Oostenveld, 2007; Sassenhagen and Draschkow, 2019). The threshold was set to the t distribution with an alpha of 0.025 (98.75th and 1.25th percentile) to correct for multiple comparisons, with 90 (number of participants minus one) degrees of freedom. Sources along the medial wall were excluded.
In the sensor space analysis, the reconstruction accuracies were averaged over sensors and submitted to a linear mixed model using lme4 in R software (Bates et al., 2015). The model had the factor condition (two levels, sentence and word list) and a random intercept for participant. In addition, the model contained three binomial factors, frequency, entropy, and surprisal, describing whether a feature was (1) or was not (0) in the model to calculate a slope for each feature separately as follows:
We used a stepwise variable selection to evaluate the contribution of each of these factors. To evaluate the contribution of a given factor (or interaction), a model with the factor was compared with a model without it, and the goodness-of-fit statistics were compared using a chi-square test. If the removal of a factor did not decrease goodness of fit, the next factor was removed. When the removal of a given feature or interaction significantly decreased model fit, the removal of features was stopped. The prefinal model should then describe the data best. As a final check, the Akaike information criterion (AIC) of the models was compared using the R package AICcmodavg (Mazerolle, 2020). Post hoc t tests were done between the Entropy/Surprisal and Full model to evaluate whether the effects held between the largest models.
In source space, a cluster-based permutation test was done to localize the interaction effect using the function permutation_cluster_test from the MNE-Python library. The test statistic was an F statistic from a two-way ANOVA with factors Condition (levels: word list, sentence) and Model (levels: Entropy/Surprisal, Full). The data were permuted 1024 times.
Control analysis I: data
The word lists were presented with variable silences between words. The sentences, on the other hand, were natural, with pauses occurring sparingly. This caused differences of word rate and signal length between the conditions that may affect our results. To examine potential effects of the pauses in the word list condition, we analyzed a second dataset of 16 participants listening to word lists and sentences using the same methods. Importantly, the word lists in this condition were naturally spoken, as were the sentences. This means that there were no pauses between the words in the word list condition, and there was coarticulation between words (Kaufeld et al., 2020). The data were supplied by ten Oever et al. (2022).
Control analysis I: Participants.
A total of 20 native speakers of Dutch (4 men, 16 women with a mean age of 39.5 years) participated in the experiment. Four participants were excluded from this analysis for a variety of reasons (e.g., session was not completed). All participants were right-handed, reported normal hearing, had normal or corrected-to-normal vision, and had no history of neurologic, developmental, or linguistic deficits. All participants provided informed consent. The study was approved by the ethical Commission for Human Research Arnhem/Nijmegen (project number CMO2014/288). Participants were remunerated for their participation.
Control analysis I: Materials.
The stimuli were identical to the stimuli used in Kaufeld et al. (2020). The experiment consisted of three conditions in total, sentences, jabberwocky, and word lists. Only the sentences and the word lists are analyzed here. The stimuli consisted of 10 words, which were all disyllabic except for de (the in Dutch) and en (and in Dutch). Sentences had a fixed syntactic structure of two coordinate clauses: [Adj N V N conj Det Adj N V N], for example, timid heroes pluck flowers and the brown birds gather branches. The word lists were scrambled versions of these sentences, and care was taken so there were no plausible internal combinations of words. The stimuli were recorded by a female native speaker of Dutch at a sampling rate of 44.1 kHz (monophonic). After recording, any pauses were normalized to ∼150 ms in all stimuli, and the intensity was scaled to 70 dB using Praat voice analysis software (Boersma and Weenink, 2018).
Participants were asked to perform four different tasks on these stimuli—a passive listening task, a syllable recognition task, a word recognition task, and a word combination recognition task. In this analysis, we did not distinguish among tasks. ten Oever et al. (2022) describes the tasks performed.
Control analysis I: Procedure.
At the beginning of each trial, participants were instructed to look at a fixation cross presented at the middle of the screen on a gray background. The audio was presented binaurally through tubes after an interval randomly jittered between 1.5 and 3 s. One second after audio offset, the task prompt (e.g., the syllables or words for recognition) was presented, which required participants to press a button on a button box. There were eight blocks of ∼8 min. After each block, participants could take a break, during which the head position was corrected. MEG was recorded using a 275-channel axial gradiometer CTF MEG system at a sampling rate of 1200 Hz. After the session, head shape was collected using the Polhemus digitizer (using as fiducials the nasion and the entrance of the ear canals as positioned with ear molds).
Control analysis I: MEG preprocessing
The MEG data were processed with custom-written Python scripts using MNE-Python (Gramfort et al., 2013). As in the main analysis, the raw MEG data were filtered using a windowed-sinc FIR filter between 0.5 and 4 Hz for the delta band, and 4 and 10 Hz for the theta band, after which the data were epoched from audio onset to audio offset and resampled to 120 Hz for TRF estimation.
Control analysis I: Stimulus representation
In this analysis, we used the envelope, word onset, and word frequency representations from the main analysis (see above, Stimulus representation).
Control analysis I: Model fitting
We used the model-fitting approach described earlier (see above, Model fitting). We fit three models, Envelope (with only the envelope feature), Onset (envelope and word onset features), and Frequency (envelope, word onset, and word frequency features). The data were split pseudorandomly into a training and a testing set at an 80:20 ratio, ensuring that the sets contained 50% of items from each condition. The regularization parameter was optimized individually per participant and model, using an eightfold nested cross-validation procedure with 20 log-spaced values around 60,000 (λ = 60,000) ranging from λ × 10−2 to λ × 102.
Control analysis I: Model evaluation
For model evaluation, we used the procedure described earlier (see above, Model evaluation).
Control analysis I: Statistical analysis
Like in the main analysis, we assessed whether the responses to word frequency qualitatively differed between conditions by evaluating the difference between the word frequency TRFs for the sentence and word list conditions using a cluster-based permutation test. In addition, to characterize the response in each of the conditions separately, we performed two additional cluster-based permutation tests with the same methods in which we contrasted the response against zero in each condition separately. In total, we performed three cluster-based permutation tests on the TRFs, one on the difference between conditions and one on the TRF for each condition separately (against zero). In all tests, we calculated the threshold on the basis of the t distribution with a significance level of 0.05 with 16 (number of participants minus one) degrees of freedom. Only clusters with a p value smaller than 0.01 were considered. Subsequent comparisons were done with a threshold calculated using a Bonferroni-adjusted significance level to correct for multiple comparisons; everything else was the same. For comparison to the main analysis, we also compared the word onset response between conditions with the methods described above.
To evaluate the effect of word frequency in each condition, we compared the reconstruction accuracies from the Onset and Frequency models in interaction with condition. The reconstruction accuracies were averaged over all sensors (conservative measure). After checking for normality and sphericity through (1) visual inspection of Q-Q plots and histograms; (2) statistical testing using the Shapiro–Wilk test, Anderson–Darling test, and D'Agostino's K2 test for kurtosis and skewness as implemented in SciPy algorithms; and (3) the Mauchly test for sphericity as implemented in the Pingouin package (Vallat, 2018), the averaged reconstruction accuracy values were submitted to a repeated-measures ANOVA using the Statsmodels package.
Control analysis II: simulations
Using simulations, we evaluated whether the interword interval has an impact on TRF model evaluation. We did this by simulating raw MEG data consisting of a signal (different impulse responses) and a variable amount of noise.
The simulated response was equivalent to the forward model, namely a noisy output of a convolution between a predefined kernel (the ground truth for the TRF estimate) and an impulse train (for the input signal). We generated those data with a variable amount of noise (i.e., explicitly manipulating the broadband signal-to-noise ratio) and with varying the interstimulus interval (ISI) while keeping the signal length the same and the number of impulses, or events, constant (in which case a shorter interstimulus interval results in the end portion of the output signal containing only noise).
We then scored the forward model by computing both the R2 score and the Pearson's correlation coefficient between the reconstruction
Data availability
The code is available at https://osf.io/ky9bj/, with the exception of the preprocessing scripts. The preprocessed data are available on request. The raw data can be downloaded from the Donders Institute repository at https://data.donders.ru.nl/collections/di/dccn/DSC_3011020.09_236?0.
Results
Behavioral results
We compared participants' responses to the task that was present in both conditions, which targeted one of the main content words (e.g., Was a grandma mentioned?). To balance the number of trials included in the accuracy scores, we took a random subset of questions from the word lists (12 or 13 trials). The average proportion of correct responses was higher in the sentence condition (meansent = 0.88; sdsent = 0.08) than in the word lists (meanlist = 0.72; sdlist = 0.14; t = 10.08, p < 0.001), meaning that participants remembered the words from the sentences better than the words from the word lists (Fig. 2).
Delta band
Sensor-level analysis
The cluster-based permutation test revealed differences between word lists and sentences in three clusters between 0 and 700 ms. Figure 3A suggests that the peak of the response to word frequency was delayed by ∼300 ms in the word list condition. To evaluate whether this was the case, we conducted one-sample cluster-based permutation tests and computed the cross-correlation between the two conditions for overlapping sensors from the clusters in both conditions. The one-sample cluster-based permutation test revealed a response in temporal areas in both conditions that peaks ∼250 ms in the sentence condition, and ∼600 ms in the word list condition (Fig. 3B,C).
The cross-correlation on overlapping sensors between the two conditions (time courses and sensors; Fig. 4A) revealed a high correlation between the word list and the sentence responses at a delay of 330 ms (mean r = 0.9). Random sampling of sensors and lags revealed the distribution shown in Figure 4D; the observed values are in the upper 0.05% percentile, indicating that the observed correlation is likely not caused by chance.
Because we wondered whether the delay could be because of the differences in the presentation rate, we examined differences between the TRFs for the other word-level feature that was numerically identical between conditions, word onsets (unit-spike-train in both conditions). We compared the word onset response from a model with only the envelope and word onset features. This model is equivalent to an ERP analysis that corrects for overlapping event windows (as is the case in the sentence condition) and controls for acoustic differences. A small delay of ∼100 ms appears in this model. This delay is in accordance with findings of an ERP-analysis on high- versus low-constraining contexts (Liu et al., 2006; León-Cabrera et al., 2017). Importantly, this model collapses over variance caused by the lexical features included in the full model (word frequency, entropy, and surprisal). In other words, this underspecified model attributes variance that is in fact because of word frequency, entropy, or surprisal to the word onset predictor. When we include the other lexical predictors in the model and compared the conditions again, no such difference between the word onset responses is observed (Fig. 3D). In this response, there were some differences around time point zero before as well as slightly after; these differences may indicate differences in temporal expectancy of word onset between conditions.
The reconstruction accuracies were evaluated with the model accuracies ∼ condition * (frequency + entropy + surprisal) + (1/participant). The explanatory value of the interaction between condition and each of the lexical factors was evaluated; each interaction significantly improved model fit [frequency, χ2(1) = 6.88, p < 0.01; entropy, χ2(1) = 4.48, p < 0.05; surprisal, χ2(1) = 7.24, p < 0.01], so the full model was interpreted. The results of this model are summarized in Table 2.
Reconstruction accuracies were higher in the word list condition than in the sentence condition (β = 1.67 * 10−2, SE = 9.43 * 10−4, t(1530) = 17.69, p < 0.01). As can be seen in Figure 5A, each feature contributed positively to the reconstruction of the neural signal in the sentence condition, less so in the word list condition, hinting at an interaction effect. Indeed, the factor frequency interacted with condition (β = 2.47 * 10−3, SE = 9.43 * 10−4, t(1530) = 2.63, p < 0.01), showing that reconstruction accuracies improved more from the addition of the word frequency predictor in the sentence condition than in the list condition (Fig. 5B). Further, although we do not discuss these effects, entropy and surprisal interacted with condition as well (entropy, β = 2.00 * 10–3, SE = 9.43 * 10−4, t(1530) = 2.12, p < 0.05; surprisal, β = 2.54 * 10−3, SE = 9.43 * 10−4, t(1530) = 2.69, p < 0.01).
To gain more insight into the effect of frequency, we performed a post hoc t test comparing the two largest models (Entropy/Surprisal and Full). These tests confirmed that the word frequency predictor enhanced reconstruction accuracy in the sentence condition (t(101) = 5.35; p < 0.01), but not in the word list condition (t(101) = −0.15, p = 1; Bonferroni corrected).
Finally, we hypothesized that the higher reconstruction accuracy in the word list condition was because of the salience of isolated words, possibly evoking a larger auditory response. If this is true, a model with only the envelope predictor, and no word-level feature, should also fit the list condition better. To evaluate this hypothesis, we compared the reconstruction accuracies (averaged over all sensors) for the Envelope model between conditions. This model was not included in the analyses of the word frequency effect. And indeed, this was the case; reconstruction accuracies were higher for word lists than sentences using only the envelope as predictor (t(101) = 13.40, p < 0.01).
In sum, the response to word frequency differed between word lists and sentences. The TRFs in sensor space revealed a left-lateralized frontotemporal response to the feature that peaked ∼250 ms after word onset in the sentence condition, and ∼600 ms in the word list condition. The sentence effect is in line with other studies that used word frequency as a feature in TRF models of natural language comprehension (Brennan and Hale, 2019; Weissbart et al., 2020). A cross-correlation analysis between a set of left (and one right) temporal and frontal sensors that were involved in the response in both conditions suggested that the word list response peaks ∼300 ms later. The reconstruction accuracies in sensor space suggests that the word frequency predictor explains more variance over and above acoustics, entropy, and surprisal in the sentence condition, but not in the word list condition.
Source reconstruction
In source space, we compared the TRFs for word lists and sentences using a cluster-based permutation test in two time windows on the basis of the results from the analysis in sensor space, 200–400 and 500–700 ms post stimulus onset, respectively. The cluster-based permutation test on the TRFs from the source reconstructed MEG revealed two clusters in the early time bin and four clusters in the late time bin. In line with the analysis in sensor space, coefficients were higher in the sentence condition than in the word list condition in the early time-bin (200–400 ms PSO). These differences appeared bilaterally in the posterior superior and middle frontal gyrus (dorsolateral and dorsomedial prefrontal cortex) and cingulate gyrus (Fig. 6A). In the right hemisphere, the cluster extended to the inferior frontal gyrus (Fig. 6A).
In the late time bin (500–700 ms PSO; Fig. 6B), coefficients were higher in the word list condition than in the sentence condition in three of four clusters. Those clusters appeared in the left hemisphere in the posterior temporal lobe across the superior, middle, and inferior gyri/sulci, the temporal pole, and the parahippocampal gyrus. In the right hemisphere, the effects appeared in superior temporal, inferior parietal, and caudal frontal areas, as well as cingulate gyrus. In a final cluster in the late time bin, the coefficients were higher in the sentence than in the word list condition. This cluster spanned left inferior frontal areas, orbital cortex, as well as a small portion of the middle frontal gyrus.
In addition, we observed a difference between the responses in left orbitofrontal and ventrolateral prefrontal cortex, including the inferior frontal gyrus. In this area, the response peaked in the late time bin in the sentence condition only. That this area is where we found a difference in late time lags is not surprising given the large literature implicating the left inferior frontal cortex, or Broca's area, in syntactic processes (Friederici, 2011, 2012, 2015; Hagoort, 2013, 2016; Matchin and Hickok, 2020).
Given our finding that the word list response appeared delayed in comparison to the response in the sentence condition, we also considered responses in the sentence and word list conditions separately through one-sample cluster-based permutation tests. Here, we observed a widespread response in both conditions; and indeed, this response appears in the early time window in the sentence condition (Fig. 6C) and in the late time window in the word list condition (Fig. 6F).
As we already observed in the contrast, in the late time window, the response to word-internal information encompasses the left posterior superior, middle, and inferior temporal gyri (including parahippocampal gyrus) and the temporal poles, as well as bilateral somatosensory areas in both conditions. These areas are traditionally associated with lexical and semantic memory (Binder and Desai, 2011; Hagoort, 2013, 2016). Furthermore, as we observed in the early time window, this response includes the bilateral dorsolateral prefrontal cortex. These areas are part of the dorsal attention network and have been implied to control activation and selection of information stored in temporoparietal cortices (Binder and Desai, 2011). In addition, like we observed in the contrast between conditions, in the sentence condition a late response appears in the left inferior frontal gyrus (Fig. 6E). This response was absent in the word list condition. We compared the reconstruction accuracies using a cluster-based two-way ANOVA with factors Condition (levels: word list, sentence) and Model (levels: Entropy/Surprisal, Full). There were no significant differences (all p values > 0.1).
Together, these findings indicate that (1) much, but not all, of the response to word internal information is shared between conditions in space; (2) the response develops differently in time, with a delay in the word list condition; and (3) word internal information modulates activity in the left inferior frontal gyrus only in the presence of a coherent context.
Theta band
Sensor-level analysis
In the theta band, the cluster-based permutation test revealed no differences between the word list and sentence TRFs for the word frequency feature (Fig. 7). The one-sample tests indicated, however, a response between 100 and 200 ms in the word list condition that was absent in the sentence condition.
Like in the delta band, the full model was accuracies ∼ condition * (frequency + entropy + surprisal) + (1/participant). Removing the interaction between frequency and condition, or the interaction between surprisal and condition, decreased model fit [marginally; frequency, χ2(1) = 3.80, p = 0.051; surprisal, χ2(1) = 3.95, p < 0.05], but removing the interaction between entropy and condition did not [χ2(1) = 0.47, p = 0.49]. We continued with the model accuracies ∼ condition * (frequency + surprisal) + entropy + (1/participant). The AIC comparison confirmed that this model was the best descriptor of the data. The results of this model are summarized in Table 3.
In theta, too, there was a main effect of condition (β = 2.09 * 10−3, SE = 6.90 * 10−4, t(1530) = 3.02, p < 0.01), with reconstruction accuracies being higher in the word list condition than in the sentence condition; see Figure 8. In addition, there was a main effect of frequency (β = 1.17 * 10−3, SE = 5.64 * 10−4, t(1530) = 2.07, p < 0.05) indicating that generally the addition of word frequency improved reconstruction accuracy. The interaction between frequency and condition approached but did not reach significance (β = 1.56 * 10−3, SE = 7.97 * 10−4, t(1530) = 1.95, p = 0.051), indicating a potential trend for the frequency effect to be larger in the sentence condition than in the word list condition (Fig. 7).
With respect to the other predictors, there was a positive effect of entropy (β = 2.43 * 10−3, SE = 3.99 * 10−4, t(1530) = 1.95, p < 0.01) and an interaction between condition and surprisal (β = 1.55 * 10−3, SE = 7.92 * 10−4, t(1530) = 1.99, p < 0.05), indicating that surprisal enhanced reconstruction accuracies more in the sentence condition than in the word list condition.
Again, we performed post hoc t tests comparing the two largest models (Entropy/Surprisal and Full) to gain more insight in the effect of word frequency on the reconstruction accuracies. These showed that the word frequency predictor enhanced reconstruction accuracies in the sentence condition (t(101) = 5.67; p < 0.01), but not in the word list condition (t(101) = 1.48; p = 0.57). There were no effects of condition for these two models (all p values = 1).
Source reconstruction
Given that the permutation test in the sensor-based analysis did not reveal any effects in the theta band, and we could not select time bins a priori, we performed a cluster-based permutation test on the full TRF. This revealed two clusters in the right hemisphere between 100 and 250 ms. Both of these clusters reflect a larger amplitude across right frontal and temporal areas for the TRF in the word list condition than the sentence condition, as can be seen in the plots of the time courses of the clusters in Figure 9. These effects, although visible in Figure 7, A–C, did not reach significance in the sensor analysis, potentially because of the stringent threshold (recommended value multiplied by three) chosen there.
Control analysis I: data from ten Oever et al. (2022)
In the delta band, the cluster-based permutation test revealed no significant differences between the word frequency response in the word lists and sentences. To evaluate whether this was because there were no detectable responses or no difference between conditions, we performed one-sample cluster-based permutation tests. Here we observed a response in the sentence condition over a large array of left-posterior sensors that was significant from word onset to ∼400 ms. The peak appears ∼200 ms (Fig. 10A). Although Figure 10B suggests a potential response of ∼400 ms in the word list condition, there were no significant clusters. As in the main analysis, there were no significant differences between conditions in the responses to word onset.
The absence of a difference between the conditions and the lack of a detectable response in the word list condition alone make the results from this analysis difficult to interpret in relation to the main analysis. The large difference between the sample sizes (N = 102 vs N = 16, respectively) may play a role in this difference. We performed a power analysis on the difference between the conditions in the control analysis using the average t values from the time points and sensors taken from the significant clusters from the same contrast in the main analysis. This showed that power would increase on average by 30.7% when taking a sample of 102 participants, with three clusters reaching a power of above 96%. This suggests that the control analysis did not have enough power to reject or confirm the hypothesis that the delay in the response in the word list condition is caused by the different temporal dynamics in the original analysis. We therefore refrain from drawing conclusions on the basis of this finding.
Nevertheless, the ANOVA on the reconstruction accuracies revealed a main effect of model (F(1,15) = 38.01; p < 0.01), indicating that the word frequency predictor enhanced reconstruction accuracy, and an interaction between condition and model (F(1,15) = 6.79; p < 0.05), suggesting that this effect was larger for the sentence condition than for the word list condition (Fig. 10C). There was no main effect of condition (p = 0.16). In the theta band, there were no significant effects on the TRF waveforms nor on the accuracy values (Fig. 11).
Control analysis II: simulations
To evaluate the effect of differences in interstimulus intervals (i.e., pauses), we simulated raw MEG data consisting of a signal (different impulse responses) and optional noise. Strikingly, the interstimulus interval has no direct influence on the reconstruction score, although the length of the segment on which we estimate the score does (Fig. 12). In this case, the difference in interstimulus interval, which eventually leads to a difference in data length, shows how the bias in the score observed between conditions is solely because of the difference in duration. The bias, however, is constant, and should be controlled for when directly comparing models within conditions. Moreover, we actually observe the opposite effect in our MEG analysis; the absolute scores for the longer segment of data (the word lists) are higher than the shorter segment of data (the sentences). This means that our score differences exist above and beyond any bias generated from the stimulus difference.
Discussion
In this study, we asked whether low-frequency neural readouts associated with words systematically changed as a function of being in a sentence context and whether neural readouts were modulated by purely lexical properties over and above sensory and contextual distributional variables. We contrasted responses to word frequency for words in sentences with word lists, the latter lacking any syntactic structure and combinatorial lexical meaning. We hypothesized that the delta-band but not theta-band responses to word frequency would be different in word lists and sentences as a consequence of the (in)availability of sentence context. Specifically, following findings from speech tracking, we expected a stronger presence of the word frequency response in the sentence condition.
Our findings showed that the delta-band response to word frequency differs between word lists and sentences in time and, albeit minimally, in space. In both conditions, word internal information modulates a response across the left temporal lobe and the frontal cortex. However, this response occurred ∼300 ms earlier in the presence of a coherent sentence context. In addition, in a sentence context, word internal information could be seen to modulate activity in the left inferior frontal gyrus at ∼600 ms after word onset, a response that is absent when a word is not embedded in a sentence. Furthermore, the word frequency feature explains more variance over and above the other features in the sentence condition than in the word list condition. In the theta band, there were only minimal differences between the conditions. We discuss our results in more detail below.
In psycholinguistic theories of word recognition, word frequency is often modeled as the baseline of activation or the prior probability of a word, for example, the Logogen model (Morton, 1969), Cohort model (Marslen-Wilson, 1987), and Shortlist A and B (Norris, 1994; Norris and McQueen, 2008). We assume therefore that the neural readout associated with word frequency represents neural activity during the process of word recognition. Our results provide direct evidence that this process happens differently depending on whether the structure building of sentence comprehension is also occurring. We know that words are recognized faster when they are embedded in a coherent sentence context (Marslen-Wilson and Welsh, 1978; Tyler and Wessels, 1983); this is reflected in the delayed word list response to word frequency (Lam et al., 2016).
Furthermore, the reconstruction accuracies in sensor space suggest that the response to word frequency explains more variance in the sentence condition than in the word list condition. This may seem contradictory to findings from psycholinguistics. Indeed, the behavioral effect of word frequency, when assessed with reaction time measures, diminishes in the sentence context (Schuberth and Eimas, 1977; Tyler and Wessels, 1983; Simpson et al., 1989). Put differently, words with a low frequency are recognized more slowly than words with a high frequency. This does not necessarily mean that lexical information explains less variance in the neural signal. In fact, studies that consider metrics like mutual information between the brain and the speech signal find that the brain represents aspects of the speech signal more reliably when more linguistic information is present (Kaufeld et al., 2020; ten Oever et al., 2022), whereas the acoustic information in speech matters less for word recognition when the word is embedded in a sentence (Boothroyd and Nittrouer, 1988; Mattys et al., 2012). In general terms, these findings suggest that the brain represents lower-level features more reliably when higher-level information can be inferred, whereas the lower-level information itself becomes less important for the outcome of the task. Indeed, that words are represented more robustly when sentence context is provided is reflected in the accuracy scores on the word monitoring task performed in this study; participants were more likely to correctly remember whether a word was mentioned when they had been presented with a sentence than when they heard a word list.
There are two causes for this finding. First, the perceptual salience of the words in the word list condition leads to a large response to the speech envelope; the response to lexical features then are of relatively lower power and explain less of the variance in the signal relative to the lower-level features. Second, as a consequence of words being embedded in larger structures, phrases and sentences, word frequency is likely present in a larger neural network in the sentence condition than in the word list condition (Martin, 2020). The signal is therefore reconstructed better in a wider array of sensors, leading to an overall larger increase in reconstruction accuracies. As discussed below, the presence of the effect in the control analysis favors the latter interpretation. The propagation of lexical information to a wider network is additionally reflected in the differences between conditions in the inferior frontal gyrus at ∼600 ms. This interpretation is consistent with findings that show that sentence structure influences the dynamics and distribution of neural signals (Blank et al., 2016; Schell et al., 2017; Matchin et al., 2019a,b; Grodzinsky et al., 2021; Bai et al., 2022; Coopmans et al., 2022; ten Oever et al., 2022).
Importantly, both the TRF and the reconstruction accuracy effects of sentence context on the representation of word-internal information are independent of (1) the contextual probability predictors surprisal and entropy and (2) sensory information in the speech envelope. Each of these predictors is undoubtedly important for how the neural signal represents lexical information (e.g., sensory, Doelling et al., 2014; and probability, Weissbart et al., 2020). Given that these influences were accounted for by the encoding model, the differences that remain imply a role for abstract structure and meaning on the transformation of low-frequency neural readouts associated with words (or more minimally, associated with purely lexical features). These conclusions are in line with findings on the visual part of the dataset, not analyzed here (Huizeling et al., 2022).
Also striking is the difference between the effects in the delta and theta bands. In the theta band, the responses to word frequency differed between conditions only slightly; the amplitude of the response was larger in the word list condition than in the sentences in the right frontal and temporal hemisphere ∼100 ms, possibly indicating that word frequency in interaction with contextual information tunes sensory sampling. The addition of the word frequency predictor had a small effect on the reconstruction accuracies, which was present only in the post hoc analysis. In general, theta-band activity appears to be more sensitive to perceptual aspects of the stimulus than to linguistic aspects. For example, tracking of sound by theta-band activity persists even in the absence of linguistic information (Molinaro and Lizarazu, 2018), whereas it is affected when acoustic edges in the stimulus are experimentally manipulated (Doelling et al., 2014). However, in line with the differences that we do see, Donhauser and Baillet (2020) showed that the gain of early theta responses varies according to the contextual uncertainty of speech. The results from the present analysis are consistent with an account in which the theta band is important for speech processing but not as central for the representation of higher-level features such as lexical-internal information. At the same time, the process reflected by theta modulations during language comprehension is likely to be influenced by linguistic context.
In addition to the linguistic differences, there was a variable pause between the words in the word list condition only. To examine the potential effect of this additional difference between the conditions on our results, we ran several simulations. The simulations showed that the ISI between events modeling word-like responses has no effect on model evaluation and TRF estimation. However, there will be a constant bias in the model score that is proportional to the broadband signal-to-noise ratio (where the noise is the additive contribution beyond variance explained by the linear model). This bias is not directly because of the differences in ISI but rather the fact that we are integrating a larger portion of data in the list condition, thus more noise to contribute to the score. As such, any model comparison contrasting scores within condition will eliminate the constant bias. Furthermore, this bias leans toward deflating the score of the model evaluated on the longest segment of data (the word list condition). We found that with the envelope alone, the scores in the list condition were higher than the scores of the sentence condition; this is in direct contrast with the expectations from the simulations. From these simulations we conclude therefore that the delay in the TRF waveform and the interaction effect in the reconstruction of the neural signal are not just because of difference in signal length between the word list and sentence condition.
The next question is, then, What are the potential cognitive effects of silence between the words? There are three potential effects, (1) higher perceptual saliency of each word, already mentioned above; (2) decreased word rate; and (3) absence of phonological cues between words, such as prosody and coarticulation. (A reviewer suggested we add a prosody predictor. We constructed a prosody predictor by extracting the prosody contour using Parselmouth, a Praat wrapper for Python. Running the analysis with this extra predictor did not qualitatively change the results.) We consider phonological cues to be consequences of as well as cues to the sentence context; they would be different between word lists and sentences in naturalistic conditions as well. The first two, however, need some consideration.
As mentioned above, the perceptual difference between two consecutive words is much smaller than the difference between silence and a word. This effect was visible in the speech–brain coherence for both conditions (Fig. 1; coherence was much higher in the word list condition in the delta band) and caused overall higher reconstruction accuracy in the word list condition. Importantly, in the analysis on a second dataset in which this difference between conditions did not exist, the interaction effect between word lists and sentences was replicated. The word frequency feature explained more variance over and above the envelope and word onset predictors in the sentence condition than in the word list condition. Furthermore, we stipulated that a general delaying effect on word processing generated by the decreased word rate in the word list condition would be visible with other features as well. Nevertheless, the word onset feature, the only feature in addition to word frequency that was numerically identical between conditions, did not show such a difference. These findings indicated that it was only the response to word-internal information that was delayed and suggests that the brain processes lexical information later in the absence of a coherent sentence context. Taken together, this indicates that the effects described in this work are unlikely to be driven by silence.
In summary, this study suggests that delta-band, and to a lesser extent, theta-band responses to word-internal information are affected by sentence context in time and in space. Given that a difference in encoding of a strictly lexical feature persists when context-driven lexical features like entropy and surprisal are added, we conclude that low-frequency responses to word internal information are changed by sentential structure and meaning and not by probabilistic differences alone. In the delta band, a lexical response across the posterior and anterior left temporal lobe and the bilateral parietal lobe is delayed in the absence of sentence context. In addition, a word embedded in a sentence context determines whether inferior frontal areas are responsive to lexical information. In the theta band, a larger amplitude in the word lists at ∼100 ms across the right frontal and parietal areas suggests that linguistic information can tune sensory sampling. In addition, this study shows that the TRF can be used to model acoustic differences between stimuli when measuring higher-level linguistic effects (Bai et al., 2022). The results of this study show how the neural representation of words is affected by the linguistic structure of sentence context and as such provide beginning insight into how the brain instantiates compositionality in language processing.
Footnotes
A.E.M. was supported by an Independent Max Planck Research Group and a Lise Meitner Research Group “Language and Computation in Neural Systems”, by NWO Vidi grant 016.Vidi.188.029 to A.E.M., and by Big Question 5 (to Prof. Dr. Roshan Cools & Dr. Andrea E. Martin) of the Language in Interaction Consortium funded by NWO Gravitation Grant 024.001.006 to Prof. dr. Peter Hagoort. H.W. was supported by NWO Vidi grant 016.Vidi.188.029 to A.E.M. We thank Laurel Brehm for statistical advice; Inge Pasman, Esther de Kerf, Carlijn Herpt, and Dennis Joosen for research assistance; and the members of the Psychology of Language department at the Max Planck Institute for valuable input on earlier versions of this project.
The authors declare no competing financial interests.
- Correspondence should be addressed to Sophie Slaats at sophie.slaats{at}mpi.nl