Abstract
Neural oscillations track linguistic information during speech comprehension (Ding et al., 2016; Keitel et al., 2018), and are known to be modulated by acoustic landmarks and speech intelligibility (Doelling et al., 2014; Zoefel and VanRullen, 2015). However, studies investigating linguistic tracking have either relied on non-naturalistic isochronous stimuli or failed to fully control for prosody. Therefore, it is still unclear whether low-frequency activity tracks linguistic structure during natural speech, where linguistic structure does not follow such a palpable temporal pattern. Here, we measured electroencephalography (EEG) and manipulated the presence of semantic and syntactic information apart from the timescale of their occurrence, while carefully controlling for the acoustic-prosodic and lexical-semantic information in the signal. EEG was recorded while 29 adult native speakers (22 women, 7 men) listened to naturally spoken Dutch sentences, jabberwocky controls with morphemes and sentential prosody, word lists with lexical content but no phrase structure, and backward acoustically matched controls. Mutual information (MI) analysis revealed sensitivity to linguistic content: MI was highest for sentences at the phrasal (0.8–1.1 Hz) and lexical (1.9–2.8 Hz) timescales, suggesting that the delta-band is modulated by lexically driven combinatorial processing beyond prosody, and that linguistic content (i.e., structure and meaning) organizes neural oscillations beyond the timescale and rhythmicity of the stimulus. This pattern is consistent with neurophysiologically inspired models of language comprehension (Martin, 2016, 2020; Martin and Doumas, 2017) where oscillations encode endogenously generated linguistic content over and above exogenous or stimulus-driven timing and rhythm information.
SIGNIFICANCE STATEMENT Biological systems like the brain encode their environment not only by reacting in a series of stimulus-driven responses, but by combining stimulus-driven information with endogenous, internally generated, inferential knowledge and meaning. Understanding language from speech is the human benchmark for this. Much research focuses on the purely stimulus-driven response, but here, we focus on the goal of language behavior: conveying structure and meaning. To that end, we use naturalistic stimuli that contrast acoustic-prosodic and lexical-semantic information to show that, during spoken language comprehension, oscillatory modulations reflect computations related to inferring structure and meaning from the acoustic signal. Our experiment provides the first evidence to date that compositional structure and meaning organize the oscillatory response, above and beyond prosodic and lexical controls.
- combinatorial processing
- lexical semantics
- mutual information
- neural oscillations
- prosody
- sentence comprehension
Introduction
How the brain maps the acoustics of speech onto abstract structure and meaning during spoken language comprehension remains a core question across cognitive science and neuroscience. A large body of research has shown that neural populations closely track the envelope of the speech signal, which correlates with the syllable rate (Peelle and Davis, 2012; Zoefel and VanRullen, 2015; Kösem et al., 2018), yet much less is known about the degree to which neural responses encode higher-level linguistic information such as words, phrases, and clauses. While previous studies suggest a crucial role for delta-band oscillations in the top-down generation of hierarchically structured linguistic representations (Ding et al., 2016; Keitel et al., 2018), they have so far either relied on non-naturalistic stimuli or failed to fully control for prosody. Here, we use a novel experimental design that allows us to investigate how structure and meaning shape the tracking of higher-level linguistic units, while using naturalistic stimuli and carefully controlling for prosodic fluctuations.
The strongest evidence for tracking of linguistic information so far are studies by Ding et al. (2016, 2017a), who found enhanced activity in the delta frequency range for sentences compared with word lists. They investigated this using isochronous, synthesized stimuli devoid of prosodic information. Yet phrases, clauses, and sentences usually do have acoustic-prosodic correlates (e.g., pauses, intonational contours, final lengthening, fundamental frequency reset; Eisner and McQueen, 2018). These might not be as prominent in the modulation spectrum of speech as syllables (Ding et al., 2017b), but listeners draw on them during language comprehension and learning (Soderstrom et al., 2003). As such, Ding et al. (2017a) cannot clearly distinguish between the generation of linguistic structure and meaning versus inferred prosody, and it is unclear whether their results generalize to naturalistic stimuli, where the timing of linguistic units is more variable.
Almost orthogonally to Ding et al. (2016, 2017a), Keitel et al. (2018) used naturalistic stimuli and found enhanced tracking (compared with reversed controls) in the delta-theta frequency range. However, as they did not include a systematic control for linguistic content, it is unclear whether their results are driven by tracking of prosodic information in the acoustic signal, rather than linguistic information.
In the current study, we bridge this gap by contrasting these two core sources of linguistic representations: prosodic structure, which can, but does not always, correlate with syntactic and information structure, and lexical semantics, which arises in isolated words and concepts. Participants listened to naturally spoken, structurally homogenous sentences, jabberwocky items (containing sentence-like prosody, but no lexical semantics), and word lists (containing lexical semantics, but no sentence-like structure and prosody; see Table 1 for examples). Additionally, we used reversed speech as the core control of our experiment because it has an identical modulation spectrum for each forward condition.
Example items in Sentence, Jabberwocky, and Wordlist conditions
Using electroencephalography (EEG), we analyzed tracking at linguistically relevant timescales as quantified by mutual information (MI)—a typical measure of neural tracking that captures the informational similarity between two signals (Cogan and Poeppel, 2011; Gross et al., 2013; Kayser et al., 2015; Keitel et al., 2017, 2018). Figure 1 shows an overview of the experimental design and analysis pipeline.
Experimental design and analysis pipeline. Participants listened to sentences, jabberwocky items, and word lists while their brain response was recorded using EEG. Step 1: Speech Processing: 1.1, the speech signal is annotated for the occurrence of phrases, words, and syllables in the stimuli, and, based on this, frequency bands of interest for each of the linguistic units can be identified; 1.2, a cochlear filter is applied to the speech stimuli and the amplitude envelope is extracted. Step 2: further processing is identical for both speech and EEG modalities: 2.1, broadband filters are applied in the previously identified frequency bands of interest; 2.2, Hilbert transforms are computed in each filtered signal, and real and imaginary parts of the Hilbert transform output are used for further analysis. Step 3: MI computation; mutual information is computed between the preprocessed speech and EEG signal in each of the three conditions and their respective backward controls.
We hypothesize that neural tracking (“entrainment in the broad sense,” as defined by Obleser and Kayser, 2019) will be stronger for stimuli containing higher-level linguistic structure and meaning, above and beyond the acoustic-prosodic (jabberwocky) and lexical-semantic (word list) controls. This may reflect a process of perceptual inference (Martin, 2016, 2020), whereby biological systems like the brain encode their environment not only by reacting in a series of stimulus-driven responses, but by combining stimulus-driven information with endogenous, internally generated, inferential knowledge and meaning (Meyer et al., 2019). In sum, our study offers novel insights into how structure and meaning influence the neural response to natural speech above and beyond prosodic modulations and word-level meaning.
Materials and Methods
Participants
Thirty-five native Dutch speakers (26 females, 9 males; age range, 19–32 years; mean age, 23 years) participated in the experiment. They were recruited from the Max Planck Institute for Psycholinguistics (MPI) participant database with written consent approved by the Ethics Committee of the Social Sciences Department of Radboud University (Project code: ECSW2014-1003-196a). Six participants were excluded from the analysis because of excessive artifact contamination. All participants in the experiment reported normal hearing and were remunerated for their participation.
Materials
The experiment used the following three conditions: Sentence, Jabberwocky, and Wordlist. Eighty sets (triplets) of the three conditions (Sentence, Jabberwocky, Wordlist) were created, resulting in 240 stimuli. In addition to one “standard” forward presentation of each stimulus, participants also listened to a version of each of the stimuli played backward, thus resulting in a total of 480 stimuli.
Dutch stimuli consisted of 10 words, which were all disyllabic except for “de” (the) and “en” (and), thus resulting in 18 syllables in total. Sentences all consisted of two coordinate clauses, which followed the structure [Adj N V N Conj Det Adj N V N]. Word lists consisted of the same 10 words as in the Sentence condition, but scrambled in syntactically implausible ways (either [V V Adj Adj Det Conj N N N N], or [N N N N Det Conj V V Adj Adj], to avoid any plausible internal combinations of words). Jabberwocky items were created using the wuggy pseudoword generator (Keuleers and Brysbaert, 2010), following the same syntactic structure as the Sentences. Specifically, standard wuggy parameters were set to match two of three subsyllabic segments wherever possible, as well as letter length, transition frequencies, and length of subsyllabic segments. The lexicality feature of wuggy was used to ensure that none of the generated pseudowords were existing lexical items in Dutch. In addition, all pseudowords were proofread by native Dutch speakers to ensure that none of their phonetic forms matched that of an existing word in Dutch. Inflectional morphemes (e.g., plural morphemes) as well as function words (“de” - the and “en” - and) were kept unchanged. Table 1 shows an example of stimuli in each condition. (Please see https://osf.io/rv5y7/ for a list of all 480 stimuli and their translations.)
Forward stimuli were recorded by a female native speaker of Dutch in a sound-attenuating recording booth. All stimuli were recorded at a sampling rate of 44.1 kHz (mono), using the Audacity sound recording and analysis software (Audacity Team, 2014). After recording, pauses were normalized to ∼150 ms in all stimuli, and the intensity was scaled to 70 dB using the Praat voice analysis software (Boersma and Weenink, 2020). Stimuli from all three conditions were then reversed using Praat. Figure 2 shows modulation spectra for forward and backward conditions.
Modulation spectra of forward and backward stimuli. Green, Sentence; orange, Jabberwocky; purple, Wordlist. Modulation spectra were calculated following the procedure and MATLAB script described in the study by Ding et al. (2017b). Note that a cochlear filter is applied to the acoustic stimuli, but not the brain data. Small deviations between the modulation spectrum of each forward condition and its backward counterpart are because of numerical inaccuracy; mathematically, the frequency components of forward and backward stimuli are identical.
Procedure
Participants were tested individually in a sound-attenuating and Faraday cage-enclosed booth. They first completed a practice session with four trials (one from each forward condition and one backward example) to become familiarized with the experiment. All 80 stimuli from each condition were presented to the participants in separate blocks. The order of the blocks was pseudorandomized across listeners, and the order of the items within each block was randomized. During each trial, participants were instructed to look at a fixation cross, which was displayed at the center of the screen (to minimize eye movements during the trial), and listen to the audio, which was presented to them at a comfortable level of loudness. The audio recording was presented 500 ms after the fixation cross appeared on the screen, and the fixation cross remained on the screen for the entire duration of the audio recording. Fifty milliseconds after the end of each recording, the screen changed to a transition screen [a series of hash symbols (#####) indicating that participants could blink and briefly rest their eyes], after which participants could advance to the next item via a button-press. After each block, participants were allowed to take a self-paced break. The experiment was run using the Presentation software (Neurobehavioral Systems) and took ∼50–60 min to complete. EEG was continuously recorded with a 64-channel EEG system (MPI equidistant montage) connected to a BrainAmp amplifier using BrainVision Recorder software, digitized at a sampling rate of 500 Hz and referenced to the left mastoid. The time constant for the hardware high-pass filter was 10 s (0.016 Hz; first-order Butterworth filter with 6 dB/octave), the high-cutoff frequency was 249 Hz. The impedance of electrodes was kept at <25 kΩ. Data were rereferenced offline to the average reference.
EEG data preprocessing
The analysis steps were conducted using the FieldTrip Analysis Toolkit revision 20180320 (Oostenveld et al., 2011) on MATLAB version 2016a (MathWorks). The raw EEG signal was segmented into a series of variable length epochs, starting at 200 ms before the onset of the utterance and lasting until 200 ms after its end. The signal was low-pass filtered to 70 Hz, and a bandstop filter centered at ∼50 Hz (±2 Hz) was applied in each epoch to exclude line noise [both zero-phase FIR (finite impulse response) filters using Hamming windows]. All data were visually inspected, and channels contaminated with excessive noise were excluded from the analysis. Independent component analysis was performed on the remaining channels, and components related to eye movements, blinking, or motion artifacts were subtracted from the signal. Epochs containing voltage fluctuations exceeding ±100 μV or exceeding a range of 150 μV were excluded from further analysis. We selected a cluster of 22 electrodes for all further analyses based on previous studies that found broadly distributed effects related to sentence processing (Kutas and Federmeier, 2000; Kutas et al., 2006; see also Ding et al., 2017a). Specifically, the electrode selection included the following electrodes: 1, 2, 3, 4, 5, 8, 9, 10, 11, 28, 29, 30, 31, 33, 34, 35, 36, 37, 40, 41, 42, and 43 (electrode names based on the MPI equidistant layout). We note that our results also hold for all electrodes, as described in the Results section below.
Speech preprocessing
For each stimulus, we computed the wideband speech envelope at a sampling rate of 150 Hz following the procedure reported by Keitel et al. (2018) and others (Gross et al., 2013; Keitel et al., 2017). We first filtered the acoustic waveforms into eight frequency bands (100–8000 Hz; third-order Butterworth filter, forward and reverse), equidistant on the cochlear frequency map (Smith et al., 2002). We then estimated the wideband speech envelope by computing the magnitude of the Hilbert transformed signal in each band and averaging across bands.
The timescales of interest for further mutual information analysis were identified in a fashion similar to that described in the study by Keitel et al. (2018). We first annotated the occurrence of linguistic units (phrases, words, and syllables) in the speech stimuli. Here, phrases were defined as adjective-noun/noun-verb combinations (e.g., in the Sentence condition: “bange helden” – timid heroes; “plukken bloemen” – pluck flowers, and so on; in the Jabberwocky condition: “garge ralden” – flimid lerops etc.; in the Wordlist condition, a “pseudo-phrase” corresponds to adjacent noun–noun, verb–verb, and adjective–adjective pairs; e.g., “helden bloemen” – heroes flowers). Unit-specific bands of interest were then identified by converting each of the rates into frequency ranges across conditions. This resulted in the following bands: 0.8–1.1 Hz (phrases); 1.9–2.8 Hz (words); and 3.5–5.0 Hz (syllables). Note that the problem the brain faces during spoken language comprehension is even more complex than this, because the timescales of linguistic units can highly overlap, even within a single sentence (Obleser et al., 2012). Populations of neurons that “entrain” to words will thus also have to be sensitive to information that occurs outside of these—rather narrow—frequency bands.
For an additional, exploratory annotation-based MI analysis (see Results, subsection Tracking of abstract linguistic units), we further created linguistically abstracted versions of our stimuli. Specifically, our aim was to create annotations that captured linguistic information at the phrase frequency entirely independent of the acoustic signal. Based on the word-level annotations of our stimuli, we created dimensionality-reduced arrays for further analysis (see the “Semantic composition” analyses reported by Brodbeck et al., 2018). Specifically, we identified all time points in the spoken materials where words could be integrated into phrases and marked each of these words associated with phrase composition [e.g., in a sentence such as “bange helden plukken bloemen en de bruine vogels halen takken” (timid heroes pluck flowers and the brown birds gather branches), the words “helden” (heroes), “bloemen” (flowers), “vogels” (birds), and “takken” (branches) were marked]. All these critical words were coded as 1 for their entire duration, while all other timepoints (samples) were marked as 0 (Brodbeck et al., 2018). This resulted in an abstract “spike train” array of phrase-level structure building that is independent of the acoustic envelope. We repeated this procedure for all items individually in all three conditions, since our stimuli were naturally spoken and thus differed slightly in duration and time course. Note that, consequently, this “phrase-level composition array” is somewhat arbitrary for the Wordlist condition as there are, per definition, no phrases in a word list. We annotated “pseudo-phrases” the same way as shown in Table 1. The procedure is visualized in Figure 3.
Visualization of the phrase-level annotations (inspired by Brodbeck et al., 2018, their Fig. 2). Across time, the response array takes value 0 for words that cannot (yet) be integrated into phrases, and value 1 for words that can, resulting in a “pulse train” array.
Mutual information analysis
We used MI to quantify the statistical dependency between the speech envelopes and the EEG recordings according to the procedure described in the study by Keitel et al. (2018; see also Gross et al., 2013; Kayser et al., 2015; Keitel et al., 2017). Based on the previously identified frequency bands of interest (see subsection “Speech preprocessing” above), we filtered both speech envelopes and EEG signals in each band (third-order Butterworth filter, forward and reverse). We then computed the Hilbert transform in each band, which resulted in two sets of two-dimensional variables (one for speech signals and one for EEG responses) in each condition (forward and backward; see Ince et al., 2017; for a more in-depth description). To take the brain–stimulus lag into account, we computed MI at five different lags, ranging from 60 to 140 ms in steps of 20 ms, and to exclude strong auditory-evoked responses to the onset of auditory stimulation in each trial, we excluded the first 200 ms of each stimulus–signal pair. MI values from all five lags were averaged for subsequent statistical evaluation. We further concatenated all trials from speech and brain signals to increase the robustness of MI computation (Keitel et al., 2018). In addition to computing “general” MI (containing information about both phase and power), we also isolated the part of the Hilbert transform corresponding to phase and computed “phase MI” values, separately.
Statistical analysis
To test whether the statistical dependency between the speech envelope and the EEG data as captured by MI was modulated by the linguistic structure and content of the stimulus, we compared MI values in all three frequency bands separately. Linear mixed models were fitted to the log-transformed, trimmed (5% on each end of the distribution) MI values in each frequency band using lme4 (Bates et al., 2015) in R (R Core Team, 2018). Models included main effects of Condition (three levels: Sentence, Wordlist, Jabberwocky) and Direction (two levels: Forward, Backward), as well as their interaction. All models included by-participant random intercepts and random slopes for the Condition * Direction interaction. For model coefficients, degrees of freedom were approximated using Satterthwaite's method, as implemented in the package lmerTest (Kuznetsova et al., 2017). We used treatment coding in all models, with Sentence being the reference level for Condition, and Forward the reference level for Direction. We then computed all pairwise comparisons within each direction using estimated marginal means (Tukey's correction for multiple comparisons) with emmeans (Length, 2018) in R (i.e., comparing Sentence Forward to Jabberwocky Forward and Wordlist Forward, but never Sentence Forward to Jabberwocky Backward, because we had no hypotheses about these comparisons). The same statistical analyses, including identical model structures, were further applied to MI values computed on the isolated phase coefficients.
For the exploratory dimensionality-reduced MI analysis, we performed the same set of statistical analyses (but only in one single-frequency band). Specifically, we fitted a linear mixed model including main effects of Condition (three levels: Sentence, Wordlist, Jabberwocky) and Direction (two levels: Forward, Backward), as well as their interaction and by-participant random intercepts and random slopes for the Condition * Direction interaction to the log-transformed, trimmed MI values. We then computed estimated marginal means precisely as described in the previous section.
Results
Speech tracking
We computed MI between the Hilbert-transformed EEG time series and the Hilbert-transformed speech envelopes within three frequency bands of interest that corresponded to the occurrence rates of phrases (0.8–1.1 Hz), words (1.9–2.8 Hz), and syllables (3.5–5.0 Hz) in a cluster of central electrodes.
Specifically, we designed our experiment to assess whether the brain response is driven by the (quasi-)periodic temporal occurrence of linguistic structures and prosody, or whether it is modulated as a function of the linguistic content of those structures. Using MI allowed us to quantify and compare the degree of speech tracking across sentences, word lists, and jabberwocky items.
Our analyses revealed condition-dependent enhanced MI at distinct timescales for the forward conditions (Fig. 4). In the phrase frequency band (0.8–1.1 Hz), the mixed-effects model revealed a significant effect of Condition (Sentence = treatment level; Jabberwocky: β = −0.452, SE = 0.096, p < 0.001; Wordlist: β = −0.491, SE = 0.116, p < 0.001) and Direction (Forward = treatment level; Backward: β = −0.885, SE = 0.117, p < 0.001), as well as Condition * Direction interactions (Jabberwocky * Backward: β = 0.429, SE = 0.152, p = 0.008; Wordlist * Backward: β = 0.523, SE = 0.185, p = 0.009). The estimated marginal means corroborated these results, revealing significant pairwise effects only between the Forward conditions (Sentence–Jabberwocky: Δ = 0.452, SE = 0.098, p < 0.001; Sentence–Wordlist: Δ = 0.491, SE = 0.118, p < 0.001; all results corrected with Tukey's test for multiple comparisons), but not the backward controls. The observation that none of the effects was present in the backward speech controls demonstrates that they were not driven by the acoustic properties of the stimuli (Table 2).
Estimated marginal means for mutual information (log-transformed) in the phrase and word frequency bands over a subset of central electrodes
MI between speech signal and brain response. a, MI for Sentences (green), Jabberwocky items (orange), and Wordlists (purple) for phrase, word, and syllable timescales across central electrodes (each dot represents one participant's mean MI response averaged across electrodes). b, Average scalp distribution of MI per condition and band, averaged across participants. Raincloud plots were made using the Raincloud package in R (Allen et al., 2019).
In the word frequency band (1.9–2.8 Hz), the mixed-effects model revealed a significant effect of Condition (Sentence = treatment level; Jabberwocky: β = −0.484, SE = 0.121, p < 0.001) and Direction (Forward = treatment level; Backward: β = −0.499, SE = 0.136, p < 0.001). The pairwise contrasts further revealed that this Sentence–Jabberwocky difference was only significant for the forward conditions (Δ = 0.484, SE = 0.123, p = 0.001), not for the backward controls (Table 2). Again, this finding indicates that the differences we observed were not driven by differences in the acoustic signals themselves.
In the syllable frequency range (3.5–5.0 Hz), the mixed-effects model revealed no significant effects of Condition (Sentence = treatment level; Jabberwocky: β = 0.001, SE = 0.121, p = 0.994; Wordlist: β = 0.104, SE = 0.109, p = 0.348) or Direction (Forward = treatment level; Backward: β = 0.034, SE = 0.120, p = 0.779), and no interaction between the two (Jabberwocky * Backward: β = 0.144, SE = 0.166, p = 0.392; Wordlist * Backward: β = −0.069, SE = 0.144, p = 0.637).
Together, these findings indicate that neural tracking is enhanced for linguistic structures at timescales specific to the role of that structure in the unfolding meaning of the sentence, consistent with neurophysiologically inspired models of language comprehension (Martin, 2016, 2020; Martin and Doumas, 2017).
An almost identical pattern of results emerged when computing MI over all electrodes (rather than a cluster of central ones). In the phrase frequency range, the mixed-effects model revealed significant effects of Condition (Jabberwocky: β = −0.401, SE = 0.075, p < 0.001; Wordlist: β = −0.418, SE = 0.088, p < 0.001) and Direction (Backward: β = −0.743, SE = 0.087, p < 0.001), as well as significant Condition * Direction interactions (Jabberwocky * Backward: β = 0.296, SE = 0.099, p = 0.006; Wordlist * Backward: β = 0.332, SE = 0.134, p = 0.019). In the word frequency range, the model revealed significant effects of Condition (Jabberwocky: β = −0.407, SE = 0.093, p < 0.001; Wordlist: β = −0.179, SE = 0.052, p = 0.002) and Direction (β = −0.316, SE = 0.090, p = 0.002), but not their interaction. For the forward conditions, the pairwise comparisons further confirmed significantly higher MI for sentences compared with jabberwocky items (Sentence Forward–Jabberwocky Forward: Δ = 0.407, SE = 0.095, p < 0.001) and sentences compared with word lists (Sentence Forward–Wordlist Forward: Δ = 0.179, SE = 0.053, p = 0.006). Surprisingly, we also found significantly enhanced MI for sentences compared with jabberwocky items in the backward conditions in the word frequency (Sentence Backward–Jabberwocky Backward: Δ = 0.288, SE = 0.083, p = 0.005), so we cannot exclude the possibility that this effect is driven to some extent by differences in the acoustic signal. Note, however, that the estimate of this effect is smaller for the backward than the forward differences.
Again, there were no significant effects in the syllable frequency range when computing MI over all electrodes (Condition: Sentence = treatment level; Jabberwocky: β = 0.035, SE = 0.106, p = 0.743; Wordlist: β = 0.037, SE = 0.090, p = 0.684; Direction: Forward = treatment level; Backward: β = 0.147, SE = 0.124, p = 0.246; Jabberwocky * Backward: β = −0.023, SE = 0.131, p = 0.859; Wordlist * Backward: β = −0.045, SE = 0.130, p = 0.733).
Phase MI
When computing MI on the isolated phase values from the Hilbert transform, we again found condition-dependent differences at distinct timescales (Fig. 5, Table 3).
Estimated marginal means for phase MI (log-transformed) in the phrase and word frequency bands over a subset of central electrodes
MI between the isolated phase of speech signals and brain responses for Sentences (green), Jabberwocky items (orange), and Wordlists (purple) for phrase, word and syllable timescales across central electrodes (each dot represents one participant's mean MI response averaged across electrodes).
In the phrase frequency band (0.8–1.1 Hz), the models revealed significant effects of Condition (Sentence = treatment level; Jabberwocky: β = −0.497, SE = 0.097, p < 0.001; Wordlist: β = −0.402, SE = 0.118, p = 0.002) and Direction (β = −0.805, SE = 0.106, p < 0.001), as well as their interaction (Jabberwocky * Backward: β = 0.368, SE = 0.150, p = 0.020). For the forward conditions, the pairwise contrasts further corroborated these results, with sentences eliciting higher phase MI than jabberwocky items (Sentence Forward–Jabberwocky Forward: Δ = 0.497, SE = 0.099, p < 0.001) and sentences eliciting higher phase MI than word lists (Sentence Forward–Wordlist Forward: Δ = 0.402, SE = 0.120, p = 0.006; again, all results were corrected by Tukey's test for multiple comparisons).
In the word frequency band (1.9–2.8 Hz), the mixed-effects model revealed a significant effect of Condition (Sentence = treatment level; Jabberwocky: β = −0.380, SE = 0.121, p = 0.004) and Direction (β = −0.474, SE = 0.126, p < 0.001). The pairwise contrasts further revealed significantly higher MI for forward sentences compared with forward jabberwocky items (Sentence Forward–Jabberwocky Forward: Δ = 0.380, SE = 0.123, p = 0.012), but not their backward controls. Again, this result demonstrates that the effect is not driven by the acoustic properties of the stimuli (see Table 3 for all pairwise contrasts).
Computing phase MI over all electrodes (rather than a cluster of central ones) revealed a similar pattern of results. In the phrase frequency range, the mixed model revealed significant effects of Condition (Sentence = treatment level; Jabberwocky: β = −0.356, SE = 0.075, p < 0.001; Wordlist: β = −0.309, SE = 0.089, p = 0.002), Direction (Forward = treatment level; Backward: β = −0.662, SE = 0.076, p < 0.001), and their interaction (Jabberwocky * Backward: β = 0.185, SE = 0.089, p = 0.047.) The estimated marginal means showed significant pairwise comparisons only in forward conditions, with forward sentences showing higher phase MI than forward jabberwocky items and forward word lists (Sentence Forward–Jabberwocky Forward: Δ = 0.356, SE = 0.076, p < 0.001; Sentence Forward–Wordlist Forward: Δ = 0.309, SE = 0.091, p = 0.005), and no significant effects for the backward comparisons (Sentence Backward–Jabberwocky Backward: Δ = 0.171, SE = 0.102, p = 0.227; Sentence Backward–Wordlist Backward: Δ = 0.099, SE = 0.110, p = 0.644; Jabberwocky Backward–Wordlist Backward: Δ = −0.072, SE = 0.125, p = 0.833).
In the word frequency band, the mixed-effects model revealed significant effects of Condition (Sentence = treatment level; Jabberwocky: β = −0.329, SE = 0.089, p < 0.001; Wordlist: β = −0.139, SE = 0.045, p = 0.005) and Direction (β = −0.351, SE = 0.091, p < 0.001). The estimated marginal means further corroborated this finding only in the forward conditions (Sentence Forward–Jabberwocky Forward: Δ = 0.329, SE = 0.091, p = 0.003; Sentence Forward–Wordlist Forward: Δ = 0.139, SE = 0.046, p = 0.014). In contrast to the “general” MI values, we found no significant differences between the backward controls when computing the isolated phase MI over the entire head (Sentence Backward–Jabberwocky Backward: Δ = 0.209, SE = 0.101, p = 0.112; Sentence Backward–Wordlist Backward: Δ = 0.088, SE = 0.087, p = 0.577; Jabberwocky Backward–Wordlist Backward: Δ = −0.121, SE = 0.085, p = 0.343). Again, these findings are consistent with neurophysiologically inspired models of language comprehension (Martin, 2016, 2020; Martin and Doumas, 2017).
Tracking of abstract linguistic units
Inspecting the modulation spectra of our stimuli (Fig. 2), it is apparent that—although carefully designed—the acoustic signals are not entirely indistinguishable between conditions based on their spectral properties. Most notably, Sentence stimuli appear to exhibit a small peak at ∼0.5 Hz (roughly corresponding to the phrase timescale in our stimuli) compared with the other two conditions. It is important to note that (1) differences between conditions are not surprising, given that our stimuli were naturally spoken; and (2) we specifically designed our experiment to include backward versions of all conditions to control for slight differences between the acoustic envelopes of the forward stimuli. That being said, we conducted an additional, exploratory analysis to further reduce the potential confound of differences between the acoustic modulation spectra and to disentangle the distribution of linguistic phrase representations and the acoustic stimulus even further. Specifically, we computed MI in the delta–theta range (0.8–5 Hz) between the brain response and abstracted dimensionality-reduced annotations of all stimuli, containing only information about when words could be integrated into phrases (Brodbeck et al., 2018; see Materials and Methods for detailed descriptions of how these annotations were created).
These annotation-based analyses revealed significant effects of Condition (Sentence = treatment level; Jabberwocky: β = −0.326, SE = 0.112, p = 0.007; Wordlist: β = −0.521, SE = 0.120, p < 0.001), Direction (β = −0.754, SE = 0.115, p < 0.001), and their interaction (Jabberwocky * Backward: β = 0.352, SE = 0.164, p = 0.040; Wordlist * Backward: β = 0.621, SE = 0.156, p < 0.001). The estimated marginal means further revealed increased MI for forward sentences compared with forward jabberwocky items and forward word lists (Sentence Forward–Jabberwocky Forward: Δ = 0.326, SE = 0.114, p = 0.021; Sentence Forward–Wordlist Forward: Δ = 0.521, SE = 0.123, p < 0.001; all results were corrected with Tukey's for multiple comparisons) and no significant difference among the backward controls (Table 4).
Estimated marginal means for MI (log-transformed) calculated over abstract phrase representations
Again, the same pattern of results also emerged when computing MI over all electrodes: the mixed-effects model revealed significant effects of Condition (Jabberwocky: β = −0.365, SE = 0.087, p < 0.001; Wordlist: β = −0.611, SE = 0.098, p < 0.001), Direction (β = −0.813, SE = 0.090, p < 0.001), and their interaction (Jabberwocky * Backward: β = 0.390, SE = 0.148, p = 0.014; Wordlist * Backward: β = 0.678, SE = 0.131, p < 0.001). The pairwise contrasts were, again, only significant between the forward conditions (Sentence Forward–Jabberwocky Forward: Δ = 0.365, SE = 0.088, p < 0.001; Sentence Forward–Wordlist Forward: Δ = 0.611, SE = 0.100, p < 0.001), but not the backward controls (Sentence Backward–Jabberwocky Backward: Δ = −0.024, SE = 0.109, p = 0.973; Sentence Backward–Wordlist Backward: Δ = −0.067, SE = 0.097, p = 0.770; Jabberwocky Backward–Wordlist Backward: Δ = −0.043, SE = 0.100, p = 0.905). These results support our previously reported findings, showing that neural tracking is influenced by the presence of abstract linguistic information. In other words, this exploratory analysis supports our earlier finding that the “sensitivity” of the brain to linguistic structure and meaning goes above and beyond the acoustic signal and both word-level semantic and prosodic controls.
Discussion
The current experiment tested how the brain attunes to linguistic information. Contrasting sentences, word lists and jabberwocky items, we analyzed, by proxy, how the brain response is modulated by sentence-level prosody, lexical semantics, and compositional structure and meaning. Our findings show that (1) the neural response is driven by compositional structure and meaning, beyond both acoustic-prosodic and lexical information; and (2) the brain most closely tracks the most structured representations on the timescales we analyzed. To our knowledge, this is the first study to systematically disentangle the contribution of linguistic content from its timing and rhythm in natural speech by using linguistically informed controls. Additionally, our data demonstrate cortical tracking of naturalistic language without a nonlinguistic task such as syllable counting and outlier trial or target-detection tasks. We show that oscillatory activity attunes to structured and meaningful content, suggesting that neural tracking reflects computations related to inferring linguistic representations from speech, and not merely tracking of rhythmicity or timing. We discuss these findings in more detail below.
Using mutual information analysis, we quantified the degree of speech tracking in frequency bands corresponding to the timescales at which linguistic structures (phrases, words, and syllables) could be inferred from our stimuli. On the phrase timescale, we found that sentences had the most shared information between stimulus and response. Crucially, this is not merely a chunking mechanism (Bonhage et al., 2017; Ghitza, 2017)—participants could have “chunked” the word lists (which have their own naturally produced nonsentential prosody) into units of adjacent words, and the jabberwocky items into prosodic units. This is especially interesting given recent work by Jin et al. (2020), showing that enhanced delta-band activity can be “induced” in listeners by teaching them to chunk a sequence of (synthesized) words according to different sets of artificial grammar rules. Conversely, the observed patterns of activity cannot exclusively be driven by the lexico-semantic content of our stimuli (Frank and Yang, 2018)—sentences and word lists contained the same lexical items, yet MI was enhanced for Sentence stimuli, where words could be combined into phrases and higher-level representations. As such, we argue that the dominating process we observe appears to be processing compositional semantic structure, above and beyond prosodic chunking and word-level meaning. We show that the brain aligns more to periodically occurring units when they contain meaningful information and are thus relevant for linguistic processing.
On the word timescale, the emerging picture is somewhat more diverse than on the phrase timescale. Specifically, we found enhanced tracking for sentences compared with jabberwocky items. We tentatively take this finding to indicate that, at the word timescale, the dominant process appears to be context-dependent word recognition—perhaps based in perceptual inference. This is further corroborated by the results of computing MI over all electrodes, rather than a subset, with sentences eliciting higher MI than both jabberwocky items and word lists. Note, however, that we also found enhanced MI on the word timescale for word lists compared with jabberwocky items in the backward controls when computing MI over all electrodes. Here, listeners could not have processed words within the context of phrases or sentences, which makes it somewhat difficult to integrate these results. One possible explanation for this surprising finding might be that there is still some acoustic-prosodic information available in the backward controls that distinguishes word lists from jabberwocky items. Future research could address this in detail, for example by including a control condition with entirely flat prosody (Ding et al., 2016, 2017a).
There continues to be a vibrant debate about whether language-related cortical activity in the delta–theta range is truly oscillatory in nature or whether the observed patterns of neural activity arise as a series of evoked responses (Haegens and Golumbic, 2018; Rimmele et al., 2018; Zoefel et al., 2018; Obleser and Kayser, 2019). Our current results cannot speak to this question; in fact, we have been careful to refer to our results as “tracking” rather than “entrainment” throughout this article. To be clear, we do not take the observed increased MI for sentences compared with jabberwocky items and word lists as evidence for an intrinsic “phrase-level oscillator” or “word-level oscillator.” Rather, we interpret our findings as a manifestation of the cortical computations that may occur during language comprehension. Here, we observe them in the delta frequency range because that is the timescale on which higher-level linguistic units occur in our stimuli.
Many previous studies have shown that attention can modulate neural entrainment (Haegens et al., 2011; Ding and Simon, 2012; Golumbic et al., 2013; Lakatos et al., 2013; Calderone et al., 2014). Importantly, Ding et al. (2018) found that tracking beyond the syllable envelope requires attention to the speech stimulus. In our current experiment, participants were instructed to attentively listen to the audio recordings in all conditions, but it is possible that “attending to sentences” might be easier than “attending to jabberwocky items,” and that listeners pay closer attention to higher-level structures in intelligible and meaningful speech. Additionally, our study used a block design, which could, in principle, have encouraged participants to use different attentional resources during the different blocks. As such, we cannot rule out the possibility that our effects might be influenced by a mechanism based on attentional control. It is, however, difficult to disentangle “attention” from “comprehension” in this kind of argument: meaningful information within a stimulus can arguably only lead to increased attention if it is comprehensible. We plan to investigate these questions in future experiments.
Overall, the pattern of results is consistent with cue integration-based models of language processing (Martin, 2016, 2020), where the activation profile of different populations of neurons over time encodes linguistic structure as it is inferred from sensory correlates in real time (Martin and Doumas, 2017). The model of language processing of Martin (2016, 2020) builds on and extends neurophysiological models of cue integration, where percepts are inferred from sensory cues through summation and normalization, both of which have been proposed as canonical neural computations (Carandini and Heeger, 1994, 2011; Ernst and Bülthoff, 2004; Landy et al., 2011; Fetsch et al., 2013; for cue integration-based models of speech and word recognition, see Norris and McQueen, 2008; Toscano and McMurray, 2010; McMurray and Jongman, 2011). Martin (2016, 2020) proposed that, during all stages of language processing, the brain might draw on these same neurophysiological computations.
Crucially, inferring linguistic representations from speech sounds requires not only bottom-up sensory information, but also top-down memory-based cues (Marslen-Wilson, 1987; Kaufeld et al., 2020). Martin (2016, 2020) therefore suggested that cue integration during language comprehension is an iterative process, where cues that have been inferred from the acoustic signal can, in turn, become cues for higher levels of processing. The pattern of findings in our current experiment strongly speaks to cue integration-based models of language comprehension: we observe that tracking of the speech signal is enhanced when meaningful linguistic units can be inferred, suggesting that the alignment of populations of neurons might, indeed, encode the generation of inference-based linguistic representations (Martin and Doumas, 2017).
Our results also speak to analysis-by-synthesis-based accounts of speech processing, more generally (Halle and Stevens, 1962; Bever and Poeppel, 2010; Poeppel and Monahan, 2011). In an analysis-by-synthesis model of speech perception, speech recognition is achieved by internally generating (synthesizing) patterns according to internal rules, and matching (analyzing) them against the acoustic input signal. Similarly, our findings are in line with the notion of hierarchical temporal receptive windows from early sensory to higher-level perceptual and cognitive brain areas (Hasson et al., 2008; Lerner et al., 2011).
There are, of course, many open questions that arise from our results. Perhaps most obviously (although presumably limited by the resolution of time–frequency analysis), it would be interesting to investigate how “far” cue integration can be traced during even more natural language comprehension situations (Alday, 2019; Alexandrou et al., 2020). To what degree are higher-level linguistic cues, such as sentential, contextual, or pragmatic information, encoded in the neural response? Another interesting avenue for future research would be to investigate whether similar patterns can be observed during language production. Martin (2016, 2020) suggested that not only language comprehension, but also language production draws on principles of cue integration. Finally—and consequentially, if cue integration underlies both comprehension and production processes—we would be curious to learn more about cue integration “in action,” specifically during dialogue settings, where interlocutors comprehend and plan utterances nearly simultaneously.
In summary, this study showed that speech tracking is sensitive to linguistic structure and meaning, above and beyond prosodic and lexical-semantic controls. In other words, content determines tracking, not just timescale. This extends previous findings and advances our understanding of spoken language comprehension in general, because our experimental manipulation allows us, for the first time, to disentangle the influence of linguistic structure and meaning on the neural response from word-level meaning and prosodic regularities occurring in naturalistic stimuli.
Footnotes
The authors declare no competing financial interests.
A.E.M. was supported by the Max Planck Research Group “Language and Computation in Neural Systems” and by the Netherlands Organization for Scientific Research (Grant 016.Vidi.188.029). We thank Anne Keitel for sharing her expertise and scripts for the Mutual Information analysis; Laurel Brehm for statistical advice; Joe Rodd and Merel Maslowski for help with the experiment visualization (Fig. 1); Annelies van Wijngaarden for lending her voice for stimulus recording; Karthikeya Kaushik, Zina al-Jibouri, and Dylan Opdam for research assistance; and Micha Heilbron for feedback on an earlier draft of this manuscript.
- Correspondence should be addressed to Andrea E. Martin at andrea.martin{at}mpi.nl