Abstract
Adaptive behavior rests on predictions based on statistical regularities in the environment. Such regularities pertain to stimulus contents (“what”) and timing (“when”), and both interactively modulate sensory processing. In speech streams, predictions can be formed at multiple hierarchical levels of contents (e.g., syllables vs words) and timing (faster vs slower time scales). Whether and how these hierarchies map onto each other remains unknown. Under one hypothesis, neural hierarchies may link “what” and “when” predictions within sensory processing areas: with lower versus higher cortical regions mediating interactions for smaller versus larger units (syllables vs words). Alternatively, interactions between “what” and “when” regularities might rest on a generic, sensory-independent mechanism. To address these questions, we manipulated “what” and “when” regularities at two levels—single syllables and disyllabic pseudowords—while recording neural activity using magnetoencephalography (MEG) in healthy volunteers (N = 22). We studied how neural responses to syllable and/or pseudoword deviants are modulated by “when” regularity. “When” regularity modulated “what” mismatch responses with hierarchical specificity, such that responses to deviant pseudowords (vs syllables) were amplified by temporal regularity at slower (vs faster) time scales. However, both these interactive effects were source-localized to the same regions, including frontal and parietal cortices. Effective connectivity analysis showed that the integration of “what” and “when” regularity selectively modulated connectivity within regions, consistent with gain effects. This suggests that the brain integrates “what” and “when” predictions that are congruent with respect to their hierarchical level, but this integration is mediated by a shared and distributed cortical network.
- audition
- effective connectivity
- magnetoencephalography
- predictive processing
- source reconstruction
- speech processing
Significance Statement
This study investigates how the brain integrates predictions about the content (“what”) and timing (“when”) of sensory stimuli, particularly in speech. Using magnetoencephalography (MEG) to record neural activity, researchers found that temporal regularities at slower (vs faster) time scales enhance neural responses to unexpected disyllabic pseudowords (vs single syllables), indicating a hierarchical specificity in processing. Despite this specificity, the involved brain regions were common across different hierarchical levels of regularities and included frontal and parietal areas. This suggests that the brain uses a distributed and shared network to integrate “what” and “when” predictions across hierarchical levels, refining our understanding of speech processing mechanisms.
Introduction
Speech comprehension relies on auditory predictions, enabling the brain to anticipate and efficiently process sounds (Hickok, 2012; Hovsepyan et al., 2020; Poeppel and Assaneo, 2020; Caucheteux et al., 2023; Mai and Wang, 2023). Accurate predictions reduce bottom-up processing, optimizing other aspects of language comprehension (Federmeier, 2007; Sohoglu and Davis, 2016; Hakonen et al., 2017; Ryskin and Nieuwland, 2023). Inaccurate predictions draw cognitive resources, causing error signals (increases in amplitude) or delays in the responses of brain regions including the superior temporal gyrus (STG), superior parietal lobule (SPL), and prefrontal cortex (Shain et al., 2020; Caucheteux et al., 2023). The timing of their activity suggests that predictions span multiple levels, from low-level acoustic to high-level semantic representations.
During speech processing both the content (“what”) and the timing (“when”) of speech can be predicted (Gómez Varela et al., 2024). For instance, when listening to a familiar speaker, our brain not only anticipates what words might come next but also when they will be spoken, allowing us to follow rapid conversations effortlessly. This predictive mechanism explains why we can still understand speech even in noisy environments—our brain fills in the gaps based on expected sounds and meanings.
In turn, unexpected contents, even in artificial speech, trigger errors which typically lead to further processing, increasing activity in the inferior frontal gyrus (IFG; Petersson et al., 2012; Wilson et al., 2015) and superior temporal cortex (Ling et al., 2022). “What” predictions span multiple levels, ranging from phonemes to phrases (Su et al., 2023), suggesting a hierarchical organization where higher-level predictions guide lower-level processing (Heilbron et al., 2022). Similarly, “when” predictions operate across several time scales, from phoneme onsets to phrase boundaries (Donhauser and Baillet, 2020; Schmitt et al., 2021). The rhythmic patterns of speech synchronize neural oscillations (Ding et al., 2016), with speech tracking linked to phase locking and increased activity (Obleser and Kayser, 2019) in auditory regions like the STG (Keitel et al., 2017), as well as (pre)motor regions (Morillon et al., 2019) and subcortical structures including the basal ganglia (Merchant et al., 2015) and the cerebellum (Kotz et al., 2014). Increased synchronization is thought to aid speech processing, particularly with predictable temporal structures (Kösem et al., 2018; Riecke et al., 2018; Zoefel et al., 2018). In contrast irregular or unpredictable speech sequences show weaker neural tracking (Klimovich-Gray et al., 2021). However the extent to which speech tracking reflects oscillatory mechanisms as opposed to evoked responses to each new sound is still debated (Doelling et al., 2019; Zou et al., 2021; Oganian et al., 2023).
While “what” and “when” predictions are both crucial for speech comprehension, they likely rely on different mechanisms (Arnal and Giraud, 2012; Auksztulewicz et al., 2018). Theoretically, it has been suggested that the brain performs separate predictive computations for “what” and “when” information, allowing it to both process them independently (factorization) and integrate them when needed (conjunction; Friston and Buzsáki, 2016). This separation has the advantage of optimizing cognitive resources efficiently. Empirically, a study using invasive electrophysiological measurements and computational modeling suggested that “what” predictability relies on modulated connectivity between sensory, prefrontal, and premotor regions, while “when” predictability mainly involves gain modulation in sensory regions (Auksztulewicz et al., 2018). Increased gain due to rhythmic “when” regularity (Auksztulewicz et al., 2019) may also amplify mismatch responses to violated “what” predictions (Todd et al., 2018; Lumaca et al., 2019; Jalewa et al., 2021).
Given the interactive effects of “what” and “when” predictability on neural activity, how do they modulate the processing of hierarchically organized stimulus sequences such as speech streams? One hypothesis suggests that hierarchies of predictions map onto neural hierarchies, such that lower cortical regions like the STG (Oganian and Chang, 2019) handle interactions between “what” and “when” regularities for single chunks (e.g., syllables) and faster time scales (syllable onsets), while higher (e.g., frontal) regions (Rimmele et al., 2023) handle longer segments (e.g., words) and slower time scales (word onsets). Alternatively, interactions between “what” and “when” regularities might occur through sensory-independent mechanisms involving attention-related regions, such as the left parietal cortex, which integrates content- and time-based speech information (Orpella et al., 2020).
This study examines the neural correlates of “what” and “when” regularities across levels of artificial speech processing. We independently manipulated “what” and “when” regularity of syllables and disyllabic pseudowords, while recording neural responses using magnetoencephalography (MEG). We first quantified the phase locking of MEG responses to different time scales of “when” regularities. Then, we used source reconstruction of evoked responses to test if “what” and “when” regularities at different levels interactively modulate activity in hierarchically distinct regions or in shared networks. Finally, computational modeling of evoked responses allowed us to infer network connectivity patterns mediating the effects across the cortical hierarchy.
Materials and Methods
Participant sample
A total of 24 participants took part in the study upon written informed consent. Two participants did not complete the study, resulting in data from 22 participants taken into analysis (13 females, 9 males; median age, 28; range, 21–35 years; all right-handed). The study adhered to protocols approved by the Ethics Board of the Goethe-University Frankfurt am Main. All participants confirmed normal hearing in their self-reports, and none reported any current or past neurological or psychiatric disorders.
Stimulus and task description
The experimental paradigm used auditory sequences which were manipulated with respect to “what” and “when” regularity at two levels each. “What” regularity was established by presenting sequences of pseudowords drawn from a set of six items and violated at the level of single syllables versus disyllabic pseudowords. “When” regularity was established by presenting sequences with isochronous timing and violated at a faster time scale of 4 Hz versus a slower time scale of 2 Hz. The details of both of these manipulations will be described below. By independently manipulating the regularity of stimulus identity and stimulus timing, we were able to analyze how “what” and “when” regularities interacted at each level of processing speech sequences.
Prior to the experimental task, participants implicitly learned six disyllabic pseudowords (“tupi,” “robi,” “daku,” “gola,” “zufa,” “somi”; Fig. 1A). The syllables were taken from a database of Consonant-Vowel syllables (Ives et al., 2005) and resynthesized using an open-source vocoder, STRAIGHT (Kawahara, 2006) for MATLAB R2018b (MathWorks) to match their duration (166 ms), fundamental frequency (F0 = 150 Hz), and sound intensity.
Experimental paradigm. A, Participants listened to sequences composed of pseudowords, drawn from a set of six items to which they had been passively exposed in a training session. B, After participants had implicitly learned the six pseudowords, they engaged in a syllable repetition detection task. During the task, “what” regularity was manipulated across three experimental conditions (at a single-trial level): pseudoword sequences could be composed of only legal pseudowords (“standard” trials), or they could contain a pseudoword with a deviant word-final syllable (“deviant syllable” trials, whereby the pseudoword starts in an expected manner but ends with a violation), or they could contain a pseudoword with a deviant word-initial syllable substituted from a final syllable of another pseudoword (“deviant pseudoword” trials, whereby the pseudoword starts with a violation). C, Sequences were blocked into three temporal conditions: an “isochronous” condition, in which ISI between all syllables was fixed at 0.25 s; a “beat-based” condition, in which the ISI between pseudoword onsets was fixed at 0.5 s but the ISI between the initial and final syllables of each pseudoword was jittered, such that only the timing of the deviant pseudoword (but not of the deviant syllable) could be predicted; and a “single-interval” condition, in which the ISI between pseudoword onsets was jittered but the ISI between the initial and final syllable of each pseudoword was fixed at 0.25 s, such that only the timing of the deviant syllable (but not of the pseudoword) could be predicted.
In the implicit learning task (administered outside of the MEG scanner but immediately prior to the MEG recording session), participants were exposed to continuous auditory streams of six pseudowords presented in a random order. The stimulus onset asynchrony (SOA) between each two syllables was set to 250 ms, resulting in an isochronous syllable rate of 4 Hz. The stream was 120 s long, amounting to 80 occurrences of each pseudoword. Following exposure to the continuous stream, participants listened to pairs of pseudowords and were asked to discriminate “correct” pseudowords (e.g., “tupi”) from “incorrect” pseudowords (e.g., “pitu,” “turo,” “tuku”). Each participant performed 60 trials of the pseudoword discrimination task.
Following the implicit learning task, participants were exposed to the main experimental paradigm in the MEG scanner. They were asked to listen to continuous sequences of four unique pseudowords (e.g., “tupirobidakugola”). As a cover task, to monitor participants' attention, we asked them to detect immediate syllable repetitions (e.g., “tupirobidakugogo”), present in 6.6% sequences, immediately after repetition onset. These trials were rejected from subsequent analysis of neural data.
The sequences presented in the MEG scanner were independently manipulated with respect to “what” and “when” regularity. “What” regularity manipulations had three levels (Fig. 1B): (1) in standard sequences (66.6%), only correct (regular) pseudowords were used (e.g., “tupirobidakugola”); (2) in “deviant syllable” sequences (13.3%), the final syllable of the sequence was replaced with a syllable belonging to a different pseudoword (e.g., “tupirobidakugofa”), such that upon hearing the syllable “go,” the prediction of the subsequent syllable “la” is violated by the syllable “fa”; (3) in “deviant pseudoword” sequences (13.3%), the penultimate syllable of the sequence was replaced with a syllable which should be a final syllable of a different pseudoword (e.g., “tupirobidakufala”), creating an irregular disyllabic pseudoword (“fala”). “What” regularities were manipulated at the level of the sequence level, such that the three different types of sequences were presented in a random order. Each trial consisted of seven seamlessly concatenated sequences, amounting to 14 s per trial. Trials were separated by an intertrial interval of 1 s. To ensure that differences between deviants and standards were not confounded by differences in the stimulus positions and/or timing in the sequences, each deviant syllable/word was matched with one designated standard syllable/word. As such, the standards were drawn from the same positions (i.e., penultimate or final syllable) and had the same timing as their matched deviants.
Independently, “when” regularity manipulations also had three levels (Fig. 1C): (1) in isochronous sequences (33% of the trials), the SOA between consecutive syllables was fixed at 250 ms; (2) in “beat-based” sequences (33% of the trials), the SOA between pseudoword-initial syllables was fixed at 500 ms, but the SOA between the initial and the final syllable of the pseudoword was jittered between 167 and 333 ms, resulting in irregular timing of final syllables of each pseudoword (i.e., at a faster time scale, corresponding to syllable rate) but regular timing of pseudoword onsets (i.e., at a slower time scale, corresponding to pseudoword rate); (3) in “single-interval” sequences (33% of the trials), the SOA between the initial and the final syllable of the pseudowords was fixed at 250 ms, but the SOA between pseudoword-initial syllables was jittered between 167 and 333 ms, resulting in irregular timing of pseudoword onsets (i.e., at the slower time scale) but regular timing of final syllables of each pseudoword (i.e., at the faster time scale). Here, we use the term “single-interval” aligning with previous literature using similar temporal manipulations (Breska and Ivry, 2018), although we acknowledge that other authors refer to similar manipulations with terms such as “interval-based” (Merchant and Honing, 2013), “memory-based” (Bouwer et al., 2020), and “duration-based” (Teki et al., 2011) timing. “What” regularity was manipulated at the trial level (seven repetitions of a six-pseudoword sequence), while “when” regularity was manipulated at a block level (20 trials per block). Each “when” condition was administered in four blocks, resulting in 12 blocks in total. Blocks were presented in a pseudorandom order, such that no immediate repetitions of the same “when” condition was allowed.
To prevent differences in baseline duration from affecting the MEG analysis of stimulus-evoked responses, all syllables preceding deviant syllables and the designated standard syllables were set to a fixed 250 ms SOA. This adjustment was necessary because comparing stimuli with fixed SOA versus random SOA could introduce baseline contamination, potentially leading to the rejection of a large number of trials. Therefore, temporal regularity was manipulated at the sequence level, affecting only syllables surrounding the analyzed syllables (deviants and designated standards), but not the syllables immediately preceding them. The global deviant probability was 4.16% (including repetitions) or 3.32% (excluding repetitions) of all syllables, resulting in up to 75 deviant stimuli (12 blocks × 20 trials × 7 sequences × 0.133 deviant sequence probability × 0.333 temporal condition probability) per combination of “what” regularity violation (deviant pseudoword vs syllable) and “when” regularity (isochronous, beat-based, single-interval).
Behavioral analysis
Behavioral analysis focused on accuracy and response time (RT) data derived from participant responses in the syllable repetition detection task. Single-trial RTs exceeding each participant's median + 3 standard deviations were removed from the analysis. The remaining RTs, derived exclusively from correct trials, underwent a log transformation to achieve an approximately normal distribution and subsequently averaged. To test for the effect of “when” regularity on behavioral performance in the repetition detection task, accuracy and mean log RTs were separately subjected to repeated-measures ANOVAs, incorporating the within-subjects factor of “when” regularity (isochronous, beat-based, single-interval). Since the behavioral task was limited to detecting immediate syllable repetitions (i.e., did not differentiate between repeated pseudowords or syllables), “what” regularities were not included as a factor in these analyses. Post hoc comparisons were executed through paired t tests in MATLAB, and corrections were applied for multiple comparisons—specifically, three for accuracy and three for RTs—using a false discovery rate of 0.05.
MEG data acquisition and preprocessing
Participants were seated in a 275-channel whole-head CTF MEG system with axial gradients (Omega 2005, VSM MedTech). The data were acquired at a sampling rate of 1200 Hz with synthetic third-order gradient noise reduction (Vrba and Robinson, 2001). For monitoring eyeblinks and heart rate, four electrooculogram (EOG) and two electrocardiogram (ECG) electrodes were placed on the participant's face and clavicles. Continuous head localization was recorded throughout the session.
Auditory stimuli were generated by an external sound card (RME) and transmitted into the MEG chamber through sound-conducting tubes which were linked to plastic ear molds (Promolds, Doc's Proplugs). The sound pressure level was adjusted to ∼70 dB SPL. Visual stimuli, consisting of the instructions between blocks and fixation cross during acoustic stimulation, were presented using a PROPIX projector (VPixx ProPixx) and back-projected onto a semitransparent screen positioned 60 cm from the participant’s head. Participants responded to stimuli by operating a MEG-compatible button response box (Cambridge Research Systems) with their right hand. Short breaks were administered between runs.
The continuous MEG recordings were high-pass filtered at 0.1 Hz and notch filtered between 48 and 52 Hz, down-sampled to 300 Hz, and further subjected to a low-pass filter at 90 Hz (including antialiasing). All filters were fifth-order zero-phase Butterworth and implemented in the SPM12 toolbox for Matlab. Based on continuous head position measurement inside the MEG scanner, we calculated six movement parameters (three translations and three rotations; Stolk et al., 2013), which were regressed out from each MEG channel using linear regression. Eyeblink artifacts were automatically detected based on the vertical EOG and removed by subtracting the two top spatiotemporal principal components of eyeblink-evoked responses from all MEG channels (Ille et al., 2002). Heartbeat artifacts were automatically detected based on ECG and removed in the same manner. The cleaned signals were subsequently subjected to separate analyses in the frequency domain (phase locking) and the time domain (event-related fields).
MEG analysis: phase locking
To investigate whether speech sequences exhibit distinct spectral peaks in neural responses at both the syllable rate (4 Hz) and the pseudoword rate (2 Hz), we conducted a frequency domain analysis. Following previous research (Ding and Simon, 2013), we chose the intertrial phase coherence (ITPC) as a metric of tracking temporal regularity and accessing frequency rates characteristic for different speech units (syllables and pseudowords). The continuous data were segmented into epochs spanning from the onset to the offset of each trial (speech sequence). For each participant, channel, and sequence, we computed the Fourier spectrum of MEG signals recorded during that specific sequence. To assess phase consistency within each condition, we computed ITPC for each temporal condition (isochronous, beat-based, single-interval) using the following equation:
We applied ITPC in the aperiodic conditions as well, based on previous findings (Breska and Ivry, 2018) that temporally consistent slow ramping neural activity (such as the contingent negative variation) can produce significant ITPC values even in aperiodic (single-interval) sequences. We also assessed the neural phase consistency in response to both periodic and aperiodic stimuli conditions by calculating the ITPC based on the raw stimulus waveform.
In the initial analysis, ITPC estimates were averaged across MEG channels. To assess the presence of statistically significant spectral peaks, ITPC values at the syllable rate (4 Hz) and pseudoword rate (2 Hz) were compared against the mean of ITPC values at their respective neighboring frequencies (syllable rate: 3.93 and 4.07 Hz; pseudoword rate: 1.929 and 2.071 Hz) using paired t tests.
Additionally, to examine whether spectral peaks at the syllable rate and pseudoword rate observed at individual MEG channels exhibited modulations due to temporal regularity, spatial topography maps of single-channel ITPC estimates were transformed into 2D images. These images were then smoothed with a 5 × 5 mm full-width at half-maximum (FWHM) Gaussian kernel to ensure that the data conform to the assumptions of the statistical inference approach, namely, cluster-based correction based on random field theory (Litvak et al., 2011). The smoothed images were subjected to repeated-measures ANOVAs, separately for syllable-rate and pseudoword-rate ITPC estimates, incorporating a within-subjects factor of Time (isochronous, beat-based, single-interval). This analysis was implemented in SPM12 as a general linear model (GLM). To address multiple comparisons and ITPC correlations across neighboring channels, statistical parametric maps were thresholded at p < 0.005, a conservative cluster-forming threshold chosen to avoid inflating the false-positive ratio (Eklund et al., 2016; Flandin and Friston, 2019; Henson et al., 2019) and corrected for multiple comparisons over space at a cluster-level pFWE < 0.05, following random field theory assumptions (Kilner et al., 2005). Repeated-measures parametric tests were selected based on previous literature using ITPC (Sokoliuk et al., 2021), assuming that differences in ITPC values between conditions follow a normal distribution. Post hoc tests were conducted at a Bonferroni-corrected FWE threshold (0.05/3 pairwise comparisons per rate).
MEG analysis: event-related fields
For the analysis in the time domain, the data underwent segmentation into epochs spanning from −50 ms before to 250 ms after the onset of deviant/standard syllables. To prevent contamination from the temporally structured presentation, baseline correction was applied from −25 ms to 25 ms, following a previously published approach (Fitzgerald et al., 2021). The data were then denoised using the “Dynamic Separation of Sources” algorithm (de Cheveigné and Simon, 2008), aimed at minimizing the impact of noisy channels. Condition-specific event-related fields (ERFs) corresponding to syllable and pseudoword deviants and the respective standards, presented in each of the three temporal conditions, were computed using robust averaging. It is a standard method of obtaining ERFs in SPM which iteratively calculates the weighted average across trials based on the deviation of each trial from the median (Litvak et al., 2011). A main advantage of robust averaging is that it downweighs the influence of outlier trials and improves the accuracy of the averaged signal, thereby minimizing the impact of artifacts. This analysis was conducted with the SPM12 toolbox and included a low-pass filter at 48 Hz (fifth-order zero-phase Butterworth) to derive the final ERFs. The ERFs were then subjected to univariate analysis to assess the effects of “what” and “when” regularity on evoked responses.
The ERF data were transformed into 3D images (2D for spatial topography; 1D for time) which underwent spatial smoothing using a 5 × 5 mm FWHM Gaussian kernel. Subsequently, the smoothed images were entered into a GLM, which implemented a 3 × 3 repeated-measures ANOVA with within-subject factors “what” (standard, deviant syllable, deviant pseudoword) and “when” regularity (isochronous, beat-based, single-interval). In addition to testing for the two main effects and a general 3 × 3 interaction, we also tested for the following planned contrasts: (1) deviant versus standard “what” conditions, (2) isochronous versus nonisochronous “when” conditions, and (3) an interaction contrast isolating the congruence effect. This last contrast aimed to investigate whether “when” regularity specifically influenced the amplitude of mismatch responses evoked by “what” deviants presented at a congruent time scale—i.e., deviant syllables in the single-interval condition and deviant pseudowords in the beat-based condition. This involved testing for a 2 × 2 interaction between “what” regularity violation (deviant syllable, deviant pseudoword) and “when” manipulation (single-interval, beat-based).
To address multiple comparisons and ERF amplitude correlations across neighboring channels and time points, statistical parametric maps were thresholded at p < 0.005 and corrected for multiple comparisons over space and time at a cluster-level pFWE < 0.05, following random field theory assumptions (Kilner et al., 2005).
MEG analysis: source reconstruction
Source reconstruction was conducted under group constraints (Litvak and Friston, 2008), enabling the estimation of source activity at the individual participant level by assuming that activity is reconstructed within the same subset of sources for each participant, thereby reducing the impact of outliers. An empirical Bayesian beamformer (Wipf and Nagarajan, 2009; Belardinelli et al., 2012; Little et al., 2018) was employed for estimating sources based on the entire poststimulus time window (0–250 ms). Given the principal findings identified in the ERF analysis—specifically, a difference between ERFs elicited by deviants and standards; a difference between ERFs elicited in isochronous and nonisochronous temporal conditions; and an interaction between deviant type and temporal condition—we focused on comparing source estimates related to these effects.
For the analysis of the difference between deviants and standards, as well as the difference between stimuli presented in isochronous and nonisochronous sequences, source estimates were extracted for the 33–250 ms time window (based on the results of the ERF analysis; see below). These estimates were converted into 3D images with three spatial dimensions and then smoothed using a 5 × 5 × 5 mm full-width at half-maximum (FWHM) Gaussian kernel. The smoothed images were entered into a GLM that implemented a 3 × 3 repeated-measures ANOVA with within-subjects factors of “what” (standard, deviant syllable, deviant pseudoword) and “when” regularity (isochronous, beat-based, single-interval). Parametric tests based on a GLM were employed as an established method for analyzing MEG source reconstruction maps (Litvak et al., 2011).
For the analysis of the interaction between deviant type and temporal condition, source estimates were extracted for the 127–250 ms time window (based on the results of the ERF analysis) and processed as described above. Smoothed images were then entered into a GLM implementing a 2 × 2 repeated-measures ANOVA with within-subjects factors of content (deviant syllable, deviant pseudoword) and time (single-interval, beat-based). To address multiple comparisons and source estimate correlations across neighboring voxels, statistical parametric maps were thresholded and corrected for multiple comparisons over space at a cluster-level pFWE < 0.005 (minimum voxel extent: 64 voxels), adhering to random field theory assumptions (Kilner et al., 2005). Source labels were assigned using the Neuromorphometrics probabilistic atlas, implemented in SPM12.
MEG analysis: dynamic causal modeling
Dynamic causal modeling (DCM) was employed to estimate connectivity parameters at the source level, specifically related to the general processing of mismatch responses (deviant vs standard) and the contextual interaction between “what” and “when” regularity (deviant syllable in the single-interval condition and deviant pseudoword in the beat-based condition vs deviant syllable in the beat-based condition and deviant pseudoword in the single-interval condition). DCM, a form of effective connectivity analysis, utilizes a generative model to map sensor-level data (in this case, ERF time series across MEG channels) to source-level activity (David et al., 2005). The generative model encompasses several sources representing distinct cortical regions, forming a sparsely interconnected network. DCM is an explanatory method designed to investigate the underlying neural connectivity that mediates observed effects in sensor space (here, ERFs) and/or source space (here, source reconstruction). As such, DCM takes the observed significant effects as a starting point (Stephan et al., 2010) and aims to disambiguate between alternative hypotheses regarding the connectivity patterns mediating those effects. Since no sources were identified in the source reconstruction of the main effect of “when” regularity (see Results), this effect was not included in the model. Therefore, the analysis focused on disambiguating neural activity patterns mediating the observed significant effects, namely, the main effect of “what” regularity violation (deviants vs standards) and the interactive congruence effect of “what” and “when” regularity (deviant syllables in single-interval vs beat-based temporal sequences; deviant pseudowords in beat-based vs single-interval sequences).
Each source’s activity is explained by neural populations based on a canonical microcircuit (Bastos et al., 2012), modeled using coupled differential equations describing changes in postsynaptic voltage and current in each population. In our study, the microcircuit consisted of four populations (superficial and deep pyramidal cells, spiny stellate cells, and inhibitory interneurons), each with a unique connectivity profile, including ascending, descending, and lateral extrinsic connectivity (connecting different sources) as well as intrinsic connectivity (connecting different populations within each source). The canonical microcircuit’s form and connectivity profile followed procedures established in the previous literature on the subject (Auksztulewicz and Friston, 2015; Auksztulewicz et al., 2018; Rosch et al., 2019; Fitzgerald et al., 2021; Todorovic and Auksztulewicz, 2021).
Crucially, a subset of intrinsic connections represented self-connectivity parameters, describing the neural gain of each region. Both extrinsic connectivity and gain parameters were allowed to undergo condition-specific changes to model differences between experimental conditions (deviants vs standards and the congruence between “what” and “when” regularity). The canonical microcircuit models prior connection weights for all ascending and lateral connections as excitatory, and for descending and intrinsic connections as inhibitory, based on the previous literature linking descending connections to predictive suppression, and intrinsic connections to self-inhibitory gain control (Bastos et al., 2012). However, these priors can be overridden by the data likelihood, possibly resulting in ascending/lateral inhibition and descending/intrinsic excitation if this maximizes model evidence.
In this study, we employed DCM to fit the individual participants’ ERFs specific to each condition within the 0–250 ms timeframe. DCM was applied to the responses evoked by the same stimuli as those analyzed in the ERF comparisons and source reconstruction. These stimuli correspond to deviant syllables, deviant disyllabic pseudowords, and designated standards, each embedded in the three temporal conditions (isochronous, beat-based, single-interval). The 0–250 ms time frame was chosen to analyze the entire time course of the syllable/pseudoword-evoked response avoiding contamination by the subsequent stimulus.
Drawing on the results of source reconstruction (refer to the Results section) and prior research (Garrido et al., 2008), we integrated 10 sources into the cortical network. These 10 sources included eight regions identified in the source reconstruction (Fig. 5) based on their peak MNI coordinates: bilateral superior temporal gyrus (STG; left, [−60 −18 −16]; right, [62 −32 8]), left angular gyrus ([−44 −70 34], right supramarginal gyrus ([44 −26 48]), bilateral superior parietal lobule (left, [−30 −64 54]; right, [22 −50 66]), and bilateral inferior frontal gyrus (left, [−48 28 8]; right, [42 34 14]). Additionally, we included two regions corresponding to bilateral primary auditory cortex for anatomical plausibility (Garrido et al., 2008; A1; MNI coordinates: left, [−56 −12 −2]; right, [60 −14 18]). The A1 coordinates were based on local maxima in the source reconstruction contrast maps for which probabilistic labeling returned “planum polare” (left A1 coordinates) or “planum temporale” (right A1 coordinates). To evaluate model fits, we utilized the free-energy approximation to model evidence, penalized by model complexity.
The analysis followed a sequential approach: initially, model parameters encompassing only extrinsic connections were estimated based on all experimental conditions, without modeling differences between conditions. The aim of this initial step was to find the optimal connectivity pattern between sources, providing the best fit of the data. In a second step, condition-specific changes in both extrinsic and intrinsic connections were optimized at the individual participant level. In both steps, models were fitted to individual participants’ data. Significant parameters (connection weights) were inferred at the group level using parametric empirical Bayes (PEB) and models were optimized using Bayesian model reduction (BMR; Friston et al., 2016), as described below.
At the individual participant level, models were fitted to ERF data considering two factors: “what” regularities (all deviants vs standards) and the interaction between “what” and “when” regularity (deviant syllable in the single-interval condition and deviant pseudoword in the beat-based condition vs deviant syllable in the beat-based condition and deviant pseudoword in the single-interval condition). These two effects were modeled in parallel, such that the model space consisted of models where these two effects could independently and factorially influence different subsets of connections. At this stage, all extrinsic and intrinsic connections were included in the network, representing a “full” model. Due to the potential for local maxima in model inversion within DCM, the group level analysis implemented PEB. This involved inferring group-level parameters by (re)fitting the same “full” models to individual participants’ data. The assumption underlying this step was that model parameters should be normally distributed in the participant sample, helping to mitigate the impact of outlier participants. For model comparison, BMR was applied, contrasting the “full” models against a range of “reduced” models where certain parameters were fixed to zero. Therefore, the model space also included a “null” model in which neither the “what” main effect nor the “what”/“when” interactive effect could modulate connectivity. This null model served as a baseline against which the other models were compared, and it would be favored if the remaining models were penalized for complexity or overfitting the data. This approach led to the creation of a model space encompassing different combinations of parameters.
In the first step (optimizing extrinsic connections independent of conditions), we used BMR to prune the extrinsic connectivity matrix. The free-energy approximation to log-model evidence was employed to score each model with a given extrinsic connectivity parameter set to 0, relative to the full model. This approach resulted in Bayesian confidence intervals for each parameter, indicating the uncertainty of parameter estimates. Parameters with 99.9% confidence intervals spanning either side of zero (equivalent to p < 0.001) were deemed statistically significant.
In the second step (optimizing the modulation of extrinsic and intrinsic connections by experimental conditions), a total of 64 models were generated. The model space was designed in a factorial manner, such that the following six groups of connections were set as free parameters or fixed to zero independent of each other: (1) ascending connectivity modulation by “what” regularities; (2) descending connectivity modulation by “what” regularities; (3) intrinsic connectivity modulation by “what” regularities; (4) ascending connectivity modulation by “what” and “when” congruence; (5) descending connectivity modulation by “what” and “when” congruence; and (6) intrinsic connectivity modulation by “what” and “when” congruence. The resulting 64 (26) models were fitted using BMR (switching off subsets of parameters of the full model) and compared using Bayesian model selection. Since a single model was identified as winning (see Results), its posterior parameters were inspected. Parameters with 99.9% nonzero confidence intervals were treated as statistically significant.
Results
Prior to MEG measurements, participants (N = 22) implicitly learned six disyllabic pseudowords (“tupi,” “robi,” “daku,” “gola,” “zufa,” “somi”; Fig. 1A) in a 2 min block of passive exposure (see Materials and Methods for more details). After participants had learned the set of pseudowords, they listened to a continuous stream of syllables, engaging in a syllable repetition detection task while MEG was measured. The stimulus sequences were independently manipulated with respect to “what” and “when” regularity at two different levels: single syllables and disyllabic pseudowords. “What” regularity manipulations had three levels (Fig. 1B): (1) standard sequences (66.6%), in which only “correct” pseudowords were used; (2) “deviant syllable” sequences (13.3%), in which the final syllable of the sequence was replaced with a syllable belonging to a different pseudoword; (3) “deviant pseudoword” sequences (13.3%), in which the penultimate syllable of the sequence was replaced with a syllable which should be a final syllable of a different pseudoword, resulting in an irregular disyllabic pseudoword. The remaining 7% of trials contained task-related repetitions that were discarded from MEG analysis. Independently, “when” regularity manipulations also had three levels (Fig. 1C): (1) isochronous sequences (33.3%), in which the SOA between consecutive syllables was fixed at 250 ms; (2) “beat-based” sequences (33.3%), in which the timing of pseudoword onsets was regular but the timing of word-final syllables was irregular; and (3) “single-interval” sequences (33.3%), in which the timing of pseudoword onsets was irregular but the timing of word-final syllables was regular.
The aim of the analysis was twofold: first, to examine the impact of “when” regularity on responses to syllables and on neural tracking at lower versus higher timescales and second, to assess how these “when” regularity manipulations modulate neural signatures associated with lower and higher-level “what” regularities, specifically the mismatch responses (MMRs). The latter analysis focused on testing interactions between “what” and “when” regularity, with a specific emphasis on testing whether MMRs exhibit contextually specific modulation based on temporal regularity. For instance, the analysis probed whether faster (vs slower) “when” regularity selectively influenced MMRs in response to violations of “what” regularities concerning syllables (vs pseudowords). To draw inferences on the putative mechanisms underlying interactions between “what” and “when” regularity observed at the sensor level, we performed source reconstruction and model-driven data analysis techniques (dynamic causal modeling).
Behavioral results
We first tested if participants implicitly learned the pseudowords. In line with that hypothesis, we observed that pseudoword discrimination following initial training far exceeded chance level, reaching 71.35% accuracy (SEM: 1.81%; one-sample t test against chance level: t(20) = 11.77, p < 0.001; Fig. 2A) and a mean d’ of 1.49 (SEM: 0.25; one-sample t test against chance level: t(20) = 5.88, p < 0.001). This indicates that participants could learn the pseudowords after short passive exposure to a continuous syllable stream.
Behavioral results. A, Accuracy in the training session in the pseudoword discrimination task. B, Left panel, Reaction times during the main continuous sequence where subjects performed a repetition detection task; right panel, accuracy during the repetition detection task. Rain cloud plots denote individual participants’ data points. Outliers are shown in red. Box plots show median values and interquartile ranges. Whiskers show data variability outside of the interquartile ranges, excluding outliers. Violin plots show the data density across participants. Asterisks denote p < 0.05.
Participants performed well and comparable across conditions in the syllable detection task (F(2,40) = 1.730, p = 0.190; Fig. 2B). Yet, we observed a clear effect of temporal regularity in the RTs (F(2,40) = 5.188, p = 0.009): in the isochronous condition, participants were faster (mean ± SEM: 488 ± 26 ms, after exponentiating log RTs) than both in the single-interval (mean ± SEM: 550 ± 34 ms; t(20) = −2.475, p = 0.022, FDR-corrected) and beat-based conditions (mean ± SEM: 544 ± 38 ms; t(20) = −3.429, p = 0.003, FDR-corrected; Fig. 2B). To quantify the prevalence of these effects at the level of single participants, we calculated two-sample t tests on single-trial RTs for each condition (median df = 48, accounting for trials removed from the analysis; see Materials and Methods). For the comparison between isochronous and single-interval conditions, the median t statistic was equal −0.283 (interquartile range between −1.420 and −0.001; 76% participants above 0), while for the comparison between isochronous and beat-based conditions, it was equal −0.564 (interquartile range between −1.246 and 0.161; 62% participants above 0), suggesting that the majority of participants showed results consistent with the grand average but with considerable variability at the single-trial level. Taken together, these results indicate that participants capitalized on the temporal regularity, adjusting their responses based on the temporal context information.
MEG results: phase locking
The stimulus spectrum, quantified as intertrial phase coherence (ITPC) of the sound waveform across 80 unique sequences per condition, showed pronounced differences between the three “when” conditions (Fig. 3A,C). Specifically, (1) in isochronous sequences, both a prominent syllable-rate (4 Hz) and a pseudoword-rate (2 Hz) peak were found; (2) in beat-based sequences, the pseudoword-rate (2 Hz) peak was largely preserved, and a syllable-rate (4 Hz) peak was relatively weaker but still present; (3) in single-interval sequences, both peaks were relatively weaker compared with the other conditions. All pairwise differences for the 2 Hz rate as well as for the 4 Hz rate were significant (all p < 0.001, all t(21) > 4.95).
Phase-locking results. A, Stimulus spectrum. Prominent peaks are observed at the 2 Hz (pseudoword rate) and 4 Hz (syllable rate). B, MEG spectrum, averaged across channels (see panels E and F for channel topographies). Similar peaks are observed as in the stimulus spectrum. C, Pseudoword (2 Hz) and syllable-rate (4 Hz) peaks based on the stimulus spectrum, showing differences between conditions. Rain cloud plots denote individual participants’ data points. Outliers are shown in red. Box plots show median values and interquartile ranges. Whiskers show data variability outside of the interquartile ranges, excluding outliers. Violin plots show the data density across participants. All pairwise comparisons for the 2 Hz peaks as well as for the 4 Hz peaks were significant (see Results). Since stimuli were generated pseudorandomly for each subject, and to facilitate comparisons with MEG spectra, error bars denote SEM across participants. D, Rain cloud, box, and violin plots of 2 Hz and 4 Hz peaks based on the MEG spectrum, showing differences between conditions. Plot legend as in C. All pairwise comparisons for the 2 Hz peaks as well as for the 4 Hz peaks were significant (see Results). Error bars denote SEM across participants. E, MEG channel topography of significant differences in the pseudoword-rate 2 Hz peak between conditions. Color bar denotes F statistic. The transparency mask shows significant topography clusters (p < 0.05, FWE-corrected). Significant clusters are also outlined on the topography maps. F, MEG channel topography of significant differences in the syllable-rate 4 Hz peak between conditions. Figure legend as in E. Please note the difference in colormap scales between panels E and F. In F, nearly all channels show significant effects.
The MEG spectrum (ITPC averaged across channels and conditions) also showed prominent syllable-rate (4 Hz) and pseudoword-rate (2 Hz) peaks for all three temporal conditions (isochronous, beat-based, single-interval; paired t tests of the peaks of interest vs neighboring frequencies; syllable-rate, all t(21) > 2.87, all p < 0.009; pseudoword-rate: all t(21) > 7.68, all p < 0.001; Fig. 3B). The syllable-rate ITPC was stronger than the pseudoword-rate ITPC when averaging across channels and conditions (paired t test: t(21) = 5.98, p < 0.001). “When” regularity had a significant main effect on both the syllable-rate peaks (Fmax = 83.57, Zmax = 7.82, pFWE < 0.001) and on pseudoword-rate peaks (Fmax = 13.29, Zmax = 3.98, pFWE < 0.001; Fig. 3D). However, the syllable-rate differences showed a broad and distributed MEG topography and were significant for virtually all channels (Fig. 3F), while the pseudoword-rate differences were only significant over left-lateralized anterior and posterior channels (Fig. 3E). Post hoc tests revealed that, at the syllable rate, ITPC was higher in the isochronous condition than in the beat-based condition (t(21) = 9.62, p < 0.001) and in the beat-based condition than in the single-interval condition (t(21) = 4.16, p = 0.004). The same pattern of results was found for the pseudoword-rate ITPC (pairwise comparisons: isochronous vs beat-based, t(21) = 2.83, p = 0.009; beat-based vs single-interval, t(21) = 5.37, p < 0.001). Thus, the phase-locking analysis showed a close correspondence between spectral characteristics of the stimulus waveform and the MEG responses; however, in sensor MEG data, sensitivity to pseudoword-rate peaks was relatively limited to left-lateralized channels. This suggests that neural activity did not merely follow the stimulus spectrum but was sensitive to the temporal structure of the syllable streams with a degree of topographic specificity.
MEG results: event-related fields and source reconstruction
To test for the temporal modulations of “what” and “when” regularity in the stimulus-evoked activity, we analyzed MEG data in the time domain and subjected ERFs to a general linear model with fixed effects “what” (standard, deviant syllable, deviant pseudoword) and “when” regularity (isochronous, beat-based, single-interval). First, we found that the three-way main effect of “what” regularities (Fmax = 12.13, Zmax = 4.22, pFWE = 0.03) corresponded to a significant difference between deviants (pooled across deviant syllables and pseudowords) and standards (133–250 ms over left central/posterior channels; Fmax = 24.18, Zmax = 4.60, pFWE = 0.003; Fig. 4A), such that the ERFs evoked by deviant stimuli were stronger than ERFs evoked by standard stimuli (Tmax = 4.92). To quantify the prevalence of this effect among single participants, for each participant we calculated a two-sample t test between single-trial evoked response amplitudes following deviants versus standards, measured at the channel and time point where the peak group effect was observed. The median t statistic was equal to 1.218 (median df: 944; interquartile t statistic range between 0.381 and 3.245; 77% participants above 0), indicating that a robust majority of participants showed results consistent with the grand average. No significant differences were found between ERFs evoked by deviant syllables and deviant pseudowords, indicating that “what” regularities alone have a relatively coarse effect on neural response amplitude and only differentiate between deviants and standards.
Event-related fields. A, Main effect of “what” regularity violation (deviant vs standard). Left panels, Time courses of ERFs averaged over the spatial topography clusters shown in the right panels. For each participant and condition, the ERF was based on an average of up to 450 stimuli (before discarding trials with artefacts). Shaded area denotes SEM across N = 22 participants. Black horizontal bar denotes pFWE < 0.05. Right panels, spatial distribution of the main effect. Color bar: F values. B, Main effect of “when” regularity (isochronous vs single-interval vs beat-based). For each participant and condition, the ERF was based on an average of up to 300 stimuli (before discarding trials with artefacts). Figure legend as in A. C, Congruency/interaction between “what” (deviant syllable vs deviant pseudoword) and “when” regularity (single-interval vs beat-based). For each participant and condition, the ERF was based on an average of up to 75 stimuli (before discarding trials with artefacts).
To source-localize the ERF effect found for deviants versus standards, we reconstructed the MEG topography of evoked responses in source space and tested for differences in 3D source maps using a GLM (see Materials and Methods section 4.7 for details). Overall, averaged across stimulus types, source reconstruction could explain 88.34 ± 3.40% of sensor-space variance (mean ± SEM across participants). We identified stronger activity estimates for deviant stimuli (collapsed across deviants) versus standard stimuli in a range of sources (Table 1; Fig. 5A), including bilateral STG, SPL, and IFG, the left angular gyrus (ANG), and the right supramarginal gyrus (SMG), reflecting a distributed network sensitive to auditory deviance.
Source reconstruction. A, Regions showing a significant main effect of “what” regularities (deviant vs standard) after applying a binary significance mask. Insets show unthresholded Z-maps. B, Regions showing a significant congruency effect between “what” (deviant syllable vs deviant pseudoword) and “when” regularity (single-interval vs beat-based). Legend as in A. STG, superior temporal gyrus; ANG, angular gyrus; SPL, superior parietal lobule; SMG, supramarginal gyrus; IFG, inferior frontal gyrus. Left and right hemispheres are shown in separate columns.
Source reconstruction results. Summary statistics of all clusters showing significant differences between conditions (pFWE < 0.05)
Second, we tested for the effect of “when” regularity on ERF amplitude. While the overall three-way difference between conditions was not significant (all pFWE > 0.05), in a planned contrast we identified a significant difference between isochronous and nonisochronous (pooled over beat-based and single-interval) conditions (Fig. 4B; Fmax = 16.92, Zmax = 3.84, pFWE < 0.025). Here, ERF amplitudes were stronger for isochronous versus nonisochronous conditions (Tmax = 4.11) between 33 and 250 ms over right posterior channels. A prevalence analysis of this effect (conducted in the same way as for deviants vs standards) showed that the median t statistic across participants was equal to 1.014 (median df: 944; interquartile t statistic range between 0.143 and 3.359; 77% participants above 0), indicating that a robust majority of participants showed results consistent with the grand average. Source reconstruction of this ERF effect, however, did not reveal any significant source-level clusters after correcting for multiple comparisons (all pFWE > 0.05), suggesting that the ERF effect did not systematically map onto underlying sources.
Finally, we tested for the ERF interaction effect between “what” regularities (i.e., the type of deviant stimulus) and “when” regularity (i.e., the type of nonisochronous temporal regularity). We observed a significant interaction, such that deviant syllables and pseudowords were associated with stronger ERFs when their timing was regular (i.e., in the single-interval and beat-based conditions, respectively), relative to when their timing was irregular (i.e., in the beat-based and single-interval conditions, respectively). This effect localized to left posterior channels (Fig. 4C; time extent: 127–250 ms, Fmax = 13.10, Zmax = 3.36, pFWE = 0.036). Post hoc tests on the interaction effect found that it was driven primarily by deviant pseudowords presented in the beat-based versus single-interval conditions (t(21) = 2.899, p = 0.009). A prevalence analysis of this effect (conducted in the same way as for the main effects) showed that the median t statistic across participants was equal to 0.229 (median df: 156; interquartile t statistic range between −0.961 and 2.169; 59% participants above 0), indicating that a majority of participants showed results consistent with the grand average albeit with considerable variance at the level of single trials. The remaining pairwise comparisons did not reach significance after correcting for multiple comparisons (deviant syllables presented in the beat-based vs single-interval condition: t(21) = −1.552, p = 0.013, uncorrected; deviant syllables vs pseudowords presented in the beat-based condition: t(21) = −1.197, p = 0.244; deviant syllables vs pseudowords presented in the single-interval condition: t(21) = 2.505, p = 0.021, uncorrected). Taken together, this interaction indicates that the effects of “what” regularities were stronger when they were congruent with “when” regularity.
To source-localize the interaction of “what” and “when” regularity, we compared sources of activity evoked by deviants whose timing was regular (deviant syllables presented in single-interval blocks and deviant pseudowords presented in beat-based blocks) against deviants whose timing was irregular (deviant syllables presented in beat-based blocks and deviant syllables presented in single-interval blocks). This contrast revealed significant differences in two regions (Fig. 5B): the left SPL and right IFG (see Table 1 for region coordinates and statistical information). A third cluster with the most probable anatomical label being “unknown” (peak MNI [−64 −4 22], voxel extent 833, lying in the vicinity of the left postcentral/precentral gyrus) was excluded from further analysis. All sources showed weaker activity for deviant stimuli presented in temporally congruent versus incongruent conditions (Tmin = −8.44). In summary, the interaction of “what” and “when” regularity was mapped to a more limited set of brain regions than the main effect of “what” regularities.
MEG analysis: dynamic causal modeling
To infer the connectivity patterns mediating the effects of “what” and “when” regularity on speech processing, we used dynamic causal modeling (DCM)—a Bayesian model of effective connectivity fitted to individual participants' spatiotemporal patterns of syllable-evoked ERFs (David et al., 2005; Auksztulewicz and Friston, 2015). DCM models evoked ERFs as arising in a network of sources. Network structure is quantified by extrinsic connections (linking distinct sources) and intrinsic connections (linking distinct populations within the same source and amounting to neural gain modulation at each source). The analysis consisted of two steps. In the first step, we created a fully interconnected model based on the eight regions identified in the source reconstruction (see above) as well as bilateral AC (see Materials and Methods) and optimized its extrinsic connectivity using Bayesian model reduction (BMR). This procedure pruned 75% of the connections, leaving 19 significant connections out of 76 connections of the fully interconnected model in the reduced model (p < 0.001; Fig. 6A). This finding indicates that the model space accommodated both overly complex and overly simple models which were appropriately penalized.
Dynamic causal modeling. A, Anatomical model including the eight regions identified in the source reconstruction analysis (bilateral superior temporal gyrus, STG; superior parietal lobule, SPL; inferior frontal gyrus, IFG; left angular gyrus, ANG; right supramarginal gyrus, SMG) as well as bilateral auditory cortex (AC). The figure depicts a model with reduced anatomical connectivity based on Bayesian model reduction, used for subsequent modeling of condition-specific effects. Black arrows, excitatory connections; red arrows, inhibitory connections. Intrinsic (self-inhibitory) connections not shown. B, Top panel, Model space showing models on the horizontal axis, and groups of connections included as free parameters (gray) or switched off (black) in each model on the vertical axis. The winning model, allowing “what” regularities to modulate ascending and descending (but not intrinsic) connections, and congruence between “what” and “when” regularity to modulate intrinsic (but not ascending or descending) connections, is shown as a white column. Bottom panel, Bayesian model comparison; model probability per model. C, Posterior model parameters sensitive to “what” regularity violations. Only significant parameters (>99.9%) shown. Black, excitatory; red, inhibitory; solid lines, stronger connectivity; dashed lines, weaker connectivity for deviants versus standards. D, Posterior model parameters sensitive to congruence between “what” and “when” regularity. Self-inhibitory intrinsic connections showed region-dependent increase (solid) or decrease (dashed) for deviants temporally predicted at congruent versus incongruent time scales.
In a second step, we took the reduced model and allowed extrinsic connections to vary systematically between “what” and “when” conditions. Specifically, since in the source reconstruction results we only identified significant differences in source maps related to (1) all deviants versus standards and (2) congruent versus incongruent “what” and “when” regularities, we considered these two factors as possible modulatory effects of extrinsic and/or intrinsic connectivity. Bayesian model comparison of the 64 resulting models revealed a single winning model, in which “what” regularities modulated only extrinsic connections, while its congruence with “when” regularity modulated only intrinsic connections (Fig. 6B). The difference in the free-energy approximation to log-model evidence between the winning model and the next-best model (log Bayes factor) was 3.701, amounting to 97.53% probability that the winning model outperforms the next-best model. Therefore, the winning model provides strong evidence for distinct effective connectivity patterns sensitive to “what” regularity and its interaction with “when” regularity.
Based on the winning model, we then inferred its significant connectivity parameters. “What” regularity significantly modulated a subset of extrinsic connections (Fig. 6C; see Table 2 for statistical information). Specifically, deviants were linked to increased ascending bilateral connectivity at multiple levels in the auditory hierarchy, from AC via STG to SPL. Conversely, ascending connectivity decreased for cross-hemispheric connections at higher levels of the hierarchy, from SPL to IFG and from the left STG via ANG to right IFG. Deviants also modulated descending connectivity, such that top-down inhibition increased at higher levels of the hierarchy (from bilateral IFG to SPL, ANG, and SMG) and decreased at lower levels of the hierarchy (from STG to AC). In summary, deviant processing differentially affected ascending and descending connections, leading to a net ascending drive especially within hemispheres and at lower levels of the hierarchy.
Dynamic causal modeling results. Summary of significant condition-specific effects on connectivity estimates (p < 0.001)
Finally, the congruence of “what” and “when” regularity exclusively modulated a subset of intrinsic connections (Fig. 6D; see Table 2 for statistical information). Specifically, deviant syllables and pseudowords predictable in time were linked to increased gain (decreased self-inhibition) in bilateral AC and the right SMG and decreased gain (increased self-inhibition) in most other sources in the network with the exception of the right STG and left IFG for which no significant self-connectivity modulation was found. This model explained 81.44 ± 1.62% of the variance of spatiotemporal ERF patterns (mean ± SEM across participants). This pattern indicates that “what” regularities congruent with “when” regularity increased gain at low levels of the hierarchy and decreased gain at higher levels.
Discussion
In the current study, we observed a contextual modulation of the neural responses to deviations from predicted speech contents, dependent on their temporal regularity. This modulation showed that faster “when” (single-interval) regularity amplified responses to deviant syllables and slower “when” (beat-based) regularity amplified responses to deviant disyllabic pseudowords. This implies a congruence effect in the processing of “what” and “when” regularity across hierarchical levels of speech processing whereby smaller processing units (syllables) are modulated at faster rates and larger processing units (words) are modulated at slower rates. However, the interactive effects between “what” and “when” regularities on evoked neural responses did not differentiate between hierarchical levels and were instead linked to a shared network of sources including the left SPL and right IFG. In the connectivity analysis, these modulatory effects of “when” regularity on “what” mismatch responses were best explained by widespread gain modulations, including stronger sensitivity of early auditory regions bilaterally, and right SMG, as well as weaker sensitivity of most of the temporo-fronto-parietal network. Thus, our analysis of evoked responses (as well as subsequent source reconstruction and computational modeling) suggest that the interactions between “what” and “when” regularities, while contextually specific due to their congruent effects, are facilitated by a common and distributed cortical network, independent of the hierarchical level of these regularities.
Mismatch responses to unpredicted speech contents are well documented in neuroimaging studies and have been found in a range of cortical regions including the STG (Rothermich and Kotz, 2013), the SMG (Celsis et al., 1999), and—in case of nonsense words—bilateral IFG, largely matching our findings (Wilson et al., 2015). Beyond these regions, we also found sensitivity to “what” regularity violations in the angular gyrus, consistent with its role in phonological processing and novel word acquisition (Seghier, 2023), and the SPL, part of the dorsal attentional network previously linked to statistical learning of artificial speech streams (Sengupta et al., 2019). Our dynamic causal modeling of mismatch responses suggested a consistent pattern of connectivity modulations, with qualitative differences between lower and higher levels of the cortical hierarchy. At the lower levels (from A1 to STG and from STG to SPL), speech deviants were linked to stronger ascending excitation and weaker descending inhibition, consistent with increased forward prediction errors signaled to short-term violations of speech predictions by auditory regions in the temporal lobe (Caucheteux et al., 2023). Conversely, at the higher levels of the hierarchy (in the frontoparietal network), speech deviants were associated with weaker ascending and stronger descending connectivity, possibly reflecting internal attentional orienting to unpredicted speech contents (Reiche et al., 2013; Lückmann et al., 2014).
Temporal regularity of speech sounds was found to modulate behavior in an incidental task (with RTs being shortest following isochronous vs nonisochronous sequences), consistent with previous findings (Morillon et al., 2016). We also found that temporal regularity (in isochronous sequences) increased sound-evoked ERF amplitude (Bouwer et al., 2016), albeit to a similar extent relative to both beat-based and single-interval nonisochronous sequences. A previous EEG study comparing these two types of temporal regularities found largely comparable behavioral and neural effects, with the differences limited to sounds presented at unexpected times (but not to sounds presented at expected times) in beat-based sequences (Bouwer et al., 2020). In our study, differences between these two types of sequences were found in the analysis of frequency-domain effects of “when” regularity. Frequency-domain analyses generally indicated a close alignment between the MEG spectrum and the stimulus spectrum, although neural tracking of syllables (as quantified using ITPC) was stronger than that of longer chunks, consistent with previous results (Har-Shai Yahav and Zion Golumbic, 2021). However, while the syllable-rate effect was distributed over a large number of channels, the pseudoword-rate effect was predominantly observed over the left hemisphere, suggesting that the statistical learning of speech sequences can exert asymmetric effects. This is consistent with previous reports of left-hemispheric contributions to speech segmentation based on statistical regularities (Cunillera et al., 2009; López-Barroso et al., 2013). Similarly, studies using phase-locking measures to quantify tracking at suprasyllabic time scales showed more pronounced differences in the left hemisphere (Ding et al., 2017; Har-Shai Yahav and Zion Golumbic, 2021), although these pertained to multiword phrases rather than single words and were based on familiar disyllabic words rather than recently learned pseudowords. Interestingly, the left-hemispheric dominance found for the word-rate tracking in the current study contrasts with the right-hemispheric dominance found for tone-pair tracking in a similar study based on nonspeech musical sequences rather than on speech stimuli (Cappotto et al., 2023), suggesting differential tracking between speech and nonspeech sequences.
Besides their distinct individual effects on neural activity, regularities related to speech contents and timing demonstrated interactive effects, such that temporally regular deviant sounds yielded stronger ERF amplitudes than temporally irregular deviant sounds. This interaction was specific with respect to the hierarchical level of speech organization and its respective time scale, such that deviant pseudowords had a stronger amplitude following beat-based regularities, whereas deviant syllables had a stronger amplitude following single-interval regularities. These results build upon prior research indicating that temporal regularities increase MMRs (Yabe et al., 1997; Takegata and Morotomi, 1999; Todd et al., 2018; Lumaca et al., 2019; Jalewa et al., 2021) and demonstrate that these modulatory effects align consistently with expected time points, irrespective of the specific nature of the “when” regularity (whether single-interval or beat-based). However, there were no significant differences between the interactive effects observed at the lower versus higher level of the hierarchy. Instead, the interactive ERF effects were source-localized to shared regions, the right IFG and left SPL, where deviants presented at regular latencies were linked to lower source activity estimates than deviants presented at irregular latencies. Identifying these two regions as mediating interactions between “what” and “when” regularities extends the results of previous studies, where the left parietal cortex was found to be involved in integrating “what” and “when” information in speech processing (Orpella et al., 2020), while the right IFG was found to be inhibited by regular speech timing (metrical context; Rothermich and Kotz, 2013). Our DCM results suggest that the interactive effects of “what” and “when” regularities were subserved primarily by gain modulation, including weaker gain at hierarchically higher regions (the right IFG and left SPL) as identified in the source reconstruction. This relative attenuation at hierarchically higher levels was accompanied by stronger gain at lowest levels of the network including bilateral A1, consistent with previously reported gain-amplifying effects of temporal orienting on auditory processing (Auksztulewicz and Friston, 2015; Morillon et al., 2016; Auksztulewicz et al., 2019), as well as in the SMG, previously shown to be involved in processing temporal features of speech (Geiser et al., 2008).
This MEG study analyzing speech sequences aligns closely with findings from Cappotto et al. (2023), who used similar methods to examine EEG responses to musical sequences. In both cases, faster and slower “when” regularity intensified the deviant responses to unexpected elements at the appropriate hierarchical levels—for speech, syllables versus disyllabic pseudowords and for music, single tones versus tone pairs. Additionally, both studies found that the interactive effects of “what” and “when” regularities are associated with the left superior parietal lobule (SPL). However, our study also implicates the right inferior frontal gyrus (IFG) as a potential source of these sensor-level effects. Furthermore, both studies reported that the sensor-level effects could be attributed to increased gain in bilateral auditory cortices and reduced gain in other network nodes. The EEG study further linked these effects to changes in forward connectivity. Collectively, these findings suggest that the interactions between “what” and “when” regularities are consistent across different stimulus domains (speech and music) and data modalities (MEG and EEG), revealing largely overlapping cortical sources and connectivity modulations. Therefore, despite the inherent differences in temporal modulation patterns and acoustic features between music and speech (Siegel et al., 2012; Ding et al., 2017; Albouy et al., 2020; Zatorre, 2022), these common underlying mechanisms support the hypothesis that the interactions between “what” and “when” regularities are broadly generalizable across stimulus characteristics.
Taken together, our study complements recent model-based reports of cortical hierarchies aligning with speech processing hierarchies (Schmitt et al., 2021; Caucheteux et al., 2023) and suggests that while “what” and “when” predictions may jointly modulate speech processing, their interactions are not necessarily expressed at different levels of the cortical hierarchy. Instead, the effects of temporal regularities on unexpected speech sounds may be subserved by a common set of frontoparietal regions, reflecting attention-like amplification of mismatch responses due to temporal predictions (Auksztulewicz and Friston, 2015; Auksztulewicz et al., 2019) and irrespective of the contents of mispredicted stimuli. Rather than requiring dedicated resources to integrate “what” and “when” features separately at each stage of hierarchical speech processing, such a generic mechanism may help integrate streams of information across hierarchical levels.
Footnotes
This work was supported by the European Commission’s Marie Skłodowska-Curie Global Fellowship (750459 to R.A.); a grant from the European Commission/Hong Kong Research Grants Council Joint Research Scheme (9051402 to R.A. and J.S.); and a grant from the German Science Foundation (AU 423/2-1 to R.A.).
The authors declare no competing financial interests.
- Correspondence should be addressed to Ryszard Auksztulewicz at ryszard.auksztulewicz{at}maastrichtuniversity.nl.