Abstract
The key assumption of the predictive coding framework is that internal representations are used to generate predictions on how the sensory input will look like in the immediate future. These predictions are tested against the actual input by the so-called prediction error units, which encode the residuals of the predictions. What happens to prediction errors, however, if predictions drawn by different stages of the sensory hierarchy contradict each other? To answer this question, we conducted two fMRI experiments while female and male human participants listened to sequences of sounds: pure tones in the first experiment and frequency-modulated sweeps in the second experiment. In both experiments, we used repetition to induce predictions based on stimulus statistics (stats-informed predictions) and abstract rules disclosed in the task instructions to induce an orthogonal set of (task-informed) predictions. We tested three alternative scenarios: neural responses in the auditory sensory pathway encode prediction error with respect to (1) the stats-informed predictions, (2) the task-informed predictions, or (3) a combination of both. Results showed that neural populations in all recorded regions (bilateral inferior colliculus, medial geniculate body, and primary and secondary auditory cortices) encode prediction error with respect to a combination of the two orthogonal sets of predictions. The findings suggest that predictive coding exploits the non-linear architecture of the auditory pathway for the transmission of predictions. Such non-linear transmission of predictions might be crucial for the predictive coding of complex auditory signals like speech.
Significance Statement Sensory systems exploit our subjective expectations to make sense of an overwhelming influx of sensory signals. It is still unclear how expectations at each stage of the processing pipeline are used to predict the representations at the other stages. The current view is that this transmission is hierarchical and linear. Here we measured fMRI responses in auditory cortex, sensory thalamus, and midbrain while we induced two sets of mutually inconsistent expectations on the sensory input, each putatively encoded at a different stage. We show that responses at all stages are concurrently shaped by both sets of expectations. The results challenge the hypothesis that expectations are transmitted linearly and provide for a normative explanation of the non-linear physiology of the corticofugal sensory system.
- auditory midbrain
- auditory pathway
- cortico-thalamic interactions
- predictive coding
- sensory processing
- sensory thalamus
Introduction
Predictive coding (Rao and Ballard, 1999; Friston, 2003b, 2005) is the leading theoretical framework for understanding how expectations are integrated in our experience of reality (Keller and Mrsic-Flogel, 2018). Its central assumption is that sensory processing is mediated by the computation of prediction error: the residual between expectations and the sensory input (Spratling, 2017; Keller and Mrsic-Flogel, 2018).
Predictive coding follows the hierarchical organization of sensory systems (Keller and Mrsic-Flogel, 2018). Units computing prediction error at each processing stage are generally assumed to test the predictions drawn by the level above (Spratling, 2017; Keller and Mrsic-Flogel, 2018; e.g., a pitch processing stage tests predictions drawn from a stage encoding melodic phrases). However, it is unclear how prediction error at each processing stage depends on predictions drawn at even higher levels of the hierarchy, and what happens when predictions from different levels are inconsistent.
One line of research suggests that predictions drawn by one level are only tested by the level immediately below. This is the converging conclusion of the studies using the so-called local–global paradigm (Bekinschtein et al., 2009). In this paradigm, participants hear successive repetitions of a melodic phrase of tones
Another line of research, however, suggests that predictions drawn by one level are tested by more than one lower-level processing stage: task instructions drive the encoding of prediction error in human AC (Stein et al., 2022) and two stages of the subcortical pathway (Tabas et al., 2020, 2021): the auditory thalamus [medial geniculate body (MGB)] and midbrain [inferior colliculus (IC)]. High-resolution studies in non-human primates also reported that the global predictions from the local–global literature are tested by prediction error units in AC (Uhrig et al., 2014, 2016; Jiang et al., 2022) and MGB (Jiang et al., 2022). Unlike the first line of research, these findings imply that predictions are further transmitted downstream in the hierarchy, all the way to the subcortical pathways.
How can one reconcile the results of the two lines of research? A likely possibility is that the nature of the predictions plays a role for determining whether or not they are further transmitted downstream in the hierarchy. Indeed, while global predictions are based on local statistics of the stimulus sequences, task-induced predictions stem from a holistic understanding of the structure of the sensory input. The brain may use the better-informed task-induced predictions across the sensory pathway and restrict the use of predictions based on the statistics of a representation like the global predictions to the immediately lower level. To understand the hierarchical interplay of predictions in predictive coding, it is thus crucial to study scenarios beyond the local–global paradigm and consider the interplay of predictions of different natures.
In the present study, we investigate how stats-informed and task-informed predictions are used to compute prediction error in the human AC, MGB, and IC. We consider three alternative scenarios: first, that prediction error is computed only with respect to the stats-informed predictions, consistent with the first line of research (Bekinschtein et al., 2009; Wacongne et al., 2011; Chennu et al., 2013; Recasens et al., 2014; El Karoui et al., 2015; Dürschmid et al., 2016; Nourski et al., 2018); second, that prediction error is computed only with respect to the task-informed prediction, consistent with the second line of research (Tabas et al., 2020, 2021; Stein et al., 2022); and third, that prediction error is computed with respect to a combination of both sets of predictions. The third scenario is the one that best reflects the non-linear anatomy of the descending auditory pathway: for instance, the IC receives feedback connections from both the AC and the MGB (Hackett and Kaas, 2004; Lee and Sherman, 2011; Schofield, 2011).
Methods
Experimental design and statistical analysis
Experimental paradigm
In this study, we reanalyze data from two previous experiments (Tabas et al., 2020, 2021). Both experiments used the same task. A trial consisted of a sequence of eight sounds: seven repetitions of a standard and one deviant (Fig. 1A). Participants were instructed to monitor the sequences and to report, as accurately and fast as possible, the position of the deviant within the sequence.
The experimental paradigm was designed to elicit two sets of independent predictions on the sensory input. One set of predictions (stats-informed) would be drawn by a population
To elicit the task-informed set of predictions, we introduced two abstract rules: (1) there will always be a deviant and (2) the deviant could only be located at positions 4, 5, or 6. These rules were disclosed to the participants at the beginning of the experiment, who could use it to infer the likelihood of the position of the deviant during each trial. The rule renders
The inter-trial-interval (ITI) was jittered so that deviants were separated by an average of 5 s, up to a maximum of 11 s, with a minimum ITI of 1500 ms. This maximized the efficiency of the response estimation of the deviants (Friston et al., 1999) while keeping a sufficiently long ITI to ensure that the sequences belonging to separate trials were not confounded.
The experiment consisted in several runs of the same task. Each run contained 6 blocks of 10 trials. The 10 trials in each block contained the same standard–deviant combination, so that within a block only the position of the deviant was unknown, while the identity of the deviant was known. Each of the blocks in a run used one of the six standard–deviant combinations. The order of the blocks within the experiment was randomized. The position of the deviant was pseudorandomized across all trials in each run so that each deviant position happened 60 (pure tone experiment) or 180 (sweep experiment) times per subject but an unknown amount of times per run. This constraint allowed us to keep the same prior probability for all deviant positions in each block (i.e.,
Stimuli
All stimuli were 50 ms long, including 5 ms ramp-in and ramp-out Hanning windows. Stimuli were arranged in each sequence with a fixed inter-stimulus-interval of
There were two sets of stimuli, one based on pure tones (experiment 1) and one based on frequency-modulated (FM)-sweeps (experiment 2). Pure tones and FM-sweeps are two of the three information-bearing-elements (IBEs) (Suga, 2012) in which meaningful acoustic signals can be linearly decomposed. We used these two sets to test whether the same principles operate across different IBE types and thus generalize to information-bearing auditory signals.
The pure tone set consisted of three pure tones of frequencies
The FM-sweep set consisted of three linear FM-sweeps, one with a descending FM (down) and two with ascending FM (up), with modulation rates
Description of the data
Data for each experiment were acquired with different MRI-machines and different participant cohorts. Here we describe shortly the key characteristics of each data-set; full descriptions are detailed in Tabas et al. (2020, experiment 1, pure tones) and Tabas et al. (2021, experiment 2, FM-sweeps). Data collection of the pure tone data-set was approved by Ethics Committee of the Medical Faculty of the University of Leipzig, Germany (ethics approval number 273/14-ff). Data collection of the FM-sweep data-set was approved by the Ethics Committee of the Technische Universtät Dresden, Germany (ethics approval number EK 315062019). All listeners provided written informed consent and received monetary compensation for their participation.
Data from 19 (12 female) and 18 participants (12 female) were included in the pure tone and FM-sweeps data-sets, respectively. All participants had normal hearing (thresholds equal of below 25 dB in the range 250 Hz and 8 kHz, as measured by pure tone audiometry) and scores within the neurotypical range in screenings for developmental dyslexia (rapid automatized naming test of letters, numbers, and objects; Denckla and Rudel, 1976) and autism spectrum disorder (AQ; Baron-Cohen et al., 2001; all screenings conducted in German).
Stimuli were presented using MATLAB (The Mathworks) with the Psychophysics Toolbox extensions (Brainard, 1997). Loudness was adjusted independently for each subject before starting the data acquisition to a comfortable level. In the pure tone experiment, stimuli were delivered through an MrConfon amplifier and headphones (MrConfon GmbH). In the FM-sweep experiment, stimuli were delivered through an Optoacoustics (Optoacoustics or Yehuda) amplifier and headphones equipped with active noise-cancellation.
Data from the pure tone data-set were collected using a 7-Tesla Magnetom (Siemens Healthineers) with a spatial resolution of 1.5 mm isotropic and temporal resolution of
Participants from the pure tone data-set completed 4 runs in a single session (240 trials in total, 80 per deviant position). All but one participant from the FM-sweep data-set completed 9 runs of the main experiment across 3 sessions (540 trials in total, 180 per deviant position); subject 18 completed only 8 runs due to technical reasons. Due to an undetected bug in the presentation code, the first three runs of subjects 1, 2, 4, and 5 and the first six runs of subject 3 were discarded.
During fMRI data acquisition, we also recorded the respiration (in the pure tone data-set) and heart rate (in both the pure tone and FM-sweep data-sets) of the participants. We recorded structural MR-images of each participant using either MP2RAGE (Marques et al., 2010; pure tone data-set; parameters:
All data were preprocessed using Nipype (Gorgolewski et al., 2011), and analyses were carried out using tools of the Statistical Parametric Mapping toolbox, version 12 (SPM); Freesurfer, version 6 (Fischl et al., 2002); the FMRIB Software Library, version 5 (FSL; Jenkinson et al., 2012); and the Advanced Normalization Tools, version 2.2.0 (ANTs; Avants et al., 2011). All second level analyses were performed in Montreal Neurological Institute (MNI) MNI152 1 mm isotropic asymmetric template.
Data were first realigned and unwarped with SPM. The transformation between the functional runs and the structural image was computed with Freesurfer’s BBregister, which fits the boundaries between gray and white matter of the structural data to the functional images using the whole-brain EPI as an intermediate step. We computed the transformation between the structural image and the MNI template fitting a concatenation of a rigid, affine, and B-spline non-linear volume-based mappings with ANTs. ANTs simultaneously fit the direct and inverse transform between the two spaces: we used the direct transform to map the data to MNI space, and the inverse transform to map the region of interests (ROIs) to the subject space to validate the registrations (see below).
Physiological (respiration or/and heart rate) data were processed by the PhysIO Toolbox (Kasper et al., 2017) that computes the Fourier expansion of each component along time and adds the coefficients as covariates of no interests in the model’s design matrix. All the preprocessing parameters, including the smoothing kernel size, were fixed before we started fitting the general linear model and remained unchanged during the subsequent steps of the data analysis.
Regions of interest
We used two atlases to identify which voxels belonged to each subcortical and cortical ROI. For the subcortical ROIs, we used an in vivo atlas (Sitek et al., 2019) that identified which voxels of the MNI space most likely cover bilateral IC and MGB. To test for potential functional specializations of the subdivision of the MGB, we used masks calculated in Mihai et al. (2019) detailing the average location of the ventral tonotopic axis of the nucleus across 28 participants. This is, to-date, the best existing approximation of the location of primary (ventral) MGB (Mihai et al., 2019).
For the cerebral cortex ROIs, we used the Morosan atlas (Morosan et al., 2001), which subdivides AC in four bilateral cortical areas using cytoarchitectural considerations. Cortical areas are identified as Te1.0, Te1.1, Te1.2, and Te3. Areas Te1.0, Te1.1, and Te1.2 are mostly located on Heschl’s gyrus (Te1.1 most posterio-medial, Te1.2 most antero-lateral, and Te.1.0 in between), and Te3 is located on the lateral surface of the superior temporal gyrus (Morosan et al., 2001). Te1.0 includes areas analogous to the core of the AC; i.e., primary AC (Moerel et al., 2014).
To empirically validate the registration procedure, we used the inverse transforms provided by ANTs to project the subcortical (Fig. 2) and cortical (Figs. 3, 4) ROIs to the native space of the structural images. The plots confirm that our registration pipeline successfully mapped the anatomical ROIs of each participant into the MNI space with an orientation that resembles the orientation of the different AC regions.
Bayesian model comparison
To evaluate whether neural responses in each of the ROIs corresponded to prediction error with respect to the stats-, task-informed predictions, or both (Fig. 1), we used Bayesian model comparison. Bayesian model comparison allows to calculate the evidence for a given model of the response profile in each voxel of the ROI. We used three models that capture three different hypotheses. Stat: neural responses encode prediction error with respect to
All regressors corresponding to each of the model were normalized to have a mean of zero and variance of one across each run before convolution and model fitting. Note however that SPM orthogonalizes the regressors before fitting them to the data. Moreover, since we applied the same procedure to all models, preprocessing of the regressors cannot bias model comparison.
We first computed the log-evidence for each of the three models in each voxel of the ROIs per each participant using SPM via nipype. Given the model amplitude(s)
Log-evidence maps were then combined across participants for each stimulus set using custom scripts (see Data and code availability) following a two-step procedure: first, we combined the log-evidences across sessions for each individual subject assuming fixed-effects (i.e., summing across the log-evidences for each subject) and then computed a posterior distribution at the group level using the random-effects procedure described in Rosa et al. (2010) and Stephan et al. (2009). Uninformed priors (i.e., uniform distribution across models) were used for the second level Bayesian analysis. This procedure resulted in an estimation of the log-evidence of each model for each voxel. Group-level log-evidence maps were then subtracted to compute the Bayes factor of the comparison of any two models
Definition of the models
Modeling prediction error
In all models, we assume that neural responses encode prediction error with respect to a set of predictions. The models disregard contributions of other factors (e.g., neural habituation) that, although are expected to influence the signal, would not differ same across sets of predictions.
We defined prediction error as the mismatch between the expected stimulus and the actual stimulus weighted by the likelihood of encountering the stimulus in each position
We modeled the prediction error responses using two regressors:
Regressors in Equations 2 and 3 can capture the prediction error to stimuli in positions 2–8; however, they cannot capture the responses to the first standard in each sequence. The first standard elicits prediction error with respect to the task-informed predictions not because its identity is unknown, but because its onset time is unknown. It also elicits prediction error with respect to the stats-informed predictions because it interrupts the silence that precedes it in the local stimulus history. To take into account the contributions of the first standard without tweaking the definition of
ξ from Equation 1, we added another regressor
While models corresponding to the stats- and task-informed predictions had three regressors, the model that incorporates both sets of predictions had five regressors. Bayesian log-evidences penalizes the addition of extra regressors, meaning that the evidence for any model of a higher complexity would only be greater than the evidence for a model of lower complexity if the additional regressors explain the data better beyond what would have been expected due to overfitting of the extra free parameters.
Prediction error with respect to the stats-informed predictions
The stats-informed predictions were
Prediction error with respect to the task-informed predictions
Task-informed predictions depended on the position of the incoming sound
n (Fig. 5B). For positions
With the exception of the expected responses to the first standard, the regressor
Prediction error with respect to a combination of both predictions
To test whether both, stats- and task-informed predictions, contribute to the computation of prediction error, we assumed that predictions would be a linear combination of the predictions of both models (Fig. 5C); or, similarly, that the neural responses would be a linear combination of the prediction error expected by each set of predictions (note that the dependence of
ξ on
Π in Eq. 1 is linear). We modeled this scenario by adding the regressors
Control model
The control model includes three regressors, one per stimulus identity. Before normalization, a regressor corresponding to, for example, the pure tone with the highest pitch had a value of 1 when the tone was played and zero otherwise.
Measuring the tSNR
To test whether the results were influenced by the tSNR of the data, we used nipype’s native confound toolbox. We computed the tSNR in the exact same preprocessed data we used as input for the Bayesian model comparison analysis. A limitation of measuring the tSNR on raw data is that the analysis effectively interprets task-induced fluctuations, which are part of the signal, as noise. A popular alternative is to use the contrast-to-noise ratio (CNR) (Welvaert and Rosseel, 2013); however, the CNR is only defined with respect to a model of the data and cannot be used to measure the relative noise across different models. Moreover, given that fMRI is characterized by a low signal changes in high noise regimes (Welvaert and Rosseel, 2013), that we only use the tSNR estimations to compare model performances, and that both data-sets included in this study used the same task, the tSNR is a reasonable and unbiased estimation of the relative levels of noise of the data. Nevertheless, the results of the control analysis involving the estimations of the tSNR should be considered with caution.
Correlational analyses
To measure whether results were consistent across the two experiments and whether spatial variations in the winning models could be explained by tSNR heterogeneities, we computed Pearson’s correlations between statistical maps across the voxels in each ROI. This means that the number of samples in each correlation is the number of voxel in the ROI. All
p-values were Holm–Bonferroni corrected for the number of ROIs (
Data and code availability
All code and derivatives needed to reproduce the analyses and figures are openly available in osf.io/f5tsy.
Results
Multiple predictions are combined to compute prediction error in the subcortical auditory pathway
Large sections of the IC and MGB displayed responses that were best explained by the task and the combined models, both for pure tones (Fig. 6A) and FM-sweeps (Fig. 6C). To rigorously establish the prevalence of each model in each ROI, we computed the Bayes factors between each target and the control model (Fig. 6B,D). The prevalence of a model in a ROI was determined as the fraction of the voxels for which the model provides for a substantially better explanation of the data than the control model (i.e.,
We also computed the minimum K factor between each model and the remaining models (Table 3) to determine whether the responses in each voxel of each ROI were substantially better explained by any of the four models. The combined model provided for a substantially better explanation for the data than in 36%, 25%, 49%, and 60% of voxels of the IC-L, IC-R, MGB-L, and MGB-R, respectively. Results were less clear in the FM-sweep data, where the combined and task models seem to perform similarly well. The control and stats models had no substantial explanatory power in any of the ROIs.
In summary, both combined and task models were extremely prevalent in all ROIs for both experiments. While the combined model dominated the responses in the pure tone experiment, both combined and task models dominated the responses in the FM-sweep experiment.
Prediction error in the MGB is consistent across physiological subdivisions
The auditory pathway is subdivided into primary (central section of the IC and ventral MGB) and secondary (cortex of the IC, and medial and dorsal MGB) subdivisions (Hu, 2003). Neurons in primary subdivisions have narrowly tuned frequency responses and are responsible for the transmission of information between the periphery and the cerebral cortex; neurons in secondary subdivisions present wider tuned frequency responses and are thought to be involved in multisensory integration (Hu, 2003). One possibility is that the functional parcellations described in Figure 6 correspond to this physiological arrangement. Neural populations responding according to the combined model do indeed seem to be located toward the cortex of the ICs, although lower tSNRs are generally expected in the borders of the nuclei, whose signal has contributions from the adjacent cerebrospinal fluid.
Imaging subdivisions of the IC and MGB in humans are remarkably challenging (Moerel et al., 2015; Mihai et al., 2019). To-date, there is no available parcellation of the human IC into primary and secondary subdivisions; however, Mihai et al. (2019) managed to identify a ventral tonotopic gradient in the MGB that putatively corresponds to its primary subdivision. Here, we used this parcellation to assess whether neural populations in primary and secondary subdivisions of the MGB are signaling prediction errors related to the task or the combined model (Fig. 7). Results show that both models are similarly prevalent in both subdivisions, indicating that, at least in the MGB, the functional parcellation described in Figure 6 does not correspond to the physiological parcellations of the nuclei.
tSNR heterogeneity correlates with model performance
The prevalence of stats and combined models in different sections of the MGB and IC may indicate that there is not a unique strategy to propagate predictions on the sensory input downstream, but that different strategies are used in different neural populations. However, Figure 7 indicates that these neural populations do not correspond to specific subdivisions of the nuclei. Another possibility is that voxels best explained by the task model are those voxels in which the BOLD responses are noisier in comparison to those in other regions. Since the combined model has two free parameters more than the task model, the former needs to provide a better explanation of the data than the latter to yield a similar log-evidence. Voxels with poorer tSNR would present higher mean-square-errors with respect to the model fits, which might result in the winning of the task model.
To test whether that was the case, we computed the correlation between the tSNR and the posterior density of the combined model across the ROIs. We found a significantly positive correlation in three of the four ROIs for the pure tone data (
Similar distribution of model performances for pure tones and FM-sweeps
Although the general prevalence of the task and combined models was different for the responses elicited by pure tones and FM-sweeps, they seem to follow a similar topographic organization on visual inspection: populations best explained by the combined model are located more centrally in the ICs and more dorsally in the MBGs. To quantify if the occurrence of the task model was consistent across the two stimulus families, we compared the distribution of the
Multiple predictions are combined to compute prediction error in AC
Most of the AC responses (Te1.0, Te1.1, Te1.2, and Te3; Morosan et al., 2001) were best explained by the combined model (Table 2, Fig. 10): it was the best explanation of the data in more than half of the voxels across fields for pure tones (Fig. 8) and FM-sweeps (Fig. 9). The task model explained the responses of most of the remaining voxels, while the stats and control models were present only minimally.
To study whether the presence of the task model could also be related to the variations of the tSNR across the ROIs, we again computed the correlation between the tSNR and the posterior density of the combined model across the cerebral cortex ROIs. In the pure tone data, the posterior density was positively correlated with the tSNR in Te1.0-R, Te1.1-R, Te1.2-L, and bilateral Te3 (
To quantify if, as in the subcortical nuclei, the cortical organization of the combined and task models was consistent for both stimulus families across cortical fields, we computed the correlation between the Bayes’ factor
Discussion
Hierarchical processing is the cornerstone of predictive coding (Friston, 2003a). Here we addressed the question whether inconsistent predictions derived from the task instructions and from the stimulus statistics are combined to compute prediction error. The main result is the robust presence of regions of the IC, MGB, and the AC that compute prediction error with respect to both sets of predictions. This result was consistent for pure tones and frequency sweeps. The relative size of these regions varied between the two families of stimuli: most voxels of bilateral IC, MGB, and AC encoded prediction error with respect to both sets of predictions in the pure tone experiment; this was also the case for the AC, but not for IC and MGB in the FM-sweeps experiment, where a majority of voxels in bilateral IC and MGB seemed to compute prediction error only with respect to the task-informed predictions. The different results in IC and MGB with the two sets of stimuli are however possibly driven by difference on the tSNR across studies. Independently of these differences, the presence of regions computing prediction error with respect to both sets of predictions in both experiments demonstrates that, at least in the auditory modality, predictive processing is powered by a complex system of transmission of predictions that escapes the linearity often assumed in the predictive coding literature (Spratling, 2017; Keller and Mrsic-Flogel, 2018). The corticofugal bundles that directly connect the AC with the MGB, IC, and superior olivary complex (Hackett and Kaas, 2004; Lee and Sherman, 2011; Schofield, 2011) might be responsible for the non-linear transmission of the task-informed predictions to nuclei of the subcortical pathways.
Our previous (Tabas et al., 2020, 2021; Stein et al., 2022) and the present results contradict conclusions drawn by studies using the local–global paradigm in humans, which assume that predictions are only communicated to the immediately lower processing level (Bekinschtein et al., 2009). In these paradigms, local predictions are based on the repetition of the tone
A, while prediction referred as to global are based on the repetition of the melodic phrase
It is tempting to hypothesize that the stats model reflects habituation: if the receptive fields of deviant and standard would overlap significantly, then a model assuming habituation to the standard would show similar properties than the stats model, which assumes that the responses to the deviant will be stronger the larger the mismatch between deviant and standard
Δ. However, in the pure tone experiment, with frequencies around
It could also be hypothesized that the task-informed and combined models could provide for good explanation for the data if the responses were modulated by attention-driven gain modulation (e.g., by mediation of the pulvinar; Kanai et al., 2015). Tones in positions 4 and 5 are indeed the most relevant for the task: if the responses were simply modulated by attention, the task model, where responses to positions 4 and 5 are higher than to the remaining tones, would explain the data better than the stats model, where all deviants 4–6 elicit the same response. However, previous analyses (Tabas et al., 2020, 2021) showed that (1) responses to deviants in positions 4 and 5, where participants were expected to show the same attention engagement, were significantly different and scaled by predictability (Tabas et al., 2020, 2021) and (2) the magnitude of the responses to deviants in position 6 and standards in positions 7 and 8 were statistically indistinguishable, even though, under non-attended listening, responses to deviants are always higher than to standards (Cacciaglia et al., 2015). The only explanation compatible with the different responses observed to the different deviant positions is that the activity encodes prediction error with respect to the task-informed predictions.
Our results show a generally higher prevalence of the combined model in pure tones than in FM-sweeps. One possibility is that these differences are driven by our decision of encoding Δ for FM-sweeps as differences in modulation rate only. Both FM-direction and rate are typically studied as independent features (Lui and Mendelson, 2003; Hsieh et al., 2012; Geis and Borst, 2013; Altmann and Gaese, 2014; Issa et al., 2016), and single neurons in the auditory pathway are usually either selective to FM-direction or to FM-rate. Therefore, FM might be encoded in a two-dimensional feature space in the brain. We could incorporate a contribution of FM-direction to Δ as an extra parameter in the models used to analyze the FM-sweep data, but adding another regressor would have increased the dependence of the log-evidence of the combined model on the tSNR even further.
Our study did not address whether subcortical pathways can adaptively track changes in the local statistics of the stimuli: in our paradigm, stimulus regularity is kept constant across the experiment, which arguably hampers the interpretability of the stats model. Future work could address whether the auditory pathway dynamically adapts to the local statistics using paradigms with varying stimulus regularities.
Another possible limitation of our study is the potential anatomical imprecision of the location subdivisions of the AC and the MGB. Due to the macroanatomical variability of the superior temporal plane in human subjects, it is possible that the mappings reported in Figures 3 and 4 do not exactly correspond to the microstructure boundaries of the auditory regions. Similarly, our 1.5 mm and 1.75 mm isotropic voxels might have been too coarse to precisely differentiate between primary and secondary subdivisions of the MGB. Therefore, the relatively homogeneous results we reported across subdivisions of the MGB (Fig. 7) and fields of AC (e.g., Fig. 10) should be considered with caution.
Predictive coding has gone a long way since it was first explicitly theorized in the 1990s (Mumford, 1992), evolving from a theory explaining extra-classical receptive field properties in visual cortex (Rao and Ballard, 1999) to a full hierarchical theory of sensory processing (Friston and Kiebel, 2009; Keller and Mrsic-Flogel, 2018). Here we have taken a step forward by questioning the assumed linearity (Friston, 2003b; Friston and Kiebel, 2009; Spratling, 2017; Keller and Mrsic-Flogel, 2018) of its hierarchical architecture. Understanding the interplay between multi-level predictions is crucial to understand how natural sensory processing occurs. For instance, predictive speech processing involves contextual, semantic, grammatical, phonetic, and vocal predictions (Kuperberg and Jaeger, 2016; Heilbron et al., 2020; Choi et al., 2021). To extract meaningful messages from noisy and ambiguous speech signals, the human brain should be able to compute independent prediction errors to all those independent predictions. Our findings suggest that, at sensory stages of the processing hierarchy, prediction error units are indeed capable of testing multiple predictions. The auditory pathway might exploit the corticofugal lines directly connecting the AC with the MGB, IC, and superior olivary nucleus for the direct transmission of predictions, bypassing the linear hierarchy often assumed in the literature (Keller and Mrsic-Flogel, 2018). This intricate system of descending connections might be responsible for our exquisite capacity to decode predictable information from noisy sensory inputs.
Footnotes
The work was funded by the European Research Council [Grant SENSOCOM (647051); Grant Beneficiary: K.v.K] and the Sächsisches Staatsministerium für Wissenschaft, Kultur und Tourismus (Grant Beneficiary: K.v.K).
The authors declare no competing financial interests.
- Correspondence should be addressed to Alejandro Tabas at at2045{at}cam.ac.uk.