Neural encoding of auditory statistics

The human brain extracts statistical regularities embedded in real-world scenes to sift through the 2 complexity stemming from changing dynamics and entwined uncertainty along multiple perceptual di- 3 mensions (e.g., pitch, timbre, location). While there is evidence that sensory dynamics along diﬀerent 4 auditory dimensions are tracked independently by separate cortical networks, how these statistics are 5 integrated to give rise to uniﬁed objects remains unknown, particularly in dynamic scenes that lack 6 conspicuous coupling between features. Using tone sequences with stochastic regularities along spectral 7 and spatial dimensions, this study examines behavioral and electrophysiological responses from human 8 listeners (male and female) to changing statistics in auditory sequences, and employs a computational 9 model of predictive Bayesian inference to formulate multiple hypotheses for statistical integration across 10 features. Neural responses reveal multiplexed brain responses reﬂecting both local statistics along indi- 11 vidual features in fronto-central networks, together with global (object-level) processing in centro-parietal 12 networks. Independent tracking of local surprisal along each acoustic feature reveals linear modulation 13 of neural responses; while global melody-level statistics follow a nonlinear integration of statistical beliefs 14 across features to guide perception. Near identical results are obtained in separate experiments along 15 spectral and spatial acoustic dimensions, suggesting a common mechanism for statistical inference in the 16 brain. Potential variations in statistical integration strategies and memory deployment shed light on 17 individual variability between listeners in terms of behavioral eﬃcacy and ﬁdelity of neural encoding of 18 stochastic change in acoustic sequences. 19


Introduction
Chernyshev2016. What is not clear is the manner in which independently tracked sensory dimensions are 48 joined into a unified statistical representation that reflects the complexity and non-deterministic nature of 49 natural listening scenarios. 50 To address the limitations of quasi-predictable regularities often employed in previous studies, we focus 51 on the perception of stochastic regularities that exist in the continuum between perfectly predictable and 52 completely random. We utilize stimuli exhibiting random fractal structure (also known as 1/f or power-law 53 noise) along multiple features, both spectral and spatial. Random fractals occur in natural sounds, including 54 music and speech Pickover1986, Attias1997, Geffen2011, Levitin2012, and previous work has shown the brain 55 is sensitive to these types of structures Schmuckler1993, Garcia-Lazaro2006, Overath2007, Maniscalco2018, 56 Skerritt2018. Using a change detection paradigm, we task listeners with detecting changes in the entropy 57 of sound sequences along multiple features. With this paradigm, we probe the ability of the brain to 58 abstract statistical properties from complex sound sequences in a manner that has not been addressed by 59 previous work. Importantly, the statistical structure of the sequences used in this study carry no particular 60 3 coupling or correlation across features hence restricting the brain's ability to leverage this correspondence in 61 line with previously reported feature fusion mechanisms observed within and between visual, somatosensory, 62 vestibular, and auditory sensory modalities Treisman1980, Angelaki2009, Fetsch2010, Parise2012, Ernst2013. 63 The use of an experimental paradigm involving uncertainty raises the specific challenge of interpretation 64 of responses to each stochastic stimulus, as changes in underlying statistics need not align with behavioral 65 and neural responses the instantiations of these statistics. This complexity is further compounded in mul-66 tidimensional feature spaces, begging the question of how the brain deals with this uncertainty, especially 67 with dynamic sensory inputs that lack a priori dependencies across dimensions. 68 The current study develops a computational model to guide our analysis through simulation and make  In experiment SP, sixteen participants (8 Female) were recruited from the general population (mean age: 87 25.1 years); one participant was excluded from further analysis because their task performance was near 88 chance (d ′ < 0.05). In experiment TP, eighteen participants (12 Female) were recruited (mean age: 21.5 89 years); three participants were excluded due to chance performance. In experiment nSP, twenty participants (9 Female) were recruited (mean age: 23.4); two participants were excluded due to chance performance. In 91 experiment nTP, twenty-two participants (13 Female) were recruited (mean age: 22.5); four participants were 92 excluded due to chance performance. Sample sizes were estimated based on similar experiments previously 93 reported Skerritt2018. 94 All participants reported no history of hearing loss or neurological problems. Participants gave informed 95 consent prior to the experiment and were paid for their participation. All experimental procedures were 96 approved by the Johns Hopkins IRB.  Random fractals are stochastic processes with spectrum inversely proportional to frequency with log-slope 103 β (i.e., 1/f β ), where β parameterizes the entropy of the sequence. Fractals at three levels of entropy were 104 used as seed sequences to generate the stimuli: low (β = 2.5), mid (β = 2), and high (β = 0, white noise).

105
In all experiments stimuli began with both features at lower entropy, and halfway through the melody, one 106 or both features increased to high entropy. Stimuli with decreasing entropy were not used in this work for 107 two reasons: 1) to keep the duration of the multi-feature paradigm within a single experimental session, 108 and 2) our previous work shows no difference in behavioral responses based on direction of entropy change 109 Skerritt2018. We chose to use increasing entropy because it affords the brain more of a chance to initiate 110 tracking of statistics jointly across features at the onset of the stimulus.

111
In the psychophysics experiments (SP and TP), for stimulus conditions with a single feature changing, 112 the non-changing feature could have either low or mid entropy. In the EEG experiments (nSP and nTP), the  Each complex tone in the melody sequence was synthesized from harmonic stack of sinusoids with fre-116 quencies at integer multiples of the fundamental frequency, then high-and low-pass filtered at the same cutoff 117 frequency using fourth-order Butterworth filters. Pitch was manipulated through the fundamental frequency 118 of the complex tone, and timbre was manipulated through the cutoff frequencies of the high-and low-pass 119 filters (i.e., the spectral centroid) Allen2017. Spatial location was simulated by convolving the resulting tone 120 with interpolated head-related impulse functions for the left and right ear at the desired azimuthal position 121 5 Algazi2001. Seed fractals were generated independently for each feature and each stimulus, standardized 122 (i.e., zero mean and unit variance), and then mapped to feature space as follows: Stimuli were presented in randomized order, thus listeners did not know a priori which feature was infor-135 mative for the task. The experiment contained four blocks with self-paced breaks between blocks. During 136 each trial, participants were instructed to listen for a change in the melody. After the melody finished, 137 participants responded via keyboard whether or not they heard a change. Immediate feedback was given 138 after each response.  Stimuli were synthesized on-the-fly at 44.1 kHz sampling rate and presented at a comfortable listening 144 level using PsychToolbox (psychtoolbox.org) and custom scripts in MATLAB (The Mathworks, Natick, MA).

145
Participants were seated in an anechoic chamber in front of the presentation screen.

146
In experiments SP and TP, stimuli were presented via over-ear headphones (Sennheiser HD 595) and par-147 ticipants responded via keyboard. The experiment duration was approximately 50 minutes. In experiments 148 6 nSP and nTP, stimuli were presented via in-ear headphones (Etymotic ER-2) and participants responded via response box. Additionally, before each melody trial, a fixation cross appeared on the screen to reduce 150 eye movement during EEG acquisition. The experiment duration, including EEG setup, was approximately 151 120 minutes.   The D-REX model builds sequential predictions of the next input x t+1 given all previously observed inputs 174 x 1 , x 2 , . . . , x t . In the present study, the input {x t } t∈Z + is a sequence of pitches, spatial locations, or spectral 175 centroids (timbre). This sensory input is assumed to be successively drawn from a multivariate Gaussian  If the generating distribution were stationary, this would yield the following prediction equation: (1)  When a new input x t+1 is observed, the model produces a predictive probability of this input for each 199 context hypothesis: where p i,t is the context-specific predictive probability of x t+1 given the statistics estimated over the i th 201 context hypothesis. Alongside these predictive probabilities, the model maintains a set of beliefs for each 202 8 context hypothesis: where b i,t is the belief in (or, equivalently, the posterior probability of) the i th context given previously predictions (Eq (2)) weighted by their beliefs (Eq (3)): where the unknown dynamics of the input are treated in a Bayesian fashion by "integrating out" the unknown 208 context.
where S t+1 is the surprisal at time t + 1, based on the predictive probability of x t+1 from Eq (4).

220
Observations with low predictive probability have high surprisal, observations with high probability 221 have low surprisal, and observations predicted with probability 1 (i.e., completely predictable) have 222 zero surprisal. Relating to concepts from information theory, this measure reflects the information gained from observing x t+1 given its context Samson1953.

224
(ii) Belief Change is a global measure of statistical change in the input sequence. If the new input x t+1 is 225 no longer well predicted using the beliefs B t (e.g., after a change in underlying statistics), the updated 226 beliefs B t+1 shift to reflect the change in context inferred by the model. The belief change δ t measures 227 the distance between these two posterior distributions before and after x t+1 is observed: where D JS (·||·) is the Jensen-Shannon divergence. The belief change ultimately reflects dynamics in 229 the global statistics of the observed sequence. 230 We derived a change detection response from the model analogous to listener behavioral responses by 231 applying a detection threshold τ to the maximal belief change δ t : We use this response to compare the model to listeners' behavioral responses. In addition, we use the 233 moment when this maximal belief change occurs, along with surprisal, to examine the neural response 234 related to different dynamics in the stimuli. Now, let the input sequence x t be multidimensional with two components along separate dimensions, e.g.,

237
pitch and spatial location:

254
• Integration operator f (·, ·). We test four different operators for how predictive information is combined noise, and decision threshold, respectively) that best replicated listener behavior for each model variant.

281
The model detection rate (i.e., percentage of trials wherein a change was detected, using Eq (7) trials was used to fit the model as described above, and the model was evaluated using the held-out partition.

292
Cross-validation results were averaged over 10 iterations to reduce noise from fitting to behavior estimated 293 over a smaller sample size. In the exploration of fitted model parameters, multiple linear regression was used 294 to establish any linear relationship between subject detection performance and model parameters.  were defined as tones with overall surprisal above the 95 th and below the 5 th percentile, respectively. Tone-321 epochs within each surprisal bin were averaged, and the high-surprisal response was subtracted from the 322 low-surprisal response to yield a difference wave.  Significance of neural response relative to baseline was determined using t-tests.  conditions, checkered bars in Fig 4).

352
In experiment SP, an ANOVA with 1 within-subject factor (3 conditions) showed strong significant

365
We replicated the same analysis for behavioral responses in the EEG experiments nSP and nTP (not 366 shown in figure). Listeners performed the same change-detection task, with the only difference being the 367 exclusion of the mid-entropy conditions (checkered bars in Fig 1). We observed the same behavioral effects

398
The uninformative (non-changing) feature did in fact affect overall detection performance, where higher 399 entropy led to increased FAs; meanwhile, detection of changes in the informative feature (i.e., hit-rates) 400 15 was not affected. Because stimulus conditions were randomized from trial to trial, listeners did not know 401 a prior which feature(s) might change. If statistics were collected jointly across features, we would expect 402 higher entropy in any feature to yield poorer statistical estimates, leading to higher FAs and lower hit-rates. 403 However, differences in the uninformative feature did not disrupt listeners' ability to track statistics in the 404 informative feature. This result suggests statistics are collected independently along each feature rather than 405 jointly between features, and integration across features occurs after statistical estimates have been formed.  individual subjects quite well. For example in Experiment nSP (Fig 5-left) We can also examine how individual differences are explained by the model parameters fit to each sub-465 ject. Using the Late D22 MAX model, the "best" overall model, we tested for correspondence between the  typically relies on deterministic patterns to define "deviant" and "standard" events, without such structure 499 we use surprisal from the model to guide identification of tones that fit predictions well and those that do 500 not. First, we use an overall measure of surprisal to define "deviant" and "standard" by summing surprisal 501 across features, e.g., S t = S P t + S S t , where S P t and S S t are the tone-by-tone surprisal from pitch and spatial 502 location, respectively (see Eq (5)). We compared the neural response time-locked to high-surprisal tones to

519
However, non-parametric tests using Spearman correlation partially contradicted the significance of these 520 results for experiment nTP (p = 0.14), while the effect in experiment nSP remained highly significant 521 (p = 0.0006).

522
We further examined this linear relationship in an extended analysis using the feature-specific surprisal 523 (e.g., S P t and S S t ). For each subject, tone epochs were binned into 128 equal-sized bins in the 2-D space spanned by surprisal along each feature, and the neural response was averaged within each bin over epochs  (e.g., S t = S P t + S S t ) in the top 5%. Maximal belief change is the moment when the belief change (δ t ) 540 reaches its maximum across the melody trial (see Eq (6)). Note that by aligning to model respones before 541 trial-averaging, the temporal position of the motor response relative to time=0 is shuffled, thereby reducing 542 cofounds due to motor preparation.  We would like to thank Audrey Chang for her help with data collection for the psychophysics experiments.       Oddball-like analysis contrasting neural response to high-surprisal tones (top 5%) with response to low-surprisal tones (bottom 5%), where overall surprisal is summed across features (e.g., S P t + S T t ). Difference wave (high-low) shows 95% confidence interval across subjects. b) EEG magnitude (80-150 ms) in sub-averages of tone epochs binned by overall surprisal (abscissa). R 2 from linear regression. c) EEG magnitude (80-150 ms) binned by feature-specific surprisal in both features (horizontal axes). Gray points on horizontal axis show position of each point in surprisal-space. R 2 from multiple linear regression. Extended Data legends 655 Audio S1. Example stimulus from experiment SP/nSP with increasing entropy in both spatial location 656 and pitch. Headphones required for spatialization. 657 Audio S2. Example stimulus from experiment SP/nSP with low entropy in spatial location and increasing 658 entropy pitch. Headphones required for spatialization. 659 Audio S3. Example stimulus from experiment SP/nSP with increasing entropy in spatial location and 660 mid-entropy in pitch. Headphones required for spatialization.