## Abstract

In bistable perception, observers experience alternations between two interpretations of an unchanging stimulus. Neurophysiological studies of bistable perception typically partition neural measurements into stimulus-based epochs and assess neuronal differences between epochs based on subjects' perceptual reports. Computational studies replicate statistical properties of percept durations with modeling principles like competitive attractors or Bayesian inference. However, bridging neuro-behavioral findings with modeling theory requires the analysis of single-trial dynamic data. Here, we propose an algorithm for extracting nonstationary timeseries features from single-trial electrocorticography (ECoG) data. We applied the proposed algorithm to 5-min ECoG recordings from human primary auditory cortex obtained during perceptual alternations in an auditory triplet streaming task (six subjects: four male, two female). We report two ensembles of emergent neuronal features in all trial blocks. One ensemble consists of periodic functions that encode a stereotypical response to the stimulus. The other comprises more transient features and encodes dynamics associated with bistable perception at multiple time scales: minutes (within-trial alternations), seconds (duration of individual percepts), and milliseconds (switches between percepts). Within the second ensemble, we identified a slowly drifting rhythm that correlates with the perceptual states and several oscillators with phase shifts near perceptual switches. Projections of single-trial ECoG data onto these features establish low-dimensional attractor-like geometric structures invariant across subjects and stimulus types. These findings provide supporting neural evidence for computational models with oscillatory-driven attractor-based principles. The feature extraction techniques described here generalize across recording modality and are appropriate when hypothesized low-dimensional dynamics characterize an underlying neural system.

**SIGNIFICANCE STATEMENT** Irrespective of the sensory modality, neurophysiological studies of multistable perception have typically investigated events time-locked to the perceptual switching rather than the time course of the perceptual states per se. Here, we propose an algorithm that extracts neuronal features of bistable auditory perception from largescale single-trial data while remaining agnostic to the subject's perceptual reports. The algorithm captures the dynamics of perception at multiple timescales, minutes (within-trial alternations), seconds (durations of individual percepts), and milliseconds (timing of switches), and distinguishes attributes of neural encoding of the stimulus from those encoding the perceptual states. Finally, our analysis identifies a set of latent variables that exhibit alternating dynamics along a low-dimensional manifold, similar to trajectories in attractor-based models for perceptual bistability.

## Introduction

Multistable perception, a phenomenon in which an ambiguous, unchanging stimulus gives rise to more than one perceptual interpretation, has been found in various sensory modalities: visual (Blake, 1989; Hupé and Rubin, 2003), auditory (van Noorden, 1975; Pressnitzer and Hupé, 2006), tactile (Carter et al., 2008), and olfactory (Zhou and Chen, 2009). Visual and auditory research has also reported neural correlates to mutually exclusive percepts (Blake and Logothetis, 2002; Micheyl et al., 2007), proposing several theories of bistable perceptual organization. Nonetheless, computational principles of bistable perception are yet to be reconciled with experimentally identified percept-specific changes in neural activity. This is because modeling simulates long-time neuronal dynamics during single trials with functional principles such as competitive attractors (Moreno-Bote et al., 2007; Curtu et al., 2008; Rankin et al., 2015), evidence accumulation (Barniv and Nelken, 2015; Nguyen et al., 2020), algorithmic signal detection (Micheyl et al., 2005; Pressnitzer et al., 2008; Krishnan et al., 2014), predictive coding (Denham and Winkler, 2006), or probabilistic processes (Barniv and Nelken, 2015). In contrast, conventional data analyses primarily rely on statistical measures within perceptual groupings, like mean and variance, and are thus limited when applied to nonstationary data. Accordingly, brain studies of bistable perception have focused on differentiating short-time neural responses near perceptual switches (Basirat et al., 2008; Higgins et al., 2020) or at fixed latencies in stimulus-locked epochs, like milliseconds from the onset of individual stimuli occurrences (Gutschalk et al., 2005; Snyder et al., 2006; Dykstra et al., 2011; Hill et al., 2012; Billig et al., 2018; Curtu et al., 2019; Higgins et al., 2020).

To unravel dynamical properties of neural activity in bistable perception and link them to functional principles proposed by theory and modeling, one must exploit the time dependency of the recorded data. Key components of a comprehensive analysis should include extraction of neural features with data-driven algorithms agnostic to the behavioral data (as opposed to prescribed measures such as averaged evoked potential) and identification of feature attributes that correlate with perception over prolonged percept durations as well as near reported switches or in other short-time windows. Here, we propose an algorithm that successfully addresses both problems. It inputs single-trial, minutes-long recordings of neuronal activity and outputs a manifold built on features that distinguish the two perceptual states. The algorithm predicts within-trial ongoing perceptual alternations (admittedly, without uncovering their neural underpinnings) by extracting time-varying latent states from neural activity compatible with trajectories in competition models for bistable perception.

Our study examined the dynamics of bistable perception in auditory streaming of triplets. Six neurosurgical patients listened to a sequence of tones, *A* and *B*, organized in repeating *ABA*– triplet patterns (Fig. 1). Subjects reported alternations between two percepts, a galloping-like rhythm (the “one-stream” percept) and a Morse-code-like rhythm of two simultaneous distinct streams (the “two-stream” percept). Electrocorticography (ECoG) recordings were collected from the subjects' core auditory cortex as they performed the behavioral task. Nonstationary percept-related features of the ECoG data were extracted with an algorithm built on recent advances in dynamical systems (Schmid, 2010; Williams et al., 2015; Giannakis, 2019), manifold learning (Takens, 1981; Berry et al., 2013), and dimensionality reduction of large-scale datasets (Coifman and Lafon, 2006; Nadler et al., 2006; Berry et al., 2013). The dynamics of neural activity were analyzed at multiple time scales: minutes, for the within-trial alternating process; seconds, for durations of individual percepts; and milliseconds, for the timing of switches. A collection of data-driven Fourier-like neuronal components robustly constructed the triplet-based averaged auditory evoked potential for the one-stream and two-stream percepts. An additional slowly-evolving extracted rhythm correlated with the perceptual states. Changes in the phase (rather than amplitude) of an oscillatory feature time-locked to the slow rhythm predicted the perceptual switches identified by subject-reported button presses. Low-dimensional projections of single-trial ECoG auditory cortical data revealed geometric structures common across subjects and stimulus types. These projections exhibited dynamic properties similar to trajectories generated by attractor-based computational models.

## Results

Six epileptic neurosurgical patients (identified here as B335, L357, R369, L372, R376, and L409), listened to 5-min-long sequences of tones *A* and *B* grouped in 500 triplets *ABA*–, each of *A*). Participants indicated changes in perception by pressing a button on a response box. They reported either a single coherent auditory stream (the one-stream percept, *A*–*A*–*A*–*A*–*B*–*B*–*B*). Intracranial ECoG recordings from core auditory cortex were obtained concurrently with the behavioral data. The recordings were obtained from electrodes placed in posteromedial Heschl's gyrus (HGPM), with a total number of six (B335), five (L357), eight (R369), six (L372), seven (R376), and one (L409) contacts (Fig. 1*C* for B335).

Two stimulus protocols were employed. In the control block, *df* = 2) and high (*df* = 12) semitone difference between tones *A* and *B*, that biased listeners toward stable one-stream and two-stream percepts, respectively. Changes in perception were primarily aligned with the changes in df (Curtu et al., 2019). In the bistable blocks *df*6, *df*8, the stimulus consisted of triplets with fixed semitone difference between *A* and *B* throughout the entire task (either *df* = 6, or *df* = 8). Although the stimulus did not change, the subjects reported spontaneous alternations between one-stream and two-stream percepts.

The datasets analyzed in this paper were previously published by Curtu et al. (2019).

### An algorithm for extracting nonstationary features from ECoG data

Studies of perceptual bistability aim to identify changes in neural activity that correlate with changes in (simultaneously recorded) behavioral responses. In auditory streaming of triplets, several studies have reported differences in the averaged auditory evoked potential calculated for the one-stream and two-stream percepts, in certain auditory-related brain areas. But these analyses used a common methodology: (1) obtained high-resolution temporal recordings with either electroencephalography (EEG; Hill et al., 2012), magnetoencephalography (MEG; Gutschalk et al., 2005; Billig et al., 2018), or ECoG (Curtu et al., 2019); (2) partitioned the data in triplet-locked epochs; then (3) performed univariate or multivariate statistics over triplets belonging to each percept. More recently, nonlinear measures derived from entropy principles were also used to identify neuronal differences between the percepts (Canales-Johnson et al., 2020), but they were based on short-time windows near switches and were reliant on the subjects' behavioral reports. We present here an algorithm that successfully processes minutes-long nonstationary neural data (as seen in the streaming task), and extracts key features across three different timescales. The algorithm identifies perceptual-related events at times measured in: milliseconds (the occurrence of switches between percepts), seconds (the sequence of triplets linked into a stable percept, either one-stream or two-stream), and minutes (the perceptual dynamics over the entire 5-min auditory block). Moreover, this algorithm, was applied exclusively to the neural data without using any prior knowledge about the perception.

The fundamental assumption guiding this feature-extraction algorithm is that a potentially nonlinear system of differential equations,
*x*, to auditory input (here we consider that the brain activity could be described by the dynamics of *n* variables). We further assume that a time-dependent trajectory *d*) *x*. In the streaming context, local field potential (LFP) recordings from HGPM contacts represent the output of unknown observation functions of the unknown underlying neural activity. Thus, the number of HGPM contacts,

We hypothesize that the manifold M contains regions, or almost-invariant attracting sets, that correspond with perception (S.L. Brunton et al., 2017; also, for the terminology, see Materials and Methods). Under this hypothesis, a trajectory

Recently, the Koopman operator (Koopman, 1931; Koopman and Neumann, 1932; Mezić and Banaszuk, 2004; Mezić, 2005; Rowley et al., 2009; S.L. Brunton et al., 2017, 2022) has been used to study nonlinear dynamics through system observations, for example in fluid dynamics (Rowley et al., 2009; Mezić, 2013; Peitz and Klus, 2019), computational chemistry (Narasingam and Kwon, 2019), and neuroscience (B.W. Brunton et al., 2016; Cura and Akan, 2020; Marrouch et al., 2020). For time *t* fixed, the Koopman operator

In many scientific applications, observations are made at discrete time points

By the definition of function composition, the Koopman operator is linear; however, since it acts on the functional space of system observations, it is infinite-dimensional. Rather than identify the full infinite dimensional operator, common approaches characterize the operator by approximating a finite collection of its leading eigenvalues, modes, and eigenfunctions

When the observation dimension is large, the dynamic mode decomposition (DMD; Rowley et al., 2009; Schmid, 2010; Tu et al., 2014) is an effective algorithm for approximating Koopman spectral quantities.

For the data presented in this paper, the number of HGPM contacts was subject dependent and varied between one and eight, yielding few observations relative to the total number of sampled time points. For low-dimensional observations, the extended dynamic mode decomposition (eDMD; Williams et al., 2015) augments the approximation in Equation 2 to act on features of the observation data. Since we were interested in the geometry of the system's underlying state space, we used manifold learning techniques to derive a collection of basis-like features (or functions) adapted to the geometry along M. These data-driven features served as a dictionary of functions for Koopman eigenfunction discovery with the eDMD algorithm.

To derive a function dictionary for the eDMD algorithm, we first employed time-delay coordinates (Takens, 1981; Sauer et al., 1991) to reconstruct and embed the state-space dynamics in a high-dimensional ambient space by appending temporal lags to a time series of ECoG recordings (Fig. 2). As suggested previously (Berry et al., 2013), we then applied the diffusion map algorithm (Coifman et al., 2005; Coifman and Lafon, 2006; Nadler et al., 2006) to the augmented high-dimensional data to provide low-dimensional timeseries representations, *y* to the dynamics of the unknown state *x*. Figure 2 depicts the end-to-end algorithm (for more details, see Materials and Methods).

The Koopman quantities were derived from ECoG data alone, independently of perception-reports, and were subsequently studied for characteristics relevant to the triplet streaming task. See Figure 1*D* for an exemplar feature which correlated with perception.

### Eigenvalues organize into two branches

The Koopman eigenvalues *A* for eigenvalues from B335). Note that we define *df*. We refer to these collections as branches and collected the indices for branch one and branch two into the indexed sets *A*). Thus, from a linear dynamical system perspective, the modes associated with eigenvalues in

The accompanying Koopman eigenfunctions *B* illustrates prototypical differences observed between eigenfunctions from *B*; also discussed below, *-*features show phase changes aligned with switches in perception).

The eigenvalue organization on branches *ABA*– triplet repetition rate and its harmonics. Thus, we called *A*, light vs dark circles). They defined an oscillatory process with a relatively faster-changing envelope, found to be entrained to the perceptual states (see next sections). Thus, we called

### A slowly-evolving feature encodes the perceptual states

The eigenfunction *B*, 4*B*).

The eigenfunction *B*). We performed a permutation of triplet labels to test for a difference in the means of *p*-value estimates in all control blocks satisfied *p*-values in the bistable blocks were *df*6 and *df*8) and *df*6. To quantify the overlap between the distributions of values *B*), we calculated the Kullback–Leibler divergence (KLdiv; see function relativeEntropy in MATLAB). Briefly, the amount of shared information between two probability distributions is given by the nonnegative scalar quantity KLdiv. While KLdiv is zero for two identical distributions, it becomes larger when separation between distributions increases (Joyce, 2011). For the histograms shown in Figure 4*B*, we found KLdiv of 33.86, 15.68, 74.02, 47.41, 14.92, and 3.96, with a mean of 31.64, for the control blocks of B335, L357, R369, L372, R376, L409, respectively. We found KLdiv 4.46, 1.75, and 2.58, with a mean of 2.93, for the bistable blocks B335, *df*6, *df*8, and R369, *df*6. The KLdiv values indicated a stronger separation between almost-invariant states identified by

J 2 -features show phase changes aligned with switches in perception

The emergence of *A* and 4*A*, each eigenfunction *f*, multiple of *f* (see Materials and Methods, Eq. 6) and drew its circular histograms over timepoints separated according to subject-reported percepts. Exemplar histograms and 5-min duration time plots of two instantaneous phases,

### The phase of the leading oscillator in branch J 2 defines a predictor for perceptual switches

As opposed to *A*, 4*A*). At any time point *B*). The difference in average phases was small for the majority of timepoints, which kept *B*, gray boxes). Then these windows were assessed for their alignment with the subject-reported perceptual switches (Fig. 5*B*, blue and red vertical lines). For details about the calculation, see Materials and Methods. Behavioral predictions based on variable *B* shows predicted button-presses and their alignment with the recorded perceptual switching data in *df*6 blocks for subject B335. Button press predictions calculated for the control block of all other subjects are shown in Figure 6.

To test whether a button press (“observed” value) was more likely to either precede or succeed a predicted switch (“expected” value), based on changes in *N* samples. From *N* = 126 button presses associated with phase changes in the control blocks (Figs. 5*B*, 6), there were bp = 71 instances of a button press preceding a marked phase change and pb = 55 instances of a phase change preceding a reported button press: *p* = 0.154, not statistically significant. We also performed the analysis over the bistable blocks (bp = 37, pb = 31) and all blocks together (bp = 108, pb = 86) and obtained *p* = 0.467, respectively *p* = 0.114, not statistically significant. The majority of button presses occurring before the predicted phase-shifts (78 out of 108) were found in subjects B335 and L357 with reactions times much shorter than the

### Triplet-based mean-LFPs are well approximated by J 1 ∪ J 2 -features

Recall that the feature extraction algorithm processed timeseries *c* in HGPM, and produced a finite collection of Koopman eigenvalues, modes, and eigenfunctions

While the eigenfunctions *c*, the approximation of signal *t*. Each coefficient

We found the average differences between perceptual streams to be limited along branch *B*, upper panel), while eigenfunctions in *B*, lower panel; also, Fig. 5). The *ABA*– triplets. The

Similarities between triplet-averaged reconstructions with *r* (see Materials and Methods). We found them to be highly correlated, with a median *r* = 0.9996 for reconstructions in bistable blocks (confidence interval *r* = 0.9411 for reconstructions in control blocks (*r* = *r* = *r* significantly larger in bistable blocks; one-sided Wilcoxon rank-sum test,

Next, we assessed the accuracy of the triplet-averaged overall approximation from Equation 3. We calculated the coefficient of determination

### Low-dimensional manifolds extracted from HGPM recordings have a similar geometric structure across group data

For each subject and experimental block, ECoG recordings from all HGPM contacts were introduced as input to the feature-extraction algorithm (Fig. 2). The output was the Koopman collection *c* were projected (see Eq. 3). To compare the geometric structures of the Koopman-based spaces across all nine experimental blocks, we plotted the trajectories defined by the first three eigenfunctions in branch

The double-cone represented an intrinsic invariant structure of the large scale ECoG data, and it was found to be stereotypical across subjects and stimulus type. The oscillations resembled the standard cosine-sine parametrization of the circle with an approximate rotation frequency of *ABA*– in the auditory sequence. We found that the timing of each triplet component (the pure tones *A*, *B* and *A*, as well as the 200-ms interval of silence) were encoded at distinct locations on the circle, aligned along the three-dimensional embedding. Trajectories shifted in phase when they drifted to the opposite side of the cone along the *A*,*B*, left column). Moreover, the double-cone formed a “perceptual manifold,” consisting of two almost-invariant sets corresponding to each percept type (Fig. 9*A*,*B*, middle column: one-stream in blue, two-stream in red). The two-sets intrinsic structure of the underlying manifold was common across subjects.

Low-dimensional dynamics of select HGPM contacts were visualized with projections onto eigenfunctions from branch *c* we considered *A*). The dynamics derived at three different HGPM contacts for each subject but L409 are shown in Figure 9*A*,*B*, right column. L409 had only one HGPM contact so we used temporal lags to visualize its three-dimensional neural dynamics. Although the shape of the underlying manifold varied among experimental blocks, its two-attractor-like structure linked to perception was preserved.

## Discussion

We investigated dynamic attributes and underlying geometric structures of neural correlates of auditory streaming of triplets, in minutes-long nonstationary ECoG recordings from human core auditory cortex. These datasets were previously studied by Curtu et al. (2019) with univariate and multivariate statistics, and have revealed significant differences in averaged evoked potentials associated with one-stream and two-stream percepts. Analyses by Curtu et al. (2019) and other studies of auditory streaming (Gutschalk et al., 2005; Cusack, 2005; Kondo and Kashino, 2007, 2009; Hill et al., 2011; Higgins et al., 2020), examined triplet-locked epochs and their percept-related means. Our paper presents a novel analysis of triplet streaming data that exploits their temporal dependencies under prolonged stimulus presentations. The algorithm combines emerging methods for dimensionality reduction, manifold learning, and dynamic discovery. We report a collection of neuronal features that encode changes in auditory bistable perception in their frequency components and instantaneous phases. The extracted features organized in two subsets,

In auditory streaming of triplets, alternations between percepts were induced by either stimulus modifications (Sussman et al., 1999; Fishman et al., 2001; Micheyl et al., 2005; Snyder et al., 2006; Pressnitzer et al., 2008; also control blocks *df*6, *df*8). Differences in neural correlates to one-stream and two-stream were identified in MEG (Gutschalk et al., 2005; Billig et al., 2018; Sanders et al., 2018), EEG (Hill et al., 2012; Higgins et al., 2020), fMRI (Cusack, 2005; Kondo and Kashino, 2009; Hill et al., 2011), and ECoG (Curtu et al., 2019) recordings. When changes in low-level acoustic stimulus properties altered perception, it has proven difficult to distinguish stimulus-driven neural signatures from perception-only driven activity. Very few attempts were successful in dissociating such effects (Hill et al., 2012). Our proposed feature-extraction algorithm (Fig. 2) provided a robust solution to the stimulus versus perception separation challenge. The eigenfunctions from the decomposition *A*, 4*A*) exhibited distinct properties: those in *ABA*– triplet repetition; those in *A*-tones. The features on *B*, 4*B*). The stronger separation of distributions observed in *df*2, *df*12 sequences significantly modulated the quasi-states of

Phase modulation of cortical oscillations as a potential mechanism for neural encoding of auditory streaming has been studied in rodents (Noda et al., 2013). Under stimulus modifications used as proxies for the perceptual states, differences in “percept”-related phase coherences were reported in the γ frequency-band. In contrast, our results showed that perceptual phase modulations (for features in

Features in *A*, 4*A*). We hypothesize that the leading frequency in *A* tone, or the *B* tone, in the two-stream percept; Gutschalk et al., 2005; Snyder et al., 2006; Thompson et al., 2011; Billig et al., 2018), we provided no such instruction. The subjects may have unconsciously directed their attention to the *A* tone (presented every *B* tone (presented every

Like classic signal processing methods, our feature-extraction algorithm identified fundamental frequencies in recordings and derived linear decompositions in frequency-selected components. There are, however, key differences between these approaches. First, Figure 2-algorithm processed data from all HGPM contacts together and produced a common collection of features for the decomposition of individual LFPs. The frequencies (harmonics of

Computational models proposed several neural mechanisms for perceptual bistability. These included oscillatory and noise-driven attractor dynamics (Laing and Chow, 2002; Moreno-Bote et al., 2007; Curtu et al., 2008; Rankin et al., 2015), evidence accumulation (Nguyen et al., 2020), predictive coding (Denham and Winkler, 2006), and probabilistic rule-based classification processes (Barniv and Nelken, 2015; Steele et al., 2015). To our knowledge, our study is the first to: (1) report neural evidence for within-trial ongoing competing percepts and (2) report neural activity compatible with the dynamics of models for bistable perception. We extracted a feature of the neural activity in core auditory cortex,

Oscillatory attractor dynamics rely on three key computational principles: bistability between two stimuli-induced attractors (typically, two stable steady states), processes for accumulation then recovery for drifting activity between attractors, and noise to ensure trial-by-trial variability. The first two principles underlie a deterministic periodic trajectory that alternates between attractors. The noise introduces stochasticity in the switching times and creates more realistic, irregular percept durations. The model can be augmented by coupling it to a chaotic system to account for external factors that influence perception (e.g., attention, medication). Figure 2-algorithm approximated the quasi-periodic solution of such attractor-like flow through features in *ABA*– triplet sequence. The algorithm likely captured the time-dependent structure of the stimulus before extracting the intrinsic system dynamics.

Bridging modeling with large-scale neural recordings has recently received scientific interest because of new methodologies like manifold learning and Koopman decomposition. Theoretical results found that DMD alone (Tu et al., 2014) was not well-suited for identifying slow or transient dynamics as seen in perceptual switching. This shortcoming was addressed by eDMD (Williams et al., 2015) together with time-delay coordinates (Takens, 1981) and diffusion maps (Berry et al., 2013). The Koopman operator was then shown to accurately describe quasi-periodic solutions of oscillatory systems with “weak” chaotic components (Giannakis, 2019), although it seemed unable to track chaotic dynamics per se (Mezić, 2005, 2013; Giannakis, 2019). We combined these methods into a single algorithm to extract neuronal features that captured single-trial behavioral responses in auditory streaming of triplets. Our results do not suggest that bistable perception might be resolved in primary auditory area HGPM. Rather they show that, under certain circumstances, one could make predictions of ongoing perceptual alternations by simply recording from HGPM.

## Materials and Methods

### Participants

Six neurosurgical patients treated for pharmaco-resistant epilepsy participated in the streaming task: four males and two females, identified here as B335, L357, R369, L372, R376, and L409 (age range 29–47; median age 33 years). The subjects listened to sequences of repeated triplets of tones and reported their perception, either one-stream or two-stream, by pressing a button. Electrocorticographic (ECoG) data from multicontact depth electrodes and subdural electrode arrays and subject behavioral responses were recorded simultaneously. Research procedures were approved by The University of Iowa Institutional Review Board and the National Institutes of Health. Participation in the task was voluntary, and each patient had the option to cease participation at any time without disruption in their clinical treatment. These datasets were previously published by Curtu et al. (2019).

### Auditory stimuli and bistable perception

The *ABA*– stimulus structure followed the description by Curtu et al. (2019). Tones *A* and *B* lasted *A* and *B* tone, and *A* tone, with a total duration of

For all participants, except R369, and all stimulus blocks, the *B* tone frequency was fixed at *A* tone varied between *df* = 2, 6, 8, and 12, respectively. Three stimulus blocks were considered: *df*6, and *df*8. The *df* = 2 and *df* = 12 semitone differences (12 durations for each of *df*2 and *df*12 ranging between 5–28 and 9–45 triplets, respectively). The timing of the onset and offset of *df*2 and *df*12 epochs was identical across subjects. We called *control* block since it elicited stable percepts, with listeners generally reporting the *df*2 semitone difference as one-stream (integrated *df*12 semitone difference as two-stream percept (segregated *A*–*A*–*A*–*A*–*B*–*B*–*df*6 and *df*8 blocks did not change throughout the behavioral task. The frequency of the *A* and *B* tones remained fixed, separated by *df* = 6 semitones (*df* = 8, respectively) in all 500 triplets, but subjects reported spontaneous switches between one-stream and two-stream percepts. We named *df*6 and *df*8 the (perceptually) bistable blocks.

All subjects were instructed to report changes in perception from one-stream into two-stream or two-stream into one-stream by pressing an appropriate button on a response box. Reaction times (RTs) were computed from the latency of their behavioral responses to stimulus changes during the control block (0.65, 0.36, 3.22, 1.24, 1.29, and 1.48s for B335, L357, R369, L372, R376, and L409). To account for individual reaction times, we assigned in our analysis a percept-neutral label (neither one-stream nor two-stream) to a number of triplets immediately preceding each button press: in bistable blocks, two triplets for B335 and six triplets for R369; in control blocks, all triplets following a stimulus change and until the button press was recorded (see below; also Curtu et al., 2019).

### ECoG recordings and data preprocessing

The ECoG data that we analyzed in this paper were a subset of the datasets published by Curtu et al. (2019), and were preprocessed following the same procedure. Briefly, ECoG recordings were obtained simultaneously from multicontact depth electrodes and subdural electrode arrays placed over the temporal, parietal, and frontal lobes. Data were acquired using an RZ2 real-time processer from Tucker-Davis Technologies for subjects B335 and L357 and the Neuralynx ATLAS system for R369, L372, R376, and L409. Depth electrode arrays (8–12 macro contacts, spaced *C*). Neural signals were also obtained from contacts targeting other noncore auditory areas within the superior temporal plane (planum temporale, planum polare), insula and superior temporal sulcus (using depth electrodes), and from surrounding auditory-related temporoparietal cortex and frontal areas (subdural grid arrays). The ECoG recordings were amplified, filtered (0.7- to 800-Hz bandpass, *N* in the spatial filter that determines how many components of the singular value decomposition (SVD) of the normalized spatial correlation matrix, taken over all contacts in the narrowly defined frequency band, were to be discarded. Here, the value of *N* was derived independently for each individual subject by applying a previously described criterion (Gavish and Donoho, 2014) to their ECoG data (as opposed to choosing

### Experimental design and statistical analysis

#### Input to the feature extraction algorithm

Our analyses were performed on a subset of the complete ECoG data set. We retained as input into the proposed algorithm only the ECoG recordings obtained from contacts in posteromedial Heschl's gyrus (HGPM). The number of HGPM contacts, *C* for B335). Nine perceptual blocks were analyzed independently: the control block *df*6 of B335 and R369; and bistable block *df*8 of B335 (Figs. 4, 9). At each contact *c*, the local field potential

#### Time-delayed coordinates and diffusion maps

The first step in the analysis was to perform the state-space reconstruction and nonlinear dimensionality reduction of HGPM data per experimental block. This was done using the diffusion-mapped delay coordinate (DMDC) method outlined by Berry et al. (2013), with the following variables and hyperparameter selection. We considered *s*-number of previous measurements. As proposed by Berry et al. (2013), the exponential factor in the delay coordinates was introduced to improve regularity in the embedded coordinates. Berry et al. (2013) advised to choose a small nonzero constant for the decay α to balance information loss (at large α the tail of delay coordinates decays rapidly to zero) with reduced regularization (at α= 0). We set α= 0.001 for all datasets and appended each observation with *s* = 799 past measurements. Thus, a single delay coordinate

Note that the first 799

The delay coordinates are hypothesized to encapsulate key dynamical properties of the original data. To generate low-dimensional representations of these dynamics, we introduced them as input into a diffusion map algorithm (Coifman et al., 2005; Coifman and Lafon, 2006). This algorithm performs nonlinear dimensionality reduction by calculating eigenvalues and eigenvectors of a stochastic matrix P derived from a nonlinear kernel. The kernel is rotationally invariant and chosen to emphasize local neighborhoods and geometry intrinsic to the data. Such an emphasis is an alternative to linear directions of maximal variance, as identified in principal component analysis (PCA). Here, we used the Gaussian kernel as a similarity measure between all delay coordinates *N*-eigenvectors identified as output of the DMDC algorithm would form a dictionary of functions in the extended dynamic mode decomposition (eDMD) algorithm proposed by Williams et al. (2015).

### Extended dynamic mode decomposition

The eDMD algorithm requires the specification of a dictionary of basis-like functions to approximate the eigenfunctions of the Koopman operator associated with the system's dynamics as linear combinations of dictionary elements. We developed a procedure to select the input dictionary for the eDMD algorithm. By computing the fast-Fourier transform (FFT) of previously identified DMDC eigenvectors

We took the starting index of the third subset of vectors as a hard threshold for function selection: the eDMD dictionary was defined by all DMDC eigenvectors before this index. Note that while the exact cutoff index varied, the qualitative organization of the eDMD dictionary was similar across subjects.

With the selected DMDC functions, we performed the eDMD algorithm as described by Williams et al. (2015) to obtain approximations to Koopman eigenvalues, modes, and eigenfunctions: *A*). Eigenfunctions *c* recording sites.

In this paper, we identified changes in the temporal dynamics of neural features

### Almost-invariant dynamical regions

Our analysis showed that trajectories along the manifold M evolved in two distinct separated regions for long periods of time, with fast transitions between them. We adopted the terminology from Froyland and Padberg (2009) and called these regions “almost-invariant” (Fig. 9, blue vs red regions). While such subsets of M were not dynamically invariant (trajectories eventually left them, often in a chaotic or probabilistic way), they were consistent with the almost-invariant sets. In accord with the definition by Froyland and Padberg (2009), a trajectory originating in either almost-invariant attracting set of M was unlikely to leave the region for some nontrivial amount of time; then the wandering time in the complementary set, before the return, was also nontrivial.

### The eigenvalues of the Koopman operator

The values

### Permutation test

To study the statistical significance of the correspondence between eigenfunction

The null hypothesis of the permutation test (to be rejected) states that there is no difference in the mean of

For each permutation of labels, we calculated the difference in means of *m*, the absolute value *n* = 10,000 permutations, and derived the Monte Carlo *p*-value estimate

### Instantaneous phase

Any complex number *f* in Hz by

Recall that the imaginary components of eDMD-derived eigenvalues *m* of the triplet presentation rate *f* to the harmonic *f* then the instantaneous phase shift computed according to Equation 6 would be constant throughout the entire 5 min experimental block. In this paper we refer to the instantaneous phase shift *f* in Equation 6 as simply *the instantaneous phase*. In some examples, when we wanted to emphasize the dependence of the phase on the frequency component *f*, as opposed to the eDMD index *j*, we used the short notation

As a timeseries, the instantaneous phase shift *df*6, *df*8 for B335 and *df*6 for R369; see Fig. 5), or through the opposite angle

### Button press prediction

We used the instantaneous phase *j* in decreasing order of real part for eigenvalues, and since all complex eigenvalues come in conjugate pairs, *f* was selected to be the closest harmonic of *f* = *f* = *f* were renamed

We computed

These were calculated with “CircStat” toolbox for MATLAB (Berens, 2009). The average phases, as opposed to the instantaneous phase, had the advantage of reducing the effects of rapid but unsustained fluctuations in

The instantaneous phase

### Signal reconstruction

System observations (LFPs) were approximated as linear combinations of the 5 min long Koopman eigenfunctions using the elements in the Koopman modes

Triplet-based reconstruction profiles

Similarities between triplet-averaged reconstructions *r*. These measures were calculated per individual subject, block, and contact, for a total of 53 comparisons. The median *r* values were calculated for the group data in control blocks (median of 33 *r* values obtained from 33 contacts) and for the group data in bistable blocks (median of 20 *r* values from 20 contacts). Then *r* of bistable group data versus control group data were tested for significance with one-sided Wilcoxon rank-sum exact tests at the

### Assessing the effects of noise on time-delay embeddings

Time-delay embeddings, particularly those constructed from real data, have been shown to be sensitive to noise. Establishing a methodology for the attenuation of the effects of noise on embeddings is still an open area of research (Ignacio et al., 2019; in topological data analysis) and (Pan and Duraisamy, 2020; in nonlinear dynamical systems). We relied on these references, and on others (see below), to build several components of the feature extraction algorithm that mitigate the impacts of noise on the results. First, we scaled the delayed-coordinates by an exponentially decaying coefficient *s* leads to a reduction of the level of noise. This scaling controls the impact of datapoints further away in time from *s*, and when we did or did not apply the aforementioned preprocessing steps. In our hyperparameter exploration, we focused on three measures: (1) the emergence of the two eigenvalue branches *A*), increase delays to *B*), *C*; this is the parameter set we used in the manuscript for all subjects), and increase delays to *D*). To summarize, increasing the number of delays led to reduced noise artifacts in the uncovered features, but also smeared the jumps between near-steady states in *C*.

### Code accessibility

All code, data, and analysis files for this report are available at https://osf.io/7w9qh/.

## Footnotes

This work was supported by the National Science Foundation (NSF) Grant CRCNS-1515678 (to R.C.) and partly by the NSF-RTF Award DMS-1840260 (to P.M.) and by the National Institutes of Health Grant NIDCD R01 DC004290. We thank Haiming Chen, Phillip Gander, Matthew Howard, Hiroto Kawasaki, Christopher Kovach, Kirill Nourski, Ariane Rhone, Beau Snoad, and Xiayi Wang for help with data acquisition and preprocessing.

The authors declare no competing financial interests.

- Correspondence should be addressed to Rodica Curtu at rodica-curtu{at}uiowa.edu

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.