Abstract
The extraction and analysis of pitch underpin speech and music recognition, sound segregation, and other auditory tasks. Perceptually, pitch can be represented as a helix composed of two factors: height monotonically aligns with frequency, while chroma cyclically repeats at doubled frequencies. Although the early perceptual and neurophysiological mechanisms for extracting pitch from acoustic signals have been extensively investigated, the equally essential subsequent stages that bridge to high-level auditory cognition remain less well understood. How does the brain represent perceptual attributes of pitch at higher-order processing stages, and how are the neural representations formed over time? We used a machine learning approach to decode time-resolved neural responses of human listeners (10 females and 7 males) measured by magnetoencephalography across different pitches, hypothesizing that different pitches sharing similar neural representations would result in reduced decoding performance. We show that pitch can be decoded from lower-frequency neural responses within auditory-frontal cortical regions. Specifically, linear mixed-effects modeling reveals that height and chroma explain the decoding performance of delta band (0.5–4 Hz) neural activity at distinct latencies: a long-lasting height effect precedes a transient chroma effect, followed by a recurrence of height after chroma, indicating sequential processing stages associated with unique perceptual and neural characteristics. Furthermore, the localization analyses of the decoder demonstrate that height and chroma are associated with overlapping cortical regions, with differences observed in the right orbital and polar frontal cortex. The data provide a perspective motivating new hypotheses on the mechanisms of pitch representation.
- auditory perception
- decoding
- machine learning
- magnetoencephalography (MEG)
- octave equivalence
- representational similarity analysis
Significance Statement
Pitch is fundamental to various facets of human hearing, including music appreciation, speech comprehension, vocal learning, and sound source differentiation. How does the brain encode the perceptual features of pitch? By applying machine learning techniques to time-resolved neuroimaging data of individuals listening to different pitches, our findings reveal that pitch height and chroma—two distinct features of pitch—are associated with different neural dynamics within the auditory-frontal cortical network, with height playing a more prominent role. This offers a unified theoretical framework for understanding the perceptual and neural characteristics of pitch perception and opens new avenues for noninvasively decoding human auditory perception to develop brain–computer interfaces.
Introduction
Pitch is central to many aspects of human auditory cognition, including the appreciation of music, understanding speech, vocal learning, segregating sound sources, and forming auditory objects. In both psychoacoustics and auditory neuroscience, the functionally early mechanisms of extracting pitch from sounds have been extensively studied, yielding a range of foundational insights (Licklider 1951; Meddis and O’Mard, 1997; de Cheveigné, 2010; Oxenham, 2023). However, one question that is less typically addressed is this: once pitch has been extracted, how does the brain represent its perceptual attributes, given that pitch is not a monolithic concept? Understanding the neural representation of pitch will, this study conjectures, bridge the gap between low-level auditory processing and high-level auditory cognition.
Many behavioral experiments demonstrate that certain (often musical) tonal stimuli are perceived as more similar to each other than others. For example, the pitches of adjacent piano keys and keys an octave apart are perceptually more similar than other pairs. This perceptual pattern can be effectively represented by a helix model (Fig. 1A), incorporating height and chroma factors, where a shorter distance between two pitches represents a higher perceptual similarity (Shepard, 1982; Ueda and Ohgushi, 1987). Height as a dimension log-linearly corresponds to the fundamental frequency (F0) of the tone. Two pitches with a smaller height difference are perceived as more similar. Height information is crucial for separating sound sources, such as distinguishing individuals' voices according to their different frequency ranges.
Helical pitch representation, experimental design, and neural signal analyses. A, The helix model of pitch representation includes the vertical height (log-linear) and horizontal chroma (circular) factors. We use musical notation to provide a clear and intuitive understanding of pitch height and chroma differences. B, The matrix of pairwise height difference among pitches used, in the unit of octave. C, The matrix indicating whether paired pitches are equivalent in chroma. D, Cochleagrams of the piano, violin, and flute tones across eight pitches. E, Event-related fields and topographies of neural responses to the tones (onset time at 0 s), recorded by MEG sensors, were grand-averaged across trials and participants. The waveforms are bandpass filtered at 0.1–40 Hz for visualization (but not for analyses). F, We extract the source signals from the early auditory cortex (red), auditory association cortex (orange), insular and frontal opercular cortex (brown), inferior frontal cortex (purple), orbital and polar frontal cortex (green), and dorsolateral prefrontal cortex (blue) for decoding analyses. G, Within each participant, we train a machine learning classifier at each time point of the source neural responses for each pitch pair. The decoding performances were quantified by ROC-AUC, where a larger value represents more separable neural responses, indicating greater neural dissimilarity. The decoding analysis returns a time series of neural dissimilarities of pitch pairs for each participant (Fig. 2). The goal is to explain these dissimilarities by pitch height and chroma (Fig. 3).
Chroma, on the other hand, signifies the cyclic and recurring patterns of perceptual similarities among pitches separated by an octave (doubling of F0 frequency), a phenomenon commonly referred to as octave equivalence (Révész, 1954; Shepard, 1964). Chroma plays an important role in recognizing acoustic patterns independent of the sound sources. In speech, for example, octave equivalence facilitates human vocal learning by equalizing the vocal range difference across ages and genders (Wagner and Hoeschele, 2022). In music, a melody shifted by an octave is recognized as being similar or even the same (Demany and Armand, 1984; Wright et al., 2000). Although the way one divides an octave interval into pitch classes (e.g., C, D, E or Do, Re, Mi) can vary across cultures, the recurrence of pitch classes per octave appears a commonly shared feature (Burns 1999; but see Jacoby et al., 2019). Octave equivalence cannot be simply explained by low-level acoustic matching, even though half of the harmonic frequencies inevitably overlap between pitches an octave apart (Jacoby et al., 2019; Wagner et al., 2022).
Despite important characterizations of the perceptual representation of pitch over many years, the neural representation of pitch remains incompletely characterized. Pitch is primarily represented in the auditory cortex (Zatorre, 1988; Pantev et al., 1989; Zatorre et al., 2002; Penagos et al., 2004; Bendor and Wang, 2005; Allen et al., 2022). Specifically, the posterior subregion (posterior planum temporale) is associated with height while the anterior subregion (extending from Heschl's gyrus into the planum polare) is associated with chroma (Warren et al., 2003; Briley et al., 2013; Moerel et al., 2015). Although these basic attributes of pitch seem spatially separable in the cortex, several neurobiological questions remain unaddressed. The present experiment is designed with two questions in mind, motivated by issues of timing and neural dynamics. When do the representations of height and chroma emerge? Are they formed at the same time or in different processing stages? And how is the neural activity associated with perceptual similarity?
We apply a decoding analysis approach to the neural responses of pitch, investigated at a high temporal resolution and recorded by magnetoencephalography (MEG). If two pitches are perceptually more dissimilar, their neural representation should also be more dissimilar, and therefore the decoding performance is expected to be higher. Therefore, by analyzing how the time course of the dissimilarity pattern among pitch pairs resembles the perceptual dissimilarity explained by pitch height and chroma factors, we can reveal the temporal neural dynamics of pitch representation.
Materials and Methods
Resource availability
All stimuli, processed data, and analysis codes have been deposited at an Open Science Framework repository (https://doi.org/10.17605/OSF.IO/NHCGJ) and are publicly available.
Participants
This study incorporated data from 17 right-handed normal-hearing, nonmusician participants (self-reported; age range, 18–35 years; 10 females). Individual MRI scans were obtained for 14 participants. For the remaining three, who were unable to undergo MRI scanning, a standardized template was used. All participants provided written informed consent prior to their involvement and were compensated monetarily. The study was approved by the local ethics committee of the University Hospital Frankfurt.
Stimuli
Eight signals spanning three octaves in the pitch classes C, E, and G# were used: G#6 (415.3 Hz), C7 (523.3 Hz), E7 (659.3 Hz), G#7 (830.6 Hz), C8 (1,046.5 Hz), E8 (1,318.5 Hz), G#8 (1,661.2 Hz), and C9 (2,093.0 Hz). The fundamental frequencies of pitches are logarithmically equidistant, with a step corresponding to a one-third octave (or four semitones). Each pitch was presented in three instrumental timbres (flute, violin, and piano), resulting in 24 individually recorded tones, retrieved from the NSynth Dataset (Engel et al., 2017). The duration of each recording was 0.4 s, and the sampling rate was 44.1 kHz. Rather than sinusoidally synthesized tones or a single timbre (as has been typical in previous studies), we used tones from various instruments to permit the generalizability of our findings. The amplitude of tones was pseudorandomly set between 67 and 73 dB SPL for each presentation.
According to the Western music scale (i.e., 12-tone equal temperament, which divides an octave into 12 logarithmically equidistant pitch classes), we quantified the pairwise pitch relationships as height difference (measured in octaves, Fig. 1B) and chroma equivalence (Fig. 1C, where “True” indicates the same pitch class).
A cochleagram representation was calculated for each stimulus (Fig. 1D), with 128 filters between 50 Hz and 20 kHz and sample factor 2, constructed by the pycochleagram package (https://github.com/mcdermottLab/pycochleagram), and pairwise Spearman's correlations were calculated to assess cochleagram similarity (Fig. 4B, where higher values indicate greater similarity).
Procedure
Each participant was exposed to 2,400 stimuli (8 frequencies multiplied by 3 instrumental timbres multiplied by 100 repetitions per stimulus). The materials were presented in 100 runs, such that each run contained each stimulus, presented in a pseudorandom order, with half of the runs having consecutive tones in the same pitch (regardless of instrument). The interonset intervals within a run roved between 0.9 and 1.4 s. To ensure that participants were attending to the stimuli, after each run, they were asked to report whether there were any consecutive tones (i.e., stimuli of the same pitch) by making a yes/no button-press response using their right hand. The next run started 1.5–2 s after the response. (Although the behavioral data were unavailable due to technical failure, the postblock, offline behavioral responses are not of critical relevance to the goals and interpretation of this study.) In sum, each stimulus was presented 100 times, and each pitch was presented 300 times.
MEG recording and preprocessing
Each participant sat in a semidark room while their brain responses were recorded by an MEG with CTF 275 channels at a 1,200 Hz sampling rate (Fig. 1E). They were required to look at a white cross in the center of a black screen throughout each trial.
We used the Python package MNE (Gramfort et al., 2013) to implement the following signal processing. An independent component analysis (ICA) was used to remove artifact signals (Delorme et al., 2007). To speed up the performance of ICA, the raw sensor-space MEG data were first high-pass filtered at 1 Hz (zero-double phase), downsampled to 200 Hz, and segmented into −0.5 to 1 s epochs (time-locked to stimulus onset). The ICs reflecting artifact, including eye blinking, eye movement, electrocardiogram, and 50 Hz powerline noise, were identified based on visual inspection. Then we went back to the raw unfiltered sensor-space continuous data and removed the artifact ICs.
The artifact-clean unfiltered sensor-space continuous signal is then band-passed filtered (zero-double phase) into different delta (0.5–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), beta (13–25 Hz), and low gamma (25–40 Hz) bands, downsampled to 200 Hz, segmented into −0.2 to 0.5 s epochs (tone onset time at 0 s), and baseline corrected to prestimulus period, and the epochs were rejected with maximum peak-to-peak signal amplitude exceeding 4 × 10−12.
We converted signals from sensor-space to source-space in the following steps. (1) The T1 images available from 14 participants were used to do participant-specific surface reconstruction and create a BEM model and solution with the FreeSurfer watershed algorithm, using FreeSurfer 7.1.1 (Fischl, 2012). For the three participants unavailable with the T1 image, the fsaverage template brain was used. (2) The noise covariance matrices were empirically computed from the prestimulus epoched data. (3) The forward solutions were then estimated. (4) The inverse operator was applied to epochs using the dSPM method with regularization parameter λ2 = 1/9. (5) The signals in the source-space were morphed and downsampled to the fsaverage ico3 template space. This step is to enhance the statistical power and mitigate overfitting of the decoding analysis by reducing the dimensions of the data and lower multicollinearity among the source signals. (6) To also mitigate overfitting, we reduced the dimensionality by only including signals from the bilateral early auditory cortex, auditory association cortex, and frontal lobe (inferior frontal cortex, insular and frontal opercular cortex, orbital and polar frontal cortex, dorsolateral prefrontal cortex), totaling in 316 vertices (Fig. 1F); these were used for subsequent decoding analysis, following the HCP-MMP1 atlas (Glasser et al., 2016), as these regions roughly cover the auditory ventral “what” pathway or have been reported to be associated with processing pitch (Rauschecker and Tian, 2000; Hall and Plack, 2009; Rauschecker and Scott, 2009). In addition to the auditory cortex, we included these broad frontal regions for several reasons. First, certain frontal areas are commonly involved in pitch processing (e.g., the inferior frontal area; Zatorre et al., 1994; Griffiths et al., 1999), whereas others are less clearly associated (e.g., the dorsolateral prefrontal cortex, which relates to pitch labeling, and the insular cortex, involved in processing pitch patterns; Wong et al., 2004; Bermudez and Zatorre, 2005). Consequently, there is no definitive guidance on which regions should be specifically included or excluded. Additionally, using decoding analyses instead of traditional fMRI contrasts may reveal more distributed and potentially unexpected regions involved in the task. Anyway, our machine learning model was designed to automatically shrink the weights of locations with minimal contribution, effectively excluding those regions.
Multivariate pattern decoding analysis
To quantify the representational similarity of each pitch pair, we used ridge classifier as the decoder, under Python packages MNE and scikit-learn (Pedregosa et al., 2011). The function RidgeClassifierCV() implements a cross-validation procedure to automatically tune the optimal regularization (complexity) parameter (α) between 10−2 and 103, consisting of 41 logarithmically spaced grid points. Compared to the least square method, the ridge method can avoid overfitting and multicollinearity by shrinking the coefficients (w) of the less important sources close to 0 (McDonald, 2009). It is very similar to the common logistic regression with L2 regularization, although it utilizes mean square error with L2 penalty instead of cross-entropy as the loss function (X, design matrix; y, dependent variable):
Statistical analyses
To explain the representational dissimilarity matrix (Fig. 1G) at each time point by pitch height difference (Fig. 1B) and chroma equivalence (Fig. 1C), we fitted a linear mixed-effects model (LMEM) at each time point using statsmodels in Python (Seabold and Perktold, 2010). The LMEM, an extension of linear regression, explores the predictors of interest (fixed effects), including height difference, chroma equivalence, and their interaction, while accounting for variability across participants (random effect) by including both random intercept and slope components (Harrison et al., 2018). We reported marginal and conditional R2 as goodness-of-fit metrics, which are commonly used for LMEMs since standard linear regression R2 does not apply (Nakagawa and Schielzeth, 2013). Though not directly comparable to traditional R2, they offer a similar indication of model fit.
We assessed multicollinearity among the independent variables in the LMEMs using variance inflation factor (VIF) analysis. For the model involving height difference, chroma equivalence, and their interaction, the VIFs were 1.3, 7.7, and 8.5, respectively—all below the common cutoff of 10 (Thompson et al., 2017), indicating low multicollinearity and a robust model fit. Conversely, the model involving height difference, cochleagram similarity, and their interaction yielded VIFs of 58.6, 6.7, and 65.4, respectively, indicating multicollinearity and should be interpreted with caution. This multicollinearity is not unexpected, as both factors essentially refer to conceptually similar features—one defined in musicological terms (height) and the other in acoustical terms (cochleagram). It is particularly inevitable given the realistically recorded instrument tones used in this study. Nonetheless, we report the models including cochleagram for transparency and hope this information will aid future studies in synthesizing tones with dissociated cochleagram similarity and height difference.
To analyze ROC-AUC time series data (Fig. 2), we used a one-sample cluster-based permutation test (Maris and Oostenveld, 2007), flipping the signs based on whether values were above or below 0.5 within each participant. To conduct the cluster-based permutation test on LMEM (Figs. 3A, 4A,C), the dependent and independent variables were randomly shuffled within each participant before fitting an LMEM for each iteration. (Additionally, a small uniform-random number between −0.0001 and 0.0001 was added to the permuted ROC-AUC in cases where the LMEM failed to converge, which occurred in a very small number of iterations.) For all these tests, the number of permutations was set to 10,000, the clustering threshold was set at an absolute t-value of 2.5 with a minimum width of 20 ms (five consecutive time samples), and two-tailed cluster-level p-values were reported.
Decoding performance in each neural frequency band. Each plot summarizes the ROC-AUC time series either (A) averaged per participant or (B) averaged per pitch pair. The curves in B are color-coded according to the pitch height difference. The temporal clusters which were significantly above chance are marked by shading. The horizontal dashed line indicates chance performance (ROC-AUC = 0.5).
Pitch height and chroma explain delta band (0.5–4 Hz) decoding performance. A, An LMEM model was used to explain the ROC-AUC at each time point with pitch height and chroma factors. The fitted coefficients were converted to t-values. The significant temporal clusters are marked by shading, and the cluster-level p-values are indicated. Pitch height difference positively predicts ROC-AUC in two time windows (blue), suggesting that neural responses are more dissimilar among pitches with larger pitch height differences. Chroma equivalence negatively predicts ROC-AUC in the 300–320 ms time window (red), suggesting that neural responses are more similar among chromatically equivalent pitches (same pitch class). Height difference and chroma equivalence have an interaction effect in the 300–320 ms time window (purple). The marginal and conditional R2 values were reported at each time point as goodness-of-fit metrics (black and gray). B, Neural dissimilarity as a function of pitch pair is predicted by the fitted LMEM model at each time point (t). The height difference factor was included at all time points, while the chroma equivalence and interaction factors were only included at their significant time point (0.3 s). At 0 s, neural dissimilarities of pitch pairs were not associated with height difference. Neural dissimilarity was positively associated with height difference, with the linear slope increasing from 0 to 0.2 s and then decreasing from 0.4 to 0.5 s. At 0.3 s, the chroma equivalence and height difference factors jointly influenced neural dissimilarities: the neural responses were more similar when pitch pairs were one octave apart and less similar when they were two octaves apart. Note that, as there were fewer instances of pitch pairs with larger height differences (top panel), their estimation precision will inevitably be lower. C, The 3D visualization illustrates neural dissimilarities (distances) among pitches. The top panel illustrates the interpretation of the dimensions of the 3D visualization. The data are identical to that in B but presented using an alternative visualization method. Pairwise ROC-AUC values, derived from the fitted LMEM model at each time point, were mapped into a three-dimensional Euclidean scale (A.U., arbitrary unit, with axes oriented to resemble the helix model in the top panel), where pitches that are closer in distance exhibit more similar neural responses. At time points 0.1, 0.2, 0.4, and 0.5 s, the neural responses to pitches can be represented in a one-dimensional line, resembling the height dimension, with the maximum separation occurring ∼0.2 s. At 0.3 s, the neural responses to pitches roughly formed a three-dimensional helix-like shape, resembling both the height and chroma dimensions of the pitch helix model (Fig. 1A).
Additional LMEM analyses and cochleagram similarity. A, Explaining theta (4–8 Hz) and alpha (8–13 Hz) band decoding performances by an LMEM with pitch height and chroma factors. The format is the same as in Figure 3A. There was no significant cluster. B, The matrix of pairwise cochleagram similarity among pitches across timbres. C, Explaining decoding performance at each frequency band by an LMEM with pitch height difference and cochleagram similarity variables. The format is the same as in Figure 3A. There was no significant cluster.
Results
In the first step, we examined which frequency bands represent neural information in this paradigm to distinguish pitches. In the by-subject analysis, we collapsed all pitch pairs per participant (nparticipant = 17), and one-sample cluster-based permutation tests showed that the ROC-AUC was significantly above chance in the delta, theta, and alpha bands (Fig. 2A). Orthogonally, in the by-item analysis, we collapsed all participants per pitch pair (npair = 28), and the ROC-AUC was also significantly above chance in the delta, theta, and alpha bands (Fig. 2B). These findings suggest that neural encoding of pitch profiles is predominantly reflected in the lower-frequency bands. Therefore, only delta, theta, and alpha bands are included in the subsequent analyses.
The ROC-AUC values were in the range of 0.50–0.55 (Fig. 2). These relatively modest decoding values are likely caused by the diversity in sound types (different instruments) in our experimental design, which introduces additional variability in the neural responses, consequently impacting the classifier's performance. Despite this experiment-inherent challenge, the magnitude of our ROC-AUC aligns with the findings from comparable pitch decoding MEG and fMRI studies that used single timbres (Sankaran et al., 2018, 2020; Czoschke et al., 2021), and our findings have a better generalizability. Importantly, the identification of statistically significant above-chance clusters implies that the signals are sufficiently robust for further analyses.
We next used LMEM to examine whether the neural representational dissimilarity (ROC-AUC in Fig. 2) at each time point can be explained by pitch height difference (Fig. 1B), pitch chroma equivalence (Fig. 1C), and their interaction. In the delta band, the time series of t-values for each coefficient of each independent variable are shown in Figure 3A. The cluster-based permutation test showed that the height difference was positively associated with ROC-AUC at approximately 95–285 and 335–415 ms. This suggests that neural responses were more different when listening to two pitches with a larger height difference, consistent with the height dimension of the helix model (Fig. 1A). Chroma equivalence was negatively associated with ROC-AUC, at approximately 300–320 ms. This suggests that neural responses were harder to distinguish when two pitches belonged to the same pitch class, consistent with the chroma dimension of the helix model (Fig. 1A). Finally, there was a significant interaction effect at approximately 300–320 ms. The fitted LMEM coefficients at each time point in the delta band are visualized in Figure 3B,C.
We also conducted, using the same approach, several control analyses. First, we applied the same LMEM to the ROC-AUC from the theta and alpha bands, but none of the factors were significant (Fig. 4A). Second, we tested an alternative hypothesis: was the effect of chroma equivalence driven by the higher acoustic similarity between pitches that are an octave apart? We calculated the cochleagram of each tone, obtained the pairwise similarities (Fig. 4B), replaced the chroma equivalence factor in the LMEM with this similarity matrix, and redid the analyses. However, the LMEM did not reveal any significant findings in any frequency bands (Fig. 4C), consistent with previous studies indicating that octave equivalence cannot be explained by acoustic similarity (Jacoby et al., 2019; Wagner et al., 2022). Note that models incorporating cochleagram similarity should be interpreted with caution due to intrinsic multicollinearity among the independent variables (see Materials and Methods, Statistical analyses), which likely lead to imprecise parameter estimates and thus fail to replicate the height effect observed in Figure 3A.
We next conducted a temporal generalization analysis with the same cross-validation method to examine whether the two significant height difference clusters in Figure 3A represent similar pitch information. The rationale is that if a classifier trained at one time point exhibited above-chance performance at another time point, neural responses at these two time points represent similar information. We used a one-sided t test to examine which paired train-test time points were above chance (Fig. 5), and false discovery rate (FDR) was used for controlling for multiple comparisons (α = 0.05; Benjamini and Hochberg, 1995). We did not use a cluster-based permutation test here to control for multiple comparisons because the performance on the diagonal line (training time equals testing time) will be certainly higher than in other locations. The analyses showed that, unsurprisingly, there was a significant region on the diagonal line, representing each classifier performing best at the trained time point with some generalizability to temporally adjacent points. More interestingly, the classifier trained on the approximately 140–190 ms window could explain the neural responses in the 400–500 ms window, with this effect being stronger than that of a cluster in the diagonally symmetrical position (late window explaining the early window). Although the effects were both weaker than the effect on the diagonal line, and the time window was not perfectly aligned with the clusters shown in Figure 3A, this finding invites the interpretation that the delta band neural activity reflects similar pitch height information peak at two distinct latencies. Moreover, activity in the early window is more robust for training a classifier that generalizes to the late window than vice versa.
Temporal generalization analysis. The decoder trained at each time point (y-axis) was applied to all other time points (x-axis). The black and gray contours indicate training–testing time pairs that are significantly above chance at various FDR-corrected thresholds. In addition to the high decoding performance on the diagonal, where training time is equal to testing time, the model trained in the approximately 140–190 ms window could explain the neural responses in the 400–500 ms window, suggesting similar neural representations in these two windows.
To explore which brain regions are involved in representing pitch height and chroma, we mapped the weights of the classifier to the cortex (Fig. 6A). We further averaged the weights within each atlas label and normalized the mean weight as a percentage to equalize the magnitude of the sum of weights across time (Fig. 6B), with the baseline level at 8.33% (100% divided by 12 regions). We specifically focused on the time windows representing height (180–230 ms, the peak range of the height effect) and chroma (300–320 ms). First, the weights at the right early auditory region at both height (t(16) = 3.92, p = 0.001, Cohen's d = 0.98) and chroma (t(16) = 3.79, p = 0.002, Cohen's d = 0.95) time windows were above baseline, and so were the right insular and frontal operculum for both height (t(16) = 6.44, p < 0.001, Cohen's d = 1.63) and chroma (t(16) = 6.07, p < 0.001, Cohen's d = 1.52) windows. Second, we tested the difference between windows within each region, and the weight in the right orbital and polar frontal cortex was higher in the chroma than in the height window (t(16) = 4.47, p < 0.001, Cohen's d = 1.12). Finally, the lateralization analyses (Fig. 6C) showed rightward lateralization in multiple regions and time windows, including the early auditory cortex during the chroma window (t(16) = 3.02, p = 0.008, Cohen's d = 0.76), auditory association cortex during the height window (t(16) = 3.20, p = 0.006, Cohen's d = 0.80), and inferior frontal cortex during the height window (t(16) = 3.53, p = 0.003, Cohen's d = 0.88). Together, both height and chroma seem to engage the right hemisphere more than the left and involve shared cortical regions across auditory and frontal areas; however, the weights of the classifier at the right orbital and polar frontal cortex reflected their differences. Note that interpreting the map of the classifier's weights is not straightforward, as a higher weight does not necessarily imply a larger influence on the classification result (Kriegeskorte and Douglas, 2019). Also, our MEG decoding analyses should be considered complementary to previous fMRI studies from an encoding perspective (e.g., Warren et al., 2003), although they are not directly comparable. Nevertheless, we hope these exploratory findings, despite not controlling for multiple comparisons (Bender and Lange, 2001), can contribute to shaping future localization studies on this topic.
Cortical maps and time series of the classifier weights. A, The cortical maps of the mean absolute weights of the classifier at 180–230 and 300–320 ms time windows, corresponding to the temporal clusters of pitch height and chroma of Figure 3A, respectively. B, The time series of the mean proportional absolute weight by cortical regions, with the black horizontal line representing the baseline (BL). C, The distribution of the mean proportional absolute weight of time windows within each cortical region, with the black dashed line representing the baseline. L, left hemisphere; R, right hemisphere; **p < 0.01; ***p < 0.001, uncorrected for multiple comparisons.
Discussion
The study applied machine learning-based decoding analyses to MEG data, revealing the neural dynamics of pitch representation. Three novel findings are reported. First, the low frequency (0.5–13 Hz) neural activity measured in the auditory and frontal cortex represents pitch profiles differently, with the delta band (0.5–4 Hz) specifically encoding the relative relationships among the pitches (height difference and chroma equivalence). Second, height and chroma representations are processed at different stages, with height preceding chroma. The height representation emerges as early as ~100 ms, is sustained for ~200 ms, and reappears at ~400 ms. The chroma representation arises later, at ∼300 ms poststimulus onset and for a brief duration, and its neural representations are consistent with the pitch helix model below two octaves (Figs. 1A, 3C). Third, exploratory analyses showed that the neural representations of height and chroma are both right lateralized, convergent with many reports in the literature (Zatorre et al., 2002; Albouy et al., 2020; Abrams et al., 2024), while the orbital and polar frontal cortices appear to be involved in height and chroma representation in different ways.
Height is arguably the most salient dimension of a pitch representation. The observation that pitch pairs with larger height differences have lower neural representational similarity (better decoding performance) is consistent with previous behavioral findings that pitches which have larger fundamental frequency differences in logarithmic scale will be perceived more differently (Ueda and Ohgushi, 1987). In the brain, a larger pitch height difference elicits stronger event-related potential (ERP) responses (Näätänen et al., 2007; He et al., 2009), and sounds with closer pitch-eliciting frequencies activate spatially adjacent neural ensembles in the tonotopically organized auditory cortex (Allen et al., 2022). The height representation, spanning from ~100 to 300 ms poststimulus onset, is well aligned with the latencies of the auditory ERP components associated with pitch perception, such as N1 and the pitch onset response (de Cheveigné, 2010). It suggests that the height representation is likely formed at a preattentive, automatic processing stage (Regev et al., 2019) and could be more or less universal in listeners (Kallman, 1982; Jacoby et al., 2019). Consistent with our findings, a recent MEG study analyzed the neural representational similarity of pitches in a Western tonal music context and showed a consistent finding that the broadband neural dissimilarity at 150–300 ms is associated with coarse-grained pitch height difference (Sankaran et al., 2020). Our fine-grained data further pinpoint that this association is specific to the delta band and describe its log-linear mathematical relationship.
The reappearance of a pitch height representation at ~400 ms may reflect a feedback or recurrent communication between the auditory and high-level frontal regions and possibly be associated with the awareness of pitch height. Recurrent processing is often associated with the conscious awareness of the sensory information (Förster et al., 2020). Consistent with this hypothesis, individuals with congenital amusia, a lifelong deficit in musical pitch and its awareness, feature atypical recurrent processing in the right frontotemporal network (Peretz, 2016). Targeted studies are needed to investigate the neural mechanism and the perceptual function of this reappearance.
The chroma representation of pitch, in our study, is elicited at a later latency and is only visible in these analyses for a short duration. This is consistent with the behavioral studies showing that the perceptual effects of chroma are less robust, as they can only be observed in specific task designs which require using chroma information and/or engaging attention (Kallman, 1982; Hoeschele et al., 2012; Regev et al., 2019; Wagner et al., 2022). Yet our findings demonstrate that the neural representation of pitch chroma can automatically occur without a task that explicitly taxes chroma. Note that the weaker observable behavioral or neural effects of chroma should not be interpreted as indicating that chroma is perceptually less important, as this remains untested. Interestingly, the latency of the chroma effect overlaps with the latency of the P3a, an ERP component associated with the attentional processing of sensory input (Polich, 2007). Together, these properties suggest a hypothesis that chroma information is likely implicit and only accessible at the attentional processing stage for a limited time, which can account for the difficulty of observing its perceptual effects.
An unexpected and intriguing finding is that neural representations are more similar when pitches are one octave apart, reflecting the effect of octave equivalence and aligning with the helix model; however, they appear more dissimilar when two octaves apart, which is inconsistent with the helix model. This finding provides a minor but noteworthy counterexample to the helix model, aligning with empirical behavioral findings suggesting that the octave equivalence effect may be limited to immediately neighboring octaves (Wagner et al., 2022). Note that, as there were fewer instances of pitch pairs with larger height differences in our experiment, their estimation precision will inevitably be lower. Therefore, the model estimation at two octaves should be taken with caution. Together, our findings suggest a need to reexamine the helix model across varying numbers of octaves.
The height and chroma representations differ in engaging the right orbital and polar frontal cortex. The implication of orbital and polar frontal cortex was less clear in the literature, despite this region being involved in some pitch or music-related tasks (Zatorre et al., 1996; Limb, 2006; Fasano et al., 2023). One fMRI study showed that these regions are associated with tracking Western tonal structure (Janata et al., 2002). This may be driven by the encoding of relationships among pitches (i.e., height and chroma), as these are fundamental elements in forming musical tonality. Other neuroanatomical studies have reported that the orbitofrontal cortex is connected to the auditory cortex via the ventral pathway (Rauschecker and Tian, 2000), and it can top-down modulate the sensory coding and delta band activity (Keitel et al., 2017; Winkowski et al., 2018; Mittelstadt and Kanold, 2023). Therefore, this connection could account for the recurrent processing in the delta band for height at ~400 ms and reflect the difference in height and chroma. Also, the gray matter in the orbitofrontal cortex is larger among musicians (Fauvel et al., 2014), and its activity can be associated with musical expertise (Bücher et al., 2023), which may relate to the fact that the effect of octave equivalence is stronger among musicians (Allen, 1967; Krumhansl and Shepard, 1979). Together, these findings suggest a testable hypothesis: the perception of music is formed through the auditory-frontal neural network, progressing from midlevel pitch height and chroma representations to high-level tonal structure. This process may involve recurrent processing and can be modulated by musical expertise.
The specific involvement of the delta band suggests that the formation of a pitch representation could engage a large cortical network. As a general principle, more widespread neural networks tend to recruit lower-frequency neural activity (Buzsaki and Draguhn, 2004). While the recent studies on the auditory delta band activity predominantly focused on its function in temporal processing and entrainment (Henry and Obleser, 2012; Arnal et al., 2015; Chang et al., 2019), it reflects the long-range communication of auditory regions with motor and frontal regions (Keitel et al., 2017; Morillon and Baillet, 2017; Gourévitch et al., 2020). On the other hand, one study showed that delta band activity is associated with the processing of pitch in the context of speech (Teoh et al., 2019). Different studies are needed to understand the function of the delta band in nontemporal aspects of auditory perception, including pitch.
The decoding performance in the theta and alpha bands was above baseline (Fig. 2) but was not explained by height and chroma (Fig. 4A). While this null effect could be due to insufficient statistical power, it is also possible that these bands encode individual pitch profiles rather than pairwise relationships (e.g., height differences or chroma equivalence), which remains to be investigated.
Together, the combined MEG recording and decoding approach reveals the neural dynamics of pitch representations, which are not monolithic or undecomposed.
Footnotes
A.C. was supported by a Ruth L. Kirschstein Postdoctoral Individual National Research Service Award, National Institute on Deafness and Other Communication Disorders/National National Institutes of Health (F32DC018205), and Leon Levy Scholarships in Neuroscience, Leon Levy Foundation/New York Academy of Sciences. X.T. was supported by Improvement on Competitiveness in Hiring New Faculties Funding Scheme of the Chinese University of Hong Kong (4937113). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank the Poeppel Lab members at New York University (NYU), Max Planck Institute for Empirical Aesthetics, and Ernst Struengmann Institute for Neuroscience for their comments and support. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise. During the preparation of this work, we used ChatGPT for language editing. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
The authors declare no competing financial interests.
- Correspondence should be addressed to Andrew Chang at ac8888{at}nyu.edu or Xiangbin Teng at xiangbinteng{at}cuhk.edu.hk.