Abstract
Pitch and timbre are two primary features of auditory perception that are generally considered independent. However, an increase in pitch (produced by a change in fundamental frequency) can be confused with an increase in brightness (an attribute of timbre related to spectral centroid) and vice versa. Previous work indicates that pitch and timbre are processed in overlapping regions of the auditory cortex, but are separable to some extent via multivoxel pattern analysis. Here, we tested whether attention to one or other feature increases the spatial separation of their cortical representations and if attention can enhance the cortical representation of these features in the absence of any physical change in the stimulus. Ten human subjects (four female, six male) listened to pairs of tone triplets varying in pitch, timbre, or both and judged which tone triplet had the higher pitch or brighter timbre. Variations in each feature engaged common auditory regions with no clear distinctions at a univariate level. Attending to one did not improve the separability of the neural representations of pitch and timbre at the univariate level. At the multivariate level, the classifier performed above chance in distinguishing between conditions in which pitch or timbre was discriminated. The results confirm that the computations underlying pitch and timbre perception are subserved by strongly overlapping cortical regions, but reveal that attention to one or other feature leads to distinguishable activation patterns even in the absence of physical differences in the stimuli.
SIGNIFICANCE STATEMENT Although pitch and timbre are generally thought of as independent auditory features of a sound, pitch height and timbral brightness can be confused for one another. This study shows that pitch and timbre variations are represented in overlapping regions of auditory cortex, but that they produce distinguishable patterns of activation. Most importantly, the patterns of activation can be distinguished based on whether subjects attended to pitch or timbre even when the stimuli remained physically identical. The results therefore show that variations in pitch and timbre are represented by overlapping neural networks, but that attention to different features of the same sound can lead to distinguishable patterns of activation.
Introduction
Pitch and timbre are two fundamental perceptual dimensions of sound. Variations in pitch carry information about intonation and melody, whereas timbre is closely related to sound quality and identity. Despite the importance of pitch and timbre in auditory and speech perception, it remains unclear how they are represented in the cortex. Although these dimensions are often studied independently, several studies have shown that they can interact with one another perceptually (Beal, 1985; Melara and Marks, 1990; Moore and Glasberg, 1990; Marozeau and de Cheveigné, 2007; Borchert et al., 2011; Allen and Oxenham, 2014). Given their perceptual interactions, it is plausible that they are represented in overlapping cortical regions. A recent fMRI study found that pitch and timbre variations were in fact represented in largely overlapping regions of the auditory cortex, although the patterns of activation could be distinguished using multivoxel pattern analysis (MVPA) (Allen et al., 2017). However, this conclusion was based on a passive listening task. In an earlier fMRI study, Warren et al. (2005) also found overlapping bilateral regions in the temporal lobes that were active for sounds varying in either fundamental frequency (F0) or spectral envelope shape. They found additional activation when alternating between harmonic and noise stimuli and inferred that the midportion of the right STS may process information specifically related to spectral envelopes (i.e., “brightness”). However, although subjects were instructed to attend to the stimuli, there was no explicit auditory task, so no manipulation of attention, and the stimuli were not equated for perceptual salience.
Auditory attention has been found to modulate activity in wide regions of the superior temporal gyrus (STG) (Jäncke et al., 1999; Degerman et al., 2006). A recent meta-analysis by Alho et al. (2014) compared neural representations of several sound dimensions and categories (pitch, spatial location, speech, and voice processing) during active and passive fMRI measurements. Although speech or voice processing loci were not found to change with attention, pitch was found to activate more posterior and lateral areas in STG during active tasks, whereas the passive listening loci were shifted more anteriorly, toward the lateral end of Heschl's gyrus, the macroanatomical landmark that corresponds most closely to primary auditory cortex. Although some studies have examined cortical representations of timbral dimensions (Menon et al., 2002; Reiterer et al., 2008; Allen et al., 2018), none has yet examined the effects of modulating attention to timbre. It thus remains possible that an attentionally demanding task may enhance the spatial separability of the cortical representations of pitch and timbre.
In addition to possible differences between active and passive listening conditions, subjects' attention can be directed to a particular sound feature. Recent studies have shown that attention can enhance the representation of a specific sound within a mixture (Mesgarani and Chang, 2012; Ding and Simon, 2012a,b) and there is some evidence for neuronal modulation in visual cortex as a function of attention to different features within a visual stimulus (Saenz et al., 2002). However, it is unknown whether attention to a specific auditory feature selectively enhances the representation of that feature over others rather than just enhancing the representation of the entire object.
This study investigated whether task-based attention enhances the separation of the neural correlates of pitch and timbre relative to that found in a passive-listening task (Allen et al., 2017) and whether neural correlates of attention to either pitch or timbre emerge even when the physical stimulus remains identical. The results suggest that the representations of pitch and timbre variation are subserved by strongly overlapping cortical regions even in the active task conditions, but reveal that attention to one or other dimension can lead to distinguishable activation patterns using MVPA even in the absence of physical differences between the stimuli.
Materials and Methods
Ethics statement.
The experimental procedures were approved by the Institutional Review Board for human subject research at the University of Minnesota. Written informed consent was obtained from each subject before starting the measurements.
Participants.
Ten right-handed subjects (mean age: 27.5, SD: 3.2; 4 females, 6 males) from the University of Minnesota community participated in the experiment. All subjects had normal hearing, defined as audiometric pure tone thresholds of 20 dB hearing level or better at octave frequencies between 250 Hz and 8 kHz. The musical experience of the subjects ranged from 0 to 20 years, with the majority of subjects (n = 7) reporting 8 or more years of formal musical training, 1 reporting between 3 and 7 years, and the remaining 2 reporting 2 years or less.
Prescan experimental design.
Before being scanned, each subject's difference limens (DLs) were measured for pitch and timbre discrimination. For pitch discrimination, we measured the DL for F0, the periodicity of a sound, a physical variable most closely associated with pitch. For timbre, we measured the DL for the spectral centroid or center frequency (CF), a physical manipulation that leads to reported changes in the timbral dimension of “brightness” or “sharpness” (Fastl and Zwicker, 2007). The paradigm was similar to that used by Allen and Oxenham (2014) and Allen et al. (2017). Band-pass-filtered, resolved harmonic complex tones (Houtsma and Smurzynski, 1990; Shackleton and Carlyon, 1994) eliciting robust pitch and timbre percepts were generated in MATLAB (The MathWorks) and presented using the AFC toolbox for auditory psychophysics (Ewert, 2013). Pairs of successive tone triplets, three identical tones presented consecutively, were played diotically through HD 650 headphones (Sennheiser) at a sampling rate of 44,100 Hz. Each tone was 200 ms in duration for a total triplet duration of 600 ms, with a 300 ms interstimulus interval between the two triplets. The tones had 20 ms raised-cosine onset and offset ramps. Harmonics up to 10,000 Hz were added in sine phase and scaled to produce 24 dB/octave slopes around the center frequency (i.e., the spectral centroid). The 3 dB bandwidth of the stimulus around the center frequency was one-quarter octave, with the tone complexes having no flat band-pass region. Sounds were presented at an overall level of 70 dB sound pressure level (SPL).
Subjects listened to pairs of tone triplets presented sequentially and on each trial selected the triplet with the higher pitch or brighter timbre (i.e., a standard two-alternative forced-choice procedure). Triplets were chosen with the aim of increasing the perceptual salience of the stimuli and obtaining a more robust cortical response (Thomas et al., 2015). Stimuli were paired with boxes on the screen that would light up with each triplet, along with the question, “Which pitch was higher?” or “Which timbre was brighter?” depending on the task. Feedback was provided after each response, indicating whether the subject was correct or incorrect. For the pitch condition, the CF of the stimulus remained unchanged at 900 Hz and the nominal F0 was roved ± 10% uniformly around 200 Hz across trials. For the timbre task, the F0 remained unchanged at 200 Hz and the nominal CF was roved ± 10% uniformly around 900 Hz across trials (Fig. 1). A two-down, one-up adaptive tracking rule was used to vary the difference in F0 (ΔF0) or the difference in CF (ΔCF) between the two triplets to track the DL corresponding to performance of 70.7% correct (Levitt, 1971). In each trial, the two values of F0 or CF were geometrically centered around the nominal value and the order of presentation (low-high or high-low) was randomized with equal a priori probability. The starting value of ΔF0 or ΔCF was 20%, defined relative to the lower of the two F0s or CFs, and was initially increased or decreased by a factor of 2. Following the second staircase reversal (i.e., the first direction change in the tracking variable from “up” to “down”), this factor was decreased to 1.26 and was then further decreased to the final factor of 1.12 after two more reversals. After six reversals at the smallest step size, the run was terminated and the DL in each run was calculated as the geometric mean of the ΔF0 or ΔCF values at the last six reversals points. Each subject's final DL for each dimension was based on the geometric mean DL across six task blocks. All blocks of one dimension were completed before beginning measurements in the next dimension and this ordering was counterbalanced across subjects.
After DLs were calculated, discrimination performance was measured using a method of constant stimuli with the attended dimension (F0 or spectral centroid) difference set to five times the DL measured for each individual subject (5DL) and the unattended dimension either being held constant or also varying by 5DL. The reason for multiplying the DL by 5 was 3-fold: (1) to allow subjects to perform near ceiling, confirming that they are attending to the correct dimension on each task; (2) to allow for the fact that the task was presented in the acoustically noisy MRI scanner environment (because the DLs were originally measured in silence); and (3) to ensure that the changes in pitch and timbre were approximately equally salient (Allen and Oxenham, 2014). Performance was based on 100 trials of each: pitch alone (PA) comparisons (when only F0 is varying), timbre alone (TA) comparisons (when only spectral centroid is varying), both pitch and timbre varying, but with subjects attending to only pitch (PwT), and both pitch and timbre varying, but with subjects attending only to timbre (TwP). In all cases, the subjects were instructed to select the tone with either the higher pitch or brighter timbre in the tone pair (Fig. 2).
Experimental design during scan.
The stimuli were presented at an overall level of approximately 75 dB SPL via MRI-compatible Sensimetrics S14 earphones with custom filters designed to compensate for the frequency response of the hardware. Other stimulus parameters were the same as those for prescanner behavioral testing. During each task scan, subjects completed 10 trials of each of the four discrimination tasks using differences that were set to be five times the DL for each subject. The direction of change for each dimension (up or down) was selected randomly and independently on each trial with equal a priori probability. These trials were evenly divided into eight blocks (with five trials in each block), one for each condition (Fig. 3). Stimulus onset during the silent period was randomly jittered to begin between 250 and 1000 ms after previous image acquisition. Blocks were separated by rest periods of between 10.75 and 12.25 s (due to stimulus presentation jitter) to measure the baseline signal. Trials were presented during inter-acquisition intervals to reduce acoustic contamination from the scanner. Following each trial, subjects had 2 s to respond. Subjects' responses were collected via a button box. Stimuli were paired with boxes projected onto a screen and viewed through a mirror mounted on top of the head coil. The words “pitch,” “timbre,” “rest,” and “end” appeared on the screen as task cues. The order of the conditions in a run was randomized and there was a total of 60 trials per condition. Feedback for correct and incorrect responses appeared in the form of happy and sad emoticons, respectively. Missed responses were followed by a presentation of an asterisk symbol on the screen. Due to some technical difficulties with the button box, occasional subject responses were missed despite subjects having pressed the button during the allotted response window (mean number of missed responses across subjects: 1.9 of 240, SD: 2.1). These missed responses were not included in the calculation of task performance.
MRI.
The data were acquired at a 3 T Siemens Prisma MRI scanner. To minimize the contamination of the functional data with scanner noise, we used a pulse sequence with sparse temporal acquisition (Hall et al., 1999). The pulse sequence used slice accelerated multiband (factor 2) echo planar imaging (EPI) (Xu et al., 2013) with a repetition time (TR) of 7 s (acquisition time of 2 s and an inter-acquisition silent interval of 5 s), providing a voxel resolution of 2 mm isotropic. Each functional volume had 48 slices angled upward to avoid the eyes in an effort to reduce eye movement artifacts while covering most of the brain. However, in many subjects, the posterior portion of the parietal cortex was not included. A total of 6 functional scans were acquired for each subject, each of which took approximately 6 min to complete and consisted of 48 volumes. To correct the spatial distortions from inhomogeneity in the B0 magnetic field, reversed phase encode EPI volumes were also acquired. To localize functional activations, we additionally collected anatomical T1-weighted images that were coregistered with the EPI data.
Statistical analysis.
Data were preprocessed using the Analysis of Functional NeuroImages (AFNI) software package (Cox, 1996) and FSL 5.0.9 (http://fsl.fmrib.ox.ac.uk/). Statistical analyses and visualization were performed with AFNI. Preprocessing included skull stripping, slice time correction, distortion correction, six-parameter motion correction, spatial smoothing (4 mm FWHM Gaussian blur), and prewhitening.
For each subject, an event-related general linear model (GLM) analysis was performed that included regressors for each of the four experimental conditions, six motion parameters, and Legendre polynomials up to the fourth order to account for baseline drift (modeled separately for each run). Each subject's brain was transformed into Montreal Neurological Institute (MNI) space (Mazziotta et al., 1995). Beta weights (regression coefficients) for individual voxels were estimated by the GLM for each condition for each subject, as were contrasts comparing conditions within pitch, within timbre, between pitch and timbre, and a contrast comparing all sounds to baseline. Cortical surface-based visualization was done in AFNI's SUMA (SUrface MApping) https://afni.nimh.nih.gov/Suma. Group-level statistics were mapped onto the FreeSurfer brain surface MNI N27.
Group-level analyses with subject as a random effect included paired-sample t tests performed on the unmasked, unthresholded β weights for each contrast using the AFNI program 3dttest++. Voxels were thresholded at p < 0.01, uncorrected. Correction for multiple comparisons was achieved by determining the minimum significant cluster size. Taking into account increasing concerns over the risk of inflated false-positives with this method, as reported by Eklund et al. (2016), a nonparametric permutation test was used. This permutation test randomized the signs of the residuals of the model among subjects and then performed a t test, with these steps iterated 10,000 times to determine nearest neighbor, faces touching, two-sided cluster thresholds. This method, implemented within 3dttest++ using the “clustsim” option, determined the probability, with each voxel having a 1% chance of displaying a false-positive, of clusters of a given size occurring by chance. Based on these probabilities, clusters smaller than those that would occur by chance >5% of the time were filtered out of the results to achieve a cluster-level α = 0.05. The t tests were conducted within a gray matter mask containing anatomically defined auditory cortices and frontal lobe regions (Fig. 4). The mask, which was created on the cortical surface, was made up of the following gyri and sulci in the left and right hemispheres: superior temporal (including banks), Heschl's, supramarginal, precentral, superior frontal, middle frontal (caudate and rostral), inferior frontal (opercularis, triangularis, orbitalis), orbitofrontal (lateral and medial), and anterior cingulate (caudal and rostral), as well as the insulae, temporal poles, and frontal poles. These regions were defined by the Desikan–Killiany atlas (Desikan et al., 2006). Frontal regions were included in the mask because of the incorporation of an attentional task, which made activation in frontal regions seem plausible. In addition, in our previous study (Allen et al., 2018), which included an active task with subjects listening to different musical instrument sounds (i.e., natural variations in timbre), we found robust responses in several frontal regions.
MVPA.
In addition to univariate analyses, we used MVPA, which has the advantage of being more sensitive to differences between conditions than the univariate approach because it examines differential patterns of activity across voxels as opposed to averaging across them (Norman et al., 2006) and may reveal differences at the voxel level that are not apparent via standard univariate analyses. MVPA was performed using Princeton's MVPA toolbox for MATLAB with a backpropagation classifier algorithm (http://code.google.com/p/princeton-mvpa-toolbox/). MATLAB's neural network toolbox with no hidden layers was used for classifier training/testing. To restrict the number of voxels in our analyses, we added a functionally defined mask based on our univariate analysis results. This mask contained voxels that were most active during the auditory tasks (all sound conditions contrasted with the silence baseline) thresholded to the 2000 most active (positive) voxels across both hemispheres for each subject. This cutoff was chosen so that the number of voxels in each mask was the same across subjects and to limit the number of voxels used for classification (De Martino et al., 2008; Schindler et al., 2013).
The preprocessed data were detrended using a third-order polynomial via AFNI's processing pipeline. Spike-contaminated trials in the time course of the preprocessed data exceeding 4 SDs from the mean of the run (excluding silence) were excluded from analysis. The remaining trials within each block were then averaged together, resulting in one value per block per voxel. The runs were z-scored and subjected to sixfold cross-validated MVPA analysis, where each training set consisted of five runs, totaling 40 patterns, whereas the remaining run (eight patterns) was used as the testing set (Fig. 5). Voxels that did not vary significantly in activation across the conditions based on an ANOVA done on the training set with a threshold of p < 0.05 were considered uninformative. In each cross-validation fold, these uninformative voxels were removed from the initial 2000-voxel mask. Classifier performance was computed for each fold separately and then averaged together for a final prediction accuracy score.
Results
Prescan behavioral task performance
The geometric mean F0 DL across subjects was 0.50% [95% confidence interval (CI) = 0.3–0.9] and the average spectral-centroid DL was 2.7% (95% CI = 2.1–3.6), in good agreement with musicians's DLs from an earlier study using similar stimuli (Allen and Oxenham, 2014). This agreement was expected because most of the current subjects would be classified as musicians, having had >8 years of musical training. As anticipated, performance on the constant-stimuli task using the scaled 5DL variation was high, with a mean proportion of correct responses of 97.0% (SD: 3.7%) across all conditions. Average performance in each condition is reported in Table 1. Due to the near-ceiling performance in these tasks, a nonparametric Friedman test was used on the four conditions, which indicated a significant effect of condition (χ2 = 11.7, p < 0.008). We then ran Wilcoxon signed-rank tests to compare conditions. As reported in Table 1, after a Bonferroni correction for multiple comparisons with an adjusted α of 0.008, no significant difference in performance between the pitch and timbre conditions was found. This was true when comparing PA and TA conditions as well as the PwT and TwP conditions. There was, however, a significant difference between the BA and BV conditions, suggesting that the conditions in which both dimensions were varying were more challenging than the conditions in which only one dimension was varying. However, this significance was marginal between PA and PwT after correction and was not significant for TA and TwP.
Behavioral task performance during scanning
Similar to performance in the prescanner session, performance in the scanner was near ceiling (mean: 99.3%, SD: 1.8). Average performance within conditions is reported in Table 2. This time, a nonparametric Friedman test did not reveal a significant effect of condition (χ2 = 7.77, p = 0.051).
Group-level analysis
For all conditions, we found robust activation of the auditory cortices when comparing sound stimulation with baseline silence (Fig. 6). However, when comparing activation between pairs of conditions that involved sound stimulation (P-T, BA-BV, PA-TA, and PwT-TwP), no significant voxels remained after thresholding.
MVPA results
As shown in Figure 7, average 4-way classifier performance for distinguishing between pitch alone, timbre alone, pitch varying, and timbre varying was 74.4% (SD = 9.7 percentage points), which was significantly above chance (25%) based on a one-tailed t test (t(9) = 16.1 p < 0.0001). Average classifier performance for pitch conditions versus timbre conditions was 82.1% (SD = 6.5), which was also significantly above chance (50%) (t(9) = 15.5, p < 0.0001). Average classifier performance for both alone (BA) conditions versus both varying (BV) conditions was 80.0% (SD = 7.0), which, again, was significantly above chance (50%) (t(9) = 13.5, p < 0.0001).
Exploratory MVPA and univariate analyses
We also performed MVPA on subsets of the conditions PA versus TA and PwT versus TwP conditions (Fig. 8, left), as well as comparing PA versus PwT and TA versus TwP. In all cases, classifier performance was significantly above chance. Additionally, we tested how well the MVPA classifier performed if it was trained on the BA conditions (PA vs TA) but tested on the BV conditions (PwT vs TwP) and vice versa (Fig. 8, right). In neither case was average performance across subjects significantly above chance. All results are reported in Table 3.
We analyzed the errors in classifier performance for the four-way classifier to determine whether, despite its high performance, there were any conditions that were consistently confused with one another (e.g., when the correct condition was PA, was it classified more often as PwT than TA or TwP?). Results are shown in Figure 9. It is difficult to draw any strong conclusions because there are no obvious indications of the classifier consistently confusing one condition for another.
Next, as a sanity check, we randomized the condition labels to determine whether classifier performance would drop to chance, as would be expected. Indeed, with random labels, no classifier performed significantly above chance (Table 3).
Finally, we reran the MVPA using two additional masks. The first mask was an intersection of our original 2000-voxel mask for each subject and anatomically defined auditory regions; the second mask was an intersection between the same 2000-voxel mask and the anatomically defined frontal regions. This separation of the 2000-voxel masks into the frontal versus auditory regions was done to test whether the high MVPA classifier performance was primarily driven by voxels in a particular region. The number of voxels included in each mask for each subject is shown in Table 4. MVPA performance within these masks is shown in Figure 10 and Table 5.
The four-way classifier when using the 2000-voxel mask intersected with auditory regions resulted in an average performance of 70.4% (SD = 8.3, which, again, was significantly above chance (t(9) = 17.3, p < 0.0001). The 4-way classifier when using the 2000-voxel mask intersected with frontal regions resulted in an average performance of 61.9% (SD = 15.4), which was also significantly above chance (t(9) = 7.6, p < 0.0001). This difference was significant based on a paired-samples t test (t(9) = 2.28, p < 0.048), suggesting that activity in auditory regions was somewhat more informative than activity in frontal regions in distinguishing between stimuli and tasks.
Discussion
The present study aimed to determine whether task-related attention to one dimension when listening to sounds varying in pitch and/or brightness would lead to more spatially distinct representations of pitch and timbre. Univariate analyses suggested that this was not the case either at the group level or for the majority of subjects at the individual level. No significant differences in activation were observed between the pitch-varying and timbre-varying conditions regardless of whether variations in the other dimension were present. Therefore, it seems that the spatial overlap between representations of pitch and brightness is observed under active and passive (Allen et al., 2017) listening conditions. This similarity in the spatial distribution of cortical processing of the two dimensions may not be entirely surprising given the perceptual interactions and dependencies that have been identified between these dimensions (Melara and Marks, 1990; Borchert et al., 2011; Allen and Oxenham, 2014).
Although the univariate analyses could not differentiate between condition pairs, high MVPA classification performance was obtained with the four-way classifier (PA vs PwT vs TA vs TwP), as well as the two-way classifiers tested (alone vs BV, pitch vs timbre, PA vs TA, and PwT vs TwP). Despite there being no significant overall spatial differences between pitch and timbre representation, it is clear that the patterns of activation within these regions are distinct. This outcome is consistent with earlier findings obtained under passive listening conditions (Allen et al., 2017) and extends them by showing that pattern classification remains possible both in the presence and absence of an attentional task. The only classifiers that did not exceed chance performance on average were training on PA and TA while testing on PwT and TwP and vice versa. It is possible that performance suffered with these two classifiers due to representations of pitch and timbre interacting in a nonlinear fashion (e.g., via divisive normalization) and that these nonlinearities are stronger for the conditions in which both dimensions are varying. As a result, this may lead to features that are helping classifiers when one dimension is varying that are not helpful when both dimensions are varying. Therefore, when both types of stimuli are not present in the training dataset, the classifier may struggle.
Particularly striking is the fact that, in the absence of physical differences in the stimuli and without significant differences in the behavioral performance of the subjects, the classifier performed well above chance for identifying the pitch and timbre conditions, successfully distinguishing between the PwT and TwP conditions. These results suggest that representations within auditory cortex may reflect the attention to particular sound features within a single auditory object. In this way, the outcomes extend earlier work using EEG, MEG, and ECoG, which has shown that attention to entire auditory objects or streams (e.g., one voice in the presence of another) can profoundly alter cortical activity (Hillyard et al., 1973; Ding and Simon, 2012a; Zion Golumbic et al., 2013). There is evidence to suggest that such feature-based attentional modulation also exists in visual cortex for visual objects (Saenz et al., 2002).
It is important to note that timbre is well established as a multidimensional dimension (Grey, 1977; Elliott et al., 2013; Allen et al., 2018). The present study manipulated timbre along just one dimension by shifting the spectral centroid, which affects a sound's perceived “brightness.” Therefore, we are unable to draw conclusions regarding other manipulations of timbre such as temporal modulations. It may be that cortical representations of the spectral properties of timbre, such as spectral centroid, would be more similar to the cortical representations of F0 than, for example, variations in attack and decay.
It is also worth noting that average performance in the behavioral tasks was very high, exceeding 90% correct in all conditions. Although this provides a strong indication that subjects attended to the correct dimension, it may also imply that the attentional demands of the task were not high. A possible follow-up study might involve increasing the task difficulty to determine whether spatial differences would emerge that were not found in the present study. However, it is important in such a design that the task difficulty is equated across dimensions so any differences that emerge cannot simply be explained by the task in one dimension being more challenging than the other.
Another future direction would be to use encoding models to explicitly characterize pitch and timbre selectivity throughout the auditory cortex and thus explore in more depth how these populations are modulated by attention. These approaches have been successfully used to characterize suppressive stimulus interactions in the visual system (Brouwer and Heeger, 2011) and the somatosensory system (Brouwer et al., 2015) and could be similarly useful in understanding the interactions between representations of pitch and timbre.
Overall, these results show that attending to either pitch or brightness does not result in a spatial separation in their representations that is detectable via conventional univariate analyses. Nevertheless, the patterns of activation produced when attending to pitch or timbre features within a sound appear to be distinct even when the stimuli remain physically identical. This finding reveals that activation in both auditory and frontal cortical regions can reflect attention to specific sound features within single auditory objects.
Footnotes
This work was supported by the National Institutes of Health (Grant R01 DC005216), the High Performance Connectome Upgrade for Human 3T MR Scanner (Grant 1S10OD017974-01), and the Brain Imaging Initiative of the College Liberal Arts, University of Minnesota. We thank Andrea Grant, Anahita Mehta, and Jordan Beim for helpful assistance.
The authors declare no competing financial interests.
- Correspondence should be addressed to Emily J. Allen at prac0010{at}umn.edu