Musicians have lifelong experience parsing melodies from background harmonies, which can be considered a process analogous to speech perception in noise. To investigate the effect of musical experience on the neural representation of speech-in-noise, we compared subcortical neurophysiological responses to speech in quiet and noise in a group of highly trained musicians and nonmusician controls. Musicians were found to have a more robust subcortical representation of the acoustic stimulus in the presence of noise. Specifically, musicians demonstrated faster neural timing, enhanced representation of speech harmonics, and less degraded response morphology in noise. Neural measures were associated with better behavioral performance on the Hearing in Noise Test (HINT) for which musicians outperformed the nonmusician controls. These findings suggest that musical experience limits the negative effects of competing background noise, thereby providing the first biological evidence for musicians' perceptual advantage for speech-in-noise.
Musical performance is one of the most complex and cognitively demanding tasks that humans undertake (Parsons et al., 2005). By the age of 21, professional musicians have spent ∼10,000 h practicing their instruments (Ericsson et al., 1993). This long-term sensory exposure may account for their enhanced auditory perceptual skills (Micheyl et al., 2006; Rammsayer and Altenmuller, 2006) as well as functional and structural adaptations seen at subcortical and cortical levels for speech and music (Pantev et al., 2003; Peretz and Zatorre, 2003; Trainor et al., 2003; Shahin et al., 2004; Besson et al., 2007; Musacchia et al., 2007; Wong et al., 2007; Lee et al., 2009; Strait et al., 2009). One critical aspect of musicianship is the ability to parse concurrently presented instruments or voices. Given this, we hypothesized that a musician's lifelong experience with musical stream segregation would transfer to its linguistic homolog, speech-in-noise (SIN) perception. To test this, we recorded speech-evoked auditory brainstem responses (ABRs) in both quiet and noise and tested SIN perception in a group of musicians and nonmusicians.
Speech perception in noise is a complex task requiring the segregation of the target signal from competing background noise. This task is further complicated by the degradation of the acoustic signal, with noise particularly disrupting the perception of fast spectrotemporal features (e.g., stop consonants) (Brandt and Rosen, 1980). Whereas children with language-based learning disabilities (Bradlow et al., 2003; Ziegler et al., 2005) and hearing-impaired adults (Gordon-Salant and Fitzgibbons, 1995) are especially susceptible to the negative effects of background noise, musicians are less affected and demonstrate better performance for SIN when compared with nonmusicians (Parbery-Clark et al., 2009).
Recent work points to a relationship between brainstem timing and SIN perception (Hornickel et al., 2009). The ABR, which reflects the activity of subcortical nuclei (Jewett et al., 1970; Lev and Sohmer, 1972; Smith et al., 1975; Chandrasekaran and Kraus, 2009), is widely used to assess the integrity of auditory function (Hall, 1992). The speech-evoked ABR represents the neural encoding of stimulus features with considerable fidelity (Kraus and Nicol, 2005). Nonetheless, the ABR is not hardwired; rather, it is experience dependent and varies with musical and linguistic experience (Krishnan et al., 2005; Song et al., 2008; for review, see Tzounopoulos and Kraus, 2009). Compared with nonmusicians, musicians exhibit enhanced subcortical encoding of sounds with both faster responses and greater frequency encoding. These enhancements are not simple gain effects. Rather, musical experience selectively strengthens the underlying neural representation of sounds reflecting the interaction between cognitive and sensory factors (Kraus et al., 2009), with musicians demonstrating better encoding of complex stimuli (Wong et al., 2007; Strait et al., 2009) as well as behaviorally relevant acoustic features (Lee et al., 2009). We hypothesized that, despite the well documented disruptive effects of noise (Don and Eggermont, 1978; Cunningham et al., 2001; Russo et al., 2004), musicians have enhanced encoding of the noise-vulnerable temporal stimulus events (onset and consonant–vowel formant transition) and increased neural synchrony in the presence of background noise resulting in a more precise temporal and spectral representation of the signal.
Materials and Methods
Sixteen musicians (10 females) and 15 nonmusicians (9 females) participated in this study. Participants' ages ranged from 19 to 30 years (mean age, 23 ± 3 years). Participants categorized as musicians started instrumental training before the age of 7 and practiced consistently for at least 10 years before enrolling in the study. Nonmusicians were required to have had <3 years of musical training, which must have occurred >7 years before their participation in the study. All participants were right-handed, had normal hearing thresholds from 125 to 8000 Hz ≤20 dB, no conductive hearing loss, and normal ABRs to a click and speech syllable as measured by BioMARK (Biological Marker of Auditory Processing) (Natus Medical). No participant reported any cognitive or neurological deficits.
The speech syllable /da/ was a 170 ms six-formant speech sound synthesized using a Klatt synthesizer (Klatt, 1980) at a 20 kHz sampling rate. Except for the initial 5 ms stop burst, this syllable is voiced throughout with a steady fundamental frequency (f0 = 100 Hz). This consonant–vowel syllable is characterized by a 50 ms formant transition (transition between /d/ and /a/) followed by a 120 ms steady-state (unchanging formants) portion corresponding to /a/ (see Fig. 2A). During the formant transition period, the first formant rises linearly from 400 to 720 Hz, and the second and third formants fall linearly from 1700 to 1240 Hz and 2580 to 2500 Hz, respectively. The fourth, fifth, and sixth formants remain constant at 3330, 3750, and 4900 Hz for the entire syllable.
The background noise consisted of multitalker babble created by the superimposition of grammatically correct but semantically anomalous sentences spoken by six different speakers (two males and four females). These sentences were recorded for a previous experiment and the specific recording parameters can be found in Smiljanic and Bradlow (2005). The noise file was 45 s in duration.
The speech syllable /da/ was presented in alternating polarities at 80 dB sound pressure level (SPL) binaurally with an interstimulus interval of 83 ms (Neuro Scan Stim 2; Compumedics) through insert ear phones (ER-3; Etymotic Research). In the noise condition, both the /da/ and the multitalker babble were presented simultaneously to both ears. The /da/ was presented at a +10 signal-to-noise ratio over the background babble, which was looped for the duration of the condition. The responses to two background conditions, quiet and noise, were collected using NeuroScan Acquire 4.3 recording system (Compumedics) with four Ag–AgCl scalp electrodes. Responses were differentially recorded at a 20 kHz sampling rate with a vertical montage (Cz active, forehead ground, and linked-earlobe references), an optimal montage for recording brainstem activity (Galbraith et al., 1995; Chandrasekaran and Kraus, 2009). Contact impedance was 2 kΩ or less between electrodes. Six thousand artifact-free sweeps were recorded for each condition, with each condition lasting between 23 and 25 min. Participants watched a silent, captioned movie of their choice to facilitate a wakeful, yet still state for the recording session.
To limit the inclusion of low-frequency cortical activity, brainstem responses were off-line bandpass filtered from 70 to 2000 Hz (12 dB/octave, zero phase-shift) using NeuroScan Edit 4.3. The filtered recordings were then epoched using a −40 to 213 ms time window with the stimulus onset occurring at 0 ms. Any sweep with activity greater than ±35 μV was considered artifact and rejected. The responses to the two polarities were added together to minimize the presence of the cochlear microphonic and stimulus artifact on the neural response (Gorga et al., 1985; Aiken and Picton, 2008). Last, responses were amplitude-baselined to the prestimulus period.
Timing and amplitude of onset and transition peaks.
The neural response to the onset of sound (“onset peak”) and the formant transition (“transition peak”) are represented by large positive peaks occurring between 9–11 and 43–45 ms poststimulus onset (0 ms), respectively. These peaks were independently identified using NeuroScan Edit 4.3 (Compumedics) by the primary author and a second peak picker, who was blind to the participants' group. In the case of disagreement with peak identification, the advice of a third peak picker was sought. All participants had distinct onset peaks in the quiet condition, but three participants (two nonmusicians and one musician) had nonobservable onset peaks in the noise condition. Statistical analyses for onset peak latency and amplitude only included those participants who had clearly discriminable peaks in both quiet and noise (n = 28). The transition peak was the most reliable peak in the transition response for both the quiet and noise condition and was clearly identifiable in all participants (n = 31).
To measure the effect of noise on the response morphology, the degree of correlation between each participant's response in quiet and in noise was calculated. Correlation coefficients were calculated by shifting, in the time domain, the response waveform in noise relative to the response waveform in quiet until a maximum correlation was found. This calculation resulted in a Pearson's r value, with smaller values indicating a more degraded response. Because the presence of noise typically causes the response to be delayed, the response in noise was shifted in time by up to 2 ms, and the maximum correlation over the 0–2 ms shift was recorded. The response time region used for this analysis was from 5 to 180 ms, which encompassed the complete neural response (onset, transition, and steady state).
To gauge the effect of noise on the neural response to the steady-state vowel, the stimulus and response waveforms were compared via cross-correlation. The degree of similarity was calculated by shifting the stimulus waveform in time by 8–12 ms relative to the response, until a maximum correlation was found between the stimulus and the region of the response corresponding to the vowel. This time lag (8–12 ms) was chosen because it encompassed the stimulus transmission delay (from the ER-3 transducer and ear insert ∼1.1 ms) and the neural lag between the cochlea and the rostral brainstem. This calculation resulted in a Pearson's r value for both the quiet and noise conditions.
To assess the impact of background noise on the neural encoding of the stimulus spectrum, a fast Fourier transform was performed on the steady-state portion of the response (60–180 ms). From the resulting amplitude spectrum, average spectral amplitudes of specific frequency bins were calculated. Each bin was 60 Hz wide and centered on the stimulus f0 (100 Hz) and the subsequent harmonics H2–H10 (200–1000 Hz; whole-integer multiples of the f0). To create a composite score representing the strength of the overall harmonic encoding, the average amplitudes of the H2 to H10 bins were summed.
SIN perceptual measures
The behavioral data used for this study are reported in Parbery-Clark et al. (2009) and are used here for correlative purposes with the brainstem measures. Two commonly used clinical tests for speech-in-noise were administered: Hearing in Noise Test (HINT) and Quick Speech-in-Noise Test (QuickSIN).
Hearing in Noise Test.
Hearing in Noise Test (HINT; Bio-logic Systems) (Nilsson et al., 1994) is an adaptive test of speech recognition that measures speech perception ability in speech-shaped white noise. The full test administration protocol is described by Parbery-Clark et al. (2009). For the purpose of this study, we restricted our analyses to only include the condition in which the speech and the noise originated from the same location because it most closely mirrored the stimulus presentation setup for the electrophysiological recordings. During HINT, participants repeated short semantically and syntactically simple sentences spoken by a male (e.g., “She stood near the window”) presented in a speech-shaped noise background. Participants sat 1 m in front of the speaker from which the target sentences and the background noise were delivered. The noise presentation level was fixed at 65 dB SPL with the target sentence intensity level increasing or decreasing depending on performance. A final threshold signal-to-noise ratio (SNR)—defined as the difference in decibels between the speech and noise presentation levels for which 50% of sentences are correctly repeated—was calculated with a negative threshold SNR indicating better performance on the task.
Quick Speech-in-Noise Test.
Quick Speech-in-Noise Test (QuickSIN; Etymotic Research) (Killion et al., 2004) is a nonadaptive test of speech perception in four-talker babble that is presented binaurally through insert earphones (ER-2; Etymotic Research). Sentences were presented at 70 dB SPL, with the first sentence starting at a SNR of 25 dB and with each subsequent sentence being presented with a 5 dB SNR reduction down to 0 dB SNR. The sentences, which are spoken by a female, are syntactically correct yet have minimal semantic or contextual cues (Wilson et al., 2007). Participants repeated each sentence (e.g., “The square peg will settle in the round hole”), and their SNR score was based on the number of correctly repeated key words (underlined). For each participant, four lists were selected, with each list consisting of six sentences with five target words per sentence. Each participant's final score, termed “SNR loss,” was calculated as the average score across each of the four administered lists. A more negative SNR loss is indicative of better performance on the task. For more information about SNR loss and its calculation, see the study by Killion et al. (2004).
Analytical and statistical methods
Correlations [quiet-to-noise and stimulus-to-response (SR)] and fast Fourier transforms were conducted with Matlab 7.5.0 routines (Mathworks).
All statistical analyses were conducted with SPSS. For all between- and within-group comparisons, a mixed-model repeated-measures ANOVA was conducted, with the subsequent planned post hoc tests when appropriate. Assumptions of normality, linearity, outliers, and multicollinearity were met for all analyses. In the case of group comparisons on a single variable, one-way ANOVAs were used. To investigate the relationship between frequency encoding and the stimulus-to-response correlation, a series of Pearson's r correlations using all subjects, regardless of group, were used. Bonferroni's corrections were applied when required.
Musicians exhibited more robust speech-evoked auditory brainstem responses in background noise (Fig. 1). Musicians had earlier response onset timing, as well as greater phase-locking to the temporal waveform and stimulus harmonics, than nonmusicians. We also found that earlier response timing and more robust brainstem responses to speech in background noise were both related to better speech-in-noise perception as measured through HINT.
Brainstem response: quiet-to-noise correlations
For musicians, there was a greater degree of similarity between their brainstem responses to noise compared with quiet (one-way ANOVA: F(1,30) = 6.082, p = 0.02; musicians: mean = 0.79, σ = 0.07; nonmusicians: mean = 0.7, σ = 0.14). This suggests that the addition of background noise does not degrade the musician brainstem response to speech, relative to their response in quiet, to the same degree as in nonmusicians.
Brainstem response: timing of onset and transition peaks
Musical experience was found to limit the degradative effect of background noise on the peaks in the brainstem response corresponding to important temporal events in the stimulus. Typically, the addition of background noise delays the timing of the brainstem response, yet musicians exhibited smaller delays in timing than nonmusicians in noise. To investigate the effect of noise on the timing of the brainstem response to speech, we looked at the latencies of the onset peak (response to the onset of the stimulus, 9–11 ms) and the transition peak (response to the formant transition, 43–45 ms) for both the quiet and noise conditions. A mixed-model repeated-measures ANOVA, with group (musician/nonmusician) and background condition (quiet/noise) as the independent variables, and onset peak latency as the dependent variable, was performed. There was a significant main effect for background (F(1,26) = 307.841, p < 0.0005), with noise resulting in delayed onset peaks, and a trend for group (F(1,26) = 3.219, p = 0.084) with the musicians having earlier latencies in noise. There was also a significant interaction between group and background (F(1,26) = 4.936, p = 0.035). Independent-samples t tests for each background condition revealed that the two groups had equivalent onset peak latencies in quiet (t(26) = 0.001, p = 0.976; musicians: mean = 8.98 ms, σ = 0.38; nonmusicians: mean = 8.98 ms, σ = 0.21) but that, in noise, musicians had significantly earlier onset responses (t(26) = 14.889, p = 0.001; musicians: mean = 9.99 ms, σ = 0.15; nonmusicians: mean = 10.24 ms, σ = 0.13) (Fig. 2B). A similar relationship was found for the transition peak with significant main effects for both background and group, with the peak latencies being later in noise than in quiet (F(1,29) = 34.173, p < 0.0005) and musicians having earlier transition latencies (F(1,29) = 8.937, p = 0.006). There was a significant interaction between group and background (F(1,29) = 4.57, p = 0.041). Again, post hoc comparisons revealed that the groups were not significantly different in quiet (t(29) = 1.43, p = 0.242; musicians: mean = 43.03 ms, σ = 0.41; nonmusicians: mean = 43.22 ms, σ = 0.46), but musicians had earlier responses in the presence of noise (t(29) = 15.2, p = 0.001; musicians: mean = 43.34 ms, σ = 0.35; nonmusicians: mean = 43.90 ms, σ = 0.43) (Fig. 2C). Thus, for both groups, the addition of noise resulted in a delay in brainstem timing, but the latency shifts were smaller for the musicians, suggesting that their responses were less susceptible to the degradative effects of the background noise.
Brainstem response: amplitude of onset and transition peaks
Amplitudes of onset responses are known to be variable (Starr and Don, 1988; Hood, 1998), and previous research has found that when a speech stimulus is presented in noise the onset response is greatly reduced or eliminated (Russo et al., 2004). Consistent with these findings, the addition of background noise significantly reduced the onset peak amplitude for both groups with a significant main effect for background (F(1,26) = 179.715, p < 0.0005; quiet: musicians: mean = 0.448 μV, σ = 0.112; nonmusicians: mean = 0.390 μV, σ = 0.117; noise: musicians: mean = 0.219 μV, σ = 0.796; nonmusicians: mean = 0.172 μV, σ = 0.749), and in the case of three participants (two nonmusicians and one musician) the onset response was completely eliminated. There were no significant group differences (F(1,26) = 2.014, p = 0.168) nor a significant interaction between group and background (F(1,26) = 0.002, p = 0.961). Although the transition response was more robust to the effects of noise than the onset response, in that all participants had reliably distinguishable peaks, the addition of background noise significantly reduced the amplitude of the transition peak (F(1,29) = 6.068, p < 0.02; quiet: musicians: mean = 0.668 μV, σ = 0.249; nonmusicians: mean = 0.578 μV, σ = 0.242; noise: musicians: mean = 0.601 μV, σ = 0.202; nonmusicians: mean = 0.498 μV, σ = 0.211). There was no effect of group (F(1,29) = 1.476, p = 0.234) nor an interaction between group and background (F(1,29) = 0.120, p = 0.732), suggesting that noise had a similar effect on the peak amplitude of both groups.
Brainstem response: stimulus-to-response correlations and harmonic encoding
The presence of noise had a smaller degradative effect on the response to the vowel in musicians than nonmusicians. To quantify the effect of noise on the steady-state portion of the response, the degree of similarity between the stimulus and the corresponding brainstem response was calculated (SR correlation) for the quiet and noise conditions. A mixed-model ANOVA showed significant main effects of noise (F(1,29) = 17.49, p < 0.005) and of group (F(1,29) = 6.01, p = 0.02) and a marginally significant interaction (F(1,29) = 4.08, p = 0.052). Subsequent independent-samples t tests indicated that the two groups had equivalent SR correlations in quiet (t(29) = 1.543, p = 0.134; musicians: mean = 0.32, σ = 0.04; nonmusicians: mean = 0.28, σ = 0.07) but the musicians had significantly better SR correlations than the nonmusicians in noise (t(29) = 2.836, p = 0.008; musicians: mean = 0.29, σ = 0.05; nonmusicians: mean = 0.22, σ = 0.08) (Fig. 3A). This suggests that the introduction of noise resulted in less degradation of the musician's response.
In the presence of background noise, musicians also showed significantly greater encoding of the harmonics (H2–H10) (Fig. 3C). This was determined by spectrally analyzing the response to the stimulus steady state, the same time period used for calculating the SR correlations. Again, a repeated-measures ANOVA showed a main effect of noise (F(1,29) = 25.293, p < 0.005) and group (F(1,29) = 6.255, p = 0.018), but no interaction (F(1,29) = 0.004, p = 0.949). Post hoc comparisons indicated that, although there was a trend in quiet for the musicians to have larger harmonic amplitudes (t(29) = 1.961, p = 0.06; musicians: mean = 1.049 μV, σ = 0.302; nonmusicians: mean = 0.859 μV, σ = 0.227) (Fig. 3B), there was a significant difference in noise (t(29) = 2.871, p = 0.008; musicians: mean = 0.886 μV, σ = 0.192; nonmusicians: mean = 0.692 μV, σ = 0.182). Conversely, when considering the f0, a repeated-measures ANOVA found no main effects for either noise (F(1,29) = 0.555, p = 0.462), group (F(1,29) = 0.009, p = 0.924), or an interaction (F(1,29) = 0.015, p = 0.904; quiet: musicians: mean = 0.399 μV, σ = 0.201; nonmusicians: mean = 0.395 μV, σ = 0.185; noise: musicians: mean = 0.418 μV, σ = 0.200; nonmusicians: mean = 0.409 μV, σ = 0.168).
To elucidate the relationship between the SR correlation and the neural representation of the frequency components, Pearson's r correlations were conducted. We found that the summed harmonic representation (H2–H10) in quiet was positively correlated with the SR correlation in quiet (r = 0.581, p = 0.001) and the summed harmonic representation (H2–H10) in noise was positively correlated with the SR correlation in noise (r = 0.648, p < 0.005) (Fig. 3D,E). These results demonstrate that a greater harmonic representation is associated with a higher degree of correlation between the stimulus and the response. In quiet, the SR correlation of both groups was similar, and although the musicians tended to have greater harmonic representation, there were no significant group difference on either measure. However, in noise, musicians had greater harmonic encoding and also greater SR correlations. Therefore, it appears that musicians are better able to represent the stimulus harmonics in noise than nonmusicians and that this enhanced spectral representation contributes to their higher SR correlation. No group differences (one-way ANOVA) were found for the representation of the f0 in either the quiet (F(1,29) = 0.511, p = 0.481) or the noise condition (F(1,29) = 0.181, p = 0.673) nor was there a relationship between f0 and SR correlations (quiet: r = 0.288, p = 0.116; noise: r = 0.275, p < 0.135).
The brainstem measures in quiet (onset peak, transition peak, SR correlation) were not related to either behavioral test of speech-in-noise perception (HINT nor QuickSIN). For the brainstem responses in background noise, better HINT scores were related to earlier peak latencies for the onset and transition peaks (onset: r = 0.551, p = 0.002; transition: r = 0.481, p = 0.006) (Fig. 4A,B, respectively). In a similar vein, a greater SR correlation corresponded to better HINT scores (r = −0.445, p = 0.01) (Fig. 4C). These relationships suggest that better behavioral speech perception in noise, as measured by a lower HINT score, is associated with greater precision of brainstem timing in the presence of background noise (i.e., earlier peaks and higher SR correlations) (Fig. 4). QuickSIN, which was previously found to be related to working memory and frequency discrimination ability (Parbery-Clark et al., 2009), showed no relationship with peak timing measures or the SR correlation for either the quiet or noise condition (all p > 0.1). Last, neither the representation of the f0 nor the harmonics were related to performance on either speech-in-noise test (all p > 0.1).
The present data show that, in background noise, musicians demonstrate earlier onset and transition response timing, better stimulus-to-response and quiet-to-noise correlations, and greater neural representation of the stimulus harmonics than nonmusicians. Earlier response timing as well as a better SR correlation in the noise condition were associated with better speech perception in noise as measured by HINT but not QuickSIN. Together, musical experience results in more robust subcortical representation of speech in the presence of background noise, which may contribute to the musician behavioral advantage for speech-in-noise perception.
The subcortical representation of important stimulus temporal features was equivalent for musicians and nonmusicians in quiet, but the musicians' responses were less degraded by the background noise. The well documented increase in onset timing in background noise (Don and Eggermont, 1978; Burkard and Hecox, 1983a,b, 1987; Cunningham et al., 2001; Wible et al., 2005) was significantly smaller in the musicians. Previous work found no change in onset response timing with short-term auditory training (Hayes et al., 2003; Russo et al., 2005). In light of their results, Russo et al. (2005) postulated that the onset response, which originates from the primary afferent volley, may be more resistant to training effects. However, other studies, including ours, found earlier onset timing in musicians (Musacchia et al., 2007; Strait et al., 2009). It is therefore possible that extended auditory training, such as that experienced by musicians, is required for experience-related modulation of onset response timing.
Musicians also exhibited more robust responses to the steady-state portion of the stimulus in the presence of background noise. By calculating the degree of similarity between the stimulus waveform and the subcortical representation of the speech sound, we found that musicians had higher SR correlations in noise than nonmusicians. A greater SR correlation is indicative of more precise neural transcription of stimulus features. One possible explanation for this musician enhancement in noise may be based on the Hebbian principle, which posits that the associations between neurons that are simultaneously active are strengthened and those that are not are subsequently weakened (Hebb, 1949). Given the present results, we can speculate that extensive musical training may lead to greater neural coherence. This strengthening of the underlying neural circuitry would lead to a better bottom-up, feedforward representation of the signal. We can also interpret these data within the framework of corticofugal modulation in which cortical processes shape the afferent auditory encoding via top-down processes. It is well documented that the auditory cortex sharpens the subcortical sensory representations of sounds through the enhancement of the target signal and the suppression of irrelevant competing background noise via the efferent system (Suga et al., 1997; Zhang et al., 1997; Luo et al., 2008). The musician's use of fine-grained acoustic information and lifelong experience with parsing simultaneously occurring melodic lines may refine the neural code in a top-down manner such that relevant acoustic features are enhanced early in the sensory system. This enhanced encoding improves the subcortical signal quality, resulting in a more robust representation of the target acoustic signal in noise. Although our data and experimental paradigm cannot tease apart the specific contributions of top-down or bottom-up processing, they are not mutually exclusive explanations. In all likelihood, top-down and bottom-up processes are reciprocally interactive with both contributing to the subcortical changes observed with musical training.
Interestingly, the improved stimulus-to-response correlation in the noise condition was related to greater neural representation of the stimulus harmonics (H2–H10) but not the fundamental frequency in noise. Musicians, through the course of their training, spend thousands of hours producing, manipulating, and attending to musical sounds that are spectrally rich. The spectral complexity of music is partially attributable to the presence and relative strength of harmonics as well as the change in harmonics over time. Harmonics, which also underlie the perception of timbre or “sound color,” enable us to differentiate between two musical instruments producing the same note. Musicians have enhanced cortical responses to their primary instrument, suggesting that their listening and training experience modulates the neural responses to specific timbres (Pantev et al., 2001; Margulis et al., 2009). Likewise, musicians demonstrate greater sensitivity to timbral differences and harmonic changes within a complex tone (Koelsch et al., 1999; Musacchia et al., 2008; Zendel and Alain, 2009). Within the realm of speech, timbral features provide important auditory cues for speaker and phonemic identification and contribute to auditory object formation (Griffiths and Warren, 2004; Shinn-Cunningham and Best, 2008). A potential benefit of heightened neural representation of timbral features would be the increased availability of harmonic cues, which can then be used to generate an accurate perceptual template of the target voice. An accurate template or perceptual anchor is considered a key element for improving signal perception (Best et al., 2008) and facilitates the segregation of the target voice from background noise (Mullennix et al., 1989; Ahissar, 2007). Zendel and Alain (2009) showed that musicians were more sensitive to subtle harmonic changes both behaviorally and cortically, which they interpret as a musician advantage for concurrent stream segregation—a skill considered important for speech perception in noise. In interpreting their results, Zendel and Alain (2009) postulate that the behavioral advantage and its corresponding cortical index may be attributable to a better representation of the stimulus at the level of the brainstem. Our findings, along with previous studies documenting enhanced subcortical representation of harmonics in musicians' responses, support this claim.
Limitations and future directions
These results provide biological evidence for the positive effect of lifelong musical training on speech-in-noise encoding. Nevertheless, we cannot determine the extent to which this enhancement is mediated directly by musical training, group genetic differences, or a combination of the two. Longitudinal studies, akin to the large-scale design recently described by Forgeard et al. (2008) and Hyde et al. (2009), could not only elucidate the developmental time course and/or genetic disposition for the musician neural advantage for speech-in-noise, but may also help disentangle the relative influences of top-down and bottom-up processes on the neural encoding of speech-in-noise. Other important lines of research include the impact that the choice of musical instrument and musical genre, as well as extensive musical listening experience in the absence of active playing, have on the subcortical encoding of speech-in-noise.
Previous research has indicated that musical training may serve as a useful remediation strategy for children with language impairments (Overy et al., 2003; Besson et al., 2007; Jentschke et al., 2008; Jentschke and Koelsch, 2009). Our results imply that clinical populations known to have problems with speech perception in noise, such as children with language-based learning disabilities (e.g., poor readers) (Cunningham et al., 2001), may also benefit from musical training. More specifically, the subcortical deficits in sound processing seen in this population (e.g., timing and harmonics) (Wible et al., 2004; Banai et al., 2009; Hornickel et al., 2009) occur for the very elements that are enhanced in musicians. Moreover, for f0 encoding, no group differences have been found between normal and learning-impaired children nor in the present study between musicians and nonmusicians; this is consistent with the previously described dissociation between the neural encoding of f0 and the faster elements of speech (e.g., timing and harmonics) (Fant, 1960; Kraus and Nicol, 2005). By studying an expert population, we can investigate which factors contribute to an enhanced ability for speech perception in noise, providing future avenues for the investigation of speech perception deficits in noise as experienced by older adults and hearing-impaired and language-impaired children. By providing an objective biological index of speech perception in noise, brainstem activity may be a useful measure for evaluating the effectiveness of SIN-based auditory training programs.
Overall, our results offer evidence of musical expertise contributing to an enhanced subcortical representation of speech sounds in noise. Musicians had more robust temporal and spectral encoding of the eliciting speech stimulus, thus offsetting the deleterious effects of background noise. Faster neural timing and enhanced harmonic encoding in musicians suggests that musical experience confers an advantage resulting in more precise neural synchrony in the auditory system. These findings provide a biological explanation for musicians' perceptual enhancement for speech-in-noise.
This work was supported by National Science Foundation Grant 0842376. We thank everyone who participated in this study. We also thank Dr. Samira Anderson and Carrie Lam for their help with peak picking and Dr. Beverly Wright, Dr. Frederic Marmel, Dr. Samira Anderson, Trent Nicol, and Judy Song for suggestions made on a previous version of this manuscript.
- Correspondence should be addressed to Dr. Nina Kraus, Department of Communication Sciences, Northwestern University, 2240 Campus Drive, Evanston, IL 60208.