Abstract
Pitch is one of the primary auditory sensations and plays a defining role in music, speech, and auditory scene analysis. Although the main physical correlate of pitch is acoustic periodicity, or repetition rate, there are many interactions that complicate the relationship between the physical stimulus and the perception of pitch. In particular, the effects of other acoustic parameters on pitch judgments, and the complex interactions between perceptual organization and pitch, have uncovered interesting perceptual phenomena that should help to reveal the underlying neural mechanisms.
The what and why of pitch
Pitch is one of the primary auditory sensations, along with loudness and timbre. In music, sequences of pitch define melody, and simultaneous combinations of pitch define harmony. In speech, rising and falling pitch contours help define prosody and in tone languages, such as Mandarin and Cantonese, pitch contours help define the meaning of words. In complex acoustic environments, differences in pitch can help listeners to segregate and make sense of competing sound sources.
Put simply, pitch is the perceptual correlate of the periodicity, or repetition rate, of an acoustic waveform. The most commonly considered form of pitch-evoking sound is a harmonic complex tone. This periodic waveform repeats at a rate corresponding to the fundamental frequency (F0) and can be decomposed into sinusoidal harmonics or overtones, which have frequencies at integer multiples of the F0 (Fig. 1A,B). The relative amplitudes of the harmonics within a complex tone play an important role in determining the sound quality, or timbre, of a sound. Despite differences in timbre and loudness, two tones generally have the same pitch if they share the same F0. Although young humans with normal hearing can hear sounds with frequencies between ∼20 and 20,000 Hz, only repetition rates between ∼30 and 4000 Hz elicit a pitch sensation that is salient enough to carry melodic information (Attneave and Olson, 1971; Pressnitzer et al., 2001).
Scientific disputes on how we perceive the F0 arose in the mid-19th Century (Seebeck, 1841; Ohm, 1843; Helmholtz, 1885/1954), but it was only firmly established in the mid-20th Century that a tone retains the same pitch, even if all the energy at the F0 is removed or masked by noise (Schouten, 1940; Licklider, 1954). This phenomenon of the “pitch of the missing fundamental” provides an important benchmark in the search for physiological correlates of pitch (Plack et al., 2005; Griffiths and Hall, 2012; Wang and Walker, 2012). From a perceptual standpoint, it makes sense that the pitch of a sound remains constant after the lowest harmonic components are removed or masked (occluded), so that some degree of perceptual invariance of a sound source can be maintained in a cluttered acoustic environment (McDermott and Oxenham, 2008).
Relationship between pitch and early auditory transformations
Figure 1A shows the time waveform of a musical tone with an F0 of 440 Hz. As shown in Figure 1B, when a time segment of a waveform is analyzed, its power spectrum—the distribution of sound intensity across the frequency spectrum—can be extracted.
Cochlear filtering
When the sound enters the cochlea, different frequencies within the sound selectively stimulate different regions of the cochlea. This frequency-to-place mapping, or tonotopy, is maintained throughout the auditory pathways up to at least primary auditory cortex (AI), and forms a major organizational principle of auditory neural processing. The perceptual consequences of tonotopic organization are manifold and can be measured using behavioral experiments in a variety of ways, often involving masking (Oxenham and Wojtczak, 2010). The results from such experiments are explained in terms of the frequency selectivity of the “auditory filters” (Fig. 1C). These behaviorally defined filters are thought to have their basis in cochlear filtering (Shera et al., 2002, 2010). The output of the auditory filters can be represented in terms of the long-term average, which is referred to as the “excitation pattern” (Fig. 1D), which can be thought of as a schematic representation of mechanical activation of the cochlear partition or neural activity as a function of characteristic frequency (CF) (Glasberg and Moore, 1990). Alternatively, the outputs of the auditory filters can be considered in terms of their time waveform (Fig. 1E). Because the filter bandwidths increase with increasing CF, regions of the cochlea tuned to the frequencies of low-numbered harmonics will respond almost exclusively to a single harmonic, whereas regions tuned to the frequencies of high-numbered harmonics will respond to several harmonics. Harmonics that are exclusively represented within single filters are referred to as “resolved,” whereas harmonics that interact with others within auditory filters are referred to as “unresolved.”
Resolved harmonics produce peaks in the excitation pattern (Fig. 1D), and should elicit filtered waveforms that are similar to single pure tones at that frequency (Fig. 1E), whereas unresolved harmonics produce no distinct peaks, and elicit complex waveforms that reflect the interaction between multiple harmonics. The point of transition between resolved and unresolved harmonics is somewhat fuzzy and depends on many factors, including sound level and F0, as well as on how resolvability is defined and measured (Bernstein and Oxenham, 2003; Moore and Gockel, 2011). Nevertheless, a number of phenomena related to pitch can be explained in terms of harmonic resolvability (Flanagan and Guttman, 1960; Houtsma and Smurzynski, 1990; Shackleton and Carlyon, 1994; Bernstein and Oxenham, 2006a,b).
Relationship of auditory-nerve and brainstem responses to pitch
Neurons in the auditory nerve are more likely to fire at one phase within the cycle of a waveform than at other phases. This precise stimulus-driven spike timing, or phase locking, is known to underlie our ability to perceive minute differences in the time of arrival of sound at the two ears, down to as little as 10 μs, which in turn helps us to localize sounds in space (Blauert, 1997). It is entirely reasonable that the same precise timing is used to help encode stimulus periodicity and hence pitch. In species such as cat and guinea pig, phase locking is known to extend to ∼1–2 kHz, and then to degrade at higher frequencies (Palmer and Russell, 1986). Because of the invasive nature of the measurements, little is known about phase locking in the human auditory nerve. Nevertheless, behavioral data, showing that the ability to discriminate frequency and to recognize melodies degrades for pure tones above ∼4–5 kHz (Attneave and Olson, 1971; Moore, 1973), has been interpreted as reflecting the degradation of phase locking at high frequencies, suggesting that accurate pitch perception relies on timing information in the auditory nerve. There are some indications that timing information alone may not be sufficient to provide accurate pitch: Oxenham et al. (2004) found that when the timing information from low-frequency harmonics was “transposed” to a higher-frequency cochlear location, listeners were not able to use the information to extract the F0. This result suggested that complex pitch was based either on tonotopic or “place” information, or on timing information that had to be presented to the “correct” place along the cochlea, as required by spatio-temporal models of periodicity coding (Loeb et al., 1983; Shamma and Klein, 2000; Cedolin and Delgutte, 2010; Carlyon et al., 2012).
How is pitch extracted? One way might be via the temporal patterns generated by unresolved harmonics in the auditory periphery (Schouten et al., 1962). These provide a direct estimate of the F0, assuming that the brain can measure the timing between peaks of neural activity corresponding to peaks in the time waveform. In contrast, extracting pitch from the resolved harmonics may require coding of the individual frequencies and then combining that information to estimate the F0 (Schroeder, 1968; Goldstein, 1973). At face value, therefore, the unresolved harmonics provide the easier route to code F0. On the other hand, phase distortions produced by room acoustics and reverberation can change the waveform of complex tones, which affects the representation of unresolved harmonics, but not resolved harmonics, making unresolved harmonics more susceptible to interference (Qin and Oxenham, 2005; Sayles and Winter, 2008). In fact, many behavioral studies have shown that low-numbered resolved harmonics elicit a much more salient, robust, and accurate pitch than do high-numbered unresolved harmonics (Houtsma and Smurzynski, 1990; Carlyon, 1996; Bernstein and Oxenham, 2003; Micheyl et al., 2010).
Pitch-matching and melody discrimination experiments have recently revealed that the pitch of the missing fundamental can be extracted even when all the harmonics present are well above 5 kHz (Oxenham et al., 2011). Because the pitch was extracted from the individual harmonics, and not from the temporal envelope produced by unresolved harmonics (Kaernbach and Bering, 2001), the results suggest either that temporal information is not necessary for complex pitch perception (i.e., harmonics can be represented via a place code), or that phase locking in the auditory nerve extends to much higher frequencies than is generally believed (see also Heinz et al., 2001; Recio-Spinoso et al., 2005; Moore and Sȩk, 2009).
The frequency following response (FFR) is a measure of phase-locked brainstem activity that can be recorded from the scalp (Skoe and Kraus, 2010). It has been used as a measure of pitch encoding accuracy, and several intriguing findings have been reported recently, including stronger FFR in musicians than in people without musical training (Wong et al., 2007), stronger FFR in people with experience in tonal languages (such as Mandarin) than without (Krishnan et al., 2005), increased FFR amplitudes after training in a speech-related task (de Boer and Thornton, 2008; Tzounopoulos and Kraus, 2009), and correlations between FFR amplitudes and the ability to learn new pitch contours in a linguistic context (Chandrasekaran et al., 2012). However, we are only beginning to understand the relationship between the FFR and pitch perception (Gockel et al., 2011).
Interactions between periodicity and other acoustic variables on pitch
Although pitch is often treated as being orthogonal to other perceptual dimensions, such as loudness and timbre, some interactions occur. For instance, small effects of stimulus intensity on pitch have been reported (Verschuure and van Meeteren, 1975). More commonly, many people find it difficult to ignore changes in brightness (produced by changes in the spectral content of a stimulus) when making pitch judgments (produced by changes in the F0) (Moore and Glasberg, 1990). In fact, a recent study reported that even many musically trained listeners find it difficult to detect small pitch differences between sounds with very different timbres (Borchert et al., 2011).
The fact that listeners find it difficult to make pitch judgments in the face of large timbral differences raises some question as to whether we should expect to find a cortical neural representation of pitch that is highly localized and invariant to changes in other dimensions. As discussed later in these mini-reviews (Wang and Walker, 2012), both localized pitch-specific invariant representations (Bendor and Wang, 2005), and more distributed population codes (Walker et al., 2011) have been proposed.
Other forms of interaction can aid comparisons between sound sequences. For instance, contours—the pattern of rising and falling pitch in melodies—have traditionally been considered specific to pitch, whereas recent findings suggest that listeners are not only able to perceive contours in dimensions other than pitch (i.e., loudness and timbre) but are also able to compare contours across perceptual dimensions, suggesting a common underlying representation (McDermott et al., 2008).
Future directions
There are many indications that perceptual organization and object formation are affected by harmonicity and pitch, but also that pitch can be influenced by perceptual organization (Darwin, 2005). Thus, the search for neural correlates of pitch in cortical regions seems entirely reasonable and consistent with behavioral data. In addition, paradigms in which pitch can be altered by changes in perceptual grouping, in the absence of changes in stimulus periodicity, could provide a fruitful approach to dissociating neural correlates of pitch from those related strictly to physical stimulus properties.
People with hearing loss, and especially those with cochlear implants, often suffer from a deficit in pitch perception abilities (McDermott, 2004; Oxenham, 2008). A better understanding of the neural transformations involved in pitch extraction should help in designing more effective neural and acoustic prostheses.
Although the tonotopic representation of frequency is established early, it is relative, rather than absolute, pitch that is most salient and important for human acoustic communication in speech and music. Although some efforts to study the neural correlates of pitch relations and contours have been made (Patel and Balaban, 2000; Warren et al., 2003), this remains a field ripe for further study.
Footnotes
This work on pitch perception is supported by NIH Grant R01 DC 05216.
- Correspondence should be addressed to Andrew J. Oxenham at the above address. oxenham{at}umn.edu