Abstract
Neural coding of the pitch of complex sounds is vital for animals' ability to communicate and to perceptually organize natural acoustic scenes. Harmonic complex sounds typically have a well defined pitch corresponding to their fundamental frequency, whereas inharmonic sounds can exhibit pitch ambiguity: their pitch can have more than one value. Iterated rippled noise (IRN), a common “pitch stimulus,” is generated from broadband noise by a cascade of delay-and-add steps, with the delayed noise phase-shifted by φ degrees. By varying φ, the (in)harmonicity, and therefore the pitch ambiguity, of IRN can be manipulated. Recordings were made from single-units in the ventral cochlear nucleus of anesthetized guinea pigs in response to IRN and complex tones, systematically varying the inharmonicity. In their all-order interspike interval distributions, primary-like and chopper units tuned within the phase-locking range of best frequencies represent the waveform temporal fine structure (which varies with φ). In contrast, those units tuned to higher frequencies represent the temporal-envelope modulation (independent of φ). We show a temporal representation of ambiguous pitch for IRN and complex tones based on responses to the stimulus fine structure. Within the dominance region for pitch this representation follows the predictions of classic human behavioral experiments and provides a unifying contribution to possible neuro-temporal explanations for the pitch shift and pitch ambiguity associated with many inharmonic sounds.
- pitch ambiguity
- complex tone
- inharmonic sounds
- iterated rippled noise
- cochlear nucleus
- interspike intervals
- auditory brainstem
Introduction
Vocal communication, musical melody recognition, and the perceptual organization of sounds into meaningful “auditory objects” rely on accurate neural coding of pitch (Bregman, 1990; Plack et al., 2005). Whether the brain uses temporal (Schouten, 1940; Licklider, 1951) or spectral (Goldstein, 1973; Wightman, 1973a; Terhardt, 1974) information to determine a sound's pitch remains an open question (Cedolin and Delgutte, 2005; de Cheveigné, 2005); however, recent evidence favors temporal processing with some dependence on spectral place (Oxenham et al., 2004; Bernstein and Oxenham, 2005). Temporal information reaches the central auditory system by auditory nerve fibers (ANFs) “phase-locking” to basilar-membrane vibrations; the product of a rapid fine-structure, and a slower temporal-envelope vibration. The fundamental frequency (F0) of harmonic complex sounds may be extracted by neural processing of either one, or both of these temporal cues. Applying a frequency shift to each harmonic of a complex sound, making the spectrum inharmonic, alters the fine structure but not the envelope. Inharmonic sounds often evoke more than one pitch (pitch ambiguity), none of which corresponds to the envelope periodicity (pitch shift), and therefore provide evidence for fine-structure based pitch (Schouten, 1940; de Boer, 1956b; Schouten et al., 1962).
Neurophysiological studies have examined the processing of harmonic and inharmonic complex sounds from the auditory nerve (Javel, 1980; Palmer and Winter, 1993; Simmons and Ferragamo, 1993; Rhode, 1995; Cariani and Delgutte, 1996a,b; Cedolin and Delgutte, 2005, 2007) to the auditory cortex (Bendor and Wang, 2005). However, studies of inharmonic sounds are limited to narrowband amplitude-modulated (AM) tones (Javel, 1980; Simmons and Ferragamo, 1993; Rhode, 1995; Cariani and Delgutte, 1996a). Broadband signals known as iterated rippled noises (IRNs) are popular pitch stimuli (Patterson et al., 2002; Bendor and Wang, 2005; Schönwiesner and Zatorre, 2008), and fine-structure detection in low-frequency “dominance” regions is thought to be the basis of IRN pitch (Bilsen and Ritsma, 1967/68, 1969/70; Bilsen, 1966; Yost and Hill, 1979; Yost, 1996; Yost et al., 1996; Shofner and Yost, 1997). The unambiguous pitch of harmonic IRN is represented in the firing patterns of ANFs (Fay et al., 1983; ten Kate and van Bekkum, 1988) and cochlear-nucleus neurons (Bilsen et al., 1975; Shofner, 1991, 1999; Winter et al., 2001; Sayles and Winter, 2007) by action potentials locked to either the fine-structure or envelope periodicity. The pitch of inharmonic IRN can be ambiguous (Bilsen, 1966; Raatgever and Bilsen, 1992; Yost, 1997), and the (in)harmonicity (and pitch ambiguity) can be manipulated by varying a single parameter.
Here we examine the temporal representation of the pitch-shift and ambiguity of IRN in the responses of anesthetized guinea-pig ventral cochlear nucleus (VCN) units. In their interspike-interval distributions, VCN primary-like units and chopper units tuned to the dominance region represent the pitch-shift and ambiguity of IRN and inharmonic complex tones predicted by theoretical and psychophysical studies (Fourcin, 1965; Bilsen, 1966; Yost and Hill, 1979; Yost, 1996, 1997). The aim of the present work is to provide neurophysiological evidence for fine-structure detection by VCN units underlying the pitch-shift and ambiguity of broadband inharmonic signals.
This work has been presented in abstract form at “Acoustics '08”, Paris, France (Sayles and Winter, 2008a).
Materials and Methods
The preparation.
Experiments were performed on 20 pigmented guinea pigs (Cavia porcellus), weighing between 320 and 600 g. Animals were anesthetized with urethane (1.0 g/kg, i.p.). Hypnorm (fentanyl citrate, 0.315 mg/ml; fluanisone, 10 mg/ml; Janssen) was administered as supplementary analgesia (1 ml/kg, i.m.). Anesthesia and analgesia were maintained at a depth sufficient to abolish the pedal withdrawal reflex (front paw). Additional doses of hypnorm (1 ml/kg, i.m.) or urethane (0.5 g/kg, i.p.) were administered on indication. Core temperature was monitored with a rectal probe and maintained at 38°C using a thermostatically controlled heating blanket (Harvard Apparatus). The trachea was cannulated and on signs of suppressed respiration, the animal was ventilated with a pump (Bioscience). Surgical preparation and recordings took place in a sound-attenuated chamber (IAC). The animal was placed in a stereotaxic frame, which had ear bars coupled to hollow specula designed for the guinea-pig ear. A mid-sagittal scalp incision was made and the periosteum and the muscles attached to the temporal and occipital bones were removed. The bone overlying the left bulla was fenestrated and a silver-coated wire was inserted into the bulla to contact the round window of the cochlea for monitoring compound action potentials (CAP). The hole was resealed with Vaseline. The CAP threshold was determined at selected frequencies at the start of the experiment and thereafter after indication. If thresholds deteriorated by >10 dB and were nonrecoverable (e.g., by removing fluid from the bulla or by artificially ventilating the animal) the experiment was terminated. A craniotomy was performed exposing the left cerebellum. The overlying dura was removed and the exposed cerebellum was partially aspirated to reveal the underlying cochlear nucleus. The hole left from the aspiration was then filled with 1.5% agar in saline to prevent desiccation. The experiments performed in this study have been performed under the terms and conditions of the project license issued by the United Kingdom Home Office to the second author.
Neural recordings.
Single units were recorded extracellularly with glass-coated tungsten microelectrodes (Merrill and Ainsworth, 1972). Electrodes were advanced in the sagittal plane by a hydraulic microdrive (650 W; David Kopf Instruments) at an angle of 45°. Single units were isolated using broadband noise as a search stimulus. All stimuli were digitally synthesized in real-time with a PC equipped with a DIGI 9636 PCI card that was connected optically to an AD/DA converter (ADI-8 DS; RME audio products). The AD/DA converter was used for digital-to-analog conversion of the stimuli as well as for analog-to-digital conversion of the amplified (×1000) neural activity. The sample rate was 96 kHz. The AD/DA converter was driven using ASIO (Audio Streaming Input Output) and SDK (Software Developer Kit) from Steinberg (Lloyd, 2002).
After digital-to-analog conversion, the stimuli were equalized (phonic graphic equalizer, model EQ 3600; Apple Sound) to compensate for the speaker and coupler frequency response and fed into a power amplifier (Rotel RB971) and a programmable end attenuator (0–75 dB in 5 dB steps, custom build) before being presented over a speaker (Radio Shack 30-1777 tweeter assembled by Mike Ravicz, Massachusetts Institute of Technology, Cambridge, MA) mounted in the coupler designed for the ear of a guinea pig. The stimuli were monitored acoustically using a condenser microphone (Brüel & Kjær 4134) attached to a calibrated 1 mm diameter probe tube that was inserted into the speculum close to the eardrum. Neural spikes were discriminated in software, stored as spike times on a PC and analyzed off-line using custom-written Matlab programs (The MathWorks).
Unit classification.
After isolation of a unit, its best frequency (BF) and excitatory threshold were determined using audio-visual criteria. Spontaneous activity was measured over a 10-second period. Single units were classified based on their peri-stimulus time histograms (PSTH), the first-order interspike-interval distribution and the coefficient of variation (CV) of the discharge regularity. The CV was calculated by averaging the ratios of the mean ISI (interspike interval) and its SD between 12 and 20 ms after onset (Young et al., 1988). PSTHs were generated from spike-times collected in response to 250 sweeps of a 50 ms tone at the unit's BF at 20 and 50 dB above threshold. Tones had 1 ms sin2-on and cos2-off gates, their starting phase was randomized, and they were repeated with a 250 ms period. PSTHs were classified as primary-like (PL), primary-like with a notch (PN), chopper-sustained (CS), chopper-transient (CT), and onset-chopper (OC). For some units with very low BFs (<∼0.5 kHz) it was not possible to assign them to one of the above categories. In the absence of a definitive classification these are grouped together as “low frequency” (LF) units.
Complex stimuli.
IRN was generated from a noise waveform with a Gaussian distribution of instantaneous amplitudes. This waveform is delayed by time d, phase shifted by φ degrees (independent of frequency), and added back to the input waveform. This process is repeated for n iterations. The phase shift was implemented in the frequency domain, with φ varied in 30° steps between 0 and 330°. Delay d was varied in octave steps between 2 and 16 ms, corresponding to F0 s between 500 and 62.5 Hz. For some units, values of d were chosen to place harmonics 1–10 of the IRN signal at unit BF when φ = 0°. Stimuli were generated with 16 iterations of the delay-and-add circuit, with the output of each iteration step serving as the input to the next [“add same,” (Yost, 1996)]. IRN signals were low-pass filtered in the frequency domain at 10 kHz. In contrast to previous studies which described IRN signals with the parameters d, gain g, and n, we describe the IRN stimuli used in this study as IRN[d, φ, n] throughout.
Complex tones were generated in the time domain from a sum of sinusoids added either in cosine (COS) or random (RAND) phase. Each complex contained all (equal-amplitude) components of a series between 0 and 10 kHz. The frequency spacing (f) between components was varied in octave steps between 500 and 62.5 Hz. The frequency shift Δf applied to each component, moving it upward in frequency from the harmonic condition, in which components are all integer multiples of f, was calculated as Δf = φf/360, with φ varied in 30° steps between 0 and 330°, as for IRN signals. When Δf >0 the lowest frequency component present in the physical stimulus was at Δf. The complex tone stimuli are described as COS[d, φ] and RAND[d, φ] throughout, to facilitate comparison with the equivalent IRN[d, φ, n] signals.
We presented a control stimulus of low-pass filtered (at 10 kHz) Gaussian noise (GN) at the same level as the pitch stimuli. All stimuli (IRN, COS, RAND, and GN) were 0.5 s in duration, presented with a 1 s repetition period, gated with 5 ms sin2-on and cos2-off ramps, and were part of a single array presented in an interleaved manner in random order for 25 repetitions. Each stimulus repetition was generated in real time from a new noise waveform (IRN and GN), or with a new random set of starting phases (RAND complex tones). Before the presentation of the complex stimuli, we collected a rate-level function in response to GN of 0.5 s duration. The complex stimuli were then presented at a sound level corresponding to the ∼50% point on the noise rate-level function. Across all units, this corresponded to an overall sound level of between 27 and 65 dB sound pressure level (SPL) (mean, 44.5 dB SPL). Figure 1 shows example magnitude spectra, waveform autocorrelation functions (ACFs), and Hilbert-envelope ACFs for IRN signals with d = 1 ms (F0 = 1 kHz). The spectral representations (Fig. 1A,B) demonstrate that with increasing φ, the stimulus spectrum drifts upward along the frequency axis. When φ = 0°, there are peaks in the spectrum at integer multiples of 1/d Hz, by φ = 180° the spectral peaks are at odd-integer multiples of 1/2d Hz, and by φ = 360° the spectrum has returned to the “harmonic” condition. The data plotted on polar coordinates and interpolated in Figure 1B demonstrate that varying the parameter φ (along the circumferential axis) results in a continuous shift in the spectrum along the (radial) frequency axis. For display purposes, the spectrum is only shown between 0 and 4 kHz. The dashed black lines in Figure 1B are at integer multiples of 1/d Hz, corresponding to the position of the harmonic spectral peaks (red) when φ = 0°, and corresponding to the spectral troughs (blue) when the signal is perfectly inharmonic (φ = 180°). The waveform ACFs (Fig. 1C,D) show peaks (red) at d and 2d ms when φ = 0°, a null (blue) at d and a peak at 2d when φ = 180°, a transition from a peak to a null centered at d with a null at 2d when φ = 90°, and a transition from a null to a peak centered at d with a null at 2d when φ = 270° (Fig. 1C,D). These differences in the waveform ACF reflect the effect of φ on the temporal fine-structure. In contrast, for all values of φ the Hilbert envelope ACF shows peaks at all integer multiples of d (Fig. 1E,F). Therefore, if a neuron responds to the temporal fine structure, its response will be modulated by changes in φ, whereas if it responds to the temporal envelope modulation, the response will be independent of changes in φ.
Analysis.
We calculated the all-order interspike-interval distribution for each unit's response to each stimulus condition and constructed interspike-interval histograms (ISIHs) with 50-μs wide bins between 0 and 2.5d. ISIHs are expressed as firing rate by dividing raw bin-counts by the product of bin-width and the total number of spikes (Abeles, 1982; Shofner, 1999) and then smoothed with a sliding 0.45 ms Hanning-window. To determine which interspike intervals (if any) occur significantly more often in response to the complex pitch-evoking stimuli than in response to GN, we performed a Fisher-Pitman permutation test (Berry et al., 2002) on the interspike-interval distributions with the null hypothesis that at each bin the difference between the signal and noise responses is zero. First the difference between the signal and noise response (the observed interval-difference function) is calculated. Next, the interspike-interval distributions from the noise and the signal are pooled and re-sampled without replacement to give two populations of interspike intervals corresponding in size to the signal and noise interval distributions: the permutation distributions. The observed data are resampled in this way for a large number of replications (here, 5000). On each replication, the data are binned in the same way as for the observed data, and the difference between the two populations is calculated. The p value associated with each bin in the observed interval-difference function is then the proportion of replications on which the difference in the permutation distribution exceeds that in the observed data. We considered bins in the interval difference histogram with p < 0.02 to be significant.
Results
We recorded responses to harmonic and inharmonic IRN and complex tones from 84 isolated single units (33 PL, 6 PN, 25 CT, 2 CS, 11 LF, and 7 OC) in the VCN of 20 urethane-anesthetized guinea pigs. The majority of units included in this study had BFs in the range of phase locking. For the PL/PN group, 36 had BFs between 0.5 and 3 kHz, and three had BFs between 5 and 6.25 kHz. The CT/CS group had BFs between 0.5 and 5 kHz, with the majority of units (15 of 27) having BFs in the phase-locking range <1.25 kHz. Low frequency units had BFs <0.5 kHz. The OC units had BFs between 2.7 and 8 kHz. To analyze the temporal responses to the complex pitch stimuli we calculated the all-order interspike-interval distribution for each unit and each stimulus condition by measuring the time between each spike and all subsequent spikes in the same spike train and tallying the intervals in an ISIH. The bin-values are expressed as firing rate (spikes per second) by dividing the raw bin count by the product of bin-width (50 μs) and the total number of spikes in response to the stimulus (Abeles, 1982; Shofner, 1999). When spikes are locked to a particular periodicity there is a peak in the ISIH at the corresponding interval. To estimate which peaks in the ISIH are related to the periodicity of the pitch-evoking signals we express the histograms as the change in firing rate relative to the response to GN presented at the same level as the pitch stimuli.
Temporal fine-structure representation
IRN is constructed by adding a broadband noise, delayed by time d and phase-shifted by φ°, to the original noise and repeating this process for n iterations. The resulting signal has a series of spectral peaks with 1/d-Hz (F0) spacing, and evokes the sensation of “repetition pitch” (Bilsen, 1966). By varying φ, the spectral peaks shift in frequency by Δf = φ/(360d), but the interpeak spacing remains 1/d Hz; therefore altering the waveform temporal fine structure, but not the temporal envelope. Most studies of IRN have used signals with φ = 0° and/or 180°, described by a gain parameter g, because φ = 180° is equivalent to inverting the delayed waveform (g=−1) and φ = 0° is equivalent to a delayed signal gain of 1. Previous studies have used the notation IRN[d,g,n], whereas here we use the alternative IRN[d, φ, n] (with unity gain of the delayed, phase-shifted noise). In addition to IRN stimuli, we also examine responses of VCN single units to inharmonic complex tones with their frequency components matched to the position of the spectral peaks in IRN. A frequency shift Δf is applied to each component of a harmonic complex tone; Δf = φ/(360d). When φ = 0°, the spectral peaks are positioned at integer multiples of 1/d Hz and the signal has a single, unambiguous, pitch of 1/d Hz. In contrast, when φ = 180°, the spectral peaks are at odd-integer multiples of 1/2d Hz, and the pitch is either well defined at 1/2d Hz, or it is ambiguous at 0.88/d, 1.14/d, and 1/2d Hz, depending on the signal's spectral content and, in the case of IRN, n (Bilsen and Ritsma, 1969/70, 1970; Fourcin, 1965; Bilsen, 1966; Yost et al., 1978; Raatgever and Bilsen, 1992; Yost, 1996, 1997).
In the time domain, varying φ changes the stimulus waveform autocorrelation function (ACF). When φ = 0°, there are peaks in the waveform ACF at integer multiples of d milliseconds, with the number of ACF peaks equal to the number of iterations. In contrast, when φ = 180° there are nulls in the waveform ACF (i.e., negative correlation) at odd-integer multiples of d milliseconds, and peaks at even-integer multiples of d milliseconds. When the IRN[d,180,n] signal is bandpass filtered, the ACF nulls at odd-integer multiples of d milliseconds are flanked by ACF peaks, at either side of the null. The position of these flanking peaks (i.e., their distance from the null) is determined by the center frequency of the filter pass band. The distance from the null decreases with increasing center frequency, with the position of the peaks being given by d ± 1/2fc, where fc is the filter center frequency. Similar relationships between d, the filter center frequency, and the position of the waveform ACF peaks exist for other values of φ. For example, when φ = 90°, the first peak in the ACF is at d − 1/4fc, and the first null is at d + 1/4fc, with zero correlation at d milliseconds. In contrast to the waveform ACF, the ACF of the Hilbert envelope of IRN signals is independent of changes in φ (provided d remains constant), with low-amplitude peaks at integer multiples of d milliseconds. The same is true for complex tones. These features of signal processing are of importance when considering the responses of auditory neurons for two reasons. First, neurons are frequency tuned, so that they respond to a band-limited frequency range centered on the unit's BF. Second, the ability to encode the waveform fine structure decreases with increasing BF (decreasing phase-locking strength). Therefore, temporal coding of stimulus periodicity becomes gradually more dominated by the stimulus temporal-envelope periodicity (and therefore independent of φ) with increasing BF.
As predicted from the description of the physical acoustic signals neurons representing the fine structure of IRN[d,0,n] (or a harmonic complex tone with F0 = 1/d Hz) in their discharge patterns show peaks at integer multiples of d milliseconds in their ISIHs. In response to IRN[d,180,n] (or an inharmonic complex tone with all components shifted by F0/2 Hz) neurons representing the fine structure show nulls (flanked by a pair of peaks) at odd-integer multiples of d milliseconds, and peaks at even-integer multiples of d milliseconds in their ISIHs (Rhode, 1995; Cariani and Delgutte, 1996a; Shofner, 1999; Verhey and Winter, 2006; Sayles and Winter, 2007). This illustrates the equivalence between the waveform ACF and the all-order interspike-interval distribution of a neuron in the phase-locking range of BFs. The position of the ISIH peaks at either side of d milliseconds in response to IRN[d,180,n] depends on the interaction of unit BF and d (Sayles and Winter, 2007), and in modeling studies the position of the corresponding ACF peaks for a given value of d depends on the bandpass filter center frequency (Bilsen and Ritsma, 1969/70; Yost et al., 1978; Yost and Hill, 1979).
The interspike-interval representation of temporal fine-structure for IRN signals with d = 4 ms and d = 8 ms is illustrated for a relatively low-BF primary-like unit in Figure 2. This unit is tuned to 0.86 kHz; therefore it exhibits strong phase-locking to the fine structure at the output of the 0.86 kHz place along the basilar membrane (BM) [vector strength of ∼0.8 in response to pure-tone stimulation at BF (Palmer and Russell, 1986; Winter and Palmer, 1990)]. Conventional ISIHs constructed from the responses to GN (gray shading) and IRN (black) with φ = 0, 90, 180, and 270° show a series of peaks at interspike intervals related to d (but not necessarily at d ms). The position of the peaks varies with φ (Fig. 2A,C). When φ = 0° (top panels), the main peaks in the IRN response are at d and 2d, with the largest peak at d milliseconds. If the largest peak in the all-order interval distribution were the pitch cue, as hypothesized by several models, this unit would indicate a pitch corresponding to the “correct” psychophysical value of 1/d Hz. The smaller peaks at either side of the large peaks are related to the unit's BF, so that in temporal models of pitch processing that sum temporal information (either autocorrelation magnitude or number of interspike intervals) across the BF axis, these small peaks “average out,” leaving a common peak at d milliseconds in the population response, and, thus, represent the unambiguous pitch of 1/d Hz (Meddis and Hewitt, 1991; Cariani and Delgutte, 1996b). By changing φ, and thereby altering the fine structure of the IRN signal, the positions of the ISIH peaks change so that the largest peak is no longer at d milliseconds. The position of the largest peak(s) in each ISIH is well predicted by Bilsen and Ritsma's (1969/70) equations relating the pitch of band-filtered rippled noise to the time interval between major peaks in the fine structure (blue lines). When φ = 180°, there is a null in the ISIH at d milliseconds flanked by two approximately equal-amplitude peaks, the position of which is matched by d ± 1/2BF. On the basis of the predominant-interval hypothesis, these peaks would indicate an ambiguous pitch sensation, with one pitch just <1/d Hz and the other just >1/d Hz. Similar predictions can be made for the positions of the ISIH peaks when φ = 90 and 270°. In these conditions, the major peak shifts away from d milliseconds by a small amount, indicating a small upward or downward shift in the perceived pitch, which is predicted by d − 1/4BF and d + 1/4BF, for φ = 90 and 270° respectively (blue lines). The circular plots (Fig. 2B,D) show the representation of stimulus fine-structure as a continuous function of φ. We measured the responses at φ = [0:30:330], and interpolated between these points. Data are shown relative to the unit's response to GN; therefore, the color scale represents a change in firing rate as a function of interspike interval and φ. Starting at φ = 0°, as φ increases, the main peak (red) shifts toward shorter interspike intervals and the next peak at a slightly longer interspike interval becomes gradually more prominent until at φ = 180° both peaks are of approximately equal amplitude. This suggests a gradual increase in ambiguity with increasing φ up to 180°. The opposite is true when φ>180°, with the pattern of peaks suggesting a gradual shift back to the well defined pitch of 1/d Hz at φ = 360°.
Multipolar cells (chopper units) exhibit poorer phase-locking to pure tones than do primary-like units (Blackburn and Sachs, 1989; Winter and Palmer, 1990). Although this limits the ability of chopper units to represent stimulus fine structure in their temporal discharge pattern, if the unit BF is sufficiently low there can still be a salient fine-structure representation in these units (Louage et al., 2005; Verhey and Winter, 2006; Sayles and Winter, 2007). Responses of a low-BF chopper-T unit show a clear representation of the gradually shifting stimulus fine structure of IRN[4, φ, 16] (Fig. 3A,B). Another chopper-T unit with a BF outside the range of phase-locking for chopper units shows a relatively broad peak at d milliseconds independent of φ, representing the stimulus temporal-envelope modulation (Fig. 3C,D). Onset-chopper units typically have very broad receptive fields and have previously been hypothesized as having a role in the detection of common envelope modulation across peripheral frequency channels to enhance signal detection against a modulated background (Pressnitzer et al., 2001b; Verhey et al., 2003). Responses of an onset-chopper unit (BF = 7.35 kHz) to broadband IRN, lowpass-filtered IRN and broadband COS stimuli are shown in Figure 4. If an onset-chopper unit receives only input from low-frequency channels then it is capable of following the fine-structure (Fig. 4A), but if it receives inputs across frequency the unit responds (weakly) to the relatively low-amplitude envelope modulation of IRN (Fig. 4B). The phase of modulation in IRN varies across frequency bands. Therefore it is not surprising that onset-chopper units show a relatively poor temporal representation of the modulation in broadband IRN (Evans and Zhao, 1998). In response to COS complex tones, in which there is common modulation across frequency channels, the onset-chopper units provides a strong representation of the stimulus temporal-envelope modulation (Fig. 4C).
To determine which interspike intervals in the all-order interval ISIH were represented significantly more in response to the complex pitch-evoking signals than in the response to GN we performed a Fisher-Pitman permutation test on the interspike-interval distributions (see Materials and Methods). This non-parametric approach allows the calculation of a p value for each bin in the ISIH. We considered any bin with p < 0.02 to be significantly different from the response to GN. Figure 5 shows the distribution of significant peaks (localized maxima with p < 0.02) in the interval-difference function (the difference between the ISIHs in response to IRN and GN) as a function of harmonic number (the equivalent harmonic rank of unit BF when φ = 0°) for four values of φ (0, 90, 180, and 270°). The data presented in Figure 5 are calculated from the responses of 45 units (34 PL/PN and 11 LF units) with BFs ≤3.5 kHz. The data are pooled from responses to IRN stimuli with d equal to 2, 4, 8, and 16 ms. Therefore, any individual unit may contribute data points to (at most) four points along the horizontal axis in each plot, with an unlimited number of points (i.e., interspike-interval distribution peaks) along the vertical axis for each point along the horizontal axis (i.e., the harmonic number corresponding to the closest integer multiple of 1/d to the unit's BF for any given value of d). The size of the data point is proportional to the amplitude of the peak in the interval-difference function, as indicated on the scale at the bottom of Figure 5. The dashed red lines show the predicted (normalized for d) position of the largest peak in the waveform autocorrelation function, based on the data of Bilsen and Ritsma (1969/70). When φ = 0° (Fig. 5A), the predicted peak is at a normalized interval of 1, because the largest peak in the fine structure occurs at d milliseconds. Shifting the phase of the delayed noise by 180° results in the two largest peaks in the IRN waveform fine structure being at either side of d milliseconds; their normalized position is given by 1±(1/(2h)), where h is the harmonic number at the unit (or filter) center-frequency (Fig. 5C). For φ of 90 and 270°, the normalized positions of the largest fine-structure peaks are given by 1−[1/(4h)] and by 1+[1/(4h)], respectively. The largest peaks in the single-unit all-order interspike-interval responses follow these predictions closely. Smaller, but nevertheless significant, peaks in the ISIHs represent action potentials phase-locked to other (smaller) peaks in the temporal fine structure (or measurement noise).
Influence of temporal-envelope modulation
When components of a complex tone are summed in cosine (COS) phase the temporal envelope of the waveform is highly modulated with a periodicity corresponding to the intercomponent spacing, which, in the case of a harmonic complex tone, is also F0. In contrast, the temporal envelope of IRN is much less “peaky,” and resembles that of a RAND phase complex tone (Winter et al., 2001; de Cheveigné, 2007). We now examine the effects of temporal-envelope modulation on the representation of fine structure by performing the same analyses on responses to COS, RAND, and IRN stimuli as a function of φ (see Materials and Methods). Typical results from a relatively low-BF PL unit show that the peaky envelope of the COS stimulus leads to a stronger representation of the fine-structure compared with RAND and IRN stimuli when the stimulus spectral peaks are less resolved (Fig. 6). This unit has a BF of 1.2 kHz, corresponding to the 2.4th harmonic (“resolved”) when d = 2 ms (Fig. 6A–C) and the 4.8th harmonic (“partially resolved”) when d = 4 ms (Fig. 6D–F). Resolvability is defined according to the rule of Shackleton and Carlyon (1994), in which the number of harmonics in the 10-dB bandwidth of the filter is estimated as F0 divided by 1.8 times the equivalent rectangular bandwidth (ERB). Guinea-pig ERB is given by 0.29BF0.56 (Evans, 2001). When <2 harmonics are present in the 10-dB bandwidth of the unit's receptive field the component is said to be resolved, when >2 but <3.25 harmonics are within the receptive field the components are partially resolved, and when >3.25 harmonics are present the components are “unresolved.” In the resolved condition there is little appreciable difference between the responses to COS and RAND stimuli (Fig. 6A,B). The fine-structure of both stimuli is represented in the interspike-interval distributions and there is no effect of relative component phase. The main difference between the responses to resolved stimuli is the smaller peak at intervals close to 2d ms in the IRN condition compared with the COS and RAND conditions. This reflects the noise component of IRN making the correlation in the waveform fine-structure weaker with increasing delay relative to a random phase harmonic complex (de Cheveigné, 2007). When a unit responds to several components of a complex tone (or IRN) the modulation depth and modulation rate of the temporal envelope at the output of the peripheral filter depends on the phase relationship between the components. For a COS complex tone the envelope is typically peaky, with a period corresponding to the frequency difference between the components, whereas for a RAND complex or IRN (which are physically similar) the envelope is much flatter. A peaky envelope results in the unit responding more precisely to the fine-structure periodicity in the vicinity of envelope maxima, whereas weaker envelope modulation weakens the representation of the fine structure. This influence of component phase (and therefore temporal-envelope modulation) is clear in the single-unit responses; e.g., a peak of ∼750 spikes/s at d milliseconds in response to COS[8,0] is reduced to ∼550 spikes/s in response to RAND[8,0] and ∼450 spikes/s in response to IRN[8,0,16] (Fig. 6D–F).
Relation to the “first-effect of pitch shift”
When a harmonic complex tone is made inharmonic by the addition of a frequency shift Δf to each component, the perceived pitch shifts (Δp) in the direction of, and by an amount proportional to, Δf. This is known as “the first effect of pitch shift” and is classic evidence against the low “residue” pitch of a series of high-numbered harmonics being the result of a simple difference tone generated by cochlear distortion, because by applying an equal Δf to each component, the difference tone (and presumably any pitch perception based on the detection of it) would remain unchanged (Schouten, 1940; de Boer, 1956a,b, 1976; Schouten et al., 1962; Smoorenburg, 1970; van den Brink, 1970; Goldstein, 1973; Wightman, 1973b; Patterson and Wightman, 1976; Gerson and Goldstein, 1978; Moore and Moore, 2003). Both spectral-pattern matching models and temporal models based on the detection of fine structure have been shown to account for the pitch-shift of inharmonic complex tones. By varying d so as to place the second through 10th harmonic of IRN at the BF of a unit when φ = 0° and then varying φ over 360° (in 30° steps), we show that the single-unit pitch matches based on the first peak in the all-order ISIH are predicted by the linear relation Δp = Δf/N (“de Boer's rule” for the first effect of pitch shift), where N is the harmonic rank of the component centered on BF (Fig. 7). We consider the effect of changing φ over a full cycle (360°) to be equivalent to changing the harmonic number (N) centered at the unit's BF. In the harmonic condition (when φ = 0°) N is the harmonic rank of the nearest integer-multiple of 1/d Hz to BF (Nh). When φ ≤ 180°, N = Nh + (φ/360), and when φ ≥ 180°, N = (Nh − 1) + (φ/360).
The responses of a 1.2 kHz BF primary-like unit (not the same unit as in Fig. 6) to IRN, with the third (Fig. 7A) and seventh (Fig. 7B) harmonics positioned at BF, demonstrate the dependence of Δp on N. As the harmonic number is increased the relative deviation of the ISIH peaks from the dashed lines at d and 2d is decreased. By considering peaks in the all-order ISIH as “pitch matches,” we show a representation of the pitch-shift of IRN as a function of harmonic number (N) for the same single unit (Fig. 7C). Both single-unit responses to 200% amplitude-modulated tones (Rhode, 1995) (black circles), and predictions based on human behavioral data (Schouten et al., 1962) (black dashed lines) closely follow the single-unit responses to IRN. The general trend in the data are for a large peak in the ISIH (strong pitch representation) near to a normalized pitch of 1 at each harmonic condition, and for the peaks to become smaller (weaker pitch representation) as the complex becomes increasingly inharmonic. At many values of N, this unit predicts multiple pitch ambiguities (i.e., multiple peaks in the response for any vertical line through Fig. 7C). The data from Rhode (1995) in response to 200% AM tones deviate more from the linear prediction than do the data from this single unit in response to IRN. In particular, at the low harmonic numbers the pitch changes more rapidly with increasing N for the AM tone data than it does for the responses to IRN.
To compare the relative strengths of the predicted pitch matches across stimulus conditions (COS, RAND, IRN) as a function of N, we normalized the significant portions (peaks and nulls) of the interval-difference function by dividing by the maximum significant (peak) value for the COS condition for each unit and each delay condition across all values of φ. Thus the amplitude of the ISIH peaks are expressed relative to the response of the same unit to the COS stimulus. Typically the maximum interval-difference occurred in response to the COS stimulus at a normalized interspike-interval of 1 when φ = 0 (the harmonic condition). This normalization procedure results in a normalized change in firing rate which varies between −1 and +1. As a function of harmonic number this analysis, for a population of 34 PL/PN units with BFs ≤3.5 kHz, shows a similar pattern of results for COS, RAND, and IRN signals (Fig. 8). In each case, the pitch matches based on the all-order ISIH distribution closely follow the predictions of the first-effect (dashed lines), with the major significant peaks in the interval distribution centered at a normalized pitch of 1 when the signal is harmonic, and shifting away from 1 when the signal becomes inharmonic. When the signal is maximally inharmonic (halfway between two harmonic conditions, when φ = 180°), there are two approximately equal-amplitude peaks, one indicating an upward shift in pitch and one indicating a downward shift. There are also a series of peaks in the response centered on a normalized-pitch value of 0.5. When the stimulus is maximally inharmonic these “lower-octave” peaks correspond to the true F0 of the signal, because the spectral peaks are at odd-integer multiples of 1/2d Hz. Lower-octave matches have been shown in human psychophysical studies using similar inharmonic signals (Gerson and Goldstein, 1978), indeed “octave errors” are often reported in behavioral pitch-matching studies with the authors choosing to “correct” these matches. Comparing the responses to COS, RAND, and IRN stimuli there is little appreciable difference between the three representations in the region of the first through fourth harmonic, because the components are resolved by the cochlear filters. In higher harmonic regions the response to COS stimuli is stronger than the response to RAND and to IRN, an effect which is consistent with decreasing harmonic resolution and therefore greater influence of the temporal envelope of the COS stimulus on the response. For example, in the region of the eighth through 10th harmonic the response to RAND and IRN stimuli is approximately half as strong as the response to COS stimuli. Although IRN (with a large iteration number) is physically similar to a RAND complex tone (de Cheveigné, 2007) the effect of the noise component of IRN is clearly visible in the responses at interspike intervals corresponding to the lower-octave pitch matches. In this region, the response to IRN is weaker than the response to RAND complex tones, because of the cumulative effects of the (uncorrelated) noise with increasing autocorrelation delay.
The dominance region for pitch
It is well known that the lower harmonics of a complex tone are more effective at conveying pitch than are high-numbered harmonics, leading to the notion of a “spectral dominance region” for pitch (Plack and Oxenham, 2005). Although the dominance region appears to be quite variable across individuals, between experiments, and shows some dependence on F0, there is general agreement that a region around the fourth harmonic (spanning the second through fifth) dominates the pitch percept (Plomp, 1967; Ritsma, 1967; Moore et al., 1985; Dai, 2000). The existence of a dominance region for the repetition pitch of IRN has been established in human listeners (Bilsen and Ritsma, 1970; Yost et al., 1978; Leek and Summers, 2001) and in animal behavioral experiments (Shofner and Yost, 1997), and recently a physiological correlate of this has been proposed in the same species (Shofner, 2008a,b). Temporal models of pitch processing have shown that the pitch of inharmonic complex sounds can be predicted from the autocorrelation of the stimulus waveform bandpass filtered in the dominance region (Bilsen and Ritsma, 1967/68, 1969/70; Yost et al., 1978; Yost and Hill, 1979). Because the responses of auditory neurons are driven by a narrowband filtered version of the broadband stimulus waveform (by virtue of basilar-membrane filtering) the temporal responses of single neurons to inharmonic complex sounds can also be predicted from the autocorrelation of the band-filtered waveform (Sayles and Winter, 2007). Therefore there are two aspects to consider when examining the temporal representation of pitch in terms of the dominance region: the mechanism by which greater perceptual weight is applied to the dominance region, and the accuracy with which the perceived pitches are represented by neurons tuned to the dominance region. Here we present analyses examining both the preferential weighting of information in the dominance region, and the accurate temporal representation of the perceived pitch(es) by neurons in the dominance region.
Using a measure based on the relative fourth moment of IRN and GN all-order ISIHs, Shofner (2008a,b) provided evidence for the existence of a “dominance region” in the responses of Chinchilla VCN primary-like neurons to infinitely iterated rippled noise with a delay of 4 ms (IRN[4,0, ∞] in the notation used here), but did not find a similar region in the responses of chopper neurons. The aim of Shofner's study was to examine the mechanism by which greater perceptual weight is applied to the dominance region, and to establish a neurophysiological correlate of the weighting function previously identified in Chinchilla behavioral experiments. The fourth moment of a waveform is related to the variance in instantaneous power (Hartmann and Pumplin, 1988), and as used in Shofner's analysis, indicates the relative power in the ISIH measured in response to IRN, in comparison with the ISIH measured in response to GN. The ISIHs in response to IRN were renormalized, with the firing rate Rτ expressed relative to the mean firing rate R̄, such that the normalized firing rate λτ = (Rτ − R̄)/ R̄]. The average fourth moment, calculated over a 50 ms window, is given by λ̄4 = Σ(λτ − λ̄)4/N, and the relative fourth moment (dB) by dB = 10log10 [λ̄IRN4 /λ̄GN4].
The dominance region in Shofner's single-unit responses is correlated with the behavioral dominance region previously reported in the same species (Shofner and Yost, 1997). However, Shofner's data were limited to responses to IRN with a 4 ms delay. Therefore, it is not clear whether the physiological dominance region identified around the fourth harmonic (1 kHz) is a harmonic dominance region (i.e., the dominant harmonics would be independent of changes in d), or whether the region identified simply reflects the strong phase-locking of neurons tuned to the region of 1 kHz. If the latter is true, the dominant harmonics would change with d as the absolute frequency region of dominance would remain fixed at ∼1 kHz.
We have applied Shofner's ISIH fourth-moment analysis to a population of primary-like and chopper unit responses to IRN with a range of delays (Fig. 9). In addition to the data recorded in response to IRN[d,0,16] (39 PL/PN, 27 CT/CS, 11 LF units) we included data recorded in response to IRN(+) from an additional 3 PL units, 9 CT units, and 8 LF units in this analysis. The analysis is shown for four different values of d (i.e., four different F0 s), as indicated by the color legend. In general, the relative fourth moment decreases with increasing BF for both primary-like and chopper units in a manner consistent with the difference in phase-locking between the two unit types. It is important to realize that the relative fourth moment measures the combined response to temporal fine structure and envelope modulation; i.e., it is simply a measure of how peaky the histogram in response to IRN is, and does not distinguish between fine-structure and modulation peaks. As a function of unit BF (Fig. 9A,C), the responses of primary-like neurons decreases monotonically for all IRN delays. In the region around 0.7–1.5 kHz the primary-like units show a stronger temporal response than the chopper population, consistent with this being the region in which their phase-locking ability differs. This enhanced temporal response ∼1 kHz may correspond to the dominance region identified in Shofner's analysis. Indeed, previous studies have identified neurophysiological correlates of the dominance region in the temporal responses of cat ANFs and related these to the strong phase locking ∼1 kHz (Cariani and Delgutte, 1996a,b). The fall-off in phase-locking ability with increasing BF has often been associated with the fall-off in pitch salience with increasing center frequency for narrowband SAM tones, with the upper-limit of phase locking imposing an upper limit to the existence region for tonal pitch (Ritsma, 1962; Cariani and Delgutte, 1996a,b). For the chopper population (Fig. 9C), there is a small peak in the response between ∼2 and 5 kHz, which is especially marked for IRN with d = 2 ms. This is likely because of a response to the envelope modulation of IRN which, at 500 Hz, is close to the intrinsic chopping frequency of many chopper units. Plotting the data as a function of harmonic number, there seems to be no evidence of a spectral dominance region tuned specifically to the fourth harmonic from a simple analysis of the relative power in the ISIH for either primary-like or chopper units.
Autocorrelation models of pitch processing are able to account for the pitch matches obtained in response to inharmonic complex sounds by assuming that the brain applies the greatest perceptual weight to the autocorrelation functions (or, equivalently, all-order interspike-interval distributions) calculated on the output of filters centered ∼4/d Hz (Bilsen and Ritsma, 1967/68, 1969/70; Yost et al., 1978; Yost and Hill, 1979). The population analyses presented in Figure 10 examine the neural pitch-matches obtained by pooling the normalized interspike-interval distributions from all primary-like units with BFs <3.5 kHz, and from all chopper units with BFs <1.25 kHz. The unit BFs were all in the region of the second through fifth harmonics, for COS, RAND, and IRN stimuli. Data from 28 PL/PN units and from 12 CT/CS units are included in this analysis. In each plot the solid lines represent the population mean pitch-match profile for each phase-shift condition (indicated by the color code), and the shaded area around the solid lines represents the 95% confidence limits. In general, the neural pitch-matches are qualitatively similar to the human behavioral pitch-matches obtained with similar IRN stimuli [Fig. 10, compare with the data of Raatgever and Bilsen (1992), their Fig. 3]. The neural pitch matches are also similar to those predicted on the basis of the largest peaks in the autocorrelation function of the stimulus bandpass filtered in the region around the fourth harmonic.
Despite the limitations of comparing the guinea-pig neural pitch matches to human psychophysical performance, the data in Figure 10 indicate a qualitative correspondence between the overall patterns of pitch matches obtained here and in previous human psychophysical experiments such as those of Raatgever and Bilsen (1992). The primary-like unit responses to COS[d,0], RAND[d,0], and IRN[d,0,16] (Fig. 10A–C, blue lines) each show a large peak at a normalized-pitch value of 1 (i.e., 1/d Hz). This corresponds to the well defined unambiguous pitch of these harmonic stimuli. Shifting the spectral peaks by 1/2d Hz (red lines) the neural responses show peaks at normalized-pitch matches of ∼0.88, ∼1.14, and ∼0.5, corresponding to the perception for narrowband IRN[d,180,n] (Raatgever and Bilsen, 1992). The green and yellow lines indicate the population responses in the region of the second through fifth harmonic to stimuli with φ = 90° and 270° respectively. In both these conditions the main peak in the response is shifted away from a normalized pitch of 1, toward either a slightly higher pitch of ∼1.07 (green line, φ = 90°), or a slightly lower pitch of ∼0.94 (yellow line, φ = 270°). Again, these follow the predicted pitch matches based on human behavioral experiments with remarkable accuracy (Bilsen, 1966; Bilsen and Ritsma, 1969/70). The main difference between the responses to COS, RAND, and IRN is the size of the peak at ∼0.5 (i.e., the lower-octave matches). When φ = 180°, this peak is largest in both the COS and RAND conditions, however, in response to IRN it is approximately equal in size to the peaks at ∼0.88 and ∼1.14. This suggests that when listening to band-filtered COS or RAND complex tones with all components shifted by F0/2 Hz, the pitch will be matched to the true F0 value more often than when listening to the equivalent IRN[d,180,n]. The neural pitch matches calculated from the responses of the chopper-unit population follow the predicted matches less accurately. By calculating the distance between the predicted pitch matches and the nearest major peaks in the neural pitch-match profiles we estimate the “error” of the neural pitch matches. The mean percentage error (±SEM) for the PL/PN population is 0.42% (±0.13%), and for the CT/CS population 1.58% (±0.37%). For a fundamental frequency of 250 Hz (d = 4 ms) this corresponds to a mean time-domain error of 16 and 63 μs for PL/PN and CT/CS groups respectively.
Discussion
Temporal fine-structure
Temporal-envelope information is important for speech understanding in quiet (Shannon et al., 1995; Smith et al., 2002) and can support rudimentary pitch perception in cochlear-implant listeners (Moore and Carlyon, 2005). However, temporal-envelope periodicity is degraded by noise and by reverberation (Moore and Carlyon, 2005; Qin and Oxenham, 2005; Sayles and Winter, 2008b), and recent evidence suggests pitch perception in real environments relies heavily on fine-structure periodicity (Smith et al., 2002; Sayles and Winter, 2008b). Individuals with cochlear hearing impairment show specific deficits in their ability to make use of fine-structure cues (Lorenzi et al., 2006; Hopkins et al., 2008), and cochlear implants provide only temporal-envelope information (Wilson et al., 1991; Moore and Carlyon, 2005). Current interest in fine-structure information is therefore based on understanding the extent to which it is required for normal auditory function, the extent to which its use is limited by hearing impairment, and the development of technology capable of restoring fine-structure sensitivity.
Reports of IRN fine-structure representation in the VCN have demonstrated that primary-like units provide a more robust representation than do chopper units, consistent with the lower-limit of phase-locking in chopper units (Shofner, 1999; Verhey and Winter, 2006; Sayles and Winter, 2007). The present results show an accurate temporal representation of IRN fine-structure, with significant peaks in all-order ISIHs at intervals predicted by autocorrelation models of monaural and dichotic repetition-pitch perception acting on narrowband IRN (Bilsen and Ritsma, 1969/70; Bilsen and Goldstein, 1974; Yost et al., 1978). Thus, at the level of the VCN the fine structure representation of the pitch of inharmonic complex sounds established at the level of the auditory nerve is preserved. Because the upper-limit of phase-locking decreases as the auditory pathway is ascended, it is commonly believed that the temporal representation of pitch is transformed to a more stable rate-based representation at higher levels, probably in the inferior colliculus (Winter, 2005). It has been suggested that a key component in this transformation is the bandpass periodicity tuning of VCN chopper units in response to AM tones and to IRN (Keilson et al., 1997; Winter et al., 2001; Wiegrebe and Meddis, 2004). However, it is important to realize that this “chopper model” of pitch requires chopping periods as long as 30 ms (to account for the lower limit of pitch), whereas most VCN chopper units have an intrinsic periodicity in the range of 2–5 ms. Therefore, the contribution of VCN primary-like units in conveying the pitch-related information to higher levels should not be ignored.
Pitch-shift and pitch-ambiguity
We have shown temporal representations of the pitch-shift and pitch-ambiguity of three classes of broadband inharmonic complex sounds (COS, RAND, and IRN) rely on the use of fine-structure information in the phase-locked discharge patterns of VCN neurons. In human psychoacoustic experiments (Schouten et al., 1962) and in ANF recordings in the cat (Rhode, 1995) the pitch-shift of inharmonic SAM tones varies faster than the linear relation Δp=Δf/N predicts. This deviation from linearity is known as the second effect of pitch shift, and is thought to be attributable to combination tones (Smoorenburg, 1972; Buunen et al., 1974). There appears to be no second effect in the present data. It is important to consider which factors may account for this. When AM of frequency fm is applied to a tone, or to a narrowband noise, cochlear distortion generates a combination tone on the BM at the fm place (Wiegrebe and Patterson, 1999). This provides a possible confound in experiments examining neural responses to AM tones, because a response at fm could arise either by the detection of AM at the output of a high-frequency filter, or it could arise from the detection of the (relatively intense) combination tone. This is a particular criticism of studies showing neurons with BFs at or near to fm responding to a group of high-numbered harmonics well outside of the unit's pure-tone response area (Biebel and Langner, 2002; McAlpine, 2004). The generation of combination tones in response to harmonic complex tones is dependent on simple phase relationships between components (Pressnitzer et al., 2001a; Pressnitzer and Patterson, 2001). Such phase relationships are weak or absent in IRN, and evidence suggests the distortion generated by IRN is relatively low in sound level (Yost et al., 1998). This could explain the lack of a second effect in the neural responses to IRN, but not in response to COS. Here, stimuli were presented at a relatively low sound level (∼10 dB above threshold), which may account for the apparent lack of distortion. An alternative explanation is that because the data shown in Figure 8 are averaged across a population of units, any small effects of distortion may have been “averaged out,” although in the single-unit data (Fig. 7) the interspike-interval distribution peaks follow the linear predictions very closely with no systematic deviation in the direction expected for the “second effect.” The data from Rhode (1995) replotted in Figure 7C deviate from the linear predictions, especially at low harmonic numbers.
The temporal representation of pitch ambiguity at the level of the VCN may be viewed similarly to low-level representations of perceptual ambiguity for other stimulus parameters in audition, such as the ambiguity between percepts of one sound source and two sources when listening to pure-tone sequences (Pressnitzer and Hupé, 2006; Pressnitzer et al., 2008), and representations of perceptual ambiguity in the visual system (Tong et al., 2006). Attention and context have strong influences on both visual (Kanwisher and Wojciulik, 2000) and auditory (Fritz et al., 2007) perception. The pitch heard at any one instant when listening to inharmonic complex sounds may be the result of a process akin to the central switching mechanisms proposed for resolving perceptual ambiguity in other systems (Leopold and Logothetis, 1999) and may involve descending interaction between auditory cortex and brainstem structures such as the inferior colliculus (Winer, 2005; Nakamoto et al., 2008).
Spectral dominance
Recently, a neural correlate of the spectral dominance region for pitch has been proposed based on an analysis of the fourth moment of all-order ISIHs from the responses of VCN primary-like units to IRN (Shofner, 2008a,b). The responses of chopper units did not show a dominance region. Performing a similar analysis on the population of primary-like and chopper units here we found no evidence for a dominance region which remains fixed in terms of harmonic number for a range of fundamental frequencies. Instead the current data suggest that the dominance region identified in earlier studies may reflect a difference in phase-locking ability between primary-like and chopper units in the region of ∼1 kHz.
The psychophysical pitch matches obtained when using inharmonic IRN stimuli have been successfully predicted on the basis of an autocorrelation mechanism operating in the region of the fourth harmonic (Bilsen and Ritsma, 1967/68, 1969/70; Yost and Hill, 1979; Yost, 1997) and by spectral pattern-matching models operating within the same spectral region (Raatgever and Bilsen, 1992; Cohen et al., 1995). The analyses presented here show that the temporal fine-structure representation in a population of VCN units represents the correct pitch matches when restricted to a similar spectral region. Comparing the neural data (Fig. 10) to the psychophysical data presented by Raatgever and Bilsen (1992) indicates a close correspondence between the neural responses to IRN and psychophysical responses to inharmonic comb-filtered noise when both are filtered in the dominance region, with approximately equal probability of matching the pitch to ∼0.88/d, ∼1.14/d, and ∼1/2d Hz in both cases. The mechanism by which the “central processor” applies greater weight to the region around the fourth harmonic when computing pitch may be simply related to the fall-off in phase-locking, or involve some other more sophisticated process such as a lateral inhibitory network (Yost and Hill, 1979; Yost, 1982).
Conclusions
The temporal discharge patterns of guinea-pig VCN units provide a representation of the stimulus-waveform fine structure for inharmonic IRN and complex tones. Despite some differences between the peripheral auditory system in guinea-pigs and humans [e.g., cochlear filters may be narrower in humans (Shera et al., 2002), and the upper limit of phase-locking may differ between the two species (Palmer and Russell, 1986; Moore, 2003)], these low-level stimulus representations provide important insights into the processing of pitch-related information for inharmonic complex sounds by the mammalian auditory system. Further processing and ultimately the formation of the “pitch percept” by higher levels of the auditory system is likely to differ more across species than these peripheral representations. The fine-structure representation, based on the all-order interspike-interval distribution, predicts the pitch-shift and pitch-ambiguity of inharmonic complex sounds in line with classic theoretical and behavioral studies. Within the dominance region for pitch, the ambiguous neural pitch matches are similar to the ambiguous pitch-matches found in human behavioral experiments using similar stimuli. We conclude, tentatively, that these aspects of human (and other animals') pitch perception are mediated by similar neuro-temporal mechanisms, with an unknown contribution from higher-level processing.
Footnotes
-
This work was supported by a grant from the Biotechnology and Biological Sciences Research Council (I.M.W.). M.S. receives financial support from the Frank Edward Elmore fund of the Cambridge MB/PhD program, and from the Leatherseller's Company, London, UK. We thank Daniel Pressnitzer and Adrian Fourcin for helpful comments on an earlier version of this manuscript and Lowel P. O'Mard for programming assistance.
- Correspondence should be addressed to Mark Sayles. ms417{at}cam.ac.uk or sayles.m{at}gmail.com
This article is freely available online through the J Neurosci Open Choice option.