WWW.JNEUROSCI.ORG
-
The Journal of Neuroscience MBF Bioscience Autoneuron
 QUICK SEARCH:   [advanced]


     
-


HOME
  |  
SEARCH  |   ARCHIVE  |   SUBSCRIBE  |   CONTACT  |   HELP

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Submit an eLetter
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (78)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Theunissen, F. E.
Right arrow Articles by Doupe, A. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Theunissen, F. E.
Right arrow Articles by Doupe, A. J.

 Previous Article  |  Next Article 

The Journal of Neuroscience, May 15, 1998, 18(10):3786-3802

Temporal and Spectral Sensitivity of Complex Auditory Neurons in the Nucleus HVc of Male Zebra Finches

Frédéric E. Theunissen and Allison J. Doupe

Sloan Center for Theoretical Neuroscience and Keck Center for Integrative Neuroscience, Departments of Physiology and Psychiatry, University of California, San Francisco, San Francisco, California 94143-0444

    ABSTRACT
Top
Abstract
Introduction
Materials & Methods
Results
Discussion
References

Complex vocalizations, such as human speech and birdsong, are characterized by their elaborate spectral and temporal structure. Because auditory neurons of the zebra finch forebrain nucleus HVc respond extremely selectively to a particular complex sound, the bird's own song (BOS), we analyzed the spectral and temporal requirements of these neurons by measuring their responses to systematically degraded versions of the BOS. These synthetic songs were based exclusively on the set of amplitude envelopes obtained from a decomposition of the original sound into frequency bands and preserved the acoustical structure present in the original song with varying degrees of spectral versus temporal resolution, which depended on the width of the frequency bands. Although both excessive temporal or spectral degradation eliminated responses, HVc neurons responded well to degraded synthetic songs with time-frequency resolutions of ~5 msec or 200 Hz. By comparing this neuronal time-frequency tuning with the time-frequency scales that best represented the acoustical structure in zebra finch song, we concluded that HVc neurons are more sensitive to temporal than to spectral cues. Furthermore, neuronal responses to synthetic songs were indistinguishable from those to the original BOS only when the amplitude envelopes of these songs were represented with 98% accuracy. That level of precision was equivalent to preserving the relative time-varying phase across frequency bands with resolutions finer than 2 msec. Spectral and temporal information are well known to be extracted by the peripheral auditory system, but this study demonstrates how precisely these cues must be preserved for the full response of high-level auditory neurons sensitive to learned vocalizations.

Key words: birdsong; song system; Zebra finch; HVc; complex sound; natural sound; time-frequency; temporal-spectral; modulation transfer function; auditory cortex; speech

    INTRODUCTION
Top
Abstract
Introduction
Materials & Methods
Results
Discussion
References

Temporal and spectral cues are critical for the identification of complex vocalizations such as speech, as shown in psychophysical experiments that use systematic degradations of the speech signal along these parameters (Liberman et al., 1967; Drullman, 1995; Drullman et al., 1995; Shannon et al., 1995). Moreover, temporal processing is thought to be critically involved in disorders of speech and language learning (Merzenich et al., 1996; Tallal et al., 1996). Very little is known, however, about the spectral and temporal sensitivity of the high-level central neurons that must mediate complex sound processing. In recent studies, researchers have described the response properties of neurons in the auditory cortex of cats and primates that are tuned to certain characteristics of natural sounds (Ohlemiller et al., 1994; Schreiner and Calhoun, 1994; Rauschecker et al., 1995; Wang et al., 1995). Birdsong provides a particularly useful model for studying the neural basis of complex vocalizations, however, because, like speech, song is a learned behavior and depends on auditory experience (Marler, 1970; Konishi, 1985). Moreover, song acquisition and production are mediated by a specialized set of forebrain sensorimotor areas unique to species that learn their vocalizations (Nottebohm et al., 1976; Kroodsma and Konishi, 1991). Electrophysiological experiments have shown that the brain areas for song contain some of the most complex auditory neurons known. These "song-selective" neurons respond more strongly to the sound of the bird's own song (BOS) than to almost any other sounds, including simple stimuli such as pure tone or broadband noise bursts, and complex stimuli such as closely related songs of other individuals of the same species (conspecifics) (Margoliash, 1983, 1986; Margoliash and Fortune, 1992; Margoliash et al., 1994; Lewicki, 1996; Volman, 1996). These neurons are also sensitive to the temporal context of the sounds within the BOS, because both the BOS played in reverse and isolated sections of the BOS, which elicit strong responses in their natural context, are ineffective stimuli (Margoliash, 1983; Margoliash and Fortune, 1992; Lewicki and Arthur, 1996). Moreover, systematic modification of some of the parameters of white-crowned sparrow songs demonstrated the dependence of HVc neural responses on both spectral and temporal features of song (Margoliash, 1983, 1986). The highly selective auditory properties of these neurons and the fact that these features emerge during song learning suggest that these neurons play an important role in vocal learning and in the discrimination of adult vocalizations (Margoliash, 1983; Margoliash and Fortune, 1992; Volman, 1993; Doupe, 1997).

Although the general importance of spectral and temporal context for the response of HVc neurons was clear, in this study we developed a systematic and broadly applicable methodology, based on a time-frequency decomposition that is commonly used in speech analysis (Flanagan, 1980), to describe any song completely with a relatively simple set of parameters. This parametrization allowed us to define explicitly the spectral and temporal structure of these complex natural sounds. We then systematically modified the parameters in the decomposition to generate a series of synthetic versions of the BOS that preserved varying degrees of the temporal and spectral structure present in the original song. By comparing the response of HVc song-selective neurons to these synthetic songs with their response to the original BOS, we were able to characterize features of the temporal and spectral structure in the BOS that were essential for HVc neurons, and to quantify the sensitivity of the neuronal responses to the exact preservation of these features. This characterization also revealed the striking precision with which the temporal and spectral structure present in these learned vocalizations needs to be preserved from the auditory periphery to higher order auditory centers.

    MATERIALS AND METHODS
Top
Abstract
Introduction
Materials & Methods
Results
Discussion
References

Song selection and recording

Two to three days before the experiment, an adult male zebra finch was placed in a sound-attenuated chamber (Acoustic Systems, Austin, TX) to obtain clear audio recordings of its mature, crystallized song (this species usually sings only one song type as adults). An automatically triggered audio system was used to record ~90 min of bird sounds, containing many samples of the song of the bird. The tape was scanned, and 10 loud, clear songs were digitized at 32 kHz and stored on a computer. Those songs were assessed further by calculating their spectrograms and by examining them visually. A representative version was then chosen from those 10 renditions and analyzed by a custom-made computer program to obtain a parametric representation based on the spectral and temporal components of the song (see below).

Zebra finch songs are organized into simple elements often called syllables. These syllables are in turn organized into a set sequence that is called a song phrase or motif. The motif is repeated multiple times in a song (Zann, 1996, pp 214-215). We chose songs that varied in length between 1.1 and 2.3 sec and consisted of two or three motifs. The length of the song is important because it has been reported that HVc neurons integrate over long periods of time and that the maximal responses are not necessarily found in the first motif (Margoliash and Fortune, 1992; Sutter and Margoliash, 1994).

Parametric representation of song

The analysis consisted of decomposing the original song into a set of narrowband signals by filtering the song through a bank of overlapping filters (Fig. 1A). The narrowband signals could then be represented by two parameters, one that describes the amplitude envelope and one that describes the time-varying phase of the carrier frequency. The set of time-varying amplitude envelopes characterizes the time-varying power in each frequency band and therefore represents both the spectral and temporal structure of the song. The time-varying phase carries additional spectral and temporal information for each band, but as we will describe in detail in Synthetic songs, this information can become redundant with the information embedded in the joint consideration of the amplitude envelopes. In this section, we describe the mathematics involved in the original decomposition. The next section describes what aspects of the spectral and temporal structure are actually represented in the amplitude and phase components, how the two are related, and how we used variations of these parameters to generate songs with specific spectral and temporal properties.


View larger version (20K):
[in this window]
[in a new window]
 
Figure 1.   A, Schematic showing the decomposition of a complex sound into a set of narrowband signals, each described by an amplitude envelope and a frequency-modulated carrier. The complex sound is the input to a filter bank composed of a set of adjoining, and in this case overlapping, filters that cover the frequency range of interest. The narrowband output signals of two of the filters in the bank is shown. The envelope that was obtained with the analytical signal is drawn. The carrier frequency is centered at the frequency corresponding to the peak of the filter and has slow frequency modulations that are not easily discernible in this figure. B, Overall filter transform (thick line) obtained from a set of overlapping Gaussian filters (thin lines), the center frequencies of which are separated by one bandwidth (1 SD). The overall filter transform is almost perfectly flat for a large frequency range. In this example, we used 15 Gaussian filters with a bandwidth of 500 Hz and center frequencies between 500 and 4000 Hz.

The decomposition of each narrowband signal into its amplitude and phase constituents was obtained using the analytical signal (Cohen, 1995). As will be emphasized below, this particular decomposition generates an amplitude envelope function that is identical to the one obtained by calculating the short time Fourier transform of the signal, just as is done when a spectrogram is generated. In addition, this operation generates the phase of the short time-window Fourier transform in a form that is continuous with time and that can be interpreted as an instantaneous frequency modulation. A detailed mathematical description of this parametric representation can be found in Flanagan (1980). The decomposition is briefly summarized here.

The original signal s(t) is first divided into n bandpassed component signals sn(t). To be able to resynthesize the original signal from the bandpassed components, we must choose the filters in the filter banks so that the overall filter transform (obtained by summing the transforms from each filter) is flat over all frequencies occupied by s(t). In addition, the phase distortion of each filter must be insignificant. If these requirements are satisfied, we can recreate the original sound by summing all of the bandpassed signals:
s(t)=<LIM><OP>∑</OP><LL>n</LL></LIM> s<SUB>n</SUB>(t).

In our decomposition, we used Gaussian filters that were separated along the frequency axis by exactly 1 SD. It can be shown analytically (and verified numerically) that the deviations from a flat amplitude transform in that case are of the order of 1 in 109 (Fig. 1B). Enough filters were used to cover the frequency range from 500 to 8000 Hz. The filtering was performed digitally in the frequency domain, resulting in no phase distortions.

In the next step, to extract the instantaneous amplitude envelope An(t) and the instantaneous phase theta (t) of each narrowband signal, we calculated the analytical signal of each sn(t):
s<SUB>n</SUB>(t)=A<SUB>n</SUB>(t)<UP>cos</UP>[&thgr;(t)].
The analytical signal decomposition of sn(t) guarantees that the frequency components of An(t) are all below those of cos[theta (t)] (Cohen, 1995). In particular, it can be shown that for a bandpassed signal of bandwidth sigma w, all of the frequency components of  An(t2 are below sigma w (Flanagan, 1980). In general, A(t) is what one would intuitively call the amplitude envelope of the signal. For example, the A(t) calculated for a beat signal made of two pure tones with amplitudes A1 and A2, frequencies w1 and w2, and absolute phases theta 1 and theta 2 is given by:  A(t2 = A12 + A22 + 2 A1A2 cos[(w1 - w2)t + (theta 1 - theta 2)]. The calculation of A(t) for a complex signal is just the extension of this simple vector sum to include all frequency components of the signal. Finally, it can be shown that An(t) corresponds to the amplitude at the center frequency wn of the Fourier transform of s(t) as seen by a window centered around t and with shape given by the inverse Fourier transform of the filter transform function expressed in the frequency domain. In other words,  An(t2 is the running power at frequency wn calculated with a window centered around t. The width and shape of the window is related to the shape and width of the frequency filter. This is exactly the value achieved when one calculates a spectrogram of a signal.

The part of the signal that is not described by the amplitude envelope (and therefore not shown explicitly in a spectrogram) is often called the fine structure of the signal and is given in the analytical signal by an instantaneous phase, theta (t). The instantaneous phase can in turn be expressed in terms of its derivative and an absolute phase. The derivative of the instantaneous phase is taken as the instantaneous frequency:
s<SUB>n</SUB>(t)=A<SUB>n</SUB>(t)<UP>cos</UP><FENCE><LIM><OP>∫</OP><LL>0</LL><UL>t</UL></LIM> w(&tgr;)<UP>d</UP>&tgr;+&thgr;<SUB>n</SUB></FENCE>.
We further expressed the instantaneous frequency as a modulation around the center frequency of the band wn:
s<SUB>n</SUB>(t)=A<SUB>n</SUB>(t)<UP>cos</UP><FENCE>w<SUB>n</SUB>t+<LIM><OP>∫</OP><LL>0</LL><UL>t</UL></LIM> w<SUB><UP>FMn</UP></SUB>(&tgr;)<UP>d</UP>&tgr;+&thgr;<SUB>n</SUB></FENCE>.
In this final form, An(t) will be referred to as the amplitude modulation or AM component of the signal, wFMn(t) is the frequency modulation or FM around the center frequency wn, and theta n is the absolute phase.

Synthetic songs

Four synthetic song families were generated using systematic degradations of the parametric representation described above. Each family of songs preserved some aspect of the original signal. In the following description, the song families are organized approximately in terms of increasing similarity with the original signal. The first set of songs was generated by preserving only the AM components in the decomposition. This resulted in synthetic songs with amplitude envelopes similar to that in the original song. The second set of songs progressively restored the relative instantaneous phase across frequency bands, improving both the FM and AM quality of the synthetic song. The third and fourth set distorted the FM component by additive FM noise. This distortion was done in two ways, one that randomized (third set) and one that preserved (fourth set) the original relative phase. Finally, as a control, we also created the single synthetic song that preserved all of the original parameters. This song is referred to as Syn in Results. The Syn song is identical to the original song filtered by the combined filter transform function obtained from our filter bank.

Synthetic AM songs and the time-frequency scale. The first set of songs was generated by preserving the AM components obtained in the decomposition but by generating a new and random instantaneous phase for each bandpassed signal. The instantaneous phase was chosen to be random so that the new component bandpassed signals of song become effectively noise, band-limited to the same frequency band as the original bandpassed signal and modulated by the same amplitude envelope. The full degraded synthetic song is the sum of these narrowband signals. A family of such AM songs can be generated by increasing or decreasing the width of the filters in the filter bank used to extract the AM waveforms of the original song.

When the filter bandwidth is very wide, the entire song will fit in the band of a single filter, and the resulting AM song will be similar to white noise modulated by the overall amplitude envelope of the signal (see Figs. 2, 6, AM-1 panel). As one narrows the bandwidths of the filters, more filters are needed to cover the entire song, and the amplitude envelopes from each filter characterize the spectral structure more precisely. However, because of the time-frequency resolution trade-off, the amplitude envelopes in each band will now be limited to coarser time resolutions [An(t) is band-limited by the width of the filter]. Normally, when the full song is resynthesized by summing the signals in each band and by preserving all parameters, the fine temporal aspect of the overall song envelope is recovered because the phase in each band interacts with that in the other bands in a specific manner to recreate the overall temporal structure of the signal. However, by randomizing the phase, we eliminated this particular relationship between the phases in each band and affected the overall temporal structure. Because our phase is random, the overall time resolution is effectively the time resolution of the amplitude envelopes in each band. This time resolution is given by the inverse of the bandwidth of the filters. Just as the songs modified after filtering through wideband filters have good temporal but poor spectral resolution, songs created at the very narrowband filter extreme characterize the frequency content of the song well but have poor temporal resolution; the amplitude envelopes in each band are effectively flat, and the resulting song is a colored-noise signal with a flat amplitude and overall frequency spectrum identical to that of the original signal (see Fig. 6, AM-256 panel). At intermediate time-frequency resolutions, the synthetic AM signals can capture both the spectral and temporal structure of the original signal but always with a particular trade-off between time and frequency (see Fig. 6, intermediate panels). We subsequently denote the width of the filters used in generating an AM song as the time-frequency scale of the synthesized signal. The AM songs are labeled with "AM scale," in which the scale is a number specifying the time scale in milliseconds.

The time-frequency scale trade-off of the AM songs is illustrated (see both Figs. 2, 6). These figures show spectrograms for synthetic AM songs generated with progressively narrower frequency filters. In Figure 2, we also show the spectrograms of the original signal calculated with the same windows that were used to obtain the AM songs. This allows for direct comparison between the amplitude envelope of the AM songs and those of the original song. In Figure 6, the full range of AM songs is displayed in spectrograms all calculated with the time-frequency scale that was best at representing the original song. These spectrograms illustrate how, as one goes from AM-1 to AM-256, spectral resolution is gained at the cost of temporal resolution.


View larger version (120K):
[in this window]
[in a new window]
 
Figure 2.   Wideband (W-1), middleband (W-16), and narrowband (W-256) spectrograms generated with different time windows for a representative section of a zebra finch song motif (BOS) and three synthetic AM songs derived from that particular song (AM-1, AM-16, and AM-256). The time windows used to generate the spectrograms had a Gaussian shape and a width of 1, 16, or 256 msec, respectively. The three AM songs were generated by preserving the AM waveforms of the frequency decomposition of the original BOS obtained with a bank of Gaussian-shaped frequency filters, as explained in Materials and Methods. The filters also had widths of 1, 16, or 256 msec expressed in the time domain (1 kHz, 62.5 Hz, or 3.9 Hz, respectively, in the frequency domain). Therefore, the W-1 (W-16 and W-256) spectrogram for the AM-1 (AM-16 and AM-256) song approximately matches the W-1 (W-16 and W-256, respectively) spectrogram for the BOS. At other time-frequency scales, the spectrograms of the AM songs do not match that of the BOS, illustrating the information that is lost in the AM songs. The AM-1 song preserves the fine temporal modulations but does not have the frequency resolution of the BOS. The AM-256 has good frequency discrimination calculated at longer time scales (notice the finer frequency bands for the last harmonic stack in the song) but has smeared the temporal structure present in the BOS. The AM-16 shows good time-frequency compromise.

In our experiments, we used a range of frequency filters by varying their width from 2 kHz to 2 Hz in logarithmic steps. The corresponding width of these filters in the time domain ranged from 0.5 to 512 msec. To cover the frequency range from 500 to 8000 Hz, the number of filters ranged from 4 for the 2-kHz-wide filters to 3840 for the 2-Hz-wide filters. For each time-frequency value describing the filter width, we generated a synthetic AM song.

Mathematically, the synthesis went as follows. The An(t) in the synthesis was calculated from the original song, but the wFMn(t) and theta n were random. The random wFMn(t) was generated so that wFM had a Gaussian distribution of zero mean and SD equal to the bandwidth of the filters in the filter bank, sigma w. In addition, we required that wFMn(t) be band-limited to frequencies below sigma w. These two requirements guarantee that the function:
<UP>cos</UP><FENCE>w<SUB>n</SUB>t+<LIM><OP>∫</OP></LIM> w<SUB><UP>FM</UP>n</SUB>(&tgr;)<UP>d</UP>&tgr;+&thgr;<SUB>n</SUB></FENCE>
is the analytical representation of a bandpassed signal centered at wn, with bandwidth sigma w and unit amplitude (i.e., flat bandpassed noise). Finally, these unit amplitude signals from each frequency band were multiplied by the original An(t) and were summed together. The result was a synthetic song with an amplitude envelope in each band similar to that in the original song but with significantly different fine structure.

The resulting synthetic songs have an amplitude envelope in each of their component bands similar to but not exactly the same as that in the original signal because, in the AM songs, the phase relationship between each band and its neighboring "overlapping" frequency bands was altered. Just as randomizing the phase altered the overall amplitude envelope in the AM songs, it will also alter the amplitude envelopes in each band when all the bands are summed together in the synthesis. In other words, in fully parameterized song, there exists redundant information in the time-varying amplitude envelopes and in the relative phase across overlapping frequency bands. One cannot therefore be scrambled without affecting the other. Under certain conditions (of enough overlap between the frequency bands), the amplitude envelopes can completely determine the value of the relative phase across frequency bands. In those cases, one can say that the spectrogram (i.e., the set of amplitude envelopes) is invertible in the sense that the original signal (except for an absolute phase) can be recovered solely from the set of amplitude envelopes. The relative instantaneous phase and therefore the exact representation of the amplitude envelopes will be restored in the family of synthetic songs described in the next section.

To estimate the degree of distortion of the An components of the AM synthetic songs, we calculated the normalized cross-correlation between the An of the original song and the An of the synthetic songs (see below for the definition of cross-correlation). We found that the average cross-correlation (± SEM) was 0.737 ± 0.003 (range, 0.634-0.798) for all 74 AM songs used in these experiments. Significantly for the interpretation of our results, this value was independent of the width of the filters used in generating the songs. Our AM synthetic songs can therefore be thought of as the typical signal that would be estimated in a inverting operation (done, e.g., by the high-level auditory areas) from a noisy representation of the complex sound by its amplitude envelopes (e.g., noisy neural encoding of these envelopes at the auditory periphery). The amount of noise is equal to ~26% of the signal. The noise in the representation is more detrimental to temporal information when many frequency bands are used, because in those cases the temporal information is present in the fine differences in amplitude across bands. Similarly, the noise is more detrimental to spectral resolution when few frequency bands are used. To eliminate completely the noise in the amplitude envelopes, we had to restore the relative phase across bands perfectly. Therefore in our particular decomposition using overlapping Gaussian frequency bands, the amplitude envelopes can fully characterize the signal (except for an absolute phase that can shift the phase by the same amount in each band).

Note that any other mathematical representation of a signal in terms of sums of amplitude envelopes [including those used in Shannon et al. (1995)] is also affected by the fact that one cannot independently change time-varying spectral and temporal information. For example, decreasing the overlap between the filters would reduce the contamination in the amplitude envelope attributable to the interaction with the neighboring bands but would result in an increase in spectral fluctuations caused by a nonuniform sampling of the frequency range covered by the overall filter transform of the filter bank (as shown in Fig. 1B). Both errors in the synthesis could apparently be eliminated by using nonoverlapping boxcar filters, but in reality the amplitude envelope of a synthetic song made from such boxcar filters would only match the amplitude envelopes of the original song extracted with the exact same set of filters that was used to obtain the An(t) waveforms for the synthesis. The amplitude envelopes of the synthetic and the original song extracted with differently shaped filters or with filters of the same shape but shifted along the frequency axis would be different, again because of different interference terms. For example, for boxcar filters, the error would be the greatest for amplitudes extracted when the filters were shifted by exactly one-half the bandwidth. On the other hand, our formulation, using Gaussian overlapping filters, would result in similar errors for amplitude envelopes extracted with filters (of equivalent bandwidth) of any shape and centered at any arbitrary point along the frequency axis. This uniformity of representation of the amplitude envelopes is physiologically more realistic, just as the shape and the overlap of our overlapping Gaussian filters constitute a better model of the auditory periphery than does a set of nonoverlapping boxcar filters. These were important factors, because we wanted to analyze our results in light of the encoding occurring at the different stages of the auditory system. Finally, we wanted to use a formulation that was completely symmetric along the time and frequency dimensions, so that we could interchangeably quantify the scale of our AM synthetic song (given by the width of the filters) in the time domain or in the frequency domain. The choice of Gaussian filters separated by 1 SD was the result of all of these considerations.

Songs that preserve the relative instantaneous phase. In the second and third set of synthetic songs, we progressively restored the fine structure components of the signal that had been eliminated from the AM songs. Our starting point was the AM synthetic song generated for the time-frequency scale of 16 msec or 62.5 Hz (AM-16). This particular time-frequency scale was chosen both because AM songs generated at this scale elicited good responses from HVc neurons and because the amplitude waveforms calculated at this scale were the most informative for discriminating among zebra finch songs from different birds (see Results).

In the second set of synthetic songs, we progressively restored the instantaneous relative phase across adjoining frequency bands. In practice, we generated a set of songs with Gaussian noise added to the values of the instantaneous phase waveforms theta n(t), obtained from the original song. This is different than the situation for the AM songs in which the instantaneous phase waveforms from the original song were ignored and new random instantaneous phase waveforms were generated. The amount of noise was specified to preserve the relative phase across adjoining frequency bands to within a given temporal resolution. The temporal resolution was implemented by allowing Gaussian deviations from the original relative phase at each time point. The value of the temporal resolution was varied by changing the width of the Gaussian noise. The width of the Gaussian was expressed in radians, which were translated into time units by dividing by (2pi )62.5. The value of 62.5 Hz corresponds to the interval between the center of two adjoining frequency bands for the time-frequency scale of 16 msec.

Songs with temporal resolutions in the relative phase ranging from 10 to 0 msec were generated. Gradually restoring the relative instantaneous phase had two effects; it progressively restored the FM in each band and improved the quality of the AM component. The FM component is the derivative of the instantaneous phase, which is independent of the absolute phase and will therefore be preserved when the relative phase is preserved. The quality of the AM component also depends on the relative phase in order to obtain the same interference terms as those of the original song (see synthetic AM songs and the time-frequency scale). To indicate the accuracy of the representation of the AM component, we also calculated for each temporal resolution the cross-correlation value between the AM component of the synthetic songs and the one of the original song. The 0 msec temporal resolution resulted in a synthetic song that was similar to the original song except for an identical shift in absolute phase in all frequency bands. That particular synthetic song was called RAP for random absolute phase. The RAP song has the same AM and the same FM that the BOS has.

Songs that preserve various amounts of the FM component. For the third and fourth sets of songs, we added noise to the original FM component in each band. In addition, the fourth set preserved the relative phase exactly across all bands. To preserve the relative phase, we added the same FM noise to every frequency band. For the third set, independent frequency noise was added to the FM component in each band. We generated a set of songs by varying the SD of the FM Gaussian noise from 0 to 30 Hz. As in the previous set of songs, the FM noise was band-limited to frequencies below 62.5 Hz. Recall that an AM-16 song is generated with random Gaussian FM with 62.5 Hz SD; thus 30 Hz corresponds to approximately half that amount of noise. For the synthetic songs designed to preserve various amounts of FM, however, the noise was added to the FM components of the original song. The absolute phase was also random, one random shift for all absolute phases for the set that preserved the relative phase and an independent phase shift in each band for the set that did not preserve the relative phase. The randomness of the absolute phase is only significant in the 0 Hz case. The 0 Hz noise value corresponds to synthetic songs with the original FM. For the cases in which the relative phase was preserved (fourth set), the 0 Hz song was identical to the RAP song. For the cases in which we generated a different absolute phase in each band (third set), the 0 Hz song will be called RP for random phase.

Measure of song similarity based on the amplitude envelope

We estimated the degree of similarity between zebra finch songs from different birds or between our synthetic songs and the original song by calculating the normalized cross-correlation coefficient of their respective AM components:
C<SUB>A</SUB>=<FR><NU><FENCE><LIM><OP>∑</OP><LL>n</LL></LIM> A<SUB>1n</SUB>(t)A<SUB>2n</SUB>(t)</FENCE><SUP>2</SUP></NU><DE><FENCE><LIM><OP>∑</OP><LL>n</LL></LIM> A<SUB>1n</SUB>(t)<SUP>2</SUP></FENCE><FENCE><LIM><OP>∑</OP><LL>n</LL></LIM> A<SUB>2n</SUB>(t)<SUP>2</SUP></FENCE></DE></FR>.
The < > indicate averages over the length of the song. CA was calculated for a range of time delays between the two signals, and the largest correlation was taken as the measure of song similarity. Because different songs could vary in duration, the time averages were performed only for the duration of the shortest song.

The correlation measures were repeated for a range of time-frequency scales to allow the study of the effect of time-frequency scale on the discriminability of songs based on their amplitude waveforms. We also performed this calculation on a syllable-by-syllable basis by comparing syllables from one song with syllables from the other songs. By looking only at syllables, we could separate the effect of the temporal scale given by the rhythmic succession of silences and syllables from the temporal scale of the sounds within a syllable. The syllable cross-correlation reported in this work was limited to the pair of syllables that were the most similar between the two songs. These particular pairs are presumably the most difficult to differentiate.

The syllable decomposition was done by a computer program that automatically divided the song into sections of sounds and silences based on the waveform profile of the overall power envelope calculated with an 8 msec hanning window. The peaks and troughs of this amplitude envelope that were a factor of 10 apart were used to define sections of silence and sections of sound in the song. The sections of sound could be separated by very short silences (one point or 4 msec) and vice-versa. The temporally discrete sections of sounds obtained with this algorithm are a particular implementation of what is usually defined somewhat subjectively as a syllable in the zebra finch song by human experts (Sutter and Margoliash, 1994; Zann, 1996, pp 214-215). Figure 3 shows the syllable decomposition obtained for one of the songs used in this work. Our algorithm efficiently divides the song into syllables, with the limitation that for syllables separated by longer periods of silence, the boundaries between sound and silence are not necessarily at the same threshold levels of sound intensity that would be used by a human expert. Because the measurement of the length of the syllables or of the interval between syllables was not part of our work, these differences are not important.


View larger version (118K):
[in this window]
[in a new window]
 
Figure 3.   Spectrogram (top) and overall power envelope (bottom) of one of the representative songs used in these experiments. The vertical lines are the divisions obtained from a computer program that automatically divides the song into syllable-like elements based on the peaks and troughs of the overall power (see Materials and Methods). Syllables 9-14 were chosen for the color spectrograms (see Figs. 2, 6).

The cross-correlation analysis was performed for 16 songs from our zebra finch colony, including the songs from the seven birds used in this experiment. The songs belonged to birds from different families and had different temporal and spectral structure. This ensemble of songs was not necessarily representative of all song sounds that zebra finches can produce but was more than sufficient to characterize the time-frequency scale of zebra finch sounds, as evidenced by the small error bars that we obtained for the CA measure.

Electrophysiology

All physiological recordings were done in anesthetized adult male zebra finches in acute experiments. Two days before the recording session, a small surgical procedure was performed to prepare the bird for the recording session. The bird was anesthetized with 20-30 µl of Equithesin intramuscularly (0.85 gm of chloral hydrate, 0.21 gm of pentobarbital, 0.42 gm of MgS04, 2.2 ml of 100% ethanol, and 8.6 ml of propylene glycol to a total volume of 20 ml with H20), and a small patch of skin on the head was removed to expose the skull. The top bony layer of the skull was removed around the dorsal part of the midsagittal sinus and in an area a few millimeters lateral of the sinus. A ink mark was made 2.4 mm lateral from the dorsal bifurcation point of the sinus to be used as a reference point for electrode penetration. Finally, a metallic stereotaxic pin was glued to the skull with dental cement.

For the recordings, the bird was slowly anesthetized with urethane (75 µl of a 20% solution administered in three doses over a 1.5 hr period) and immobilized with the stereotaxic pin. A very small patch of the lower layer of the skull and the dura was removed at the marked location exposing the brain. Extracellular electrodes were inserted through this opening at and around the location originally marked by the ink dot. The stimuli were presented inside a sound-attenuated chamber (Acoustic Systems) with a calibrated speaker 20 cm away from the bird. The volume of the speaker was adjusted to deliver peak levels of ~85 dB. The rate-intensity function of HVc neurons quickly plateaus above threshold values. The 85 dB value was chosen so that the sound level of the song was in the range at which the rate-intensity function of the neurons is flat (Margoliash and Fortune, 1992). We did not investigate what effect low sound levels would have on our results. Data were collected when the base line activity and auditory responses were characteristic of the nucleus HVc. As reported in other studies both for zebra finches (Margoliash and Fortune, 1992) and for white-crowned sparrows (Margoliash, 1986), HVc responses, in both the urethane-anesthetized and the awake-restrained animal, are characterized by bursting spontaneous activity and by auditory responses that show a strong preference for the BOS in comparison with the responses to other complex auditory stimuli, such as the BOS played in reverse or the song of conspecifics. These characteristic properties can be used to distinguish the neural responses of HVc neurons proper from those of the neighboring neostriatal areas. Our experience is in complete agreement with this phenomenology. The exact location of the recording sites was also verified postmortem by finding the electrode tracks and lesion reference points in Nissl-stained sections of the brain of the bird [for detailed histological methods, see Doupe (1997)]. The data from this paper consist solely of recordings from within the nucleus HVc [for a detailed anatomical description of the HVc, see Fortune and Margoliash (1995)].

The data consisted of neural responses obtained in 77 distinct recording sites in seven birds. In any particular bird, the recording sites were at least 75 µm apart. This distance was sufficient to guarantee that the neural activity recorded from two successive sites originated from different units. A window discriminator was used to translate the neural activity at each recording site into spike arrival times of small clusters of one to five neurons. The single-cell spike arrival times were obtained when a stereotyped spike shape was easily selected with a window discriminator. The multiunit recordings consisted of spikes of various shapes that could easily be discriminated from the background activity with a window discriminator but not from each other. We assessed that responses from small clusters of two to five neurons were obtained in such recordings. Additional single units were isolated from the clusters of neurons using the spike-sorting algorithm of Lewicki (1994) and showed very similar results to the small clusters of neurons and to the single units isolated with a window discriminator but were not used in the analysis presented here.

Stimulus repertoire and presentation

Stimulus repertoire consisted of the BOS, all of the synthetic versions of the BOS, the BOS played in reverse, the BOS played in reverse order, two conspecific songs, and broadband noise bursts. In addition, in some experiments, we used pieces of songs to test for temporal combination-sensitive neurons. The BOS, the BOS played in reverse, the broadband noise bursts, and the conspecific songs were used both as search stimuli and to initially characterize the selectivity of the recording sites for the BOS. The synthetic stimuli were then presented in subgroups that consisted of most of the synthetic songs from one of the four families and the BOS. Ten interleaved trials were collected for all of the stimuli in the subgroup. The stimulus presentation order was randomized for each trial number. The interstimulus interval was between 7 and 8 sec. Two seconds of background activity was recorded before each stimulus, and between 4 and 5 sec was recorded after the stimulus. An additional time interval between 1 and 3 sec (uniform random distribution) was added between collections. When a single stimulus such as the BOS was presented with this interstimulus interval, no measurable adaptation in the responses was found.

In the analysis, the response to the synthetic songs was compared with the response to the BOS obtained during the same collection trials. This was a precaution used in case the response properties were not stationary during the long period of time that was required for the presentation of all synthetic stimuli. Because the set of collection trials was repeated for a subgroup of stimuli to assess stationarity, we obtained between 10 and 40 trials for each stimulus. However, 10 trials were used most of the time to be able to record the responses to the largest ensemble of synthetic stimuli at each recording site. This small number of trials was sufficient to characterize single recording sites in terms of their classic selectivity properties (i.e., BOS vs conspecific song) because the magnitude of the response is clearly different in those cases. For stimuli that gave similar responses (such as AM-16 vs AM-8), more trials would be required in particular cases to obtain significant differences for single recording sites (although some neurons showed statistically significant differences). Our conclusions for those stimuli are based on the population study.

Not all synthetic stimuli were presented at each recording site. The total number of recording sites for each stimulus is specified in each of the figure legends when the results are presented. Table 1 summarizes the number of recording sites per bird and the number of sites where data were obtained for each of the four synthetic stimuli ensembles.

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Distribution of recording sites per bird and per stimulus ensemble

Neural response characterization

The neural response to any given stimulus was expressed as a Z score. The Z score is given by the difference between the firing rate during the stimulus and that during the background divided by the SD of this difference quantity:
Z=<FR><NU>&mgr;<SUB><UP>S</UP></SUB>−&mgr;<SUB><UP>BG</UP></SUB></NU><DE><RAD><RCD><UP>Var</UP>(<UP>S</UP>)+<UP>Var</UP>(<UP>BG</UP>)−2<UP>Covar</UP>(<UP>S, BG</UP>)</RCD></RAD></DE></FR>,
where µS is the mean response during the stimulus (S) and µBG is the mean response during the background (BG). The denominator is the equation for the SD of S - BG. The background was estimated by averaging the firing rate during the 2 sec period before the stimulus. For each unit, the Z score for the response to any stimulus was then compared with the Z score for the response to the BOS by calculating the ratio of these two values. The Z score was in most cases larger for the BOS than for any other stimuli, so that the fraction of the BOS Z score is also an estimate of the response relative to the maximal response. The fractions for different units were then averaged to generate a result for the entire data set.

We also used the psychophysical measure d' (Green and Swets, 1966) to estimate the strength of the selectivity of the recorded neurons. The selectivity of HVc neurons is determined by their response to the BOS in comparison with their response to conspecific songs or to the BOS played in reverse. We calculated the d' to estimate the difference between such responses. The d' value for the discriminability between stimuli i and j is calculated as:
d′=<FR><NU>2(<A><AC>R</AC><AC>&cjs1171;</AC></A><SUB>i</SUB>−<A><AC>R</AC><AC>&cjs1171;</AC></A><SUB>j</SUB>)</NU><DE><RAD><RCD>&sfgr;<SUP>2</SUP><SUB>i</SUB>+&sfgr;<SUP>2</SUP><SUB>j</SUB></RCD></RAD></DE></FR>,
where R is the response to a given stimulus. <A><AC>R</AC><AC>&cjs1171;</AC></A> is the mean value of R, and sigma  is its SD. We took R to be S - BG. The d' value for neuronal responses can be compared with psychophysical or behavioral responses in a forced-choice paradigm [see, for example, Delgutte (1996)]. For our purposes, d' is the simplest measure of selectivity that takes into account not only the estimate of mean responses but also their variance.

    RESULTS
Top
Abstract
Introduction
Materials & Methods
Results
Discussion
References

Song selectivity

In this paper, we were interested in quantifying the selectivity seen in HVc neurons. Our goal was to find what aspects of the acoustical structure inherent to all songs are essential to obtain neural responses and to measure the sensitivity of the neurons to systematic degradation of the necessary structure. The first step in our analysis involved measuring the classic song selectivity of the auditory neurons recorded in the experiments. A rigorous quantification of the selectivity was needed to compare our responses with those found in previous work and to select a group of neurons from our data set that we determined to be highly song selective. The song selectivity in HVc neurons has been characterized by a much stronger mean response to the BOS than to songs from conspecifics or to the BOS played in reverse (Margoliash, 1986; Margoliash et al., 1994; Lewicki and Arthur, 1996; Volman, 1996). To also take into account the variance seen in the responses, we chose to quantify the degree of selectivity by calculating the psychophysical measure of discrimination d' (see Materials and Methods).

Figure 4A shows the cumulative probability distribution of the d' measure for BOS versus conspecific song for all the recording sites in our data set. The mean d' value is 2.3 with 86% of the recording sites showing a selectivity greater than d' = 1.0. These values are not necessarily representative of all auditory neurons in HVc because we did not attempt to map the nucleus in a systematic manner. For certain recording sites, the responses to conspecific songs were missing, but we had data to characterize the selectivity of BOS versus BOS played in reverse. In either case, all neural recordings for which d' was >1 were classified as song selective.


View larger version (13K):
[in this window]
[in a new window]
 
Figure 4.   A, Cumulative probability distribution of the measure d' from signal detection theory for the discriminability between the BOS and conspecific songs (Con), calculated from the neural responses obtained at 54 recording sites. B, Response, measured as a percent of the response to the BOS, for the synthetic song that preserved all of the parameters obtained in our decomposition (Syn), for the song played in reverse (Rev), and for conspecific songs (Con). The data are obtained from n = 30 for Syn, n = 39 for Rev, and n = 47 for Con (n refers to the number of recording sites). The error bars show 1 SEM.

Figure 4B shows the mean relative Z score of all song-selective responses to conspecific songs and to the BOS played in reverse. The response to these stimuli was close to zero. Figure 4B also shows the mean response to the synthesized BOS in which all parameters in the decomposition have been preserved (Syn). As expected, the response to Syn is statistically indistinguishable from the response to the BOS because the two stimuli are identical except for the overall bandpass filtering from 500 to 8000 Hz. Figure 5 shows the peristimulus spike time histogram (PSTH) and the single-spike train records from a single-unit recording for these four stimuli.


View larger version (19K):
[in this window]
[in a new window]
 
Figure 5.   Individual spike rasters and peristimulus time histograms (top) for the response of a particular single unit in the HVc to the BOS, Syn, Rev, and Con stimuli (see Fig. 4). Oscillograms (waveform representations of the sound pressure) of the stimuli are shown below each histogram. The d' for this particular single unit was 1.5. As shown in Figure 4, ~75% of the recording sites showed greater selectivity than did this particular neuron, and this neuron, despite its evident selectivity, is among the less selective members of the population that was used for the studies involving the synthetic stimuli (d' > 1).

Time-frequency scale tuning

To investigate the spectral and temporal requirements of the song-selective neurons, we first generated synthetic songs that, when decomposed into defined frequency bands, had amplitude envelopes similar to that of the BOS in each frequency band. However, the time-varying phase of the signal in each band (which can also be expressed as a frequency modulation and absolute phase) was set to be different from the one in the original BOS and was randomized. By varying the width (and correspondingly the number) of the frequency bands, we generated a set of AM songs with systematically varying degradations of temporal versus spectral resolution (see Materials and Methods for more details). Figure 6 shows the spectrograms for a set of such AM songs, illustrating the trade-off between synthetic songs that preserve the time structure of the original song (AM-1 to AM-16) and synthetic songs that preserve the spectral structure of the original song (AM-16 to AM-256). Based on visual inspection, the synthetic songs in the middle values of the time-frequency scale (AM-16 or AM-32) show a good compromise, achieving seemingly minimal temporal and spectral degradation. We will come back to this issue in the next section.


View larger version (130K):
[in this window]
[in a new window]
 
Figure 6.   Spectrograms of a representative section of an original song and its corresponding degraded AM synthetic songs. The spectrograms of the AM-1 to AM-256 songs are shown. The songs generated with small time windows (1-4 msec) preserve the temporal modulations seen in the original song but have poor frequency resolution. For long time windows (such as 256 msec), the spectral resolution calculated at longer time scales is good, but the temporal structure present in the original signal is smeared. The symbols (*, black-diamond ) indicate the time-frequency scale that gave the best neural response (*) and the best discrimination among songs (black-diamond ) (see Fig. 10 and the corresponding text). The same symbols are also used below (see Figs. 7, 8). All spectrograms displayed in this figure were generated with 16 msec Gaussian windows.

Figure 7 shows the PSTHs for a representative single unit obtained in response to eight AM songs generated with time windows ranging from 0.5 to 64 msec. The PSTHs in this figure can be compared with the ones obtained from the same unit in response to the original BOS and other songs shown in Figure 5. As the time window was increased, the responses of this particular neuron to the AM songs increased, peaked at ~16 msec, and then decreased. The response to AM songs synthesized with time windows of <2 or >32 msec was indistinguishable from background activity. The maximal response obtained at 16 msec was slightly less than was the response to the BOS. Thus, this neuron showed good responses to synthetic songs based only on the amplitude envelopes, as long as these were obtained in a particular time-frequency range. This time-frequency scale included songs that, by visual inspection of the spectrograms of Figure 6, were good at representing both the spectral and temporal structure in song (AM-16 in Figs. 6, 7) but also included songs that showed substantial spectral degradation (AM-4 and AM-8 in Figs. 6, 7).


View larger version (24K):
[in this window]
[in a new window]
 
Figure 7.   Peristimulus histograms for a single-unit recording in response to the set of AM songs spanning the range of time-frequency scales between 0.5 and 64 msec. The responses to AM songs generated with time windows of >64 msec were similar to those obtained at 64 msec. The stimuli started at t = 2 sec and lasted ~1 sec. This single unit and song were from bird zfa_18. This neuron is the same as that of Figure 5. The symbols (*, black-diamond ) indicate the time-frequency scale that gave the best neural response (*) and the best discrimination among songs (black-diamond ) (see Figs. 6, 8, 10).

The mean relative Z score of all song-selective neuronal responses in our data set for the entire range of AM songs is shown in Figure 8A. All individual song-selective recording sites exhibited similar tuning, with responses that peaked at time-frequency scales between 2 and 16 msec. The exact location of the peak as well as the width of the tuning curve varied slightly across units, as exemplified in the three single-unit response traces shown in Figure 8B and in the example shown in Figure 7. This variability was present both in neuronal responses from the same bird (as shown here) or across birds.


View larger version (22K):
[in this window]
[in a new window]
 
Figure 8.   A, Time-frequency tuning curve of HVc in response to AM song stimuli. The x-axis shows the time (bottom) or frequency (top) scale that was used to generate the AM song stimuli. The response is measured as a percent of the response to the BOS. The error bars show 1 SEM. The number of recording sites for each stimulus was n = 31 for t = 0.5 msec, n = 31 for t = 1.0 msec, n = 42 for t = 2.0 msec, n = 35 for t = 4.0 msec, n = 41 for t = 8.0 msec, n = 37 for t = 16 msec, n = 42 for t = 32 msec, n = 33 for t = 64 msec, n = 40 for t = 128 msec, and n = 25 for t = 256 msec. The symbols (*, black-diamond ) indicate the time-frequency scale that gave the best neural response (*) and the best discrimination among songs (black-diamond ) (see Figs. 6, 7, 10). B, Time-frequency tuning curves for three different single units from an individual bird. The x- and y-axes are identical to those in A.

In all cases, the response at the extreme time-frequency scales was similar to or below background. The stimulus at the extreme time scale of 0.5 msec is similar to a broadband white noise stimulus modulated by the overall amplitude envelope of the BOS calculated with a 0.5 msec window. This stimulus is analogous to the noise stimulus defined in Margoliash and Fortune (1992). For that particular stimulus, our results are similar to theirs; they reported a weak response to noise syllables, and we found a weak or inhibitory response for the AM-0.5 synthetic song. At the other end of time-frequency scale, the synthetic song preserves the overall spectrum of the BOS, but because we randomized the phase of its frequency components, it has lost almost all of the original temporal modulations (Figs. 2, 6); such a stimulus is often referred to as colored noise, as opposed to white noise that is characterized by a flat spectrum. As seen in Figure 8, this stimulus also elicited a weak or inhibitory response. The song played in reverse is another example of a synthetic stimulus that preserves the spectral quality of the song on the time scale of the song duration but distorts the temporal structure. In the song played in reverse, the temporal distortion is a very particular one, whereas the distortion from the random phase in the AM songs generates a systematically degraded version of the original temporal envelope (see Materials and Methods).

Neither AM song with a very precise temporal profile nor AM song with a highly precise spectral profile elicited a positive response, and in some cases these stimuli even inhibited the cells. However, as we moved from one time-frequency extreme to the other, the response to the AM synthetic songs traced a smooth tuning curve, reflecting a graded sensitivity to the temporal-spectral precision trade-off inherent in these synthetic AM songs. The responses were the largest for time-frequency scales between 4 and 16 msec. The mean response at the peak time-frequency scale of 4 msec was on average 77 ± 13% (± SEM) of the response to the BOS. Most individual neuronal responses also showed a response at the peak of their tuning that was high but still below that of the BOS; 83% of the recording sites had Z scores below the value of their Z score to the BOS. A one-tail paired t test comparing the mean Z score obtained for the AM-4 songs and the mean Z score for the BOS shows that the mean for AM-4 is clearly below the mean for BOS (n = 34; t = -4.264; p = 0.0001).

In summary, HVc neurons show a strong response to synthetic songs that preserve only the amplitude envelopes of a filter bank decomposition of the original song, but only do so for a range of time-frequency scales between 4 and 16 msec (250-62 Hz). On visual inspection, it appears that some of the synthetic songs that gave the best neural responses were also the ones that were in some sense most like the original signal. On the other hand, we also found large responses to synthetic songs that apparently had large amounts of spectral degradation. In the next section, we will attempt to quantify these observations about how the different AM songs characterize the acoustical structure in the song and how this compares with neural responses. It is also true that even the responses for the optimal time-frequency scales were still below those of the original song, reflecting the fact that HVc song-selective neurons are sensitive to additional temporal and spectral structure of the original song that was not reflected in these AM songs (see Materials and Methods). The nature of the missing structure and the sensitivity of the neurons to the gradual restoration of this structure will be addressed in a subsequent section.

Relative preference for temporal cues

The next goal in our analysis was therefore to compare the time-frequency scale tuning of HVc neurons with the time-frequency scale that would best characterize the acoustical structure in song obtained in an independent manner. For instance, it is well known that a given sound is best represented in a spectrogram when this spectrogram is calculated at a particular time-frequency scale. For example, on visual inspection, the acoustical structure of the zebra finch song shown in Figure 2 seems best represented by the spectrogram calculated with a 16 msec window. Similarly, when we look at the spectrograms of the synthetic songs generated from amplitude envelopes of the BOS obtained from a range of time windows such as those shown in Figure 6, we can visually decide on a time window that seems to be the best at characterizing the structure present in the original song. In general, the optimal time window depends on the properties both of the acoustical signal and of the particular acoustical features that are of interest. Optimally, we would like to base our criteria for the "best representation" not necessarily on all the information that could be extracted from the spectrogram (or the set of amplitude envelopes), but on the aspects of that information that represent the behaviorally relevant bioacoustical structure of zebra finch song. To do so, one might want to evaluate the quality of AM songs generated at different time-frequency scales by testing the efficacy of the songs in eliciting the appropriate natural behaviors.

Short of this, we estimated the time window at which a simple measure of discrimination based on the spectrogram would give us the most information and enable us to distinguish songs from different zebra finches. Our measure of discrimination was based on the cross-correlation between the amplitude envelopes of the different songs (i.e., CA). We calculated the pairwise correlations between 16 zebra finch songs from our colony (120 comparisons) and between the syllables in each of the songs that were the most similar (n = 2031). The correlations were calculated for a range of time scales from 1 to 256 msec and are shown in Figure 9.


View larger version (22K):
[in this window]
[in a new window]
 
Figure 9.   Cross-correlation between amplitude envelopes calculated at different time-frequency scales for songs (Song) and syllables (Syll) from different birds. Sixteen different songs were used, resulting in 120 pairwise correlation measures for songs and over 2000 pairwise comparisons for syllables. Low values of cross-correlation indicate large differences between signals and therefore show the time-frequency scales that are best at discriminating among zebra finch songs. The error bars showing 1 SEM are smaller than the size of the markers.

The cross-correlation measure shows a tuning with a minimum at 32 msec. This minimum point corresponds to the time-frequency scale at which the amplitude envelopes of the two songs are the most different according to the cross-correlation measure (other measures based on higher order statistics might give slightly different answers). This quantitative measure matches our visual estimates of the "best" spectrograms in Figures 2 and 6. Note that, in contrast to the AM songs used in the physiology, in this calculation the amplitude envelopes are not distorted because new synthetic songs were not generated in the process. Moreover, the amplitude envelopes (or the spectrogram) obtained from the original decomposition using our overlapping filters completely characterize the original songs, except for an absolute phase. The same information about the identity of each song is therefore present in the amplitude envelopes at any time-frequency scale but in a different form. At particular (optimal) time-frequency scales, the temporal and spectral structure that is most useful in distinguishing between songs is encoded in large fluctuations in each of the envelopes. At other time-frequency scales, the same temporal and spectral structure can only be recovered by examining the joint small fluctuations in the envelopes from multiple bands (see Materials and Methods). This is the effect that we are quantifying with the measure of cross-correlation between amplitude envelopes. For the same reason, the noise added to the amplitude envelopes of the AM songs at those optimal time-frequency scales by randomizing the phase has the smallest effect in altering the time-frequency structure of the signal, as illustrated in Figures 2 and 6.

One might expect the time-frequency scale of individual syllables to be different from that of an entire song, because a large fraction of the temporal complexity of a full song is attributable to the more or less precise sequence of syllables and silences. In fact, the curves for song and syllable shown in Figure 9 peak at around the same point. The effect of the overall temporal pattern in the entire song is nevertheless reflected in the relatively larger width of the curve for discriminations based on the entire song, particularly at finer time resolutions; there the overall temporal pattern given by the sequence of syllables still allows one to discriminate across songs. In contrast, as the time resolution is made finer, all individual syllables begin to resemble each other, being described by a few Gaussian-shaped amplitude envelopes.

The time-frequency tuning of the correlation measure can now be compared with the time-frequency tuning of HVc neurons to the AM songs shown in Figure 8. The two curves are plotted together in Figure 10 to facilitate the comparison. The symbols (*, black-diamond ) in Figures 6-8 and 10 indicate the time-scales for the peak of the averaged neural responses (*) and for the peak of the average discrimination based on the cross-correlation measure (black-diamond ). The symbols are used to facilitate further the comparison between all of the figures, but note that the strength of the neural response at the 4 msec peak is not significantly different from those at 8 and 16 msec. However, it is clear that the two curves are shifted along the time-frequency axis. To test whether this shift was significant, we compared the distribution of peaks in neuronal responses with the distribution of minima in the cross-correlation values. The distribution of neural sites with peak responses at the different time scales was as follows: 4, 13, 16, and 8 sites at 2, 4, 8, and 16 msec, respectively (total = 41 neurons). The distribution of minima in the cross-correlation was 5, 42, 66, 6, and 1 song pairs at 8, 16, 32, 64, and 128 msec, respectively (total = 120). A Kolmogorov-Smirnov test (insensitive to the logarithmic scale) shows that these two distributions are different from each other with high statistical significance (p < 0.0001). The mean time-frequency value for neuronal peak is 7.7 msec, whereas the mean time-frequency value for minimum cross-correlation is 27.8 msec. A one-tail t test done both with and without a log transform shows that these two means are statistically different (p < 0.0001 in both cases). The difference is striking when one compares the spectrograms for the AM-4 song that elicited a maximal response in 13 of 41 recording sites with the spectrogram for the AM-32 song that elicited no peak responses and small average responses overall (Fig. 6).


View larger version (27K):
[in this window]
[in a new window]
 
Figure 10.   Comparison of the cross-correlation measure for song similarity and of the response of HVc neurons as a function of the time-frequency scale. The data in Figures 8A and 9 are plotted together to facilitate the comparison. Note that the ri