The auditory sense of humans transforms intrinsically senseless pressure waveforms into spectacularly rich perceptual phenomena: the music of Bach or the Beatles, the poetry of Li Bai or Omar Khayyam, or more prosaically the sense of the world filled with objects emitting sounds that is so important for those of us lucky enough to have hearing. Whereas the early representations of sounds in the auditory system are based on their physical structure, higher auditory centers are thought to represent sounds in terms of their perceptual attributes. In this symposium, we will illustrate the current research into this process, using four case studies. We will illustrate how the spectral and temporal properties of sounds are used to bind together, segregate, categorize, and interpret sound patterns on their way to acquire meaning, with important lessons to other sensory systems as well.
The auditory system extracts an astounding amount of information about the world from seemingly simple signals − sound waves reaching the two ears. The initial sound representations, at the auditory nerve that connects the inner ear with the CNS, as well as in brainstem stations, such as the cochlear nucleus and the superior olivary complex, are well described. Indeed, the cochlear nucleus can compete with the retina for being the best understood CNS structure, and technological standards, such as the MP3 sound coding scheme, are based in part on the deep understanding that already exists regarding these initial sound representations in the early stages of the auditory system.
There is, however, a large gap between these early representations, which are centered on the physical structure of the sound waveform, and perceptual representations, which are related to the actual “things” that occur in the world. Following the ground-breaking work of Bregman and his collaborators (Bregman, 1990), this process is referred to as auditory scene analysis. Deviating from Bregman's original proposal, we tend to call the resulting “things” auditory objects (Griffiths and Warren, 2004; Winkler et al., 2009; McDermott et al., 2011; Schnupp et al., 2011; Cervantes Constantino et al., 2012). The formation of auditory objects is considered a crucial step in auditory processing. Stations from the midbrain up to auditory cortex are thought to be involved in this computation. We think that the representation of auditory objects forms the main output of the auditory system and that it is these that are used by the rest of the brain to guide behavior (e.g., listening to music, responding to someone's voice, or localizing a sound source in space around us).
Interestingly, although many laboratories work on the perception of auditory objects, a commonly accepted definition of what they are is lacking; instead, different groups concentrate on different clusters of properties that auditory objects are likely to have, and use them as a handle for accessing the more general concept. In a symposium in this year's Society for Neuroscience annual meeting, we will illustrate the richness and vividness of this research area by addressing four different computational problems that the auditory system has to solve to create auditory objects, as summarized below.
Humans and animals can attend to a sound source and segregate it rapidly from a background of many other sound sources, often with little learning or prior exposure to the specific sounds. For humans, this is the essence of the well-known “cocktail party problem” in which a person can effortlessly conduct a conversation with a new acquaintance in a crowded and noisy environment (Cherry, 1953; Bregman, 1990). For many animals, including frogs, songbirds, and penguins, this ability is vital for locating a mate or an offspring in the midst of a loud chorus (Aubin and Jouventin, 2002; Singh and Theunissen, 2003; Bee and Micheyl, 2008; Velez et al., 2012). This capacity is matched by comparable object segregation feats in vision and other senses (Ison and Quiroga, 2008; Henderson et al., 2009; Rust and Stocker, 2010), and hence understanding auditory object segregation will shed light on the neural mechanisms that are fundamental and ubiquitous across all sensory systems.
In contrast to visual scenes, in which nearby bits of the scene are likely to belong to the same object, elements of a given auditory object are not necessarily local in the basic auditory representation, which is based on frequency. Instead, bits of the same object may occupy different frequency bands. For example, harmonic relationships are extremely important for deciding which frequency bands are bounded together for forming an object, and indeed sounds that have harmonic structure produce pitch: one can play melodies with them (Schnupp et al., 2011). The study of pitch has a long history (Turner, 1977), going back to the mid-Nineteenth Century with Helmholtz and Ohm (who mistakenly claimed that pitch perception requires the physical existence of the fundamental frequency of the harmonic complex) and Seebeck (who actually proved that pitch can be evoked by stimuli missing the fundamental).
Harmonically related frequency components are produced by music instruments of many types whose designs result in modes with harmonic frequencies. Harmonic sounds are also produced by vocal apparatuses of humans and many animal species (ranging from avian to rodents and primates). Any periodic sound is harmonic, and sounds emitted by animals tend to be periodic. Thus, it is not too exaggerated to say that we live in and have evolved from an acoustic environment full of sounds with harmonically related frequency components. The perception of auditory objects based on harmonic relationships among component frequencies is essential for both speech and music perception.
In addition to harmonically related frequency components encountered in the acoustic environment around us, the auditory system also produces harmonics internally (Pickles, 1988). The cochlea generates nonlinear distortion products that may contain harmonics of frequencies included in the physical stimulus. Furthermore, nonlinear processing in the auditory nerve and subsequent brainstem and midbrain structures leading to auditory cortex also generate harmonic byproducts. The combination of exogenous and endogenous harmonics may have led to the formation of neural circuitry in the central auditory system, and in particular in auditory cortex, to process harmonic sounds. It has been proposed that a fundamental organizational principle of auditory cortex is based on harmonic structures of sounds (Wang, 2013). Such an organization has important implications for understanding how the brain processes speech and music.
A typical auditory neuron throughout the ascending auditory system is most sensitive to one particular frequency (the characteristic frequency or best frequency) within the hearing range of a species. In auditory cortex, however, a number of studies have shown that many neurons are “multipeaked”: they are sensitive to multiple frequencies, and these frequencies are often harmonically related. Multipeaked cortical neurons have been found in a variety of mammalian species, from bats (Suga et al., 1983), cats (Sutter and Schreiner, 1991), to nonhuman primates (Kadia and Wang, 2003; Sadagopan and Wang, 2009). Such neurons may be components of an underlying neural circuitry in auditory cortex that process harmonic patterns embedded in natural sounds. Recent neuroimaging studies in humans (Patterson et al., 2002; Penagos et al., 2004; Norman-Haignere et al., 2013) and neurophysiology experiments in marmoset monkeys (Bendor and Wang, 2005; Bendor et al., 2012; Osmanski et al., 2013) have identified regions of nonprimary auditory cortex that have selective responses to the fundamental frequency of harmonic complex sounds that evoke the perception of pitch in humans and marmosets (referred to hereafter as “pitch-selective neurons”). The pitch-selective neurons identified in nonprimary auditory cortex of marmosets are not only tuned to low-frequency pure tones, but also to missing fundamental harmonic complex sounds with a pitch near a neuron's characteristic frequency (Bendor and Wang, 2005). These pitch-selective neurons do not, however, respond to individual components in a harmonic complex tone that are outside its tone-derived excitatory frequency response area, suggesting that such neurons extract pitch of an auditory object from spectrally separated but harmonically related frequency components (Bendor and Wang, 2006; Wang and Walker, 2012). New experimental evidence to be discussed in this symposium has revealed more widespread harmonic pattern processing in auditory cortex beyond the range of pitch.
Although harmonicity is an important auditory processing primitive for fusing the percept of multiple acoustic components, the separation of ongoing sound sources into separate “streams” involves primarily sequential relationships in time. Temporal coincidence is thought play a key role in the streaming of sources such that a unified sound source is perceived only when all of its attributes or features are bound together by being temporally coherent with each other, and also incoherent with the attributes of all other concurrent sources.
Three cortical mechanisms may underlie the use of temporal coincidence for auditory scene analysis. The first is the rich and diverse nature of sound representation in the auditory cortex in which multiscale spectral and dynamic features, as well as location and pitch cues are extracted and encoded explicitly by the primary auditory cortical responses. These representations, however, are shown to be highly plastic, rapidly adapting within fractions of a second, to modulate their sensitivity and saliency according to the objectives and target of attention during behavior. The second cortical mechanism critical for source segregation and formation is coherence analysis through which all temporally coincident features of a single source are identified, grouped, and eventually segregated away from other sources. Finally, the third set of critical mechanisms reviewed involves the attentional influences that target specific features as anchors so as to bind all other elements of a source.
To illustrate the versatility of the temporal coincidence principle in sound segregation, a computational implementation of these ideas has been developed in which sound is first transformed by a model of the early auditory stages to its cortical representation (Chi et al., 2005). A subsequent stage computes a coincidence matrix that summarizes the pairwise coincidences between all pairs of responses making up the cortical representation. A final auto-encoder network is then used to decompose the coincidence matrix into its different streams. The use of the cortical representation here is critical as it provides a multiresolution view of spectral and temporal features of the incoming sounds, and these in turn endow the model with its robust character.
The auditory objects that are being generated by the mechanisms described above have properties: for example, pitch (related to harmonicity), vowel identity (related to the multiscale representations of the spectral envelope), location in space (related to both binaural response properties and to some monaural spectral features), and so on and so forth. Each of these perceptual qualities is invariant to variations in many other sound qualities. For example, we can identify the vowel ‘a’ across different voice pitches and whether it is spoken, whispered, or sung. Being able to extract certain stimulus attributes while generalizing across others is what underpins our ability to identify and categorize sound sources in our environment. Although some of the computations that might support perceptual invariance have been elucidated in higher visual areas (Rust and Stocker, 2010), relatively little is known about whether similar computations underlie our ability to identify sound sources (Bizley and Cohen, 2013).
The potential for formation of invariant representations within auditory cortex has been demonstrated in previous studies of neural activity in anesthetized animals. For example, the identity of a bird song is represented by a subset of single neurons independently of its sound level (Billimoria et al., 2008). Other work has explored the representation of perceptual features such as pitch, timbre, and spatial location, and observed that neurons throughout auditory cortex are modulated by multiple stimulus features (Bizley et al., 2009). Although such an observation makes it seem unlikely that any one of these perceptual features are represented in an invariant manner, these same neural responses can provide a robust representation of any one stimulus feature if the neural response is considered within a distinct time window (Walker et al., 2011). These studies demonstrate the potential for neurons in early auditory cortex to contribute to perceptual invariance, but drawing stronger conclusions requires that we measure both neural coding and stimulus perception simultaneously.
A behavioral model was developed to study perceptual invariance by training ferrets to identify artificial vowels in a two-alternative forced choice task. Ferrets were able to generalize vowel identity over a range of voice pitches and sound intensities and were also able to accurately classify whispered vowels (Bizley et al., 2013; Town et al., 2013). To explore the neural mechanisms that support this perceptual invariance, multielectrode recording arrays were implanted into auditory cortex and neural activity recorded during behavioral discrimination. Neuronal responses were decoded to quantify how unit activity discriminated the following: (1) the identity of the target vowel across variation in pitch or voicing, (2) pitch across the two vowel classes, and (3) the associated behavioral response. Many neurons were informative about the vowel identity, and the classification performance of a subpopulation of neurons matched behavioral performance. Neuronal responses were also informative about the pitch of the target vowel, as well as (perhaps surprisingly) the behavioral decision that the animal took. However, decoding performance was highest for sound identity early in the sound, whereas pitch decoding was best achieved when considering time periods throughout the duration of the sound, and decoding the behavioral decision was best achieved toward the end of the sound. These data suggest that, during behavior, neurons in auditory cortex provide a robust estimation of the vowel identity across variation in other stimulus features.
Finally, as already alluded to above, the analysis of auditory scenes requires figuring out sound properties within their temporal context. The same bit of sound can be perceived differently when embedded in different contexts: for example, a harmonic that is part of a vowel may be “captured” by a different auditory stream, causing vowel identity to change (Roberts and Holmes, 2006). Indeed, neuronal responses in the auditory system show a rich context sensitivity. Importantly for computations involving auditory objects, context sensitivity can span surprisingly large time scales. In cortex, sequential effects may span the range from seconds (Asari and Zador, 2009) up to minutes (Yaron et al., 2012), even under anesthesia.
This type of context sensitivity has been studied extensively under the name “stimulus-specific adaptation” (SSA). The basic paradigm that is used for evoking SSA is the oddball sequence: a sequence composed of two stimuli (one common and one rare). Almost invariably, a second sequence, inverting the roles of the two stimuli, is used as well, and the responses to the same stimulus in the two sequences are compared. Very often, the response to a stimulus when rare is larger than the response to the same stimulus when common. Such effects may be thought of as a form of predictive coding, with responses related to the “prediction error,” here related to the probability of the sound that is currently presented, with that probability estimated from the recent past. Indeed, predictive coding has been suggested to play an important role in auditory scene analysis and the formation of auditory objects (Winkler et al., 2009). SSA has been shown in rodents (Anderson et al., 2009; Malmierca et al., 2009; Bäuerle et al., 2011; Zhao et al., 2011), carnivores (Ulanovsky et al., 2003), and primates (Fishman and Steinschneider, 2012), as well as in nonmammalian species (Reches and Gutfreund, 2008). SSA is present as early as the inferior colliculus, although it is weak in the core, lemniscal pathway leading to primary auditory cortex (Antunes et al., 2010; Duque et al., 2012). In primary auditory cortex, SSA is strong and robust (Taaseh et al., 2011; Hershenhoren et al., 2014; Nelken, 2014), leading to the hypothesis that it is computed at least twice in the auditory system: once in the nonlemniscal subdivisions of the inferior colliculus and a second time in primary auditory cortex.
Cortical SSA has true deviance sensitivity: neurons in auditory cortex (although not in inferior colliculus) respond more strongly than expected to rare sounds embedded in oddball sequences, where their occurrence breaks down an expected regularity, but do not show a similar amplification of their responses to the same sounds with the same probability when embedded in a multitone sequence (the “control” condition) (Jacobsen and Schröger, 2001). Cortical neurons have been shown to be sensitive to sequence regularity in other ways as well; for example, when the deviants occur at fixed intervals rather than randomly, responses are smaller than those evoked by the same sounds with the same probability but occurring randomly (Yaron et al., 2012). Although most studies of SSA used pure tones and tested sensitivity to tone frequency, SSA is also produced by wideband stimuli (Nelken et al., 2013). When finely balanced in frequency, broadband sounds do not produce SSA in the inferior colliculus, even in subdivisions in which pure tones do show SSA, but they do produce SSA in auditory cortex. Thus, SSA in auditory cortex seems to be ideally placed for adapting cortical responses to the statistics of the incoming sound sequence, reducing the sensitivity to common, irrelevant sounds and increasing the sensitivity to deviations from the expected regularity.
This short and, admittedly, highly selective review demonstrates the large variety of perceptual questions, experimental approaches, and theoretical considerations that underlie research into sound processing in auditory cortex today. The field is much larger than what this review can cover, and we think that a great deal can be learned generally about the brain, behavior, sensory processing, and everything in between, by studying how we listen to sounds in the real-world environment.
This work was support in part by the United States–Israel Binational Science Foundation Grant to I.N., Royal Society/Wellcome Trust Award WT098418MA and BBSRC Awards BB/D009758/1 and BB/H016813/1 to J.B., Advanced ERC and National Institutes of Health Grant R01 DC005779 to S.A.S., and National Institutes of Health Grants DC03180 and DC005808 to X.W.
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Israel Nelken, Department of Neurobiology, Hebrew University, Edmond J. Safra Campus, 91904 Jerusalem, Israel.