Abstract
A striking feature of many sensory processing problems is that there appear to be many more neurons engaged in the internal representations of the signal than in its transduction. For example, humans have ∼30,000 cochlear neurons, but at least 1000 times as many neurons in the auditory cortex. Such apparently redundant internal representations have sometimes been proposed as necessary to overcome neuronal noise. We instead posit that they directly subserve computations of interest. Here we provide an example of how sparse overcomplete linear representations can directly solve difficult acoustic signal processing problems, using as an example monaural source separation using solely the cues provided by the differential filtering imposed on a source by its path from its origin to the cochlea [the headrelated transfer function (HRTF)]. In contrast to much previous work, the HRTF is used here to separate auditory streams rather than to localize them in space. The experimentally testable predictions that arise from this model, including a novel method for estimating the optimal stimulus of a neuron using data from a multineuron recording experiment, are generic and apply to a wide range of sensory computations.
Introduction
Animals in nature confront an acoustic environment consisting of sounds from a rich, indeed often bewildering, combination of sources. Survival depends on responding appropriately to potential threats, food sources, and mates (e.g., at a cocktail party), while at the same time ignoring the many irrelevant sound sources that may constitute the majority of the acoustic energy received. Source separation, or “stream segregation,” is therefore one of the central problems in acoustic processing that organisms must solve. Animals must confront many of the same challenges in solving this problem as do artificial systems, and the insights gained from the one can be applied to the other. However, little is currently known about how animals solve this problem (but see Fishman et al., 2004; Micheyl et al., 2005), and no artificial system can solve it in a general setting.
Animals exploit a variety of binaural and monaural cues to separate acoustic sources (Bregman, 1990). For example, two tones occurring simultaneously are more likely to be grouped together perceptually (i.e., perceived as arising from the same source) than the same notes occurring sequentially. Such grouping makes sense under the assumption that the auditory system is trying to discover the statistically independent causes of the acoustic signals received at the ears (Bell and Sejnowski, 1995, 1997; Lewicki and Sejnowski, 2000; Simoncelli and Olshausen, 2001); simultaneous onset of two tones is unlikely to arise purely by chance; thus, it is more parsimonious to assume that the tones were caused by a single source (e.g., as harmonics of a single fundamental frequency.) Many of the spectral, temporal, and spatial cues used for stream segregation can be interpreted in this context.
A striking feature of this and many other sensory processing problems is that there appear to be many more neurons engaged in the internal representations of the signal than in its transduction. For example, humans have only ∼30,000 cochlear neurons, but at least 1000 times as many neurons in the auditory cortex. Although such apparently redundant internal representations have sometimes been proposed as necessary to overcome neuronal noise, here we posit that they contribute to computation.
To extract the behaviorally relevant information embedded in natural acoustic environments, animals must be able to separate auditory streams originating from distinct acoustic sources (“cocktail party problem”). The auditory cortex has orders of magnitude more neurons than the cochlea, thus many different patterns of cortical activity may faithfully represent any given pattern of cochlear activity. We propose that the cortex exploits this excess “representational bandwidth” (DeWeese et al., 2005), or the excess degrees of freedom, by selecting the sparsest representation within an overcomplete set of features. This model suggests how this excess representational bandwidth can be used for computation, instead of merely to overcome neuronal noise as is usually assumed. We illustrate this model by showing how sparseness can be used to separate sources perceived monaurally. The model makes testable predictions about the dynamic nature of representations in the auditory cortex. Our results support the idea that sparse representations may underlie efficient computations in the auditory cortex.
Our approach is to adopt a practical computational framework for the cocktail party problem and then explore the testable implications that follow. Here we describe a model of how the auditory system can exploit one particular sort of monaural segregation cue, namely the spectral cues introduced by the differential filtering imposed by the headrelated transfer function (HRTF). Note that in contrast to much previous work, the HRTF is used here to separate auditory streams rather than to localize them in space; our model assumes that the locations of the sources have already been determined by other mechanisms.
The model posits that the neural representation of an acoustic stimulus is overcomplete in the sense that there are many more neurons available than are needed to represent the stimulus with high fidelity (Olshausen and Field, 1997; Lee et al., 1999; Lewicki and Sejnowski, 2000; Zibulevsky and Pearlmutter, 2001). Because the representation is overcomplete, there are many patterns of neural activity that all faithfully encode any given stimulus. We show that constraining neural activity to be sparse selects one of these representations and that the resulting pattern of neural activity solves the source separation problem, even when multiple sources are audible to only a single ear. The framework is quite general and can serve as a starting point for understanding how cortical circuits might exploit other sensory cues as well.
Materials and Methods
All programming was done in Matlab (The MathWorks, Natick, MA).
HRTF.
The HRTF is the filter imposed by the head and the detailed shape of the ear on sounds received at the cochlea. The HRTF depends on the spatial position (both the relative azimuth and elevation) of the source (Yost et al., 1996). At some frequencies, the HRTF can attenuate sound from one location by as much as 40 dB more than from another.
Although every individual has his or her own HRTF, the basic characteristics of HRTFs are similar across individuals. We used a representative left human pinna HRTF downloaded from http://www.itakura.nuee.nagoyau.ac.jp/HRTF/ (Nishino et al., 2001).
Spectral basis for sources via nonnegative matrix factorization.
We tested our algorithm on mixtures of musical sources. We used nonnegative matrix factorization (NMF) to obtain basis elements for each source.
NMF is an algorithm for factorizing a data matrix under an elementwise nonnegativity constraint (Lee and Seung, 1999). The original data matrix is given as an n × m matrix V, each column of which here contains the n data values for one of m spectrogram segments. The data matrix V is approximated by NMF as V ≈ QH, where the dimension of the factors Q and H are n × r and r × m, respectively. The rank of factorization, r, is chosen so nr + rm < nm, so as to compress the original nm elements in the matrix V into a smaller number of elements, nr in Q plus rm in H. Each column of Q contains one of the basis spectrograms, and the matrix H represents the coefficients for reconstructing the columns of the original data matrix V in this basis. The Q matrices obtained by NMF for each individual source were concatenated to form an overcomplete sourcespace basis matrix D̃ = [Q_{1}  Q_{2}  … ]. Each column of D̃ was then filtered through each HRTF and the results concatenated to form the feature matrix: D = [h_{1}(t) ∗ D̃  …  h_{N}(t) ∗ D̃].
In our experiments, the spectrograms in the data matrix V were obtained from music sounds, natural sounds, or speech sounds: commercial audio CDs (instrumental solos, classical and jazz, one each on cello, clarinet, trumpet, harp, and harpsichord, for a total of five), the audio CDs The Diversity of Animal Sounds and Sounds of Neotropical Rainforest Mammals (Cornell Laboratory of Ornithology, Ithaca, NY), and spoken poetry (Dylan Thomas, T. S. Eliot, Frank O’Hara, and William Butler Yeats on the commercial audio CD Poetry speaks: hear great poets read their work from Tennyson to Plath, Sourcebooks Inc., 2001), respectively. Samples of 100–150 s were taken, stereo channels averaged, and the signal downsampled from the original 44.1 kHz to 8 kHz. Logscaled spectrograms were generated using a custom Matlab routine (available upon request) with a bin size of 5 ms and 75 frequency bands ranging from 55 to 3951 Hz in steps of 1/12 octave. Each column of V held a strip of spectrogram, yielding a dimensionality of n = 75, and m = 5000 samples were used for the training. Note that the training samples were distinct from those used for the testing. Specifically, we used 10,000 samples to assess the representational sparseness achieved by the NMF basis (see Fig. 2) and 20,000 random combinations of three sources (using the 10,000 samples in Fig. 2) to assess separation performance (see Fig. 4).
Each NMF run consisted of 500 iterations with 10 restarts from random initial conditions, with the restart that yielded the minimum total error chosen. The factorization rank was r = 15. Concatenating the five Q matrices for the five instruments yielded a dictionary of 75 basis elements, each of which was filtered by each of three different HRTFs, resulting in a feature matrix D with 225 columns. The source locations were randomly chosen but 90° apart from each other in the simulations (in Fig. 3, the three sources were located on your left, center and right, corresponding to the HRTFs for azimuth −90°, 0°, and 90°, respectively, with zero elevation). The analyses on the natural sound and speech sound were performed in a similar manner, with 5000 training samples for each data matrix V.
Minimization.
Pseudoinverses (L_{2}norm minimization) were computed with the pinv routine in Matlab, which uses an algorithm based on singular value decomposition. The linear programming problem (Eq. 14) was solved using the Matlab Optimization Toolbox linprog routine. We did not impose a nonnegativity constraint on the coefficients. As a result, the dense solution consists of negative coefficients as well as positive ones, whereas all of the substantially nonzero elements are positive for the sparse solution. Figure 6 shows the absolute values of the dense solution coefficients.
The linear programming problem given in Equation 14 can be sensitive to noise. We therefore solved an augmented version of this that included a noise model. In particular, we assumed that the total amount of noise was bounded. Thus, Equation 14 was reformulated as the following: where β is proportional to the noise level and with p = 1, 2, or ∞. Letting β → 0 is equivalent to assuming that the noise is very small, and the solution converges to the zeronoise solution, Equation 14.
The Gaussian noise case, p = 2, can be solved by semidefinite programming methods (Fletcher, 1985). Both p = 1 and p = ∞ can be solved using linear programming. All approaches yield qualitatively similar results.
The solutions presented here all used p = 1. For this case, noise vectors e^{+} and e^{−} are introduced and included in the optimization, allowing Equation 1 to be rewritten in standard form:
We typically examined four different noise levels (log_{10}‖y‖_{1}/β = 1, 2, 3, 4) and selected the one with the best separation performance on average as the result (see Figs. 4, 5).
Signaltonoise ratios (SNRs) were calculated as the reciprocal of the average across sources of 〈(x_{i}(t) − x̂_{i}(t))^{2}〉/〈x_{i}(t)^{2}〉, where x_{i}(t) is the original spectrogram of the ith source, x̂_{i}(t) is its estimate when recovered from the mixture, and the average 〈·〉 is over time.
To measure the degree of sparse representations by NMF basis elements (see Fig. 2) and its relationship to the separation performance (see Fig. 4), we introduced a “sparseness index” defined as the number of nonzero elements in the presence of a single source divided by the dimension size. This index is unity for a dense representation and approaches zero as the representation becomes sparser. The noise level was log_{10}‖y‖_{1}/β = 1 in Figure 2B, resulting in the reconstruction SNR of 18.3 ± 3.8, 16.0 ± 3.0, and 18.0 ± 3.6 (median ± interquartile range in decibels) for music, natural sound, and speech ensembles, respectively.
Estimation of linear encoders and decoders.
Given a set of stimuli y_{k} (for k = 1, 2, …) and the corresponding responses c_{k} generated using Equation 14, the optimal linear decoding filter D̂ was estimated by solving the following regression problem: where ‖·‖_{2} denotes the L_{2} (Euclidean) norm. Similarly, the optimal linear encoder Ê was obtained by solving the following equation:
Note that we used a fraction of the elements in c_{k} for the linear filter estimation and showed the average results in Figures 7 and 8 over 200 random samplings of neurons. Also note that the ith column of D̂ and the ith row of Ê correspond to the optimal linear decoder and encoder for the ith neuron, respectively.
In Figure 7, we used a 1168 × 3600 feature matrix D, each column of which held a feature spanning over 16 time bins (96 ms), with a bin size of 6 ms and 73 frequency bands ranging between 55 and 3520 Hz in steps of 1/12 octave. As the original feature of a target neuron, we chose the one obtained from cello ensembles, and thus we used cello sounds as input stimuli in the simulation.
Asymmetry of sparse representations.
To illustrate the asymmetry of linear encoding and decoding in the framework of our model, we ran simulations in 25 dimensions with 75 neurons. In the simulations, the threefold overcomplete features (a 25 × 75 feature matrix D) were first generated randomly on the unit hypersphere. Neural activities for sample stimuli drawn from a Gaussian distribution were then determined by Equation 14.
For simulated single unit data (see Fig. 8A), we computed the mutual information I(c, s) between the simulated neural responses c and stimulus s using I(c, s) = H(c) − H(cs) = H(c), where H(c) is the response entropy and H(cs), the conditional of the response given the stimulus, is zero because the relationship between stimuli and responses was deterministic. Thus the mutual information between the single neuron and the stimulus was just equal to the response entropy, which we estimated by direct binning from the histogram of neural responses. We compared this information to either the mutual information between the optimal linear estimate of the response given the stimulus and the actual stimulus (encoding), or between the optimal linear estimate of the stimulus given the response and the actual response (decoding). For these information estimations, we used the Gaussian approximation to bound the entropy of the reconstruction error (Bialek et al., 1991). We then normalized these linear information estimates to full mutual information to obtain the “reconstruction quality”:
For multiunit data (see Fig. 8B), the computation of the full mutual information (rather than the linear approximation) was computationally intractable. We therefore computed the following simpler measure of the reconstruction quality of the models: where ‖·‖_{2} denotes the L_{2} norm, and 〈·〉 denotes the mean over data. Note that the measure is based on the relative length of the model errors and that it gives zero for pure noise and one for perfect reconstruction.
Context dependence of spatial temporal receptive fields.
In Figure 10, for demonstration purposes, we used a 1168 × 3600 feature matrix D (the same one as in Fig. 7) and used two different sets of 300 active features (i.e., 1168 × 300 packed matrices D_{k}; see Appendix A.3) to estimate the spectrotemporal receptive fields (STRFs) for the two different contexts. Note that some of the features were active in both contexts (including the one shown in Fig. 10A), whereas others were active only in either context.
Results
Our main goals are to explore a model of computation with sparse representations and to generate new experimentally testable predictions from this model. To make our model concrete, we consider a specific computation: a special case of the monaural cocktail party problem in which the HRTF provides the critical cue for disentangling sources. We focus on this special case not because it is of central importance from a psychophysical perspective (in a general setting, the HRTF is typically just one of many cues, and often not the most important) but rather because this problem provides a convenient way to illustrate the key predictions. The same sparse framework can be generalized to exploit other cues for source separation and to other sensory processing problems (e.g., vision) as well.
The presentation is organized as follows. First, we define the particular source separation problem we consider, in which there are several sources and a single ear. Next, we show how a sparse overcomplete representation, such as that seen at the cortical level in the auditory system, can be used to separate the sources. Finally, we identify experimentally testable predictions of the model.
Problem formulation
We have all experienced the basic cocktail party problem as a part of everyday life: we stand in a room full of people chatting, chairs scraping, fans humming, and so forth, and strain to understand the words of a single interlocutor. This familiar but challenging scenario is interesting precisely because it tests the limits of what we humans can achieve. The cocktail party is, however, just an extreme example of a more general problem that the auditory system constantly confronts. It is rare that we can listen to an acoustic source without interference from other sources, yet our auditory system filters the interfering sources out of our conscious perception so effectively that we are often almost unaware of them. The apparent effortlessness with which we solve the cocktail party problem is deceptive and is a testament to the effectiveness of our auditory system. Indeed, the problem of background noise represents one of the main factors limiting the widespread practical adoption of artificial speech recognition systems.
The auditory system uses a wide variety of psychophysical cues to segregate auditory streams (Bregman, 1990), including both binaural and monaural cues. Many monaural cues have been identified, such as common onset time or comodulation of stimulus power in different parts of the spectrum.
For simplicity, we focus here on just one set of cues: those provided by the differential filtering imposed on a source by its path from its origin in space to the cochlea. This filtering is caused both by the head and the detailed shape of the ear (the HRTF) and by the environment on sources at different positions in space (Yost et al., 1996). The HRTF is important for generating a threedimensional experience of sound, so that acoustic sources that bypass the HRTF (e.g., those presented with headphones) are typically perceived unnaturally, as though arising inside the head (Wightman and Kistler, 1989; Kulkarni and Colburn, 1998). Whereas the importance of the HRTF in sound localization has been studied extensively (Knudsen and Konishi, 1979; Wightman and Kistler, 1989; Wenzel et al., 1993; Hofman and Opstal, 2002), its role in source separation as such has not. In contrast to much previous work, the HRTF is used here to separate auditory streams rather than to localize them in space.
It is often reasonable to assume that sound arriving from different locations should be treated as arising from distinct sources. For the purposes of the present study, all sounds from a given position are defined to belong to the same source, and any sounds from a different position are defined to belong to different sources. We emphasize that although sound localization (the process by which an animal determines where in space a source is located) is related to source separation (the process by which an animal extracts different auditory streams from a single waveform), the two computations are distinct; neither is necessary nor sufficient for the other. Here we focus on the separation problem and assume that source localization occurs by other mechanisms.
The particular source separation problem we consider is as follows. Suppose there are N acoustic sources located at known distinct positions in space, with x_{i}(t) being the time course of the stimulus sound pressure of the ith source at its point of origin. Associated with each position is a known filter given by h_{i}(t). In what follows we will refer to h_{i}(t) as the HRTF, but in general, h_{i}(t) will include not just the filtering of the head and external ear but also the filter function of the acoustic environment (reverberation, etc.).
The signal y(t) at the ear is the sum of the filtered signals: where ∗ indicates convolution and x̃_{i}(t) = h_{i}(t) ∗ x_{i}(t) is the ith source in isolation after filtering. (We can say that x_{i}(t) is the ith source measured in source space, whereas x̃_{i}(t) is the same source measured in sensor space.) The organism’s goal in source separation is to recover the underlying sources x_{i}(t) from the signal y(t), using knowledge of the directional filters h_{i}(t). For example, if x_{alice}(t) and x_{bob}(t) are speech streams generated by two speakers (sources), Alice and Bob, at a cocktail party, the goal is to disentangle these two streams using the only signal available, the sum y(t) = h_{alice}(t) ∗ x_{alice}(t) + h_{bob}(t) ∗ x_{bob}(t). Note that the actual spatial locations of the sources are not computed during the separation; we do not address the localization problem in this paper.
The particular monaural version of this problem that we consider here is a special, more difficult, case of the binaural (or, in artificial systems, the multiple microphone) problem.
Neural representation for source separation
How might a neural system solve the source separation problem described above? We begin by assuming that each short segment (e.g., 5 ms) of each acoustic source x̃_{i}(t) (as it sounds at the cochlea) is represented in the activities c_{ij} of a population of neurons indexed by components j and source positions i: where d_{ij}(t) are stimulus “features,” i.e., elements of a (not necessarily orthogonal, and possibly overcomplete) linear basis. We will interpret the neural activities c_{ij} as the spike rate of the corresponding neurons during each segment. The signal y(t) is then given by the following: We have introduced Equation 9 as an analytic model: given a stimulus y(t), find a set of features d_{ij}(t) and neural activities c_{ij} that represent that stimulus; if the features span the stimulus space, such a representation will always exist. Below we will focus on the case in which the feature set permits a sparse representation, i.e., where only a few of the neural activities c_{ij} are significantly nonzero. (Although “sparse” might colloquially refer to the case in which most of the activities c_{ij} are exactly zero, here we use a generalized notion of sparseness, common in the literature, which requires only that most activities be close to zero.)
Equation 9 assumes a linear relationship between an auditory stimulus y(t) and its neural representation in terms of features d_{ij}(t). The assumption of linearity is common in both visual and auditory physiology. For example, it is often assumed that a population of neurons in cortical area V1 represents a visual scene in terms of a collection of oriented edges; in this case, the scene and the features in Equation 9 would be rewritten as functions of spatial rather than temporal coordinates, but the formulation would be otherwise identical. Similarly, in auditory physiology, stimuli are sometimes represented as a weighted sum of basis elements such as moving ripples (Kowalski et al., 1996; Klein et al., 2000); in the context of Equation 9, this implies assuming a onetoone correspondence between a basis element (derived from the ripple basis) d_{ij}(t) and the firing rate c_{ij} of a corresponding neuron.
To relate the neural representation of the signal y(t) in Equation 9 to the sources x_{i}(t), we further assume that each source can be expressed as a linear combination of (not necessarily orthogonal) basis elements q_{j}(t): where the basis elements q_{j}(t) are related to the features d_{ij} by convolution with each filter h_{i}(t):
Combining these expressions, the signal y(t) received at the ear is related to the sum of the filtered sources by the following:
There are thus more features d_{ij}(t) in the neural representation than there are basis elements q_{j}(t). In particular, if there are known to be N sources, then there are Nfold more features d_{ij}(t) than basis elements q_{j}(t). As before, Equations 9–11 represent an analytic model: given a set of features d_{ij}(t) [or equivalently a set of basis elements q_{j}(t) and positiondependent filters h_{i}(t)] and an input y(t), find an appropriate set of neural activities c_{ij}.
The basis elements q_{j}(t) reflect statistical correlations within sources; each source typically consists of several such elements. These basis elements can be thought of as an internal model of the components of acoustic sources, in the same way that edges might be thought of as components of visual sources (objects). Because the neural representation involves prefiltering with the HRTF (Eq. 11), the coefficient c_{ij} associated with feature d_{ij}(t) is then better thought of as representing the hypothesis that an element q_{j}(t) is present at position i. In the same way, neurons in the primary visual cortex can be thought of as representing the hypothesis (d_{ij}) that an oriented edge (q_{i}) is present at a particular position (i) in the visual field. In other words, the elements q_{j}(t) reflect only the properties of the stimulus, whereas the features d_{ij}(t) arise from the interaction of these elements with the sense organs.
A population of neural activities satisfying Equations 9–11 has effectively solved the source separation problem, because a given source i can be reconstructed merely by summing over all neurons associated with position i. This formulation therefore recasts the source separation into a new problem: finding the appropriate neural activities c_{ij}. Such a representation, if it could be found, is especially appealing because it permits the sources to be reconstructed (by Eq. 10) directly in terms of the stimulus elements q_{j}(t) as they sound at the source (i.e., before filtering); the reconstruction is therefore invariant to changes in stimulus position. In the next section, we will show that sparseness provides the key to specifying the appropriate representation.
For notational and computational convenience we discretize time and rewrite Equation 9 in matrix form (using boldface to indicate vectors and matrices): where y is a column vector whose N_{row} elements correspond to the discretetime sampled elements y(t_{k}), c is a column vector of length N_{col} representing the complete neural activity pattern c_{ij}, and D is an N_{row} × N_{col} matrix whose columns d_{ij} hold the features with elements d_{ij}(t_{k}).
Sparse neural representation of sources
Source separation thus requires finding the neural activities c such that the neural representation represents the sources x_{i} as closely as possible. We assume that the neural representation is overcomplete (Riesenhuber and Poggio, 2000; Olshausen and Field, 1997), i.e., that the number of neurons (features) is large (N_{col} > N_{row}). In this case, many different neural activity patterns c could represent the stimulus y equally well (Fig. 1A). However, the goal is not merely to represent the stimulus y, but to find a representation in which the underlying sources x_{i} are apparent and from which they can be readily recovered.
Because the neuronal population does not have access to the sources themselves, but only to their sum y, not enough information is available to recover the sources uniquely. The source separation problem is thus ill posed. (In the same way, knowing that the sum of two scalars a and b is 12 is not sufficient to recover a and b, and any choice for a and b that satisfies a + b = 12 is a possible solution.) The problem can be made well posed by adding additional constraints (regularizers) on the responses, as is often done in computational vision (Poggio et al., 1985). Here we consider a sparseness regularizer on the neural representation (Olshausen and Field, 1996, 1997; Bell and Sejnowski, 1997; Chen et al., 1998; Lee et al., 1999; Lewicki and Sejnowski, 2000; Vinje and Gallant, 2000; Simoncelli and Olshausen, 2001; Zibulevsky and Pearlmutter, 2001; Hahnloser et al., 2002; Olshausen and O’Connor, 2002). In neural terms, this sparseness assumption corresponds to representing the acoustic stimulus y in terms of the minimum number of spikes (Fig. 1C), a biologically appealing constraint which leads to an energyefficient representation (Levy and Baxter, 1996; Laughlin and Sejnowski, 2003). Thus we assume that the neural representation c satisfies (see Appendix A.1):
Equation 14 specifies a linear programming problem with a single global optimum. Formally, the solution minimizes the L_{1} norm ‖c‖_{1} = Σ_{ij}c_{ij} of the solution vector. In practice, the problem we consider allows for reconstruction noise (see Eq. 1).
Dense neural representation of sources
An alternative regularizer is that implicit in the pseudoinverse (Strang, 1988), corresponding to the usual leastsquares solution (see Appendix A.1),
The pseudoinverse finds the solution c that minimizes the L_{2} norm, i.e., the squared neural activity Σ_{ij}c_{ij}^{2} (Fig. 1B). However, it is not obvious why it would be useful for the brain to minimize this quantity, which has units of spikes squared, rather than some other quantity (such as spikes; see below). Moreover, we show below that it fails in practice to separate the sources successfully.
Separation of harmonic sources
Successful source separation based on Equation 14 requires that two conditions be satisfied. First, the sources must be sparsely representable, as is the case with natural auditory stimuli (Attias and Schreiner, 1997; Lewicki, 2002; Klein et al., 2003; Smith and Lewicki, 2006). Second, the sources must have spectral correlations matched to the HRTF. We found that the model was able to separate acoustic sources consisting of mixtures of music, natural sounds, and speech.
Finding a feature set
We used NMF to generate a set of basis features from spectrograms obtained from samples of solo instrumental music, natural sounds, and speech (Fig. 2). NMF is an algorithm for factorizing a data matrix (a matrix whose columns contain the snippets of solos) under nonnegativity constraints (Lee and Seung, 1999). In contrast to some other decomposition approaches, such as principal component analysis, NMF often yields representations in which the elements are fairly local, and which can be interpreted as “parts.”
When applied to music, NMF typically yielded elements suggestive of musical notes, each with a strong fundamental frequency and weaker harmonics at higher frequencies. In many cases, listeners could easily use timbre to identify the instrument from which a particular element was derived. When applied to sounds from other ensembles (natural sounds and speech), NMF yielded elements that had rich harmonic structure, but it was not in general easy to “interpret” the elements (e.g., as vowels). Nonetheless, these elements captured aspects of the statistical structure of the underlying ensemble of sounds and led to sparse representations of the ensembles (Fig. 2B).
The choice of NMF in this context was merely a matter of convenience; we could have used any basis that captured the spectral correlations in the sources and permitted a sparse representation. Finding good overcomplete dictionaries from samples of a stimulus ensemble is a subject of ongoing research (KreutzDelgado et al., 2003). We do not imagine that NMF is the “algorithm” by which features are established in real neural circuits; such features must surely arise through a complex interaction of genetic and environmental cues. We need not, therefore, expect to find a precise correspondence between the features obtained by NMF and those observed in the auditory cortex. In this respect, our results complement previous work on finding the features underlying auditory or visual scenes (Bell and Sejnowski, 1997; Olshausen and Field, 1997, 1996; Schwartz and Simoncelli, 2001; Lewicki, 2002); the emphasis here is not on the elements themselves but rather on how they work together to form a representation that separates sources.
Separation
To test the ability of the model to separate sources, we generated digital mixtures of three sources positioned at three distinct positions in space (Fig. 3). In the left column are the spectrograms of the sources at their origin. Two of the sources (a harp playing the note “D”; center and bottom) were chosen to be identical; this example is thus particularly challenging, because the only cue for separating the sources is the filtering imposed by the HRTF.
Separation was nevertheless quite successful (Fig. 3, compare left and right columns). These results were typical: whenever the underlying assumptions about the sparseness of the stimulus were satisfied, sources consisting of mixtures of music, natural sounds, or speech were all separated well (Fig. 4). Separation worked particularly well for mixtures of sparsely representable sources (i.e., smaller sparseness index values), whereas it did not work for sources that were not sparsely represented (i.e., larger sparseness index values). Figure 5 shows that separation without differential prefiltering by the HRTF was unsuccessful, as was separation using the Gaussian prior instead of the sparseness prior (dense representation).
The neural representations underlying separation provide insight into these results. Figure 6A shows the representations of each of the three sources (the same as in Fig. 3) presented in isolation. In each panel, the activity in a population of 225 neurons (corresponding to the 225 features d_{ij} = h_{i} ∗ q_{j}) is indicated by the intensity of points on a 15 × 15 grid. Because the sources occupy three positions i, there are three copies of the basis q_{j} in each panel (corresponding to the three filters h_{i}.) The activity patterns are sparse; only a relatively small number of units are active in each representation. Note that because the middle and the right sources (source 2 and 3, respectively) in this example were chosen to be identical, the middle and right neural representations differ only by a shift.
The procedure for recovering a source from such a representation is straightforward: the estimate of the left source (source 1) is simply the summed activity of the left third of the neurons (those representing features prefiltered by the HTRF corresponding to the leftmost position in space); and likewise for the middle and right thirds. The HRTF can thus be seen as a kind of “tag” for grouping together elements from a single source. This suggests dividing source separation into two conceptually distinct steps (although in practice the steps occur simultaneously.) In the first step, the stimuli are decomposed into the appropriate features. In the second step, the features are tagged and bundled together with other features from the same source. It is for this bundling step that the HRTF along with the prior knowledge of source locations is essential.
The failure of the dense representation to separate sources (Fig. 4) results from a failure of the first step. Instead of decomposing the sources into a small number of features, the dense representation (Fig. 6C) assumes that each instrument contributed about equally to the received signal and thus finds a representation in which a large fraction of neurons are active. That is, instead of “explaining” the sources in terms of two harps and a trumpet, the dense representations also finds some clarinet, some cello, etc., at all positions. This is intrinsic to the dense solution, because it finds the “minimum power” solution in which neural activity is spread among the population (Fig. 1B).
The failure of even the sparse approach when the spectral cues induced by the HRTF are absent (Fig. 5, leftmost point showing 0° separation) results from a failure at the second step. That is, the sparse approach finds a useful decomposition at the first step even without the HRTF, but in the absence of HRTF cues the active features are not tagged, and thus the features cannot be assigned appropriately to distinct sources. Other psychophysical cues relevant for source separation, such as common onset time, might provide alternative or additional tags in this same framework. A more general formulation of source separation might allow tagging on longer time scales, so that a feature active at one moment might be more (or less) likely to be active the next, reflecting the fact that sources tend to persist, but we do not pursue that approach further here.
Experimental predictions
Our model of sparse representations makes at least three experimentally testable predictions.
Optimal feature estimation requires multineuron recording
In this model, the firing rate of a given neuron {ij} is maximized when there is a perfect match between the stimulus and the feature of that neuron, i.e., when y = d_{ij}. Because the feature d_{ij} is used in the linear reconstruction of the stimulus from the neural activities (Eq. 9), one might imagine that the optimal stimulus (i.e., the stimulus that maximizes the firing rate) can be obtained by estimating the optimal linear decoder of the target neuron considered alone. Experiments based on this idea have shown that the optimal linear decoder can sometimes drive neurons in the auditory cortex to fire vigorously (deCharms et al., 1998).
Surprisingly, this model predicts that the linear estimate of the decoder obtained in this way is not the optimal stimulus, although the optimal decoder is linear. Instead, finding the optimal stimulus requires recording from all of the neurons involved in the representation. This follows from the fact that we have assumed that the features are not orthogonal (see also Appendix A.2). Note that in this model, optimal decoding (Eq. 9) need not take neural correlations into account, even when they are present.
This first prediction is illustrated by a simulation (Fig. 7). The yaxis shows the firing rate of a target neuron (normalized to its maximum firing rate) in response to the presentation of the stimulus that matches the optimal linear decoder constructed by recording the activity of a target neuron and a variable number of other neurons. When the optimal linear decoder is estimated from only the target neuron, the firing rate is submaximal. As the number of neurons used to estimate the optimal linear decoder is increased (xaxis), the response of the target neuron converges to unity, indicating that the optimal decoder has converged to the feature of the target neuron.
Figure 7 represents a novel and testable prediction of the model: jointly estimating the optimal linear decoder from a population of neurons should yield a stimulus that is closer to optimal. Moreover, it also leads to a novel experimental approach for finding the optimal stimulus. Note that although in principle the activity of all neurons involved in the representation must be recorded, in practice the activity of even a few can be useful. With modern techniques (e.g., tetrodes) for isolating the activity of several nearby neurons, this approach might be practical.
Linear decoding and nonlinear encoding
A second testable prediction of the model is that there should be an asymmetry between encoding and decoding: the optimal encoding function is nonlinear, but the optimal decoding function is linear. Here “decoding” refers to the process of “reading out” a neural representation (e.g., by forming an estimate or reconstruction of the stimulus), whereas “encoding” refers to the process by which the nervous system constructs a pattern of neural activities from a stimulus. Surprisingly, however, this asymmetry emerges only for populations of neurons; the optimal linear encoder and decoder of an isolated neuron perform about equally, and both underperform the optimal nonlinear decoder (Fig. 8).
The fact that optimal decoding of a neuronal population is linear (i.e., that the optimal linear decoder of the neuronal population response provides perfect reconstruction of the stimulus under the model, so that no nonlinear model can do better) is a direct consequence of our fundamental assumption (Eq. 9) that the neural representation is a linear combination of features. The linearity of neural decoding does not imply that the neural encoding function (the inverse transformation from the stimulus to the response) need be linear, and in general it is not.
Sparseness induces a nonlinear encoding function; more precisely, it induces a piecewise linear encoding function (Fig. 9). Sparseness implies that only at most N_{row} out of the possible N_{col} features d_{ij} are active in the representation of a particular stimulus; the precise subset of active neurons changes for different stimuli. Piecewise linearity arises because the encoding function is linear for all stimuli that activate the same subset of features, but changes for different subsets (see also Appendix A.2). Note that not just any nonlinear function can be implemented. For example, any saturating nonlinearities must be introduced by a preprocessor, because doubling the stimulus y necessarily doubles the neural representation c, i.e., y = Dc implies 2y = D(2c).
The prediction that there is an asymmetry between the linearity of the decoding function and the nonlinearity of the encoding function can be tested experimentally (Fig. 8). Given an ensemble of stimulus–response pairs (i.e., the neural responses to an ensemble of sounds) obtained from a population of neurons, the model predicts that a linear stimulus reconstruction approach (i.e., a decoding model) will outperform a linear “forward” (i.e., encoding) model, but only if the optimal linear reconstructors are estimated from a population of neurons.
The idea that a linear approximation is better suited for the neural decoding than encoding function was first exploited to estimate the information rate of fly visual neurons (Bialek et al., 1991). In contrast, our model predicts that, if the neural representation is sparse and overcomplete, then the asymmetry should emerge only in multineuron recordings. To our knowledge, this asymmetry has not been tested for highlevel auditory representations. Our model thus makes a strong prediction: linear decoding does not provide an advantage over linear encoding for single neuron experiments, whereas the former outperforms the latter for multineuron experiments.
Context dependence of STRFs
A third prediction that follows from the piecewise linearity of the encoding function is that the linear component of receptive fields should depend on the acoustic context. Following conventional usage in auditory physiology, we will use the term STRF to refer only to the linear component of the encoding function, although the encoding function itself may be highly nonlinear (Kowalski et al., 1996; Theunissen et al., 2000, 2001). (In visual physiology, STRF is used to refer to the “spatial temporal receptive field,” but the quantities are analogous.) The STRF is the analog (in a highdimensional input space) of the slope of the tuning curve of a neuron in one dimension.
In an experimental setting, piecewise linearity predicts that the STRF should depend on the acoustic context. We define the acoustic context of a feature d_{ij} with respect to a stimulus y as the collection of other features activated simultaneously by that stimulus. In music, for example, the features tend to resemble musical notes, and the acoustic context can be thought of as the set of notes (e.g., in a chord) that accompany a given note. Figure 10 shows the STRF of the same neuron (a trumpet feature) in two different contexts (either clarinet or flute.) The gross features of the STRF (e.g., the excitatory band around 880 Hz) are preserved in both contexts, but the secondary features (e.g., the addition of an inhibitory sideband) is context sensitive. Changes in the STRF for different features and different contexts can be larger or smaller than in this example. Stimulus context thus changes the neural encoding function, suggestive of the nonclassical receptive field modulation observed in visual and auditory cortexes (David et al., 2004; Valentine and Eggermont, 2004).
Context dependence as defined here is stronger than simple nonlinearity. Specifically, the prediction is that there should exist extended subregions of stimulus space in which the encoding function of a given target neuron is one linear function and across some boundary in stimulus space switch to a second linear function. These boundaries are demarcated by the activation of another (nontarget) neuron in the population and the deactivation of a second (nontarget) neuron (Fig. 9). This prediction could be tested using a multineuron recording technique.
The locally linear encoding induced by sparseness may help reconcile some of the apparent contradictions in the auditory literature. STRFs obtained using a “moving ripple” basis can predict responses to linear combinations of basis elements (Kowalski et al., 1996). However, linear encoding (STRF) models fail to predict neural responses when the stimulus domain is extended to include a wide selection of complex sounds (Linden et al., 2003; Machens et al., 2004), consistent with the idea that ripples represent a subspace within which encoding is linear. Context sensitivity may also provide an explanation for a proposed neural correlate of comodulation masking release in which the addition of a pure tone can suppress the response to temporally modulated noise (Nelken et al., 1999); this form of contextual modulation cannot be explained by any purely linear encoding model.
Discussion
Our main result is that the appropriate “sparse” neural representation implicitly separates a mixture of sound sources into its constituent auditory streams. In this model, sources at different positions in space were separated with only monaural information by exploiting the differential filtering imposed by the HRTF, under the assumption that the source locations have already been identified by other mechanisms. This model provides a possible explanation for an important question about cortical organization: Why are there so many more neurons in the auditory (or visual) cortex than in the cochlea (or retina)? The answer we provide, motivated by the ability of an overcomplete sparse representation to separate sources, is potentially quite general and may be applicable to other brain regions as well.
This model was motivated foremost by the computational demands of source separation. Source separation is a complex computation, and we could no more expect to solve the whole problem in its entirety here than we could expect to solve completely its visual analog (scene segmentation) or any of the many other challenging problems in computational vision. We have instead concentrated on a restricted form of the problem involving only the spatial cues introduced by the HRTF, with the expectation that the same framework can be generalized to understand how some other cues might be used.
Sparseness provides a powerful and useful constraint on neural activities. Our results complement and extend previous work on finding features that permit an efficient representation of auditory or visual scenes (Bell and Sejnowski, 1997; Olshausen and Field, 1997, 1996; Schwartz and Simoncelli, 2001; Lewicki, 2002; Klein et al., 2003; Smith and Lewicki, 2006). In our framework, the HRTF “tags” these features so that they can be assigned to the appropriate source. Other psychophysical cues important for acoustic stream segregation, such as common onset time, could be used in a similar way.
Efficient coding and sparseness
Our approach is compatible with the “efficient coding hypothesis” (Barlow, 1961), according to which the goal of sensory processing is to construct an efficient representation of the sensory environment. Building on these results, we used NMF to derive a set of basis elements with which auditory stimuli could be represented sparsely. Although we did not test bases obtained using other approaches, we do not expect that our results would be sensitive to the particular method used to find the basis; any sparse basis would likely have worked.
The principle of efficient, or sparse, coding has been used to predict receptive field properties of both auditory and visual neurons (Olshausen and Field, 1996; Bell and Sejnowski, 1997; Simoncelli and Olshausen, 2001; Lewicki, 2002; Smith and Lewicki, 2006). However, the focus of the present work is not on the receptive field properties themselves but rather on how the resulting sparse representation can subserve a computation. Our predictions are therefore not about the detailed structure of receptive fields but rather about how receptive fields interact.
The motivation for sparseness here is not coding [or metabolic (Levy and Baxter, 1996; Laughlin and Sejnowski, 2003)] efficiency per se, but rather performance on a particular computational problem: source separation. There are contexts in which coding efficiency imposes important constraints on representation; for example, the retina compresses visual information collected at 10^{8} photoreceptors into a signal that is carried by only ∼10^{6} fibers in the optic nerve. For source separation, however, sparseness provides a mathematical instantiation of Occam’s Razor: it allows a search for the most likely interpretation to be conducted by searching for the sparsest interpretation.
Sparse encoding implies that most stimuli should elicit only modest firing in most neurons, as has been observed experimentally for both simple and complex auditory and visual stimuli (Vinje and Gallant, 2000; DeWeese et al., 2003; Machens et al., 2004), but it does not imply that responses must be weak for all stimuli. Sparseness implies merely that stimuli elicit only a small number of spikes across the neuronal population; indeed, a neuron encoding some particular feature d will fire maximally when the stimulus d is presented. Sparseness in this model is therefore a constraint on the activity of the population of neurons involved in a representation, rather than on the activity of any single neuron. Our results are thus fully consistent with experiments indicating that it is sometimes possible to optimize stimuli online to obtain high firing rates (deCharms et al., 1998; Barbour and Wang, 2003), because in our framework such a stimulus is the feature associated with the neuron.
Directly assessing the sparseness of a neuronal representation experimentally is difficult. The key issue is how many neurons (or spikes) participate in the representation of a typical stimulus. Ideally this would be measured by recording all spikes from all neurons simultaneously, but this is not possible using the experimental techniques currently available. Nevertheless, there is growing evidence that natural auditory (DeWeese et al., 2003; Machens et al., 2004) and visual (Baddeley et al., 1997; Vinje and Gallant, 2000) stimuli activate only a relatively small number of neurons (Olshausen and Field, 2004). Thus the representational sparseness assumed by this model should be viewed as at least provisionally consistent with the current experimental evidence about cortical representations.
Overcomplete representations as a model of cortex
Although there is nothing in the model that explicitly ties it to one or another brain area, we think it most likely that, at least in mammals, the operations we describe occur in the cortex, rather than at subcortical stations. First, receptive fields in auditory cortex are heterogeneous and often have the broad and complex spectrotemporal structure (Sutter, 2000) required to exploit the HRTF.
Second, and more significantly, auditory cortex has the characteristics expected from an overcomplete representation. There are ∼30,000 auditory nerve fibers but more than 1000 times as many auditory cortex neurons. Assuming that the “representational fidelity” (the amount of information that can be represented by a single spike) of neurons in cortex is comparable with that at the periphery, the “representational capacity” of cortex is far in excess of what is needed to form merely a complete representation. This model suggests a way in which this excess representational bandwidth can be used for computation, instead of merely to overcome neuronal noise, as has sometimes been proposed.
Although we do not know how sparse cortical representations are achieved, it seems likely that the underlying circuitry involves lateral interactions. Indeed, “sparsification” is similar to divisive normalization approaches, which are motivated by both circuit and computational considerations (Schwartz and Simoncelli, 2001). Explicit circuit dynamics and connection weights can be obtained using gradient descent to minimize the total neural activity in Equation 14.
Model predictions
We have identified three clear experimental predictions of the model. First, the optimal linear decoder estimated from an experiment in which the activity of multiple neurons are recorded should maximize the firing rate of a target neuron. Second, there should be an asymmetry between the performance of the optimal linear encoder and decoder, but this asymmetry should become evident only in interpreting multineuron recording experiments; the model predicts that the optimal linear encoder and decoder in a single neuron experiment both underperform the “true” optimal (i.e., nonlinear) decoder. Finally, the STRF should be dynamically influenced by acoustic context. These predictions can be used to test (and falsify) the model.
Perhaps the most surprising prediction of this model is how stimulus optimization for a target neuron can improve by recording from other neurons involved in the representation. To our knowledge, online stimulus optimization using data from more than one neuron has not been previously proposed but could be practical using modern multineuron recording techniques.
We have assumed that the neural decoding function (the transformation from the neural response to the stimulus) is linear. However, we have shown that sparseness implies that the neural encoding function (the inverse transformation from the stimulus to the response) is in general nonlinear (piecewise linear). This asymmetry, which emerges only in multineuron recording experiments, is a strong and testable prediction of the model.
This asymmetry further implies the context dependence of the STRF. We speculate that this may explain why linear (STRF) encoding models work well in restricted domains (Kowalski et al., 1996) but fail for richer stimulus ensembles (Linden et al., 2003; Machens et al., 2004). However, context dependence has not yet been tested directly.
Relationship to independent component analysis
Our formulation of the source separation problem (Eq. 7) differs in two respects from the one usually considered in the independent component analysis (ICA) literature. First, we have assumed prefiltering of each source by a known filter h, whereas in the usual formulation, the weighting of each source is given by an unknown scalar. Second, in most formulations the receiver is assumed to have access to the sources via several sensors, each of which is exposed to a different (linear) combination of the sources, whereas here we assume only a single sensor with input y.
Most approaches to solving such multisensor formulations focus on recovering an “unmixing matrix” which inverts the mixing matrix governing the weighting of each source at each sensor. In such cases, it is generally sufficient to assume simply that the sources are statistically independent (Comon et al., 1991; Comon, 1994; Bell and Sejnowski, 1995; Belouchrani et al., 1997; Amari and Cichocki, 1998). If, however, there is only a single sensor, the problem is degenerate and such approaches fail: separating multiple sources from a single sensor requires assumptions stronger than simple independence (Cauwenberghs, 1999; Hochreiter and Mozer, 2001; Roweis, 2001; Jang and Lee, 2003; Smaragdis, 2004). The intermediate case (at least two sensors but more sources than sensors) simplifies the problem considerably (Rickard and Dietrich, 2000; Bofill and Zibulevsky, 2001; Linsker, 2001), because binaural cues as well as monaural ones can be used.
Recent advances in ICA have emphasized the utility of sparse overcomplete representations for source separation problems in acoustic, visual, and other domains (Farid and Adelson, 1999; Lee et al., 1999; Lewicki and Sejnowski, 2000; Zibulevsky and Pearlmutter, 2001; Levin and Weiss, 2004; Li et al., 2004). Here we have built on these ideas and developed a novel approach to separating multiple “augmented” (prefiltered) signals combined at a single sensor.
Our framework can be generalized to exploit other cues used in single sensor separation, such as common onset time. It can also readily be extended to make use of binaural information. Each HRTF function is made singleinput twooutput, and the lengths of the column vectors corresponding to the postHRTF dictionary elements and the observation vector are doubled. Interaural time and level disparity can then be used to separate sources. Information from two (or more) sensors can thus be naturally incorporated.
HRTF and source separation
The model assumes that sources have a statistical structure consisting of spectral correlations that can be exploited by filtering by the HRTF. One novel contribution of this work is its specific proposal for how the HRTF can be used for source separation, a process related to but distinct from sound localization. Spectral cues are not strictly required for sound localization: binaural cues can provide robust cues even in the absence of spectral cues. Conversely, source separation can proceed when spectral cues are weak, or indeed, even when spatial cues are completely absent, as for example when picking out a violin from within a concerto played over a single speaker. This illustrates a general principle: no single cue is essential to source separation, and the auditory system will promiscuously exploit any cues that are available.
Nevertheless, it is clear that HRTF cues, when present, help in source separation (Yost et al., 1996). We have shown how neural systems can exploit these cues using sparse representations. We speculate that sparseness may represent a general adaptation used by the nervous system to separate acoustic sources, and that similar principles may be also involved in source separation in other modalities, such as olfaction.
Appendix
A.1. Notes on regularizers
L_{0} minimization
Minimizing sparseness in the L_{1} sense is not the only possible choice. One natural alternative is the L_{0} norm, which minimizes the total number of active neurons (the total number of nonzero activities c_{ij}) rather than the total number of spikes. Although this constraint also seems biologically sensible, it leads to a computationally intractable (NPcomplete) combinatorial problem (Donoho and Elad, 2003); moreover, in many cases it leads to the same solution as the minimum L_{1} solution (Li et al., 2004), particularly in the presence of a noise model. We therefore consider only the L_{1} solution here.
Conditioning of dense representation
From Equation 15 it is clear that the uniqueness of the solution depends on the invertibility and condition number (ratio of largest to smallest singular values) of D, whose columns d_{ij} = h_{i} ∗ q_{j} depend in turn on both the filters h_{i} and the source elements q_{j}. In particular, if there is no filter h_{i}, as in the usual formulation of source separation, the columns of D are identical, and the solution is therefore degenerate. The greater the difference between the filtered copies of the sources (the columns of D), the smaller the condition number of D and the more numerically stable the problem.
Probabilistic interpretation of regularizers
Interpreted probabilistically, the regularizers on neural representations (Eqs. 14 and 15) correspond to maximum likelihood estimates using different a priori assumptions about the processes generating the stimuli, whose estimates are represented as the neural activities participating in a representation (Fig. 1D). The pseudoinverse assumes that the underlying causes represented by the activities c_{ij} were drawn from a Gaussian distribution, whereas the sparseness regularizer assumes instead a Laplacian distribution: p(c_{ij}) ∝ e^{−cij}. Because a Laplacian distribution has more elements very close to (and very far from) zero than does a Gaussian with the same variance, it corresponds to a sparser description in terms of c_{ij}. Note that without a noise term, the maximum likelihood estimates using any prior yield perfect (zero reconstruction error) representations of the stimulus; the prior here is on the distribution of the underlying causes represented by the coefficients c_{ij}, rather than on the distribution of reconstruction errors (as for example in robust fitting methods). Only when a noise term is added (Eq. 1) do the neuronal activities c_{ij} cease to represent the stimulus perfectly.
A.2. Asymmetry of sparse representations
Why does an overcomplete sparse representation predict an asymmetry between encoding and decoding (see Results, Linear decoding and nonlinear encoding)? To understand this, let us begin by considering the more familiar case of a complete (but not overcomplete) orthonormal representation, such as a Fourier or ripple basis. In this case, there is perfect symmetry between the encoding and decoding transformations. (These transformations are referred to as “analysis” and “synthesis” in the wavelet literature.) That is, if the columns of D in the decoding equation y = Dc (Eq. 13) are orthonormal, the inverse (encoding) transformation is given by c = D^{T}y, where we have used the fact that the inverse of an orthonormal matrix is its transpose, D^{T} = D^{−1}. In this case, the encoding and decoding filters (the rows and columns of D) are identical. This explains the familiar symmetry between the forward and inverse Fourier transform, in which the rows and columns of D are sinusoids.
If the columns of a square matrix D are not orthonormal, its inverse (if it exists) is not equal to its transpose. However, the encoding transformation c = D^{−1}y is still linear, because it can be expressed as a linear combination of the columns of D^{−1}. Even in the overcomplete case (Eq. 15), linearity of encoding is preserved if the dense representation is used, because in that case the encoding filters are the columns of the pseudoinverse D^{⋄}.
In the sparse overcomplete case, asymmetry arises because the inverse transformation is in general nonlinear. Here, the representation is found by solving the optimization problem of Eq. 14. Note that this in turn gives a reason that stimulus optimization requires multineuron recordings (see Results, Optimal feature estimation requires multineuron recording). Specifically, because the neural responses cannot be explained by a single (or a global) linear encoding function, the feature (or the stimulus that elicits the maximum response) cannot be estimated correctly by a linear regression using single unit data.
A.3. Context dependence of STRFs
To understand the context dependence of STRFs (see Results, Context dependence of STRFs), we consider the encoding function associated with the sparse representation of a particular stimulus y_{k}. Consider the “packed matrix” D_{k}, whose columns are the subset of features involved in the sparse representation of the stimulus y_{k}, i.e., only the columns corresponding to the nonzero elements of c_{k}. This matrix satisfies the following decoding relationship: where c̄_{k} consists of c_{k} without zero elements. However, because of the sparseness (L_{1}) prior, the matrix D_{k} is at most fullrank and is constructed from only at most N_{row} features. It is thus not overcomplete, and thus the encoding can be specified by a matrix using the pseudoinverse:
Thus, the encoding function is continuous and piecewise linear, with the linear segments defined by pseudoinverses D_{k}^{⋄}, and the discontinuities in the first derivative occurring whenever the stimulus activates (or deactivates) a new feature (i.e., an element of c becomes nonzero or zero) and thereby changes D_{k}.
The STRF for the ith neuron is then obtained from the ith row of the pseudoinverse of the packed matrix D_{k}. (Recall that, following convention, we use the term STRF to refer only to the linear component of the encoding function from stimulus to response.)
Three comments on the STRF estimation are in order (see Materials and Methods, Context dependence of STRFs). First, we note that constructing the matrix D_{k} requires knowledge of the solution c_{k}, so that this does not actually constitute an algorithm for finding c_{k}. Second, in the special case of the dense representation, both encoding and decoding are linear. In this case, the encoding function for any stimulus (Eq. 11) is simply the pseudoinverse D^{⋄} of the full matrix D.
Finally, we note the fact that even if two STRFs obtained from the response of a single neuron to two different stimulus ensembles differ, and each accounts perfectly for the data to which it was fit, this does not rule out the possibility that there might exist some third STRF that perfectly fits the amalgamation of the two datasets. Consider two stimuli y_{k} and y_{k}′, which activate only features in c_{k} and c_{k}′, respectively. The pseudoinverse D_{k}^{⋄} corresponding to the first stimulus will be valid (Eq. 17) for any stimulus composed of any subset of features that are nonzero in c_{k} but will not be valid for any features that were not active. Therefore, any stimulus consisting of sums of features in the union of the feature sets c_{k} and c_{k}′ will yield incorrect results if either of the corresponding packed pseudoinverse matrices are used. However, a new pseudoinverse matrix constructed from features in the union of the feature sets c_{k} and c_{k}′ could yield correct values for a superset of the stimulus ensembles for which either of the original two STRFs were valid.
Such supersets can only be constructed from a number of features that is limited by the rank of D. That is, there does not exist a matrix D_{k}^{⋄} that can be used in (Eq. 17) to generate the correct c for all stimuli. For the example in Figure 10, we confirmed that there does not exist an STRF that includes both as special cases.
Footnotes

This work was supported by the Higher Education Authority of Ireland (An tÚdarás Um ArdOideachas) and Science Foundation Ireland Grant 00/PI.1/C067 (B.A.P.), a FarishGerry Fellowship (H.A.), and the Sloan Foundation, the Mathers Foundation, the National Institutes of Health, the Packard Foundation, and the Redwood Neuroscience Institute (A.M.Z.).
 Correspondence should be addressed to Anthony M. Zador, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724. Email: zador{at}cshl.edu