Abstract
The relationship between a sound and its neural representation in the auditory cortex remains elusive. Simple measures such as the frequency response area or frequency tuning curve provide little insight into the function of the auditory cortex in complex sound environments. Spectrotemporal receptive field (STRF) models, despite their descriptive potential, perform poorly when used to predict auditory cortical responses, showing that nonlinear features of cortical response functions, which are not captured by STRFs, are functionally important. We introduce a new approach to the description of auditory cortical responses, using multilinear modeling methods. These descriptions simultaneously account for several nonlinearities in the stimulus–response functions of auditory cortical neurons, including adaptation, spectral interactions, and nonlinear sensitivity to sound level. The models reveal multiple inseparabilities in cortical processing of time lag, frequency, and sound level, and suggest functional mechanisms by which auditory cortical neurons are sensitive to stimulus context. By explicitly modeling these contextual influences, the models are able to predict auditory cortical responses more accurately than are STRF models. In addition, they can explain some forms of stimulus dependence in STRFs that were previously poorly understood.
 neural coding
 stimulus–response function
 complex sounds
 hearing
 spectrotemporal receptive field
 auditory cortex
Introduction
The spectrotemporal receptive field (STRF) (Aertsen et al., 1980) is widely used to characterize responses of primary auditory cortex (A1) neurons (deCharms et al., 1998; Depireux et al., 2001; Miller et al., 2002; Linden et al., 2003; Tomita and Eggermont, 2005). Previously, however, it has been shown that the STRF, when interpreted as a predictive model, is insufficient to account for auditory cortical responses (Sahani and Linden, 2003a; Machens et al., 2004), indicating that neurons in A1 respond nonlinearly to complex sounds. Nonlinearities have profound consequences for the description of response properties. STRFs often change with the choice of stimulus (Theunissen et al., 2000; Blake and Merzenich, 2002; Valentine and Eggermont, 2004). Although this might reflect adaptation to stimulus statistics, the apparent lability of STRF structure could also arise from the approximation of a nonlinear stimulus–response function with a linear function within different regions of stimulus space (Borst et al., 2005; Christianson et al., 2008).
Thus, nonlinear models are required to gain a better understanding of auditory cortical responses to complex sounds. Here we introduce a new class of such models, which are compact and easy to estimate from limited experimental data. We show how these models can capture and quantify, in neuronal responses to complex stimuli, three distinct nonlinear phenomena documented in experiments with isolated stimuli.
First, although standard STRF models assume that firing rate is related linearly to the sound level at every point in spectrotemporal space, this is not generally true (Phillips and Irvine, 1981). Our “input nonlinearity” model captures neuronspecific nonlinear mappings between spectrotemporal energy and neuronal responses to a complex sound.
Second, standard STRF models can describe timefrequency inseparabilities (i.e., interdependencies in temporal and spectral tuning properties), but not timelevel or frequencylevel inseparabilities, such as loudnessdependent latency (Phillips, 1989), or frequencydependent tuning to loudness (Sutter, 2000). The input nonlinearity model allows us to estimate inseparabilities in all pairs of the three stimulus parameters, and thus to analyze how these inseparabilities affect neuronal responses to dynamic complex sounds.
Third, suppressive phenomena such as twotone and forward suppression are observed at all stages of the auditory pathway, including the auditory cortex (Brosch and Schreiner, 1997; Bartlett and Wang, 2005). Such phenomena are not modeled well with STRFs, and have therefore rarely been studied using complex sounds (but see BarYosef et al., 2002). Extending the input nonlinearity model to the “context model,” we represent local contextual effects as the modulation of the effective sound level at a point in the spectrogram by the energy that falls within a short spectrographic window preceding that point. Thus, this spectrographic window is used to provide a local adaptive context for each element of the sonogram.
Each of these nonlinear models can be formulated and efficiently estimated using a common flexible framework based on multilinear analysis. These models improve on the predictive performance of STRF models by a factor of ∼1.4. Moreover, for each neuron, the nonlinear models produce quantitative descriptions of the three phenomena described above. In addition, the context model sheds some light on the origin of the suppressive regions commonly observed in STRFs, and provides an explanation for certain systematic changes observed in STRFs measured under different stimulus conditions (Blake and Merzenich, 2002). Thus, the models offer far more insight into auditory cortical response properties than has previously been available from STRF analysis.
Materials and Methods
The DRC stimulus
The dynamic random chord (DRC) stimuli used in these experiments have been described previously (Linden et al., 2003). Each stimulus was a series of 20mslong random chords (i.e., combinations of randomly selected tone pulses) played without intervening gaps. The center frequencies of the tone pulses were chosen from 24 or 48 different possibilities (in the range of 25–100 kHz or 2–32 kHz, respectively, depending on the approximate tuning of the neuron), spaced 1/12 octave apart. The number of tones constituting each chord was random, with an average of two tone pulses per octave per chord. Tone pulses were gated on and off with 5 ms cosine ramps to reduce spectral splatter. The peak level of each pulse was chosen randomly from 10 different intensity levels, 5 dB sound pressure level (SPL) apart in the range 25–70 dB SPL. The frequency and intensity of each tone pulse were selected independently. Each of the two stimuli was 60 s long, comprising 3000 different random chords. Exactly the same stimulus was repeated 10 (high frequency) or 20 (low frequency) times, with no intervening gap. All analyses are based on the final 9 or 19 repetitions, so as to avoid artifacts caused by strong adaptation at the onset of stimulation.
Experimental methods
The experimental methods were similar to those described by Linden et al. (2003). Surgical procedures conformed to protocols approved by the University of California at San Francisco's Committee on Animal Research and were in accordance with United States federal guidelines for care and use of animals in research. Mice and rats were anesthetized and maintained at a surgical plane of anesthesia with ketamine and medetomidine or pentobarbital, and extracellular recordings were made in early auditory cortical areas [A1 of rats, and A1 and anterior auditory field (AAF) of mice]. Recordings targeted thalamorecipient layers III/IV (Smith and Populin, 2001) by cortical depth (350–600 μm below the dural surface) and by the polarity and size of stimulusevoked local field potentials. The recordings were later bandpass filtered and then analyzed using Bayesian spikesorting techniques (Lewicki, 1994) (user interface software by M. Kvale, University of California San Francisco, CA) to extract responses from single units or small clusters of neurons.
Notation
In this study, bold letters such as w represent arrays: these may be vectors, matrices, or higherorder Cartesian tensors. Bold superscripts are used with arrays of weights and stimuli to indicate the corresponding dimensions. Thus w^{tf} represents an array of STRF weights that span timelag and frequency. Elements in the array are italicized and subscripted. Thus, w_{jk}^{tf} is the element at time lag j and frequency bin k in the matrix w^{tf}. To simplify notation, we have adopted modified versions of operators from multilinear algebra. The symbol ⊗ generalizes the vector outer product, such that, for example, if b, c, and d are all vectors, then a = b ⊗ c ⊗ d is a threedimensional array, with elements a_{ijk} = b_{i}c_{j}d_{k}. The symbol · generalizes the inner product, so that all indices shared between arguments on the left and right sides of the operator are summed (or contracted); which indices are shared will be clear from the context. Standard matrix multiplication is shown without an explicit operator.
Model estimation
To capture the response properties of auditory cortical neurons, predictive models may be used to approximate the neural stimulus–response function (i.e., the transformation from stimulus to neuronal firing rates). A widely used model is based on the STRF, here denoted w^{tf}, where the superscripts denote timelag and frequency, respectively. Such models predict the timevarying firing rate of a neuron, r(i) (where i indexes time bins), from the stimulus s(i, k) (denoting the power in time bin i and frequency bin k of the spectrographic representation of the sound) through the following formula: The parameters of the model, i.e., the constant c and the elements of the STRF matrix w^{tf}, should be chosen so as to make the predicted firing rate r(i) as close as possible to the observed firing rate r̂(i). One approach is to present the same stimulus s multiple times; the mean firing rate over repeated trials will then be approximately Gaussian distributed around the “true” mean stimulusdependent firing rate. The parameters c and w^{tf} can then be found by leastsquares linear regression, minimizing the squared error ε = Σ_{i}(r(i) − r̂(i))^{2}, which is equivalent to maximumlikelihood estimation under the Gaussian assumption.
One danger of maximum likelihood regression is overfitting. That is, the parameters that minimize the squared error on one set of training data may do so by capturing stimulusunrelated variability in those measurements, and therefore perform poorly when predicting responses to a novel set of test data. If this happens, the features exhibited by the bestfit model cannot be presumed to reflect actual structure in the neural response function.
Overfitting may be contained by regularization methods, in which extra terms are introduced to the objective function (or equivalently, a prior belief about the parameter properties is expressed) so as to penalize solutions that have, for example, many large parameter values, or large variation in parameter values. These extra terms, for example, describing the scale of the smoothness, are often set by hand. Here, we used a datadriven method called automatic smoothness determination (ASD) to estimate optimal values for the scale of the smoothness and the magnitude of the model parameters (Sahani and Linden, 2003b).
Bilinear models
Separable STRF models.
A linear spectrotemporal receptive field w^{tf} is called separable if, in its matrix form, it can be written as the outer product of two vectors, w^{tf} = w^{t}w^{f T}; or, in terms of the components, w_{jk}^{tf} = w_{j}^{t}w_{k}^{f}. Functionally, this implies that the relative response of the neuron to tones of different frequencies, given by a vector w^{f}, is preserved across timelag, with the absolute responses scaled by a different (possibly negative) factor at each time, given by the vector w^{t}, or equivalently, that the temporal response profile at all frequencies is identical up to a perfrequency scale factor. Mathematically, the predicted firing rate in time bin i is, up to an additive constant (discussed later) as follows: Thus, although the separable model is nonlinear in the parameters taken all together, it is linear in each of the vectors w^{t} and w^{f} alone. Such a model is called bilinear. A graphical illustration of a bilinear model is shown in Figure 1A.
Many STRFs in the auditory system appear to be well modeled as separable (Linden et al., 2003; Simon et al., 2007), but even when this is not the case, a bilinear model may be used to approximate the full inseparable linear (in w^{tf}) relationship. Here, we discuss this bilinear approximation to the STRF as a preliminary to the more extensive multilinear models introduced later.
One way to fit a bilinear model is to estimate the full receptive field w^{tf}, and then examine the singular value decomposition (Strang, 1988) of the resulting matrix. The component vectors corresponding to the largest singular value give the best separable approximation to the full STRF matrix in the leastsquares sense; that is, they minimize the term Σ_{jk} w_{jk}^{tf} − w_{j}^{t}w_{k}^{f}^{2}. However, even when w^{tf} is itself found by leastsquares regression, this cost function is different from the following bilinear model error:
We derive algorithms to minimize this model error (possibly with additional regularization terms) directly. It is helpful to simplify notation as follows. Consider an expanded threedimensional stimulus array, augmented by the addition of a timelag dimension: M_{ijk}^{itf} = s(i − j + 1, k). This change of notation allows us to write the bilinear spike rate prediction as follows:
or
where the second form adapts the tensor notation of multilinear algebra as described above. Here, because M^{itf} shares timelag and frequency indices (j and k) with (w^{t} ⊗ w^{f}), the sum implied by the · operator is over those two dimensions, with the third index i remaining, corresponding to the binbybin predictions collected in the vector r̂. If we now write the vector norm as
In the following we will use component notation, tensor notation or both as needed to aid clarity.
Parameter estimation.
At a local minimum of the error, we have the following: and Differentiating and rearranging yields the following fixed point conditions: or and or Defining A = w^{f} · M^{itf} and B = w^{t} · M^{itf}, and noting that both A and B are matrices, we obtain the matrix equations A^{T}r = A^{T}Aw^{t} and B^{T}r = B^{T}Bw^{f}, which can be solved to give w^{t} = (A^{T}A)^{−1}A^{T}r and w^{f} = (B^{T}B)^{−1}B^{T}r. It should come as no surprise that these each resemble the solution to a linear regression problem, as the bilinear model is linear in each parameter vector separately. However, the matrix A (B) is itself a function of w^{f} (w^{t}), and so these equations do not give a closedform solution to the optimization problem. Instead, the equations are applied iteratively, alternating between updates for w^{t} and for w^{f}. Each iteration decreases the squared error and, because ε cannot drop below zero, the iterations must converge to an optimum in the parameter space.
The background rate.
Because neurons often fire in the absence of a stimulus, a complete predictive model must include a constant term. As in the linear case, this can be introduced cleanly into the multilinear framework by augmenting the stimulus appropriately. If there are T time points in the stimulus, and the timelag index ranges from 1 … J and the frequency index from 1 … K, so that M^{itf} is T × J × K, we consider a T × (J + 1) × (K + 1) array Q^{itf} with Then, with w^{t} and w^{f} also augmented to contain J + 1 and K + 1 elements, respectively, the model r̂ = (w^{t} ⊗ w^{f}) · Q^{itf} becomes equivalent to a model with background rate, r̂ = c+(w^{t} ⊗ w^{f}) · M^{itf} with c = w_{J}^{t}_{+1}w_{K}^{f}_{+1}.
Multilinear models
We have seen that a separable STRF may be viewed as a bilinear predictive model. Shortly, we will see that more complex models, incorporating nonlinear level sensitivities and acousticcontextdependent sensitivities, can be also be expressed in a multilinear form. However, before going on to develop these models, it is useful to discuss some general aspects of parameter estimation in the multilinear setting (Ahrens et al., 2008).
The general predictive form of a multilinear model can be written as follows: or r̂ = (a ⊗ b ⊗ … ⊗ z) · Q, where a, b, …, z are parameter vectors and Q is a fixed multidimensional array that depends only on the stimulus.
Alternating least squares.
As in the case of the bilinear model discussed above, the squared error ε = ‖r − (a ⊗ b ⊗ … ⊗ z) · Q‖^{2} can be minimized by cycling through a set of update equations, each of which resembles the solution to a linear regression problem: Each of these updates minimizes the squared error with respect to one parameter vector, while holding the others fixed. Thus, the algorithm, a variant of alternating least squares, is guaranteed not to increase the squared error at any step and, hence (as the error is bounded below by zero), to converge.
Control of overfitting.
As the number and length of the parameter vectors in any model increases, so does the danger of overfitting. As described above, regularization is the practice of adding terms to the cost function to be optimized, so as to discourage overfitting. Here, we adopt a Bayesian perspective, in which the simple cost function corresponds to the (log) likelihood of the parameters given the data, and the regularization terms express our prior beliefs about their forms or values.
The objective function used is of the form ε_{r} = ε + a^{T}D_{a}a + b^{T}D_{b}b +… . Viewed probabilistically, the squarederror term ε corresponds to a Gaussian likelihood. Although a likelihood based more directly on a mean of Poisson terms may be a better model for empirical firing rate data, the Gaussian assumption makes fitting algorithms more tractable and is appropriate in the regime of many observations or moderate to high firing rates, where the trialaveraged Poissonbased likelihood approaches the Gaussian form. The additional terms of the objective function express zeromean Gaussian priors on each parameter vector, with covariances D_{a}^{−1}, etc. The priors are zeromean because we have no reason to expect a priori parameters of any particular sign. The covariance matrices are chosen to favor smaller parameter values (commonly referred to as “shrinkage”) as well as smoothness in the parameter vectors.
The alternating least squares algorithm of Equation 14 can be adapted to minimize ε_{r} through a small modification. Given a matrix A as defined in Equation 14, the update for a becomes a = (A^{T}A + D_{a})^{−1}A^{T}r, and similarly for the other parameter vectors. For fixed prior covariances, these updates are guaranteed to converge, as before.
To obtain appropriate prior covariance matrices D_{(·)}, we adapted the ASD scheme of Sahani and Linden (2003b). In this scheme, each prior covariance matrix is parametrized by two parameters θ_{1,2}^{(·)}, describing the degree of shrinkage and smoothing; optimal values of the parameters are found through an automatic procedure. Here, we used this procedure within the first few iterations of the alternating least squares framework. If ASD updates are used at every step, the objective function varies between iterations (because the D matrices change), and convergence of the fitting procedure is no longer guaranteed. For this reason, the covariance parameters were only updated during the first three iterations. Once these covariance parameters were fixed, the remaining iterations were again guaranteed to converge.
Estimation error.
Error bars for the parameters were obtained by a bootstrap procedure (Effron and Tibshirani, 1993). Parameters were refit 10 times, in each case using a equinumerous training set randomly redrawn, with replacement, from the available data. The error bars in all figures indicate the SDs of these 10 estimates.
The input nonlinearity model
In this and the next sections we develop predictive auditory models that express various nonlinear stimulus dependencies in multilinear form, allowing them to be fit using the methods described above.
Stimulus representation.
An auditory STRF model is linear in the sonogram. Consequently, the predictions of the model, and the parameter values obtained, depend on whether the input stimulus s(i, k) represents the intensity, the power or the log intensity of the stimulus at the given time and frequency. Indeed, the best predictions may be obtained not from any of these external representations, but rather from something closer to the ratelevel function of the cell, or some other cellspecific mapping. Here, we consider a generalized model in which a suitable nonlinear transform of the stimulus can be found directly from the data. Based on the STRF model of Equation 1, such a model has the following form: where g is a function to be learned. The mapping g is an “input nonlinearity,” which transforms the representation of sound level in the spectrogram before it is spectrotemporally filtered by w^{tf}. This form is linear in w^{tf}, but g has not yet been parameterized so as to make estimation possible. One natural parameterization, common in the regression literature, is as a linear combination of a fixed set of basis functions {g_{l}}, so that g(x) = Σ_{l} w_{l}^{1}g_{l}(x), for some parameter vector w^{l}.
For the stimulus considered here, the level of each tone pulse was drawn from a set of 10 distinct possible values. Thus, a natural set of basis functions comprises 10 indicator functions, each taking the value 1 for a single possible input sound level, and 0 otherwise. This leads to a simple interpretation of w_{l}^{1} as the net effective input correponding to a pulse at the lth intensity level. However, the mathematical development that follows is applicable to any choice of basis and, thus, similar methods could be used with other stimuli, even if not discretized in level (Ahrens et al., 2008).
Given this basis parameterization, Equation 15 can be rewritten as follows: If we now define a fourindex array, M_{ijkl}^{itfl} = g_{l}(s(i − j + 1, k)), and consider a separable STRF model w_{jk}^{tf} = w_{j}^{t}w_{k}^{f}, the model can be written in the following multilinear form: or with a fourdimensional array Q^{itfl} defined by augmenting M^{itfl} in a manner analogous to Equation 12. This parameter grouping is illustrated in Figure 1B.
Grouping stimulus dimensions.
The unseparated form Equation 16 is itself a bilinear model, with linear dependence on each of the STRF parameters w_{jk}^{tf}, that is, it can be written r̂ = (w^{tf} ⊗ w^{l}) · Q^{itfl}, as illustrated in Figure 1C. It is therefore possible to apply the multilinear estimation framework directly, without forcing the timefrequency component of the model to be separable. Formally, this may be done by considering the “rasterized” or “unrolled” vector form of the matrix w^{tf}, often written vec(w^{tf}), while similarly unrolling the corresponding dimensions of the array Q^{itfl}.
This model has many more parameters than the fully separated, trilinear version (Eqs. 17, 18), and is thus considerably more prone to overfitting, making regularization yet more crucial. In this case, the automatic smoothness determination procedure (Sahani and Linden, 2003a) obtains separate smoothing parameters for the timelag and frequency dimensions, even though they are combined into a single vector.
Although this model corresponds well to the standard STRF representation, it is clearly not the only way in which parameters of the trilinear model may be grouped. Instead, a neuron might be better described as having an inseparable frequencylevel receptive field, described by a matrix w^{fl}. In this case, the model is r̂ = (w^{t} ⊗ w^{fl}) · Q^{itfl}. This model is able to capture inseparable frequencylevel components of the response function, as are often seen in static receptive fields, but are ignored in linear STRF models. However, the model assumes that the time component of the response function is separable from the frequencylevel component.
A neuron might also have an inseparable timesound level component, in which case the appropriate model might be r̂ = (w^{tl} ⊗ w^{f}) · Q^{itfl}, which is able to capture inseparable timelevel properties of the response function (i.e., changes in sound level tuning at different lag times).
For any given neuron, we can consider each of these possible models, evaluating the performance of each by crossvalidation and choosing the most predictively successful model to characterize the cell.
Indeed, it is possible, in principle, to dispense with separability altogether, fitting a linear model with a threedimensional parameter array w^{tfl}. In practice, however, the number of parameters implied by this model (the product of the numbers of time lags, of frequency bins, and of level basis functions) makes it impractical to estimate without considerable overfitting.
Adaptation and twotone interaction: the context model
Documented nonlinear effects in the auditory pathway include twotone suppression, where the response evoked by a tone is affected by a simultaneously presented tone at a different frequency (Sutter and Schreiner, 1991), and forward suppression, where the toneevoked response is modulated by earlier tones at the same frequency (Brosch and Schreiner, 1997).
These and other contextual effects are partly modeled by the standard STRF; however, this framework forces the contextual influence to be additive. Here, we consider multilinear models that also describe multiplicative contextual effects. In particular, we consider models in which the context of a tone pulse in the DRC, or, more generally, of the energy at a point in the spectrogram, which we call a timefrequency element, may be viewed as multiplicatively modulating its effective “level”; that is, the stimulus in time bin i at frequency k has an effective strength given by the following: Here, g is a nonlinear function of the type considered previously (Eq. 15). Context(i, k) is a function mapping the stimulus context around the timefrequency point under consideration to a multiplicative factor.
We assume that the contextual modulator Context(i, k) is itself a multilinear function of the same form as the input nonlinearity model, but indexed relative to the time and frequency being modulated, as sketched in Figure 2. That is, we write the contextual modulation as follows: where Φ = (N − 1)/2 is the maximal absolute frequency difference considered between the contextual and modulated timefrequency elements, and the exclusion of (m, n) = (1, Φ) is needed because a timefrequency element cannot be in its own context [that is, s(i, k) is not part of the expression for Context(i, k)]. The vectors w^{τ} and w^{φ} contain weights that depend on the relative time differences m = 1 … M and frequency differences n = 1 … N, respectively. The vector w^{λ} transforms the contextual sound energy in terms of a set of P basis functions h_{p}(s), which are here chosen to be identical to the g_{l}(s) used in the input nonlinearity model, although they may differ in general.
In the tensor notation used previously, Context(i, k) = (w^{τ} ⊗ w^{φ} ⊗ w^{λ}) · M^{τφλ} (i, k), where M^{τφλ} (i, k) is a stimulus array that depends on the (i, k) position of the timefrequency element being modulated, and (if the bases {g_{l}} and {h_{p}} are the same) for 1 ≤ m ≤ M; 1 ≤ n ≤ N; 1 ≤ p ≤ L, with M^{itfl} defined as in Equations 17 and 18. M^{τφλ} (i, k) will be referred to the “contextual subunit” of the timefrequency element at time i and frequency k (Fig. 2).
The contextual influence modulates the effectiveness of the stimulus s(i, k) according to Equation 19. The modulated sound strengths are then combined according to the noncontextual multilinear form. Thus, the full model becomes the following: The first term in the sum reflects the input nonlinearity model, and the second term reflects the contextual modulation of the input to that model. To render this in the standard multilinear form, we define a sevendimensional array Q^{itflτφλ}: Then, with w^{t}… w^{λ} correspondingly augmented with one or two additional dimensions each, we obtain the expression r̂ = (w^{t} ⊗ w^{f} ⊗ w^{l} ⊗ w^{τ} ⊗ w^{φ} ⊗ w^{λ}) · Q^{itflτ φλ}, with the constants in Equation 22 recovered as the following: Note that the parameter vectors w^{τ}, w^{φ}, and w^{λ} associated with the contextual subunit are independent of the timelag j and the frequency k. As such, the action of the contextual subunit may be interpreted as an adjustment of the stimulus that happens before the action of an input nonlinearity model, and independent of the frequency. However, the parameters determining the contextual effects are estimated simultaneously with the others; algorithmically, there is no hierarchy.
We may regroup some of the parameter vectors as before, for example, assume all components are separable, or group time and sound level together so that the parameters are w^{tl} ⊗ w^{f} ⊗ w^{τ} ⊗ w^{φ} ⊗ w^{λ}.
Because the different components of the models interact multiplicatively, there is an ambiguity of scale in the parameters. For example, if w^{l} were multiplied by a constant factor and w^{t} divided by that same factor, an identical model would result. We resolve this ambiguity in the input nonlinearity part of the model by rescaling all parameter vectors but the first so that the maximal absolutevalue of the vector components is 1. The first parameter vector then has an unconstrained, but well defined yaxis with units of spikes per second. The context part of the model can be separately rescaled relative to c_{2}; we resolve this by dividing by this constant so that c_{2} = 1 and then scaling w^{φ} and w^{λ} to have maximal absolutevalues of 1, thereby assigning the scale of the context terms to w^{τ}.
The dimensionalities used in this study are as follows: w^{t}, t ranged from 0 to 10 (in units of 20 ms time bins; i.e., J = 11); w^{f}, there were either 24 or 48 frequency bins in the stimuli, so that K = 24 or 48; w^{l}, the stimuli contained 10 sound levels, so L = 10; w^{τ}, the contextual subunit was defined to stretch from 0 to 10 time bins into the past, so that M = 11; w^{φ}, the contextual subunit was defined to stretch 5/12 octaves to either side of a target tone (or less, if the target tone appeared near to the spectral boundary of the stimulus), and because the frequency bands in the stimulus were 1/12 octaves apart, N = 11; w^{λ}, the stimuli contained 10 sound levels, so P = 10.
Results
Neural recordings
The neuronal population consisted of 147 units, 97 from mice and 50 from rats. Of these, 39 recordings (21 in rats and 18 in mice) were made under stimulation by higherfrequency sounds (25–100 kHz), and 108 (76 in mice and 32 in rats) were made under stimulation by lowerfrequency sounds (2–32 kHz). In mice, about half of the recordings came from A1, and the other half came from the AAF; all of the recordings in rats were from A1. In all of the recordings used for analysis, an estimate of the repeatable stimuluslocked signal power (Sahani and Linden, 2003a) was at least one SD greater than zero. Results based on subsets of these data have been reported previously (Linden et al., 2003; Sahani and Linden, 2003b).
Performance of the models
The predictive performance of the STRF and multilinear models fit to each recording in the population was evaluated by crossvalidation. The 60 s stimulus presentation time was first divided into 10 segments of equal size. The models were then trained using data from nine of these segments and tested on the remaining one; this procedure was repeated 10 times, once for each of the 10 possible test segments. The predictive power of a model was defined to be the average predictive power on the 10 crossvalidation datasets (Sahani and Linden, 2003a). Predictive power is a measure of performance that takes into account trialtotrial variability of the neuronal response; its expected value for the “perfect” model is 1, and for a model predicting only the mean firing rate it is 0. Figure 3 shows these predictive powers for the fully separated input nonlinearity and context models, compared with a wellfit STRF model (estimated by automatic smoothness determination) (Sahani and Linden, 2003b).
The predictive power obtained by crossvalidation is a lower bound on the true predictive power achievable by the model. As finite data are used for training, the estimated parameters are inevitably overfit to the training data, which leads to suboptimal predictive performance on the test segment. (Negative predictive powers on test data are also possible; this means that the model prediction is less accurate than a constant prediction of the mean firing rate would be.) A complementary upper bound on the predictive power can be obtained by fitting the unregularized model to the entire dataset and computing the resultant predictive power directly on the training data. This overestimates the true value, because the modeling of noise in the training data cannot be distinguished from accurate prediction. The “true” predictive power of the model, i.e., the predictive power it would have when trained on an infinitely large dataset, lies between these upper and lower bounds (Sahani and Linden, 2003a). These bounds become tighter as the trialbytrial variability of the response, the noise power, decreases. Thus, following Sahani and Linden (2003a), we consider the population as a whole, and extrapolate both the upper and lower bounds to the point of zero noise using polynomial fits, whose order is chosen by leaveoneout crossvalidation. This yields relatively tight bounds on the performance expected of the model if it were applied to a hypothetical neuron drawn from a similar population but with a completely stimulusdetermined response. The extrapolation is shown in Figure 4.
Using this analysis, we find that the predictive power of the input nonlinearity model w^{tf} ⊗ w^{l} (0.27–0.40, expressed as a fraction of the predictable signal power) is modestly larger than that of the STRF model (0.23–0.37). The predictive power of the context model is substantially higher (0.32–0.52), with the midpoint between the bounds being ∼1.4 times that of the STRF model. This finding demonstrates that taking into account local acoustic context (based on a spectrographic window stretching over recent history and nearby frequencies) through the structure of the context model allows the multilinear description to capture more of the dynamic behavior of auditory cortical neurons than the linear STRFbased descriptions.
Figure 5 shows four examples of how the predictions for the STRF and the context models differ. The context model generally predicts the PSTH more accurately, especially at stimulusevoked peaks in the firing rate.
Input nonlinearity model
Sensitivity to sound level
We used the input nonlinearity model w^{tf} ⊗ w^{l} to study the sensitivity of neuronal responses to sound level within the DRC spectrogram. Figure 6A shows the model parameters and predictive performance for one such model. The inferred levelsensitivity parameter, w^{l}, exhibits a mild threshold, and remains fairly linear (in logarithmic terms) above ∼45 dB SPL. Note that this inferred level dependence may not be the same as the usual ratelevel function, for two reasons. First, the ratelevel function is conventionally measured using isolated tones presented in silence. Level sensitivity in the context of a complex sound may be quite different (Dean et al., 2005). Second, the estimate of w^{l} is made in the context of subsequent integration through the timefrequency weights w^{tf}. Estimates of the two parameter vectors are interdependent. Similarly, the weights w^{tf} learned through this model are sometimes different from those of the STRF (Fig. 6E).
Figure 6C shows some more examples of levelsensitivity profiles w^{l} learned using different neuronal responses. These are illustrative of the variety of shapes seen in the population. To summarize the observed distribution, 5 prototypical profiles were designed by hand, and each learned profile classified according to the prototype with which its dotproduct was greatest. The results are shown in Figure 6D. Note that the profiles were chosen to illustrate the range of shapes observed; we did not see evidence of clustering in the data.
Inseparabilities
As described in Materials and Methods, the parameters of the input nonlinearity model may be grouped in several different ways. These groupings can be used to assess several forms of inseparability that may be present in the stimulusresponse functions. Timefrequency inseparabilities in the STRF have been studied extensively (Kowalski et al., 1996a,b; Depireux et al., 2001; Linden et al., 2003) and related properties (such as temporal symmetry) (Simon et al., 2007) have been documented previously. The opportunity exists to extend such studies to new forms of interaction.
The optimal form of inseparability for each neuron was assessed by comparing the crossvalidation predictive powers of the variously grouped models. Thus, a cell would be classified as having a frequencylevel inseparable receptive field if the w^{fl} ⊗ w^{t} model performed better on crossvalidation data than the w^{tf} ⊗ w^{l}, w^{tl} ⊗ w^{f}, w^{t} ⊗ w^{f} ⊗ w^{l} and the STRF models. There are two caveats to this classification. First, more than one inseparable model could perform better than the fully separated one, suggesting that all three stimulus dimensions might interact in that neuron's response function. As the full model (with no separated parameters) could not be fit reliably with the data available we did not explore this threeway interaction further. The second caveat is that the different models have different numbers of parameters, and are thus differently prone to overfitting. Thus, although the fully separated w^{t} ⊗ w^{f} ⊗ w^{l} model performs best on crossvalidation data for many neurons (Fig. 7C), it is not possible to say whether this reflects genuine separability, or simply greater reliability in the estimates of a smaller number of parameters.
With these caveats, Figure 7C shows the proportion of neurons best described by each of the configurations of the input nonlinearity model. These results suggest that frequencylevel and timelevel interactions, as well as the more widely studied timefrequency interactions, are present in the responses of many neurons to complex sounds.
The w^{tf} ⊗ w^{l} model
Separabilities in time and frequency have been extensively documented. The input nonlinearity model also recovers such inseparabilities in a number of neurons in the population (Fig. 7C). For these neurons, then, the form of frequency integration varies with lag time.
The w^{fl} ⊗ w^{t} model
We also observed several cells with frequencylevel inseparabilities (Fig. 7C). Such inseparabilities are consistent with previous reports of frequency response areas derived using isolated tone pips, which can have quite complex and asymmetric shapes. A frequencylevel inseparability indicates that sound level is processed differently for different frequencies; or alternatively, that frequency integration depends on sound level. Figure 7A shows an example of such a cell.
The w^{tl} ⊗ w^{f} model
Finally, we found a substantial number of cells whose optimal inseparability was in time and sound level (Fig. 7C). This form of inseparability indicates that during the processing of a complex stimulus, sound level integration depends on lag time. Figure 7B shows an example of such a neuron. The timelevel receptive field is clearly inseparable, showing a fast response to loud sounds, without any suppression. Prolonged suppression occurs only for sounds of intermediate loudness. In other words, the response to sound is monotonic at short lag times, but nonmonotonic at larger lag times. Thus, for this cell, the temporal processing of sounds is critically dependent on the sound level.
The context model
The context model (see Materials and Methods) (Fig. 2) can be viewed as combining two input nonlinearity models. One (Fig. 2, A) determines the modulation of the effective sound level of a timefrequency element of the stimulus by its local context; the other (Fig. 2, B) determines the predicted firing rate using these modulated values. The algorithm used to estimate the parameters fits both models simultaneously, allowing the parameters (A) to affect the estimate of the parameters (B) and vice versa.
More specifically, the context model extends the input nonlinearity model through the term w^{τ} ⊗ w^{φ} ⊗ w^{λ}. Thus, the effective intensity of a timefrequency element (i.e., the stimulus power s(i, k) in some time bin i and frequency bin k) is modulated by each timefrequency element that precedes it (within a window of times and frequencies) in a way that depends on their separation in time through w^{τ}, their separation in frequency through w^{φ}, and on the level of the modulating timefrequency element through w^{λ}.
The context effect
Because the emphasis of this section is on the form of the context terms, only fully separated context models are shown in Figure 8. As illustrated in that figure, w^{φ} is negative across most frequency differences, whereas w^{τ} and w^{λ} are generally positive. Thus, the context term w^{τ} ⊗ w^{φ} ⊗ w^{λ} is generally suppressive over most timefrequency differences.
The w^{τ} parameters illustrated show their greatest suppression for lags of 20–100 ms, depending on the neuron. The w^{φ} values show greatest suppression when the modulating tone is at the same frequency as its target. This suppression falls off with greater frequency separation, vanishing at a separation of about half an octave (or more in some cells; data not shown). The sound level representation of the context w^{λ} is generally monotonic, and sometimes different from w^{l}.
Figure 8C demonstrates the context model for one neuron, for which context tones just before the target tone can have an excitatory effect (w^{τ} is negative at τ = 20 ms; with w^{φ} generally negative this together implies excitation). Figure 8A, which displays the context model parameters for a different neuron, shows two slightly excitatory humps at the edges of w^{φ}. This structure of w^{φ} was also observed for other neurons, especially in a few cases when the range of φ was extended (data not shown).
The context model and STRF models
Suppressive stimulus effects can be captured by linear models through negative regions in their temporal or spectrotemporal receptive fields. The context terms of the context model provide an additional way to create suppressive effects. We wondered if the suppressive regions seen in STRFs might be better accounted for by the contextual component of the context model than by negative weights in w^{tf}. To measure this, we fit multilinear models with context terms (that is, the fully separated context model) and without context terms (the fully separated input nonlinearity model) to the data, and examined the relative amount of suppression as reflected by w^{t}, for each neuron in the population. Figure 9 shows the relative suppression min(w^{t})/(max(w^{t}) − min(w^{t})) for both models. There is a marked decrease in the relative amount of observed suppression when the contextual terms are present, and sometimes the suppression disappears. This suggests that the suppressive regions seen in the STRFs are, at least in part, attributable to contextual effects, rather than to additive suppression.
Figure 10 illustrates a mechanism that may be responsible for this effect. The PSTH of a hypothetical neuron, modeled by a context model (Fig. 10A) responding to a stimulus similar to the DRC used in the experiments, was fit with both an STRF model (Fig. 10B, top) and a context model (A, dashed line, B, bottom). The original context model contained no additive suppression, as there is no trough in w^{t}, but it does contain multiplicative stimulus interactions (through w^{τ} and w^{φ}). The fitted STRF model, however, contains a clearly visible suppressive region following the excitatory peak. This again shows that the suppressive regions in STRFs fit to experimental data may reflect multiplicative stimulus interactions (forward suppression, twotone interactions) rather than genuine additive suppression.
Simulation of STRF features changing with spectral density
STRFs fit to data collected using different stimulus classes are generally not the same (Theunissen et al., 2000; Rotman et al., 2001; BarYosef et al., 2002; Valentine and Eggermont, 2004; Woolley et al., 2006). We used the context model to help understand certain systematic stimulusdependent changes in STRFs related to the spectral density of a sound. The dependence of the STRF on the spectral density of a DRC can be summarized as follows (Blake and Merzenich, 2002): a higher spectral density leads, on average, to a lower excitatory peak in the STRF, and in some cells, the suppressive region only appears at higher spectral densities. Could these experimental results be explained by the context model? We used the context model of Figure 8B (fit on real data at a single spectral density) with five stimuli of varying spectral density, to generate five simulated PSTHs. Then we fit STRF models for these five stimuli and simulated PSTHs. This process is summarized in Figure 11A.
The five STRFs show phenomena similar to those observed experimentally, as shown in Figure 11B. First, there is spectral narrowing at higher spectral densities. Second, the excitatory peak becomes lower at higher spectral densities, whereas the suppressive depth stays roughly constant, as summarized in Figure 11C. This graph is qualitatively the same (a monotonic decrease in peak value, and a roughly constant suppression) as the experimental observations of Blake and Merzenich (2002). Thus, the context model is a candidate for explaining the changes in STRFs at various stimulus densities.
In addition the behavior shown in Figure 11, there are context models (both fit to real data and constructed by hand) that can be used to generate data for which STRFs would have other forms of dependence on the spectral density of the stimulus. For example, the STRFs may show no suppressive region at low spectral density, and a deepening suppressive region at higher spectral density, which is similar to the experimentally observed appearance of suppressive regions at higher spectral density. There are also context models for which STRF fits would change very little with spectral density. Cells with similar response properties have also been reported experimentally.
Discussion
We have constructed a class of models that describe the transformation between a sound and the firing rate of a neuron in the auditory cortex. These models were used here to analyze responses to DRC stimuli, but are applicable to the analysis of responses to arbitrary sounds. The models make possible novel analyses of auditory cortical responses to complex sounds, and also unify measurement of many different response properties that were previously analyzed separately, such as the ratelevel function, the STRF, and contextual effects.
What is gained by this unification? Why should we not, for instance, measure the ratelevel function and the STRF separately instead of estimating the input nonlinearity model? First, the stimulus conditions under which ratelevel functions are typically measured (tones in silence) are different from those used for STRF estimation (continuous complex sounds); the state of the auditory system is thus likely to be different in the two cases. The present approach allows us to estimate sensitivities to different stimulus dimensions simultaneously, from a single (complex) stimulus, thus obtaining a more coherent picture of the function of the auditory system in a single context. Second, the terms in the models depend on one another. For example, a complex sound contains energy at different sound levels. An STRF fit to such data carries with it an implicit model for sound level sensitivity; this model is typically the identity function, and so entries of the STRF are weighted proportional to sound level (in, say, decibels SPL). To the extent that this assumption is not the most appropriate, the STRF estimate will be affected. Estimating neuronal sound level sensitivity directly, using the input nonlinearity model, produces models that describe the system more accurately, as demonstrated by the improvement in model predictive power, and the weights w^{tf} in the input nonlinearity model are sometimes not equal to those in the STRF. Similarly, for timelevel inseparable cells, fitting a timelevel receptive field would be impossible without a model for frequency integration, poor with a suboptimal model of it, and not general if one made frequency integration irrelevant by using a fixed spectral content in the experiment. A third and final example of the benefits of a unified model, as demonstrated in Results, is that the context terms of the context model have important effects on the inhibitory region of the spectrotemporal receptive field, something that would not be evident if an STRF and a twotone receptive field were measured separately.
The input nonlinearity
Finding an appropriate single input nonlinearity for all neurons can raise the average predictive power of an STRF model (Gill et al., 2006). However, our data revealed various shapes of w^{l}. Thus, estimating the input nonlinearity adds neuronspecific detail and provides a better understanding of how populations of neurons respond to level in complex sounds.
Separability of receptive fields
Using the input nonlinearity framework, we found frequencylevel and timelevel inseparabilities during processing of complex sounds in significant proportions of our neuronal population. Attention has previously been focused on timefrequency inseparabilities (Kowalski et al., 1996a; Depireux et al., 2001; Miller et al., 2002; Linden et al., 2003; Simon et al., 2007). The present models open up the possibility of exploring the properties of the other two types of inseparable receptive field.
Timefrequency inseparable receptive fields may underlie frequency sweep direction selectivity (deCharms et al., 1998). The inseparabilities in the other pairs of variables may reflect other functions. For example, cells with a timesound level inseparable receptive field may respond to sounds with increasing or decreasing levels, which feature, for instance, in vocalizations (particularly speech). Without multilinear models, such effects are very hard to study in the context of complex sounds.
Stimulus context
Including local stimulus context allows us to predict auditory cortical responses more accurately than before. The context model should also be applicable to precortical auditory structures, where contextual influences of the sort described first arise.
The context terms reveal forward suppression in auditory cortical responses to complex sounds, lasting ∼200 ms, with a frequency width of about half an octave in most cells. Some neurons also show forward excitation through positive humps at the edges of the frequencydifference term of the context model. These results are consistent with those of previous physiological studies using twotone paradigms in A1 (Calford and Semple, 1995; Brosch and Schreiner, 1997), and with previous psychophysical studies (Moore, 1980, 1997; Jesteadt et al., 1982). The dominance of forward suppression over forward excitation is consistent with recent subthreshold studies (Wehr and Zador, 2005). Again, the current models show the impact of such effects in the responses to complex sounds. Longer time scales of adaptation also exist in A1 (McKenna et al., 1989; Ulanovsky et al., 2003; Bartlett and Wang, 2005), but were not studied here.
We also find that the context model has implications for the interpretation of results of previous STRF studies. Most notably, (1) the suppressive region of w^{t} is reduced when contextual effects are modeled explicitly (Fig. 9), and (2) the context model can account qualitatively for previous observations of stimulusdensitydependent STRFs, such as spectral narrowing and the changing ratio between the maximum and minimum values of the STRF (Fig. 11). These two observations are linked, as the average stimulus power in the context of a timefrequency element increases with spectral density. Thus, with increasing spectral density of the stimulus, the impact of the contextual subunit on the PSTH increases, and therefore the difference between a fitted STRF and the w^{t} ⊗ w^{f} component of a context model will also increase. One way in which the STRF may depart from w^{t} ⊗ w^{f} is through the appearance of the deepening suppressive region (Fig. 10), in line with the first observation above; another way may be through an idealized version of the systematic changes found by Blake and Merzenich (2002), such as spectral narrowing, in line with the second observation.
Interestingly, Blake and Merzenich found that STRF models with a more pronounced suppressive region were on average less predictive than STRF models where no suppressive region was observed. If, as our results here suggest, suppressive regions in the STRF are often signatures of significant contextual effects, then this observation may simply reflect the greater nonlinearity of cells that show them. An explicitly nonlinear approach (such as the context model) may well not show such an effect. (Alternatively, deeper suppressive regions may be associated with lower firing rates, leading to noisier STRFs and lower predictive power.) Thus, we suggest that the previously reported changes in STRF structure with stimulus spectral density arise as a result of linear approximation of nonlinear cortical response functions (Christianson et al., 2008) that are better described by the context model.
The models
The input nonlinearity model is a form of Hammerstein cascade (Narendra and Gallman, 1966; Hunter and Korenberg, 1986); however its development in the multilinear setting, and related advances in estimation, are more recent (Ahrens et al., 2008). We are unaware of previous discussions of alternative grouping of stimulus dimensions facilitated by the multilinear view. The higherorder nonlinear models (e.g., the context model) are novel. Multilinear regression models have not received extensive attention in the statistics literature (but see Paatero, 1999). The fully inseparable model r̂ = w^{tfl} · M is a generalized additive model (Breiman and Friedman, 1985; Hastie and Tibshirani, 1999), but its parameters are too numerous to make it useable (Aertsen and Johannesma, 1980). Other multilinear methods have been useful in various settings in the statistics and machine learning setting (Harshman and Lundy, 1994; Tenenbaum and Freeman, 2000; Vasilescu and Terzopoulos, 2005).
Previously proposed nonlinear models for auditory processing (Fishbach et al., 2001) tend to be designed to mimic the known physiology of the auditory system and often do not allow for straightforward parameter estimation. The statistical ease of estimating multilinear models directly from neuronal data makes them more similar to models based on VolterraWiener expansions (Marmarelis and Marmarelis, 1978). However, prohibitive amounts of data are typically needed to estimate the large number of parameters of the latter class of models, so that in general they can only be fit by restricting their parameter space or focusing on a single stimulus dimension (Young and Calhoun, 2005). The data requirements of other nonlinear approaches (Brenner et al., 2000) also make them difficult to apply to auditory cortical data.
Other forms of the multilinear model, perhaps similar to the context model, should allow for the study of withinstimulus interactions and stimulusspecific adaptation in a variety of sensory systems (Ahrens et al., 2006). In the present case, multilinear methods provide a highly efficient description of auditory cortical response functions; note that the fully separated context model, although better able to predict neural responses, has fewer parameters than an inseparable linear STRF model. Multilinear models provide a firm statistical foundation for analyzing nonlinear neuronal response properties; allow response parameters to be learned directly from the data rather than being handtuned; and reveal a rich variety of input nonlinearities, inseparabilities, and context effects in auditory cortical responses to complex sounds, thereby introducing new possibilities for data analysis and extending existing understanding of auditory processing of complex sounds.
Footnotes
 Received July 25, 2007.
 Revision received November 16, 2007.
 Accepted December 15, 2007.

This work was supported by the National Institutes of Health and the Gatsby Charitable Foundation. M.B.A. and M.S. developed the models; J.F.L. collected the experimental data; all three authors contributed to the writing of this manuscript. J.F.L. thanks Michael M. Merzenich and Christoph E. Schreiner for support at the University of California at San Francisco, and Robert C. Liu for assistance with data collection.
 Correspondence should be addressed to Maneesh Sahani, Gatsby Computational Neuroscience Unit, 17 Queen Square, London WC1N 3AR, UK. maneesh{at}gatsby.ucl.ac.uk
 Copyright © 2008 Society for Neuroscience 02706474/08/28192914$15.00/0