Abstract
The relationship between a sound and its neural representation in the auditory cortex remains elusive. Simple measures such as the frequency response area or frequency tuning curve provide little insight into the function of the auditory cortex in complex sound environments. Spectrotemporal receptive field (STRF) models, despite their descriptive potential, perform poorly when used to predict auditory cortical responses, showing that nonlinear features of cortical response functions, which are not captured by STRFs, are functionally important. We introduce a new approach to the description of auditory cortical responses, using multilinear modeling methods. These descriptions simultaneously account for several nonlinearities in the stimulus–response functions of auditory cortical neurons, including adaptation, spectral interactions, and nonlinear sensitivity to sound level. The models reveal multiple inseparabilities in cortical processing of time lag, frequency, and sound level, and suggest functional mechanisms by which auditory cortical neurons are sensitive to stimulus context. By explicitly modeling these contextual influences, the models are able to predict auditory cortical responses more accurately than are STRF models. In addition, they can explain some forms of stimulus dependence in STRFs that were previously poorly understood.
- neural coding
- stimulus–response function
- complex sounds
- hearing
- spectrotemporal receptive field
- auditory cortex
Introduction
The spectrotemporal receptive field (STRF) (Aertsen et al., 1980) is widely used to characterize responses of primary auditory cortex (A1) neurons (deCharms et al., 1998; Depireux et al., 2001; Miller et al., 2002; Linden et al., 2003; Tomita and Eggermont, 2005). Previously, however, it has been shown that the STRF, when interpreted as a predictive model, is insufficient to account for auditory cortical responses (Sahani and Linden, 2003a; Machens et al., 2004), indicating that neurons in A1 respond nonlinearly to complex sounds. Nonlinearities have profound consequences for the description of response properties. STRFs often change with the choice of stimulus (Theunissen et al., 2000; Blake and Merzenich, 2002; Valentine and Eggermont, 2004). Although this might reflect adaptation to stimulus statistics, the apparent lability of STRF structure could also arise from the approximation of a nonlinear stimulus–response function with a linear function within different regions of stimulus space (Borst et al., 2005; Christianson et al., 2008).
Thus, nonlinear models are required to gain a better understanding of auditory cortical responses to complex sounds. Here we introduce a new class of such models, which are compact and easy to estimate from limited experimental data. We show how these models can capture and quantify, in neuronal responses to complex stimuli, three distinct nonlinear phenomena documented in experiments with isolated stimuli.
First, although standard STRF models assume that firing rate is related linearly to the sound level at every point in spectrotemporal space, this is not generally true (Phillips and Irvine, 1981). Our “input nonlinearity” model captures neuron-specific nonlinear mappings between spectrotemporal energy and neuronal responses to a complex sound.
Second, standard STRF models can describe time-frequency inseparabilities (i.e., interdependencies in temporal and spectral tuning properties), but not time-level or frequency-level inseparabilities, such as loudness-dependent latency (Phillips, 1989), or frequency-dependent tuning to loudness (Sutter, 2000). The input nonlinearity model allows us to estimate inseparabilities in all pairs of the three stimulus parameters, and thus to analyze how these inseparabilities affect neuronal responses to dynamic complex sounds.
Third, suppressive phenomena such as two-tone and forward suppression are observed at all stages of the auditory pathway, including the auditory cortex (Brosch and Schreiner, 1997; Bartlett and Wang, 2005). Such phenomena are not modeled well with STRFs, and have therefore rarely been studied using complex sounds (but see Bar-Yosef et al., 2002). Extending the input nonlinearity model to the “context model,” we represent local contextual effects as the modulation of the effective sound level at a point in the spectrogram by the energy that falls within a short spectrographic window preceding that point. Thus, this spectrographic window is used to provide a local adaptive context for each element of the sonogram.
Each of these nonlinear models can be formulated and efficiently estimated using a common flexible framework based on multilinear analysis. These models improve on the predictive performance of STRF models by a factor of ∼1.4. Moreover, for each neuron, the nonlinear models produce quantitative descriptions of the three phenomena described above. In addition, the context model sheds some light on the origin of the suppressive regions commonly observed in STRFs, and provides an explanation for certain systematic changes observed in STRFs measured under different stimulus conditions (Blake and Merzenich, 2002). Thus, the models offer far more insight into auditory cortical response properties than has previously been available from STRF analysis.
Materials and Methods
The DRC stimulus
The dynamic random chord (DRC) stimuli used in these experiments have been described previously (Linden et al., 2003). Each stimulus was a series of 20-ms-long random chords (i.e., combinations of randomly selected tone pulses) played without intervening gaps. The center frequencies of the tone pulses were chosen from 24 or 48 different possibilities (in the range of 25–100 kHz or 2–32 kHz, respectively, depending on the approximate tuning of the neuron), spaced 1/12 octave apart. The number of tones constituting each chord was random, with an average of two tone pulses per octave per chord. Tone pulses were gated on and off with 5 ms cosine ramps to reduce spectral splatter. The peak level of each pulse was chosen randomly from 10 different intensity levels, 5 dB sound pressure level (SPL) apart in the range 25–70 dB SPL. The frequency and intensity of each tone pulse were selected independently. Each of the two stimuli was 60 s long, comprising 3000 different random chords. Exactly the same stimulus was repeated 10 (high frequency) or 20 (low frequency) times, with no intervening gap. All analyses are based on the final 9 or 19 repetitions, so as to avoid artifacts caused by strong adaptation at the onset of stimulation.
Experimental methods
The experimental methods were similar to those described by Linden et al. (2003). Surgical procedures conformed to protocols approved by the University of California at San Francisco's Committee on Animal Research and were in accordance with United States federal guidelines for care and use of animals in research. Mice and rats were anesthetized and maintained at a surgical plane of anesthesia with ketamine and medetomidine or pentobarbital, and extracellular recordings were made in early auditory cortical areas [A1 of rats, and A1 and anterior auditory field (AAF) of mice]. Recordings targeted thalamorecipient layers III/IV (Smith and Populin, 2001) by cortical depth (350–600 μm below the dural surface) and by the polarity and size of stimulus-evoked local field potentials. The recordings were later bandpass filtered and then analyzed using Bayesian spike-sorting techniques (Lewicki, 1994) (user interface software by M. Kvale, University of California San Francisco, CA) to extract responses from single units or small clusters of neurons.
Notation
In this study, bold letters such as w represent arrays: these may be vectors, matrices, or higher-order Cartesian tensors. Bold superscripts are used with arrays of weights and stimuli to indicate the corresponding dimensions. Thus wtf represents an array of STRF weights that span time-lag and frequency. Elements in the array are italicized and subscripted. Thus, wjktf is the element at time lag j and frequency bin k in the matrix wtf. To simplify notation, we have adopted modified versions of operators from multilinear algebra. The symbol ⊗ generalizes the vector outer product, such that, for example, if b, c, and d are all vectors, then a = b ⊗ c ⊗ d is a three-dimensional array, with elements aijk = bicjdk. The symbol · generalizes the inner product, so that all indices shared between arguments on the left and right sides of the operator are summed (or contracted); which indices are shared will be clear from the context. Standard matrix multiplication is shown without an explicit operator.
Model estimation
To capture the response properties of auditory cortical neurons, predictive models may be used to approximate the neural stimulus–response function (i.e., the transformation from stimulus to neuronal firing rates). A widely used model is based on the STRF, here denoted wtf, where the superscripts denote time-lag and frequency, respectively. Such models predict the time-varying firing rate of a neuron, r(i) (where i indexes time bins), from the stimulus s(i, k) (denoting the power in time bin i and frequency bin k of the spectrographic representation of the sound) through the following formula:
The parameters of the model, i.e., the constant c and the elements of the STRF matrix wtf, should be chosen so as to make the predicted firing rate r(i) as close as possible to the observed firing rate r̂(i). One approach is to present the same stimulus s multiple times; the mean firing rate over repeated trials will then be approximately Gaussian distributed around the “true” mean stimulus-dependent firing rate. The parameters c and wtf can then be found by least-squares linear regression, minimizing the squared error ε = Σi(r(i) − r̂(i))2, which is equivalent to maximum-likelihood estimation under the Gaussian assumption.
One danger of maximum likelihood regression is overfitting. That is, the parameters that minimize the squared error on one set of training data may do so by capturing stimulus-unrelated variability in those measurements, and therefore perform poorly when predicting responses to a novel set of test data. If this happens, the features exhibited by the best-fit model cannot be presumed to reflect actual structure in the neural response function.
Overfitting may be contained by regularization methods, in which extra terms are introduced to the objective function (or equivalently, a prior belief about the parameter properties is expressed) so as to penalize solutions that have, for example, many large parameter values, or large variation in parameter values. These extra terms, for example, describing the scale of the smoothness, are often set by hand. Here, we used a data-driven method called automatic smoothness determination (ASD) to estimate optimal values for the scale of the smoothness and the magnitude of the model parameters (Sahani and Linden, 2003b).
Bilinear models
Separable STRF models.
A linear spectrotemporal receptive field wtf is called separable if, in its matrix form, it can be written as the outer product of two vectors, wtf = wtwf T; or, in terms of the components, wjktf = wjtwkf. Functionally, this implies that the relative response of the neuron to tones of different frequencies, given by a vector wf, is preserved across time-lag, with the absolute responses scaled by a different (possibly negative) factor at each time, given by the vector wt, or equivalently, that the temporal response profile at all frequencies is identical up to a per-frequency scale factor. Mathematically, the predicted firing rate in time bin i is, up to an additive constant (discussed later) as follows:
Thus, although the separable model is nonlinear in the parameters taken all together, it is linear in each of the vectors wt and wf alone. Such a model is called bilinear. A graphical illustration of a bilinear model is shown in Figure 1A.
A, A bilinear STRF. The STRF (right) is equal to the outer product of two vectors, one describing the temporal response (left) and the other describing the spectral response (middle). B, Schematic of the fully separated input nonlinearity model wt ⊗ wf ⊗ wl. The response to latency, frequency and sound level is assumed to be fully separable. C, Schematic of the input nonlinearity model wtf ⊗ wl. The full spectrotemporal component (left) and the sound-level component (middle) produce a response function (right) that is separable in sound-level and time, sound-level and frequency, but not in time and frequency. Note that there are weights (not shown) in the interior of the cube.
Many STRFs in the auditory system appear to be well modeled as separable (Linden et al., 2003; Simon et al., 2007), but even when this is not the case, a bilinear model may be used to approximate the full inseparable linear (in wtf) relationship. Here, we discuss this bilinear approximation to the STRF as a preliminary to the more extensive multilinear models introduced later.
One way to fit a bilinear model is to estimate the full receptive field wtf, and then examine the singular value decomposition (Strang, 1988) of the resulting matrix. The component vectors corresponding to the largest singular value give the best separable approximation to the full STRF matrix in the least-squares sense; that is, they minimize the term Σjk |wjktf − wjtwkf|2. However, even when wtf is itself found by least-squares regression, this cost function is different from the following bilinear model error:
We derive algorithms to minimize this model error (possibly with additional regularization terms) directly. It is helpful to simplify notation as follows. Consider an expanded three-dimensional stimulus array, augmented by the addition of a time-lag dimension: Mijkitf = s(i − j + 1, k). This change of notation allows us to write the bilinear spike rate prediction as follows:
or
where the second form adapts the tensor notation of multilinear algebra as described above. Here, because Mitf shares time-lag and frequency indices (j and k) with (wt ⊗ wf), the sum implied by the · operator is over those two dimensions, with the third index i remaining, corresponding to the bin-by-bin predictions collected in the vector r̂. If we now write the vector norm as
In the following we will use component notation, tensor notation or both as needed to aid clarity.
Parameter estimation.
At a local minimum of the error, we have the following:
and
Differentiating and rearranging yields the following fixed point conditions:
or
and
or
Defining A = wf · Mitf and B = wt · Mitf, and noting that both A and B are matrices, we obtain the matrix equations ATr = ATAwt and BTr = BTBwf, which can be solved to give wt = (ATA)−1ATr and wf = (BTB)−1BTr. It should come as no surprise that these each resemble the solution to a linear regression problem, as the bilinear model is linear in each parameter vector separately. However, the matrix A (B) is itself a function of wf (wt), and so these equations do not give a closed-form solution to the optimization problem. Instead, the equations are applied iteratively, alternating between updates for wt and for wf. Each iteration decreases the squared error and, because ε cannot drop below zero, the iterations must converge to an optimum in the parameter space.
The background rate.
Because neurons often fire in the absence of a stimulus, a complete predictive model must include a constant term. As in the linear case, this can be introduced cleanly into the multilinear framework by augmenting the stimulus appropriately. If there are T time points in the stimulus, and the time-lag index ranges from 1 … J and the frequency index from 1 … K, so that Mitf is T × J × K, we consider a T × (J + 1) × (K + 1) array Qitf with
Then, with wt and wf also augmented to contain J + 1 and K + 1 elements, respectively, the model r̂ = (wt ⊗ wf) · Qitf becomes equivalent to a model with background rate, r̂ = c+(wt ⊗ wf) · Mitf with c = wJt+1wKf+1.
Multilinear models
We have seen that a separable STRF may be viewed as a bilinear predictive model. Shortly, we will see that more complex models, incorporating nonlinear level sensitivities and acoustic-context-dependent sensitivities, can be also be expressed in a multilinear form. However, before going on to develop these models, it is useful to discuss some general aspects of parameter estimation in the multilinear setting (Ahrens et al., 2008).
The general predictive form of a multilinear model can be written as follows:
or r̂ = (a ⊗ b ⊗ … ⊗ z) · Q, where a, b, …, z are parameter vectors and Q is a fixed multidimensional array that depends only on the stimulus.
Alternating least squares.
As in the case of the bilinear model discussed above, the squared error ε = ‖r − (a ⊗ b ⊗ … ⊗ z) · Q‖2 can be minimized by cycling through a set of update equations, each of which resembles the solution to a linear regression problem:
Each of these updates minimizes the squared error with respect to one parameter vector, while holding the others fixed. Thus, the algorithm, a variant of alternating least squares, is guaranteed not to increase the squared error at any step and, hence (as the error is bounded below by zero), to converge.
Control of overfitting.
As the number and length of the parameter vectors in any model increases, so does the danger of overfitting. As described above, regularization is the practice of adding terms to the cost function to be optimized, so as to discourage overfitting. Here, we adopt a Bayesian perspective, in which the simple cost function corresponds to the (log) likelihood of the parameters given the data, and the regularization terms express our prior beliefs about their forms or values.
The objective function used is of the form εr = ε + aTDaa + bTDbb +… . Viewed probabilistically, the squared-error term ε corresponds to a Gaussian likelihood. Although a likelihood based more directly on a mean of Poisson terms may be a better model for empirical firing rate data, the Gaussian assumption makes fitting algorithms more tractable and is appropriate in the regime of many observations or moderate to high firing rates, where the trial-averaged Poisson-based likelihood approaches the Gaussian form. The additional terms of the objective function express zero-mean Gaussian priors on each parameter vector, with covariances Da−1, etc. The priors are zero-mean because we have no reason to expect a priori parameters of any particular sign. The covariance matrices are chosen to favor smaller parameter values (commonly referred to as “shrinkage”) as well as smoothness in the parameter vectors.
The alternating least squares algorithm of Equation 14 can be adapted to minimize εr through a small modification. Given a matrix A as defined in Equation 14, the update for a becomes a = (ATA + Da)−1ATr, and similarly for the other parameter vectors. For fixed prior covariances, these updates are guaranteed to converge, as before.
To obtain appropriate prior covariance matrices D(·), we adapted the ASD scheme of Sahani and Linden (2003b). In this scheme, each prior covariance matrix is parametrized by two parameters θ1,2(·), describing the degree of shrinkage and smoothing; optimal values of the parameters are found through an automatic procedure. Here, we used this procedure within the first few iterations of the alternating least squares framework. If ASD updates are used at every step, the objective function varies between iterations (because the D matrices change), and convergence of the fitting procedure is no longer guaranteed. For this reason, the covariance parameters were only updated during the first three iterations. Once these covariance parameters were fixed, the remaining iterations were again guaranteed to converge.
Estimation error.
Error bars for the parameters were obtained by a bootstrap procedure (Effron and Tibshirani, 1993). Parameters were refit 10 times, in each case using a equinumerous training set randomly redrawn, with replacement, from the available data. The error bars in all figures indicate the SDs of these 10 estimates.
The input nonlinearity model
In this and the next sections we develop predictive auditory models that express various nonlinear stimulus dependencies in multilinear form, allowing them to be fit using the methods described above.
Stimulus representation.
An auditory STRF model is linear in the sonogram. Consequently, the predictions of the model, and the parameter values obtained, depend on whether the input stimulus s(i, k) represents the intensity, the power or the log intensity of the stimulus at the given time and frequency. Indeed, the best predictions may be obtained not from any of these external representations, but rather from something closer to the rate-level function of the cell, or some other cell-specific mapping. Here, we consider a generalized model in which a suitable nonlinear transform of the stimulus can be found directly from the data. Based on the STRF model of Equation 1, such a model has the following form:
where g is a function to be learned. The mapping g is an “input nonlinearity,” which transforms the representation of sound level in the spectrogram before it is spectrotemporally filtered by wtf. This form is linear in wtf, but g has not yet been parameterized so as to make estimation possible. One natural parameterization, common in the regression literature, is as a linear combination of a fixed set of basis functions {gl}, so that g(x) = Σl wl1gl(x), for some parameter vector wl.
For the stimulus considered here, the level of each tone pulse was drawn from a set of 10 distinct possible values. Thus, a natural set of basis functions comprises 10 indicator functions, each taking the value 1 for a single possible input sound level, and 0 otherwise. This leads to a simple interpretation of wl1 as the net effective input correponding to a pulse at the lth intensity level. However, the mathematical development that follows is applicable to any choice of basis and, thus, similar methods could be used with other stimuli, even if not discretized in level (Ahrens et al., 2008).
Given this basis parameterization, Equation 15 can be rewritten as follows:
If we now define a four-index array, Mijklitfl = gl(s(i − j + 1, k)), and consider a separable STRF model wjktf = wjtwkf, the model can be written in the following multilinear form:
or
with a four-dimensional array Qitfl defined by augmenting Mitfl in a manner analogous to Equation 12. This parameter grouping is illustrated in Figure 1B.
Grouping stimulus dimensions.
The unseparated form Equation 16 is itself a bilinear model, with linear dependence on each of the STRF parameters wjktf, that is, it can be written r̂ = (wtf ⊗ wl) · Qitfl, as illustrated in Figure 1C. It is therefore possible to apply the multilinear estimation framework directly, without forcing the time-frequency component of the model to be separable. Formally, this may be done by considering the “rasterized” or “unrolled” vector form of the matrix wtf, often written vec(wtf), while similarly unrolling the corresponding dimensions of the array Qitfl.
This model has many more parameters than the fully separated, trilinear version (Eqs. 17, 18), and is thus considerably more prone to overfitting, making regularization yet more crucial. In this case, the automatic smoothness determination procedure (Sahani and Linden, 2003a) obtains separate smoothing parameters for the time-lag and frequency dimensions, even though they are combined into a single vector.
Although this model corresponds well to the standard STRF representation, it is clearly not the only way in which parameters of the trilinear model may be grouped. Instead, a neuron might be better described as having an inseparable frequency-level receptive field, described by a matrix wfl. In this case, the model is r̂ = (wt ⊗ wfl) · Qitfl. This model is able to capture inseparable frequency-level components of the response function, as are often seen in static receptive fields, but are ignored in linear STRF models. However, the model assumes that the time component of the response function is separable from the frequency-level component.
A neuron might also have an inseparable time-sound level component, in which case the appropriate model might be r̂ = (wtl ⊗ wf) · Qitfl, which is able to capture inseparable time-level properties of the response function (i.e., changes in sound level tuning at different lag times).
For any given neuron, we can consider each of these possible models, evaluating the performance of each by cross-validation and choosing the most predictively successful model to characterize the cell.
Indeed, it is possible, in principle, to dispense with separability altogether, fitting a linear model with a three-dimensional parameter array wtfl. In practice, however, the number of parameters implied by this model (the product of the numbers of time lags, of frequency bins, and of level basis functions) makes it impractical to estimate without considerable overfitting.
Adaptation and two-tone interaction: the context model
Documented nonlinear effects in the auditory pathway include two-tone suppression, where the response evoked by a tone is affected by a simultaneously presented tone at a different frequency (Sutter and Schreiner, 1991), and forward suppression, where the tone-evoked response is modulated by earlier tones at the same frequency (Brosch and Schreiner, 1997).
These and other contextual effects are partly modeled by the standard STRF; however, this framework forces the contextual influence to be additive. Here, we consider multilinear models that also describe multiplicative contextual effects. In particular, we consider models in which the context of a tone pulse in the DRC, or, more generally, of the energy at a point in the spectrogram, which we call a time-frequency element, may be viewed as multiplicatively modulating its effective “level”; that is, the stimulus in time bin i at frequency k has an effective strength given by the following:
Here, g is a nonlinear function of the type considered previously (Eq. 15). Context(i, k) is a function mapping the stimulus context around the time-frequency point under consideration to a multiplicative factor.
We assume that the contextual modulator Context(i, k) is itself a multilinear function of the same form as the input nonlinearity model, but indexed relative to the time and frequency being modulated, as sketched in Figure 2. That is, we write the contextual modulation as follows:
where Φ = (N − 1)/2 is the maximal absolute frequency difference considered between the contextual and modulated time-frequency elements, and the exclusion of (m, n) = (1, Φ) is needed because a time-frequency element cannot be in its own context [that is, s(i, k) is not part of the expression for Context(i, k)]. The vectors wτ and wφ contain weights that depend on the relative time differences m = 1 … M and frequency differences n = 1 … N, respectively. The vector wλ transforms the contextual sound energy in terms of a set of P basis functions hp(s), which are here chosen to be identical to the gl(s) used in the input nonlinearity model, although they may differ in general.
Schematic of the context model for a DRC stimulus. The sound level components are not shown. A, The context part of the model, dependent on time-frequency (i, f) and parametrized by (τ, φ, λ). A weighted sum of the tone pulses (or, for a general sound, time-frequency elements) in the box acts to modulate the effective sound level of the target tone, indicated by the arrow. B, The time-frequency part of the model, equivalent to the domain of action of the STRF. A weighted sum of the amplitudes of the tone pulses in the box after modulation by context (A) predicts the spike rate, indicated by the downward pointing arrow. Inset, The axes labels for A and B.
In the tensor notation used previously, Context(i, k) = (wτ ⊗ wφ ⊗ wλ) · Mτφλ (i, k), where Mτφλ (i, k) is a stimulus array that depends on the (i, k) position of the time-frequency element being modulated, and (if the bases {gl} and {hp} are the same)
for 1 ≤ m ≤ M; 1 ≤ n ≤ N; 1 ≤ p ≤ L, with Mitfl defined as in Equations 17 and 18. Mτφλ (i, k) will be referred to the “contextual subunit” of the time-frequency element at time i and frequency k (Fig. 2).
The contextual influence modulates the effectiveness of the stimulus s(i, k) according to Equation 19. The modulated sound strengths are then combined according to the noncontextual multilinear form. Thus, the full model becomes the following:
The first term in the sum reflects the input nonlinearity model, and the second term reflects the contextual modulation of the input to that model. To render this in the standard multilinear form, we define a seven-dimensional array Qitflτφλ:
Then, with wt… wλ correspondingly augmented with one or two additional dimensions each, we obtain the expression r̂ = (wt ⊗ wf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ) · Qitflτ φλ, with the constants in Equation 22 recovered as the following:
Note that the parameter vectors wτ, wφ, and wλ associated with the contextual subunit are independent of the time-lag j and the frequency k. As such, the action of the contextual subunit may be interpreted as an adjustment of the stimulus that happens before the action of an input nonlinearity model, and independent of the frequency. However, the parameters determining the contextual effects are estimated simultaneously with the others; algorithmically, there is no hierarchy.
We may regroup some of the parameter vectors as before, for example, assume all components are separable, or group time and sound level together so that the parameters are wtl ⊗ wf ⊗ wτ ⊗ wφ ⊗ wλ.
Because the different components of the models interact multiplicatively, there is an ambiguity of scale in the parameters. For example, if wl were multiplied by a constant factor and wt divided by that same factor, an identical model would result. We resolve this ambiguity in the input nonlinearity part of the model by rescaling all parameter vectors but the first so that the maximal absolute-value of the vector components is 1. The first parameter vector then has an unconstrained, but well defined y-axis with units of spikes per second. The context part of the model can be separately rescaled relative to c2; we resolve this by dividing by this constant so that c2 = 1 and then scaling wφ and wλ to have maximal absolute-values of 1, thereby assigning the scale of the context terms to wτ.
The dimensionalities used in this study are as follows: wt, t ranged from 0 to 10 (in units of 20 ms time bins; i.e., J = 11); wf, there were either 24 or 48 frequency bins in the stimuli, so that K = 24 or 48; wl, the stimuli contained 10 sound levels, so L = 10; wτ, the contextual subunit was defined to stretch from 0 to 10 time bins into the past, so that M = 11; wφ, the contextual subunit was defined to stretch 5/12 octaves to either side of a target tone (or less, if the target tone appeared near to the spectral boundary of the stimulus), and because the frequency bands in the stimulus were 1/12 octaves apart, N = 11; wλ, the stimuli contained 10 sound levels, so P = 10.
Results
Neural recordings
The neuronal population consisted of 147 units, 97 from mice and 50 from rats. Of these, 39 recordings (21 in rats and 18 in mice) were made under stimulation by higher-frequency sounds (25–100 kHz), and 108 (76 in mice and 32 in rats) were made under stimulation by lower-frequency sounds (2–32 kHz). In mice, about half of the recordings came from A1, and the other half came from the AAF; all of the recordings in rats were from A1. In all of the recordings used for analysis, an estimate of the repeatable stimulus-locked signal power (Sahani and Linden, 2003a) was at least one SD greater than zero. Results based on subsets of these data have been reported previously (Linden et al., 2003; Sahani and Linden, 2003b).
Performance of the models
The predictive performance of the STRF and multilinear models fit to each recording in the population was evaluated by cross-validation. The 60 s stimulus presentation time was first divided into 10 segments of equal size. The models were then trained using data from nine of these segments and tested on the remaining one; this procedure was repeated 10 times, once for each of the 10 possible test segments. The predictive power of a model was defined to be the average predictive power on the 10 cross-validation datasets (Sahani and Linden, 2003a). Predictive power is a measure of performance that takes into account trial-to-trial variability of the neuronal response; its expected value for the “perfect” model is 1, and for a model predicting only the mean firing rate it is 0. Figure 3 shows these predictive powers for the fully separated input nonlinearity and context models, compared with a well-fit STRF model (estimated by automatic smoothness determination) (Sahani and Linden, 2003b).
Lower bounds on predictive power for neurons in rodent auditory cortex. A, Predictive power of the wtf ⊗ wl input nonlinearity model versus that of a well fit linear (STRF) model. B, Predictive power of the wtf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ context model versus that of the linear model. C, Predictive power of the wtf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ context model versus that of the wtf ⊗ wl input nonlinearity model. Negative predictive power means that the prediction is worse than predicting a constant mean firing rate. The input nonlinearity model performs better on most neurons than the STRF model, and the context model performs better than both on the majority of neurons, despite having the largest number of parameters. D, Predictive power of the wt ⊗ wf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ context model versus that of the STRF model. The cross-validation performance of the fully separated context model is higher than that of all the other models.
The predictive power obtained by cross-validation is a lower bound on the true predictive power achievable by the model. As finite data are used for training, the estimated parameters are inevitably overfit to the training data, which leads to suboptimal predictive performance on the test segment. (Negative predictive powers on test data are also possible; this means that the model prediction is less accurate than a constant prediction of the mean firing rate would be.) A complementary upper bound on the predictive power can be obtained by fitting the unregularized model to the entire dataset and computing the resultant predictive power directly on the training data. This overestimates the true value, because the modeling of noise in the training data cannot be distinguished from accurate prediction. The “true” predictive power of the model, i.e., the predictive power it would have when trained on an infinitely large dataset, lies between these upper and lower bounds (Sahani and Linden, 2003a). These bounds become tighter as the trial-by-trial variability of the response, the noise power, decreases. Thus, following Sahani and Linden (2003a), we consider the population as a whole, and extrapolate both the upper and lower bounds to the point of zero noise using polynomial fits, whose order is chosen by leave-one-out cross-validation. This yields relatively tight bounds on the performance expected of the model if it were applied to a hypothetical neuron drawn from a similar population but with a completely stimulus-determined response. The extrapolation is shown in Figure 4.
Interpolation of the predictive power to the zero noise limit for the pooled population. Open circles and white line, Upper bound predictive powers. Closed circles and black line, Lower bound predictive powers. The input nonlinearity model is the wtf ⊗ wl model and the context model is the wtf ⊗ wl ⊗ wτ ⊗ wφ ⊗ wλ model (see Materials and Methods). Note that both bounds for the multilinear models might be underestimated because their objective function may have multiple local minima. Nevertheless, the input nonlinearity model performs better than the STRF model and the context model performs better than both. All three models sometimes predict negative firing rates; when those negative rates are set to zero, the predictive power bounds of the models increase by between 1 and 2%.
Using this analysis, we find that the predictive power of the input nonlinearity model wtf ⊗ wl (0.27–0.40, expressed as a fraction of the predictable signal power) is modestly larger than that of the STRF model (0.23–0.37). The predictive power of the context model is substantially higher (0.32–0.52), with the midpoint between the bounds being ∼1.4 times that of the STRF model. This finding demonstrates that taking into account local acoustic context (based on a spectrographic window stretching over recent history and nearby frequencies) through the structure of the context model allows the multilinear description to capture more of the dynamic behavior of auditory cortical neurons than the linear STRF-based descriptions.
Figure 5 shows four examples of how the predictions for the STRF and the context models differ. The context model generally predicts the PSTH more accurately, especially at stimulus-evoked peaks in the firing rate.
PSTH and cross-validation predictions for four example cells which illustrate a range of predictive power increments of the context model over the STRF model. Within each panel: top, real PSTH (gray), STRF, and context model predictions (white and black); bottom, spike rate during each trial. Grayscale indicates number of spikes in each 20 ms bin. On the right the cross-validation predictive powers are shown. The context model generally predicts the PSTH more accurately than the STRF model, especially at the peaks of the PSTH (e.g., bottom). The top shows a case in which the context model's prediction quality is roughly the same as that of the STRF model.
Input nonlinearity model
Sensitivity to sound level
We used the input nonlinearity model wtf ⊗ wl to study the sensitivity of neuronal responses to sound level within the DRC spectrogram. Figure 6A shows the model parameters and predictive performance for one such model. The inferred level-sensitivity parameter, wl, exhibits a mild threshold, and remains fairly linear (in logarithmic terms) above ∼45 dB SPL. Note that this inferred level dependence may not be the same as the usual rate-level function, for two reasons. First, the rate-level function is conventionally measured using isolated tones presented in silence. Level sensitivity in the context of a complex sound may be quite different (Dean et al., 2005). Second, the estimate of wl is made in the context of subsequent integration through the time-frequency weights wtf. Estimates of the two parameter vectors are interdependent. Similarly, the weights wtf learned through this model are sometimes different from those of the STRF (Fig. 6E).
A, Input nonlinearity model wtf ⊗ wl for a neuron in mouse auditory cortex. Left, The time-frequency component wtf of the model. The average SE of the weights of wtf across time-frequency bins is shown by the error bar at the 0 position of the color bar. The SE of the peak weight is shown at the top of the bar; the error at the minimum is shown at the bottom of the bar. Units are spikes/s. Center, The sound level component of the model. The 1 SD error bars are shown as a gray area. All error bars were calculated by bootstrap methods. This graph has been normalized and thus has no units. Right, Predictive power of the STRF model and all the possible configurations of the input nonlinearity model. The white bars show the predictive power of the regularized models on the training data and the black bars show the predictive power on cross-validation data. For this neuron, the wtf ⊗ wl model has the highest predictive power, and indeed, the wtf receptive field can be seen to be inseparable, because the peak excitation (at a 20 ms lag time) occurs at a lower frequency than the peak suppression (at a 60 ms lag time). B, For comparison, the STRF estimated on the same data, which in this case differs little from wtf. C, Various wl from the population. Gray areas indicate 1 SE. D, Number of cells whose wl fits are best approximated by each of five idealized shapes. The shapes were chosen by hand to reflect the variety of shapes of wl derived from the data. Similarity between the actual wl and the idealized shape is defined to be the normalized dot product between the graphs. No nonmonotonic shapes were observed, but this may be a result of the sound level range used in the experiments, or of the inequivalence between wl and the rate-level function. The smoothed ramp shape was the most common. E, Example of a cell for which wtf of the input nonlinearity model is different from the STRF. The difference is shown in the fourth panel; the color code is the same in all panels. The third panel shows wl for the input nonlinearity model (solid), and the fixed input scaling function that is assumed for all STRF models (dashed). (This function, which is linear in sound pressure, was found to produce better STRF results across the population of cells than a linear function in dB.) The wtf receptive field shows a peak response at a larger lag time than suggested by the STRF. The predictive power of the input nonlinearity model was 1.6 times that of the STRF model, on both training and cross-validation data.
Figure 6C shows some more examples of level-sensitivity profiles wl learned using different neuronal responses. These are illustrative of the variety of shapes seen in the population. To summarize the observed distribution, 5 prototypical profiles were designed by hand, and each learned profile classified according to the prototype with which its dot-product was greatest. The results are shown in Figure 6D. Note that the profiles were chosen to illustrate the range of shapes observed; we did not see evidence of clustering in the data.
Inseparabilities
As described in Materials and Methods, the parameters of the input nonlinearity model may be grouped in several different ways. These groupings can be used to assess several forms of inseparability that may be present in the stimulus-response functions. Time-frequency inseparabilities in the STRF have been studied extensively (Kowalski et al., 1996a,b; Depireux et al., 2001; Linden et al., 2003) and related properties (such as temporal symmetry) (Simon et al., 2007) have been documented previously. The opportunity exists to extend such studies to new forms of interaction.
The optimal form of inseparability for each neuron was assessed by comparing the cross-validation predictive powers of the variously grouped models. Thus, a cell would be classified as having a frequency-level inseparable receptive field if the wfl ⊗ wt model performed better on cross-validation data than the wtf ⊗ wl, wtl ⊗ wf, wt ⊗ wf ⊗ wl and the STRF models. There are two caveats to this classification. First, more than one inseparable model could perform better than the fully separated one, suggesting that all three stimulus dimensions might interact in that neuron's response function. As the full model (with no separated parameters) could not be fit reliably with the data available we did not explore this three-way interaction further. The second caveat is that the different models have different numbers of parameters, and are thus differently prone to overfitting. Thus, although the fully separated wt ⊗ wf ⊗ wl model performs best on cross-validation data for many neurons (Fig. 7C), it is not possible to say whether this reflects genuine separability, or simply greater reliability in the estimates of a smaller number of parameters.
A, Input nonlinearity model wfl ⊗ wt for a neuron in mouse auditory cortex. Conventions are as in Figure 6. This unit was classified as having a frequency-level inseparable receptive field. B, Input nonlinearity model wtl ⊗ wf for another auditory cortical neuron in the mouse. Note that the cross-validation predictive power (second black bar) can be negative, meaning that the model does worse at predicting the response of the cell than a constant average firing rate prediction would. This unit was classified as time-level inseparable on cross-validation data. C, Proportions of the population of rodent auditory cortical neurons best described by each of the possible configurations of the input nonlinearity model. Left, Proportions for training data. Right, Proportions for testing data. The discrepancy between the two pie charts illustrates one of the difficulties in classifying the neurons into groups of optimal inseparability, which is that models with more parameters are more likely to overfit to noise in training data, and thereby underfit testing data. Importantly, however, all of time-frequency, time-level, and frequency-level groupings occur in substantial proportions in both pie charts, showing that all three forms of inseparability exist in auditory cortex.
With these caveats, Figure 7C shows the proportion of neurons best described by each of the configurations of the input nonlinearity model. These results suggest that frequency-level and time-level interactions, as well as the more widely studied time-frequency interactions, are present in the responses of many neurons to complex sounds.
The wtf ⊗ wl model
Separabilities in time and frequency have been extensively documented. The input nonlinearity model also recovers such inseparabilities in a number of neurons in the population (Fig. 7C). For these neurons, then, the form of frequency integration varies with lag time.
The wfl ⊗ wt model
We also observed several cells with frequency-level inseparabilities (Fig. 7C). Such inseparabilities are consistent with previous reports of frequency response areas derived using isolated tone pips, which can have quite complex and asymmetric shapes. A frequency-level inseparability indicates that sound level is processed differently for different frequencies; or alternatively, that frequency integration depends on sound level. Figure 7A shows an example of such a cell.
The wtl ⊗ wf model
Finally, we found a substantial number of cells whose optimal inseparability was in time and sound level (Fig. 7C). This form of inseparability indicates that during the processing of a complex stimulus, sound level integration depends on lag time. Figure 7B shows an example of such a neuron. The time-level receptive field is clearly inseparable, showing a fast response to loud sounds, without any suppression. Prolonged suppression occurs only for sounds of intermediate loudness. In other words, the response to sound is monotonic at short lag times, but nonmonotonic at larger lag times. Thus, for this cell, the temporal processing of sounds is critically dependent on the sound level.
The context model
The context model (see Materials and Methods) (Fig. 2) can be viewed as combining two input nonlinearity models. One (Fig. 2, A) determines the modulation of the effective sound level of a time-frequency element of the stimulus by its local context; the other (Fig. 2, B) determines the predicted firing rate using these modulated values. The algorithm used to estimate the parameters fits both models simultaneously, allowing the parameters (A) to affect the estimate of the parameters (B) and vice versa.
More specifically, the context model extends the input nonlinearity model through the term wτ ⊗ wφ ⊗ wλ. Thus, the effective intensity of a time-frequency element (i.e., the stimulus power s(i, k) in some time bin i and frequency bin k) is modulated by each time-frequency element that precedes it (within a window of times and frequencies) in a way that depends on their separation in time through wτ, their separation in frequency through wφ, and on the level of the modulating time-frequency element through wλ.
The context effect
Because the emphasis of this section is on the form of the context terms, only fully separated context models are shown in Figure 8. As illustrated in that figure, wφ is negative across most frequency differences, whereas wτ and wλ are generally positive. Thus, the context term wτ ⊗ wφ ⊗ wλ is generally suppressive over most time-frequency differences.
A, A fully separated context model for a rat auditory cortical neuron. Shaded regions indicate one SE. The x-axes of the top row are as in the input nonlinearity model: lag time t, frequency f, and sound level l. Units of wt are spikes per second; wf and wl have no units as they have been normalized. The bottom row shows the context components with the following axes: time difference between context and target tones τ, frequency difference between the tones φ, and sound level of the context tone λ. Because c2, wφ and wλ have been normalized, the scale of the context terms is contained in wτ; this scale reflects the fractional change in the effective sound level induced by a tone in the context (after weighting by the other contextual terms). The bottom right displays the predictive powers of the STRF model, the fully separated input nonlinearity model, and the context model shown. The white and black bars show the predictive power on training data and cross-validation data, respectively. B, A context model for a mouse auditory cortical neuron. The context effect follows a faster time course than the example in A. C, A context model for a rat auditory cortical neuron. In this case, the context could have a facilitatory effect on the target time-frequency element. Such behavior was relatively rare in our population. IN, Input nonlinearity model; ctxt, context model.
The wτ parameters illustrated show their greatest suppression for lags of 20–100 ms, depending on the neuron. The wφ values show greatest suppression when the modulating tone is at the same frequency as its target. This suppression falls off with greater frequency separation, vanishing at a separation of about half an octave (or more in some cells; data not shown). The sound level representation of the context wλ is generally monotonic, and sometimes different from wl.
Figure 8C demonstrates the context model for one neuron, for which context tones just before the target tone can have an excitatory effect (wτ is negative at τ = 20 ms; with wφ generally negative this together implies excitation). Figure 8A, which displays the context model parameters for a different neuron, shows two slightly excitatory humps at the edges of wφ. This structure of wφ was also observed for other neurons, especially in a few cases when the range of φ was extended (data not shown).
The context model and STRF models
Suppressive stimulus effects can be captured by linear models through negative regions in their temporal or spectrotemporal receptive fields. The context terms of the context model provide an additional way to create suppressive effects. We wondered if the suppressive regions seen in STRFs might be better accounted for by the contextual component of the context model than by negative weights in wtf. To measure this, we fit multilinear models with context terms (that is, the fully separated context model) and without context terms (the fully separated input nonlinearity model) to the data, and examined the relative amount of suppression as reflected by wt, for each neuron in the population. Figure 9 shows the relative suppression min(wt)/(max(wt) − min(wt)) for both models. There is a marked decrease in the relative amount of observed suppression when the contextual terms are present, and sometimes the suppression disappears. This suggests that the suppressive regions seen in the STRFs are, at least in part, attributable to contextual effects, rather than to additive suppression.
Depth of trough of wt of the fully separated input nonlinearity and context models. Values on the axes are min(wt)/(max(wt) − min(wt)), for the input nonlinearity model and for the context model.
Figure 10 illustrates a mechanism that may be responsible for this effect. The PSTH of a hypothetical neuron, modeled by a context model (Fig. 10A) responding to a stimulus similar to the DRC used in the experiments, was fit with both an STRF model (Fig. 10B, top) and a context model (A, dashed line, B, bottom). The original context model contained no additive suppression, as there is no trough in wt, but it does contain multiplicative stimulus interactions (through wτ and wφ). The fitted STRF model, however, contains a clearly visible suppressive region following the excitatory peak. This again shows that the suppressive regions in STRFs fit to experimental data may reflect multiplicative stimulus interactions (forward suppression, two-tone interactions) rather than genuine additive suppression.
Fitting a linear STRF model to a hypothetical nonlinear neuron described by a context model. A, The original context model (black) and the context model fit to the simulated PSTH (dashed). The fitted model matches the original very well. The fitting was done without regularization; the simulated PSTH was taken to be the noiseless output of the original context model, thresholded to avoid negative firing rates. The parameters wl and wλ were set to be linear and were not fit for brevity. B, Top, An STRF model fit to the simulated PSTH, B, Bottom, The wtf receptive field of the hypothetical neuron given by wt ⊗ wf. The additive part wt ⊗ wf of the original context model does not show additive suppression, whereas the STRF does show a suppressive region. This suppressive region in the STRF is therefore purely attributable to the contextual effects of wτ ⊗ wφ.
Simulation of STRF features changing with spectral density
STRFs fit to data collected using different stimulus classes are generally not the same (Theunissen et al., 2000; Rotman et al., 2001; Bar-Yosef et al., 2002; Valentine and Eggermont, 2004; Woolley et al., 2006). We used the context model to help understand certain systematic stimulus-dependent changes in STRFs related to the spectral density of a sound. The dependence of the STRF on the spectral density of a DRC can be summarized as follows (Blake and Merzenich, 2002): a higher spectral density leads, on average, to a lower excitatory peak in the STRF, and in some cells, the suppressive region only appears at higher spectral densities. Could these experimental results be explained by the context model? We used the context model of Figure 8B (fit on real data at a single spectral density) with five stimuli of varying spectral density, to generate five simulated PSTHs. Then we fit STRF models for these five stimuli and simulated PSTHs. This process is summarized in Figure 11A.
A, Summary of how STRFs were fit to the outputs of a context model. The context model was trained on experimental data that was gathered using a stimulus at a single spectral density, then run on stimuli of various spectral densities to generate simulated PSTHs. B, STRF fits to simulated PSTHs from context models that were run on stimuli of increasing spectral density. The peak excitation decreases monotonically, and so does bandwidth. Both of these observations were also made experimentally. C, The peak excitation and the magnitude of the maximum suppression of the STRFs shown in B. The peak excitation decreases monotonically whereas the suppression stays roughly constant, in accordance with experimental observations (Blake and Merzenich, 2002).
The five STRFs show phenomena similar to those observed experimentally, as shown in Figure 11B. First, there is spectral narrowing at higher spectral densities. Second, the excitatory peak becomes lower at higher spectral densities, whereas the suppressive depth stays roughly constant, as summarized in Figure 11C. This graph is qualitatively the same (a monotonic decrease in peak value, and a roughly constant suppression) as the experimental observations of Blake and Merzenich (2002). Thus, the context model is a candidate for explaining the changes in STRFs at various stimulus densities.
In addition the behavior shown in Figure 11, there are context models (both fit to real data and constructed by hand) that can be used to generate data for which STRFs would have other forms of dependence on the spectral density of the stimulus. For example, the STRFs may show no suppressive region at low spectral density, and a deepening suppressive region at higher spectral density, which is similar to the experimentally observed appearance of suppressive regions at higher spectral density. There are also context models for which STRF fits would change very little with spectral density. Cells with similar response properties have also been reported experimentally.
Discussion
We have constructed a class of models that describe the transformation between a sound and the firing rate of a neuron in the auditory cortex. These models were used here to analyze responses to DRC stimuli, but are applicable to the analysis of responses to arbitrary sounds. The models make possible novel analyses of auditory cortical responses to complex sounds, and also unify measurement of many different response properties that were previously analyzed separately, such as the rate-level function, the STRF, and contextual effects.
What is gained by this unification? Why should we not, for instance, measure the rate-level function and the STRF separately instead of estimating the input nonlinearity model? First, the stimulus conditions under which rate-level functions are typically measured (tones in silence) are different from those used for STRF estimation (continuous complex sounds); the state of the auditory system is thus likely to be different in the two cases. The present approach allows us to estimate sensitivities to different stimulus dimensions simultaneously, from a single (complex) stimulus, thus obtaining a more coherent picture of the function of the auditory system in a single context. Second, the terms in the models depend on one another. For example, a complex sound contains energy at different sound levels. An STRF fit to such data carries with it an implicit model for sound level sensitivity; this model is typically the identity function, and so entries of the STRF are weighted proportional to sound level (in, say, decibels SPL). To the extent that this assumption is not the most appropriate, the STRF estimate will be affected. Estimating neuronal sound level sensitivity directly, using the input nonlinearity model, produces models that describe the system more accurately, as demonstrated by the improvement in model predictive power, and the weights wtf in the input nonlinearity model are sometimes not equal to those in the STRF. Similarly, for time-level inseparable cells, fitting a time-level receptive field would be impossible without a model for frequency integration, poor with a suboptimal model of it, and not general if one made frequency integration irrelevant by using a fixed spectral content in the experiment. A third and final example of the benefits of a unified model, as demonstrated in Results, is that the context terms of the context model have important effects on the inhibitory region of the spectrotemporal receptive field, something that would not be evident if an STRF and a two-tone receptive field were measured separately.
The input nonlinearity
Finding an appropriate single input nonlinearity for all neurons can raise the average predictive power of an STRF model (Gill et al., 2006). However, our data revealed various shapes of wl. Thus, estimating the input nonlinearity adds neuron-specific detail and provides a better understanding of how populations of neurons respond to level in complex sounds.
Separability of receptive fields
Using the input nonlinearity framework, we found frequency-level and time-level inseparabilities during processing of complex sounds in significant proportions of our neuronal population. Attention has previously been focused on time-frequency inseparabilities (Kowalski et al., 1996a; Depireux et al., 2001; Miller et al., 2002; Linden et al., 2003; Simon et al., 2007). The present models open up the possibility of exploring the properties of the other two types of inseparable receptive field.
Time-frequency inseparable receptive fields may underlie frequency sweep direction selectivity (deCharms et al., 1998). The inseparabilities in the other pairs of variables may reflect other functions. For example, cells with a time-sound level inseparable receptive field may respond to sounds with increasing or decreasing levels, which feature, for instance, in vocalizations (particularly speech). Without multilinear models, such effects are very hard to study in the context of complex sounds.
Stimulus context
Including local stimulus context allows us to predict auditory cortical responses more accurately than before. The context model should also be applicable to precortical auditory structures, where contextual influences of the sort described first arise.
The context terms reveal forward suppression in auditory cortical responses to complex sounds, lasting ∼200 ms, with a frequency width of about half an octave in most cells. Some neurons also show forward excitation through positive humps at the edges of the frequency-difference term of the context model. These results are consistent with those of previous physiological studies using two-tone paradigms in A1 (Calford and Semple, 1995; Brosch and Schreiner, 1997), and with previous psychophysical studies (Moore, 1980, 1997; Jesteadt et al., 1982). The dominance of forward suppression over forward excitation is consistent with recent subthreshold studies (Wehr and Zador, 2005). Again, the current models show the impact of such effects in the responses to complex sounds. Longer time scales of adaptation also exist in A1 (McKenna et al., 1989; Ulanovsky et al., 2003; Bartlett and Wang, 2005), but were not studied here.
We also find that the context model has implications for the interpretation of results of previous STRF studies. Most notably, (1) the suppressive region of wt is reduced when contextual effects are modeled explicitly (Fig. 9), and (2) the context model can account qualitatively for previous observations of stimulus-density-dependent STRFs, such as spectral narrowing and the changing ratio between the maximum and minimum values of the STRF (Fig. 11). These two observations are linked, as the average stimulus power in the context of a time-frequency element increases with spectral density. Thus, with increasing spectral density of the stimulus, the impact of the contextual subunit on the PSTH increases, and therefore the difference between a fitted STRF and the wt ⊗ wf component of a context model will also increase. One way in which the STRF may depart from wt ⊗ wf is through the appearance of the deepening suppressive region (Fig. 10), in line with the first observation above; another way may be through an idealized version of the systematic changes found by Blake and Merzenich (2002), such as spectral narrowing, in line with the second observation.
Interestingly, Blake and Merzenich found that STRF models with a more pronounced suppressive region were on average less predictive than STRF models where no suppressive region was observed. If, as our results here suggest, suppressive regions in the STRF are often signatures of significant contextual effects, then this observation may simply reflect the greater nonlinearity of cells that show them. An explicitly nonlinear approach (such as the context model) may well not show such an effect. (Alternatively, deeper suppressive regions may be associated with lower firing rates, leading to noisier STRFs and lower predictive power.) Thus, we suggest that the previously reported changes in STRF structure with stimulus spectral density arise as a result of linear approximation of nonlinear cortical response functions (Christianson et al., 2008) that are better described by the context model.
The models
The input nonlinearity model is a form of Hammerstein cascade (Narendra and Gallman, 1966; Hunter and Korenberg, 1986); however its development in the multilinear setting, and related advances in estimation, are more recent (Ahrens et al., 2008). We are unaware of previous discussions of alternative grouping of stimulus dimensions facilitated by the multilinear view. The higher-order nonlinear models (e.g., the context model) are novel. Multilinear regression models have not received extensive attention in the statistics literature (but see Paatero, 1999). The fully inseparable model r̂ = wtfl · M is a generalized additive model (Breiman and Friedman, 1985; Hastie and Tibshirani, 1999), but its parameters are too numerous to make it useable (Aertsen and Johannesma, 1980). Other multilinear methods have been useful in various settings in the statistics and machine learning setting (Harshman and Lundy, 1994; Tenenbaum and Freeman, 2000; Vasilescu and Terzopoulos, 2005).
Previously proposed nonlinear models for auditory processing (Fishbach et al., 2001) tend to be designed to mimic the known physiology of the auditory system and often do not allow for straightforward parameter estimation. The statistical ease of estimating multilinear models directly from neuronal data makes them more similar to models based on Volterra-Wiener expansions (Marmarelis and Marmarelis, 1978). However, prohibitive amounts of data are typically needed to estimate the large number of parameters of the latter class of models, so that in general they can only be fit by restricting their parameter space or focusing on a single stimulus dimension (Young and Calhoun, 2005). The data requirements of other nonlinear approaches (Brenner et al., 2000) also make them difficult to apply to auditory cortical data.
Other forms of the multilinear model, perhaps similar to the context model, should allow for the study of within-stimulus interactions and stimulus-specific adaptation in a variety of sensory systems (Ahrens et al., 2006). In the present case, multilinear methods provide a highly efficient description of auditory cortical response functions; note that the fully separated context model, although better able to predict neural responses, has fewer parameters than an inseparable linear STRF model. Multilinear models provide a firm statistical foundation for analyzing nonlinear neuronal response properties; allow response parameters to be learned directly from the data rather than being hand-tuned; and reveal a rich variety of input nonlinearities, inseparabilities, and context effects in auditory cortical responses to complex sounds, thereby introducing new possibilities for data analysis and extending existing understanding of auditory processing of complex sounds.
Footnotes
-
This work was supported by the National Institutes of Health and the Gatsby Charitable Foundation. M.B.A. and M.S. developed the models; J.F.L. collected the experimental data; all three authors contributed to the writing of this manuscript. J.F.L. thanks Michael M. Merzenich and Christoph E. Schreiner for support at the University of California at San Francisco, and Robert C. Liu for assistance with data collection.
- Correspondence should be addressed to Maneesh Sahani, Gatsby Computational Neuroscience Unit, 17 Queen Square, London WC1N 3AR, UK. maneesh{at}gatsby.ucl.ac.uk