Studies of neuronal activity throughout the brain predominantly rely on extracellular recordings of spiking activity. These records often contain action potentials from more than one neuron, so it is imperative to discriminate the spikes that originate from separate neurons. The process of discrimination is based on differences in each neuron's extracellular waveform that originate from cell-specific biophysical properties. We briefly review the electronics of extracellular recording, the detection of spike events, and the clustering of similar spike waveforms as a means to sort spikes according to their putative underlying sources. When spike sorting succeeds, it transforms a fundamental weakness of extracellular recording, namely the inability to isolate single neurons, into one of its greatest strengths, the simultaneous measurement from multiple cells.
Special emphasis is placed on visualization schemes for spike data as well as on a set of metrics to estimate the number of incorrectly categorized spikes. False-negative contributions to a cluster lead to a suppression of inferred spike rates, while false-positive contributions lead to a distortion in the inferred receptive field for the cell. Both errors reduce the estimated information carried by the cell. A matrix of values for these metrics allows readers to assess claims, e.g., the size and reliability of multiple peaks in a receptive field, relative to the level of contamination of the data.
Extracellular action potential waveforms appear roughly biphasic in the vicinity of a cell soma and dendrite, as the capacitive current flow across the cell membrane dominates and the voltage approximates the derivative of the intracellular waveform. The amplitude of the extracellular signals is small compared with that of the intracellular potential, on the order of hundreds of microvolts, but still significant in terms of signal-to-noise ratio (Henze et al., 2000) (Fig. 1A). The resistive nature of the extracellular space ensures that currents summate without preferential attenuation of their high-frequency components (Logothetis et al., 2007), so that the signals from different cells may be recorded with a single or multiwire electrode at a fixed location (Buzsáki, 2004).
The most important factors for successful spike sorting are the choice of appropriate microelectrodes, the precision of the placement of the electrodes, and quality of the acquisition electronics. When spiking activity is synchronous between neighboring cells, there is considerable overlap of the spike waveforms, and sorting becomes very difficult. In this case, it appears best to use a fine individual electrode to isolate a single cell. These electrodes typically have 1- to 5-μm-diameter tips and an impedance whose magnitude is typically 1–20 MΩ measured at a frequency of 1 kHz. Multiwire electrodes, such as stereotrodes (McNaughton et al., 1983) and tetrodes (Gray et al., 1995), are appropriate for simultaneous recording from two or more units when neuronal activity is largely asynchronous (Ainsworth and O'Keefe, 1977; Krüger and Bach, 1981; Reitboeck, 1983; Venkatachalam et al., 1999; Fee and Leonardo, 2001; Yamamoto and Wilson, 2008; Battaglia et al., 2009). These electrodes are made of 12–25 μm microwire with exposed ends that have impedances of ∼100 kΩ. While the extracellular waveform of two neurons may appear similar on a single microwire, they are unlikely to be similar on multiple electrodes at different albeit nearby locations. Last, the impedance of electrodes exhibits equal reactive and resistive components (Humphrey and Schmidt, 1991); the magnitude of these components decreases as a power law with increasing frequency (Blum et al., 1991) and the measured noise decreases with increasing frequency, in a manner consistent with thermal Johnson noise (Fig. 1B).
The electrical signal recorded by the electrode must be conditioned for acquisition (Fig. 1A). In general, the signal is impedance buffered as close to the electrode tip as possible to minimize contamination by noise from environmental sources. Spectral filtering is performed to isolate frequency components relevant for spike waveforms. The signal from the amplifier is high-pass filtered near 300 Hz solely for spike recording or at frequencies down to 0.1 Hz when field potential data are simultaneously acquired. The signal is typically low-pass filtered at 10 kHz, which is where the amplifier noise and the spectral power of the spike are comparable (Fig. 1C). The signal is oversampled, and the entire digitized data stream is stored for analysis. Acquisition schemes where only putative spike events are stored do not allow for the characterization of background noise or for different spike detection criteria to be applied post hoc. Before offline sorting, the digitized signals are additionally bandpass filtered to specifically isolate those frequencies present in the extracellular waveform that exceed the noise (Fig. 1C).
Algorithmic spike sorting
We summarize the key issues of spike sorting and refer the reader to the work by Lewicki (1998) for a thorough review. The first step in automated spike sorting is the extraction of spike waveforms from extracellular data, and the second is the clustering of these waveforms into groups that represent the activity of single neurons. The simplest method for detecting a spike is to select a threshold value of voltage for the extracellular signal on each microwire (Fig. 2A). An epoch of data, typically 1.0–2.0 ms in extent, is extracted and time-stamped whenever a signal crosses threshold (Fig. 2B). This approach considers the voltage only at a particular time point, whereas extracellular waveforms have characteristic shapes that extend over many data samples. If the extracellular waveform is known a priori, the matched filter provides an optimal way of linearly filtering the recording to use the full shape of the waveform in detection (Fig. 2A). As the waveform is not typically known a priori, a nonlinear filter can increase the signal-to-noise ratio by emphasizing voltage deflections that are both large in amplitude and high in frequency content (Kim and Kim, 2000) (Fig. 2A).
The choice of threshold for detection of a putative spike waveform is a trade-off between false-positive and false-negative errors. On the one hand, it is favorable to err on the side of a low threshold since noise events that are erroneously included as spikes can be eliminated when waveforms are clustered. On the other hand, each detected event is typically followed by a censored period that momentarily prevents another spike from being detected. This censor period is included to prevent a single spike waveform from triggering multiple events. If a spike detection threshold is set too low, noise events may cause a high fraction of the data to be censored and true events will go undetected. The number of expected noise events for a particular threshold can be calculated from the background noise and used to set detection criteria (Fig. 2C).
The set of waveforms that cross threshold must be aligned to facilitate their comparison. Waveforms are typically aligned on either their peak or center-of-mass, as either of these points is less susceptible to noise than the threshold crossing. Further, the true peak of the waveform may occur in between samples, so splines (de Boor, 2001) can be used to interpolate the waveform about its true peak. With multiwire electrodes, the alignment can be performed on the data channel with the largest event, with the resulting alignment point applied to the other channels.
A range of different approaches are used to separate the set of waveforms into groups that are based on similarity of their shape and possible other factors, such as the state of the animal and the statistics of neuronal spiking (Abeles and Goldstein, 1977; Lewicki, 1994; Fee et al., 1996b; Nguyen et al., 2003; Quiroga et al., 2004; Delescluse and Pouzat, 2006). Since the clustering of waveforms into common sources involves statistical inference, a given cluster of waveforms is referred to as a unit rather than a neuron.
Multiwire electrode signals are sorted as one composite waveform whose dimensionality is the number of samples times the number of channels. For example, stereotrode voltages sampled at 30 kHz for a 1.5 ms window have a dimension of 90. As it is difficult to visualize data in greater than even three dimensions, sorting relies on automated algorithms to cluster data. A useful parametric method is to assume that the data consist of multiple Gaussian distributions, where each distribution corresponds to the spikes of a single neuron (Pouzat et al., 2002). While extracellular noise typically follows a Gaussian distribution, non-Gaussian variability in waveforms can occur during bursts of action potentials or as a result of electrode drift (Fee et al., 1996a; Harris et al., 2000; Shoham et al., 2003; Delescluse and Pouzat, 2006). It is therefore advisable to use nonparametric clustering algorithms, which typically involve hierarchical aggregation, where waveforms that are the most similar are progressively grouped until some criteria are reached (Fee et al., 1996b).
A final issue concerns on-line sorting, such as by matched filtering with members of a library of spike waveforms (Thakur et al., 2007; Calabrese and Paninski, 2011). On-line sorting is unavoidable for certain engineering applications, such as neuroprosthetic devices (Tillery and Taylor, 2004), and may provide useful feedback on the placement of electrodes in the course of experimentation. However, on-line sorting is at best approximate and cannot supplant post hoc sorting based on the entire collection of spike waveforms and the statistics of the underlying noise (Fee et al., 1996a,b).
We consider a series of visual tests on the output of automated spike sorting routines that address whether a single cluster of waveforms is self-consistent with a single neuron.
Inspect the waveforms
The clustered waveforms should be inspected for a nonphysiological shape or obvious contamination by multiple neurons. This can be done by overlaying all clustered waveforms together, which has the advantage of clearly indicating any outlier events, or with a heat map of time-voltage values (Fig. 2D), which has the advantage of indicating the variability of the waveform.
Specialized visualization tools may be required for atypical cases. For example, a systematic change in the shape of the recorded waveforms may occur during epochs of synchronous spiking and may be resolved only by the use of electrodes with a smaller exposed tip.
Inspect for stationarity
The temporal stability of the waveforms in each cluster can be determined by plotting both the instantaneous spike rate and the amplitude of the waveforms as a function of time (Fig. 2E). The electrodes may have physically drifted or the state of the neuron may have changed if either of these values systematically varies over the course of an experiment.
Inspect for refractory violations
A histogram of the interspike interval (ISI) for the spike times of each waveform in a cluster is expected to show a refractory period, i.e., a dearth of spikes that occur within milliseconds of each other (Fig. 2F). An alternate graph is a histogram of the autocorrelation of the spike times, which preferentially highlights patterns of spikes. Note that the value of the ISI asymptotes to zero at large time lags, while that of the autocorrelation asymptotes to the mean firing rate.
Verification of waveform threshold
Independent of what metric was used to detect spikes, e.g., the peak voltage, a histogram of the value of that metric for each waveform can be inspected for a sharp cutoff at the threshold level (Fig. 2G). In general, the vast majority of waveforms in a cluster must lie well above the threshold for detection.
The residuals of a cluster, i.e., the standard deviation at each sample of the waveform, can indicate whether multiple neurons contributed to a cluster (Pouzat et al., 2002) (Fig. 2D). Ideally, the variability of a cluster arises only from background noise and matches the statistics of background noise, which appears to be stationary over the duration of the waveform (Fee et al., 1996a). The presence of strong temporal structure in the background noise may indicate contamination from other neurons.
We now turn to comparisons that involve pairs of potential single-unit clusters as a means to identify cross-contamination (Fig. 3A,B).
Projection onto Fisher's linear discriminant
The optimal one-dimensional projection to separate two known Gaussian distributions, given by Fisher's linear discriminant (Fisher, 1936), may be calculated from the mean and covariance matrix of two clusters, even if the data are not strictly Gaussian. As noted by Pouzat et al. (2002), the projection of both clusters onto the discriminant provides a rapid means to visualize potential overlap in their distribution of waveforms (Fig. 3C). A caveat is that an apparent overlap in one dimension may hide a separation in higher dimensions.
The correlation between the spike times of two clusters reveals whether the two clusters are independent (Fig. 3D). A correlation function that appears constant indicates two autonomous spike trains. However, temporal structure in the cross-correlation can arise from many sources, including functional connectivity, receptive field similarity, or cross-contamination of clusters.
The final step of spike sorting is the removal of outlier data points from each cluster. An outlier is a waveform that does not resemble any of the mean waveforms. A principled way to remove outliers is to use the χ2 probability distribution. For a random variable v drawn from a multivariate Gaussian distribution with mean vector μ and covariance matrix Σ̄, the distribution of (v − μ)TΣ̄−1 (v − μ) follows a χ2 distribution. One could remove from a cluster all waveforms that have a probability of <1/N, where N is the number of events in the cluster. This process should be performed last so that the mean waveform is defined for each unit represented in a dataset.
Manual inspection provides a qualitative means to evaluate the success of spike sorting. Yet an estimate of false-positive and false-negative errors should be reported as a means to evaluate a given level of modulation of a unit's firing rate relative to the contamination of the cluster. We propose that the following four estimates of false-positive and false-negative errors be included in publications that make use of spike sorting, independent of the algorithm used to sort.
False positives based on refractory period violations, denoted f1p
We assume that the majority of spikes in a cluster arise from a single neuron. We consider these the true events and recall that the absolute refractory period of a neuron implies that the ISI distribution should drop to zero below a minimum time. Thus, ISIs that are shorter than the absolute refractory period represent contamination by a rogue unit. One can compute the fractional level of contamination, f1p, with the assumption that refractory period violations arise from a point process that is uncorrelated with the true spikes of the cluster (Meunier et al., 2003).
We denote the number of spike events in a cluster as N, the width of the refractory period as τR, the width chosen for the censored period as τC, and the duration of the experiment as T. In terms of these definitions, the number of true spikes in the cluster is N(1 − f1p) and the number of rogue spikes is Nf1p. The total duration during which refractory period violations could occur around true spikes is 2(τR − τC)N(1 − f1p); the factor of two arises since refractory period violations occur whether a rogue spike appears immediately before or after a true spike. The expected number of refractory period violations, r, is given by the total duration of refractory periods around true spikes multiplied by the rate of rogue spikes, or r = 2 (τR − τC) N2(1 − f1p)f1p/T, which can be solved for f1p. For example, under the typical conditions of mean firing rate, N/T = 10 Hz, τR = 3 ms, τC = 1 ms, and T = 1000 s, an observation of r = 20 refractory period violations yields an estimate of f1p = 0.05. These values, particularly the refractory period, are typical for measurements from neocortical pyramidal cells in awake rat (Curtis and Kleinfeld, 2009) and have to be adjusted for other cell types and experimental conditions.
False negatives from the threshold for detection, denoted f1n
The histogram of the values of the spike detection metric (Fig. 2G) may be used to estimate the number of spikes whose waveforms were below the threshold for detection. For the case of a voltage threshold, as used here, the histogram may follow a Gaussian distribution. However, as the tail of the data is missing, the estimate of the mean and variance of this distribution may require a special fitting procedure. Once the parameters are fit, the percentage of missing spikes is the value of the cumulative Gaussian distribution up to the detection threshold.
False positives and false negatives from the overlap between pairs of clusters of sorted spike waveforms, denoted f2p and f2n, respectively
We estimate the overlap between a pair of clusters in terms of fits with multivariate Gaussian waveforms to each cluster. Let V represent the voltage waveform of an extracellular spike that is assigned to either cluster 1, with mean waveform μ1, or cluster 2, with mean waveform μ2. We use a model in which the variability of the waveforms in either cluster are characterized by Gaussian noise with a covariance matrix Σ̄1 for units assigned to cluster 1 or Σ̄2 for those assigned to cluster 2; the dimensionality of the Σ̄ values is the number of samples in the composite waveform. All of these parameters are estimated from the data with the expectation-maximization (EM) algorithm (gmdistribution.fit in MATLAB) to a mixture model of two multivariate Gaussians; as a technical aside, one must calculate pseudoinverses of the Σ̄ values as they are not typically full rank. The relative number of spikes in each cluster is given by the prior distributions, i.e., P(C = 1) and P(C = 2), where C labels the cluster, also found by the EM algorithm, where P(C = 1) + P(C = 2) = 1.
The probability that a cluster generates a particular waveform, v, is thus P(V= v|C = c) = , where c indexes the cluster. The inverse problem of finding that a particular waveform belongs to cluster c, denoted P(C = c|V = v), is found using Bayes' rule, i.e., P(C = c|V = v) = P(V = v|C = c)P(C = c)/P(V = v), where P(V = v) = Σi = 1,2 P(V = v|C = i)P(C = i). To calculate the fraction of false positives in cluster 1, we calculate the probability that each waveform assigned to cluster 1 was generated by cluster 2, i.e., false positives assigned to cluster 1, as fp(1;2) = (1/N1)Σv ∈ cluster 1 P(C = 2|V = v), where N1 is the number of spikes in cluster 1. The expressions for the remaining fractions of false negatives and positives are fn(1;2) = (1/N1)Σv ∈ cluster 2 P(C = 1|V = v), fp(2;1) = (1/N2)Σv ∈ cluster 2 P(C = 1|V = v), and fn(2;1) = (1/N2) Σv ∈ cluster 1 P(C = 2|V = v), where N2 is the number of spikes in cluster 2. This pairwise analysis can be generalized for n clusters for which the (n − 1) fractions of false positives for a given cluster, k, sum to form f2p=Σi≠kn fp(k; i). A similar summation of the (n − 1) fractions of false-negative events yields f2n for a given cluster.
False negatives from censored events, denoted f3n
All detected events outside of a cluster create a brief period of time that potentially censors the detection of true spikes for that cluster. One can estimate the total time that the dataset was censored by multiplying the duration of the censored period, denoted τC, by the total number of detected events minus the number of events in the cluster of interest, i.e., Mk = Σi≠kall clusters Ni, where Ni denotes the number of spikes in the ith cluster. The fraction of time that was censored contributes to the estimate of false-negative errors, i.e., f3n = Mk τC/T.
The summary statistics from the above analysis form two matrices, one for the fraction of false-positive events and the other for the fraction of false-negative events. The ISI violations should include false positives from the fractional overlaps of a cluster with all other clusters, i.e., we expect f1p ≥ f2p. As a hedge against the possibility that ISI violations were underestimated, the composite fraction of false-positive events is taken as the larger of the false-positive estimates based on ISI violations or cluster overlaps (Table 1). The fraction of undetected spikes and the fraction of censored spikes are independent, so these fractions are combined as a product of their complements (Table 2). However, these detection errors are mutually exclusive of false-negative errors that result from the overlap of clusters, so these fractions are added to form a composite fraction of false-negative events (Table 2).
Extracellular recording is the oldest and most common method of recording electrical activity across populations of neurons in awake behaving animals, from invertebrates to human primates. Yet simple criteria for acceptable data, particularly with regard to claims of single unit responses, are largely missing. Such criteria are critical since interpretations of spike trains that are based on inadequately sorted units can lead to erroneous claims on neural coding. For the case of false-positive errors, contamination can result in sorted units that exhibit false multi-modal responses (Fig. 4A). This occurs when spike trains from two differently tuned cells are combined into a single spike train. For the case of false-negative errors, the true firing rate is underestimated (Fig. 4B) and thus the signal-to-noise ratio is decreased for statistics that are calculated with the spike train. In addition, false-negative errors may include temporal structure, such as when spikes are hidden by stimulation artifacts or are misassigned because of a reduction in amplitude and an increase in width for the trailing spikes of a burst.
We suggest that investigators include a matrix of criteria that define the levels of errors for their units (Tables 1, 2) (see http://physics.ucsd.edu/neurophysics/links.html for accompanying MATLAB code), or at the minimum report a matrix of upper bounds for these errors across all units. Further, any analysis of neuronal modulation must be made in light of the estimated contamination of the unit; e.g., a feature based on a 20% modulation of the firing rate is not significant for that unit if the measured count contains a 20% false-positive error rate (Fig. 4A). Following Joshua et al. (2007), we note that while calculation of errors involves statistical assumptions about the dataset, even if these assumptions do not hold true, it is more useful to report estimated false-negative and false-positive events for a dataset rather than declare that “only well-isolated units were included in our study.”
Our analysis is practical for even particularly large datasets, such as those that originate from arrays of electrodes (Blum et al., 1991; Churchland et al., 2007). Continuous storage of signals from a large number of channels may be streamed to disk in real time and is compatible with the bandwidth of radio telemetry systems (Tillery and Taylor, 2004); e.g., recordings from 100 stereotrodes leads to a data rate of only 10 Mbytes/s. For well separated electrodes, or sets of stereotrodes or tetrodes, one can cluster the waveform data in parallel. The computational methods we discuss may be scaled up and the results may be subject to rapid visual inspection (Figs. 2, 3). For situations with hundreds to thousands of electrodes in which continuous data storage is currently impractical, we suggest that the threshold for detection be biased low and a time series that is at least twice the period of the spike waveform plus censored period around each event be saved. Overlapping epochs may need to be saved when the spike is high. This procedure allows the threshold to be raised post hoc to minimize the number of censored events (Fig. 2C).
Supplemental material for this article is available at http://physics.ucsd.edu/neurophysics/links.html. This material has not been peer reviewed.
Editor's Note: Toolboxes are intended to describe and evaluate methods that are becoming widely relevant to the neuroscience community or to provide a critical analysis of established techniques. For more information, see http://www.jneurosci.org/misc/ifa_minireviews.dtl.
This work was supported by the National Institutes of Health (Grant NS051177 to D.K., Grant FNS054393A to D.N.H.) and the US-Israeli Binational Science Foundation (Grant 2007121). We thank the two reviewers for insightful comments, Ehud Ahissar, Elad Assa, John Curtis, Michale Fee, Kenneth Harris, Stephen Lisberger, and Jeffrey Moore for useful discussions, Gyorgy Buzsáki and Darrel Henze for supplying data for Figure 1A, and E. J. Chichilnisky, Vicky Lam, Jill Leutgeb, and Emily Mankin for supplying test data for analysis during the development of our sorting criteria. The material in this article has been presented as part of the NIH-sponsored school on “Neuroinformatics” held at the Marine Biological Laboratories. We apologize to all engineers and scientists whose work could not be cited as a consequence of space constraints.
- Correspondence should be addressed to David Kleinfeld, University of California 0374, 9500 Gilman Drive, La Jolla, CA 92093-0374.