Abstract
It remains unclear how single neurons in the human brain represent whole-object visual stimuli. While recordings in both human and nonhuman primates have shown distributed representations of objects (many neurons encoding multiple objects), recordings of single neurons in the human medial temporal lobe, taken as subjects' discriminated objects during multiple presentations, have shown gnostic representations (single neurons encoding one object). Because some studies suggest that repeated viewing may enhance neural selectivity for objects, we had human subjects discriminate objects in a single, more naturalistic viewing session. We found that, across 432 well isolated neurons recorded in the hippocampus and amygdala, the average fraction of objects encoded was 26%. We also found that more neurons encoded several objects versus only one object in the hippocampus (28 vs 18%, p < 0.001) and in the amygdala (30 vs 19%, p < 0.001). Thus, during realistic viewing experiences, typical neurons in the human medial temporal lobe code for a considerable range of objects, across multiple semantic categories.
Introduction
A key question in neuroscience is whether neuronal representation is distributed across populations of neurons or more localized to stimulus-selective neurons (Bowers, 2009). Distributed coding (i.e., individual neurons encoding multiple stimuli;Thorpe, 1995) may offer many advantages, such as high coding capacity, resistance to noise, and generalization to similar stimuli (Rolls and Treves, 1998; Rolls and Deco, 2002). In contrast, localized (or gnostic) coding (i.e., individual neurons encoding singular stimuli unequivocally; Barlow, 1972; Thorpe, 1989) may provide metabolic efficiency and a simple relation between single neuron activity and different instantiations of objects (Hummel, 2000; Lennie, 2003). While the question of distributed object representation has been thoroughly studied in nonhuman primates (Baylis et al., 1985; Rolls and Tovee, 1995; Baddeley et al., 1997; Treves et al., 1999; Franco et al., 2007), there is less consensus among the few human studies that have been performed. Due to clinical constraints, these studies have been largely confined to recordings from medial temporal lobe (MTL) brain areas (i.e., hippocampus and amygdala).
Several studies, wherein images were shown to epilepsy patients, have revealed neural selectivity for individual objects, consistent with distributed representation (Kawasaki et al., 2005; Rutishauser et al., 2006; Viskontas et al., 2006; Steinmetz et al., 2011). Because these experiments did not show multiple views of each object, they are limited to explaining the neural representation of the specific exemplars shown. To date, only one series of studies has explicitly tested object encoding by single neurons in the human MTL (Quiroga et al., 2005; Quian Quiroga et al., 2009). In these experiments, subjects viewed multiple views of objects with multiple presentations of the same view, yielding results suggesting that individual neurons in the MTL are strongly selective for a small number of individual or related objects.
Given the contrast between these results, an obvious question remains: What is the neural representation of objects when multiple views are shown in one viewing session with limited presentations? This distinction is critical because it is well established that the MTL is involved in both recognition and recollection of prior experiences (Scoville and Milner, 1957; Squire et al., 2004). Many imaging studies reveal hemodynamic changes in the MTL with greater memory strength specifically after stimulus repetition (Law et al., 2005, Daselaar et al., 2006; for review, see Gonsalves et al., 2005; Yassa and Stark, 2008); some have even theorized that repetition contributes to increased representational sparsity (Desimone, 1996; Wiggs and Martin, 1998).
Thus to better understand single-neuron responses to objects as people might encounter them naturally, and as a first step in understanding the effects of initial and multiple viewing, we had human epilepsy patients discriminate objects in a single session of an oft-used visual discrimination task (Kreiman et al., 2000), where each view of an object was presented only six times. In contrast to findings with higher numbers of presentations, our results reveal that object representation in the human MTL during initial viewing is notably distributed, with the activity of many recorded neurons predicting the presence of multiple, unrelated objects.
Materials and Methods
Subjects.
We recorded single-neuron activity from 21 patients at the Barrow Neurological Institute (14 female, 18 right-handed, ages 20–56, mean age = 40). All patients had drug-resistant epilepsy and were evaluated for possible resection of an epileptogenic focus. Each patient granted his/her consent to participate in the experiments per a protocol approved by the Institutional Review Board of Saint Joseph's Hospital and Medical Center. Data were recorded from clinically mandated brain areas, including the hippocampus, amygdala, ACC, and vPFC.
Microwire bundles and implantation.
The extracellular action potentials corresponding to single-neuron activity and local field potentials were recorded from the tips of 38 μm diameter platinum-iridium microwires implanted along with a depth electrodes used to record clinical field potentials (Dymond et al., 1972; Fried et al., 1999). Each bundle of nine identical microwires was manufactured in the laboratory and implanted using previously described techniques (Thorp and Steinmetz, 2009; Wixted et al., 2014, their supplement) and typically had an impedance of 450 kΩ at 1000 Hz. Each anatomical recording site received one bundle of nine microwires. Given eight sites typically implanted per patient, this resulted in 72 microwires implanted in each patient. Electrodes were placed through a skull bolt with a custom frame to align the depth electrode along the chosen trajectory. The error in tip placement using this technique is estimated to be ±2 mm (Mehta et al., 2005). Note that this resolution is insufficient to determine subfields within the hippocampus or nuclei within the amygdala.
Amplification and digitization of microwire signals.
After the patient recovered from surgery (typically within 6 h), the microwire bundles were connected to the headstage amplifiers, amplification, and digitization system as previously described (Steinmetz et al., 2013; Wixted et al., 2014, their supplement) The complete recording system has a 4.1 μV RMS noise floor that permits recovery of single-neuron activity signals on the order of 20 μV (Thorp and Steinmetz, 2009).
Filtering and event detection.
Spike sorting was performed using methods previously described (Valdez et al., 2013). In review, possible action potentials (events) were detected by filtering with a bandpass filter, 300–3000 Hz, followed by a two-sided threshold detector (threshold = 2.8 times each channel's SD) to identify event times. The signal was then high-pass filtered (100 Hz, single-pole Butterworth) to capture waveform shape with the event time aligned at the ninth of 32 samples (Viskontas et al., 2006; Thorp and Steinmetz, 2009). All events captured from a particular channel were then separated into groups of similar waveform shape (clusters) using the open-source clustering program, KlustaKwik (klustakwik.sourceforge.net), a modified version of the Govaert–Celeux expectation maximization algorithm (Celeux and Govaert, 1992, 1995). The first principal component of all event shapes recorded from a channel was the waveform feature used for sorting. After sorting, each cluster was graded as noise, multi-unit activity (MUA), or single-unit activity (SUA) using the criteria described previously (Valdez et al., 2013, their Table 2). Figure 1 shows an example of a cluster representing SUA. We recorded from a total of 3239 neurons (SUA and MUA) during these experiments (Table 1). The average number of clusters per channel of recording was 0.55 for SUA and 1.3 for MUA. This is larger than the 0.4 SUA per channel recently reported by Misra et al. (2014). Based on our prior work (Thorp and Steinmetz, 2009), we would expect such differences are due to different noise characteristics of the clinical recording environment, though they could also be due to different spike sorting techniques, as Misra et al. (2014) used a manual spike-sorting process. In our experience, this technique reported in Valdez et al. (2013) produces results comparable to prior reports in other laboratories (Viskontas et al., 2006) in terms of recorded waveform shapes, interspike intervals, and firing rates (Wild et al., 2012; regarding variability in spike sorting depending on the particular waveforms shapes being detected). While it is important to note that these and other reports of human single-unit recordings (Kreiman et al., 2000; Steinmetz, 2009) do not achieve the quality of unit separation achievable in animal recordings (Hill et al., 2011), they nonetheless represent neural activity at a much finer spatial and temporal scale than otherwise achievable.
Events in a cluster identified as SUA after sorting. Channel recorded from the left amygdala. a, Average waveform shape of events in cluster. y-axis: waveform shape with dashed lines indicating ±1 SD at each sample point. b, Distribution of interspike intervals (ISIs) for two duration scales. y-axis: probability of interval; x-axis: duration of interval shown on two scales, blue for the broader range 0–0.5 s on bottom and black for narrower range 0–0.035 s on top. c, Power spectral density of event times. y-axis: power spectral density in events2/Hz; x-axis: frequency in Hz, with magenta lines indicating primary and harmonics of the power line frequency (60 Hz).
Number of recorded neurons by brain area
The results reported here exclude neurons from areas other than the hippocampus and amygdala and from recording sessions where the subject did not complete sufficient trials to test for object-selective neural responses. We focus this report on 432 clusters of SUA. Results for primary effects of object selectivity on MUA in the hippocampus and amygdala were in all cases similar and statistically significant, though with a smaller fraction of neurons with significant effects. This is consistent with MUA comprising a mixture of the same activity reported as SUA mixed with noise.
Experimental stimuli and task.
Subjects viewed images of 11 objects from each of three categories (animals, landmarks, and people) during each experimental session. The objects were chosen to match those used in Quiroga et al., 2005, with the exception of several images of laboratory personnel for the people category. None of the images were personally significant to the subjects, as in Viskontas et al., 2006. All images were chosen to have approximately similar properties of illuminance and contrast to reduce potential confounds of these factors (Steinmetz et al., 2011). We showed the images in random order, with each appearing individually in the center of a computer screen (subtending ∼9.5° visual angle) for 1 s. During each session, we showed four representations (three color pictures and the printed name) of each object six times each, for a total of 792 trials per experimental session. Subjects who participated in more than one session did so on different days with different stimulus sets and any neurons recorded on the same channels during different sessions were regarded as independent. We downloaded the pictures from the World Wide Web (Fig. 2), and the printed words were in English in 30 point, Helvetica font A. A total of 33 objects were depicted in each of two stimulus sets. The task was to press a button on a trackball (Kensington Expert Mouse) when an image (or name) represented a person, and a different button when an image (or name) represented a landmark or animal. Button assignment was randomized across experiments.
Sample images from our stimulus sets. Subjects viewed three color images of each object. Objects were clearly visible in each image, with variation in object position and background: bird and lizard (a), Gateway Arch and Eiffel Tower (b), Bill Clinton and David Letterman (c). Readers interested in requesting images in our sets should contact the corresponding author.
Data analysis.
We initially analyzed the influence of several factors—object identity, object luminance, and object contrast—on neuronal responses. We included the latter two factors to account for their recently demonstrated effect on neuronal responses in the human hippocampus and amygdala (Steinmetz et al., 2011). We constructed a set of nested generalized linear models for each neuron (Maindonald and Braun, 2003, their Chap. 8), with these factors as independent variables and the firing rate of a single neuron (in a temporal window from between 200 and 1000 ms after stimulus onset) as the dependent variable. More precisely, model 1 contained only a constant term; model 2, constant + luminance terms; model 3, constant + luminance + contrast + luminance × contrast interacting terms; and model 4, constant + luminance + contrast + luminance × contrast interacting + object identity terms. There were 10 indicator variables for object identity in the model for a single experiment. We computed the improvement of fit for each successive model using the χ2 statistic (Maindonald and Braun, 2003). The comparison of model 4 to model 3 thus identifies neurons that have responses that distinguish among the different objects presented after accounting for differences in image luminance and contrast.
While prior studies have often restricted analysis of the effects of independent factors, such as object identity, to neurons with responses that differ from background firing, we do not do so, because this form of preselection can lead to erroneous conclusions (Steinmetz and Thorp, 2013). To study changes from background firing, we used two techniques: multinomial logistic regression and a bootstrapped test for changes from background firing.
Multinomial logistic regression predicted the presence of objects from our stimulus sets, based on the firing rates of neurons compared with their background firing rates. In logistic models, the “input” is comprised of a set of theoretically meaningful predictors, while the “output” is a predicted grouping (Maindonald and Braun, 2003, their Chap. 8; Dobson and Barnett, 2008, their Chap. 7), in our case extended to multiple categorical outcomes (Hosmer and Lemeshow, 2006). More particularly, the ratio of firing during presentation of images of an object relative to background firing is used to predict the odds that each object may have been presented by determining the closest-fitting coefficient in the logistic function. A coefficient of zero implies no change in odds due to changes in neuronal firing, whereas coefficient values other than zero signal different odds of one object being present versus no object being present (i.e., background neuronal activity). Statistically reliable changes in coefficient values from zero were determined using multivariate t tests (Hosmer and Lemeshow, 2006, their Chap. 2), one for each neuron. This approach is similar to a simplified version of the point-process framework proposed by Truccolo et al. (2005). The α-level was 0.05 in all t tests for nonzero coefficients.
As an independent test of whether neuronal firing is different from background firing, we used the bootstrapped changes from background test (CBT) described in detail (Steinmetz and Thorp, 2013). In brief, this test determines whether the observed responses, grouped by object, were likely to have arisen from the observed background firing. Together, our analyses first determined whether responses of the neurons distinguished between different objects presented (linear models), then determined whether particular objects could be predicted based on neural firing relative to background firing (multinomial logistic regression), and finally as an additional check tested whether firing in response to different objects differed from background (CBT).
Results
The linear models applied to response counts showed that object identity reliably affected firing rates (p < 0.05) in many MTL neurons: 17% in the left hippocampus (LH), 25% in the right hippocampus (RH), 24% in the left amygdala (LA), and 23% in the right amygdala (RA). Table 2 shows the proportions of object-selective neurons. Neither luminance nor contrast reliably influenced firing rates (p > 0.05).
Summary of results from the linear models applied to response counts
The multinomial logistic regression models were used to determine whether the firing rates of single neurons could predict the presence of particular objects. These revealed that, in many object-selective neurons, firing rates by MTL area and side predicted the presence of more than one object (Table 3). Bilaterally, more neurons encoded several objects versus only one object in the hippocampus (28 vs 18%, p < 0.001) and in the amygdala (30 vs 19%, p < 0.001). For neurons coding for two or more objects in this analysis, 49% coded for objects drawn from two or three categories. Figures 3 and 4 illustrate the distributed response of an object-selective neuron in the right hippocampus.
Summary of results from the regression models used to predict the presence of objects
Raster plots of responses of a neuron in the right hippocampus to the presentation of eight objects. Each line shows the responses on one trial, where an image of the object was presented at time 0. x-axis: time in seconds relative to stimulus onset; y-axis: index of object presentation in random order of appearance in experiment. Objects in the top row are those for which multinomial logistic regression, based on the firing rate relative to background firing, permits prediction of the object at a level above chance (p < 0.05). Objects in the bottom row could not be so predicted.
Modified box-plot of distribution of responses to all objects shown for the same right hippocampal neuron shown in Figure 3. For each object, the solid dot shows the median response to presentation of an image of the object. Vertical lines extend from , (where IQR is interquartile range, n = number of observations, and 1.58 provides the equivalent to a 95% confidence interval for differences between medians; Chambers et al., 1983, their p. 62) to the data point furthest from the median, which is no more than ± (1.5 * IQR) beyond the first or third quartiles. Open circles show responses outside that range. Solid gray line shows the mean of background firing; dashed gray lines at
of the background firing (n̄ = mean number of presentations of an object), representing a 95% confidence interval for the median of background firing. Thus median values above this line show strong responses relative to background firing for that specific object.
On average, the proportion of objects whose presence was predicted by single-neuron firing rates was 22% in the LH, 27% in the RH, 24% in the LA, and 23% in the RA. That is, the tuning of these neurons was much broader and the sparsity of their responses much lower (Rolls, 2007) than previously reported, where only ∼1–3% of stimuli elicited statistically reliable single-neuron responses (Quiroga et al., 2005; Mormann et al., 2008) though at a stricter level of statistical significance. Note that we refer to lifetime sparsity, or the proportion of stimuli that evoke statistically reliable neuronal responses (Bowers, 2009). Our results thus suggest that many human MTL neurons become active during the initial session of a visual discrimination task and that many respond to a range of stimuli from different semantic categories. A post hoc power analysis further showed that our sample size of 432 well isolated MTL neurons provided 95% power to detect a change of at least 5% in the fraction of objects that these neurons encoded (Erdfelder et al., 1996), or a difference of approximately two objects, reflecting a distinct sensitivity in our models.
While the multinomial logistic regression models show that differences of neural firing from background reliably predict the presence of different objects, we also separately tested whether these responses differ reliably from background firing, to confirm these results. Such differences can be difficult to observe when visually comparing responses to presentation of a single object to the immediately preceding background activity, so we used the recently described changes from background test (Steinmetz and Thorp, 2013), which provides a single test for each neuron.
Table 4 summarizes the number of neurons in each brain area, which had responses differing reliably from background (p < 0.05), as well as the number of those neurons that also had a reliable response to different objects (Table 2). As shown in that table, a substantial proportion of neurons in both the hippocampus and amygdala (20–30%) responded to the presentation of objects at a rate that differed reliably from background, and ∼3/4 of those also had responses that differed depending on the particular objects shown.
Summary of results of CBT
Last, to ensure that other neurons or brain areas could decode the observed neuronal responses, we calculated the amount of information that neuronal firing rates provided about the presented objects. This is the mutual information between firing rate and the presented object, expressed mathematically as I(X; Y) = H(X) − H(X|Y), where X is the object presented, Y is firing rate, H(X) is the entropy of X, and H(X|Y) is the conditional entropy between X and Y (Cover and Thomas, 2006, their Chap. 2). In MTL neurons classified as object selective, firing rates contained on average 0.16 bits of information distinguishing the presented object, lower but generally congruent with estimates in nonhuman primates (e.g., 0.30 bits in monkey hippocampal neurons; Abbott et al., 1996). Table 5 shows MI values for the MTL (values in parentheses indicate 95% confidence intervals). Note that to distinguish objects in our stimulus set would require 5.04 bits, indicating that the firing rates of ∼30 average MTL neurons, if their firing is independent of one another, could provide full information about the presence of any given object.
Average mutual information (encoded per neuron) by brain area
Discussion
The present observations cast new light on the debate regarding distributed versus localized representation of objects by single neurons in the human brain. In the first session of a visual discrimination task with only six presentations of each image, one that emphasizes initial object encoding, many neurons recorded in the MTL encoded the identities of multiple objects. We also discovered a smaller proportion of neurons that encoded only one object, supporting the existence of distinct encoding populations—one distributed and the other gnostic—and confirming the plausibility of long-held notions about dual representation schemes (Konorski, 1967, his p. 200). Our results, however, differ from prior models that sought to explicitly decode object identity from a few invariant MTL neurons (Quiroga et al., 2008). In such models, the number of neuronal spikes within certain time intervals (300–600, 300–1000, and 300–2000 ms) per trial were used as inputs to a decoding algorithm that predicted object identity, given the spike distributions of excluded trials. Compared with our statistical models—which predicted the identities of specific stimuli among many encoded by a single neuron—the leave-one-out decoding algorithm by Quiroga et al. (2007) would potentially confuse multiple objects whenever neurons encoded more than one stimulus, an ambiguity in prediction we acknowledge.
While the single-unit activity reported here is not as well isolated as that achieved in animal recordings (Hill et al., 2011), these techniques provide the highest spatial resolution yet achievable in the conscious human brain. Even if the SUA reported here is contaminated by noise, it is difficult to see how such contamination could create the appearance of a distributed code across a larger fraction of neurons; one would expect noise to decrease the number of neurons with apparent responses.
We note that the gnostic neurons described here are distinct from previously described neurons with invariant representations (Quiroga et al., 2005; Quian Quiroga et al., 2009). Although gnostic neurons encoded information about single objects in our linear models, these neurons failed to meet the previously applied criteria for invariant responses that far exceed baseline to a single object (Quiroga et al., 2005). Applying those criteria to both single- and multi-unit activity, as combined in prior reports (Quiroga et al., 2005; Quian Quiroga et al., 2009), we found one cluster of multi-unit activity (from a grand total of 1573, or 0.06%) that met the criteria for invariant single-neuron representations; whereas these same prior reports found 5% of neurons with invariant representations (Quiroga et al., 2005; Quian Quiroga et al., 2009). What is the cause of this 78-fold difference in the frequency of single-neuron invariant representations?
One idea would be that there were a greater number of separate objects shown in the screening session in prior work, 71–114 (mean = 93.9, Quiroga et al., 2005), compared with the 33 objects shown in our single-session design here. Assuming, however, that the objects for which recorded neurons are selective would be drawn randomly from a set of possible objects with which the subject is familiar, this hypothesis accounts for only a factor of 93.9/33 = 3. An additional factor of 26 remains.
While there are several possible technical and experimental factors that could explain this remaining difference (among them a higher percentage of faces in prior experiments [Mormann et al., 2011 ]or differences in the fraction of principal cells recorded at different medical centers [Ison et al., 2011]), one intriguing hypothesis is that it may reflect differences in how many times objects were recently viewed. The present study involved recently unseen views of objects (i.e., initial encoding during six presentations of each image), whereas prior human experiments included higher presentations counts in several sessions (Quiroga et al., 2005; Quian Quiroga et al., 2009; at least 12 presentations—6 in screening and 6 in test—and often up to 50 presentations when the same stimuli were used in other experiments in the same subjects) that may have enhanced a more sustained neuronal selectivity for frequently viewed images (i.e., visual learning; Logothetis et al., 1995; Freedman et al., 2006). In similar fashion, the observed image selectivity in MTL neurons likely also reflects contributions from recognition memory, as each new incarnation of a previously seen “concept” (e.g., different photographs of the same person) will likely evoke both familiarity, which has been documented in previous single-unit recording studies (Rutishauser et al., 2006; Jutras and Buffalo, 2010), and episodic memory (Wixted et al., 2014). Thus, the divergent results may reflect different stages of episodic representation, of when and where an object was viewed, an assumed primary coding function of the MTL (Squire et al., 2004). Given that repetition has been reported to suppress neural responses in the amygdala (Pedreira et al., 2010), this hypothesis clearly requires further testing in experiments designed to observe changes in neural responses, and the sharpening of neural representation in particular, as objects are presented an increasing number of times.
Finally, our results broadly agree with those of nonhuman primate studies, wherein a substantial body of work suggests that neurons, or groups of neurons, with diverse response profiles form the substrate of object representation (e.g., in monkey temporal cortex;Baylis et al., 1985; Rolls and Tovee, 1995; Baddeley et al., 1997; Treves et al., 1999; Franco et al., 2007). Some researchers have postulated that when such distributed responses reach a sufficient level of independence, there may be an exponential increase in representational capacity (Rolls, 2007), making it possible for a comparatively meager fraction of neurons to encode a large number of diverse stimuli. Although our findings in the hippocampus differ from those of other animal studies, e.g., those in which rat hippocampal place cells show a greater degree of sparsity during spatial encoding (O'Keefe, 1976; Wilson and McNaughton, 1993; Moser et al., 2008), our results overall support an initial object coding scheme in the human MTL, which is broadly selective and which reflects a lesser degree of sparsity.
Footnotes
This research was funded by National Institutes of Health Grant 1R21DC009871-0, the Barrow Neurological Foundation, and the Arizona Biomedical Research Institute (09084092). We thank the patients at the Barrow Neurological Institute who volunteered for these experiments and E. Cabrales for technical assistance. We also thank E. Niebur and S. Macknik for early comments on this manuscript.
The authors declare no competing financial interests.
- Correspondence should be addressed to Peter N. Steinmetz, 350 West Thomas Road, Phoenix, AZ 85013. PeterNSteinmetz{at}steinmetz.org