Abstract
Area TE is required for normal learning of visual categories based on perceptual similarity. To evaluate whether category learning changes neural activity in area TE, we trained two monkeys (both male) implanted with multielectrode arrays to categorize natural images of cats and dogs. Neural activity during a passive viewing task was compared pre- and post-training. After the category training, the accuracy of abstract category decoding improved. Single units became more category selective, the proportion of single units with category selectivity increased, and units sustained their category-specific responses for longer. Visual category learning thus appears to enhance category separability in area TE by driving changes in the stimulus selectivity of individual neurons and by recruiting more units to the active network.
Significance Statement
Neurons in area TE are known to respond selectively to a small number of visual stimuli. Here we demonstrate that the neural activity in area TE is modulated by category learning of natural images (cats and dogs), thus demonstrating that this region is capable of undergoing rapid plastic changes in adult primates.
Introduction
Area TE (TE) is the most rostral brain area of the ventral visual stream (Gross et al., 1972; Mishkin et al., 1983). Several lines of evidence support the hypothesis that TE is a potential substrate for supporting categorization based on visual perceptual similarity: single-unit recording and fMRI studies have revealed evidence for spatially discrete clusters of category-preferring responses (e.g., those that have greater responses to faces, bodies, objects, or color) in TE (Perrett et al., 1982; Desimone et al., 1984; Kobatake et al., 1998; Sugase et al., 1999; Tsao et al., 2003; Afraz et al., 2006; Lafer-Sousa and Conway, 2013; Bao et al., 2020); the strength of the category-selective responses in TE is modulated by task demands (Fuster and Jervey, 1981; Sigala and Logothetis, 2002; Koida and Komatsu, 2007; Emadi and Esteky, 2014); and lesion studies have demonstrated that TE is essential for the accurate categorization of perceptually ambiguous stimuli (Matsumoto et al., 2016; Eldridge et al., 2018), whereas lesions of the cortical region immediately upstream of TE, area TEO, cause only a mild and transient impairment (Setogawa et al., 2021), and lesions of the cortical region immediately downstream of TE, rhinal cortex, produce no discernable impairments in categorization accuracy (Eldridge et al., 2018). Previous studies of TE's role in visual categorization have demonstrated that, after extensive behavioral training, neurons in TE become tuned to diagnostic features of parameterized stimuli (Sigala and Logothetis, 2002; Freedman et al., 2006; Baene et al., 2008; Meyers et al., 2008; Meyer et al., 2014; Sasikumar et al., 2018). However, monkeys—like humans—can rapidly learn new category groupings and subsequently generalize those categories to new exemplars, without the need for extensive training (Minamimoto et al., 2010).
Here, we sought to avoid the potential confound of “overtraining”—i.e., that monkeys may develop a strategy to solve the lab-based task that does not reflect the mechanism(s) via which categorization based on perceptual similarity is performed innately—by examining the neural correlates of rapid visual category learning in a two-interval forced-choice paradigm. We did so by recording from hundreds of single units in TE while monkeys passively viewed unparameterized natural images before and after category learning. We posit that this rapid learning more accurately approximates natural behavior and hence the interpretation of the corresponding changes in neural activity will likely be less confounded by task parameters and hence more ethologically relevant.
Materials and Methods
Experimental subjects/housing/care
Experiments were performed with two Japanese monkeys (Macaca fuscata) that were provided by the NBRP-Nihonzaru, which is part of the National Bio-Resource Project of the Ministry of Education, Culture, Sports, Science and Technology (MEXT, Japan). Monkey R was a 12-year-old male weighing 9 kg, and Monkey X was a 13-year-old male weighing 11 kg. Monkeys were housed in adjoining, individual primate cages that allowed social interaction. The monkeys had access to food daily and earned their liquid during, and additionally after, neural recording experiments on testing days. Monkeys were tested 5 d per week. All surgical and experimental procedures were approved by the Animal Care and Use Committee of the National Institute of Advanced Industrial Science and Technology (Japan) and were implemented in accordance with the Guide for the Care and Use of Laboratory Animals (eighth ed., National Research Council of the National Academies).
Surgery
Each monkey was first implanted with a titanium head holder ∼4 months (Monkey R) or 2 months (Monkey X) prior to electrode implantations. Three microelectrode arrays (Utah arrays, iridium oxide, 96 electrodes, 10 × 10 layout, 400 µm pitch, 1.5 mm depth, Blackrock Microsystems) were surgically implanted in the anterior, middle, and posterior parts of area TE in the left hemisphere for Monkey R, and four arrays, one additionally in area TEO (data not shown), were implanted in the left hemisphere for Monkey X. Surgical procedures were similar to those described previously (Mitz et al., 2017), and the procedures for Monkey X have been previously published (Endo et al., 2021). For both monkeys, a bone flap located over the temporal cortex was temporarily removed from the skull, and a CILUX chamber was placed onto the anterior part of the skull to protect connectors of the arrays.
Experimental design
Behavioral testing
All behavioral tests were carried out using a shielded room. Monkeys were seated in a primate chair and responded with a touch-sensitive bar that was mounted on the chair at the level of the monkey's hands. The display was a 21 inch color CRT monitor (GDM-F520, SONY), and the center of the monitor was placed 56.6 cm in front of the monkey's eyes. The total reward delivered in each session was ∼120 ml of juice for Monkey R and ∼420 ml of water for Monkey X. The monkeys were head-fixed for all behavioral testing, and eye-tracking was performed with an infrared pupil-position monitoring system (iRecHS2, Matsuda; http://staff.aist.go.jp/k.matsuda/iRecHS2/). The visual stimuli were presented using the Matlab (MathWorks) Psychtoolbox (Kleiner et al., 2007) on Windows operating system (Microsoft). Task control was performed by the REX real-time data acquisition program adapted to QNX operating system (Hays et al., 1982).
Pre-training
The monkeys were first trained to use the touch bar to receive a reward. Then, a red/green color discrimination task was introduced (Bowman et al., 1996). Each trial began with a bar touch, and 100 ms later a small red target square (0.5 × 0.5 degrees visual angle) was presented at the center of the display. Monkeys were required to continue touching the bar until the color of the target square changed from red to green. Color changes occurred randomly 500–1,500 ms after bar touch. Rewards were delivered if the bar was released between 200 and 1,000 ms after the color change; releases occurring either before or after this epoch were counted as errors.
Category pre-training
The category pre-training paradigm was similar to the basic red/green task described above, except that now the initial “no-go” period became the second option in a cued two-interval forced-choice task. Each trial began when the monkey pressed the bar, and one of two Walsh patterns was presented. After 350–400 ms, the red target square appeared, and after 1–3 s, it turned green. Release in red ended the trial, whereas release in green led to the trial's associated outcome. The trial outcome was either a timeout (4–6 s) or a reward, depending on which Walsh pattern was shown. This outcome pattern led monkeys to release in red for one pattern, to avoid the associated timeout, and to release in green for the other, to collect the reward. We have previously shown that avoiding delays to reward has subjectively similar value to monkeys as reward itself (Minamimoto et al., 2009, 2012). Thus, on each trial, the monkeys had to choose between releasing in the red or green period—a two-interval forced-choice task. All trials were followed by an intertrial interval of 1–1.1 s.
Cat–dog training and transfer test
The experimental paradigm and timings for the cat–dog training task were similar to that described for the category pre-training task, except that the two Walsh patterns were replaced with images of cats and dogs. Further, monkeys were required to fixate on a central fixation point to begin the trial, at which point an image appeared. Then, 350–400 ms later, the red target square appeared, and monkeys were allowed to break fixation (Fig. 1B). Images of cats were associated with the timeout, whereas images of dogs were associated with rewards. Monkeys therefore learned to release in red for cats and release in green for dogs.
Monkeys were tested on this task with the small, training image set (20 cats and 20 dogs; Fig. 1D) until they reached a performance target (80% correct). Then a transfer test was performed with a larger, held-out image set (240 cats and 240 dogs; Fig. 1E) to assess their ability to generalize. We note that the 520 total images used during cat–dog training are the same 520 images used in all the passive viewing experiments. Thus, the 480 images in the transfer test were not strictly novel but had never before been categorized by the monkeys.
Passive viewing task
Monkeys were trained to keep their gaze in the center of the CRT monitor while images appeared at a moderate speed. Monkeys viewed five images on each trial and received a reward at the end (Fig. 1A). Each trial began with a fixation point (0.4 degrees of visual angle) that served as an invitation for the monkey to look at the center of the screen. When the monkey's gaze entered the fixation window (Monkey R, 6 × 6 degrees; Monkey X, 4 × 4 degrees), five images (12 × 12 degrees) were presented serially, each appearing for 350–400 ms, with 350–400 ms of blank screen between images. The fixation point remained present during the entire trial. If the monkey's gaze exited the fixation window during the trial, the trial was immediately terminated, and the trial with the identical image sequence was repeated. After all five images, a liquid reward was delivered, the monkey was allowed to break fixation, and there was an intertrial interval of 1 s. Five hundred twenty images were used in total, with 260 cats and 260 dogs. Images were presented in randomized blocks such that each image was presented once in the block.
We first recorded 12 d (Monkey R) or 3 d (Monkey X) of baseline passive viewing data with the 40 training images. This was followed by 6 d (Monkey R) or 2 d (Monkey X) of baseline passive viewing data with the full set of 520 images; 3 extra days of baseline data were added for Monkey X ∼1 month later. Separate, unrelated passive viewing experiments were then performed, during which the neural activity recorded by the arrays qualitatively changed. Thus, we recorded one additional pre-training passive viewing session on the day immediately before cat/dog training for both monkeys. After training, we recorded 2 d of passive viewing for Monkey R (because the first day had too low of a trial count; only 3–4 presentations per cue, compared with 5–8 for good days—only data from the second day was included in our analyses) and 1 d for Monkey X (20 presentations per cue).
Neural recording
Recording and spike sorting
Neural data, task events, and eye position were recorded using the Cerebus system (Blackrock Microsystems). Extracellular signal was bandpass filtered (250–7.5 kHz) and digitized (30 kHz). Units were sorted online before each recording session for the extracellular signal of each electrode using a threshold and time–amplitude windows. The spike times of the units were stored using Cerebus Central Suite (Blackrock Microsystems). Single units for Monkey R were refined offline by hand using principal-component-analysis projection of the spike waveforms in Offline Sorter (Plexon).
Data exclusion
Inspection of the data using standard raster plots revealed Day 4 of Monkey R's full image-set baseline data to be contaminated by noise, so it was excluded from the analyses. Similarly, our first attempt at recording a post-training passive viewing session for Monkey R only had 3–4 presentations per image due to the monkey not being as motivated as usual, so we performed a second session the following day, which was used for analysis.
For all analyses of neural data during passive viewing, we used data exclusively from completed trials, that is, trials without fixation errors.
Neurons with <20 spikes in a given session were removed from the analysis. Only 10 neurons were excluded across all sessions between both monkeys for this reason. Otherwise, all units recorded from TE were used in a given analysis, except where noted (Fig. 4E), or where units were subsetted to match total unit number across days.
Statistical analysis
Behavioral analyses
To compute the fraction of trials correct by session sextile, sessions were split into six equal blocks of trials, including incomplete trials (i.e., including fixation errors and precue bar releases), and the fraction correct trials was computed within each block. Reaction times for correct cat trials (Fig. 1F,iv) were calculated as the time between when the red cue appeared and when the bar was released.
Image analysis
To extract the foreground of images, we took advantage of the fact that they were cropped onto white backgrounds. We used k-means clustering to group the RGB values into four clusters, which consistently reported the background as one of the colors. We subsequently set the pixels belonging to any other cluster to black and then performed an image fill operation to fill holes. The number of foreground pixels was the number of black pixels after this operation. HSV distributions were computed from these foreground pixels. The image-mean luminance was calculated using a standard RGB conversion (Anon, n.d.). Spatial frequency energy was calculated using a 2-D Fourier transform of the image and summing the energies across all orientations at each frequency.
Initial neural analysis
To select the most informative analysis window, we took an unbiased approach using Monkey R's data (which was acquired first) and an SVM classifier with an “abstract category” approach (see below; Meyers et al., 2008). The classifier's accuracy was assessed at 25 ms intervals, with window starting points ranging from 50 ms before image onset to 300 ms after image onset. Window sizes varied from 50 to 300 ms in 25 ms increments. Longer windows were more informative without exception. The accuracy of the decoders using the smallest window peaked at 150 ms. As the latency of visual information arriving in TE is roughly 75 ms, we chose 75–175 ms and 175–275 ms as two spike windows to correspond roughly to the first wave of activity in TE, and a nonoverlapping window with better decoding. In analyses of Monkey X's data, similar trends with respect to these analysis windows were observed, so we kept the windows the same.
Neural response statistics
To calculate the Fano factor for each single unit, we calculated the variance divided by the mean of all spike counts in the 175–275 ms window after image onset. To calculate the maximum-image-averaged firing rate for each single unit, we first calculated the average response of the unit to each of the 520 images in the 175–275 ms window and then took the maximum of those averages.
Repetition suppression analysis
In the passive viewing task, a trial, i.e., a sequence of 5 image presentations + reward, was aborted if the monkey broke fixation at any point during the trial; the trial was subsequently repeated. This could potentially give rise to repetition suppression effects. To look for repetition suppression, we first Z-scored each neuron's spike counts in the window 175–275 ms after image onset, across all image presentations, using only completed trials that were not preceded by a fixation error, i.e., trials on which there was no possibility of image repetition. We then took the spike counts from completed trials that immediately followed a fixation error (only considering images that were, in fact, repeated) and applied the same Z scaling.
Decoding analysis
For the SVM analyses (Fig. 4), we asked how well the neural population response (spike counts) could decode the category (cat or dog) of the presented image. N neurons’ single-presentation spike counts (predictors) provided a response vector for each of the P single image presentations (observations), so the full data matrix available to the SVMs had dimension P × N. This data was subset in the following ways: in order to more accurately assess category information in the population, we used “abstract category decoding” (Meyers et al., 2008), which prevents overfitting to particular images by using different images for training and test data. Images were divided into five subsets, and each set of image presentations corresponding to each subset was used once as a held-out set. Further, the total number of presentations per category was matched across sessions, in order to balance the total amount of training data available to each classifier. Using the MATLAB function “fitclinear” with lasso regularization, we discovered that the “sparsa” solver delivered better abstract category decoding than the “sgd” solver and that the decoding accuracy plateaued when using the top ∼100 category-coding single units (see below) and began to decrease thereafter. Thus, in order to maximize our estimates of decoding accuracy, we limited our decoding to using the top 100 category-coding single units. To rank units by their category coding, we ran a 1-D SVM decoder for each unit (i.e., P × 1 inputs to the decoder) and ranked the units based on the resulting cross-validated accuracies. Population decoding was repeated using different 100-ms-wide time bins (Fig. 4A); or within individual arrays (Fig. 4B); or across baseline sessions (Fig. 4C,D); or adding the top 100 units one at a time (Fig. 4E).
To assess statistical significance in the time-course analyses (Fig. 4A,B), we used a t statistic clustering method, described in detail in Wittig et al. (2018). Briefly, adjacent timepoints with the same-signed t statistic (i.e., pre > post, vs post > pre) are grouped into a cluster; the t statistics are summed and ranked; and these are compared, rank-wise, to the corresponding summed and ranked values pooled across all shuffle runs. In this way, the p value for each cluster (i.e., for each run of significant decoding difference) measures the probability of observing an equal or more extreme cluster anywhere in the shuffled analyses.
To fit the sigmoidal curves in Figure 4E, we used the MATLAB function “fittype” with the sigmoidal equation “a1 / (1 + exp(−b1 * (x − c1))) + d”. The first three points (i.e., x = 1, 2, 3) were excluded to achieve a stable fit for the sigmoidal curve (i.e., positive amplitudes, half-max x-values inside the domain of the data).
For the time-swap analysis (Fig. 7A), decoders were trained as above but then evaluated on spike count data from all other bins.
Mahalanobis distance
To calculate the Mahalanobis distance (MD) between neural representations of the images (Fig. 6D–F), we used the MATLAB function “pdist” to calculate the pairwise distances between all single-trial neural responses (Fig. 6D–F). Units were randomly subselected from the full set in order to match the number of units (i.e., the dimensionality of the response space) across days. Due to the sparse nature of responses in TE (i.e., low overall firing rates), in order to avoid poorly conditioned covariance matrices, it was also necessary to use a spike count window of 175–350 ms for all MD analyses. Bootstrapping was used in order to compare the set of within-category MD's to the set of between-category MD's between the pre- and post-training days. Specifically, for each bootstrap iteration, units were randomly resubsampled, all MDs were calculated, individual trials (and their corresponding pairwise distances) were subsampled from the results, and finally the relevant statistics were calculated on the resulting sets of MDs. D-prime was calculated using the MATLAB File Exchange function computeCohen_d.
Single-unit category analyses
For the GLM analysis (Fig. 5), for each neuron, we asked if that neuron fired at significantly different rates for cats and dogs. Spike counts for each single image presentation were regressed against image category with a simple linear model, log(spike count) ∼ image category (Poisson link function). The total number of presentations per category was again matched across sessions. Units were counted as having a cat/dog difference if they had a significant coefficient in the model (p < 0.05).
For the image responsiveness analysis (Fig. 6A), for each neuron, we asked if that neuron fired above its baseline rate for each image. A one-sided t test was used to compare spike counts in the window 175–275 ms after image onset to those in a preonset window of the same size (−150 to −50 ms), and t tests with p < 0.05 were considered significant. The total number of presentations per category was again matched across sessions, for each single unit. The fraction of images of each category to which each neuron significantly responded was plotted. Each neuron's category selectivity was then quantified as the absolute difference between the fraction of cat and dog images to which it responded (Fig. 6B).
To fit the slope of the population's responsiveness (Fig. 6B, inset), a line was fit using major-axis regression (Sokal, 1995). Major-axis regression is suitable when there is no independent/dependent variable pairing, as is the case here; the two variables are equivalent and we do not a priori expect a causal link in one direction or the other. Major-axis regression was performed analytically (i.e., with an explicit formula, using the MATLAB function maregress), returning the line's slope and the corresponding 95% confidence interval.
Single-unit sparseness (Fig. 6C) was calculated as in Vogels (1999):
For the time-course analysis of category differences (Fig. 7), smoothed, normalized category-level firing rates were first estimated for each neuron using an asymmetric kernel (Gaussian kernels of bandwidth 15 ms on the “causal”/positive side, 5 ms on the “acausal”/negative side; Thompson et al., 1996; Brincat and Connor, 2006; Sasikumar et al., 2018) applied to the collated peristimulus spike times from all the cat or all the dog image presentations. The densities (units of 1/s) were then converted to average rates by multiplying them by the total number of considered spikes and dividing them by the number of considered trials (units of spikes/trial/sec). Examples of the resulting firing rates are shown in Figure 3D. The two category-averaged spike rates were then subtracted and the absolute value was taken as the category difference over time. To assess statistical significance of this difference, the same analysis was repeated 500 times with shuffled data, and timepoints where the real difference (at that timepoint) exceeded the 99th percentile of the shuffled data point (at that timepoint) were considered to have significant category difference at that timepoint (Fig. 7B). We then calculated the fraction of units with significant category difference at each timepoint and compared pre- versus post-training sessions (Fig. 7D). Significance was assessed with a chi-squared test at 50 ms intervals. Finally, the durations of periods of continuous significant category difference were calculated for each session, and the distribution of run lengths pre versus post was compared with a two-sided rank-sum test (Fig. 7C).
Results
Categorization training
Two Japanese monkeys (Monkeys “R” and “X”) were trained to categorize natural images of cats and dogs (Fig. 1; see Materials and Methods for training details). We selected cats and dogs as the testing categories because we have previously demonstrated that these are more perceptually challenging to discriminate than alternative pairs of categories, such as cars versus trucks, or human faces versus monkey faces (Matsumoto et al., 2016). Monkey X took one session and Monkey R took three sessions to learn to categorize 40 training images presented one at a time (10–50 trials per image per session). The monkeys were then tested on a larger, trial-unique set of cats and dogs, which they categorized with ∼70% accuracy, above chance, although markedly below their peak accuracy of ∼90% during the training phase. The accurate responding to trial-unique stimuli is evidence that the monkeys learned a generalization strategy (i.e., categorization) during training, as opposed to a memorization strategy. Images were cropped onto white backgrounds and had similar low-level visual statistics such as hue and saturation (Fig. 2). The dog images had, on average, ∼1 more leg visible than the cat images, but we did not observe any clear trend for the monkeys to more accurately categorize dogs with more legs, nor to more often mis-categorize cats with more legs as dogs (data not shown). Based on these rough analyses of the image sets, we conclude that it is unlikely that the monkeys distinguished the cats and dogs by any method other than integrating along the multiple visual dimensions that, combined, allow for reliable discrimination between the two species.
Neural recordings during passive viewing
To investigate whether, and if so, how, the category training changed the stimulus-evoked activity of neurons in TE, we recorded neural activity in TE while the monkeys passively viewed images of cats and dogs either before or after the category training. In total, we recorded 348 single units before training (148 from Monkey R and 200 from Monkey X) and 333 after training (139 and 194, respectively), from three chronically implanted Utah electrode arrays implanted in each subject (96 electrodes per array; Fig. 3A). On each trial, monkeys fixated on a central point and five cat and/or dog images were presented sequentially (350–400 ms stimulus duration and interstimulus interval). During these passive viewing sessions, monkeys were rewarded following the presentation of all five images if they maintained fixation throughout the trial. Passive viewing recordings were performed at three times relative to the category training: for multiple days in the period a few months before the training (“baseline”); for 1 d directly before the training (“pre-training”); and for 1 d directly after (“post-training”).
The monkeys’ behavior was similar between the pre- and post-training passive viewing sessions, with similar numbers of completed trials (550 pre vs 736 post for Monkey R; 2,000 both pre and post for Monkey X). Monkey R showed a 25% decrease in the number of fixation errors on dog images after category training (5.7% of presentations pre-, 4.3% of presentations post-training; p < 0.05, chi-squared test for two proportions). Fixation accuracy was unchanged on cat images for Monkey R, and on both categories of image for Monkey X. Although we did observe some repetition suppression on trials immediately following fixation errors, the overall distribution of Z-scored stimulus-evoked spike counts on these trials did not significantly differ pre- and post-training for either monkey (p = 0.60 and p = 0.092, two-sided two-sample t test).
The properties of the stimulus-evoked responses did not differ consistently between pre- and post-training days, with no significant change in the distribution of Fano factors (two-sided rank-sum test, p = 0.69 for Monkey R, p = 0.57 for Monkey X; Fig. 3B). Visual responsiveness was also similar across days (Fig. 3C), albeit not precisely matched; Monkey R showed a nonsignificant decrease in the median maximum-image-average firing rates per single unit (two-sided rank-sum, 14.0 sp/s pre vs 11.4 sp/s post, p = 0.49), whereas Monkey X showed a significant increase (p = 0.012, 14.7 sp/s pre vs 20.5 sp/s post). Neither of these changes in image-evoked firing rates was specific for one category or the other (Monkey R, mean image-evoked rate of 3.3 ± 5.4 sp/pre- vs 3.2 ± 5.1 sp/s post-training for cats, and 3.1 ± 4.9 sp/s pre- vs 3.3 ± 4.7 sp/s post-training for dogs; Monkey X, 7.3 ± 9.8 sp/s pre- vs 9.2 ± 12.1 sp/s post-training for cats, and 6.8 ± 9.1 sp/s pre- vs 9.0 ± 12.0 sp/s post-training for dogs). Units with a variety of response characteristics (e.g., excitatory, suppressed, ramping; Fig. 3D) were observed across all sessions.
Neural representations of category at the population level
When comparing neural activity from the pre- and post-training days, linear support vector machines (SVMs) trained to decode image category from population vectors of spike counts (“abstract category SVMs,” 100 ms window size, see Materials and Methods) more accurately predicted category on the post-training day, from both the full neural populations combined across all three electrode arrays (Fig. 4A) and all individual array subpopulations (Fig. 4B; p < 0.05, cluster-based permutation corrected for multiple comparisons; see Materials and Methods). The “abstract category” decoding method holds out a set of images from training, forcing the decoder to rely on category-level information to subsequently predict the category of those images, rather than learning the category of each individual image. Decoding accuracy plateaued 175–275 ms after stimulus onset (Fig. 4A), and accuracy at this plateau increased between pre- and post-training days by 8.7 and 3.8 percentage points for Monkey R and Monkey X, respectively, with the post-training accuracy exceeding any of the observed baseline accuracies (Fig. 4C). The magnitude of this effect is comparable with previous studies of visual learning in TE (Baene et al., 2008). Similar effects were seen when the SVM input was restricted to only the 480 trial-unique images, which the monkeys had only 1–5 opportunities to see on the transfer test day (as opposed to 100 + opportunities across multiple days with the 40 training images; increases in accuracy of 9.6 and 3.5 percentage points, respectively). Similar effects were also seen when we trained decoders either on only the twenty training images (11.3 and 1.5 percentage points), or on random subsets of twenty of the 480 trial-unique images (10.1 and 3.2 percentage points), and decoding accuracies were marginally improved for the random subsets of the trial-unique images as compared with the training images. These control analyses demonstrate that our decoding results do not rely on the monkeys having memorized the categories of the training images. Further, it seems unlikely that the monkeys would have memorized 480 initially trial-unique images and their reward associations in one exposure to the set. Finally, mere passive exposure to the training stimuli in the training-stimuli-only baseline sessions (Fig. 4D) did not improve category coding in TE on its own. Thus, this category training enhanced visual category coding in area TE, independently of memorized reward associations and familiarity effects.
Single unit correlates of category learning
To determine what fraction of single units contributed to the observed population-level category coding, we retrained the decoders on different-sized subpopulations, adding units in order of their decoding accuracy in a one-dimensional decoder (Fig. 4E). As single units were added, half-maximal accuracy was reached with the top 10–15 units, and accuracy continued to increase to an asymptote by 100 units. Sigmoidal fits showed that amplitude of the increase was increased for Monkey R post-training, and the slope of the fit was increased for Monkey X (Table 1). This result suggests that increased category coding in a subset of units in TE was sufficient to drive the observed population-level increase in category coding.
Using a GLM to model the effect of category on the neural responses for each single unit confirmed that a larger proportion of units responded selectively to category post-training when compared with pre-training (175–275 ms window; Monkey R, 37% vs 50% of neurons, p = 0.025, chisq = 5.05, df = 1; Monkey X, 32% vs 45% of neurons, p = 4.6 × 10−3, chisq = 8.0, df = 1; Fig. 5A). This increase was observed in all three array locations, though not all increases reached significance (Fig. 5B). As in the decoding analysis, the fractions of single-unit coding for category in the GLM post-training were larger than any of those observed on baseline days (Fig. 5C), arguing against a role for familiarity effects in driving the observed results. We noted a shift toward dog-preferring coefficients after training (Fig. 5D), but this effect only approached statistical significance (Monkey R, p = 0.065; Monkey X, p = 0.080, two-sided rank-sum test).
To determine to what extent changes in single-unit image selectivity drove these effects, we calculated the proportion of cat and dog images to which each unit was significantly visually responsive (spike counts from 175 to 275 ms after stimulus onset vs −150 to −50 before stimulus onset, paired t tests, p < 0.05). Image selectivity before training was sparse, with 21 and 24% of units responding to >10% of images, but only 5 and 12% of single units responding to >25% of the images, for Monkeys R and X, respectively (Fig. 6A). This proportion increased moderately, approaching statistical significance, after training, with 28 and 28% responding to >10% of images (p = 0.14, chisq = 2.2, df = 1; and p = 0.38 chisq = 0.75, df = 1, respectively); and 6 and 18% of units responding to >25% of images (p = 0.87, chisq = 0.025, df = 1; and p = 0.12, chisq = 2.4, df = 1, respectively). These increases, as well as the total post-training proportions, were larger than any observed during baseline testing (21% ± 3% and 13% ± 1% units responding to >10% of images on baseline days, respectively). Single units were almost equally responsive to cats and dogs, with a mean absolute between-category difference of ∼2%. This difference, a rough proxy for single-unit category selectivity, increased slightly but significantly in both monkeys after training (p = 0.016 and p = 0.013, respectively, unpaired t test; Fig. 6B), reaching higher mean selectivity than any observed during baseline testing (2.8% mean difference post-training vs 1.8% pre-training and 2.4% baseline for Monkey R; and 3.0% mean difference post-training vs 2.1% pre-training and 1.6% baseline for Monkey X). A greater number of neurons shifted their responsiveness toward dogs—the rewarded category—in both monkeys, although the shift toward dogs only reached statistical significance in Monkey X (no overlap/overlap, respectively, of analytical 95% confidence intervals for slope of the major-axis regression; Fig. 6B, inset). This shift in responsiveness for Monkey X was greater than any changes observed during baseline testing (data not shown). Consistent with this finding of an increase in the number of images to which each unit responded after training, we also measured a decrease in single-unit sparseness after category training (p = 0.02 and p = 0.05, respectively, one-sided rank-sum test; Fig. 6C); that is, the neurons responded to a larger number of images, and with more evenly distributed spike counts, after training. We did not observe a difference between the sparseness of responses to cats versus that to dogs either before or after training (p > 0.05, two-sided rank-sum test). Taken together, these results show that the proportion of single units in TE with category selectivity increased after category training and suggests that single units likely became more category selective in response to the training by broadening their responses to one category or the other.
In order to assess how these changes in single-unit image selectivity affected population-level category representations, we measured the Mahalanobis distance (MD) between pairs of single-trial neural response vectors. Overall, the variance of MD's decreased after category training, and there were fewer high-MD outlier pairs, in line with the decreased sparseness of neural responses (Fig. 6D,E). Bootstrapped MD distributions (see Materials and Methods) revealed that the median between-category MD was uniformly larger than the median within-category MD and the difference in the medians nearly doubled following category training (Fig. 6F). Nonetheless, the post-training between- and within-category MD distributions were still extremely similar, with d-prime values of ∼0.04 and ∼0.008 after training. We conclude that category training led to a small but reliable increase in intercategory separation in neural response space.
Stability of category representation over time
To assess whether the population encoding of category identity is stable or dynamic within a given trial, we applied a decoding analysis in which we trained a classifier using data from one time period and tested the classifier using data from a different time period (Fig. 7A). Category information could be decoded from the population almost equally well at all time points from 75 to 450 ms after stimulus onset, regardless of the data used for training within the same monkey, indicating a relatively static (stable) representation of category identity over this timeframe. To better understand how this encoding is supported at the single unit level, we visualized the time course over which significant differences to category emerged for each unit (see Materials and Methods; Fig. 7B). Category selectivity in single neuron responses developed over the course of ∼400 ms from stimulus onset, with the majority of category-selective neurons beginning to show selectivity from 100 to 200 ms. After training, units displayed longer periods of continuous category coding (p = ∼0 and p = 0.014, respectively, two-sided rank-sum test; Fig. 7C). Additionally, the proportion of units with significant category selectivity increased on four of the six arrays (the anterior array of Monkey X also showed an increase but did not reach statistical significance; the amount of data was small, causing the test to be underpowered; p < 0.05 for each marked timepoint, chi-squared test with df = 1; Fig. 7D).
Latency of category coding
Monkey R's units showed a decrease in median category coding latency of roughly 10–20 ms after training, but the difference only approached significance (135–116 ms for the anterior array, p = 0.17; 150–141 ms, middle array, p = 0.58; 121–111 ms, posterior array, p = 0.13; 131–120 ms, pooled across arrays, p = 0.065). Monkey X's units showed either no change or a small increase in median category coding latency (147–164 ms, anterior array, p = 0.43; 168–167 ms, middle, p = 0.99; 114–129 ms, posterior, p = 0.042; 142–145 ms, pooled across arrays, p = 0.34). To assess the latency of category coding at the population level, we found the first time when the fraction of significant category coding units became significantly different than zero (Fig. 7D, vertical dotted lines). In line with the single unit latencies, Monkey R showed a slight decrease in latency (166–118 ms, anterior array; 94–74 ms, posterior array; first timepoint where p < 0.05, two-sample chi-square test testing difference from zero) whereas Monkey X did not.
Discussion
In this study, category training with natural images (Fig. 1) enhanced neural category coding in area TE (Fig. 4). Three factors might have supported enhanced category coding. First, a higher proportion of units coded for category after training (Figs. 5A, 7D), suggesting that new units were likely recruited that had not previously demonstrated category selectivity. Second, single units broadened their responsiveness (Fig. 6A,B) and showed a concomitant decrease in sparseness (Fig. 6C,E), responding to larger numbers of cats or dogs in a manner that increased population-level category information and increased intercategory representational distance. Third, some units increased the duration of their category-selective responses (Fig. 7B,C). In line with previous studies, no units strongly responded to a majority of images within a category, with most units responding nearly equally to cats and dogs (Vogels, 1999). Since the most informative category coding emerged 150–200 ms into TE neurons’ visually evoked responses, and given that the latency of visual activity in TE is ∼75 ms, lateral inhibition, feedback processing, or both are candidate mechanisms for this category-level selectivity. In sum, the improved category coding in TE after learning was likely driven by a larger fraction of category-responsive neurons, a broadening of single-unit response profiles, and increased duration of category-coding responses.
It has been reported that passive viewing alone is sufficient to enhance category-selective coding in visual areas in monkeys and to drive plasticity in visual areas in mice (Freedman et al., 2006; Meyer et al., 2014; Schmid et al., 2024; Zhong et al., 2024). However, others have demonstrated that the strength of the neural representation in inferotemporal cortex depends on task demands (Fuster and Jervey, 1981) and is modulated by active categorization versus passive viewing (Emadi and Esteky, 2014), and that the representation of category-diagnostic features is enhanced more than that of nondiagnostic features (Sigala and Logothetis, 2002). In the present study, although there was a weak trend toward improved decoding accuracy across baseline passive viewing sessions involving the full image set (Fig. 4B), it was nowhere near as substantial as the increase from pre- to post-training sessions. Furthermore, this trend was not observed during the other repeated passive viewing session the monkeys underwent with the smaller 20/20 training set (Fig. 4E). Additionally, whereas passive viewing of stimuli has been associated with sharpened tuning to category identity, we also observed an increase in the number of units coding category identity and an increase in the duration of category representation after stimulus presentation.
Did category learning, in and of itself, cause the observed changes in area TE? The evidence suggests that it did: (1) whereas previous studies have relied on extensive periods of training to study neural representations of category or value in TE, we have shown similar effects after ∼1 week of training, after which the monkeys were not expert at this task (as evidenced by the reduction in accuracy and increase in reaction times from training to transfer test), arguing against an effect of overtraining or familiarity. (2) The effects of pre-exposure to the stimuli in the baseline period were generally null or smaller than the effects of the category training, again arguing against an effect of familiarity. (3) The image-evoked neural responses were similar on pre- and post-training days (Fig. 3), with any nonspecific changes (i.e., increase in overall evoked firing rates) not matching across subjects. (4) Although we cannot rule out the possibility that monkeys covertly categorized the stimuli during the passive viewing task after category training, they were not overtly performing the categorization task when these data were recorded (no response manipulanda available); thus we infer that it is the learned association, rather than immediate task salience, that likely drove these differences in neural responses to the categories. (5) Due to the asymmetric reward structure of our training task, we cannot fully rule out a role for reward associations in these changes—i.e., the monkeys may have learned to associate dogs with the liquid reward, and cats with the different “reward” of avoiding a delay in the task (see Materials and Methods). In line with this asymmetry between cats and dogs, we observed a mild bias toward dogs in our linear models (Fig. 5D) and in one monkey for overall image selectivity (Fig. 6A). However, we observed no nonspecific decrease in the sparsity of neural response toward dogs (Fig. 6C), which might be expected if single units were simply responding nonspecifically to every dog image with a reward signal. Further, since it is unlikely that the monkeys would have memorized the 480 test image associations, the monkeys would still have had to visually distinguish cats from dogs via a process of generalization to expect a reward. In conclusion, we propose that visual category learning is, at least partially, supported by enhanced neural representations of category in area TE.
Footnotes
We thank Drs Fernando Ramirez, John H Wittig, Bing Li, Wenliang Wang, and Reona Yamaguchi for helpful feedback while drafting this manuscript. Funding was provided by the Intramural Research Program; National Institute of Mental Health; National Institutes of Health; Department of Health and Human Services (annual report number ZIAMH002032), and KAKENHI Grants, JP15K21742 and JP20H05955 (to T.M.), 19K12149, 22K12189 (to N.M.), and 23H04374 (to Y.S.-M.) from Japan Society for the Promotion of Science (JSPS).
↵*J.E.P. and N.Ma. contributed equally to this work.
The authors declare no competing financial interests.
- Correspondence should be addressed to Mark A. G. Eldridge at mark.eldridge{at}newcastle.ac.uk.