Abstract
The ventrolateral prefrontal cortex (VLPFC) shows robust activation during the perception of faces and voices. However, little is known about what categorical features of social stimuli drive neural activity in this region. Since perception of identity and expression are critical social functions, we examined whether neural responses to naturalistic stimuli were driven by these two categorical features in the prefrontal cortex. We recorded single neurons in the VLPFC, while two male rhesus macaques (Macaca mulatta) viewed short audiovisual videos of unfamiliar conspecifics making expressions of aggressive, affiliative, and neutral valence. Of the 285 neurons responsive to the audiovisual stimuli, 111 neurons had a main effect (two-way ANOVA) of identity, expression, or their interaction in their stimulus-related firing rates; however, decoding of expression and identity using single-unit firing rates rendered poor accuracy. Interestingly, when decoding from pseudo-populations of recorded neurons, the accuracy for both expression and identity increased with population size, suggesting that the population transmitted information relevant to both variables. Principal components analysis of mean population activity across time revealed that population responses to the same identity followed similar trajectories in the response space, facilitating segregation from other identities. Our results suggest that identity is a critical feature of social stimuli that dictates the structure of population activity in the VLPFC, during the perception of vocalizations and their corresponding facial expressions. These findings enhance our understanding of the role of the VLPFC in social behavior.
Significance Statement
Primates are unique in their ability to process and utilize complex, multisensory social information. The brain networks that support this are distributed across the temporal and frontal lobes. In this study, we characterize how social variables like identity and expression are encoded in the neural activity of the ventrolateral prefrontal cortex, a prefrontal region of the macaque brain. We found that most single neurons do not appear to encode these variables, but populations of neurons display similar activity patterns that are primarily differentiated by the identity of the conspecific that a macaque is attending to. Furthermore, by employing dynamic, audiovisual, multisensory stimuli, our experiment better approximates real-world conditions, making our findings more generalizable.
Introduction
Facial and vocal expressions are a fundamental component of communication in humans and nonhuman primates (Darwin, 1965; Ekman and Friesen, 1971; Partan, 2002; Parr and Heintz, 2009). In macaques, viewing facial expressions produces robust activity through the occipital, temporal, and ventrolateral prefrontal cortices (Russ and Leopold, 2015; Shepherd and Freiwald, 2018). The ventrolateral prefrontal cortex (VLPFC) exhibits a high degree of responsivity to social stimuli, containing face cells (O Scalaidhe, 1999, 1997), face patches (Tsao, Schweers, et al., 2008), neurons responsive to the angle of a viewed face (Romanski and Diehl, 2011), and neurons selective for species-specific vocalizations (Romanski and Goldman-Rakic, 2002; Romanski et al., 2005; Hage and Nieder, 2015).
While these findings suggest that the VLPFC serves a critical social function, little is known about what social variables are represented by neural activity in the region. This is in stark contrast to the temporal face patch network (Tsao, Moeller, et al., 2008), where a variety of features that drive neural responses have been identified. Anterior face patches contain neurons with selectivity for identities, whereas middle face patch activity correlates with head orientation of the viewed stimulus (Freiwald and Tsao, 2010). Patches in the fundus of the superior temporal sulcus are sensitive to changes in expression (Taubert et al., 2020), while face patches outside the traditional network display joint encoding of multiple variables, including expression, identity, gaze, and head orientation (Yang and Freiwald, 2021). Notably, information about identity is broadly distributed in the face patch network because it can be reliably decoded from small neural populations across many face patches based on representation of neurons as axes in a high-dimensional face space (Chang and Tsao, 2017). While neurons with differential responses for the angle of a viewed face have been identified in VLPFC (Romanski and Diehl, 2011), a systematic exploration of their selectivity for identity or type of expression has yet to be conducted.
A further complication in assessing the social function of the VLPFC arises from its role as a multisensory integrator. A VLPFC neuron's multisensory response to an audiovisual face–vocalization stimulus is frequently a nonlinear combination of its component responses to the face or vocalization alone (Sugihara et al., 2006; Diehl and Romanski, 2014). While this has the advantage of drastically increasing the possible set of responses a neuron or population of neurons can have (Rigotti et al., 2013), it also illustrates how response characteristics to a single sensory modality can fail to represent ethologically valid stimuli. Simply put, nonlinear multisensory integration provides direct evidence that conclusions drawn from neural responses to a static image or isolated vocalization will not scale to natural versions of these inherently multisensory stimuli.
To glean more insight into the categorical social features that drive VLPFC neural activity during social communication, we recorded single-unit activity in the VLPFC while macaques perceived naturalistic audiovisual expressions made by unfamiliar conspecifics. Given the primacy of identity and expression in similar studies (Gothard et al., 2007; Kuraoka and Nakamura, 2007; Kuraoka et al., 2015; Yang and Freiwald, 2021), we varied stimuli by conspecific identity and expression. We presented ethologically valid audiovisual movie clips from three different conspecifics, each making three discrete types of facial and vocal expressions with distinct valances (aggressive, affiliative, and neutral, respectively; Hauser and Marler, 1993; Partan, 2002). We recorded neurons across the VLPFC and analyzed the subset that displayed consistent firing rate changes in response to at least one of the nine stimuli presented. We found that single neurons had low selectivity for single stimuli and even lower selectivity for expression or identity as a category. Rather, they responded to various stimuli over the stimulus period. Despite heterogeneous firing patterns, both expression and identity could be accurately decoded from populations of neurons. Furthermore, we found that population responses to the same identity followed coherent trajectories over time that separated them from other identities in the population response space.
Materials and Methods
Subjects and surgical procedures
Extracellular recordings were performed in two rhesus monkeys (Macaca mulatta): Subject 1 (male, 12.0 kg, 7 years old) and Subject 2 (male, 16.5 kg, 12 years old). All procedures were in accordance with the National Institutes of Health's Guide for the Care and Use of Laboratory Animals and the University of Rochester Care and Use of Animals in Research committee. For Subject 1, prior to recordings, a titanium head post was surgically implanted to allow for head restraint, and a titanium recording cylinder was placed over the VLPFC as defined anatomically (Preuss and Goldman-Rakic, 1991) and physiologically (Romanski and Goldman-Rakic, 2002). For Subject 2, several floating microarrays (FMAs, Microprobes for Life Science) were designed specifically for the subject’s prefrontal cortex and surgically implanted into VLPFC (inferior to the principal sulcus, anterior to the inferior limb of the arcuate sulcus).
Experimental design
Stimuli and apparatus
In our stimulus presentation paradigm, subjects attended to naturalistic, audiovisual communication calls with their accompanying facial expressions. These vocalization movies were extracted from recordings of unfamiliar conspecifics in our home colony obtained by L.R. Examples of some stimuli can be seen on https://www.urmc.rochester.edu/labs/romanski.aspx. Two stimulus lists were designed such that they each contained three identities (unfamiliar conspecifics) making three ethologically distinct expressions (Fig. 1B), which differed at three levels of valance: aggressive (pant threat, bark), affiliative (coo), and neutral (grunt; Hauser and Marler, 1993; Partan, 2002; Parr and Heintz, 2009). As such, our selection of stimuli renders the type of expression (e.g., coo) synonymous with the valance (e.g., affiliative), and expression types are henceforth referred to by their valance. Both lists contained the same identity conspecifics, which made different exemplars of the same vocalization and facial expression. Lists alternated between recording sessions, such that the population of neurons recorded, contain a mix of neural responses to one list or the other; however, the abstracted category of identity and expression were preserved in the stimuli from day to day (i.e., the second list was a copy of the first with different exemplars of the same identity–expression paring). The movie stimuli varied in length, based on the natural length of the vocalization and associated facial movement (average length, 900 ms). Additional lists of visual stimuli including objects, patterns, and scenes were used during the preparatory phase of a recording session.
Naturalistic expression presentation task. Two macaque subjects viewed movie clips (audio and video) of vocalization–expressions from unknown conspecifics. A, Stimulus presentation was initiated by the subject by fixating on a central dot for 500 ms. An audiovisual expression was subsequently presented (900 ms mean length) and a reward was given if fixation was maintained on the stimulus through its length. B, Matrix of stimuli showing that movie clips of three conspecifics (PH, BK, and ST) making aggressive (AGG; pant threat, bark), affiliative (AFL; coo), or neutral (NEU; grunt) expressions were presented pseudo-randomly. Two lists were constructed with these parameters and alternated between recording sessions.
The audiovisual movie stimuli were processed using Adobe Premiere (Adobe Systems), Jasc Animation Studio (Corel), and several custom and shareware programs. Auditory and visual components of the movies were separated into wav and mpeg streams for processing. The visual mpeg stream was edited to remove extraneous and distracting elements in the viewing frame. The audio track was also filtered to eliminate background noise if present using MATLAB (MathWorks), Goldwave (GoldWave), and SIGNAL (Engineering Design). The two streams were then recombined for presentation during recordings. Stimulus audio was presented at 65–75 dB SPL at the level of the subject's ear. Movies subtended 8–10° of visual angle and were presented at eye level in the center of the computer monitor.
Neurophysiology recordings were performed in a sound-attenuated room lined with Sonex (Acoustical Solutions). Stimuli were presented through a video monitor flanked by speakers with ±3 dB and 75–20,000 Hz frequency response (PH5-vs, Audix). The monitor and flanking speakers were located 29 inches from the subject's head and centered at the level of the subject's eyes. During all neural recordings, eye position was monitored using an infrared pupil monitoring system (ISCAN).
Experimental procedures and presentation task
Animals were acclimated to the laboratory and testing conditions. Subjects were then trained on a version of the audiovisual stimulus presentation task that did not include the stimuli used for recording (though they did encounter some of the stimuli previously, in other tasks). For Subject 1, the head was restrained by means of the surgically implanted post and a custom-built fixation system mounted to the primate chair. Then, either a single recording electrode (tungsten, FHC) or linear array (16 ch or 32 ch, V-Probe, Plexon) was lowered into the VLPFC each day, using a hydraulic microdrive and XY positioner (MO-95C, Narishige) that mounted onto the recording cylinder. For Subject 2, the head was temporarily restrained with a moldable plastic headpiece while the recording head stage was connected to the FMA Omnetics connectors. Once connected, the plastic headpiece was removed, and the subject was free to move through a limited range in the chair during recordings.
During the preparatory phase of the recordings and cylinder preparation, subjects viewed visual objects, patterns, and scenes. For Subject 1, an electrode or linear array was lowered into the VLPFC while the subject viewed these preliminary lists. The linear array was lowered until all of the contacts were below the dural surface. Single electrodes were lowered past layer 1 until single cells could be easily isolated. For Subject 2, the implanted arrays required no adjustments, and 64 channels of data were recorded by default.
Wideband neural data and behavioral event codes were recorded, digitized at 40,000 Hz, and saved to disk using a neural data acquisition system that featured analog to digital conversion at the head stage (OmniPlex, Plexon). For Subject 1, the wideband signal recorded from Plexon V-Probes or tungsten electrodes was bandpass filtered to 600–6,000 Hz and played back through a pre-amplifier and speaker in order to optimize the position of the electrode, prior to recording. In certain instances, the 600–6,000 Hz “spike band” was also saved alongside the wideband data.
At the start of each trial, a red fixation point (0.15 × 0.15°) was presented at the center of the monitor. Subjects were given a 5 s period in which a 500 ms sustained fixation on the point would trigger the beginning of the perceptual task and a vocalization movie was presented. During the movie, subjects had to maintain their gaze within the viewing window corresponding to the boundaries of the stimulus (10° × 10° window) for the entire duration of the movie stimulus. Eye position was monitored continuously through the trial. Successful completion of the trial resulted in a juice or water reward 250 ms after the end of the stimulus (Fig. 1). Failure to establish 500 ms of fixation within the given time or failure to maintain gaze within the boundaries of the stimulus was considered a failed trial and immediately terminated the presentation of the stimulus. All successful trials were equally rewarded and unsuccessful trials were stopped upon loss of fixation. Trials were presented pseudo-randomly for 10–12 repetitions (Subject 1) or 15 repetitions (Subject 2) of each stimulus. The intertrial interval (ITI) was 2 s. The timing of behavioral contingencies, presentation of stimuli, delivery of reward, and monitoring of eye position were controlled by custom software running on a dedicated computer.
Data analysis
Overview of recording and analytic approach
We recorded a total of 435 neurons in the VLPFC of two nonhuman primate subjects. Recordings spanned 74 successful sessions. Subjects viewed one of two stimulus lists on any given day, which alternated between days. Both lists contained three types of vocalization/facial gestures from the same three macaque conspecifics (e.g., Stimulus 1 on List 1 was monkey PH exhibiting an affiliative coo and Stimulus 1 on List 2 was a different, unique instance of monkey PH exhibiting an affiliative coo). Each session yielded a different number of simultaneously recorded neurons, and the average simultaneous population size was 3.4. Spike sorting for single units was conducted on all recorded channels. Prior to analysis, 150 neurons which were not responsive in our task were removed (see Identification of responsive units). A total of 172 responsive neurons were recorded in Subject 1 using Plexon V-Probes (16 channels) in the VLPFC and 113 responsive neurons were recorded in Subject 2 using the chronically implanted microprobes floating microarrays (4 arrays × 16 channels each). Single-unit analyses (selectivity index, ANOVA, decoding) were conducted to characterize the capacity of single units to encode information regarding expression or identity. Population-level analyses [decoding, principal component analysis (PCA)] were performed pursuing the same goal, and neurons were randomly pooled as a pseudo-population to extrapolate the encoding capabilities of this region to a population beyond the sizes gleaned within a recording session.
Spike sorting
Spike sorting was conducted manually, off-line, after recording sessions were completed using a dedicated computer and software (Offline Sorter, Plexon). Spikes were either sorted from the wideband signal that was high-pass filtered at 600 Hz (four-pole Butterworth) or from saved spike band data (see Experimental procedures and presentation task). The generalized process for spike sorting started with thresholding the filtered wideband or spike band at a level where clusters of spikes were visually distinct in principal component (PC) 1/2, 2/3, or 1/3 spaces. Spikes were then realigned to their global minimum after the first threshold crossing. Clusters were preliminarily designated as a single unit within one of the PC spaces and then corroborated through clustering in another PC space. In rare instances, a cluster that could not be corroborated in another PC space would be found in a PC × nonlinear energy or PC × peak–valley space. Uncorroborated clusters had their unit designation removed and were thus not identified as single units in our analysis. Clusters that had >0.2% 1 ms inter-spike interval (ISI) violations were reduced in size, typically in the corroborating space, to meet the threshold. If such a threshold could not be met, the unit designation was removed. As such, the single units analyzed in this study have, at most, 0.2% ISI violations and can be visually identified as clusters in at least two combinations of PC 1, 2, 3, nonlinear energy, or peak–valley spaces.
Identification of responsive units
Because our stimuli varied in duration (range, 400–1,600; average, 900) due to the natural length of the associated vocalization, any period of time used to identify responsive units represented a trade-off between including time after a stimulus had terminated (for shorter stimuli) or cutting off a response before a stimulus had terminated (for longer stimuli). We found units that had stimulus-dependent response latencies (Fig. 2). Previous VLPFC studies indicate that response latencies may vary between 50 and 450 ms. Therefore, we decided to look for responses over an inclusive time period and assessed responsivity to stimuli between 0 and 1,500 ms, after stimulus onset. This period of time is henceforth referred to as the “stimulus period.” While this was 100 ms shorter than our longest stimulus, the majority of cells responsive at late time points were also responsive at earlier ones for shorter stimuli, which facilitated their identification using the analyses described below:$$mathtex$$P\lpar x \rpar \equals \displaystyle{{e^{- \lambda }\lambda ^{\rm x}} \over {x!}}.$$mathtex$$ (1)
Responsive units were identified using a sliding bin analysis to determine time bins where the Poisson probability (Eq. 1) of the average spike count for a given condition was higher than the mean baseline spike count. For each unit, a mean spike count for the 200 ms prior to every stimulus presentation was calculated $$mathtex$$\lpar \lambda \rpar$$mathtex$$. Then a sliding bin analysis was conducted in 200 ms time bins from 0 to 1,500 ms post-stimulus onset, shifting by 100 ms, to derive the mean spike count in each 200 ms bin for each stimulus presentation (10–15 repetitions, spike count rounded to the nearest whole number). Equation 1 was used to calculate the Poisson probability of spiking within each bin, where $$mathtex$$\lambda$$mathtex$$ was the mean spike count in the 200 ms prior to stimulus onset across all stimuli, $$mathtex$$x$$mathtex$$ was the mean spike count in a particular 200 ms bin for a single stimulus of interest, and $$mathtex$$P\lpar x \rpar$$mathtex$$ was the probability of that the spike count $$mathtex$$x$$mathtex$$ would be seen within the condition and bin being analyzed given the average spontaneous spike rate $$mathtex$$\lambda$$mathtex$$. This calculation was repeated for all bins and for each stimulus. Units with at least one bin in one stimulus condition where the probability of spike count was <0.1 were deemed responsive in this task. This analysis was conducted using the R programming language.
Single units exhibit heterogeneous responses. Representative responses to multiple repetitions of all nine expressions are grouped in columns by expression. Responses to identity PH are in gold, BK in purple, and ST in teal. Raster plots for each trial are presented above with corresponding spike density functions below, color-coded by identity. 0 ms represents the onset time of the stimuli. Because these stimuli were naturalistic expressions, consisting of a video and audio component, the duration of the stimuli varied from 400 to 1,500 ms. A, Identity was a significant modulator of the firing rate for this unit, as well as the interaction between identity and expression (two-way ANOVA identity × expression; p < 0.005 for identity; p = 0.154 for expression; p < 0.005 for interaction). B, Another responsive unit. In this case, identity, expression, and their interaction significantly modulated the firing rate (two-way ANOVA identity × expression; p < 0.005 for identity; p = 0.025 for expression; p < 0.005 for interaction).
Since the audiovisual movie stimuli we employed are the natural co-occurrence of a macaque vocalization with the corresponding facial gesture, either component or the combination of the two could be responsible for the neural response. Separate presentations of the vocalization and facial gesture components have been previously analyzed to demonstrate multisensory responses in VLPFC cells (Sugihara et al., 2006; Romanski and Hwang, 2012; Diehl and Romanski, 2014) so this was not performed in the current study.
Selectivity analysis
For each of the 285 responsive units, a selectivity index (SI) was calculated with regard to the selectivity for specific stimuli within our stimulus set, the identities in the set, and the expressions in the set. The index provides a 0–1 scale for each unit, through which the selectivity for a particular stimulus or category of stimuli can be quantified. In this analysis, a unit that selectively fired for a single audiovisual stimulus, expression, or identity would have an $$mathtex$${\rm SI} \equals 1$$mathtex$$ and a unit that fired uniformly to all instances would have an $$mathtex$${\rm SI} \equals 0$$mathtex$$. Equation 2 shows the calculation:$$mathtex$${\rm SI} \equals \displaystyle{{n- \mathop \sum \nolimits_{{\rm i} \equals 1}^{\rm n} \lpar {{\bar{r}}_{\rm i}\sol {\bar{r}}_{\rm m}} \rpar } \over {n- 1}} \;$$mathtex$$ (2)where $$mathtex$$\bar{r}_{\rm i}$$mathtex$$ represents the mean of the absolute value of the difference in firing rate between the baseline (−200 to −0 ms) and response period (0–1,500 ms) for a given stimulus, identity, or expression type $$mathtex$${\rm i}$$mathtex$$; $$mathtex$$\bar{r}_{\rm m}$$mathtex$$ is the maximum of $$mathtex$$\bar{r}_{\rm i}$$mathtex$$ for all $$mathtex$${\rm i}$$mathtex$$; and $$mathtex$$n$$mathtex$$ is the number of stimuli (9), identities (3), or expressions types (3) for which an SI was calculated.
We also determined the optimal stimulus in VLPFC responsive neurons. A one-way ANOVA by stimulus (n = 9) was performed on each neuron and post hoc Tukey’s Honestly Significant Difference (Tukey’s HSD) test. For each significantly responsive cell (p < 0.05), we identified the stimulus with the highest mean firing rate and, using the post hoc Tukey’s test, determined that it was significantly different from at least two other stimuli to be considered the “optimal stimulus.” For each of the nine stimuli in the two testing lists, we computed the number of cells which ranked each stimulus as the optimum stimulus and used a chi-square goodness of fit test to assess the distribution.
Two-way ANOVA for expression and identity
To assess the influence of expression type and identity on the firing rates of responsive units, we fit the firing rate of each unit during the response period (0–1,500 ms) with a two-way ANOVA model of the form in Equation 3:$$mathtex$$r_{{\rm i j}} \equals \mu \plus \alpha _{\rm i} \plus \beta _{\rm j} \plus \lpar \alpha \beta \rpar _{{\rm i j}} \plus \sigma \;$$mathtex$$ (3)where $$mathtex$$r_{{\rm i j}}$$mathtex$$ is the firing rate over all expression types $$mathtex$$\lpar {\rm i} \equals 1 \;2 \;3\rpar$$mathtex$$ and identity $$mathtex$$\lpar {\rm j} \equals 1 \;2 \;3\rpar$$mathtex$$ levels, $$mathtex$$\mu$$mathtex$$ is the mean firing rate, $$mathtex$$\alpha$$mathtex$$ is the effect due to the type of expression, $$mathtex$$\beta$$mathtex$$ is the effect due to identity, $$mathtex$$\lpar {\alpha \beta } \rpar$$mathtex$$ is the effect due to the interaction of expression type and identity, and $$mathtex$$\sigma$$mathtex$$ is Gaussian noise. This ANOVA was performed for each responsive unit, rendering p values for the statistical significance of the effect of expression type $$mathtex$$\lpar \alpha \rpar$$mathtex$$, identity $$mathtex$$\lpar \beta \rpar$$mathtex$$, and interaction $$mathtex$$\lpar {\alpha \beta } \rpar$$mathtex$$ on each unit's firing rate. Units were considered as positive for each effect at the p = 0.05 level for that effect. As such, units could be significant for multiple effects. Two-way ANOVA analysis and plotting was completed using R and packages stats, ggplot2, and ggVennDiagram.
Single-unit k nearest neighbor decoding
For all 285 responsive units, a k nearest neighbors classifier (KNN) was used to decode the expression type or identity of a stimulus in a single trial based on the firing rate observed in that trial. For each trial that was classified, the Euclidean distance between the firing rate in the trial to be classified and all trials in a training set were calculated. The trial was then classified based on the majority classification in an odd number of nearest neighbors within the training set. Classifications were made and accuracy was calculated using 5–21 nearest neighbors. The maximum accuracy across this set of nearest neighbors was used as the accuracy for the unit being analyzed. This analysis was conducted with threefold cross-validation (CV) with five repetitions within each fold averaged to produce that fold's accuracy metrics. The matrix provided to the decoder was n rows by 1 column plus a column of labels, with n = number trials (typically 90 or 135 for 10 or 15 repetitions of each stimulus, respectively) and the column containing the 0–1,500 ms firing rate for a single unit. The chance level was empirically determined by repeating the aforementioned analysis, decoding for expression using shuffled data. The chance level was set at the mean value for the 285 iterations of the shuffled analysis. This analysis was implemented in R using additional packages caret, knn, and tidyverse.
Pseudo-population k nearest neighbor decoding
Pseudo-populations of sizes 2–280 neurons were constructed from the responsive units. Expression type and identity were then decoded from the pseudo-population using a KNN model, similar to that used for single-unit decoding. In this case, classifications were made and accuracy was calculated using 5, 7, or 9 nearest neighbors and twofold CV. Fifty repetitions were conducted at each pseudo-population level, and each repetition included a random selection of units to include in the population and randomization of trials within the pseudo-population, in addition to standard randomization of the data for CV. The chance level was empirically determined by repeating the aforementioned analysis, decoding for expression using shuffled data. The chance level was set at the mean value of all iterations (50 repetitions at each population size) of the shuffled analysis.
For each pseudo-population size, a matrix of 135 rows by u columns was created, containing the z scored firing rates of units to 135 trials as rows (15 repetitions of nine stimuli). In cases where 135 trials were not recorded for the unit, additional data were produced by random sampling from existing data for that unit. In the five cases where there were an unbalanced number of completed trials for a unit (e.g., a unit had 10 repetitions of one stimulus and nine of another), the unit was removed from this analysis. The order of trials was shuffled prior to insertion in the matrix. The pseudo-population size was represented by u, and units from the 280 responsive units appropriate for this analysis were sampled at random to populate u columns on each iteration of the decoding analysis.
In summary, a 135 × u data matrix plus a column of labels was the input data for the KNN decoder. Euclidean distance was used, twofold CV was completed on each iteration, and 50 iterations were completed for each pseudo-population size u. The data matrix was rebuilt, starting with random selection of units, for each iteration. This analysis was implemented in R using the previously mentioned packages.
Decoding of identity and expression over time
A pseudo-population of 285 was created from all responsive units by creating a 135 × 285 data matrix plus a column of labels (expression type or identity) for 200 ms bins (200 ms bin size, 100 ms step size, 16 total bins) across the response period. A single column represented the z scored firing rate in a 200 ms period of interest for a single unit in response to each of 135 trials. Random sampling of existing data was used to produce 135 trial column vectors for units that did not have 135 recorded trials (same as previous). The order of trials was shuffled for each unit prior to insertion into the matrix (same as previous). Decoding was performed on firing rates in 200 ms time bins, sliding forward by 100 ms, starting at 250 ms before stimulus onset, and ending with a bin starting at 1,350 ms after stimulus onset.
For each 200 ms bin, the decoder generated mean vectors for each classification (either expression type or identity) from the training data and classified each trial in the test data based on the mean vector it is maximally correlated with (Pearson's correlation coefficient). This was done using twofold CV and repeated for 50 iterations for each 200 ms bin. Mean accuracy and standard error were derived from the 50 iterations. The chance level was empirically determined by repeating the aforementioned analysis, decoding for expression using shuffled data. The chance level was set at the mean value of all iterations (50 repetitions at each time point) of the shuffled analysis. This analysis was conducted in MATLAB using the Neural Decoding Toolbox (Meyers, 2013) and visualized in R using packages previously mentioned.
Principal component, group distance, and trajectory analysis
The mean firing rate of each of the 285 responsive units was calculated for each of the nine stimuli using the 0–1,500 ms response window. The rates were then z scored and compiled in a matrix of 9 rows × 285 columns (stimuli by units). The PCA was conducted on this matrix to project the population response to each of these nine stimuli into eight retained dimensions. When visualized, the coordinates of the nine population responses were taken from their respective PC.
Distances between population responses to stimuli were calculated using the Euclidean distance between the coordinates in the preserved number of PCs. For example, the distances in a population space preserving two PCs were calculated for each point using $$mathtex$$\sqrt {{\lpar {{\rm PC}1_{{\rm S1}}- {\rm PC}1_{{\rm S2}}} \rpar }^2 \plus {\lpar {{\rm PC}2_{{\rm S1}}- {\rm PC}2_{{\rm S2}}} \rpar }^2} \;$$mathtex$$ where $$mathtex$${\rm PC}1_{{\rm S1}}$$mathtex$$ is the coordinate along PC1 of the population response to stimulus 1. These distances were grouped according to whether they were a distance between the same identity (ID), between the same expression type (EXP), or between mismatches of identity and expression (OTHER). A two-way ANOVA of the form in Equation 4 was fitted to the distances calculated in spaces retaining 1–8 PCs. A post hoc test, Tukey’s HSD, was used to specify which groups had statistically significant differences from each other.$$mathtex$$d_{{\rm i j}} \equals \mu \plus \alpha _{\rm i} \plus \beta _{\rm j} \plus \sigma .$$mathtex$$ (4)The trajectory of the neural population response to each stimulus was visualized by projecting the position of the population response at each time point of interest into the eight-dimensional PCA space of the population at the final time point of 1,500 ms. First, the rotation matrix mapping the original space to the eight PC spaces $$mathtex$$\lpar {\vector R}\rpar$$mathtex$$ was calculated from the PCA covering the full-time range of 1,500 ms. For each time point of interest ($$mathtex$$i$$mathtex$$, 50 ms increments from 0 m to 1,500 ms), a new population matrix $$mathtex$$\lpar {\vector P}_{\rm i}\rpar$$mathtex$$ of 9 rows × 285 columns (stimuli × units) was produced from the mean firing rate to each stimulus from $$mathtex$$0- i$$mathtex$$ ms post-stimulus onset. This matrix was rotated into the final PC space $$mathtex$$\lpar {\vector P}_{\rm i}{\vector R}\rpar$$mathtex$$ and visualized by retaining the respective PCs. This was repeated for all i intervals of interest to produce visualization of the path the population response to a single stimulus took over time. These analyses were conducted using R and additional packages PCA, factominer, and factorextra.
Results
Single units responsive to naturalistic stimuli exhibit complex and nonselective responses
We recorded and isolated 466 units from the VLPFC of two macaques, while they viewed audiovisual movie clips of three different conspecifics making three different types of vocalization–expressions: either an affiliative, aggressive, or neutral expression. In the subsequent analysis, “expression” and “expression type” refer to the type of vocalization–expression (coo/affiliative, pant threat, bark/aggressive, or grunt/neutral) whereas “stimulus” or “stimuli” refers to the particular identity–expression combinations, that is, one or multiple of the nine audiovisual movie clips seen and heard. Due to the variability of responses to the stimuli found across time (Fig. 2), a custom filter for evaluating responsivity, using a sliding bin analysis of Poisson probability, was constructed to identify units that were responsive in this task (see Materials and Methods, Identification of responsive units). From this analysis, all units identified as responsive had at least one 200 ms period, during the stimulus period, where the firing rate was significantly elevated or reduced compared with baseline, in response to at least one stimulus. Additionally, units that had fewer than 10 repetitions of all stimuli were removed from this analysis.
A total of 285 units met the aforementioned criteria (n = 172, Subject 1; n = 113, Subject 2). Figure 2 depicts the typical activity patterns found across the population of responsive units. Units had peak responses that varied in time or had multiple peaks across the response period and displayed varying activity patterns relative to the specific stimulus presented. The unit in Figure 2A exhibited an increased response to the aggressive expression from BK and the neutral response from ST. In the example single unit shown in B, the magnitude of the response to the neutral expressions by BK and ST were increased as was the response to the affiliative expression by ST.
The observed dynamic firing patterns prompted the need for a quantitative understanding of the selectivity of single units for specific stimuli, identities, and expression types. An SI was calculated for the firing rate across 0–1,500 ms for each unit, with regard to selectivity for each stimulus, each identity, and each expression (see Materials and Methods, Selectivity analysis). Using this index, each neuron's response to a condition can be expressed along a 0–1 scale, with 1 signifying exclusive responsivity to a single condition and 0 signifying uniform responsivity to each condition. Neurons responsive to our task were only moderately selective for specific stimuli (mean SI, 0.41 ± 0.13) and less selective for identity (0.22 ± 0.13) or expression type (0.23 ± 0.13; Fig. 3). The increased selectivity for stimuli compared with the category of identity or expression was statistically significant (Kruskal–Wallis followed by pairwise Wilcoxon rank sum; p < 0.001; Bonferroni corrected). This suggests that the neural firing of single units in response to naturalistic expressions correlates more with combinations of stimulus features than categorical variables like identity or expression type.
Selectivity index distributions. A selectivity index was calculated for all 285 units with regard to selectivity for one of nine stimuli, one of three identities, or one of three expressions. The plots above depict the counts of cells at each tier of selectivity index (in bins of size 0.1), with regard to selectivity for stimuli (A), identity (B), and expression (C). The mean of each distribution was significantly different from 0 (3 Wilcox tests; p < 0.005 for stimuli, identity, and expression means compared with 0). Boxplots in D depict a comparison between means of SI for each factor of interest. The population was more selective for specific stimuli (0.41 ±0.13) compared with identity (0.22 ± 0.13) and expression (0.23 ± 0.13) (Kruskal–Wallis; df = 2; p < 0.001 followed by pairwise Wilcoxon rank sum; p < 0.001; Bonferroni corrected). While these units are not uniformly responsive, they are also not highly selective for specific stimuli, identities, or expressions.
We also investigated whether VLPFC cells exhibited a preference for a particular stimulus in our testing set and identified the “optimal stimulus” for each cell in a manner similar to Kuraoka et al. (2015). We performed a one-way ANOVA on all cells which had stimulus-related activity (N = 285). Of these, 114 cells had a significant effect of stimulus (p < 0.05). For each cell, we identified the stimulus with the highest mean firing rate and using a post hoc Tukey test, determined that it was significantly different from at least two other stimuli, to be considered the “optimal stimulus.” For each of the nine stimuli in the two testing lists, we computed the number of cells which ranked each stimulus as the optimal stimulus. A chi-square goodness of fit test was performed to determine if the distribution of optimal responses across the population of responsive cells, to the nine stimuli, differed significantly from chance. Across 72 responsive cells tested with Set 1 and 42 responsive cells tested with Set 2, a chi-square analysis on stimulus was not significantly different from chance [Set 1, χ2 (8, N = 72) = 8.0; p = 0.4335; Set 2, χ2 (8, N = 42) = 9.87; p = 0.274]. This demonstrated that there was not a bias in activity modulation magnitude for a particular stimulus which was optimal for the neurons tested.
Both the selectivity index and the optimal stimulus analysis indicated that while VLPFC neurons did not respond uniformly or indiscriminately to all identities or expressions, the majority of cells were, also, not selective for a specific stimulus or either categorical variable. Thus, it was possible that these categorical variables influenced firing rates in a more subtle way. In order to further characterize the potential influence of these factors on single-unit firing rates, we conducted a two-way ANOVA analysis modeling expression type, identity, and the interaction of both factors as predictors of firing rate in the 0–1,500 ms stimulus period. A two-way ANOVA was conducted for each unit, and we found that 111 units from our population had a significant effect of expression type, identity, or their interaction (Fig. 4A). Within the set of 111 units, 80 units had a significant effect of identity, 56 had a significant effect of expression, and 66 had a significant effect of their interaction (Fig. 4B; p < 0.05). It should be noted that these counts represent overlapping subsets, because a single unit could have multiple significant factors in the two-way ANOVA; this is portrayed in the Venn diagram in Figure 4C. The classification with the highest number of units was those for which there was a significant effect of identity, expression, and their interaction (30 units, 27% of the 111 significant units). The next highest group was those with a significant effect of identity only (28, 25%). Firing rates and SI for two units that showed ANOVA effects of expression type, conspecific identity, and interaction are shown in Figure 5. Overall, these results provided evidence that expression type and conspecific identity influenced the firing rates of a subset of single units.
Two-way ANOVA of single-cell firing rates. Two-way ANOVA of EXP, ID, or the INT of both. A, Of the 285 responsive units, 111 were statistically significant (p < 0.05) for at least one ANOVA factor (in orange), indicating an influence of EXP, ID, or INT. B, The counts of units with significant effects for each factor or interaction in this analysis. Eighty units were significant for ID, 56 for EXP, and 66 for INT. C, The units from B represent an overlapping population that can have multiple significant factors. Thus, there are seven possible classifications for each unit and the distribution of these classifications is shown in the Venn diagram. Lighter shades indicate higher counts. Percentages represent a proportion of the 111 two-way ANOVA positive units in A. Overall, a co-occurrence of significant effects of ID, EXP, and INT was the highest portion in the two-way ANOVA positive population, indicating influences of both ID and EXP as well as specific stimuli. Examples such units with these effects are shown in Figure 5.
Firing rate by stimulus for two units with effects of ID, EXP, and INT. Two units with box and whisker representations of firing rates for each stimulus [horizontal bar = median; box = interquartile range (IQR); stems = 1.5*IQR]. Boxplot line boarders correspond to expression, and boxplot fill colors correspond to identity. In both A and B, both units had significant effects of expression, identity, and interaction in two-way ANOVA analysis. A, An SI calculated by stimulus is provided and the y-axis is scaled to a relevant range for each unit. In A, a unit with firing rate trends for expressions and identity also has a single stimulus (BK-NEU) for which there is a markedly different firing rate distribution than all others. B, A similar trend as in A for a unit with a higher SI. In this case, stimulus ST-AFL had a markedly higher firing rate than others.
Accurate decoding of expression and identity requires a neural population
Since single units paradoxically exhibited both generally low selectivity and main effects of expression type and identity, we assessed whether the effects of these variables were sufficient to differentiate expressions or identities based on the firing rates of single units. We used a decoding approach in which single trials were classified by conspecific identity or expression type based on their Euclidean distance to k nearest neighbors (KNN model, threefold CV with five repeats, maximum decoding accuracy across k = odd integers from 5 to 21; see Materials and Methods, Single-unit k nearest neighbor decoding). Figure 6 depicts the results for this decoding model. The mean expression decoding accuracy of responsive units was 0.37 ± 0.05, and the mean decoding accuracy for identity was 0.38 ± 0.06, which were not significantly different from the empirically derived chance value of 0.37 or each other (N = 285; p > 0.016; Wilcoxon signed-rank; Bonferroni-corrected p threshold). These results suggest that single cells are, on average, unreliable encoders of categorical social information in the VLPFC.
Weak decoding of identity and expression from single-unit firing rates. For each single unit, we classified single trials based on the firing rate similarity (Euclidean distance) to 5–21 nearest neighbors in training sets (KNN model, threefold CV, 5 repeats). The maximum decoding accuracy across 5–21 nearest neighbors was used as the single-unit decoding accuracy. The mean decoding accuracies of expression (0.37 ± 0.05 SD) and identity (0.38 ± 0.06) were not significantly different from the empirical chance value of 0.37, nor were they significantly different from each other (p > 0.016; Wilcoxon signed-rank; Bonferroni-corrected p-threshold).
Although decoding of categorical social variables using single-unit firing rates was largely inaccurate, the diversity of responses found in these units may facilitate accurate decoding at the population level. To understand whether decoding of expression type or conspecific identity could be achieved by utilizing the populations of units, we conducted a decoding analysis on increasing sizes of pseudo-populations of responsive units (280 units used, 5 removed due to unbalanced number of trials). Briefly, a pseudo-population was constructed by randomly combining units to meet a target population size and randomly matching trials of the same stimuli within that population. In cases where a unit had fewer than 15 repetitions per stimulus, firing rates for that the remaining repetitions were randomly sampled from the existing data for that stimulus until 15 repetitions of each stimulus were represented in the pseudo-population for that unit. Decoding accuracy for a pseudo-population was derived from the maximum accuracy of a KNN model (see Materials and Methods, Pseudo-population k nearest neighbor decoding) across all k (k = 3,5,7, twofold CV, 50 repetitions with randomization of units in the population at each repetition). It should be noted that decoding for expression type or conspecific identity in this analysis is made more difficult by two factors inherent to the methodology of our experiment. First, single units were recorded during viewing of one of two possible sets of stimuli that contained slightly different exemplars of the same identity–expression pairings as stimuli (i.e., different exemplars of the same identity making the same expression). Thus, the decoding for conspecific identity or expression type at the population level requires a generalization of that variable across two exemplars of each stimulus. Secondly, KNN classifiers suffer from “the curse of dimensionality” (Weber et al., 1998), in which higher dimensional spaces naturally lead to sparse representations of randomness, particularly beyond 10 dimensions.
Despite these factors, the decoding accuracy for both expression and identity increased as the population size increased (Fig. 7). Expression decoding increased from 0.40 at a pseudo-population size of 2 to 0.68 at 280. Identity increased from 0.41 at size 2 to 0.80 at 280. Inspection of Figure 7 shows a non-monotonic but consistent increase in decoding accuracy as a function of population size. Additionally, the accuracy of identity was consistently above that of expression for all population sizes. This evidence supports the assertion that the decoding of expression and identity requires population-level activity from single units in the VLPFC. Additionally, it suggests that information about the identity of a conspecific is abundant at the population level.
Pseudo-populations of single units facilitate better decoding of identity and expression. Pseudo-populations of sizes 2–280 were constructed from random samples of the 280 units. A KNN decoder was used to classify the single trials by their nearest neighbors (Euclidean distance). The maximum accuracy from classification using five, seven, or nine neighbors was used as the accuracy metric. At each pseudo-population size, 50 repetitions of twofold CV were conducted (each repetition with a random selection of units) and averaged to produce the mean and standard error for the population size. The data was randomized and decoding was re-run to derive an empirical chance value (black line). Dots represent the mean decoding accuracy at a given population size for a particular variable, error bars represent the 95% confidence interval of the mean (±1.96 × standard error). Decoding accuracies for both variables increase as population size increases, with identity increasing faster and ultimately having higher accuracy at larger populations than expression.
Single-unit responses, like our naturalistic stimuli, were dynamic across time. Considering accurate decoding could be achieved at the population level, we sought to understand the time course of information about expression type and conspecific identity across the population. To do so, we employed a time-binned decoding model, where pseudo-population firing rates were calculated for discrete, 200 ms time bins across the response period. The period from 250 ms before the stimulus to 1,450 ms after the stimulus was analyzed in 200 ms bins, overlapping by 100 ms. A decoding model using the maximum correlation coefficient between a single trial and mean templates of training trials was used to classify single trials based on the 285-unit pseudo-population response (fivefold CV, 50 iterations). This model was run for each 200 ms bin in the stimulus period and for each variable of interest (expression and identity; see Materials and Methods, Decoding of identity and expression over time). The mean and 95% confidence interval of the decoding accuracy was produced and plotted for each time bin in Figure 8.
Sliding bin decoding analysis reveals early peak of identity decoding. A decoding analysis was performed on 285 unit pseudo-population firing rates in 200 ms bins, across the response period. The data was randomized and decoding was re-run to derive an empirical chance value (black line). Dots represent the mean decoding accuracy for the variable of interest across 50 iterations of fivefold CV. Dots represent the center of the bin (i.e., a dot at 50 ms represents a bin from −50 to 150 ms). Error bars represent the 95% confidence interval of the mean (1.96 * standard error). Peak decoding of identity was higher and occurred earlier than that for expression. Additionally, the mean decoding accuracy for identity was greater than expression for all bins after stimulus onset.
The mean decoding accuracy for identity was higher than that of expression for all time bins after stimulus onset, reaching a peak of 0.61 at the bin centered on 250 ms post-stimulus onset. Expression decoding reached a peak of 0.51 at 450 ms after stimulus onset. The 95% confidence intervals for both variables were nonoverlapping for bins centered on 250, 350, 550, 650, 750, and 1,350 ms. Overall, identity decoding increased earlier and to a higher level than that for expression. Decoding for both variables remained above chance for multiple bins after the peak of their decoding accuracy without a clear return to chance over our response period.
This suggests that viewing expressive stimuli generates sustained population activity in the VLPFC that continues to carry information about conspecific identity and expression type after a stimulus has terminated, in some cases. The early peak of identity decoding followed by expression decoding is consistent with the availability of this information over time in our naturalistic stimuli.
Identity shapes the population response to naturalistic stimuli
Across our decoding analyses, we observed earlier and persistently robust decoding of identity across the stimulus period compared with the decoding of expression (Fig. 8). Given these observations, we explored the relationships of both of these categorical variables within the population activity space. In particular, we sought to understand if there was an inherent structure within this space that would facilitate accurate decoding of conspecific identity. In order to do this, we conducted a PCA on the mean firing rates of our 285 responsive units for each of the nine stimuli (see Materials and Methods, Principal component, group distance, and trajectory analysis). Once again, because individual units were recorded during one of two examples of an identity–expression combination, this population analysis assumes a generalization of conspecific identity and expression type across two exemplars.
The variance explained by the first three PCs was 47.4% (PC1, 18.8%; PC2, 15.1%; PC3, 13.5%) without a clear inflection point across the eight PCs, suggesting the population response space is inherently high dimensional. Nonetheless, visualization of population responses in PC1 and PC2 displayed a trend found across all eight retained dimensions; within PCs 1 and 2, population responses to the same identity appeared to occupy nonoverlapping subspaces within the response space (Fig. 9A). In other words, although a population response might be closer to that of a different conspecific identity than to one of its own (as is the case for BK-AFL and ST-AGG), responses to the same identities could be connected together without overlapping the space occupied by other connected identities. This suggested that population responses to the same identity might be closer together than those for expression types or those for mismatching stimuli in the population space.
The PCA population responses to identity are closer than expressions. The mean firing rates over 0–1,500 ms for each stimulus, for all 285 responsive units, were analyzed using PCA. A, The population responses to each of the 9 stimuli are reconstructed in PC1 and PC2. Population responses are colored by identity, to highlight that identities appear to occupy nonoverlapping portions of the overall response space. B, The distances between population responses to similar identities (ID), expressions (EXP), or mismatching stimuli (OTHER) were measured and grouped together in reconstructed spaces preserving PCs 1–8. There were significant effects of PCs preserved (p < 0.005) and group (p < 0.005) in a two-way ANOVA (adjusted for unbalanced groups). Tukey’s HSD testing showed significant differences between ID-EXP (p < 0.005) and ID-OTHER (p < 0.005). The difference between EXP-OTHER was not significant (p = 0.95).
To quantitatively characterize this phenomenon, we measured the Euclidean distances between all population responses and grouped them according whether the distance was between population responses to the same conspecific identity (ID), the same expression type (EXP), or between responses to mismatches of both expression and identity (OTHER). We conducted this analysis for population spaces ranging from PC1 to PC8 (Fig. 9B). In each population space, the median distance for the ID group was lower than the median distances for EXP or OTHER. A two-way ANOVA model fitting distance as a function of dimensions and group was statistically significant for both factors (dimension p < 0.005; group p < 0.005; ANOVA model, distance = dimension + group; see Eq. 4). Post hoc testing with Tukey’s HSD revealed significant differences between ID and EXP as well as ID and OTHER (p < 0.005 for both). EXP and OTHER were not significantly different (p = 0.95). Similar trends were seen when data was analyzed for each subject independently (data not shown).
Given that population responses to the same identity tended to be closer together in the population space, we pursued an understanding of the relationships between population responses across time. We particularly sought to assess whether population responses had a consistent direction of movement toward their final positions in the population response space, based on the identity of the response. To do so, we projected the mean firing rates of the 285 neuron population responses at various time points along the 0–1,500 ms stimulus period into a common population response space (see Materials and Methods, Principal component, group distance, and trajectory analysis). By plotting each population response's trajectory to its final position, we were able to visualize the average response of the population, over time, to each of the stimuli in our set (Fig. 10). Viewed from multiple PC angles, a population response to a particular conspecific identity appears to emanate from a central region, common to all responses, and follow a trajectory, through time, that is similar to its identity counterparts. In views where responses to the same identity appear to splay in separate directions, orthogonal views corroborate that the appearance of splay is typically an artifact of the viewing angle (e.g., ST in teal; Fig. 10C shows splaying while A and B do not). This consistency of trajectory is not readily seen with population responses to the same expression type (Fig. 10D).
The PCA population responses to naturalistic stimuli follow identity-based trajectories. The positions of the population response to each stimulus through time were projected onto the 0–1,500 ms firing rate PCA space from Figure 9. Coloring of identities is the same as the previous figure; markers have been adjusted to reflect the same identity as well. Each marker represents the position of the population at a point in time, for a given stimulus. The transparency of the marker is proportional to the time point, wherein very transparent markers represent firing rates across the early response period (e.g., 0–150 ms) and less transparent markers represent firing rates across a longer response period (e.g., 0–1,200 ms). Trajectories are connected by lines for ease of following. A, A view of the population mean trajectories from PC1 and PC2. B, PC1 and PC3. C, PC2 and PC3. D, The same as B, recolored to group expressions rather than identities (red = aggressive, blue = affiliative, green = neutral). The trajectories emanate from a central region and the trajectories of the same identity appear to move toward a common subspace.
Discussion
Our findings indicate that, despite a diversity of responses exhibited at the single cell level, population activity in the macaque VLPFC in response to the audiovisual expressions of conspecifics is primarily organized around the identity of the conspecific attended to. Single neurons, responsive to audiovisual expression stimuli in the VLPFC, exhibited complex firing patterns that varied over time (Fig. 2) and exhibited weak selectivity for specific stimuli, expression type, or conspecific identity (Figs. 3, 4). Despite low selectivity, both conspecific identity and expression type exerted a systematic influence on neuronal firing rates (Fig. 4), suggesting that information about both factors is transmitted by VLPFC neurons in response to the perception of a vocalization–expression. Information from a neural population, rather than single neurons, was necessary to decode identity and expression type with appreciable accuracy (Fig. 7). Specifically, the higher decoding accuracy of identity at all populations sizes as well as its earlier and higher peak response time compared with expression type (Fig. 8) support a model where identity is the primary factor determining the population activity of the VLPFC. Findings from our PCA analysis confirm that the structure of the aggregate activity from responsive neurons segregates population responses to the same conspecific identity into subspaces of the total response space (Fig. 9) and that the activity of the population over the response period cohesively evolves such that responses to the same identity follow a similar trajectory across time (Fig. 10). Together, these findings confirm the primacy of identity in determining the population activity of the VLPFC during perception of naturalistic expressions. Furthermore, this activity is information rich, allowing for accurate decoding of expression type at the population level and potentially many more variables based on the diversity of responses seen at the single neuron level. These findings have direct implications for the study of circuits supporting social communication.
VLPFC population activity reflects the identity of conspecifics
Studies of the macaque VLPFC have consistently provided evidence for its critical role in social function. Functional neuroimaging during viewing of social stimuli reveals robust VLPFC activation in a network of regions classically involved in perception (Russ and Leopold, 2015; Shepherd and Freiwald, 2018). Face patches have been identified in macaque VLPFC (Tsao, Schweers, et al., 2008). In the auditory domain, VLPFC neurons have little reactivity to simple auditory features but express robust responses to species-specific vocalizations, particularly when combined with accompanying facial gestures, as in the present experiment (Romanski and Goldman-Rakic, 2002; Romanski et al., 2005; Sugihara et al., 2006; Romanski and Averbeck, 2009; Romanski and Hwang, 2012; Diehl and Romanski, 2014; Kuraoka et al., 2015).
Our combination of single-unit and population-level analysis illustrates the emergence of a cohesive population structure from a variety of relatively nonselective neurons. Despite finding virtually no neurons robustly selective for a single identity, the population and time-integrated activity in the VLPFC facilitated separation of dynamic multimodal stimuli based on identity. Importantly, we recorded and analyzed neurons responsive to naturalistic stimuli broadly across the VLPFC, without regard to facial or vocal selectivity, and showed that population-level activity could be utilized to accurately decode both expression type and conspecific identity and that the population activity response followed a time evolving structure that aligned with identity. As such, our results are, perhaps, more aligned with studies of the functional nature of the entire VLPFC under naturalistic conditions (Russ and Leopold, 2015; Shepherd and Freiwald, 2018) than those specifically targeting the face patches with a large variety of static face stimuli (Barat et al., 2018). Our findings, combined with well-established anatomical connectivity between the VLPFC and temporal cortex (Barbas, 1988; Felleman and Van Essen, 1991; Romanski, Bates et al., 1999; Romanski, Tian, et al., 1999; Kravitz et al., 2013) as well as the prevalence of identity information across the face-patch network (Chang and Tsao, 2017; Yang and Freiwald, 2021), support the idea that the VLPFC is a critical region in social communication.
It must be noted that one drawback of our analysis of pseudo-populations is that the technique assumes that the pooled neurons would fire independent of each, in real time, during a single trial, thereby neglecting noise correlations that may reshape the population response space (Abbott and Dayan, 1999; Averbeck et al., 2006). Whether this assumption enhances or degrades the ability to decode stimuli from our population is difficult to predict a priori (Wu et al., 2001; Averbeck and Lee, 2006; Ecker et al., 2011), and generalization from other empirical studies is equally difficult due to task- and stimulus-specific effects on noise correlations (Cohen and Newsome, 2008; Qi and Constantinidis, 2012). As such, noise correlations in VLPFC during expression perception should be the subject of further studies to determine the generalizability of the population characteristics found in this study.
Decoding of expression is a feature of a mixed-selective neural population and naturalistic stimuli
Even though single neurons could not be categorized as “identity” or “expression” responsive, a population decoding approach found evidence for representation of both variables. This finding is consistent with previous studies of mixed-selective neurons in the prefrontal cortex and amygdala (Rigotti et al., 2013; Saez et al., 2015), as well as population coding of abstract variables in the prefrontal cortex (Meyers et al., 2008; Mendoza-Halliday and Martinez-Trujillo, 2017; Murray et al., 2017; Parthasarathy et al., 2017; Fine et al., 2023). Neural responses in our study resembled mixed-selective neurons (Fig. 5), in that they had dynamic, robust responses to a single or few stimuli and were responsive, to a lesser degree, to other stimuli. Previous studies have shown that VLPFC neurons exhibit nonlinear multisensory integration and context-dependent firing in response to naturalistic expression stimuli (Sugihara et al., 2006; Diehl and Romanski, 2014; Diehl et al., 2022). As such, it is highly likely that the multiple streams of sensory input that converge in the VLPFC (Romanski, Tian, et al., 1999; Kravitz et al., 2013) are combined via nonlinear mixed selectivity to fundamentally increase the dimensionality of the population response space (Rigotti et al., 2013). Computational studies have shown that such an encoding strategy is a highly efficient method for creating population neural representations (Johnston et al., 2020) that balances the need to discriminate between correlated sensory inputs and generalize across noisy permutations of the same stimulus (Barak et al., 2013). Our ability to decode expression with increasing accuracy (Fig. 7), despite the dominance of identity in the population space, suggests that expression may be converging in higher dimensions or exist within a linked orthogonal space (Elsayed et al., 2016).
Ultimately, our findings provide clarity regarding the presence of categorical representations of facial (Kuraoka et al., 2015) and vocal (Romanski et al., 2005; Romanski and Averbeck, 2009) stimuli in the nonhuman primate VLPFC. While these previous studies relied on single-unit analyses, the population-based analyses employed in this study provide a clearer picture of the social information represented in, and computational strategy employed by, this brain region. These findings support the viewpoint that the neural population of the VLPFC is the “essential unit of computation” (Saxena and Cunningham, 2019) and encourages multiscale analysis of VLPFC neural activity (Panzeri et al., 2015) in the context of studying social function.
An additional factor facilitating the decoding of expression is our use of naturalistic stimuli with synchronous, dynamic auditory and visual information that affords additional features in the time domain. For example, the expansion and contraction of the mouth presented with a simultaneous rapid increase in intensity of a vocalization may be a feature encoded by these neurons. Most neurophysiology studies of expression and identity in the macaque employ static visual stimuli (Gothard et al., 2007; Tsao, Schweers, et al., 2008; Barat et al., 2018), present an auditory stimulus with a static image (Barat et al., 2018), and/or separate the visual movie from the vocalization (Kuraoka and Nakamura, 2007; Kuraoka et al., 2015). These paradigms afford greater control of stimulus timing and increase the variety of stimuli at the expense of failing to recapitulate naturalistic multisensory integration and dynamic multisensory features.
As a result of our approach, our stimuli were constrained by the natural duration of the audiovisual expression and by what exemplars could be elicited from live subjects, such that subtle features (direction of gaze or the amplitude of the mouth movement) could not be controlled for without degrading the natural quality or timing of the expressions. Additionally, independent sensory modalities were not tested, and thus, it is not possible to compare whether visual or auditory information was the major contributor of expression or identity information in this study. Fortunately, recent studies of macaque social perception (Khandhadia et al., 2021) have started employing natural audiovisual stimuli and naturalistic 3D macaque avatars for which specific features of expression, gaze, and facial action can be parametrically manipulated within a single identity (Murphy and Leopold, 2019; Wilson et al., 2020). Further single-unit and population-level studies employing these methods are needed to understand the specific correlates of expression that are encoded in the activity of neurons in the social communication network.
Footnotes
We thank Oishee Raman and John Housel for their technical assistance and expertise in spike sorting, animal training, and histology. We also thank Marc Scheiber for his guidance and assistance with the FMA surgical procedures. This study was supported by NIH DC 04845 (L.M.R.), The Schmitt Program for Integrative Neuroscience (L.M.R.), the Center for Visual Science (P30 EY001319), F30 MH122048 (K.K.S.), and the University of Rochester–Medical Scientist Training Program (T32 GM007356).
The authors declare no competing financial interests.
T.L.'s present address: Astrobotic Technology Inc., Pittsburgh, Pennsylvania.
E.R.A.'s present address: University of Miami School of Medicine, Miami, Florida.
- Correspondence should be addressed to Lizabeth M. Romanski at Liz_romanski{at}urmc.rochester.edu.