Abstract
Vocal communication is a complex social behavior that entails the integration of auditory perception and vocal production. Both anatomical and functional evidence have implicated the anterior cingulate cortex (ACC), including area 32, in these processes, but the dynamics of neural responses in area 32 during naturalistic vocal interactions remain poorly understood. Here, we addressed this by recording the activity of single area 32 neurons using chronically implanted ultrahigh-density Neuropixels probes in freely moving male common marmosets (Callithrix jacchus) engaged in an antiphonal calling paradigm in which they exchanged long-distance “phee” calls with a virtual conspecific. We found that many neurons exhibited complex modulations in discharge rates in response to presented calls, prior to and following self-generated calls, and during the interval between presented and produced vocalizations. These findings are consistent with the conceptualization of area 32 as an audiovocal interface integrating auditory information, cognitive processes, and motor outputs in the service of vocal communication.
Significance Statement
Vocal communication is indispensable in the daily lives of social animals including primates. This sophisticated ability requires processing and production of vocalizations across fluid social contexts. Vocal behavior is controlled by a large network of brain areas. Area 32 within the anterior cingulate cortex may be a linchpin of this network, as it is interconnected with both auditory cortical areas and subcortical structures engaging vocal control. This position is ideal for integrating auditory, motor, and cognitive signals serving vocal behavior. We show for the first time that neural correlates of these three signal types are present in area 32 neurons recorded in freely moving marmosets during naturalistic vocal exchanges.
Introduction
Vocal communication is a complex behavior that entails interactions between a sender, an auditory signal, and a receiver, all of which operate under the influence of fluidly changing contexts (Bradbury and Vehrencamp, 2000). In primates, this complexity is reflected in the fact that the neural circuits instantiating the linked processes of auditory perception and vocal production of social signals range broadly across the neuroaxis, involving subcortical structures and cortical areas including the anterior cingulate cortex (ACC; Jürgens, 2009; Grijseels et al., 2023).
A large body of evidence supports a general role of the ACC in the integration of sensory, motor, and cognitive processes (Paus, 2001; Amodio and Frith, 2006; Kolling et al., 2016), extending to those underpinning social interactions (Hadland et al., 2003; Rudebeck et al., 2006). This includes critical aspects of vocal communication such as the auditory processing of conspecific vocalizations and vocal production. ACC areas 32 and 10m/32v are interconnected extensively with auditory cortices including higher-order areas responsive to conspecific vocalizations (Tian et al., 2001; Medalla et al., 2007; Medalla and Barbas, 2014) and indeed both ultrahigh-field fMRI (Jafari et al., 2023; Dureux et al., 2024) and electrophysiological (Gilliland et al., 2024) investigations have revealed robust responses to these stimuli within area 32. Pregenual and subgenual ACC, which encompass area 32, have also been associated with vocal production. These areas send substantial anatomical projections to the periaqueductal gray (PAG; Müller-Preuss and Jürgens, 1976; Mantyh, 1983), a midbrain region critically involved in vocal control (Jürgens, 1994), and lesions or microstimulation of these areas impair (Sutton et al., 1974) or evoke (Jürgens and Ploog, 1970) vocalizations, respectively. Altogether, the combined weight of anatomical and functional evidence suggests that area 32 has properties consistent with those of an “audiovocal interface” linking perceived and produced vocalizations (Jürgens, 2009).
Beyond the studies noted above, little is known regarding the specific role of ACC area 32 in vocal communication either in or outside of a social context. This is due in part to the relatively limited vocal repertoire of rhesus macaques and the technical difficulties inherent in conducting electrophysiological studies in freely moving animals with naturalistic tasks that encourage the production of species-typical vocal exchanges. The common marmoset (Callithrix jacchus) is a model species within which these challenges can be addressed. As an arboreal species, this small New World primate relies on vocal communication to facilitate social cohesion and survival in the tree canopy (Bezerra and Souto, 2008). One aspect of this communication is the species-typical long-distance “Phee” call, which is used to maintain contact between group members and typically expressed in a call-and-response pattern (Takahashi et al., 2013). These antiphonal exchanges are reliably produced in the laboratory, can be modified by a number of contextual factors, and have been exceptionally well characterized (Miller and Wang, 2006; Miller et al., 2009, 2010; Roy et al., 2011). The small size of these animals also makes them ideal candidates for freely moving wireless recordings (Miller et al., 2015; Walker et al., 2021; Wong et al., 2023).
Here, we investigated the role of area 32 neurons in vocal communication by implanting chronically an ultrahigh-density Neuropixels probe (Jun et al., 2017) in area 32 and recording single neurons in freely moving marmosets engaged in bouts of antiphonal calling behavior. Consistent with the notion that this area acts as an audiovocal interface, we found that a large proportion of neurons exhibited robust modulations of discharge rates in response to externally presented vocalizations as well as before, during, and after self-generated phee calls. We additionally observed varying dynamics of excitation and suppression between presented and produced calls suggestive of an involvement in differentiating perceived and produced vocalizations in social contexts.
Materials and Methods
Subjects
Two adult male common marmosets (C. jacchus) participated in this study (Marmoset R, age 33 months, weight 463 g; Marmoset A, age 49 months, weight 374 g). Both animals were under the close supervision of veterinarians throughout their participation. All experiments were performed in compliance with the Canadian Council on Animal Care policy on the care and use of laboratory animals, and the experimental protocol was approved by the Animal Care Committee of the University of Western Ontario Council on Animal Care.
Surgical preparation of animals for electrophysiological recordings
In preparation for electrophysiological recordings, each marmoset underwent an aseptic surgical procedure with the dual purpose of creating trephinations in the skull to allow access to the cortex and fixing a recording chamber to the skull. A microdrill was used to create an ∼2 mm trephination in the right hemisphere above area 32 and lateral to the sagittal sinus, based on subject-specific preoperative anatomical MRI scans and stereotaxic coordinates (Paxinos et al., 2012). The trephination was then sealed with silicone (Kwik-Sil, WPI International). A second trephination was made roughly 10 mm from this location within which a gold amphenol contact (FST 19003) was secured with dental adhesive for use as an electrical ground in the electrophysiological recordings. A recording chamber was fixed to the skull using a combination of universal dental adhesive (All-Bond Universal, Curion) and UV-cured dental resin cement (Core-Flo DC, Curion) and covered with a protective cap. This chamber provided controlled access to the trephinations and allowed for stabilization of the head during electrode insertion and implantation. Animals received analgesics (buprenorphine for 2 d and acetaminophen for 5–7 d) and an anti-inflammatory steroid (dexamethasone for 5–7 d) postsurgically. Detailed descriptions of these procedures have been reported previously (Johnston et al., 2018; Schaeffer et al., 2019).
Vocal recording and preparation of auditory stimuli
To prepare exemplar calls for use as auditory stimuli in this experiment, we recorded bouts of antiphonal calling between pair-housed marmosets that were temporarily placed in separate cages in the recording rom. Auditory recordings were made using a Sennheiser MKH 8050 microphone (supercardioid pickup pattern) connected to a MacBook Pro (2021, macOS Ventura 13) via an 18 V phantom power supply (NEEWER NW-100, NEEWER) positioned at one of the cages. Four phee calls from an adult male marmoset unfamiliar to the two experimental animals (Fig. 1A) were selected as stimuli.
Experimental setup, vocal response behavior, neuronal yield, and implant location in marmosets. A, Schematic of the experimental setup. Neural recordings from two freely moving marmosets were captured using high-density Neuropixels probes implanted in the anterior cingulate cortex, with data logged via a SpikeGadgets datalogger. The marmosets were presented with conspecific phee calls while their own vocalizations were also tracked. Created in BioRender. Everling, S. (2025) https://BioRender.com/x06y352. B, Illustration of precise detection of vocalization onsets and offsets using Raven Pro. C, Histogram of response times from presented to produced phee calls in marmosets R and A. D, Number of neurons recorded in four sessions postimplantation for each marmoset. E, F, Ex vivo MRI scans at 15.2T showing electrode tracts (yellow arrows) in the anterior cingulate cortex for Marmosets R (E) and A (F). The overlaid color-shaded brain areas (Paxinos et al., 2012) illustrate targeted and adjacent cytoarchitectonic areas. AC, anterior commissure; CC, corpus callosum; ON, optic nerve.
Electrophysiological localization of area 32 and implantation of chronic Neuropixels probes
Prior to electrode implantation and neurophysiological recordings, Marmosets R and A were acclimated to the recording room. To allow unrestricted movement during electrophysiological recordings, enabling marmosets to engage in more naturalistic vocal communication, we recorded the activity of single neurons using Neuropixels 1.0 short probes (Jun et al., 2017) implanted chronically within area 32. Prior to implantation, we first identified area 32 based on location and depth from the cortical surface in separate recording sessions in which the animals were restrained as in traditional electrophysiological recording experiments. Detailed descriptions of these procedures have been published previously (Gilliland et al., 2024). These locations were confirmed at the end of the experiments by ex vivo ultrahigh-field structural MRI (see below, Confirmation of recording sites with ex vivo MRI).
After localizing area 32, we conducted a single chronic implantation session in each animal, in which we lowered a single Neuropixels 1.0 NHP probe into place at the optimal location determined during the localization sessions, in order to permit freely moving datalogger neural and vocal recordings. These implantation sessions were conducted identically to localization sessions. Neural activity was monitored while advancing the electrode as noted above, and once it had reached a location within area 32 at which we observed well-isolated neurons, we allowed it to settle for 30 min. Following this, we fixed the electrode in place using a two-step process. We first carefully flowed silicone (Kwik-Sil, WPI International) around the electrode shank within the trephination until it was completely sealed and the shank was fully encased. We then covered this with UV-cured dental resin cement (Core-Flo DC, Curion) until the probe was secured up to roughly 5 mm above the skull surface. Once the cement was cured, we carefully detached the custom electrode holder from the probe and slowly raised it using the stereotaxic micromanipulator. Once this was done, we added additional layers of cement to fully encase the probe base. Following this, we further cemented in place around the electrode a custom-designed, 3D printed protective cone/headstage holder. This served the dual purpose of protecting the site of electrode implantation as well as providing an anchor point for the SpikeGadgets headstage and datalogger (SpikeGadgets).
Simultaneous vocal and neural recordings in freely moving marmosets during phee call broadcasts
For each recording session, the animal was transported to the recording room and prepared for untethered datalogger recordings. To do this, we attached the SpikeGadgets Neuropixels datalogger headstage to the previously implanted headstage holder and commenced recording. Neural data were recorded on a microSD card inserted into the headstage (Samsung PRO Plus Micro SD 256 GB). The SpikeGadgets system was configured to receive both analog inputs from a small lavalier microphone used to record vocalizations (Hollyland Technology Lark M1) and digital sync pulses from a Raspberry Pi 3 Model B, which was used to broadcast the previously recorded calls via a Bose SoundLink III speaker (Bose Corp) placed a distance of 1 m from the animal. These inputs were directed to the SpikeGadgets environmental control system (ECU), which served as an interface to the SpikeGadgets MCU, and enabled synchronization of played and recorded calls with neural data. Neural recordings were controlled by the SpikeGadgets Trodes software package.
After attaching the headstage and initiating wireless recording, the animal was released into a 30 cm × 20 cm × 30 cm transfer box with clear sides allowing full visibility of the recording lab, and within which they were allowed unrestricted movement. The lavalier microphone was positioned at a small opening at the top of the box to sample vocalizations. Auditory stimuli consisting of the previously recorded phee calls were played manually under the control of the experimenter, triggered by keystrokes via custom written Python code on the Raspberry Pi. Dynamics of stimulus presentation were intended to mimic that of natural antiphonal calling. In some cases call sequences were initiated by calls broadcast by the experimenter, while in others, spontaneous calls evoked by the animals were “answered” by the experimenter. Response calls were initiated within 6 s of calls produced by the animal, based on natural antiphonal calling dynamics (Miller and Wang, 2006). Each session lasted ∼30−45 min. The duration of each session was dictated by the animal’s willingness to respond to played calls or generate spontaneous calls and was ended by the experimenter.
Automated spike sorting with manual curation
Spike sorting was performed using Kilosort4 (Pachitariu et al., 2016) followed by manual curation with Phy. Preprocessing involved median filtering and whitening, followed by adaptive template matching. Single-unit classification was based on waveform consistency, amplitude distribution, autocorrelogram characteristics, and unit stability. Detailed descriptions of this process have been reported previously (Gilliland et al., 2024; Selvanayagam et al., 2024).
Automated detection of phee calls
Times of call onset and offset were determined using a band-limited entropy detector in the Raven Pro software program (Cornell Lab of Ornithology; https://www.ravensoundsoftware.com/software/raven-pro/) and were additionally visually inspected and manually corrected as needed.
Confirmation of recording sites with ex vivo MRI
Following data collection, the marmosets were perfused, and electrode tracts were reconstructed using ex vivo ultrahigh-field MRI. Animals were deeply anesthetized, perfused transcardially, and brains were postfixed in 10% buffered formalin. MRI scans were conducted at 15.2 T (Bruker BioSpec Avance Neo) using a 35 mm quadrature detection volume coil. T2*-weighted images (75 × 75 × 75 µm resolution) were acquired and registered to the NIH ex vivo marmoset brain atlas (Liu et al., 2018). These procedures have been described in detail previously (Gilliland et al., 2024).
Experimental design and statistical analysis
All analyses were performed using scripts custom-written in Matlab (MathWorks). The analysis pipeline was designed to extract, process, and interpret neural data aligned to both the presentation of phee calls and the onset of produced phee calls. The first vocalization occurring after each presented phee call was identified. Matching trials, along with the time differences between stimulus and vocalization onsets, were stored for further analysis. Neural spike data were loaded from Kilosort-generated files. Spike times were converted to milliseconds and aligned to two reference points: presented phee call onset and produced vocalization onset. For each neuron, a binary spike matrix was generated, representing spike occurrence across trials and time.
To assess neuronal responsiveness, spike counts in three phases—stimulus (0–1,000 ms after phee call presentation), prevocalization (−2,000 to 0 ms prior to produced vocalization), and postvocalization (0–1,000 ms after produced vocalization)—were compared to baseline activity (1,000 ms prior to phee call presentation) using Wilcoxon signed-rank tests at p < 0.05. Neurons were classified based on significant activity changes in one or more phases. Categories included stimulus-only, prevocalization-only, postvocalization-only, combinations of these phases, and nonresponsive neurons. The proportions of neurons in each category were calculated and visualized in pie charts for each of the two sessions, providing an overview of the neural population's functional diversity.
Spike counts were aggregated and smoothed using a 100 ms Gaussian kernel to generate PSTHs to illustrate the temporal firing patterns of neurons relative to call presentation and call production. Principal component analysis (PCA) was applied to z-normalized neural activity aligned to call presentation (−1,000 to 1,000 ms) and call production (−2,000 to 1,500 ms) to reduce data dimensionality and identify dominant patterns of activity. The first PCA from the activity aligned to call presentation and the second PCA from the activity aligned on call production were used as input features for k-means clustering which was run 1,000 times. The optimal number of clusters was identified using silhouette analysis, which evaluated the consistency and separation of clusters. Neurons in each cluster were visualized using scatterplots of PC1 with the activity aligned to call presentation and PC2 with the activity aligned on call production. Average z-normalized neuronal waveforms for each cluster, aligned to both call presentation and call production, were plotted with standard error of the mean (SEM). To visualize the activity pattern of all single neurons in the identified clusters, we generated heatmaps of z-normalized neural activity binned into 25 ms internals and aligned on call presentation and call production. Neurons were sorted by the time to reach 50% of their maximal activity for call presentation or call production.
To assess whether neuronal responses differed between conditions such as antiphonal and non-antiphonal trials, we extracted z-normalized firing rate traces aligned to call presentation and call production for all neurons within a cluster. Neurons were grouped according to their cluster assignment, and their activity was separated into conditions. For each neuron and condition, we computed the mean response and SEM across trials. To determine whether the activity differed significantly across conditions at each time point, we performed paired Wilcoxon signed-rank tests across neurons within each cluster. This yielded a time series of p values, which were corrected for multiple comparisons using the Benjamini and Hochberg (1995) false discovery rate (FDR) procedure (q = 0.05). Time points showing significant differences (p < 0.05, FDR corrected) are marked by black horizontal bars in the figures.
To investigate neural response differences between presented and generated vocalizations, we performed a classification analysis using raw spiking data in 1 s windows immediately after the presentation of calls and in the 1 s after the production of calls. Data were split into training (50%) and testing (50%) sets. A support vector machine (SVM) classifier with a linear kernel was trained on the training set and tested on the held-out data. This process was repeated 1,000 times to assess classification accuracy under varying splits of the data. Classification accuracy was calculated for each iteration as the proportion of correctly predicted labels in the test set.
Results
We recorded neural activity from the ACC of two freely moving marmoset monkeys using high-density Neuropixels probes implanted in each animal coupled with a SpikeGadgets datalogger. During these sessions, we presented each marmoset with previously recorded conspecific phee calls while also tracking their self-generated phee calls (Fig. 1A). To ensure precise identification of vocalization times, we utilized the Raven Pro 1.6 software developed by the Cornell Laboratory of Ornithology, which allowed us to automatically detect and subsequently confirm by visual inspection the onset and offset times of phee calls produced by the animals from which neural recordings were obtained (Fig. 1B). Marmoset R typically responded with a phee call ∼2 s after hearing the presented call, displaying a relatively consistent response time. In contrast, Marmoset A exhibited much greater variability in call onset times following the presented vocalizations, indicating individual differences in response patterns (Fig. 1C). The yield of recorded neurons from the implanted Neuropixels probes differed between the two animals (Fig. 1D). Although both animals initially exhibited a similar number of isolated neurons—125 in Marmoset R and 129 in Marmoset A, recorded a few hours postimplantation—the neuron count declined in Marmoset R to 17 on Day 1 and 16 on Day 2. In contrast, the neuron count in Marmoset A increased to over 200 for the subsequent three days. To prevent reanalysis of the same neurons across sessions, we included only one session from Marmoset R (Day 0, n = 125 neurons) and Marmoset A (Day 1, n = 204 neurons) in our analysis. We deliberately chose this conservative approach because there is currently no universally accepted method for reliably identifying and matching the same neurons across days in chronic high-density recording experiments. By restricting the analysis to a single session per animal, we aimed to avoid potential duplication of neuronal data.
Ex vivo ultrahigh-field magnetic resonance imaging at 15.2 T confirmed that the probes primarily targeted areas 32 and 32v of the anterior cingulate cortex, with a few neurons recorded in area 14R in Marmoset A (Fig. 1E,F).
We observed a variety of neural response profiles in the ACC. Consistent with our previous findings that many area 32 neurons in marmosets respond to conspecific vocalizations, we found neurons in this naturalistic vocalization paradigm that responded to the presented phee calls. Figure 2A illustrates a neuron that exhibited a transient increase in activity following the presented phee calls (blue triangles), which returned to baseline levels before the marmoset produced its own phee calls (magenta circles). A similar pattern is seen in Figure 2B, where the neuron also shows some activity following the onset of the produced calls (magenta circle, right panel).
Neural activity patterns of significantly modulated neurons in response to presented and produced vocalizations in Marmosets R and A. A–F, Raster plots and peristimulus time histograms (PSTHs) showing activity of single neurons aligned to presented (stimulus) and self-generated (vocalization) phee calls. Each panel displays activity during four time periods: baseline (Base), stimulus presentation (Stimulus), prevocalization (Pre), and postvocalization (Post). Rasters depict spike times (gray dots) across trials, with color-coded markers indicating specific event timings: stimulus start (blue triangles), vocal start (magenta circles), and vocal end (red squares). Black lines indicate the average spike density across trials. G, H, Proportion of neurons with significant responses to presented calls, produced calls, or both, in Marmoset R (G) and Marmoset A (H). Pie charts categorize neurons as responsive to stimulus only, prevocalization only, postvocalization only, combinations of these periods (Stimulus & Pre, Stimulus & Post, Pre & Post), all three periods, or nonresponsive.
Some neurons displayed more sustained responses to presented phee calls, only decreasing just before the onset of the produced vocalization (Fig. 2C), and these neurons did not show increased activity during the produced calls above baseline levels. Other neurons, such as the one shown in Figure 2D, increased their activity following the presented phee calls but reached peak activity at a later time. Additionally, many neurons showed increased activity just before and during the produced phee calls, peaking at or after call onset (Fig. 2E). In contrast, some neurons reduced their activity following the presented call and showed an increase immediately prior to the produced call (Fig. 2F).
To quantify the total number of neurons modulated during this free vocalization paradigm, we identified neurons with significant activity changes across different task epochs, compared with baseline activity evaluated in a window from 1,000–0 ms before phee call presentation. We used nonparametric Wilcoxon signed-rank tests (p < 0.05) to test for significance. Modulated neurons were found in all three task periods analyzed: the stimulus period (0–1,000 ms after phee call presentation), the prevocalization period (2,000–0 ms before the produced call), and the postvocalization period (0–1,000 ms after the produced call; Fig. 1G,H). Note that the stimulus and prevocal periods overlapped somewhat for neurons recorded from Marmoset R, which typically responded with a call 2–3 s after phee call presentation. Overall, 78% of neurons in Marmoset R and 40% of neurons in Marmoset A showed significant modulation across these broad task periods.
To examine the neural response dynamics during both presented and self-generated vocalizations, we performed principal component analysis (PCA) on these 179 task-modulated neurons from the two monkeys. Figure 3 illustrates the temporal evolution of the top three principal components (PC1, PC2, and PC3) during presented (Fig. 3, left panels) and produced (Fig. 3, right panels) phee calls, capturing key aspects of neural response variance in each condition.
Principal component analysis of neural responses to presented and produced vocalizations. Principal component (PC) analysis of neural activity during presented (left panel) and produced (right panel) vocalizations. The time courses of the first three principal components (PC1, PC2, and PC3) are shown. The percentage next to each component indicates the variance explained by that component.
In response to the presented phee calls (Fig. 3, left panels), the first principal component (PC1) exhibited a sharp increase in activity immediately following call onset, explaining 45% of the variance in the neural response. This strong activation suggests PC1 captures a dominant neural dynamic in area 32 related to auditory input. PC2 (26.6% variance explained) showed a rapid, phasic decrease following presented call onset, while PC3 (10.6%) exhibited a biphasic modulation pattern. During self-generated phee calls (Fig. 3, right panels), neural dynamics differed noticeably. PC1 peaked ∼1 s before the produced call and decreased before and during the call, capturing 34.2% of the variance. PC2, which explained 25.2% of the variance, exhibited an increase starting before call onset, consistent with activation during vocal production. Similar to its response profile during presented calls, PC3 displayed an oscillatory pattern, explaining 9.7% of the variance, suggesting this component may capture a more general response pattern present across both presented and produced vocalization conditions.
To classify distinct neural response patterns during presented and produced vocalizations, we used k-means clustering analysis on the first principal component of the activity for presented vocalizations and the second principal component of the activity for produced vocalizations of the 179 significantly modulated neurons. Using silhouette analysis to determine the optimal number of clusters, we observed a peak average silhouette value at three clusters (Fig. 4A), indicating that the modulated neurons can be grouped into three distinct response types based on their activity patterns during presented and produced vocalizations. We selected the second principal component for clustering in the produced condition because PC2 more clearly captured time-locked phasic modulations relative to call onset, whereas PC1 primarily reflected a gradual decrease in activity, resembling the evoked response profile observed in PC1 for presented calls. As a result, PC2 provided a more informative basis for distinguishing functionally distinct neural response types associated with vocal production.
Clustering analysis of neural response patterns to presented and produced vocalizations. A, Silhouette analysis showing the average silhouette value as a function of the number of clusters (k) for k-means clustering. A peak at three clusters indicates that the neural activity can be optimally divided into three distinct response types. B, Scatterplot of principal component 1 (PC1) for presented versus principal component 1 (PC2) for produced vocalizations, with each point representing a neuron. Colors denote the three identified clusters. Each cluster contains neurons from both Marmosets R (filled circles) and A (open circles). C, Dendrogram showing the Euclidean distances between neurons based on their projections in principal component space (PC1 and PC2). Hierarchical clustering using Ward’s method revealed three distinct neuronal groups (Clusters 1–3), indicated by branch color. Each leaf represents an individual neuron, and branch heights correspond to pairwise distances between neurons in the reduced PCA space. D, Average z-normalized neural activity of each identified cluster during presented (left) and produced (right) phee calls. Shaded regions indicate ±SEM. The dotted line marks the onset of the presented call (left) and the produced call (right). E, Heat maps showing the z-normalized neuronal activity for neurons recorded from the two marmosets in the three clusters during call presentation and call production. Each heat map represents neuronal activity sorted by the time at which activity reaches 50% of the maximal response. Data are aligned to the onset of the presented calls (left) and the onset of call production (right).
In Figure 4B, we plot each neuron’s principal component 1 scores for presented versus produced vocalizations. The scatterplot shows the three identified clusters, with each cluster containing neurons from both marmosets (R and A, open and filled circles, respectively), represented by different colors. The separation of clusters for presented and produced vocalizations suggests that distinct neural populations are selectively tuned to either vocalization source or exhibit different response magnitudes and directions based on the vocalization type.
To validate the results of the k-means clustering, we also performed hierarchical clustering using Euclidean distances across the same feature space (Fig. 4C). This approach independently recovered the same three cluster structure, supporting the robustness of the classification.
Figure 4D shows the average z-normalized neural activity for each of the three clusters during presented (left) and produced (right) phee calls. Figure 4E shows the z-normalized neural activity of single neurons in the clusters. The dotted lines indicate the onset of the presented phee call (left panels) and the onset of the produced phee call (right panels). Neurons in Cluster 1 (blue, n = 67) exhibited a marked increase in activity following the onset of the presented phee call and decreased their activity before the onset of the produced call. This pattern suggests that these neurons respond strongly to external vocalizations but are suppressed during self-generated vocalizations, indicating a possible role in distinguishing between external auditory inputs and self-generated actions. Cluster 2 (red, n = 62) showed a phasic increase in activity after the presented call, followed by a gradual increase that began ∼1,500 ms before the onset of the produced call and peaked ∼500 ms before the call. Cluster 3 (yellow, n = 50) displayed a decrease in activity after the presented calls but increased its activity during produced vocalizations.
To investigate whether distinct functional response types were associated with putative excitatory or inhibitory neurons, we examined the distribution of narrow-spiking (NS; likely inhibitory) and broad-spiking (BS; likely excitatory) neurons across the three response clusters to classify distinct neural response (Fig. 5). Spike waveform classification was based on trough-to-peak duration, with neurons classified as NS if their duration was <300 μs and as BS if ≥300 μs. Consistent with previous reports that NS neurons tend to exhibit higher firing rates, NS neurons in our dataset showed significantly stronger responses than BS neurons across multiple time points (horizontal black bars indicate p < 0.05, Wilcoxon rank-sum test).
Neuronal response profiles for narrow- and broad-spiking neurons. Left, Histogram distributions of neurons classified as narrow-spiking (NS, light gray) or broad-spiking (BS, dark gray) based on trough-to-peak duration of average waveforms, shown for the entire population (“All”) and for each of the three hierarchical clusters. Middle, Mean z-normalized firing rate (±SEM) aligned to the onset of call presentation (vertical dashed line at time 0) for NS (dashed lines) and BS neurons (solid lines), plotted separately for all neurons (top) and each cluster (bottom three rows). Black horizontal bars indicate time periods with significant elevation in firing rate between NS and BS neurons. Right, Mean normalized firing rate (±SEM) aligned to the onset of self-produced phee calls, plotted with the same conventions.
Although BS neurons were more prevalent overall (147 BS vs 32 NS), all three functional clusters contained both cell types. A chi-square test revealed no significant difference in the proportion of NS and BS neurons across clusters (χ2 = 0.21, p = 0.90), indicating that putative inhibitory and excitatory neurons were similarly distributed among the identified neural response categories.
To further investigate the functional role of ACC neurons in vocal communication, we compared neural responses on antiphonal versus non-antiphonal trials in Marmoset A, the animal that exhibited variable latencies and frequent failures to respond to presented calls (Fig. 1C). We defined antiphonal trials as those in which the marmoset produced a phee call within 10 s of a presented call, consistent with previous work characterizing natural antiphonal calling behavior in this species (Miller et al., 2010). Trials on which no vocalization occurred within this window were classified as non-antiphonal.
Figure 6 shows the average z-normalized activity for each of the three functional clusters (identified in Fig. 4) separately for antiphonal (solid lines) and non-antiphonal (dashed lines) trials from Marmoset A, aligned to the presented call (left) and to the onset of the self-produced phee call (right). Cluster 2 neurons showed stronger responses to the presented phee calls on antiphonal compared to non-antiphonal trials (see horizontal black bars indicating significant differences, p < 0.05, Wilcoxon signed-rank test). In contrast, the activity in this cluster was higher prior to call production on non-antiphonal trials. These results indicate that activity in area 32 is sensitive not only to the presence of vocal stimuli but also to the behavioral context—specifically, whether or not a stimulus will be followed by a vocal reply. This suggests that sensory-evoked activity in the ACC may help set the stage for vocal production and that diminished activation in this network may relate to the absence of a behavioral response.
Neuronal response on antiphonal and non-antiphonal trials for Marmoset A. Z-normalized average firing rates (±SEM) of neurons from each of the three hierarchical clusters during presented (left panels) and produced phee calls (right panels). Solid lines represent trials on which the marmoset produced a phee call within 10 s of the presented call (antiphonal trials), while dashed lines represent trials with longer-latency or absent responses (non-antiphonal trials). Black horizontal bars indicate time points with significant differences between antiphonal and non-antiphonal trials. Cluster 2 exhibited stronger responses to presented phee calls during antiphonal trials and enhanced prevocal activity during non-antiphonal trials.
To explore whether ACC neural responses were modulated by vocal effort or acoustic structure, we examined neural activity as a function of the number of phee pulses produced. In Figure 7, we compare the z-normalized average firing rates of neurons in each of the three response clusters during trials in which the marmoset produced a single phee pulse (solid lines) versus trials with double or triple phee pulses (dashed lines). All three clusters showed higher prestimulus activity on subsequent multi-pulse phee calls compared with single pulses which were significant in Cluster 2. Cluster 1 also showed stronger activity prior to produced multi-pulse phee calls (black bars indicate significant time points, p < 0.05, Wilcoxon signed-rank test). Although we did not perform a full acoustic feature analysis, the increased activity observed during more effortful vocalizations of multi-phee calls suggests that area 32 integrates motivational and affective components into vocal motor control.
Neuronal responses for single and multiple phee pulses. Z-normalized average firing rates (±SEM) of neurons from each of the three hierarchical clusters during presented (left panels) and self-produced phee calls (right panels). Solid lines represent trials on which marmosets produced a single phee pulse, while dashed lines represent trials with double or triple phee pulses. Black horizontal bars indicate time points with significant differences between single- and multi-pulse responses.
To further evaluate the ability to distinguish between neural activity patterns associated with presented versus produced calls, we trained a linear support vector machine (SVM) classifier using the activity of all recorded neurons within each of the two sessions. For each trial, neural responses were represented by the summed spike counts of each neuron within the first second following the onset of either a presented or generated call, creating a high-dimensional feature space that captured the collective neural response dynamics for each vocalization type. This analysis aimed to quantify the separability of neural responses on a trial-by-trial basis, leveraging the simultaneous activity patterns across the neural population to enhance classification accuracy.
The classifier demonstrated a mean correct classification rate of 96% for Marmoset R, indicating strong separability between the neural responses associated with presented and generated calls, with a distinct pattern of activity following each call type. For Marmoset A, the classifier achieved a mean correct classification rate of 82%, suggesting effective, albeit slightly lower, discriminability in neural response patterns compared with Marmoset R. These findings highlight the high degree of specificity in neural encoding for presented versus self-generated calls within each marmoset’s neural population in area 32, indicating robust neural differentiation based on the source of the vocalization.
Discussion
Convergent lines of anatomical and functional evidence have linked area 32 with auditory processing and vocal production, suggesting that it may act as an audiovocal interface integrating auditory inputs with vocal outputs. Given the well-established links between pregenual ACC and social cognition (see for review Amodio and Frith, 2006), we reasoned that this may be particularly the case in contexts requiring voluntary control of vocal initiation on the basis of cognitive factors, such as during communication with a conspecific (Jürgens, 2009; Fichtel and Manser, 2010; Bradbury and Vehrencamp, 2011; Grijseels et al., 2023). Here, we investigated this by carrying out electrophysiological recordings in freely moving marmosets engaged in an antiphonal calling paradigm in which they responded to broadcasts of long-distance phee calls (Miller and Wang, 2006; Eliades and Miller, 2017). We found that the response properties of single neurons in marmoset area 32 met the criteria for an area integrating auditory information with vocal output. First, consistent with our prior finding of complex dynamics in auditory responses to biological sounds and conspecific vocalizations (Gilliland et al., 2024), a large proportion of neurons (58.9% of responsive neurons across both animals) were modulated by presented phee calls. Second, roughly 80% of responsive neurons were modulated around the time at which marmosets generated calls, consistent with prior findings suggestive of role in call production (Jürgens and Ploog, 1970; Sutton et al., 1974). Third, nearly all responsive neurons exhibited either a sustained increase or suppression of activity in the interval between presented and produced vocalizations. Such modulations, which bridge the temporal gap between sensory responses and motor output have been associated previously with varied cognitive processes including working memory (Pasternak and Greenlee, 2005) and decision-making (Shadlen and Kiani, 2013), as well as motor preparation (Jonikaitis et al., 2023).
In addition to the broad response profiles, we observed across neurons in area 32, a more nuanced pattern emerged when examining vocal responses based on behavioral context. In Marmoset A, the only animal to exhibit sufficient trial-to-trial variability, we found that neural responses differed significantly between antiphonal and non-antiphonal trials. Specifically, neurons in Cluster 2 showed stronger responses to presented phee calls on trials that were followed by a vocal reply, suggesting that sensory-evoked activity in area 32 may contribute to the facilitation of vocal production. Interestingly, this cluster also showed elevated activity prior to vocalizations on non-antiphonal trials—those with longer or absent responses—raising the possibility that anticipatory activity in this population may reflect an internal decision process or motivational state that does not ultimately culminate in call production. Together, these results underscore the context sensitivity of area 32 responses, indicating that its activity is shaped not only by vocal stimuli and motor output, but also by the likelihood and timing of communicative engagement.
To further probe the functional role of area 32 in vocal behavior, we investigated whether activity was modulated by vocal effort, using the number of phee pulses as a proxy for acoustic complexity or motivational drive. We found that neural activity preceding multi-pulse phee calls was elevated compared to that preceding single-pulse calls, particularly in Clusters 1 and 2. This suggests that activity in area 32 reflects not only the timing of vocal production but may also encode internal variables such as arousal or vocal effort. These findings are consistent with prior evidence that affective and motivational states influence call structure in marmosets (Borjon et al., 2016) and support the notion that area 32 contributes to integrating these internal states into the control of vocal output.
We also examined whether distinct functional response types could be attributed to putative excitatory or inhibitory cell classes by classifying neurons based on spike waveform. Although narrow-spiking neurons—presumed to be inhibitory—exhibited stronger firing rate modulations than broad-spiking neurons, we found no significant differences in the distribution of these cell types across the functional response clusters. This suggests that both excitatory and inhibitory neurons contribute to the diverse patterns of sensory and response-related activity in area 32, reinforcing the view that vocal communication dynamics in this region emerge from a heterogeneous but integrative local network.
Our findings of robust responses of area 32 neurons to presented and produced vocalizations are consistent with prior conceptualizations of the role of the ACC in vocal control. On the basis of stimulation and lesion studies targeting ACC and the interconnected PAG (Müller-Preuss and Jürgens, 1976), a structure with direct connections to the brainstem vocal pattern generator (Mantyh, 1983), Jurgens (2009) proposed that the ACC, including an area corresponding to area 32, existed at the apex of a hierarchical cingulo-periaqueductal pathway involved in voluntary control of vocalizations. In a similar vein, the dual-network model of Hage and Nieder (2016) places the ACC, including area 32, as the highest level of a limbic vocal-initiating network which includes the PAG, drives the vocal pattern generator in the brainstem, and additionally receives top-down inputs from a suite of lateral prefrontal, premotor, and motor cortical areas responsible for high-level vocal control. In this conceptualization, the ACC is in a sense a linchpin connecting evolutionarily old structures critical for emotional vocalizations with relatively newer cortical areas involved in cognitive control. Here, we found that area 32 neurons exhibited neural responses during vocal communication that resembled those observed in other areas with respect to vocal production but differed with respect to auditory responsiveness to conspecific calls. Numerous studies in both macaque monkeys performing trained vocalization tasks (Hage and Nieder, 2013, 2016; Gavrilov et al., 2017; Hage, 2018), and marmosets engaged in naturalistic antiphonal calling (Miller et al., 2015; Roy et al., 2016; Nummela et al., 2017; Jovanovic et al., 2022; Zhao and Wang, 2023), have demonstrated modulations in discharge rates of neurons in prefrontal and premotor cortex in advance of and during the production of vocalizations. The complex excitatory and inhibitory discharge dynamics around the time of phee call production during antiphonal calling in marmosets (Miller et al., 2015; Roy et al., 2016) resemble closely those we observed here. In contrast, auditory responses to presented calls in the antiphonal calling paradigm appear to be relatively weak and present in only roughly 10% of neurons in cytoarchitectonic areas corresponding to marmoset premotor cortex (Miller et al., 2015; Roy et al., 2016; Nummela et al., 2017), though there is some evidence that auditory responsiveness may be greater in more anterior frontal cyotoarchitectonic subfields including areas 8aD, 8aV, 46, 47, 9, and 10 (Nummela et al., 2017; Jovanovic et al., 2022; Wong et al., 2024). This differs considerably from the robust auditory responses we observed in area 32 here and in our previous work (Gilliland et al., 2024), suggesting that this area is related more directly to sensory processing of auditory inputs in the context of vocal communication. In the PAG, which as noted above is related more directly to vocal production, auditory responses are almost completely absent (Düsterhöft et al., 2004). Single neurons exhibit exclusively excitatory dynamics at various times prior to and during vocalizations which may be involved in coordinating various motor processes related to vocal production (Larson and Kistler, 1984, 1986; Larson, 1991). Overall, considered in the context of networks responsible for vocal production, our findings are broadly consistent with the conceptualization of area 32 as an “auditory field” (Medalla and Barbas, 2014). This relative bias toward sensory input observed together with activity changes occurring preceding and following vocalization onset suggest additionally that area 32 may implement an early stage of transformation from auditory inputs which are converted to motor or bias signals shared with other cortical and subcortical regions with more direct access to the vocal pattern generator in the brainstem. Indeed, it has been shown that the PAG, as well as cortical motor, premotor, and cingulate motor areas, but not area 32, have disynaptic connections to laryngeal motorneurons involved in vocal control (Cerkevich et al., 2022).
The present study also highlights technical limitations associated with the chronic use of Neuropixels probes in marmosets. Although the probes enabled high-yield recordings for several days in Marmoset A, the yield declined sharply within 24 h in Marmoset R. This stands in contrast to rodent studies, where chronically implanted Neuropixels probes—especially the thinner NP1.0 (Jun et al., 2017) and NP2.0 versions (Steinmetz et al., 2021)—often yield stable recordings for weeks or even months. One possible contributing factor in our case is the use of the thicker Neuropixels probes designed for nonhuman primates, which may induce more tissue disruption. Additionally, prior to implantation, we had performed multiple acute recordings with Neuropixels probes in the same animals, which could have influenced tissue integrity at the implant site. In the current study, we prioritized precise post hoc localization and limited our analyses to a single session per animal to avoid re-sampling the same neurons across days. While this approach yielded robust data and uncovered novel functional properties of area 32 neurons, it also underscores the need for further methodological development to improve the longevity and reliability of chronic high-density recordings in marmosets.
Footnotes
We thank Cheryl Vander Tuin, Whitney Froese, and Hannah Pettypiece for animal preparation and care and Joseph Gati for scanning assistance. This work was supported by the Canadian Institutes of Health Research (S.E.) and the Natural Sciences and Engineering Research Council of Canada (S.E.). We also acknowledge the support of the Government of Canada’s New Frontiers in Research Fund (NFRF-T-2022-00051).
The authors declare no competing financial interests.
- Correspondence should be addressed to Kevin D. Johnston at kjohnst9{at}uwo.ca.













