Abstract
The common marmoset (Callithrix jacchus) is known for its highly vocal nature, displaying a diverse range of calls. Functional imaging in marmosets has shown that the processing of conspecific calls activates a brain network that includes fronto-temporal areas. It is currently unknown whether different call types activate the same or different networks. In this study, nine adult marmosets (four females) were exposed to four common vocalizations (phee, chatter, trill, and twitter), and their brain responses were recorded using event-related functional magnetic resonance imaging at 9.4 T. We found robust activations in the auditory cortices, encompassing core, belt, and parabelt regions, and in subcortical areas like the inferior colliculus, medial geniculate nucleus, and amygdala in response to these calls. Although a common network was engaged, distinct activity patterns were evident for different vocalizations that could be distinguished by a 3D convolutional neural network, indicating unique neural processing for each vocalization. Our findings also indicate the involvement of the cerebellum and medial prefrontal cortex in distinguishing particular vocalizations from others.
Significance Statement
This study investigates the neural processing of vocal communications in the common marmoset (Callithrix jacchus). Utilizing event-related functional magnetic resonance imaging at 9.4 T, we demonstrate that different calls (phee, chatter, trill, and twitter) elicit distinct brain activation patterns, challenging the notion of a uniform neural network for all vocalizations. Each call type distinctly engages various regions within the auditory cortices and subcortical areas. These findings offer insights into the evolutionary mechanisms of primate vocal perception and provide a foundation for understanding the origins of human speech and language processing.
Introduction
Most primate species live in groups which provide them with socially complex structures. Vocal communication plays a key role in such groups because it allows individuals to avoid predators, interact with other group members, and promote cohesion within social groups during daily activities (Swedell, 2012). Vocalization in nonhuman primates (NHPs) is considered a precursor for human language, and speech perception in humans likely evolved in our ancestors by using pre-existing neural pathways that were responsible for extracting behaviorally relevant information from the vocalizations of conspecifics (Belin, 2006).
Neuroimaging studies in Old World macaque monkeys suggest that responses to conspecific calls are structured bilaterally in a gradient form along the superior temporal lobe wherein the rostral parts are predominantly activated by integrated vocalizations, while the caudal parts are responsive to the acoustic features of these calls (Zhang et al., 2021). Several studies examined the activations of neurons within the auditory cortex in response to conspecific vocalizations in macaque monkeys. Some level of selectivity for three or fewer out of seven calls was reported in all three lateral belt regions (Tian et al., 2001; Belin, 2006). Beyond the auditory cortex, single-unit recordings also demonstrate that the macaque ventrolateral prefrontal cortex is involved in assessing acoustic features unique to conspecific vocalizations. The majority of the neurons within this area demonstrated some level of selectivity when the monkeys were presented with diverse vocalizations (Romanski et al., 2005).
Given the challenges of studying vocal and cognitive auditory processing in general in macaques (Joly et al., 2017; Cléry et al., 2021), New World common marmosets (Callithrix jacchus) have emerged as a powerful additional NHP model for vocal studies (Takahashi et al., 2021). Potentially because of their densely foliated arboreal habitat and family structure, marmosets possess a diverse array of calls that are dependent on social contexts and ecological factors (Bezerra and Souto, 2008; Landman et al., 2020). This rich vocal repertoire underlies their consistent engagement in acoustic communication which is also characterized by vocal turn-taking (Eliades and Miller, 2017), a critical feature shared with humans (Levinson, 2016).
While electrophysiological recordings (Wang et al., 1995; Zeng et al., 2021) and more recently two-photon calcium imaging (Zeng et al., 2021) have revealed a subset of neurons tuned to specific call types in auditory cortices, it is unknown whether different neural circuits are activated by different vocalizations in marmosets. If this is the case, it is possible that vocalizations produced in different contexts are processed by different brain networks. Another possibility is the involvement of different brain networks based on the acoustic features of vocalizations, with a more similar neural substrate for calls sharing more acoustic characteristics (Takahashi et al., 2021).
Recently, we developed a technique to obtain whole-brain functional magnetic resonance imaging (fMRI) in awake marmosets at 9.4 T in response to auditory stimuli and used it to map the marmosets’ vocalization-processing network. We found that blocks of mixed conspecific vocalizations evoked stronger activations than scrambled vocalizations or nonvocal sounds in a fronto-temporal network, including auditory and cingulate cortices (Jafari et al., 2023). Here we followed up on this approach and utilized event-related fMRI to test whether single phee, chatter, trill, or twitter calls evoked similar or distinct patterns of activation in awake marmosets. Our data show that although all calls activated a core network of cortical and subcortical areas, they were also associated with distinct activation patterns that could be reliably classified by a trained 3D convolutional neural network (CNN).
Materials and Methods
Animal subjects
All experimental procedures complied with the guidelines outlined by the Canadian Council of Animal Care and were conducted in accordance with a protocol adhered to the Animal Care Committee of the University of Western Ontario. In this study, we conducted whole-brain fMRI scans on nine adult common marmosets (Callithrix jacchus, four females, two left-handed; age, 48–74 months; and weights, 380–464 grams). All animals were housed in pairs in a colony located at the University of Western Ontario where environmental conditions included a humidity level of 40–70%, a diurnal 12 h light cycle, and a temperature maintained between 24 and 26°C.
Surgical procedure
Animals underwent an aseptic surgery for implanting a compatible machined PEEK (polyetheretherketone) head post under anesthesia to prevent any head motion while scanning. During the surgical procedure, the marmosets were sedated and intubated to be maintained under gas anesthesia (a mixture of O2 and air, with isoflurane levels ranging from 0.5 to 3%). The skull surface was initially prepared by applying several coats of adhesive resin (All-Bond Universal; Bisco, Schaumburg, Illinois, USA) after a midline incision of the skin along the skull. The resin-coated area was air-dried and then cured with an ultraviolet dental curing light (King Dental). Then, the PEEK head post was secured on the skull using a resin composite (Core-Flo DC Lite; Bisco, Schaumburg, Illinois, USA). Throughout the surgery, continuous monitoring was conducted to track the heart rate, blood pressure, oxygen saturation, and body temperature. More details regarding the animal surgery protocol are included in Johnston et al. (2018) and Zanini et al. (2023).
Animal habituation
Following a recovery period of 2 weeks after the surgery, the monkeys were gradually acclimatized to the head-fixation system and the MRI environment using a mock scanner. Specific details concerning our training protocol are found in Zanini et al. (2023).
Marmoset contact calls and their characteristics
Previous studies in marmosets, whether in captivity or moving ad libitum, indicate that their vocal repertoire comprises ∼13 distinct call types, each capable of eliciting varied behavioral responses from conspecific listeners (Bezerra and Souto, 2008). These vocalizations are influenced by social contexts and can serve as indicators of the animal’s overall welfare (Wang and Kadia, 2001; Bezerra and Souto, 2008; Takahashi et al., 2021). Different acoustic factors such as bandwidth, harmonic ratio, and duration vary among individual marmosets and even within calls of the same animal for a specific call type (Lui et al., 2015). A fundamental aspect of exchanging calls among marmosets is the temporal adjustment of contact calls, also known as turn-taking. Within the array of marmoset vocalizations, the trill and phee calls serve the purpose of antiphonal communication and are considered a key factor in facilitating the turn-taking aspect (Landman et al., 2020). Trill calls, as a short-distance call, are the most dominant vocalizations within the marmoset repertoire. They are produced when individuals are in close vicinity to each other. Previous vocal interaction studies demonstrated that high-frequency trill calls predominantly occur between partners and are less frequently exchanged with other members within the group. Additionally, trills emitted by one animal are often followed by trills from other animals within a period of <1 s. Conversely, high-frequency phee calls are emitted when a marmoset is distant from its peers, lacking visual contact (Bezerra and Souto, 2008; Eliades and Miller, 2017; Landman et al., 2020). Phee calls can be different in their acoustic features and are classified based on the distance of the callers, whether it is longer or shorter, from their conspecifics (Kato et al., 2018; Grijseels et al., 2023). Twitter calls are characterized as loud sounds that share a structural resemblance with warbles. These high-frequency calls consist of a sequence of several brief, rapidly frequency-modulated call phrases with relatively consistent inter-syllable intervals, typically produced during intergroup agonistic interactions (Wang et al., 1995; Bezerra and Souto, 2008; Watson and Caldwell, 2010; Eliades and Miller, 2017; Zhao and Wang, 2021). Low-frequency chatter calls are also produced by marmosets, serving as a manifestation of distinct emotional contexts including intergroup or outgroup aggression (Grijseels et al., 2023).
Stimuli generation
The natural vocalizations used in this study were recorded in a small colony that accommodated five groups of marmosets under this study or their companions. Monkeys were paired-housed in five different cages (2–6 individuals/cage) at the University of Western Ontario. Conspecific vocalizations frequently produced in the vocal exchange between members of the colony were recorded over a period of 2 h using a microphone (MKH 805, Sennheiser, in combination with phantom power, NW-100, NEEWER) connected to a laptop (Macbook Pro, Apple) and stored as Wave files in the Audacity software (v3.2.5; Audacity, 2017). No manipulation such as filtering or background noise removal was performed on the archived data to preserve the integrity of vocal features. Then, data were explored through its spectrogram using the Audacity software package (Audacity, 2017), leading to the identification of three distinct categories of social calls including phee, twitter, and trill calls. Among the recorded data, twitter and trill were the most prevalent calls. Instances of calls with low amplitude or intensity, which likely originated from animals housed in distant cages, were excluded from our selected dataset. For trill and twitter calls, we selected two samples each from our recorded calls. However, we opted to use other sources and pre-recorded samples for phee and chatter (n = 2) calls (Landman et al., 2020) due to their suboptimal quality, primarily resulting from overlapping individual calls with background noise, vocalizations from multiple monkeys, or the absence in our original recordings. The spectral power of candidate samples was subsequently normalized using a custom MATLAB script and was matched in duration for 600 ms using online AudioTrimmer. Ultimately, we employed a collective count of eight samples of the four mentioned call categories (two samples from each of trill, chatter, twitter, and phee calls) as our vocal stimuli in this study. The spectrograms of the call types, which were used in this study, are displayed in Figure 4C.
fMRI data parameters
Each fMRI session lasted 40–55 min depending on whether marmosets were scanned for two or three functional runs using gradient-echo-based single–shot echoplanar image (EPI) sequences. To minimize the impact of the noise emanating from the scanner on auditory stimuli, we adopted a “bunched” acquisition approach. In this method, the acquisition of image slices occurred within a period (acquisition time, TA) that was shorter than the repetition time (TR), thus leaving a silent interval (silent period, TS) during which the auditory stimuli were presented to the animals (TR = TA + TS). The parameters used for data collection in this experiment are as follows: TR, 3 s; TA, 1.5 s; TS, 1.5 s; TE, 15 ms; flip angle, 40°; field of view, 64 × 48 mm; matrix size, 96 × 128; voxel size, 0.5 mm isotropic; number of axial slices, 42; bandwidth, 400 kHz; and GRAPPA acceleration factor (left–right), 2. Furthermore, during each fMRI session, additional series of EPIs were acquired with an opposing phase-encoding direction (right–left). This was done to reduce spatial distortion induced using topup in FSL (Smith et al., 2004) in our subsequent analysis. For each animal, we also collected a T2-weighted anatomical image in a separate session to facilitate the anatomical registration of the fMRI data. The T2-weighted anatomical image was obtained using the following imaging parameters: TR, 7 s; TE, 52 ms; field of view, 51.2 × 51.2 mm; voxel size, 0.133 × 0.133 × 0.5 mm; number of axial slices, 45; bandwidth, 50 kHz; and GRAPPA acceleration factor, 2 (Zanini et al., 2023).
fMRI setup
Functional imaging data were acquired at the Centre for Functional and Metabolic Mapping at the University of Western Ontario, using an ultrahigh-field magnetic resonance system operating at 9.4 T, featuring a 31 cm horizontal bore magnet and a Bruker BioSpec Avance III console running the Paravision 7 software package. A custom-designed gradient coil (Handler et al., 2020) with an inner diameter of 15 cm, boasting a maximum gradient strength of 400 mT/m, was employed. Additionally, an eight-channel receive coil (Gilbert et al., 2023) was positioned inside a home-built transmit quadrature birdcage coil with an inner diameter of 120 mm. A comprehensive description of the animal’s preparation can be found in Jafari et al. (2023) and Zanini et al. (2023). Vocal stimuli were presented passively during each run using a custom-built MATLAB script (R2021b, MathWorks). A transistor–transistor logic (TTL) pulse box was used to synchronize the onset of the EPI sequences with the onset of the auditory stimuli and video recordings.
fMRI task design and sound presentation
The fMRI task involved passively presenting the marmosets with the vocal stimuli described above. During the sessions, each marmoset was scanned for a total number of 2–3 runs. Each run contained 300 functional volumes (TR, 3 s). This resulted in a total of 86 functional runs of which 4 runs were excluded and not incorporated into our final dataset due to technical issues during scanning sessions where vocal stimuli were not properly presented.
To present vocal stimuli, a customized MATLAB script in line with the event-related imaging paradigm was employed. This script controlled the initiation of sound presentation through a TTL pulse received from the MR scanner. Each scanning session started and ended with a 1.5 s baseline period during which no auditory stimuli were introduced. Using a TS of 1.5 s within the TR enabled us to present stimuli without interference from the scanner noise. Subsequently, a specific call was chosen in a random order and played back during TS, with the interstimulus intervals being selected in a pseudorandom sequence from a predefined set of intervals including 3, 6, 9, and 12 s.
Preprocessing of fMRI data
The data were preprocessed using AFNI (Cox, 1996) and FSL (Smith et al., 2004). Prior to initiating the preprocessing of the functional data, the T2-weighted anatomical data were reoriented. Subsequently, a manual skull-stripped mask was created for each individual subject using FSLeyes application (Smith et al., 2004) and then converted into a binary mask. This binary mask was then multiplied with the anatomical data to produce the T2 template mask and ultimately registered to the 3D NIH marmoset brain atlas (NIH-MBA; Liu et al., 2018) through Advanced Normalization Tools (ApplyTransforms; Avants et al., 2009) using a nonlinear registration. Functional data preprocessing briefly include the following: (1) conversion of raw functional data from DICOM to NIfTI formats (dcm2niix), (2) reorientation of the functional data (fslswapdim), (3) registering functional images to anatomical data (fslroi and topup), (4) interpolating each run (applytopup), (5) removal of spikes (3dToutcount and 3dDespike), (6) time shifting (3dTshift), (7) registration of the entire dataset to a reference volume, typically the middle volume (3dvolreg), (8) spatial smoothing by convolving the BOLD image with a three-dimensional Gaussian function having a full-width half-maximum of 1.5 mm (3dmerge), (9) bandpass filtering within the frequency range of 0.01–0.1 Hz (3dbandpass), and (10) calculation of a mean functional image for each run which was linearly aligned with the corresponding T2-weighted anatomical image of each animal (FSL’s FLIRT). The resulting transformation matrix from the realignment process was then employed to transform the 4D time series data.
Statistical analysis of fMRI data
We used a univariate general linear model approach by convolving the event onsets obtained during scans with the hemodynamic response function [AFNI’s Convolution, GAM(4,0.7)]. A regression model was then generated for each condition using 3dDconvolve, incorporating polynomial detrending regressors and motion parameters from the previous preprocessing steps. The resultant regression coefficient maps were then registered to template space using the transformation matrices obtained with the registration of anatomical images on the template. Then, one T value map for each call per run was obtained after registering to the NIH Marmoset Brain Atlas (Liu et al., 2018). The obtained maps were subsequently subjected to group-level comparisons through paired t tests (3dttest++), yielding Z value maps. These functional maps were then visualized on fiducial maps using Connectome Workbench v1.5.0 (Marcus et al., 2011) as well as coronal and sagittal sections using FSLeyes (Smith et al., 2004). To determine the locations of activated brain regions, we aligned these functional maps with a high-resolution (100 × 100 × 100 µm) ex vivo marmoset brain (Schaeffer et al., 2022), registered on the NIH marmoset brain template (Liu et al., 2018), and employed Paxinos parcellation for region identification (Paxinos et al., 2012).
Quantification of local responses
To evaluate how the brain responds to different types of vocalizations, we analyzed the changes in neural activation across 24 region of interests (ROIs), including auditory areas and peripheral regions. Beta values from the resultant regression coefficient maps were extracted in these ROIs using AFNI’s 3dmaskave tool for each type of vocalization and each of the 82 experimental runs across nine monkeys. We then calculated the average beta value and standard error of the mean (SEM) for each call across all runs for each ROI using a custom MATLAB script. Additionally, we conducted a repeated-measure analysis of variance (rmANOVA) on the beta values for each ROI in MATLAB to examine differences between vocalization types (ranova, multcompare). This analysis helped us determine the significance of observed neural activity patterns (Macey et al., 2016).
Decoding
To determine whether the four call types could be distinguished based on their activations, we employed a decoding method. Although support vector machine (SVM) algorithms are among the most commonly used decoding methods due to their simplicity, they face significant limitations when applied to fMRI data (Du et al., 2022). One of the primary drawbacks of SVMs is their reliance on linear separation, which restricts their ability to capture the complex, nonlinear relationships often present in functional maps. Also, SVMs struggle with managing the high-dimensional nature of fMRI data, where the number of voxels, i.e., features, in each brain volume can vastly exceed the number of samples, making them less effective for fMRI data. To overcome these limitations, AI-driven approaches such as deep neural networks (DNNs) have emerged, providing solutions to the shortcomings of SVMs while also generating higher decoding accuracies. A convolutional neural network is a form of DNN that employs a convolutional kernel to scan predefined areas of the input data, sliding it to extract topographically organized features in a way that is biologically plausible and meaningful (Cabral et al., 2012; Tong and Pratte, 2012; Vu et al., 2020).
As anatomical regions and functional patterns might not align perfectly across different sessions or subjects, classifying fMRI volumes using vectorized 1D patterns (flattened functional maps) in CNNs presents inherent limitations. These limitations arise from both shifts in neuronal activation across sessions or subjects due to spatial misalignment, as well as variations in activation intensity across different brain regions. This variability complicates the accurate representation of functional networks using simple vectorized patterns. Previous studies demonstrated that 3D-CNN could address the potential spatial misalignment in comparison with 1D vectorized approaches (Vu et al., 2020; Du et al., 2022). The fine-tuning of the selected 3D-CNN model could significantly enhance the accuracy and prediction performance (Plis et al., 2014). In this study, a 3D convolutional neural network was chosen as the primary classifier to assess the decoding capability of the obtained functional maps.
Architecture of the convolutional neural network model
We implemented a 3D-CNN using a custom-built MATLAB script (R2024b, MathWorks), running on two systems: (1) a MacBook Pro (14 inch, 2021) with an Apple M1 Pro chip, 32 GB memory, running macOS Ventura 13.2.1 and (2) Lenovo laptop powered by a 13thgeneration Intel Core i9-13980HX processor, NVIDIA RTX 4000 Ada Generation Laptop GPU (12 GB), and 64 GB DDR5-4000 MHz memory. The network starts with an input layer that accepts 3D whole-brain fMRI volumes, thresholded above 1.15 to retain more significant activations. This layer is followed by three 3D convolutional layers (kernel sizes of 5, 7, and 7, with 8, 16, and 32 filters, respectively) which use Rectified Linear Unit (ReLU) activation functions. Each convolutional layer is followed by max-pooling layers (filter size, 2; stride, 1) to reduce dimensionality. The feature maps are then flattened and passed through a fully connected layer with 128 neurons, followed by a dropout layer (dropout rate, 0.25) to prevent overfitting. The output layer consists of four neurons representing the four classification categories (phee, chatter, trill, and twitter) and uses a softmax function for final classification. The network then uses a cross-entropy loss function to measure the difference between the predicted output and the actual labels (Loss). The 3D-CNN was trained using stochastic gradient descent with momentum (SGDM) over 40 epochs with a mini-batch size of 25 and an initial learning rate of 1 × 10−4. The classification process for a single run took ∼5–8 h to complete, depending on the specifications of the system and the data being used.
Since data from a single subject is insufficient to train a robust CNN model, single-trial whole–brain fMRI maps from eight marmosets were combined to feed the CNN model. The model was subsequently tested on the remaining data from one subject using leave-one-subject-out cross–validation (LOOCV). This procedure was performed nine times, leaving one subject (n = 9) out at a time.
Statistical significance of classification
The accuracy of classification for each subject was calculated by dividing the number of correctly predicted trials by the total number of trials tested for that specific subject in each iteration. Subsequently, the overall accuracy was obtained by averaging the accuracy scores across all nine subjects, and the SEM were then calculated. The training and testing accuracy, along with a measure of the Loss, were continuously monitored throughout the training process, as depicted in Figure 6C.
To further validate the significance of the model’s performance, we conducted a shuffled-label control analysis as well. The purpose of the shuffled-label control was to establish a baseline, indicating the performance of the model when no real relationship exists between the data and labels (Ossmy et al., 2015). In this analysis, the labels of the training data were randomly shuffled, breaking the relationship between the input data and their corresponding labels. The network was retrained using the shuffled labels under the same conditions as the original experiment, including the same LOOCV procedure. This process was repeated for each subject, similar to the nonshuffled condition, and the test accuracy was averaged across all nine test subjects. The accuracy obtained from this control was expected to be near the theoretical chance level (25% for four classes), as no meaningful decoding is possible with randomly assigned labels.
A two-tailed paired t test was conducted to statistically compare the accuracies from the original model (with correct labels) against the accuracies from the shuffled-label control (Ossmy et al., 2015). Finally, we performed an rmANOVA on the test accuracies for each class across nine runs to investigate differences among the experimental conditions.
Results
We performed event-related fMRI at 9.4 T in nine awake marmosets. Utilizing a “bunched slice acquisition sequence” (Fig. 1A), we collected each EPI volume in 1.5 s, while the effective TR was 3 s (with 1.5 s of the silent time between volumes). We presented one of the four types of marmoset vocalizations (phee, chatter, trill, or twitter) during the silent periods every 3–12 s in a pseudorandom order (Fig. 1B). Figure 1C illustrates the combined group activation in response to all four vocal stimuli compared with the baseline periods when no auditory stimuli were presented to the monkeys. The findings show robust activations in all regions of the auditory cortices including the core, belt, and parabelt at the cortical level. Subcortically, the calls activated the inferior colliculus (IC), medial geniculate nucleus (MG), amygdala (Amy), the reticular nucleus of the thalamus (RN), and the brainstem reticular formation (RF).
fMRI study overview. A, The sequential process of a bunched slice acquisition paradigm utilized in the study. Each acquisition cycle comprises a TA of 1.5 s followed by a TS of equal duration, collectively constituting a TR of 3 s. B, Graphical representation of the experimental task paradigm employed in the current study. Auditory stimuli with a duration of 0.6 s are randomly presented to marmoset subjects during the silent periods depicted in A with interstimulus intervals varying between 3 and 12 s, pseudorandomly chosen. C, Representation of group brain activation comparison (n = 9 marmosets) for overall auditory tasks versus the baseline. The top panels depict surface maps, providing a topographical view of cortical activations. White lines delineate regions based on the atlas from Paxinos et al. (Paxinos et al., 2012). The bottom panels show volumetric representations at different interaural (IA) levels, overlaid onto coronal slices of anatomical MR images. All surface maps are set to a threshold of z-scores below −5 and above 5, while volumetric maps are set to a threshold of z-scores below −2.3 and above 2.3, for deactivation and activation correspondingly. Cold color gradients indicate deactivation (negative values), while hot color gradients signify activation (positive values), representing the spatial distribution and intensity of neural responses during the auditory task. LH, left hemisphere; RH, right hemisphere; Aud, auditory cortex; MG, medial geniculate nucleus; IC, inferior colliculus; Amy, amygdala; RN, reticular nucleus of the thalamus; RF, the brainstem reticular formation; Cd, caudate; A1, primary auditory cortex area; ML, auditory cortex middle lateral area; AL, auditory cortex anterolateral area; CPB, caudal parabelt area; RPB, rostral parabelt area, TPO, temporoparietal–occipital area.
Moreover, our findings in Figure 1C show strong widespread deactivation in response to the brief vocalizations in several motor regions including primary motor cortex 4ab and premotor cortex areas 6DC, 6DR, 6Va, and 6 M and in cingulate cortices 32, 24a–d, and 23a–c; somatosensory cortex areas 1/2, 3a, and 3b; prefrontal area 8Av; and parietal areas PE and PFG. Additionally, Figure 2 illustrates bilateral deactivation in the cerebrocerebellum (Crll) for all vocal stimuli versus the baseline.
Bilateral deactivation of the cerebellum. Volumetric representation illustrating the deactivation of the cerebellum in response to the overall auditory tasks. All volumetric maps are set to a threshold of z-scores below −2.3 and above 2.3, for deactivation and activation, respectively. Cold color gradients with negative values show deactivation, while hot color gradients with positive values denote activation, representing the spatial distribution and intensity of neural responses during the auditory task. LH, left hemisphere; RH, right hemisphere; MG, medial geniculate nucleus; Crll, cerebellum.
To directly identify the activation pattern associated with a particular call, we subsequently conducted group comparisons (n = 82 runs) between the functional response for each of the four vocalizations and the baseline in Figure 3. Overall, each of the four vocalizations activated a relatively similar network in marmosets. This shared call-specific network predominantly encompassed the auditory cortices including core [primary area (A1), rostral field (R), and rostral temporal (RT)], belt [caudomedial (CM), caudolateral (CL), mediolateral (ML), anterolateral (AL), rostromedial (RM), rostrotemporal medial (RTM), rostrotemporal lateral (RTL)], and parabelt [rostral parabelt (RPB) and caudal parabelt (CPB)] regions.
Brain activations for the different conspecific calls versus the baseline. Group functional topologies (n = 9 marmosets) for phee (A), trill (B), twitter (C), and chatter (D) against the baseline are presented on the lateral and medial views of the right fiducial marmoset cortical surfaces. Volumetric activations at various IA levels are superimposed onto coronal slices of anatomical MR images. All activation maps are thresholded within the range of z-scores below −2.3 and above 2.3. Cool color gradients with negative values denote neural deactivation, whereas warm color gradients with positive values represent neural activation, illustrating both the spatial distribution and intensity of neural responses elicited by the auditory task. LH, left hemisphere; RH, right hemisphere; Aud, auditory cortex; MG, medial geniculate nucleus; IC, inferior colliculus; Amy, amygdala; RN, reticular nucleus of the thalamus.
At the subcortical level, the IC and MGN were activated by each of the four call types relative to the baseline. The activations of the amygdala were found for the presentation of phee, twitter, and chatter calls, whereas the comparison between trill and the baseline failed to show significant activation in this region.
To better illustrate the activations for the four calls, we displayed them on flat maps of the right hemisphere in Figure 4A. The figure shows that a few other cortical areas such as the temporoparietal–occipital area (TPO), insular proisocortex (Ipro), temporal proisocortex (Tpro), superior RT area (STR), and the retroinsular area (ReI) were also significantly activated in addition to auditory cortices by each of conspecific vocalizations compared with the baseline. Moreover, this figure demonstrates that cingulate areas 32, 24a–d, and 23a–c; primary motor cortex 4ab; premotor areas 6DC, 6M, 6V, and 6DR; somatosensory areas 3a-b and 1/2; parietal area PE; and frontal area 8aD, 8Av, and 8b were deactivated by the presentation of each of these single calls.
ROI analysis. A, The representation of neural activity patterns for each call versus the baseline on flat maps for the right hemisphere. The z-score maps are thresholded at below −2.3 and above 2.3. Warm color gradients indicate activation, while cold color gradients signify deactivation. B, Beta values analysis for the activity of each call versus the baseline across 24 ROIs. Each ROI is represented by four bars, each corresponding to the functional activity of a specific call, as displayed in the left column of the plot, wherein the bar height reflects the magnitude of activity for the corresponding ROI in response to that specific call. Significance levels of group differences in each ROI are displayed below each graph using asterisks (*p ≤ 0.05; **p < 0.01; and ***p < 0.001), displaying regions where differences reach statistical significance. In regions where differences between conditions are significant, asterisks indicating significance levels are displayed above the corresponding bars. (*p < 0.05; **p < 0.01; and ***p < 0.001). The vertical line on each bar demonstrates the SEM. CM, auditory cortex caudomedial area; A1, auditory cortex primary area; R, auditory cortex rostral area; RT, auditory cortex rostrotemporal area; CL, auditory cortex caudolateral area; ML, auditory cortex middle lateral area; AL, auditory cortex anterolateral area; RTL, auditory cortex rostrotemporal lateral area; RTM, auditory cortex rostrotemporal medial area; RM, auditory cortex rostromedial area; CPB, auditory cortex caudal parabelt area; RPB, auditory cortex rostral parabelt area; TPO, temporoparietal–occipital area; Ipro, insular proisocortex; Tpro, temporal proisocortex; STR, superior rostral temporal area; ReI, retroinsular area.
Despite these overall similarities, Figure 4A also shows some differences in activation between the four calls. To quantify the variations in response magnitude and the effect size of the experimental conditions on brain activity within this call-specific network, we extracted the beta value for each call for 24 ROIs and performed rmANOVA on these values. Significant p values are displayed by asterisks beneath each ROI in Figure 4B. We found differences between the calls in CL (F(3,243) = 6.66; p < 0.001), ML, CPB, and Tpro (F(3,243) = 4.95; F(3,243) = 5.13; F(3,243) = 5.56, respectively; all p < 0.01), as well as in RM, Ipro, and A1 (F(3,243) = 2.82; F(3,243) = 3.02; F(3,243) = 2.63, respectively; all p ≤ 0.05). Between-call differences are indicated above corresponding bars in each of these ROIs, where significant differences were found between twitter and chatter in A1 (p < 0.05) and Tpro (p < 0.01); between twitter and trill in CL (p < 0.05) and ML (p < 0.01); between phee and trill in CL (p < 0.001), ML (p < 0.01), and CPB (p < 0.01); and between phee and chatter in CL (p < 0.05), CPB (p < 0.05), and Tpro (p < 0.01).
Figure 4B also shows that A1, CM, CL, ML, Tpro, and Ipro exhibited higher activations compared with other regions in response to all types of vocal stimuli. In addition, we found that the response patterns for the twitter and phee calls were more similar in RM, ML, and Tpro, whereas the pattern was more alike in CL, CPB, ML, and Ipro for trill and chatter. Some regions such as R, CM, RTM, RPB, and STR showed relatively similar responses to all four call types.
To test whether the activation patterns could distinguish between the four calls, a decoding analysis was performed using a 3D-CNN. We applied the LOOCV approach, whereby the model was retrained and tested nine times; each time, a different subject’s data were used for testing, while the remaining eight were used for training the classifier. The results from the CNN yielded a significant above-chance classification performance [mean accuracy across nine subjects, 45.07% (with a maximum mean accuracy of 62.16%); mean shuffled data accuracy, 25.43%; chance, 25%; two-tailed paired t test compared with shuffle p < 0.001]. The predicted labels from the model were compared against the actual labels to generate a confusion matrix, shown in Figure 5A. The confusion matrix represents the model's overall performance in classifying the four conditions. The x-axis represents the predicted labels, while the y-axis shows the true (actual) labels. Also, the diagonal elements indicate the correct identification rate for each condition, with off-diagonal cells indicating misclassifications.
Classification results from the convolutional neural network. A, After training, the CNN processes unlabeled functional maps (test data), predicting the corresponding labels. The predicted labels are then compared with the actual (true) labels to generate the confusion matrix shown here. The diagonal elements of the confusion matrix represent the correct classifications for each vocalization, while the off-diagonal elements indicate misclassifications. The chance level for this study is 25%, as there are four possible conditions. B, This figure shows the accuracy of classifying each class correctly. The accuracy for all classes is above the chance level. The vertical line of each bar demonstrates the SEM. C, Test and training accuracies during a single run of the training process are monitored. The alternating white and shaded regions represent epochs. In the top panel, the dark blue line represents the smoothed training accuracy, while the light blue line shows the actual training accuracy for each iteration. The dashed black line depicts the validation accuracy (or test accuracy) evaluated at regular intervals (frequency, 5) during training. The accuracy trends upward as the model learns from the data, with fluctuations reflecting the model's adjustments during the training. In the bottom panel, the loss (a measure of model error) during both training and validation is plotted over the same iterations. The orange line represents the smoothed training loss, the solid red line shows the actual training loss per iteration, and the dashed black line illustrates the validation loss. The overall decreasing trend in loss signifies an improvement in the model's performance, although some fluctuations occur due to the dynamic nature of training. The figure was generated using MATLAB's deep learning toolbox (R2024b).
Our findings demonstrate that the 3D-CNN model classified all four conditions above the chance level (25%) as separately indicated in Figure 5B. Among the conditions, phee calls were most accurately identified, with an average accuracy of 51.28%. Chatter and twitter followed, with a close average accuracy of 45.78 and 43.42%, respectively. Trill calls were the least accurately classified, with an accuracy of 39.81%. However, the rmANOVA suggests that this trend between calls is not statistically significant (F(3,24) = 0.64; p = 0.59). The confusion matrix further suggests that phee and twitter calls were frequently misclassified as each other more than as other call types. Twitter calls were also often confused with chatter, with a misclassification rate close to that of phee. Chatter calls, on the other hand, were most commonly misclassified as twitter, while trill calls were predominantly misclassified as chatter. Figure 5C demonstrates that both training and test accuracy were continuously monitored throughout the network’s training process. The upward trends in both training and test accuracies, as seen in the top panel, along with the decreasing loss rate in the bottom panel, indicate that the 3D-CNN network was effectively trained with the selected parameters and dataset in this sample run.
To identify call-specific activations, we next compared the responses to each call with those of the other three calls. Figure 6 depicts these results both on flat maps and in volume space (coronal and sagittal views). Phee calls (first row) were characterized by more robust activations in the auditory area CPB and area TPO. In the cerebellum, phee calls elicited stronger responses in the deep cerebellar nuclei (DCN) and cerebellar lobules VIIB and VIIIA. For trill calls (second row), larger responses were present in orbitofrontal area 11 and anterior cingulate cortex area 32 compared with the other calls. Twitter calls (third row) exhibited stronger activation in area CM of the auditory cortex and the MG. Finally, chatter calls (fourth row) were associated with more intense activations in the superior colliculus (SC), parietal area PE, and premotor area 6M in comparison with all other vocalizations.
Brain activity in response to each call versus all other calls. Each row illustrates the neural response of brain regions in response to phee, trill, twitter, and chatter calls, respectively, compared with all other calls. The neural patterns are displayed on flat cortical maps for both the right and left hemispheres, along with volumetric representations at different interaural (IA) levels. All z-score maps are thresholded to display values less than −2 and >2. LH, left hemisphere; RH, right hemisphere; CPB, auditory cortex caudal parabelt area; TPO, temporoparietal–occipital area; DCN, deep cerebellar nuclei; LVIIB/LVIIIA, cerebellar lobules VIIB and VIIIA; CM, auditory cortex caudomedial area; MG, medial geniculate nucleus; SC, superior colliculus; PE, parietal area.
We previously showed that blocks of vocalizations compared with scrambled vocalizations or nonvocal sounds activate the medial prefrontal cortex (mPFC) area 32 (Jafari et al., 2023). Therefore, we compared activations in mPFC between the calls (Fig. 7). The calls on the x-axis were compared with those listed on the y-axis (calls x-axis > calls y-axis), and the results were thresholded at z-scores higher than 2. The results show that area 32 was significantly more activated by phee and trill calls than twitter calls. Area 9 was more activated by trill and phee calls than by chatter calls and area 14R was more active for chatter than phee or twitter calls. Overall, the results show more activation in the mPFC cortex for phee and trill calls compared with each of the other calls.
Neural response of mPFC in response to each call versus each other calls. The volumetric representation of the neural activity of mPFC for calls on the x-axis compared with those listed on the y-axis (calls x-axis > calls y-axis). Results are thresholded at z-scores >2.
Discussion
Our recent research revealed that conspecific vocalizations activate a fronto-temporal network in marmosets (Jafari et al., 2023; Dureux et al., 2024). Here, we investigated the specificity of this network in processing various calls by conducting whole-brain event–related fMRI in nine adult marmosets. The sounds encompassed four prevalent call types: phee, chatter, twitter, and trill. Consistent with our prior findings (Jafari et al., 2023; Dureux et al., 2024) and existing studies on the marmoset auditory pathway (Eliades and Tsunada, 2019), we found that the calls activated a network, encompassing both cortical and subcortical structures, involving primarily auditory cortices, including its core, belt, and parabelt areas, as well as adjacent areas like TPO, Ipro, Tpro, STR, and ReI.
When comparing all vocalizations against the baseline, we observed that many cortical regions, including premotor cortical areas 6DC, 6DR, 6Va, and 6M; primary motor cortex 4ab; cingulate cortices 24a–d, 32, 23a–c, and 23v; prefrontal area 8Av; and somatosensory cortex areas 1/2 and 3, were deactivated in response to short conspecific calls. These areas are known for their roles in controlling motor functions including vocalization, such as the voluntary initiation of vocalizations in nonhuman primates and speech in humans (Jürgens, 2002; Loh et al., 2017). Similar deactivation is observed for the Crll which is known to play an important role in motor preparation (Strick, 1983). These deactivations suggest that the marmosets likely reduced minor limb and body movements present during the long baseline (lasting 3–12 s) when exposed to the shorter vocalizations (0.6 s). It is possible that the animals entered a state of stillness, suppressing motor activity to focus on perceiving the auditory stimuli. This finding aligns with the functional ultrasound study that reported reduced blood flow in the medial sensorimotor cortex during the presentation of conspecific vocalization in marmosets (Takahashi et al., 2021).
Several fMRI experiments with awake marmosets have revealed a tonotopic organization in their auditory cortex. This tonotopic arrangement is consistent with the general pattern observed in the auditory cortex of other primates, corroborating findings from optical recordings (Song et al., 2022) and electrophysiological studies (Bendor and Wang, 2008). Our data are in line with this tonotopic structure, identifying a caudal–rostral gradient for vocalization selectivity in the auditory cortex.
Although all four calls activated a core network with significant overlapping activation, our analysis of the ROIs, combined with the results of the 3D-CNN classifier, revealed distinct activation patterns for each call type. Twitter and phee calls, commonly used for long-distance communication, showed the greatest variation in activation patterns across the ROIs. While these two calls were often correctly classified, they were also frequently misclassified as each other, suggesting that both calls may activate an overlapping network. These observations align with a previous functional ultrasound imaging study that found no statistically significant differences in cerebral blood volume rates in medial brain regions, particularly the mPFC, for twitter and phee calls (Takahashi et al., 2021). In contrast, chatter calls evoke the most uniform activation pattern among the four call types. Despite this uniformity, the 3D-CNN accurately classified chatter calls.
Similar to chatter, trill calls also showed a relatively uniform activation pattern. Both call types exhibited the strongest activation in the CM and ML areas of the auditory cortex. Our decoding analysis demonstrated that trill calls were frequently misclassified as chatter, suggesting an overlap in their activation patterns.
The presence of call-specific selectivity in CL, ML, and AL regions of the macaque auditory cortex (Tian et al., 2001; Belin, 2006) prompted the question of whether a similar hierarchical organization exists in marmosets. We found notable group distinctions in CL, ML, CPB, RM, and A1, indicating discrimination of conspecific calls in these regions.
A prominent activation was seen in the anterior cingulate area 32 and medial orbitofrontal cortex 11 (mOFC 11) for trill calls in comparison with all other conditions in our study. Trill calls are typically exchanged at a close distance between social partners during relaxed social states like foraging and resting and are usually regarded as positive welfare indicators (Bezerra and Souto, 2008; Liao et al., 2018; Zürcher et al., 2021). Indeed, mOFC 11 is known for its responsiveness to pleasant stimuli in macaques, and it is connected to the pregenual anterior cingulate cortex (Rolls et al., 2020).
Caudal belt regions, including CL and CM, have been recognized as components of a proposed pathway involved in spatial information processing (Recanzone et al., 2011). This observation is consistent with our findings, showing more pronounced CM activation in response to twitter calls compared with that in response to other calls, as long-distance calls typically rely on accurate spatial localization for effectiveness. Additionally, twitter calls contain rapid upward shifts in frequency over short time intervals, indicating a high level of temporal modulation (Pannese et al., 2016). This observation supports recent research which proposes that CM neurons are integral to temporal processing, especially in the precise discrimination of temporal envelopes (Kajikawa et al., 2008). Notably, a study using single-unit recordings in macaques performing an auditory discrimination task also found that neurons in the CM region had the broadest frequency tuning range (Recanzone et al., 2000).
Lastly, for chatter calls, which are often exchanged during intergroup or outgroup aggression (Grijseels et al., 2023), our findings revealed significantly higher activity in the SC, parietal area PE, and premotor area 6M compared with all other calls. This neural activity pattern suggests that these regions are specifically engaged in processing the distinct features of chatter calls, possibly reflecting orienting responses to these signals as marmosets may perceive a threat in these situations. The SC integrates sensory information from vision, hearing, and touch and generates motor signals controlling movements of the eyes, ears, head, and body (Burnett et al., 2004). As outlined by Mountcastle (1997), neurons in the parietal area PE, part of the posterior parietal cortex, are organized in modular sets that are activated during tasks like gaze fixation, object manipulation, and pursuit tracking. Also, it has been shown that neurons in the ventral intraparietal area are multimodal, responding to different stimuli, particularly when objects are near or approaching the body. Stimulation of these areas triggers defensive-like reactions, such as closing the eyes, grimacing, pulling the head back, raising the shoulder, and moving the hand to protect the area near the head or shoulder, indicating their role in coordinating protective responses to potential threats (Cooke et al., 2003; Graziano and Cooke, 2006). This suggests that in response to chatter calls, the parietal area PE may be involved in coordinating attention and movement-related behaviors necessary for reacting to perceived threats.
We previously found strong activations for longer periods of vocalizations (12 s) in mPFC area 32 (Jafari et al., 2023). Although we show here that all brief call types (0.6 s) suppressed activations in the mPFC in comparison with the long baseline periods (3–12 s), we found significantly higher activity in response to phee and trill calls when compared with twitter and chatter calls. These two calls belong to the category of communicative calls, often grouped together (Agamaite et al., 2015). In addition to its role in vocalization processing, the mPFC is also implicated in the production of vocalizations (Jürgens, 2009). Studies in both New World squirrel monkeys (Jürgens et al., 1967) and Old World macaques (Robinson, 1967) have demonstrated that long-train electrical microstimulation of the pregenual mPFC can evoke vocalizations, likely via multisynaptic pathways through the periaqueductal gray to laryngeal motoneurons (Jürgens and Pratt, 1979). This suggests that the mPFC may serve as a critical interface between the neural circuits governing vocal perception and production.
In conclusion, our event-related fMRI study has revealed several key aspects of vocalization processing in marmosets. In addition to the finding of a core cortical and subcortical network activated by phee, trill, twitter, and chatter vocalizations, the observed deactivations in certain cortical and subcortical areas during vocalization align with the notion of reduced motor activity to facilitate enhanced auditory perception. In addition, the cerebellum’s significant role in processing phee calls, the activation patterns in the anterior cingulate and orbitofrontal cortices for trill calls, and the involvement of the CM belt and medial geniculate in processing twitter calls all point to a highly specialized and context-dependent vocalization-processing system in marmosets.
Footnotes
We thank Cheryl Vander Tuin, Whitney Froese, Miranda Bellyou, and Hannah Pettypiece for their animal preparation and care and Dr. Alex Li for the scanning assistance. This work was supported by the Canadian Institutes of Health Research (FRN 148365, S.E.), the Natural Sciences and Engineering Council of Canada (S.E.), and the Canada First Research Excellence Fund to BrainsCAN. We also acknowledge the support of the Government of Canada's New Frontiers in Research Fund (NFRF, NFRF-T-2022-00051).
The authors declare no competing financial interests.
- Correspondence should be addressed to Azadeh Jafari at ajafar{at}uwo.ca.