Abstract
Mouse ultrasonic vocalizations (USVs) contain predictable sequential structures like bird songs and speech. Neural representation of USVs in the mouse primary auditory cortex (Au1) and its plasticity with experience has been largely studied with single-syllables or dyads, without using the predictability in USV sequences. Studies using playback of USV sequences have used randomly selected sequences from numerous possibilities. The current study uses mutual information to obtain context-specific natural sequences (NSeqs) of USV syllables capturing the observed predictability in male USVs in different contexts of social interaction with females. Behavioral and physiological significance of NSeqs over random sequences (RSeqs) lacking predictability were examined. Female mice, never having the social experience of being exposed to males, showed higher selectivity for NSeqs behaviorally and at cellular levels probed by expression of immediate early gene c-fos in Au1. The Au1 supragranular single units also showed higher selectivity to NSeqs over RSeqs. Social-experience-driven plasticity in encoding NSeqs and RSeqs in adult females was probed by examining neural selectivities to the same sequences before and after the above social experience. Single units showed enhanced selectivity for NSeqs over RSeqs after the social experience. Further, using two-photon Ca2+ imaging, we observed social experience-dependent changes in the selectivity of sequences of excitatory and somatostatin-positive inhibitory neurons but not parvalbumin-positive inhibitory neurons of Au1. Using optogenetics, somatostatin-positive neurons were identified as a possible mediator of the observed social-experience-driven plasticity. Our study uncovers the importance of predictive sequences and introduces mouse USVs as a promising model to study context-dependent speech like communications.
SIGNIFICANCE STATEMENT Humans need to detect patterns in the sensory world. For instance, speech is meaningful sequences of acoustic tokens easily differentiated from random ordered tokens. The structure derives from the predictability of the tokens. Similarly, mouse vocalization sequences have predictability and undergo context-dependent modulation. Our work investigated whether mice differentiate such informative predictable sequences (NSeqs) of communicative significance from RSeqs at the behavioral, molecular, and neuronal levels. Following a social experience in which NSeqs occur as a crucial component, mouse auditory cortical neurons become more sensitive to differences between NSeqs and RSeqs, although preference for individual tokens is unchanged. Thus, speech-like communication and its dysfunction may be studied in circuit, cellular, and molecular levels in mice.
Introduction
Humans produce and respond to behaviorally relevant sound sequences with specific temporal structures and predictiveness (Shannon, 1951). Many species, ranging from songbirds to mammals, emit sound sequences whose structural complexity may contain information beneficial to the receiver. Parallels between birdsong and speech have been a matter of interest for a long time (Marler, 1970; Doupe and Kuhl, 1999) because of similarities in vocal learning (Konishi, 1965), production (Brainard and Doupe, 2002), and presence of order selective auditory circuits (Doupe, 1997) in songbirds. Further, a rich literature is present on the neural representation in the auditory cortex for intraspecies communication in various mammals like nonhuman primates (Wang et al., 1995), guinea pigs (Grimsley et al., 2012), and cats (Gehr et al., 2000).
Mouse ultrasonic vocalizations (USVs) have characteristics of predictability similar to that of songs of songbirds (Holy and Guo, 2005) and speech of humans (Shannon, 1951). It contains multiple syllables, a unit of sound separated by silence, in a nonrandom manner, often repeated in phrases. Male USVs emitted in mating contexts were shown to have predictiveness using a Markov model (Holy and Guo, 2005). Although mouse USVs are believed to be innate (Fischer and Hammerschmidt, 2011), they do undergo contextual modulation (Grimsley et al., 2016; Portfors and Perkel, 2014; Matsumoto and Okanoya, 2018) and hold communicative significance (Ehret and Haack, 1984; Egnor and Seagraves, 2016). Preference for male USVs in females (Hammerschmidt et al., 2009) and pup calls in mothers (Ehret and Haack, 1982) over other stimuli have been reported. Similar analyses, using Zipf’s statistics, of USVs produced by pups and adults show predictiveness in sequences (Grimsley et al., 2011). We recently used mutual information (MI) to quantify the predictiveness among sequential syllables in pup USVs and derived highly predictive sequences (Agarwalla et al., 2020). The sequences had common characteristics in individual male as well as female mouse pup calls showing consistency of sequences. Despite the predictable nature of mouse USVs, very few studies have taken advantage of the same. For instance, Takahashi et al. (2016) quantified the predominant USV sequences of pup calls using a Markovian model. They found that highly probable pup sequences could elicit maternal behavior but not the least probable sequences. Niemczura et al. (2020) used the virtual mouse vocal organ, a probabilistic Markovian model proposed by Grimsley et al. (2011) to generate the USV sequences and evaluate their significance behaviorally and quantified the corticosterone level, the amount of central ambulation, and the number of syllables emitted by mice during playback. In addition, neurons in the auditory cortex (ACX) of female mice also respond to the playback of single syllables from pup calls (Liu and Schreiner, 2007; Galindo-Leon et al., 2009). Carruthers et al. (2013) presented a long USV sequence by concatenating 350 syllables that do not correspond to the original USV sequence as bouts cannot have syllables in hundreds range and recorded from rat primary auditory cortex (Au1). Tasaka et al. (2018) genetically tagged the neurons actively involved in encoding USV sequences in the auditory cortex of mice using the promoter of Fos and performed in vivo two-photon loose-patch recordings. However, none of the studies mentioned above have explored the significance of predictiveness in USVs. Therefore, there is a gap in knowledge regarding whether USV sequences with predictive power have distinct behavioral as well as neuronal signatures compared with the ones lacking predictiveness.
Our study evaluates the importance of highly predictive sequences at behavioral, molecular, and neuronal levels and thus introduces mouse vocalizations as a model to study speech-like communications. Here, we first investigated whether highly predictive natural sequences (Nseqs) of USVs hold behavioral relevance over less predictive random sequences (RSeqs). Second, we probed the relevance of NSeqs versus RSeqs at the neuronal level using c-fos immediate early gene expression immunohistochemistry. As plasticity studies have shown that behavioral relevance can change the neuronal representation of sounds in birds (Menardy et al., 2012); and mice (Liu and Schreiner, 2007), our third goal was to understand whether social-experience-driven plasticity can alter the auditory cortical representations of the sequences. Finally, we identified the cell types involved in this plasticity using two-photon imaging and optogenetics.
Materials and Methods
Animals
All mice used in the study were 8–12 weeks old at the time of experiments. All procedures pertaining to animals used in the study were approved by the Institutional Animal Ethics Committee of the Indian Institute of Technology Kharagpur. Mice (Mus musculus; age and sex identified in individual cases) were reared under a 12 h light/dark cycle in individually ventilated cages, thermoregulated at 22–25°C with ad libitum access to food and water. All vocalization and extracellular single-unit recordings were performed in the C57BL6/J mouse strain. For two-photon imaging, the parvalbumin (PV)-ires-Cre (strain #08069, The Jackson Laboratory) and somatostatin (SOM)-ires-Cre (strain #013044, The Jackson Laboratory) driver lines were crossbred with Ai95(RCL-GCaMP6f)-D (strain #024105, The Jackson Laboratory) to label inhibitory neuron types with green fluorescence expression in the filial (F)1 progeny. Recordings from excitatory neurons were done using C57BL/6J-Tg (Thy1-GCaMP6f; strain #024339, The Jackson Laboratory). The optogenetic recordings were performed from positive F1 progeny mice, generated by cross breeding SOM-ires-Cre (strain #013044, The Jackson Laboratory) driver line with Ai40(RCL-ArchT/EGFP)-D (strain #021188, The Jackson Laboratory) to label the somatostatin-positive neurons with archaerhodopsin. The vocalization recordings were done in different contexts with either male alone or male interacting with a female. The extracellular recordings were conducted on animals of either sex. The behavior, two-photon imaging, and optogenetics studies were conducted on females.
USVs recording
To record adult male mouse USVs, initially a male mouse was kept in a wooden recording cage (12 × 18 × 15 cm) and placed in a sound isolation booth (Industrial Acoustics) alone for 5–10 min. No vocalizations were observed in this condition. A female mouse was introduced in the same setup with a separator between them for 5–10 min before the mesh separator was removed. USVs were emitted by the male in both of the latter cases. After 7–10 d, when the male was placed alone in the recording setup, it vocalized in the absence of the female. At that point, final recordings were started and analyzed further. USV recordings were collected using the following protocol: Context 1, the male mouse alone (Alone); Context 2, the male mouse with a female present in view, separated by a mesh between them (Separated); and Context 3, same as Context 2 without the separator; that is, the mice are in direct contact with each other (Together; see below, Results, Male mice produce context-dependent syllable sequences). The above context-specific USV recordings were made for at least 5 d. Acoustic signals were recorded with a free-field microphone and amplifier (one-fourth-inch microphone, Model 4939, Bruel and Kjaer) with flat frequency response up to 100 kHz and slightly diminishing sensitivity at higher frequencies. The acoustic signals were digitized at 250 kHz with 16-bit resolution collected with National Instruments Data Acquisition Systems. Recorded signal time waveforms and spectrograms were displayed in real time on a computer with open access bioacoustics software Ishmael.
Vocalization recording analyses
All USV analyses were performed in MATLAB (MathWorks) and have been presented in detail in our previous work (Agarwalla et al., 2020). In brief, each WAV file was divided into 5 s epochs. Background noise was eliminated by bandpass filtering with a Butterworth filter of order 8, removing frequencies outside the 20 kHz and 120 kHz range. Syllable segmentation was performed by first calculating the short-term Fourier transform of each epoch with a 1024 length Hamming window, and with an overlap of 75%. Syllables were identified by calculating the power concentrated in each frame, normalized by the average power in all the frames, and median filtered over 30 ms windows. Peaks in power over time were detected by using peak detection. Syllables were classified into five categories (Agarwalla et al., 2020). Noisy (N type) syllables have a broad energy content, and syllables with harmonic content (H-type) are marked based on the presence and absence of discontinuities. Additional syllables are again classified into three classes, namely S type (continuous contour of spectral energy), Jump (J type, a single discontinuity), and Other (O type, more than one discontinuity). All subtypes of syllables are also present, as observed in other studies (Portfors, 2007; Grimsley et al., 2016; Matsumoto and Okanoya, 2018). However, we used only pitch jump as the primary classification criteria based on Holy and Guo (2005) to restrict the number of broad classes enabling our analyses requiring large sample sizes. Moreover, sudden discontinuities in pitch are also inherently tied to the vocalization production machinery.
Identification of high-probability sequences in different contexts
The significance of occurrence of each syllable at different positions given the previous syllable was computed for each context. For the syllable in the first position in a bout, the probability of occurrence of each syllable type in the beginning of the bout was obtained and compared with the equally likely probability of occurrence of each syllable type. Syllables with higher (90% confidence) probability of occurrence than overall were considered significant. The process was continued for each of the subsequent positions, keeping the previous syllable types fixed until there were no significant syllables further observed. In the above manner we find sequences that occur above chance and render structure in different contexts.
Surprise analysis
Surprise was computed by calculating the dissimilarity between the posterior P(D) and prior P(M) distributions of occurrence of syllables in sequences of length 3–7 using the Kullback–Leibler divergence (KLD; Cover and Thomas, 2006). As the sequences used as stimuli had a minimum of three and a maximum of seven syllables, the sequences considered in the calculations are of length three to seven. Surprise was defined by the average of the log-odd ratio (Itti and Baldi, 2009) as follows:
The sum is over all the possible values the random variable M can take, denoted by k, or the space of the sequences; D also takes on the same values.
Sequence construction and auditory stimulus delivery
The stimulus consisted of 12 RSeqs drawn randomly based on the probability distribution of syllables obtained in three contexts (four sequences from each context). The remaining eight sequences were the high-probability NSeqs emitted in different contexts. Between each syllable, a silence of 90 ms was inserted based on the mean of the intersyllable silence (ISS) distribution (Fig. 1E). All stimuli were generated using custom-written software in MATLAB (MathWorks) and data acquisition board (National Instruments), attenuated using a TDT PA5 attenuator (Tucker-Davis Technologies), generated using TDT EC1 calibrated speakers (driven with TDT ED1 drivers), and delivered through a speaker kept 10 cm away from the contralateral ear. The acoustic calibrations, performed with microphone 4939 (Brüel & Kjær), of the ES1 speakers (Tucker-Davis Technologies) in the sound chamber, showed a typical flat (95 dB) calibration curve from 4–60 kHz, 0 dB attenuation on PA5. Each of the syllables in the sequence had an onset and offset of 5 ms ramp and had root mean square (rms) matched. All awake and anesthetized recordings were done at 65–75 dB SPL and 75–85 dB SPL, respectively. Responses with usually five (rarely four or six) repetitions were presented in randomized order and used for further analysis. Each stimulus onset was preceded by a baseline of 500 ms and had an intertrial interval of 5 s.
Naive female mice prefer NSeqs emitted by male mice during social exposure. A, Schematic depicting the three social contexts (Alone, Separated, and Together) in which mouse vocalizations were recorded. B, Examples of spectrograms of representative syllable types. Different syllable types are depicted with different color bars. C, Probability distribution of the different syllable types in the three contexts shown as bars with percentage of syllables. Right, Diagonal matrix quantifies the KLD distance among the distributions at 95% CIs (in bits). D, Joint probability distributions of syllable-to-syllable transition considering starting two syllables in bouts is depicted in each of the first three matrices in the row for the three contexts of the adult male (Alone, Separated, Together). Right, The diagonal matrix quantifies the KLD among the joint distributions. E, The three distributions depict the ISSs observed in each context of adult male. The vertical dashed line (at 250 ms) marks mean + 2 * SD of the overall data. F, The distribution of percentage of bouts of a particular length present in each of the contexts is shown. G, Plots of MI, I(S1; Si) with i = 1,2 ... 10 in the three different contexts in blue with 95% CIs. Red plots, with 95% CIs, show the expected extent of 0 MI estimate from the data after scrambling the order of syllables. Lack of overlap of the CIs (red and blue) indicate significant MI. H, The three matrices represent the MI calculated as in D with each row showing the MI for the nth syllable with the first (row 1), second (row 2), third (row 3), and so on. The diagonal elements show the entropy of the syllable in the corresponding position from the bout start. Asterisks indicate significance at 95% confidence.
Two-choice free access/preference task, NSeqs versus RSeqs
All behavioral tests were performed inside a soundproof anechoic chamber under dim red light. The test cage was an acrylic rectangular box (30 cm long × 20.5 cm wide × 20 cm tall). Two tubular ports (9 cm long) terminated with a mesh on one end, each 5 cm in diameter, were attached to the test cage diagonally opposite to each other, one to the right-side (RS) corner and the other to the left-side corner (LS; Fig. 2C, setup schematic) with a speaker beyond the mesh end of the ports. The mice had to enter the tubes to explore the inside of the tubular ports, contents of which were otherwise not visible from the box. Before starting a session, the entire test cage was wiped with 70% ethanol. There were four sessions (Ss), denoted as S1 to S4, each of 5 min. In S1 the test female mouse was allowed to explore the test cage; in S2 NSeqs (say, from RS) and RSeqs (from LS) were played alternatively with an interstimuli gap of 5 s; in S3, there was no stimulus; and S4 was same as S2 with corners swapped for NSeqs (from LS) and RSeqs (from RS). A webcam (Logitech C925e) was fitted 35 cm above the center of the test cage from the base, and the entire duration of the four sessions was recorded at 15 frames/s, 1411 kbps, using Logitech software.
Naive adult female mice prefer NSeqs over RSeqs. A, Spectrogram of the syllables used for stimulus design. B, The set of sequences created for RSeq and extracted for NSeq are depicted with the color bars as shown in A. Sequences numbered 01–12 are the RSeq. The light blue background in a subset of the RSeqs indicates the sequences for the same length case chosen. The NSeqs in three different contexts (numbered 13–20) are identified above each NSeq as Alone, Separated (SEP), and Together (TOG). C, Top row, S1–S4 shows the four sessions recorded with S1 and S3 with no sounds played. S2 and S4 have sounds played from speakers as indicated. Bottom row, The two bar plots to the left of the dashed line show bars indicating time spent on the side of NSeqs and RSeqs in S2 and S4 in equal length case. The same for the RSeq with seven syllables is shown to the right of the dashed line. Each animal is assigned the same color in S2 and S4. The black line represents the overall mean ± SEM of the population data; *p < 0.05, **p < 0.01, ***p < 0.001; NS, Not significant.
Calculation of joint distributions and MI-based dependence
Transition probabilities were computed by estimating the joint probability distribution. Krischevsky–Trofimov correction was applied to take care of combinations leading to zero values. MI, between two random variables X and Y, quantifies total dependence between the two random variables (Cover and Thomas, 2006) and can be computed from the joint distribution P (X, Y) and its marginal distributions P(X) and P(Y) as follows:
By considering the syllables at a particular position as the random variable X and syllables after k-steps as the random variable Y, which take on values y = 1, 2…,5 (The five possible values the random variables can take are the five possible syllable types.), we computed the dependence, that is, MI between syllables in these two positions in a bout of syllables.
As MIs were sensitive to bias (Bandyopadhyay and Young, 2004; Chase and Young, 2007), the results can be erroneous provided the limited data size. The above problem was circumvented through bootstrap removal of bias and comparing the MI estimates with scrambled syllable order in sequences of vocalizations to get only significant estimates of MI.
Removal of bias in mutual information estimates and significance analysis
Bootstrap debiasing was used to remove bias in MI estimates (Agarwalla et al., 2020; Bandyopadhyay and Young, 2004). In computing MI between two random variables X and Y (e.g., the syllable occurring in the first position of a bout and in the nth position of a bout) from samples of paired data ξ, yi (i = 1,2,…,K), a large number (N = 1000) of bootstrap datasets were created using sampling with replacement. Each bootstrap dataset is represented as xb,i, yb,i (i = 1,2,…,K), where b varies from 1 through N for the different sets. MI estimated from the raw data [ξ, yi (i = 1,2,…,K)] is called the RawMI. Each bootstrap dataset gives an estimate of MI and is called the BootstrapMI, which has N estimated values. Mean of the N BootstrapMI, Bootstrap_MeanMI, represents the population mean (Efron and Tibshirani, 1994). From the population mean and sample mean (RawMI), bias is estimated as the difference (Bootstrap_MeanMI - RawMI) and is subtracted from the raw estimate for bias removal. Thus the debiased MI value is 2RawMI − Bootstrap_MeanMI. Confidence intervals on the estimates can be obtained from the variation of the N bootstrap estimates. We compare the lower confidence interval with the upper confidence interval of the estimate of zero MI obtained similarly as above but now by randomly scrambling the x–y paired relation. In the case of sequences of syllables, it is done by not keeping the transitions intact, that is, by randomly scrambling the sequences, which leads to the estimates of zero MI and its confidence interval from the same dataset with same number of syllables and other statistics intact, except for the order of the syllables in a sequence. When the confidence intervals of MI of the data and the scrambled data did not overlap, the estimated MI was considered to be significant. Thus, we minimize the possibility (<5%) of spurious MI because of limited data size and variability.
Calculation of KLD between distributions
The proximity among the probability distributions was quantified using KLD (Cover and Thomas, 2006; Bandyopadhyay and Young, 2004), an information theoretic distance metric that makes no assumptions about the statistics of the data. To compute KLD between the distributions P and Q taking on values over the same set (in our case syllables produced by three different contexts taking values of different syllable types with probabilities P(10) and Q(10), x being a syllable type, or the syllable-to-syllable transitions produced by the two groups) was computed as follows:
We performed debiasing of KLD using bootstrap resampling calculated significance in the same way with 95% confidence intervals as with MI (above).
Tracking mouse movement
The videos of the five sessions monitoring the behavior of mice in the syllable sequence preference test were analyzed using custom code written in MATLAB. We used the MATLAB computer vision toolbox function kalmanFilterForTracking for tracking the position of the mouse in the test chamber in every video frame during the five sessions. When the mouse was inside a port and not visible in the chamber, the position of the mouse was assigned to the last observed location, always near the entry point of the port. If in the initial frames the mouse was not visible, the mouse location corresponding to all such initial frames was assigned to the coordinates of the position the mouse was first detected. A bounded tracking was done by selecting a region of interest (ROI) covering the entire test floor, and the entire area outside it was never assigned as a possible mouse location. Based on the manually selected coordinates of the four corners of the test chamber, the area was then divided either into two equal halves corresponding to RS and LS, respectively, or three equal parts corresponding to RS, LS, and a central neutral region. Based on the tracked coordinates of the mouse, the percentage of time spent toward each corner (RS or LS) was quantified for the different sessions and used for statistical comparisons. Two outliers in 9 mice and 1 outlier in 7 mice were excluded as they were >10 SD and >11 SD (p ∼ 0) away and 3.72 SD (p < 0.0001) away (respectively) from the mean of the rest in the side selectivity in the behavior. Further, exclusion or inclusion did not alter the median values of side selectivity and stimulus selectivity significantly.
c-fos Immunohistochemistry
The animals were reared in the in-house animal facility and were familiarized in the experimental environment at least 24 h before the experiments. The experiments were conducted in a sound-attenuated and anechoic chamber. The mice were exposed to three different exposure paradigms to investigate the activation of c-fos+ cells. First, the control condition, in the absence of auditory stimulus, started with 60 min of habituation followed by 15 min of silent period equivalent to the duration of stimulus presentation for the two other conditions and ended with an additional 60 min equivalent to the consolidation period. The second condition, Random (Ran), started with 60 min of habituation, followed by 15 min of exposure of RSeqs played at 75 dB SPL at a height of 15 cm from the surface by an electrostatic speaker and followed by 60 min of silent consolidation period. The third condition, Natural (Nat), is exposure to NSeqs, started with 60 min of habituation, followed by 15 min of exposure to NSeqs sequence played at 75 dB SPL at a height the same as RSeqs and followed by a 60 min of silent consolidation period.
Postexposure, the mice were treated with a hypnotic dose of anesthesia in a transparent anesthesia chamber using 5% isoflurane. The depth of anesthesia was measured by withdrawal to the toe pinch reflex. The mice were perfused transcardially with 0.1 m chilled PBS, pH 7.4, followed by 4% chilled paraformaldehyde fixative (PFA) in 0.1 m PBS. The brains were postfixed for 18 h in 4% PFA in 0.1 m PBS. Serial coronal sections (with thickness of 30 μm) were cut using a vibratome (Leica VT1000 S) keeping the perfused brain tissues submerged in 0.1 m PBS and surrounded by an ice chamber.
The brain sections were preserved in cryoprotectant solution at −20°C for long-term usage. Free-floating sections were washed 5 times with 0.1 m PBS (duration of 10 min for each washing), followed by incubation into 3% hydrogen peroxide (H2O2) solution for a span of 15 min. The sections were further washed three times with 0.1 m PBS (each washing for a duration of 10 min). The free-floating sections were preincubated in a blocking solution [6% normal goat serum (NGS)] in 0.1 m PBS with 0.3% Triton X-100 (PBST) for 1 h, then incubated at 4°C with rabbit polyclonal anti-c-fos (1:1000; catalog #ABE 457, Sigma-Aldrich) in PBST with 2% NGS for a duration of 72 h. Following primary antibody incubation, the sections were washed four times with 0.1 m PBS and further incubated in goat anti-rabbit secondary antibody (1:1000; Alexa Fluor 594) for 2 h at room temperature. Following incubation, the sections were washed five to six times with 0.1 m PBS and mounted onto subbed slides (using Fluoroshield with DAPI, catalog #1003411675, Sigma-Aldrich) and sealed with clean coverslips.
Stereological quantification of c-fos
Immunofluorescence images were obtained using the Leica fluorescence microscope DM 2500. The images were obtained using 20× and 40× objective lenses. c-fos Immunoreactive cells were examined in sections containing subregions of ACX, like Au1; secondary auditory cortex, dorsal Au (AuD); and secondary Au, ventral area (AuV). The above-mentioned cortical structures were delineated according to the reference anatomic atlases (Paxinos and Franklin, 2019). For the quantification of c-fos immunoreactive cells, the coronal sections were sampled from bregma at a distance of −1.70 to −3.40 mm. Temporal association cortex (TEA) and auditory thalamic nucleus regions like medial geniculate nucleus ventral part (MGV) and medial geniculate nucleus dorsal part (MGD) were sampled from the above-mentioned distance from bregma. The inferior colliculus (IC) sections were sampled between −4.96 to −5.20 mm from bregma. Quantification was performed in the central nucleus of the IC (CIC). Sampled sections were collected from the rostrocaudal axis matching the anatomic symmetry for all the three exposure conditions, that is, Control (Ctrl), Ran, and Nat. The subcortical regions, as mentioned previously, were sampled from all of the five animals from each of the exposure groups.
Quantification of the c-fos-positive cells were performed using ImageJ software (Fiji) with a cell counter plug-in. The imaged sections were matched with the anatomic positioning of the reference atlas and scaled, and region of interests for the cortical and subcortical areas were marked out. The c-fos immunoreactive cells/mm2 area were counted with positional marking. Differential counts/mm2 were measured for the selected cortical regions (mentioned above) for quantitative analysis. To emphasize the differential quantitative and spatial distribution of the c-fos immunoreactive nuclei in the Au1, AuD, and AuV areas, covering the whole ACX, along with subcortical regions and auditory thalamus, the representative images were made using ImageJ and Adobe Illustrator Creative Cloud 2015 software.
Surgical procedure for in vivo recordings
The mouse was anesthetized in an induction chamber with isoflurane. It was placed on a temperature-controlled heating pad (for maintaining body temperature ∼37°C), and isoflurane inhalation (5% for induction and 0.8–1% for maintenance) was maintained via a nose mask. An incision was made in the scalp along the midline. The region of interest was drilled using a micro drill, and an electrode was placed targeting supragranular layer 2/3. It was cemented using a super bond on the skull. The animal was habituated for 5–7 d. The recording was done in a single-walled sound-attenuating chamber. On experimental days, the animal was placed securely into a foam body mold. The head post was attached to a custom-made stereotaxic apparatus. For imaging, initially the mouse was injected with 0.1 ml dexamethasone (5 mg/kg body weight, intramuscular) into the thigh muscle. Once the animal was stabilized on anesthesia, skin hairs to the area of interest were removed using hair removing cream. After a few minutes, the region was cleaned and followed by the procedures described above. For two-photon imaging, the craniotomy was sealed with a coverslip keeping a 3 mm diameter transparent imaging window. All the mice for chronic recordings were maintained by postsurgical treatment with ceftriaxone antibiotic (50 mg/kg, i.m.) single dose daily for 3–5 d.
In vivo extracellular recordings
Extracellular recordings were performed using a tungsten microelectrodes array (MEA) of impedance 3–5 MΩ (MicroProbes); 4 × 4 custom-designed metal MEAs with an interelectrode spacing of 125 μm were used. The array was advanced slowly to a depth of 200–350 μm from the surface into the ACX using a micromanipulator (MP-285, Sutter Instrument). The electrodes were allowed to settle for 15–20 min before the stimulus presentation. Signals were acquired after passing through a unity gain head stage (Plexon HST16o25) followed by PBX3 (Plexon) preamp with a gain of 1000, to obtain the wideband signal used to extract the local field potential, 0.7 Hz to 6 kHz, and spike signals (150 Hz to 8 kHz) in parallel and acquired through a National Instruments Data Acquisition Card (NI-PCI-6259) at 20 kHz sampling rate, controlled through custom-written MATLAB (MathWorks) routines. Data acquisition lasted for < 2 weeks, with ∼1–2 h sessions every day for the electrode implanted in animals in the awake state. Units collected on each day from the implanted recording electrodes were considered as separate units. All the sound tokens presented in all kinds of stimuli had 5 ms rise and fall times.
Analysis of in vivo extracellular recordings
Spike sorting was done off-line in custom-written MATLAB scripts. Data were baseline corrected and notch filtered (Butterworth fourth order) to reject any remnant of power supply 50 Hz oscillations. Single-unit spike times were obtained from the acquired spike channel data using threshold crossing and spike sorting with custom-written software in MATLAB. Mean spike rate was calculated by taking the mean of the neuronal firing over the stimulus duration +20 ms. A neuron was considered to be significantly responding if the spontaneous activity (200 ms before stimulus onset) was significantly different from the neuronal activity (unpaired t test, p < 0.05).
All the units that significantly responded for any syllable have been considered for calculating the neuronal selectivity. Mean spike rate was calculated by averaging the spike rate over trials for each of the common monosyllables; disyllables (leaving the ones at the starting position) in NSeqs and RSeqs and the 20 sequences (NSeqs and RSeqs) corresponding to their duration. For each of the neurons, selectivity was calculated for individual stimuli by the following:
where Si is the ith stimulus, and Rk represents the mean spike rate to the kth stimuli, k varies from 1 to the total number of stimuli under consideration.
Wide-field calcium imaging
After the animal preparation for chronic window implant, for wide field imaging, cortical images were taken using X-cite Q120 blue light (450–490 nm) excitation. A 4 × 0.13 NA objective (Olympus) was focused 200 μm below the cortical surface. The green emission fluorescence (500–550 nm) was collected onto a charge-coupled device camera at ∼20 Hz.
For identifying the ACX regions, a significant change in fluorescence (df/f) was computed by subtracting and normalizing each frame by the mean response of the baseline frames (the prestimulus three frames). The pixels that have significant df/f retain their values (unpaired t test). A Gaussian filter was used to smooth the df/f with an SD of two, normalized by the absolute maximum. A binary image is generated based on the pixels having values >0.5. Using the built-in function of MATLAB bwboundaries, eight connected neighborhood boundaries are drawn for different frequencies.
Two-photon calcium imaging
For two-photon imaging, Ultima IV Prairie Technologies laser scanning microscope with a Spectra Physics Insight Ti:Sapphire mode-locked femtosecond laser was used. Cells were imaged using a 20×, 0.8 NA (Olympus) water immersion objective at depths of usually 200–350 μm from the cortical surface at an excitation wavelength of 860–920 nm. Full-frame images were acquired at a resolution of 512 × 512 pixels. The laser power was adjusted from 50 to 80 mW. Frames in the region of interest (120–150 μm × 300–350 μm with 1.16 μm pixel size) were imaged at ∼4–10 Hz (∼250 ms frame period, 4 μs dwell time).
Analysis of two-photon imaging data
Two-photon imaging analysis was performed using custom codes written in MATLAB (MathWorks). Imaging sequences were aligned by performing X–Y drift correction. Cells were selected manually by selecting the center point of the cell on the motion-corrected mean images. ROI (5 µm radius) were drawn based on the cell centers. Raw fluorescence signals over time (f) of the selected ROIs across all frames were extracted. For each trial, relative fluorescence was computed by using df/f0 = (f–f0)/f0, where f0 corresponds to baseline fluorescence. Baseline fluorescence amplitude was estimated by calculating the mean of fluorescence values over all the frames preceding stimulus, except the first three frames, which was either four or six frames.
A neuron was considered responsive to a stimulus if the mean of three frames of the mean fluorescence trace (calculated from stimulus repetitions) before the stimulus onset is significantly different from moving three frame averages after the stimulus onset (unpaired t test, p < 0.05). Baseline distribution was obtained from all the prestimulus three frames (all stimuli and all repetitions, usually 100). The total number of frames considered for each stimulus encompassed the stimulus duration and an additional 0.5 s after the stimulus. We only considered significant positive going responses, and thus false detection rate is <2.5%. The response of a neuron to a sequence was calculated based on the mean df/f of all frames from stimulus start to 0.5 ms poststimulus end. Selectivity to each of the 20 stimuli of a neuron was based on the above response and calculated as with single units.
Calculation of selectivity based on spontaneous activity for optogenetics
For photo inhibition of SOM (strain strain #013044 crossed with strain #021188, The Jackson Laboratory), we activated ArchT via light pulses of 589 nm (Ryan et al., 2015) laser through an optical fiber of 200 μm, 0.5 NA. The power at the tip was 30 mW. Light pulses were presented 100 ms before auditory stimulus onset and lasted for the stimulus duration and an additional poststimulus time span of 100 ms.
Reversibility because of photo inhibition of SOM is tested by comparing the mean spontaneous activity with (W) and without (WO) the laser being on. The formula used for calculating selectivity is the following:
Experimental design and statistical analysis
The experimental design and statistical analysis corresponding to each experiment are discussed previously (see above, Materials and Methods). For vocalization recording (n = 4), male mice alone and interacting with females were used. KLD was used as a metric to quantify differences between distributions. The confidence intervals for KLD and presence of MI in the USV sequences were estimated using bootstrap. Behavioral studies were conducted on females with two different lengths of the stimuli [n = 8 (actual length); n = 7 (length 7)]. c-fos Experiments were conducted in three conditions with five female mice in each condition. Extracellular recordings were performed on awake naive females (n = 6), anesthetized naive females (n = 12), naive anesthetized males (n = 8), and experienced anesthetized females (n = 8). Two-photon imaging studies were done in females from genetically marked excitatory neurons (EXNs; n = 3 mice; strain #024339, The Jackson Laboratory), genetically identified individual inhibitory neuron types (SOM, n = 4 mice; strain #013044 crossed with strain #024105, The Jackson Laboratory), and PV positive (n = 438 neurons from n = 4 mice, strain #08069 crossed with strain #024105, The Jackson Laboratory). For optogenetics, the mice expressed ArchT-EGFP specifically in SOM neurons, obtained by cross breeding of strain #021188 (Ai40D, The Jackson Laboratory) and strain #013044 (SST-IRES-Cre, The Jackson Laboratory). To optimize the power level of the laser, n = 6 mice, (three female, three male) were used. Finally, SOM was deactivated using optogenetics, and recordings were made from anesthetized naive females (n = 5). The statistical significance was tested using one-way ANOVA in all the above experiments except vocalization.
Results
Mice produce context-dependent syllable sequences
We developed a protocol for mouse vocalization production in three different contexts., referred to as Alone, Separated, and Together (Fig. 1A). When lone naive adult male mice [n = 4, postnatal day (P)56–P90] with no previous social exposure to a female were placed in a cage within a sound chamber (Alone, 5–10 min), they emitted no vocalizations (Fig. 1A, top left). Following introduction of a naive female mouse (P56–P90) into the cage, separated from the male by a mesh (Fig. 1A, middle row), vocalizations were produced by the male (Separated, 5–10 min). Finally, the mesh was removed, and the female and male were exposed to each other, and vocalizations were recorded (Together, 5–10 min). We repeated the process with the same pair of mice for at least 5 d until the male mouse emitted vocalizations in the Alone condition (see above, Materials and Methods). Ultrasonic frequency (UF) vocalization recordings (20–120 kHz) made from the above stage onward for the following 5-7 d in four pairs of mice were further analyzed and grouped into the three contexts above.
Vocalization sequences were automatically annotated by detecting ultra-frequency syllables based on energy threshold (see above, Materials and Methods), and syllables were categorized into five types (see above, Materials and Methods; Fig. 1B; Agarwalla et al., 2020), based on their spectrograms with pitch jumps as the distinguishing feature (Holy and Guo, 2005). The syllable types were, Noisy (N); Single (S) frequency contour; Jump (J) type, with a single jump in frequency; Harmonics (H); and Others (O) consisting of multiple Jumps. The relative percentage of the different kinds of syllables produced by male mice in each context (Fig. 1C) showed distinct differences. The overall difference in the distributions of syllables in each pair of the contexts was quantified using KLD (see above, Materials and Methods; Cover and Thomas, 2006; Bandyopadhyay and Young, 2004; Agarwalla et al., 2020). All pairs of distributions were significantly different at 95% CI (see above, Materials and Methods). However, the Together condition showed the most distinct distribution having high distances from the others, with the Alone and Separated cases showing a similar nature. Despite differences in distributions, order of syllables could be random in the three contexts. To probe whether the syllables are random or nonrandom, we analyzed syllable-to-syllable transition probability matrices as done in multiple studies (Holy and Guo, 2005; Chabout et al., 2015, 2016) (see above, Materials and Methods) with two successive syllables (Fig. 1D) or disyllables. We found distinct differences, quantified with KLD at 95% CI (Fig. 1D, right) between the different contexts suggesting differences in structure of syllable order.
To probe higher-order structure in syllable sequences, analyses as above with three or more successive syllables require large amounts of data for reliable estimation of transition probabilities. Thus, to analyze structure in the order of syllables emitted (Holy and Guo, 2005; Grimsley et al., 2011; Chabout et al., 2015; Agarwalla et al., 2020), the entire period of vocalization recordings of each session was divided into bouts based on the ISS distribution. Silence intervals >250 ms in duration were used to mark the end of a bout of vocalizations for Alone (80.7 ± 79.2 ms), Separated (99.4 ± 81.4 ms), and Together (88.9 ± 80.6 ms). ISS distributions based on merging the distributions from all contexts had a mean +2 * SD of (90 + 2 * 80) = 250 ms (Fig. 1E). The syllable following the immediately preceding end of a bout was considered as the start of the next bout, and successive syllables in a bout were considered as a vocalization sequence of syllables. Thus, bouts of vocalizations could have only one or many syllables (observed up to 30 syllables; Fig. 1F; total 948 Alone, 3320 Separated, and 6656 Together bouts recorded), with bouts of length 1 being dominant, independent of the context. Bouts of vocalization with three or more syllables are referred to as sequences (see above, Materials and Methods) hereafter. With Si denoting the ith syllable of a bout, the MI between S1 and Si, denoted I(S1;Si), was computed, debiased, and tested for significance based on CI (see above, Materials and Methods). Significant MI between syllables at different positions in a bout shows dependence between the syllables or predictability, the basis of structure in sequences (Shannon, 1951; Agarwalla et al., 2020). We found that in the three contexts significant dependence existed between the first and other syllables in bouts of vocalizations (Fig. 1G). The dependence was strongest and longest in the Together context and weakest and shortest in the Alone condition. Thus, the three contexts had different degrees of dependence in syllables of sequences. Similar results were observed for I(Sj;Sj+k), j = 1,2,3, and k = 1,2 (Fig. 1H), showing dependence exists within the sequence beyond that on the first syllable. Thus, context-dependent sequences are produced by mice with degrees of structure varying with context.
To further understand what order of syllables, if any, were involved in providing such dependence in the sequences, we found the high probable sequences produced in the three contexts (Fig. 2B, sequence 13–20; see above, Materials and Methods). The spectrogram of the syllables used for the generation of the stimulus is shown in Figure 2A. The eight sequences obtained are NSeqs. The NSeqs or their subsequences (from the starting syllable with at least three syllables) constituted 36% (Together) and 24% (Separated) of all the recorded sequences of length three or more in the two contexts. The percentages of such sequences occurring by chance based on the Together and Separated syllable probability distributions (Fig. 1C) were only 11 and 8%. The NSeqs were also 8 of the 10 sequences with the highest accumulated surprise (Itti and Baldi, 2009; see above, Materials and Methods) showing that they were the most informative among the repertoire of syllable sequences produced in the three contexts. As expected from the results of MI (Fig. 1G), the longest NSeqs were in the Together condition, and Together and Separated conditions had distinct sets of NSeqs indicating context-dependent specific sequence production in social encounters of male and female mice. The only high-probable sequence in the Alone context was also present in the Separated context and is possibly a search sequence before finding a female mouse. However, all the NSeqs were present in the Together condition, and thus during the Together condition the female was exposed to all the NSeqs. To study the importance of the NSeqs, if any, both from a behavioral perspective and coding perspective, as control, 12 other sequences were designed. Four randomly ordered sequences were created from the probability distributions of syllables in each context (Fig. 1C). The above sequences, each with seven syllables, the length of the longest NSeqs, are considered RSeqs (Fig. 2B, sequence 1–12; see above, Materials and Methods).
Naive adult female mice prefer NSeqs over RSeqs
Although NSeqs were obtained from vocalizations produced during social exposure of males to females, the sequences need not be behaviorally relevant to naive females. We tested the relevance of the specific sequences obtained with a two-sided free access/choice test (Musolf et al., 2015; Chabout et al., 2015; see above, Materials and Methods). Two sets of experiments were conducted, one in which 8 of the 12 RSeqs were truncated to match the number of syllables in NSeqs (Fig. 2B, shaded RSeq, same length) and the other in which all RSeqs had seven syllables as in the longest NSeqs. The latter case was used to provide the maximum possibilities of syllable-to-syllable transitions that could be NSeqs occurring. The above allowed us to probe the relevance of the full sequence as opposed to NSeq transitions or disyllables occurring randomly. Naive adult female mice (females with no prior exposure to adult male mice; P56–P90, n = 8, two outliers, same length; n = 7, 1 outlier, 7 length) were placed in an open cage for four distinct sessions (S1–S4) with sounds presented in S2 and S4 (see above, Materials and Methods). Sequences from RSeq were played back from one corner and those from NSeqs from the opposite corner, alternately every 5 s, in S2. RSeq and NSeq respective sides were switched in S4 to remove any side bias (Fig. 2C, top row). Female mice had significantly higher preference (p < 0.001, both cases) for NSeqs independent of which side NSeqs were played back, both for equal length and length 7 sequences (Fig. 2C, bottom row, quantified by dividing the cage floor into three equal parts denoting NSeq, Neutral, and RSeq; see above, Materials and Methods). Thus, the extracted NSeq sequences are relevant to naive female mice and are preferred over the designed RSeq.
Higher c-fos expression in the mouse ACX for the NSeqs over RSeqs
Because the theoretically derived informative sequences have behavioral relevance for naive adult female mice over random sequences (Fig. 2D), we further investigated the relative degree of effect of NSeqs over RSeqs at the neuronal level, particularly in the ACX. For the above, we used c-fos immunohistochemistry (see above, Materials and Methods) as a marker for immediate early gene expression and hence neuronal activity (Clayton, 2000). The presence of c-fos in a neuron is one of the methods to identify the specific neurons that participate in a given function without losing the ability to precisely locate such neurons (Clayton, 2000). c-fos Expression has been known to mark synaptic plasticity occurring in neurons because of novel experience, memory trace formation, and learning, among other such phenomena (Liu et al., 2012; Moreno et al., 2018; Minatohara et al., 2016). Three different sound environments were used to expose three groups of female mice (n = 5 per group) age between P56 and P90 (see above, Materials and Methods). First, the control condition, in the absence of auditory stimulus, started with 60 min of habituation followed by 15 min of a silent period, which is the same as the duration of stimulus presentation for the two other conditions, and ended with an additional 60 min as a consolidation period (Fig. 3A). In the second and third conditions the 15 min silent period in control condition was replaced by RSeq and NSeq presentations (Fig. 3A), keeping other things identical to control conditions. Coronal sections (Fig. 3B) across the rostrocaudal extent of ACX were used to mark the boundaries of ACX (Paxinos and Franklin, 2019). The neuronal activation of the different regions involved for the three different contexts (Fig. 3B, left) were quantified with the number of c-fos-positive neurons per unit area (Fig. 3B, right column). To further assess how much of these effects are because of general arousal or specific auditory cortical processing, we considered primary (V1) and secondary (V2L) visual cortex as control brain regions. As shown in the bar plot (Fig. 3C, top row), the results of the stereological analysis (see above, Materials and Methods) revealed no differential expression in any of the conditions for visual cortex regions V1 and V2L. A higher number of c-fos+ neurons were present in the NSeq condition compared with RSeq and control conditions in the ACX. The overall observation did not alter when considering the different subregions of ACX separately, namely, Au1, AuD, and AuV (Paxinos and Franklin, 2019; Fig. 3C, middle and bottom). Thus, our observations indicate that exposure to NSeqs has very different effects in terms of expression of c-fos which can indicate salience of NSeqs over RSeqs or significant behavioral relevance of NSeqs over RSeqs or initiation of synaptic plasticity in ACX with NSeq exposure. Results of behavioral relevance (Fig. 2D) together with those of c-fos expression (Fig. 3) warrant investigation of coding of NSeqs as opposed to RSeqs. As NSeqs and Rseqs are from the same distribution, the question of whether presence of predictiveness for the NSeqs over RSeqs is important to understand as it may relate to perceptual learning of the specific sequences.
Naive adult female mice show higher activation of c-fos+ cells for NSeqs over RSeqs. A, Schematic depicting the three stages followed to investigate activation of c-fos+ cells. Throughout the experiment up to killing, the mouse remained in a sound-attenuating chamber; Stage 1, 60 min of habituation; Stage 2, 15 min of exposure to auditory stimulus (RSeqs or NSeqs or silence for Ctrl); Stage 3, 60 min of consolidation in silence. B, Representative coronal brain sections for c-fos expression from different groups, Ctrl (top), Ran (middle), and Nat (bottom), with ACX subregions Au1, AuV, AuD, and visual cortex areas (V1 and V2L as control) demarcated with dotted lines. Right, Enlarged images show sampled Au1 regions with c-fos+ cells (top, white arrow, red), corresponding DAPI-stained nuclei identification (middle, white arrows, blue), and the overlay of the two (bottom). C, Bar plots show average count of c-fos+ cells/mm2 in three conditions, Ctrl, Ran, and Nat, for V1. V2L, and total ACX, subregions of ACX (Au1, AuD, AuV). Quantitative differences of c-fos+ cells for each of the cortical subregions are marked with statistical significance at 5% significance level using a one-way ANOVA; *p < 0.05, **p < 0.01, ***p < 0.001; NS, Not significant.
Single units in Au1 code single syllables and disyllables differentially for NSeqs and RSeqs
To investigate coding of behaviorally relevant NSeqs of vocalizations relative to RSeqs, we first performed extracellular single-unit recordings from layer 2/3 (200–350 μm from the surface) of mouse Au1 using multielectrode arrays (Sharma and Bandyopadhyay, 2020; Srivastava and Bandyopadhyay, 2020). Three groups of mice were used, passively listening awake naive females (Awk_F; 328 units from 6 animals, chronic recordings, over 10–14 d), anesthetized naive females (Anes_F; 266 units from 12 animals), and naive anesthetized males (Anes_M; 195 units from 8 animals).
Typically, five (between four and six) repetitions of each stimulus of RSeqs (1–12) and NSeqs (13–20) were presented in pseudorandom order (Fig. 4A, left) and single-unit spiking (Fig. 4A, inset) with 500 ms baseline was recorded. The stimuli varied in length based on the component syllables, from 0.389 s to 1.233 s. Peristimulus histograms (PSTHs) were constructed, and responses locked to single tokens were observed throughout the RSeqs and NSeqs (Fig. 4A, right).
Differential coding of single syllables in Au1 for NSeqs and RSeqs. A, Representative dot raster plot of single-unit spiking responses to NSeqs and RSeqs presented in pseudorandom order (right, spike shape). Smoothed PSTHs (binning of 10 ms) of the same unit for each stimulus sequence is shown with stimulus start (tall vertical line) followed by lines marking start and end of each subsequent syllable. B, Schematic for calculation of responses to common syllables within the sequences in NSeq and RSeq, not considering the syllables in the starting position. Different color bars represent different syllable types, and the width of color bars is indicative of syllable duration. The block dots are the spikes corresponding to each iteration (REP1, REP2…REPN). The mean responses of the syllables with the same color (indicated by colored space on the schematic of spikes over n repetitions) were calculated for RSeqs (Baseline, left) and NSeqs (Baseline, right) for each of the different common syllable types. C, Scatter plots show comparison of mean response rates for all common first syllables in NSeqs and RSeqs in three groups of mice, Awk_F, Anes_F, and Anes_M. D, Scatter plots show comparison of mean response rates of all common first (identified by solid symbols (S, large circle; H, square; O, triangle; syllable types in NSeq and RSeq in 3 groups of mice, Awk_F, Anes_F and Anes_M). E, Scatter plots comparing mean response rates of common syllables (identified by solid symbols (S, large circle; J, small circle; H, square; O, triangle) in NSeqs and RSeqs, excluding occurrence in the first position for the same groups in C. F, Profile of changes in the normalized response strength for each position of the sequences NSeq (blue) and RSeq (red) over time, normalized by response to the syllable in the starting position. None of the profiles at any position showed any significant difference (one-way ANOVA). Each dot in the scatter plots (C, D, E) represents an individual neuron; *p < 0.05, **p < 0.01, ***p < 0.001; NS, Not significant.
Syllables S, J, H, and O and disyllables or transitions S–S, J–J, H–H, and O–O were present in both RSeqs and NSeqs, excluding the first syllable of any sequence. The first syllable was excluded as its occurrence does not contain any sequence information (Fig. 4B). Thus, as a control, in all three groups of the average rate responses of animals of each syllable type when occurring in the first position were compared (First syllable after the dashed vertical line, Fig. 4B, see above, Materials and Methods). As expected, none of the responses of the syllable in the first position showed any difference between NSeqs and RSeqs (Fig. 4C,D, for each syllable). Next, we compared the single-unit rate responses of syllables, present in both RSeq and NSeqs (Fig. 4E; see above, Materials and Methods). Syllables J and O, across all three groups of animals, elicited higher average rates in Au1 single units when occurring in NSeqs than when they occurred in RSeqs (paired t test, p < 0.01 in all cases, except syllable O in Anes_F had p < 0.05; Fig. 4E). However, syllable H across all the animal groups evoked lower average rates for NSeqs than for RSeqs (paired t test, p < 0.01 in all cases). Syllable S produced significantly higher rates when in NSeqs (paired t test p < 0.01) compared with when in RSeqs for Awk_F group, and response strength of S was not significantly different in NSeqs and RSeqs in the other two groups of animals.
Thus, the single syllables were encoded differentially in a NSeq context than in an RSeq context, specifically a higher response was generally elicited by syllables in NSeqs than RSeqs, except for the H syllable. We rule out the possibility that the above differences could be an effect of stimulus-specific adaptation (Carbajal and Malmierca, 2018; Srivastava and Bandyopadhyay, 2020; Mehra et al., 2022b). It may appear that because of repetitions of syllables in NSeqs (e.g., H in NSeq 15 and NSeq 16) and absence of such long repeats of H in RSeqs could potentially explain the lower response rates to H in NSeqs compared with RSeqs (Fig. 4E). However, repetitions of up to four successive syllables are present for S, J, and O as well as H in NSeq with such repeated occurrences of them being absent in RSeqs. But unlike H, the S, J, and O syllables evoked higher rates in NSeqs than RSeqs. Thus, it is unlikely that adaptation can explain the observed context-dependent responses to tokens. To more conclusively rule out adaptation as the cause of such differences, we compared the average adaptation profiles (Fig. 4F) in NSeqs and RSeqs by normalizing each single-unit mean response to the first token as one, which was not significantly different between NSeqs and RSeqs for any token (Fig. 4C,D). No significant differences were found in the adaptation profile based on average token-wise responses over NSeqs and RSeqs across animal groups.
The above results suggest that predictiveness binding the syllables is likely an important determinant of the difference in responses of individual syllables in NSeqs and RSeqs. We thus considered coding of the common disyllables in NSeqs and RSeqs, considering only the response to the second syllable in the transition (Fig. 5A; see above, Materials and Methods). Comparison of rate responses to transitions in NSeqs and RSeqs showed similar results as single syllables, with S–S, J–J, and O–O transitions generally showing stronger responses when present in NSeqs than when present in RSeqs across all animal groups (except S–S in Anes_F and Anes_M; Fig. 5B–D, right). As with responses to the single syllable H, the transition H–H had higher responses in the RSeq context compared with the NSeq context across all groups of animals (Fig. 5B–D, right). Representative example PSTHs are shown for the groups under consideration (Fig. 5B–D, left). Syllable transitions of other kinds (Fig. 1D) were not common to both NSeqs and RSeqs. All the other kinds of disyllables in NSeqs were present as the first transition, which were excluded, as the first syllable did not have context information (Fig. 4C,D). However, the other disyllables occurring as the first transition in NSeqs (S–J, S–H, S–O, and O–H) were significant in transition matrices (Fig. 1D). Thus, we compared responses of the above transitions based on the response to the second token of the transition, between NSeqs and RSeqs (Fig. 5E–G). In the above remaining cases also, we generally find higher responses to the transitions occurring in NSeqs compared with those occurring in RSeqs.
Differential coding of disyllables in Au1 of mice for NSeqs and RSeqs. A, Schematic for calculation of mean rate responses to transitions/disyllables based on response to the second component, excluding the first transition. Each color bar represents a different syllable type; the width is representative of syllable duration. The black dots stand for the spikes correspond to each iteration (REP1, REP2…REPN). Right, The common transitions. Left, The color lines used for showing the transitions, namely, green, blue, and cyan, have been correspondingly used for denoting the spikes over iterations for those transitions. Mean spike rates were computed for each of the same colored areas indicated in the schematic of the spikes over N repetitions for RSeqs (Baseline, left) and NSeqs (Baseline, right) for each of the common transitions. B–D, Representative example PSTHs (vertical gray bar indicates start of a sequence, red line along the x-axis represents the stimulus duration) and scatter plots comparing mean response rates of common disyllables in NSeqs and RSeqs for the three groups, Awk_F (B), Anes_F (C), and Anes_M (D). E–G, Scatter plot of comparison between mean rate responses to first transition in NSeqs (S–J, S–H, S–O, and O–H) and the same transition present at any position in RSeqs based on response to the second component, excluding the first transition for the groups AwK_F (E), Anes_F (F), and Anes_M (G); *p < 0.05, **p < 0.01, ***p < 0.001; NS, Not significant.
Social exposure of adult female to adult male alters coding of entire NSeq sequences but not their component syllables in female A1
Our results show that naive males and females both show the same results of differential coding of single syllables in the two sets of stimuli, RSeqs and NSeqs (Fig. 5). Further, the NSeq stimuli used are preferred by naive female mice compared with RSeqs (Fig. 2C). Social exposure between adult male and female mice (>P56) is a significant event constituting the NSeq USVs produced (Egnor and Seagraves, 2016; Chabout et al., 2015; Yang et al., 2013) (Fig. 1, 2). Further H syllables and sequence of Hs were a major component of the USVs especially in Together context (Fig. 1C,D). Thus, we hypothesize that the context-specific coding of syllables and disyllables would change, especially for syllable H and disyllabic H–H, both of which showed lower responses for NSeqs.
We first tested the above hypothesis with single-unit recordings from layer 2/3 of Au1 of experienced female mice (Fig. 6A; see above, Materials and Methods) with a different number of days of exposure to male mice (1–5 d of exposure, 421 units from 8 animals, Anes_F-Aft_Expo). Because the differential coding observed for syllables and disyllables was the same in Awk_F and Anes_F (Figs. 4E, 5B–D) except for S, we considered rate responses in anesthetized mice. First, as a control for context independence, we confirm that responses to all first syllables did not change between NSeq and RSeq sequences (Fig. 6B,C, each syllable type). Example PSTH for a unit for all the sequences are shown in Figure 6D for an experienced female mouse. Contrary to our hypothesis, we found that experienced females had identical differential representation of single syllables (Fig. 6E) and disyllables (Fig. 6F,G) as the naive females (Figs. 4E, 5B–D) before exposure (Awk_F-Bef_Expo and Anes_F-Bef_Expo), except S and S–S in the latter case. However, selectively stronger responses to S in NSeq sequences in Awk_F-Bef_Expo and Anes_F-Aft_Expo compared with that in RSeq, suggests that the exposure likely had little effect on context-specific coding of single syllables and transitions. As in the other groups, adaptation could not explain the context-dependent coding of single syllables in the experienced females (Fig. 6H).
Coding of syllables in Au1 remains unaltered after social experience. A, Schematic of social exposure protocol of female mice with male mice over days. B, Scatter plot for comparison of mean response rates to the common first syllables in NSeqs and RSeqs, following exposure in anesthetized females (Anes_F-Aft_Expo). C, Scatter plots show comparison of mean response rates of all common (solid symbols, S, large circle; H, square; O, triangle) first syllable types in NSeqs and RSeqs for Anes_F-Aft_Expo. D, Representative PSTH for an experienced female (vertical gray bar represents the onset of the stimulus, the red line along the x-axis represents the duration of the sequence). E, F, Scatter plot to compare mean rate responses to common syllables (E) and disyllables (F) for anesthetized females (Anes_F-Aft_Expo) as in Figure 4E and Figure 5B–D, respectively. G, Scatter plot of mean rate responses to first transition in NSeqs (S–J, S–H, S–O, and O–H) based on response to the second component, excluding the first transition. H, Profile of changes in the normalized response strength for each position of the sequences NSeq (blue) and RSeq (red) over time with respect to the syllable in the starting position; not significant (NS) across position, one-way ANOVA; *p < 0.05, **p < 0.01, ***p < 0.001; NS.
We observed that the NSeqs as a whole may have the potential to carry information that is behaviorally relevant in the male–female exposure compared with its components. Further, sequences in NSeqs appeared as the most informative set of sequences in the exposure event over multiple days. Thus, we made a rate comparison for monosyllables, disyllables, and sequence selectivity of individual neurons (see above, Materials and Methods) and compared mean selectivity to RSeqs and NSeqs in the population of neurons in each group (Fig. 7). No difference in neuronal selectivity was observed for monosyllables and disyllables among the different groups under consideration (Fig. 7A,B). However, differential encoding was well captured at the sequence level (Fig. 7C). Interestingly, both Awk_F-Bef-Expo and Anes_F-Bef-Expo showed higher selectivity to NSeqs than RSeqs (paired t test, p < 0.001) with no difference between the two groups of females (one-way, ANOVA, p > 0.05, NSeqs; p > 0.05, RSeqs), reiterating the lack of difference in such selectivity in the awake and anaesthetized conditions. Anes_M-Bef-Expo also showed increased selectivity to NSeqs compared with RSeqs, and surprisingly, had higher selectivity for NSeqs than naive females (one-way ANOVA, p < 0.001, AwF-Bef-Expo; p < 0.001, Anes_F-Aft_Expo). The Anes_F-Aft_Expo group of mice showed the highest difference between selectivity to NSeqs and RSeqs. Interestingly, females after exposure had lower selectivity to RSeqs compared with all other groups (one-way ANOVA p < 0.001, Awk_F-Bef-Expo; p < 0.001, Anes_F-Bef-Expo; p < 0.001, Anes_M-Bef-Expo). Also, the selectivity to NSeqs of Anes_F-Aft_Expo was similar to that of Anes_M-Bef-Expo (one-way ANOVA, p > 0.05) but significantly higher than naive females (one-way ANOVA, p < 0.001, Anes_F-Bef-Expo; p < 0.001, Awk_F-Bef-Expo).
Plasticity observed in single-unit selectivity to entire the NSeq and not its components. A–C, Comparison of mean overall selectivity in NSeqs to that in RSeqs and comparisons across groups before exposure (Awk_F-Bef-Expo, Anes_F-Bef-Expo, and Anes_M-Bef-Expo) and after exposure (Anes_F-Aft_Expo), monosyllables (A), disyllables (B), and overall sequences (C). D, The mean selectivity to NSeqs and RSeqs in anesthetized females before exposure (Anes_F-Bef_Expo) is compared with selectivity to NSeqs and RSeqs over days of exposure and within days of exposure (Anes_F-Aft_Expo); *p < 0.05, **p < 0.01, ***p < 0.001; NS, Not significant.
To further test that the increased difference in selectivity in experienced female mice was because of the exposure to male mice, we considered the effect of multiple days of exposure (Day1, Day2, Day3 and Day > 3; Fig. 7D). First we observe that on every time point considered, selectivity to NSeqs was significantly higher than selectivity to RSeq during the exposure period (Fig. 7D; paired t test, p < 0.001; Day1, p < 0.001; Day2, p < 0.001; Day > 3, p < 0.001). The effect of exposure over days shows that the selectivity to NSeqs in Anes_F-Aft_Expo was significantly higher than that in Anes_F-Bef_Expo on all the days (one-way ANOVA, p < 0.001, Day1; p < 0.001, Day2; p < 0.001, Day3, p < 0.001, Day > 3). Moreover, we found that selectivity to NSeq increased significantly over the first two days (one-way ANOVA, Day1 > Bef_Expo, p < 0.001; Day2 > Day1, p < 0.001). Following Day2, there was no significant change in selectivity to NSeq sequences and was saturated. On the other hand, selectivity to RSeq was unchanged on Day1 (one-way ANOVA p > 0.05) and decreased on Day2 (one-way ANOVA, p < 0.001) and then remained unchanged over further days (one-way ANOVA, p > 0.05, Day2/Day3; p > 0.05, Day3/Day > 3) while being significantly less than that of naive females and that of Anes_F-Aft_Expo Day1 (one-way ANOVA, p > 0.05). These observations are clearly consistent with the idea that exposure or experience-dependent plasticity occurs to increase selectivity to the entire sequence but not its component syllables and disyllables. As responses of RSeqs and NSeqs are from the same neurons collected in pseudorandom order, it suggests that neurons with higher selectivity to NSeq also lose selectivity to RSeqs over days to maximize differences of the relevant sequences from any others. Thus, although the Aft_Expo Day2 group of mice was never exposed to the RSeq, we find a decrease in selectivity to RSeq. The difference in selectivity to NSeqs and RSeqs saturates after Day2 and it is explained by the fraction of units in layer 2/3 of A1 that have higher selectivity to NSeqs than to RSeqs (χ2 test, p > 0.01). Fraction of units with higher selectivity for NSeq than RSeq significantly increases on Day2 from Day1 (χ2 test, p < 0.05) coincident with the decrease in mean selectivity of RSeqs.
SOM-positive interneurons and Thy-1 excitatory neurons exhibit different experience dependent plasticity in NSeq selectivity
To elucidate the mechanisms underlying the observed plasticity, we hypothesized the involvement of inhibitory interneurons as observed in multiple studies (Schreiner and Polley, 2014). Our single-unit recordings showed mostly regular spiking, and hence extracting putative fast-spiking inhibitory neurons was not possible. Our final conclusions above are based on responses to entire sequences (0.389–1.232 s) and not fine time scale activity information. Thus Ca2+-dependent fluorescence imaging was performed in naive and experienced female mice in the awake state through a cranial window (see above, Materials and Methods). We used three types of mice to image activity of Thy-1-positive EXNs (naive n = 3360 neurons, experienced n = 2853 neurons from n = 3 mice, strain #024339, The Jackson Laboratory) and genetically identified individual inhibitory neurons (SOM, naive n = 517 neurons, experienced n = 460 neurons from n = 4 mice, strain #013044 crossed with strain #024105, and PV-positive, naive n = 395 neurons, experienced n = 438 neurons from n = 4 mice, strain #08069 crossed with strain #024105, all from The Jackson Laboratory).
We first performed wide-field Ca2+ imaging (Mehra et al., 2022a) (see above, Materials and Methods) to identify the location of Au1 with UF (Stiebler et al., 1997; Bandyopadhyay et al., 2010) and other auditory areas (Issa et al., 2014; Fig. 8A). Following identification of Au1 (see above, Materials and Methods), fine scale two-photon imaging (see above, Materials and Methods) with single-neuron resolution was performed by restricting individual ROIs (120–150 × 300–350 μm) within Au1. Single ROIs (Fig. 8B; two to three ROIs per day) were imaged in each group of mice, female before exposure (Fig. 8B, left, Thy1-GCamp; Fig. 8B, middle, SOM-GCamp; Fig. 8B, right, PV-GCamp) and after exposure. Chronic recordings in this case allowed collecting data from the same animals over different days of exposure (5 d); however, the same neurons could not always be tracked reliably over days. As expected from the literature, we obtained data from 100 ± 20 neurons in each ROI of Thy-GCamp mice and 6 ± 4 and 8 ± 4 neurons in each ROI from SOM-GCamp and PV-GCamp mice, respectively. Within an ROI, many neurons produced significant responses (mean df/f compared with mean baseline df/f, t test, p < 0.05; see above, Materials and Methods) to one or more stimuli in all the cases (Fig. 8C).
Responses to sequences obtained with two-photon Ca2+ imaging of Thy1, SOM, and PV neurons. A, Representative examples of tonotopy in Au1 and other auditory areas in mouse Au1 obtained in all three groups of mice (Thy-1-GCamp, SOM-GCamp, and PV-GCamp, respectively, in three columns, separated by dashed lines) are shown. White box marks the area shown in B. B, Sample two-photon image of an ROI in A1 in each of the three groups of mice. C, Average df/f plots obtained with two-photon imaging, in response to each of the 20 stimuli for two cells in each ROI (Cell A1, B1, C1 and Cell A2, B2, C2, respectively). D, Left, Bar graphs show percentage of single units (blue) with significant rate responses to each stimulus (1–20; Fig. 2B) in Awk_F-Bef-Expo and that of single Thy-1-positive EXNs (brown) with significant responses in Ca2+ in EXN-Bef-Expo. Right, Bar graphs show (same color representation as left) percentage of neurons responding to number of stimuli either 1, 2, or all 20 of the stimuli; the stimuli identity (id) doesn't matter.
Distinct differences in response behavior to sequences existed in the population of neurons between single units (primarily EXNs) and Thy-1 neuron Ca2+ responses. In general, responses were sparser with two-photon imaging and underlie some of the discrepancies observed between imaging and extracellular recordings (Bandyopadhyay et al., 2010; Siegle et al., 2021; Rothschild et al., 2010). Fractions of neurons responding to each of the 20 sequences (Fig. 2B) in the awake state show lower fractions with imaging compared with single-unit recordings in the awake state (Fig. 8D). Similarly, the fraction of neurons (above two groups) responding to the number stimuli (independent of the identities of the stimuli) showed that the single-unit population responded less selectively than the population of neurons observed with imaging (mean 2.9 and 5.6 stimuli, respectively, with Ca2+ imaging and single units; Fig. 8D; χ2 test, p < 0.001). Thus, direct comparisons with single-unit data may not be possible when using Ca2+-based responses.
We considered the selectivity of the three different types of neurons to NSeqs and RSeqs and the effect of exposure in two ways. First, we considered the relative selectivity to NSeqs and RSeqs by grouping the neuronal populations according to how many of the RSeq and NSeq neurons of each type respond simultaneously. We first observe that before exposure, Au1 single Thy-1 EXNs and SOM respond to subsets of both NSeqs and RSeqs similarly, with neurons responding to few or none of RSeqs and also responding to few or none of NSeq sequences. Neurons responding more for RSeqs are more likely to respond to more of the NSeqs (Fig. 9A). However, with exposure, EXNs reduce the number of NSeqs they respond to (Fig. 9A, histograms). The same is true of SOM but to a lesser degree. PVs, on the other hand (Fig. 9A, right), show a similar degree of responding simultaneously to NSeqs and RSeqs before and after exposure. Thus, it is likely that PV neurons are less involved in the observed exposure-based change in selectivity to NSeq sequences. We quantified the compactness and its change before and after exposure in the three groups of neurons by the bootstrap (n = 1000) mean of the number of nonzero elements in the matrix representations in Figure 9, A–C. EXNs and SOM showed significant changes (EX-Bef_Expo, 63.3 and EX-Aft_Expo, 33.2; SOM-Bef_Expo, 56.8 and SOM-Aft_Expo, 38.7; nonoverlapping 95% confidence intervals) expected as above, whereas the PV neuron population did not show any change (PV-Bef_Expo, 24.3 and PV-Aft_Expo, 21.1; n.s.).
Differential effects of social-experience-driven plasticity in EXNs and SOM and not in PV. A–C, Population data with two-photon imaging in Thy1-GCamp (A), SOM-GCamp (B), and PV-GCamp (C) mice, with two matrix plots for before (left) and after (right) exposure. The rows in each matrix represent the percentage of neurons in the group and condition that respond to none (0), one, two, all (8) of the NSeq (x-axis) and of the neurons that respond to none (0), one, two, all (12) of the RSeq (y-axis). Right, Marginal distributions show the number of neurons responding to the different number of RSeqs. Distribution at the bottom of each matrix plot shows the average of the rows. Comparison of average selectivity to NSeqs and RSeqs in each condition Bef_Expo and Aft_Expo and between the conditions for each neuronal type (D) Thy-1; (E) SOM; (F) PV; *p < 0.05, **p < 0.01, ***p < 0.001; NS, Not significant.
Next, we compared mean selectivity to NSeqs and RSeqs, based on responses quantified with df/f (see above, Materials and Methods) before and after exposure in Thy-1-GCamp female mice. We found that EXNs had increased selectivity for NSeqs compared with RSeqs (one-way ANOVA, p < 0.001) following exposure. Thus, the observed experience-dependent plasticity of the entire NSeqs observed with single units was also observed with Ca2+ responses, relative to responses to RSeq (Fig. 9B, left). However, there was an overall decrease in response selectivity to both RSeqs and NSeqs not observed in single units. The above deviation from single-unit population behavior seen with Ca2+ imaging is because of the inherent differences in the two as stated previously (Fig. 8D). Thus, with the Ca2+ imaging data we consider changes with respect to RSeq in each condition. Unlike EXNs, SOM had significantly less selectivity for RSeq before exposure, which is abolished following exposure (Fig. 9B, middle; one-way ANOVA, p < 0.001). As with Thy-1 EXNs, exposure caused SOM to also have decreased selectivity to both NSeqs and RSeqs. On the contrary, PV neither showed a difference in selectivity to NSeqs compared with RSeqs both before and after exposure (paired t test, p > 0.05) nor did their overall selectivity vary with exposure (Fig. 9B, right, one-way ANOVA, p > 0.05, NSeqs; p > 0.05, RSeqs). Thus, among the two inhibitory neuron types tested, we hypothesize SOM to be involved in the observed experience-dependent plasticity. Also, overall SOM had the highest selectivity both before and after exposure and thus is likely capable of mediating changes to specific stimuli.
Optogenetic silencing of SOM combined with sequence presentation induces plasticity in selectivity to sequences but not sequence components
To test the hypothesis of the involvement of SOM in the above experience-dependent plasticity of entire sequences, we performed experiments in naive anesthetized female mice by deactivating the SOM neurons (Fig. 10A, schematic. As Awk_F-Bef_Expo and Anes_F-Bef-Expo mice did not show any difference in sequence selectivity between them both for NSeqs and RSeqs, use of anesthetized mice is justified. The mice used (P56–P90) expressed ArchT-EGFP specifically in SOM neurons, obtained by crossbreeding strain #021188 (Ai40D) and strain #013044 (SST-IRES-Cre; both from The Jackson Laboratory; Fig. 10A). First, we decided on the power level of the 589 nm laser for ArchT-EGFP activation to silence SOM. We used the highest power level at which the spontaneous activity after light off recovered to initial spontaneous activity before light on while producing an average increase in spontaneous rate with light on in the prestimulus period (n = 6 mice, 3 female, 3 male; n = 51 units, noise at five intensities, 55–95 db SPL; see above, Materials and Methods; Fig. 10B). There was no significant difference in spontaneous activity between before light on and after light off periods (Fig. 10C; paired t test, p > 0.05, bottom), with increased light-on spontaneous activity (Fig. 10B, green box, C, top and middle).
Reversible silencing of SOM paired with sequence presentation mimics plasticity in sequence selectivity without altering syllable selectivity. A, Schematic for optogenetic silencing of SOM using a laser of 589 nm (middle). Recordings were made in Au1 using a multielectrode array with an optical fiber in a mouse model expressing ArchT-EGFP in only SOM neurons. Right, Representative image of brain sections collected postrecording is shown with some SOM-ArchT-EGFP-positive neurons marked with white arrows. B, Determining laser power. For all such experiments the laser was turned on 100 ms prestimulus onset and turned off 100 ms stimulus offset (orange shading). Spontaneous or nonauditory-driven activity in three periods were used, OFF1, ON and OFF2, each 100 ms long and were as depicted. PSTH of a neuron is shown below the schematic. For sequences stimulus onset and offset were onset of the first syllable and offset of the last syllable of the sequence. C, Histograms of modulation of spontaneous activity of all cases show significant modulation of spontaneous spiking by light (left and middle, mean, red arrow), and the histogram to the right shows comparisons of spontaneous activity before and after light on. D, Representative example of a neuron with optogenetics for the sequences. E, Scatter plot to compare single syllable mean responses in NSeqs and RSeqs during SOM silencing paired with sequences (Anes_F-Opto) and after the period of pairing (Anes_F-Aft_Opto) as in Figures 4E and 6E. F, G, Scatter plots comparing mean response rates of common disyllables excluding occurrence in the first position for the Anes_F-Opto and Anes_F-Aft_Opto groups. Similar plot as in Figure 5C, with Anes_F-Bef_Expo and Anes_F-Aft_Expo (from Fig. 6F). H, Comparison of mean overall selectivity in NSeqs to that in RSeqs and comparisons across groups similar to Figure 7, C and D; *p < 0.05, **p < 0.01, ***p < 0.001; NS, Not significant.
We performed optogenetic silencing of SOM at 30 mW output power over a period encompassing presentation of the 20 sequences (both NSeqs and RSeqs, 100 ms before to 100 ms after the stimulus as in Fig. 10B; see above, Materials and Methods) for a total 100 presentations (five for each NSeq and RSeq) to test our hypothesis of involvement of SOM neurons (n = 5 female mice, n = 90 single units). Representative example of the PSTH of a neuron with optogenetics is shown in Figure 10D. We found that the relationship of responses of single syllables for NSeqs and RSeqs during SOM silencing (Anes_F-Opto; Fig. 10E, left) was similar to Anes_F-Bef_Expo (Fig. 10E, right) except the reduced responses of H in NSeqs compared with RSeqs were now the same (Fig. 4E, middle). We also probed for longer-term plastic changes induced by the pairing of SOM turning off (effectively disinhibiting EXNs to which they primarily project; Pi et al., 2013) with simultaneous sound stimulus (NSeqs and RSeqs) presentations. We found that the comparative responses to each of the single syllables in NSeqs and RSeqs following the above pairing (Anes_F-Aft_Opto; see above, Materials and Methods) were largely the same as Anes_F-Bef-Expo, Awk_F-Bef-Expo, and Anes_F-Aft_Expo (Figs. 4E, 6E, 10E). No distinct switch in preference between NSeqs and RSeqs was observed. Similar observations were made with transitions common to NSeqs and RSeqs in female mice before and after exposure (Fig. 5C; Bef_Expo; Fig. 6F, Aft_Expo) and during and after optogenetic silencing of SOM (Fig. 10F,G), with no distinct switching between NSeqs and RSeqs. We then tested the idea of plasticity of selectivity in sequences. The above experiments could also be considered to be pairing of sound stimuli and silencing of SOM neurons. For both groups, Anes_F-Opto and Anes_F-Aft_Opto, there was a significant increase in selectivity to NSeqs compared with the Anes_F-Bef-Expo group and was the same as that of the Anes_F-Aft_Expo group of mice (Fig. 10H, one-way ANOVA, p < 0.001, p < 0.001, p < 0.001). The mean selectivity to RSeq remained the same as the Anes_F-Bef-Expo group. In the subgroup of units from Anes_F-Bef_Expo mice in which SOM-silencing-based pairing was performed, the mean selectivity for NSeqs was the same as Anes_F-Bef_Expo; however, there was a small reduction in selectivity to RSeqs in the same population (median 4% lower), indicating an effect of rise in selectivity to RSeqs during and after pairing. Thus, SOM inhibition during sequence presentation is capable of inducing rapid plastic changes in coding of entire sequences without changing coding of the component syllables and disyllables. We thus hypothesize that during the activity-dependent plasticity occurring in naive females, VIP-mediated inhibition of SOM could underlie the plasticity in sequence selectivity.
Discussion
Our study evaluates the importance of predictive sequences in mouse USVs using a combination of behavioral, molecular, electrophysiological, and in vivo imaging techniques. We found that mathematically derived informative sequences obtained during male–female interaction were preferred by naive adult female mice over the random sequences. A differential expression in the auditory cortex was observed for c-fos+ cells with higher expression for NSeqs over RSeqs. The neuronal signatures of the auditory cortex exhibited higher selectivity for NSeqs, which was further enhanced in adult females on social exposure with adult males. Finally, with in vivo two-photon imaging and optogenetics, we confirmed the possible role of SOM neurons in driving the above plasticity.
Previous studies have shown that the USVs produced by mice are highly variable in their acoustic features, suggesting that they may convey different types of information (Portfors and Perkel, 2014; Musolf et al., 2015; Chabout et al., 2015). Since then, various studies have looked at the behavioral relevance of USVs in terms of preference for USVs over synthetic stimuli (Hammerschmidt et al., 2009; Ehret and Haack, 1982), vocalizing over devocalized males (Pomerantz et al., 1983), and complex over simple syllables (Chabout et al., 2015), by female mice, among other preferences. Along with the behavioral aspect, various studies have also explored the neuronal representation of USVs in the auditory cortex using extracellular recordings (Liu and Schreiner, 2007; Galindo-Leon et al., 2009), in vivo two-photon loose-patch recordings by genetically tagging the neurons actively involved (Tasaka et al., 2018), and expression of immediate early genes like c-fos as a proxy of neuronal activity (Moreno et al., 2018), along with other studies. However, all the above behavior and physiology studies were performed using the playback of USVs and were randomly selected from the vast database of recordings. Although the presence of structure in call sequences in adult male mice (Holy and Guo, 2005; Grimsley et al., 2011; Chabout et al., 2015, 2016) and pups (Grimsley et al., 2011; Agarwalla et al., 2020) has been known to exist, studies exploring and using the importance of the same are limited. Playback of highly probable dyads of pups could elicit maternal behavior, but, as above, the least probable dyads failed to do so (Takahashi et al., 2016). The above suggests predictive USV sequences may contain more information of benefit over least predictive random sequences, and ACX neuronal response properties may have adapted accordingly.
In our current study, we tried to understand whether mathematically derived informative sequences hold any relevance by evaluating them at three levels, behavioral, molecular by using c-fos, and physiological, before and after social experience. The first step was to extract the predictive sequences. Multiple methods have been used to evoke adult male USVs relevant to social interaction with females (Arriaga and Jarvis, 2013). We developed a protocol to record mouse USVs during male–female interaction over days (Fig. 1A) from which we extracted USV sequences that are highly informative using mutual information named as NSeqs. Although in the above interactions females may also vocalize, their contribution was minimal (Maggio and Whitney, 1985; White et al., 1998) and primarily with low-frequency harmonics (Grimsley et al., 2013; Lupanova and Egorova, 2015), and therefore was not considered in our data. As a control, random sequences were designed to have the same frequency of occurrence of each syllable type as found in our male–female interaction protocol (Figs. 1, 2). Both NSeqs and RSeqs were constructed with the exact same syllable and ISS, removing any spectrotemporal cues other than the presence or absence of predictability among the syllables to differentiate NSeqs and RSeqs. Any possible influence of variation in silence duration between syllables was removed by using equal gaps between syllables (90 ms ∼ mean of ISS duration). The impact of temporal regularity by changing the ISS and finer categorization of syllable types (Portfors, 2007; Grimsley et al., 2011; Matsumoto and Okanoya, 2018) might lead to exciting insights, which were beyond the scope of the current work.
To evaluate the behavioral relevance, we performed a two-choice task with four sessions in which stimuli were presented in two of the sessions. NSeqs and RSeqs were played alternately from two speakers at the opposite corners diagonal to each other, and the stimuli sides were switched in the two sessions to remove any bias in side preference. Interestingly, we found a higher preference for NSeqs over RSeqs in naive adult female mice (Fig. 2). Having found that females give more weight to mathematically derived informative sequences, we started to explore further by using the well-studied cellular immediate early gene marker, c-fos. We exposed naive females to either silence, NSeqs, or RSeqs and quantified the expressions of the same. Higher activation of c-fos+ cells for NSeq sequences in the Au1 further strengthened our hypothesis that sequences with high predictiveness are important (Fig. 3C). Similar results were observed for AuD, AuV, and ACX as a whole (Fig. 3C). Thus, the behavior and molecular assay results warranted further investigation of coding of different sequence types in Au1. We also investigated the activation of c-fos in the earlier auditory stations medial geniculate body (MGB) and CIC as well as a higher order auditory cortical association area, namely, TEA (Fig. 11). MGB, both ventral and dorsal, exhibited differential encoding for NSeqs and RSeqs (one-way ANOVA, p < 0.05). However, we cannot conclude whether the preferential selectivity for NSeq in Au1 is derived from MGB. The feedback connection from the ACX to MGB (Bartlett, 2013) might have a role to play in shaping MGB responses and requires further investigation. In the CIC (bilateral), there was no differential encoding for RSeqs and NSeqs. However, the c-fos+ cell counts for both sequence types differ significantly from the control condition (silence, one-way ANOVA, p < 0.05; Fig. 11), as expected. Differences in encoding are thus observed at stations above the CIC (Fig. 11). Further, in TEA, significant differential encoding was observed for RSeqs and NSeqs (one-way ANOVA, p < 0.05; Fig. 11E). However, on making a comparative study among the different regions after subtracting the spontaneous activity in control condition from Nat and Ran conditions, we find overall that ACX has the highest activation in Nat condition (Fig. 11F).
Differential c-fos+ cell activation in MGV, MGD, IC, and TEA. A, Representative coronal section for c-fos+ cells in the auditory thalamus (MGD and MGV) and TEA (dotted lines). B, Representative images of the sampled locations from MGV for c-fos+ cells in three different contexts, Control, RSeq, and NSeq. White arrows mark some c-fos+ cells in red, cell nucleus (DAPI stained) in blue, and overlaid cells in purple in the sampled images. C, Sample images from TEA from the above mentioned three experimental conditions show c-fos+ cells in red, cell nucleus in blue, and overlaid cells in purple. White arrow indicates c-fos+ cells. D, Example section from IC region with cortical and subcortical areas marked in dotted lines; DCIC: Dorsal cortex of the IC (DCIC); ECIC, external cortex; 2Cb, second cerebellar lobule. c-fos+ Cells are quantified from CIC (bilateral) from animals exposed in the above mentioned experimental contexts. Example c-fos+ cells are marked with arrows (c-fos+ in red, DAPI in blue, and overlaid in purple). E, Quantitative representation of the c-fos expression in the three different conditions are presented with bar diagrams. F, Comparative quantification of c-fos+ cells in Ran and Nat conditions for different regions after subtracting the activation for each region in silence; *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001; NS, Not significant.
The sequences extracted and used in our study are based on predictability from syllable-to-syllable. A set of such syllables combined into structured sequences as above can provide more information than the same syllables produced in isolation (Shannon, 1951). In many species, like the songbird, acoustic communication takes the form of ordered sequences, especially with more flexible vocal production. Sequences contribute to information-rich communication, as is the case for human speech as studied through the prediction of letters in English text (Shannon, 1951). Previous work with machine learning has connected single mouse USV syllables with a particular behavior and extracted information about identity, sex, and context (Ivanenko et al., 2020). However, similar approaches have yet to be taken for mouse USV sequences of syllables. Although we do not claim that each sequence may convey different meanings or information, the results do provide the fundamental bases required for sequence-based communication. Future studies are needed to combine more refined behavior to link sequences with behavior and to understand their representation encoding and plasticity at the single-neuron and network level.
Long-term as well as rapid alteration of representation of sounds in mouse Au1 with experience, like maternity (Tasaka et al., 2018; Galindo-Leon et al., 2009), learning based on reward or fear (Fritz et al., 2003; Wang et al., 2020), developmental environment (Mehra et al., 2022a; Bhumika et al., 2020; Kral, 2013). and artificial involvement of subcortical and higher-order structures (Winkowski et al., 2013; Chavez et al., 2009) have been shown for single tokens of NSeqs as well as artificial sounds.
However, our study shows that social experience-dependent changes in selectivity to behaviorally relevant sequences are captured by the sequences as a whole and not by their components (single syllables and disyllables; Figs. 4-7). Our results show that Au1 responses to single tokens and dyads do depend on sequence type, NSeqs or RSeqs; but their relative selectivity does not change with the above experience (Figs. 4–7). In contrast, the selectivity of single neurons for the whole sequence is altered based on the degree of the experience (Fig. 7D). Integration of acoustic components into a single percept across frequency and at least partially overlapping in time is well known (Wang et al., 2020; Kline et al., 2021). However, feature integration to obtain a holistic representation of sound tokens nonoverlapping in time is surprising and needs to be better understood. Such integration likely involves very long-time scales of adaptation known in A1 (Pérez-González and Malmierca, 2014; Ulanovsky et al., 2004), long-time constant recursive connections, and inhibitory inputs (Kim and Sejnowski, 2021). Previous work (Mehra et al., 2022b) has looked into very long-time scale adaptation (Ulanovsky et al., 2004) of entire sound sequences and change in their representation over time from repeated presentations and has shown recurrence in EXNs and SOM to play a role. Using two-photon Ca+ 2 imaging, we find that SOM and EXNs have a role to play in encoding experience-dependent plasticity consistent with the literature (Fig. 9). Further, optogenetic silencing of SOM paired with sequence presentations alters sequence selectivity as with social experience with essentially no change in relative selectivity of single neurons to syllables (Fig. 10).
SOMs are known to disinhibit EXNs on activation of VIP neurons (Pi et al., 2013). SOM neurons respond less selectively to NSeqs than RSeqs before exposure, and EXNs behave oppositely. Thus, higher responses to EXNs to NSeqs than RSeqs coupled with SOM disinhibition can drive the observed plasticity. Higher selectivity to RSeqs before exposure of SOM would reduce the disinhibition when RSeqs are presented and thus produce less plastic effects. We hypothesize that the above silencing of SOM in naive female mice during the social experience occurs through VIP neuron activation triggered by the interaction with the male. However, VIP activation during the said social experience is unknown and needs to be explored. The likely candidates are inputs from basolateral amygdala, hypothalamus, ventral tegmental area, or prefrontal cortex because of their involvement in mating behavior (Nakajima et al., 2014; Hashikawa et al., 2016; Zhang et al., 2021).
Our study is a step toward treating sequences of vocalizations as structured and using the predictiveness in mouse USVs by using a mathematical framework to obtain sequences that are informative. Our results take us a step further in establishing mouse models of vocal communication. Our study also opens ways to reinvestigate many aspects of mouse communication and even vocal learning (Arriaga et al., 2012; Arriaga and Jarvis, 2013) and proposes ideas of the innateness of mouse USVs (Mahrt et al., 2013) by considering sequences of syllables.
Footnotes
This work was supported by the DBT/Wellcome Trust India Alliance Grant IA/I/11/2500270 to S.B.; Department of Science and Technology (DST) Ministry of Science and Technology, India Grant DST/INT/CZ/P-04/2020; DST Science and Engineering Research Board Grant SERB-CRG/2021/005653; and Indian Institute of Technology Kharagpur Grant SGIGC 2015. We thank Ministry of Human Resource Development for the institute fellowship, India Alliance, Indian Institute of Technology Kharagpur for startup and Challenge grants, and Sponsored Research & Industrial Consultancy Cell.
The authors declare not competing financial interests.
- Correspondence should be addressed to Sharba Bandyopadhyay at sharba{at}ece.iitkgp.ac.in