Cascaded Processing of Amplitude Modulation for Natural Sound Recognition

Temporal variation of sound envelope, or amplitude modulation (AM), is essential for auditory perception of natural sounds. Neural representation of stimulus AM is successively transformed while processed by a cascade of brain regions in the auditory system. Here we sought the functional significance of such cascaded transformation of AM representation. We modelled the function of the auditory system with a deep neural network (DNN) optimized for natural sound recognition. Neurophysiological analysis of the DNN revealed that AM representation similar to the auditory system emerged during the optimization. The better-recognizing DNNs exhibited larger similarity to the auditory system. The control experiments suggest that the cascading architecture, the data structure, and the optimization objective may be essential factors for the lower, middle and higher regions, respectively. The results were consistently observed across independent datasets. These results suggest the emergence of AM representation in the auditory system during optimization for natural sound recognition.


19
Natural sounds such as speech and environmental sound exhibit rich patterns of amplitude envelope 20 ( Fig. 1a). Temporal variation of amplitude envelope, called amplitude modulation (AM), is one of the 21 most important physical dimensions for auditory perception 1,2 . Humans can recognize speech contents 22 and identify daily sound based on its AM patterns even if its fine temporal structure is substantially 23 deteriorated 3,4 . AM patterns of a sound is usually characterized by their frequency components, AM Emerging selectivity to AM frequency 1 The aim of the present study is to understand the functional significance of the empirically-revealed 2 AM coding scheme in the auditory system, by comparing the AM representation in the trained DNN 3 and that in the auditory system. To enable direct comparison, we simulated experimental approaches 4 of typical neurophysiological studies. Specifically, we conducted "single unit recording" on each unit 5 in the DNN while presenting a sinusoidally amplitude-modulated sound stimulus (Fig. 2a, b). A single 6 unit responded differently to the stimuli with different AM frequencies ( Fig. 2c as examples). From 7 the recorded unit activity, we calculated the degree of synchrony to the stimulus AM frequency and 8 the average magnitude of the activity. The synchrony and the average activity as functions of the AM 9 frequency, called a temporal modulation transfer function (tMTF; Fig. 2d, top panel) and a rate 10 modulation transfer function (rMTF; Fig. 2d, bottom panel), characterize tuning to AM frequency in 11 terms of temporal and rate coding, respectively 7 . 12 Fig. 3a shows MTFs of representative units in the 1st (i.e., closest to the input), 5th, 9th and 13th (i.e., 13 closest to the output) layers. As in typical physiological experiments, we classified the MTFs into low-14 pass, band-pass, high-pass or flat types according to certain criteria (see the Methods). Most units 15 exhibited low-pass, band-pass, or flat MTFs, and a negligible number of units exhibited the high-pass 16 type (Fig. 3b). All MTFs in the 1st layer were flat, indicating the 1st layer did not tune to AM 17 frequencies. In the 5th layer, units with low-pass or band-pass tMTFs appeared and a very small 18 number of units with low-pass rMTFs were observed. In the 9th and higher layer, magnitude of tMTF 19 generally increased and the number of units with low-pass or band-pass rMTFs also increased. 20 Heatmaps of all tMTFs normalized by their peaks reveal downward shift of the distribution of the 21 preferred AM frequencies from 5th layer to the highest layer, and distinct tuning in rMTFs appeared 22 from 9th layer and above (Fig. 3c). 23 Comparison with the auditory system 24 As in typical neurophysiological studies, the MTF of a unit was characterized by its best modulation Generality across datasets 1 It can be argued whether the obtained results were specific to our choice of the dataset, animal 2 vocalizations and environmental sounds. Previous studies show positive pieces of evidence for the 3 generality across datasets. A DNN trained on one dataset can be transferred to another task with only 4 small modification 56 . Also, an efficient-code model trained for substantially different sound datasets, 5 one consisting of human speech and the other of animal vocalizations and environmental sounds, 6 exhibits quantitatively similar representation of carrier frequency 57 . To test the generality of the finding 7 of the present study across datasets, we conducted the "physiology" in a DNN optimized for phoneme 8 classification of speech sounds. A segment of speech sounds in the dataset was labelled with 9 corresponding phoneme, an element of vocalization in speech. 10 The DNN trained on the speech derived essentially the same conclusions as those shown by the DNN 11 for the animal and environmental sounds. The layer-region pairwise similarity matrix exhibited the 12 diagonal pattern (Fig. 6d): Lower layers were similar to peripheral regions, middle layers to middle 13 regions, and higher layers to central regions. The similarity emerged during the optimization, and was 14 weak in the control conditions (Extended Data Fig. 11a, b). The similarities in the DNNs with various 15 architectures correlated with the classification accuracy (Extended Data Fig. 11c; Spearman's rank 16 correlation coefficient ρ = 0.33, p = 3.91×10 −2 ). 17 Tuning to carrier frequency 18 Other than tuning to AM frequency, one of the frequently measured characteristics of auditory neuron 19 is tuning to carrier frequency 58,59 . We calculated temporal average of the activities in each unit in 20 response to a sinusoid with various frequencies and amplitudes (Extended Data Fig. 12a). The 21 responses generally increased as the amplitude of the input increased, but some units in higher layers showed non-monotonic responses to the input amplitude. For instance, in the layer 13, the unit shown 23 in the right panel in Extended Data Fig. 12a exhibited large responses to ~ 30 dB, 400 Hz tone, but the 24 response was smaller to the tone with larger amplitude. As in the neurophysiological studies, a unit was characterized by a frequency tuning curve, the minimum stimulus amplitude which gives larger 1 response than a certain threshold (Extended Data Fig. 12a, grey lines, Extended Data Fig. 12b). 2 Frequency tuning curves in the lower (1st to 3rd) layers appeared to exhibit many peaks. Those in the 3 middle layers (around 5th layer) appeared to exhibit single large peaks and multiple small peaks. The 4 large peaks appeared to span wide range of the carrier frequency as a population (Extended Data Fig.  5 12b), which may be interpreted as a band-pass filter bank. Frequency tuning curves in the higher (8th 6 to 13th) layers appeared to be more complex without clear bandpass-like tunings even as a population. 7 The results were in contrast to the auditory system. Neurons usually exhibit frequency tuning with a 8 sharp single peak, which is likely to originate from frequency decomposition performed in the cochlea. 9 We did not explicitly conducted spectral decomposition of the input sound but directly fed raw 10 waveforms to the DNN. The results suggest that frequency decomposition in the cochlea may be 11 essential for auditory-system-like carrier frequency tuning but not for auditory-system-like AM tuning. 12

13
We found that a DNN optimized for sound classification exhibits AM representation similar to the 14 auditory system throughout the entire cascade of the signal processing. The lower layers in the DNN 15 were similar to the peripheral regions, the middle layers to the middle regions, and the higher layers to 16 the central regions. Such representation gradually emerged during the optimization and correlated with 17 the classification accuracy. The control experiments suggest that essential factors for AM 18 representation in the lower layers, middle layers, and higher layers are the cascading architecture, data 19 naturalness, and optimization objectives, respectively. Such representation was consistently observed 20 in the DNNs trained on different datasets. The similarity of the entire cascade was demonstrated 21 because our DNN performs sound recognition from a raw sound waveform. Since our DNN was not 22 designed or trained to reproduce any physiological or anatomical properties of the auditory system 23 including cochlear frequency decomposition, the results should reflect only the nature of the task and essential for auditory perception, are common in a DNN and the auditory system. These results suggest 1 that AM representation in the auditory system might also be the consequence of optimizing to the 2 sound recognition in the real world, which could emerge during evolution and development. 3 AM Representation in the lower, middle, and higher layers 4 Our results suggest that AM representation in the lower layers is due to the cascading nature of the 5 system. A DNN performs highly nonlinear operation by cascading close-to-linear operations. Perhaps 6 this is also what happens in the auditory system. Neurons in each layer performs relatively simple 7 operation, which may lead to little sensitivity to AM frequency in the peripheral regions. 8 The representation in the middle and higher layers, however, depended on the optimization condition. 9 The representation in the middle layers were similar to that of the auditory system only in the DNN 10 with high classification accuracy, but not in the DNNs with poor classification accuracy, the DNN 11 halfway in the optimization process, or the DNN trained with unrealistic data. This suggests that mid-12 level AM representation is essential for effective representation of natural sounds. On the other hand, 13 AM representation in the higher layers were similar to the auditory system in all of these conditions 14 but the waveform following task. This suggest that task natures are determinant factors for forming 15 high-level AM representation, perhaps because higher representation is more directly used for final 16 decision than middle or lower representation. In other words, whatever the lower representation is, the 17 role of the higher layers are to derive appropriate outputs for the specific task from the lower 18 representation 19 Decreasing temporal resolution for sound classification 20 Both of the two prominent characteristics of the auditory AM coding, decrease of synchronizing AM 21 frequency and time-to-rate conversion, involve decrease of temporal resolution of the transmitted 22 signals. The above discussion regarding representation in higher layers suggests that encoding 23 information of sound categories with low temporal resolution may be beneficial for classification tasks.
The next question is why such coding scheme is beneficial. The following discussion might explain 1 the reason. In our setting, as in the typical classification task with a DNN, the larger the value in each 2 unit in the classification layer (the layer above the 13th layer), the larger the score will be for the 3 particular category. The final output category is the one assigned to the unit with the maximum value. 4 If the units synchronize to the amplitude envelope of the input sound, which wax and wane with time, 5 the output category would be temporally unstable. On the other hand, if the activity of an output unit 6 is large all the time, the score for the category will be kept large. The latter case would be more 7 preferable for classification tasks. 8 In the real world, recognizing the stimulus category would be more important than synchronizing to 9 the stimulus, and animals might be better at sound classification than synchronizing to the sound. This 10 notion is supported by the well-known phenomenon that in a synchronization tapping task humans 11 tend to respond slightly earlier than the correct timing 60 , suggesting that we tap according to the 12 internally generated rhythm but not react after hearing the ongoing sound. Other animals which have 13 the ability to act synchronously to a stimulus exhibit similar behaviour 61 . These animals (including 14 humans) might first recognize the frequency of the stimulus envelope and then generate rhythm at the 15 recognized AM frequency. Such behaviour might also be observed if a DNN optimized for sound 16 classification is forced to perform a synchronization tapping task. 17 A reader who is familiar with a convolutional DNN may think that low temporal resolution in the 18 higher layer is trivial if each layer performs pooling operation, which temporally downsamples the 19 input waveform. However, this is not the case for our DNN, in which no pooling was performed. Thus 20 layers in our DNN does not necessarily downsample the input. Indeed the DNN trained for the 21 waveform following task did not decrease temporal resolution very much. 22

23
Our DNN did not exhibit sharp single peaks in the frequency tuning curves as widely found in the auditory system, while some studies report auditory-like frequency tuning emerging in a DNN with 1 different architecture from ours 64,65 . In the auditory system, frequency tuning of a neuron is largely 2 affected by mechanical and physical properties of the cochlea 59 . Although investigating what 3 determines the shape of a frequency tuning curve in a DNN is beyond the scope of this study, some 4 architectural constraints might be necessary for inducing similarity to the auditory system in the carrier 5 frequency domain. 6 Several other modelling works try to explain AM coding in the auditory system with anatomical and 7 physiological assumption including frequency decomposition in a cochlea 23,66,67 . A message brought 8 from the present study, which did not incorporate cochlear frequency decomposition, is that sharp 9 frequency tuning may not be necessary for effective AM representation for natural sound recognition. 10

11
Our results suggest the effectiveness of analysing computational model using physiological methods. 12 To date various methods have been proposed for analysing representation in a DNN 63 . Most of them 13 rely on differentiability of the DNN, using backpropagation to estimate the optimal input for each unit 14 assuming such an input exists. On the contrary, there is a long history of developing physiological 15 method to elucidate brain functions. Physiologists rely on parametric search over the stimulus space, 16 since backpropagation cannot be applied to the biological neurons 5 . One advantage of our method is 17 that the results are directly comparable with the ones reported in the physiology experiments. By taking 18 advantage of the previously-conducted vast number of neurophysiological studies, we could show the 19 relationship between layers in the DNN and the regions in the entire cascade of the auditory system. 20 Although DNNs have been used to explain sensory representation in several modalities 20-24 , to the best 21 of our knowledge this is the first report of similarity throughout the entire cascade of the sensory 22 processing. The success of our method indicates the future possibility of applying well-established 23 physiological paradigms to explore the functions and mechanisms of a DNN and other complex 24 machine learning models.
From a physiological perspective, this study implies that a DNN may become a useful tool for testing 1 a new hypothesis. Although this study focused on representation of sound envelopes, for which large 2 amount of physiological data are already available, any domain of stimulus parameters can be explored 3 in the same paradigm as ours. As long as the model takes raw data as in this study, physiologists can 4 test their hypothesis on any sensory domains with any kinds of stimuli with much lower costs than 5 actually conducting a pilot physiology experiment. 6 Methods 7 Task   8 The task of the DNN was sound classification. Specifically, the task was to estimate the sound category 9 at the last timeframe of a sound with certain duration (0.19 s for natural sounds and 0.26 s for speech). 10 A classification accuracy is defined as an average of the correct classification rate for each category, 11 which is the number of timeframes correctly estimated as the particular category divided by the number 12 of total timeframes of the category. 13

14
The following two datasets were used to train DNNs. The first one consists of non-human natural 15 sound, which is a subset of ESC-50 68 . The original dataset contains 50 sound categories with 40 sounds 16 for each category. From the original dataset we used 18 categories which are not produced by human 17 activities. Each entry in the original dataset contains a sound waveform of length less than 5 s and the 18 category of the sound. In this study we excluded silent intervals, resulting in the total length of 53.9 19 minutes. The original dataset is divided in 5 folds for cross validation. We used fold #5 for validation 20 and the other fold for training. The sound format was 44.1 kHz 16 bit linear PCM. 21 The second dataset consists of speech sound 69 . Each entry in the dataset contains a sound waveform 22 of a single spoken sentence, phoneme categories, and time intervals of each phoneme. The original 23 number of phoneme categories is 61. We merged some categories in accordance with the previous study 70,71 , resulting in 39 categories. The average duration and the total duration of the sound is 3.1 s 1 and 3.3 hours, respectively. The data is originally divided in validation set and training set. In this 2 study we followed the original division. The validation set and training set contains speech of 24 and 3 462 speakers, respectively. The speakers and the sentences in the two dataset did not overlap. The layer were determined based on the pilot study and fixed to the value throughout the study. In the pilot 17 study DNNs with various number of layers and units were trained using random portion of the training 18 set. The filter length was 2, and the dilation length was 2 to the power of the layer number 62 . The 19 number of layers and the number of units in each layer that gave the best classification accuracy on 20 the other portion of the training set were used in the following study. 21 We tested multiple architectures with random filter and dilation length in each convolutional layer and 22 selected the DNN which achieved the best classification accuracy on the novel dataset (Extended Data 23 Table 1). The filter size and dilation length was randomly chosen for each layer with constraints that the filter size does not exceed 8 and the total input length for the whole DNN, which is equal to the 1 length of the input time window of the highest layer, does not exceed 8192 (~ 0.19 s) for non-human 2 sound and 4096 (~ 0.26 s) for speech. The number of layers and the number of units in each layer were 3 fixed as mentioned in the previous paragraph. 4 Optimization 5 The DNNs were trained on the training set, and the classification accuracy were calculated on the 6 validation set. The initial filter weights were randomly sampled and biases were set to 0 in accordance 7 with the previous study 73 . The filter weights and biases were updated using Eve algorithm 74 with 8 softmax cross entropy as the cost function. The number of iteration for parameter update was 9 determined to the value which gave the best classification accuracy on random portion of the training 10 set trained on the other portion of the training set. 11 The stimulus was 8 s of sinusoidally amplitude modulated white noise (Fig. 2b). In the physiological 17 studies tuning to AM frequency is measured with sinusoidally amplitude-modulated tones with carriers 18 at the neurons' best frequencies, sinusoidally amplitude-modulated white noises, or click trains. We 19 did not use tones as carriers because many units showed multiple peaks in the tuning curves to carrier 20 frequency or non-monotonic responses to the input amplitude (Extended Data Fig. 12), making it 21 difficult to define the best carrier frequencies. 22

Physiological analysis of a DNN
From the values of each unit the synchrony to the stimulus and the average activity was calculated. 23 The synchrony to the stimulus was quantified by a vector strength 75 . When dealing with spike timing data recorded in biological neurons, each spike is represented as a unit vector with its angle 1 corresponding to the modulator phase at that time, and the vector strength is defined as the length of 2 the average of these unit vectors. Equivalent operations were applied to the continuous output of the 3 DNN unit to derive a value of vector strength (equation 1). The vector strength takes a value between 4 0, indicating no synchrony, and 1, indicating perfect synchrony. 5 where t is an index of the timeframe, a(t) is the unit activation, fs is the sampling rate, and fm is the 6 stimulus AM frequency. The average activity was defined as the temporal average of the values, which 7 could be considered as the DNN version of an average spike rate. The synchrony and the average 8 activity was averaged for 16 instances of the carrier white noise to reduce the effect of stimulus 9 variability. A tMTF and an rMTF was defined as the synchrony and average activity as functions of 10 AM frequency, respectively. In physiology an MTF is usually defined only at the frequencies at which 11 the unit shows statistically significant synchrony or spike rate. Since a statistical test on the results of 12 deterministic model such as our DNN does not make sense, we considered the synchrony or average 13 activities less than a certain threshold as "non-significant" and excluded them from the following 14 analysis. The threshold was arbitrarily set to 0.01 for the synchrony and to 0.01 above the average 15 activity in response to unmodulated white noise for the average activity. 16 An MTF was classified into one of the following 4 types: low-pass, high-pass, band-pass, or flat. The 17 low-pass type MTF was defined as the one not having values smaller than 80% of its maximum in the 18 frequencies smaller than the peak frequency. The high-pass type MTF were defined as the one not 19 having values smaller than 80% of its maximum in the frequencies larger than the peak frequency. The 20 flat MTF was defined as the one not having values smaller than 80% of its maximum or the one with 21 the peak to peak range less than 0.1. The band-pass MTF was defined as otherwise.
BMFs were calculated from the band-pass type MTFs, and UCFs were calculated from the low-pass 1 and the band-pass type MTFs. BMFs of low-pass, high-pass, or flat MTFs and UCFs of high-pass or 2 flat MTFs were considered as indefinable. The BMF was defined as the modulation frequency at the 3 peak of the MTF. If multiple peaks with the same height exist, the geometric mean of the frequencies 4 was taken. The UCF was calculated in two different ways: one for qualitative visualization in Fig. 4a  5 and the other for quantitative comparisons with specific physiological data of neurons in the literature. 6 The UCF for visualization was defined as the frequency at which the value of the MTF crosses 80% 7 of its maximum. If a MTF had multiple such frequencies, the geometric mean of the frequencies was 8 used. The threshold of the UCF for quantitative comparison with the auditory system varied according 9 to the reference physiology study. They were 50% 35,49 , 80% 26,32 , and 70% (−3 dB) 27-29 of the maximum, 10 90%:10% interior division of its minimum and maximum 36 , absolute value of 0.1 26,31 , and the highest 11 frequency that gives significant responses 32,33,36,38,[42][43][44]47,50 . If at no frequency did the MTF cross the 12 threshold, the UCF was considered as indefinable. 13 Stimuli for calculating a tuning to carrier frequency were tones with various frequencies and 14 amplitudes. The values of each unit was temporally averaged to obtain the response to the particular 15 stimulus. The tuning curve was defined for each frequency as the smallest amplitude inducing the 16 response larger than a certain threshold. In physiological studies thresholds are usually determined 17 arbitrarily. In Extended Data Fig. 12  with a DNN, the similarity of each extracted distribution to the distribution of each layer in the DNN 3 was calculated. As the measure of similarity we employed Kolmogorov Smirnov statistic subtracted 4 from 1 since it is nonparametric and does not depend on the bin widths of the histogram very much. 5 For each of the BMF and UCF for each of the rate and temporal coding, the similarities in the same 6 regions in a single paper were averaged, and then the similarities in the same region in different papers 7 were averaged (Extended Data Fig. 4). Averaging the 4 pairwise similarities (tBMF, tUCF, rBMF, and 8 rUCF) derived the final layer-region pairwise similarity matrix. Since no distribution of tBMF was 9 reported in AN, no distribution of rBMF was reported in AN, CN, or SOC, and no distribution of rUCF 10 was reported in AN or CN, the similarities to them were set to 1 if there was no unit with definable 11 BMF or UCF and set to 0 if otherwise. Also, for the regions other than those, the similarity was set to 12 0 if there was no unit with definable BMF or UCF. 13

Evaluation of a pairwise similarity matrix
14 From a matrix of pairwise similarity, similarity of the entire cascade and that of each layer were 15 calculated. We would like to evaluate the pairwise similarity matrix in a way that a DNN with its lower 16 layers similar to the peripheral regions, its middle layers to the middle regions, and its higher layers to 17 the central regions gets high score. To realize this concept of evaluation, we defined the similarity of 18 the entire cascade, which we call cascade similarity, as the weighted mean of the pairwise similarity 19 matrix (Extended Data Fig. 6). The weight at the position (i, j) was proportional to 20 where Ni and Nj are the number of brain regions (= 7) and the number of the DNN layers, respectively. 22 The weight was scaled so that the squared mean of the weight matrix was 1. The weight was maximum 23 on the diagonal line and minimum on the top left and bottom right corners. Similarity of each layer,

Control experiments 2
In the first control experiment, weights and biases were shuffled across units within each layer. The 3 weights and biases were shuffled independently. In the second control experiment, category labels of 4 the sounds in the training set were randomly shuffled. Validation set was not modified. The parameter 5 update was conducted for the same number of iteration as the original non-random condition. In the 6 third control experiment, the order of waveform samples in each sound was randomly shuffled, 7 resulting in noise-like input waveform maintaining only the marginal distribution of the amplitudes. 8 The fourth control experiment, the waveform following task, was to copy the amplitude value of the 9 last timeframe of the input sound segment. To make the result directly comparable with those of the 10 classification tasks, the target amplitude was quantized and the cost function was softmax cross 11 entropy 62 . The waveform was nonlinearly transformed with a μ-law companding transformation before 12 quantization 62 . The number of bins was equals to the number of sound categories in the original 13 classification task. Fields of the Cat. Similarities Outweigh Differences. J. Neurophysiol. 80, 2743Neurophysiol. 80, -64 (1998. Narrowband Clicks in the Primary Auditory Cortex of Cat. J. Neurophysiol. 84, 236-247 (2000). 1 51. Bieser, A. & Müller-Preuss, P. Auditory responsive cortex in the squirrel monkey: neural 2 responses to amplitude-modulated sounds. Exp. Brain Res. 108, 273-284 (1996). The modulation spectrum was calculated as the root mean square of the filtered envelope with a 7 logarithmically spaced bandpass filter bank. Each modulation spectrum is normalized by its maximum. 8 The lower and the upper peak in the modulation spectrum of speech (top) probably contain the 9 information of the speech content and the speaker, respectively. The modulation spectrum of the rain 10 sound (bottom) appeared different from the one of the speech.  Examples of AM stimuli with 1, 10, 100, and 1000 Hz AM frequency. The carrier was white noise. (c) 2 Examples of responses to the AM stimuli in (b) in a single unit. A unit in the 8th layer is chosen as an 3 example. Responses to the stimuli with different AM frequencies appeared different. (d) An example 4 of tMTF (top) and rMTF (bottom) in the same unit as (c). A tMTF and an rMTF is defined as synchrony 5 to the stimulus AM frequency and the average activity as functions of AM frequency, respectively. The 6 unit exhibited the low-pass type tMTF and the band-pass type rMTF. 7 MTFs were flat. In the 5th layer significant synchrony to the stimulus AM was observed. In the 9th 9 and 13th layer the synchrony at the lower AM frequencies increased. The magnitude of rate-based 10 responses, shown as the heights of the rMTFs appeared gradually increasing with ascending the layers.
(b) The number of units with the low-pass (solid green lines with circles), band-pass (dashed red lines 1 with crosses), high-pass (dotted black lines with triangles), and flat (dash-dotted grey lines with 2 squares) type tMTF (left panel) and rMTF (right panel). Most MTFs were low-pass, band-pass, or flat 3 type. With ascending the layer, the number of low-pass and band-pass MTFs increased. The increase 4 started at higher layer for rate coding than for temporal coding. (c) Heatmaps of all tMTFs (left) and 5 rMTFs (right) in layer 1, 5, 9, and 13. MTFs are normalized by their peak values for better visualization. 6 The units are sorted vertically by their peak AM frequencies. As ascending the layer from the layer 5, 7 the effective AM frequency for inducing synchrony appeared to decrease, and the distinction between 8 darker and brighter area in the rMTFs appeared to become clearer. In some layers, distinct peaks and 9 notches appeared commonly across different units at particular AM frequencies (observed as the 10 vertical lines in tMTFs). We have no clear explanation for this, but this is perhaps due to artefacts of 11 discrete convolutional operation. Similarity to the auditory system throughout the entire cascade.

5
(a) Histograms of BMF (filled blue bars) and UCF (hatched orange bars) of temporal (left panels) and 6 rate (right panels) coding in each layer. The layers are sorted vertically from bottom to top. In the 1st 7 and 2nd layer, no units exhibited definable tBMF or tUCF. In the 3rd and 4th layer, the tBMFs and 8 tUCFs covered wide range of the AM frequency, majority of them being low. As ascending from 5th 9 layer, the tBMFs and tUCFs appeared to decrease. As for rate coding, in the 1st to 4th layers, no units 10 exhibited definable rBMF or rUCF. In the 5th layer small number of high rBMFs and rUCFs appeared. 11 As ascending from the 5th layer, the number of units with definable tBMFs and tUCFs increased. (b) 12 The number of units with definable BMF (filled blue circles) and UCF (open orange triangles) of 13 temporal (solid lines) and rate (dashed lines) coding. As ascending the layers, the number of definable 14 than those with definable tBMF and tUCF. In other words, rate coding are performed in higher layers 1 than temporal coding. (c) Distributions of BMF (filled blue areas) and UCF (hatched orange areas) of 2 temporal (left panels) and rate (right panels) coding in each region in the auditory system. Regions are 3 sorted vertically from the peripheral regions (bottom panels) to the central (top panels). No distribution 4 of tBMF is reported in AN. The tBMFs and tUCFs gradually decrease from the periphery to the central. 5 No distribution is reported for rate coding in the peripheral regions probably because peripheral regions 6 do not code AM frequency by the spike rate. (d) Layer-region pairwise similarity of the AM 7 representation in the DNN layers (horizontal axis) and that in the regions in the auditory system 8 (vertical axis). Pairs of layers and regions with large similarity appeared in diagonal. (e) Layer-region 9 pairwise similarity normalized by the maximum value of each brain region. The diagonal pairs with 10 large similarity are more clearly observed. 11  Table 1 9 Major factors for AM representation in different regions. Schematic illustration of the classification task and the waveform 7 following task. 8 In the both tasks the DNN operated on a short sound segment. The sound classification task was to 9 estimate the category of the input sound. The waveform following task was to copy the amplitude 10 value of the last timeframe of the input segment.