Abstract
Electrical neuroimaging in humans identified the speed and spatiotemporal brain mechanism whereby sounds of living and man-made objects are discriminated. Subjects performed an “oddball” target detection task, selectively responding to sounds of either living or man-made objects on alternating blocks, which were controlled for in their spectrogram and harmonics-to-noise ratios between categories. Analyses were conducted on 64-channel auditory evoked potentials (AEPs) from nontarget trials. Comparing responses to sounds of living versus man-made objects, these analyses tested for modulations in local AEP waveforms, global response strength, and the topography of the electric field at the scalp. In addition, the local autoregressive average distributed linear inverse solution was applied to periods of observed modulations. Just 70 ms after stimulus onset, a common network of brain regions within the auditory “what” processing stream responded more strongly to sounds of man-made versus living objects, with differential activity within the right temporal and left inferior frontal cortices. Over the 155–257 ms period, the duration of activity of a brain network, including bilateral temporal and premotor cortices, differed between categories of sounds. Responses to sounds of living objects peaked ∼12 ms later and the activity of the brain network active over this period was prolonged relative to that in response to sounds of man-made objects. The earliest task-related effects were observed at ∼100 ms poststimulus onset, placing an upper limit on the speed of cortical auditory object discrimination. These results provide critical temporal constraints on human auditory object recognition and semantic discrimination processes.
- auditory evoked potential
- AEP
- object recognition
- event-related potential
- sound
- what and where pathways
- electrical neuroimaging
- LAURA source estimation
Introduction
Just how fast the human brain can discriminate sounds of different objects remains an unresolved, yet critical issue for understanding auditory functions, including speech and language. Related studies of the speed of visual sensory-cognitive processes indicate that the recognition and categorization of faces and objects can be achieved within ∼150 ms after stimulus onset (Thorpe et al., 1996; Mouchetant-Rostaing et al., 2000; VanRullen and Thorpe, 2001; Michel et al., 2004a; Murray et al., 2004a). These abilities of the visual system are thought to rely on specialized brain networks within a ventral, “what” processing pathway (for review, see Ungerleider and Mishkin, 1982; Malach et al., 2002).
A sound object recognition network within the superior and middle temporal cortices as well as the inferior frontal cortex has similarly been proposed based on evidence from neuropsychology (Engelien et al., 1995; Clarke et al., 2000, 2002), electrophysiology (Romanski et al., 1999; Alain et al., 2001; Tian et al., 2001), and hemodynamic brain imaging (Engelien et al., 1995; Maeder et al., 2001; Arnott et al., 2004; Bergerbest et al., 2004; Binder et al., 2004; Lewis et al., 2004, 2005; Zatorre et al., 2004). Part of this network, in particular the upper bank of the superior temporal sulcus bilaterally, has been shown to be selectively involved in speech/voice recognition (Belin et al., 2000). More recent evidence indicates that functional specialization within this auditory what network might also differentiate categories of sounds of objects, including tools versus animals (Lewis et al., 2005). Such categorical sensitivity, in the case of sounds of tools, appears to involve a distributed network that extends into motor-related cortices of the so-called mirror neuron system (Rizzolatti et al., 2002), which may be related to higher-level recognition and association processes concerning how sounds of tools might have been produced (Johnson-Frey, 2003; Kellenbach et al., 2003).
Despite such evidence concerning the brain regions involved in auditory object processing, there is comparatively sparse evidence regarding their temporal dynamics. Such information is essential for determining when during sound recognition and during which processing steps different brain areas become active (i.e., for differentiating feedforward from feedback as well as sequential from parallel activity) (Schroeder et al., 1998; Michel et al., 2004a). Temporal information is likewise thought to play a major role in language acquisition and proficiency (Tallal, 2004), as well as in the association of signals from different senses (Stein and Meredith, 1993). Determining the speed and locus of auditory object discrimination is critical for the development of accurate models of these processes. The present study therefore applied electrical neuroimaging (Michel et al., 2004b) to the issue of auditory object processing. In particular, we examined the speed with which and likely neurophysiological mechanism by which sounds of living and man-made objects are first differentiated.
Materials and Methods
Subjects.
Nine healthy subjects (six females), 21–34 years of age (mean ± SD = 26.3 ± 4.3) participated. All subjects provided written, informed consent to participate in the study, the procedures of which were approved by the Ethics Committee of the University of Geneva. All subjects were right-handed (Oldfield, 1971). No subject had a history of neurological or psychiatric illnesses, and all reported normal hearing.
Stimuli.
Auditory stimuli were complex, meaningful sounds (16 bit stereo; 22, 500 Hz digitization) obtained from an on-line library (http://www.cofc.edu/∼marcellm/confrontation%20sound%20naming/zipped.htm) [normative data concerning these stimuli are published in the study by Marcell et al. (2000)]. We used this set of 120 as a database for selecting the sounds of living and man-made objects used in the EEG portion of this study. This was achieved in the following manner.
In a pretest session, a separate group of 18 individuals listened to each sound. In addition to identifying each sound, they gave both a confidence as well as a familiarity rating of their identification using a 1–7 Likert scale. The 20 sounds of living objects and 20 sounds of man-made objects that were most often correctly identified were selected for use in the EEG portion of this study. These sounds were correctly identified on average by 14.7 and 14.9 of the 18 subjects, respectively, with no significant performance difference between these categories (t(38) = 0.25; p = 0.80). The complete stimulus list and performance data can be found in Table 1. Although we did not explicitly control for the emotive aspects of the sounds, we would note that both categories include stimuli with strong emotional associations (e.g., a baby crying, a police siren) (Table 1). For each of these 40 sounds, two additional exemplars were obtained from an on-line sound search engine. All of these 120 sounds were then modified using audio editing software (Adobe Audition 1.0; Adobe Systems, San Jose, CA) so as to be 500 ms in duration. An envelope of 50 ms decay time was applied to the end of the sound file to minimize clicks at sound offset. All sounds were further normalized according to the root mean square of their amplitude. As a final step, sound files from the living and man-made categories were compared for acoustic differences in the following manner.
Stimulus list and corresponding results from pretest session
A time-frequency analysis was conducted on the spectrogram (Rabiner and Schafer, 1978) of each sound file, using a frequency bin-width of 86 Hz and temporal bin-width of 5.8 ms. A Kolmogorov–Smirnov test was used to assess whether significant differences existed for a given time-frequency pair between sound categories in terms of their distributions of values. Differences in the spectrogram at a given time and frequency reflect differences in the power of the given pitch (frequency) for the corresponding interval (time). The results of this test indicate that significant differences between these sound categories appeared only after ∼125 ms of sound onset, were only present for frequencies above ∼4000 Hz, and were temporally short-lived (see Fig. 1a). Because an additional 15–20 ms is required for signal transmission into human auditory cortex (Liegeois-Chauvel et al., 1994), this analysis thus indicates that differences in brain responses before ∼140–145 ms cannot be explained by differences between the spectrograms of sound categories. Stimuli were likewise analyzed in terms of their mean harmonics-to-noise ratio (HNR), which was calculated using PRAAT software (http://www.fon.hum.uva.nl/praat/). This measure has been presented recently as one method to quantify and compare dynamic acoustic properties of sounds (Lewis et al., 2005). Briefly, the HNR provides an index of the relative periodicity of a sound. The mean (±SEM) HNR for the 60 sounds of living objects was 9.4 ± 0.9 (range, −0.1 to 29.1), and that for the 60 sounds of man-made objects was 11.1 ± 1.2 (range, −3.0 to 33.5). The HNR did not significantly differ between categories (t(59) = 1.15; p > 0.25).
Analysis of stimulus spectograms. Time-frequency distribution of the results of the Kolmogorov–Smirnov test (p < 0.05; adjusted for the number of frequency bins) comparing spectograms from living and man-made sound categories. The z-axis indicates 1 minus the p value. Note that effects were only present after ∼125 ms of sound onset.
As an additional psychophysical control, an additional and separate cohort of 10 healthy subjects (five females), 20–32 years of age (mean ± SD = 27.3 ± 3.5 years; seven right-handed and three left-handed), listened to the final set of 120 sounds. Each sound was presented once via insert earphones (Etymotic model ER-4P; Etymotic Research, Elk Grove Village, IL) at a comfortable volume, and the onset of each sound presentation was controlled by the subjects. Subjects were asked to identify, categorize as living versus man-made, indicate their confidence in their identification (1–7 Likert scale), and indicate their familiarity with the sound (1–7 Likert scale). Sounds of each category were equally well identified (mean ± SD = 93.0 ± 6.6% vs 90.5 ± 5.8% for living versus man-made sounds, respectively; t(9) = 1.26; p = 0.24) and classified (95.7 ± 5.7% vs 97.8 ± 3.5%; t(9) = 1.04; p = 0.33). Likewise, neither the confidence ratings (6.0 ± 0.6 vs 6.0 ± 0.5; t(9) = 0.09; p = 0.93) nor familiarity ratings (5.4 ± 0.9 vs 5.3 ± 1.0; t(9) = 1.74; p = 0.12) significantly differed between categories.
Procedure and task.
For the EEG portion of this study, a living versus man-made oddball paradigm was performed, such that on a given block of trials “target” stimuli to which subjects pressed a response button occurred 10% of the time. The use of sounds of living and man-made objects was counterbalanced across blocks. The remaining 90% of stimuli (“distracters”) were comprised of the other (i.e., nontarget) sound category. Stimuli were blocked into series of 300 trials with an interstimulus interval of 3.4 s. Each EEG participant completed four blocks of trials (two in which man-made sounds were targets and two in which living sounds were targets). Both behavioral as well as EEG data were collected from all conditions throughout the length of the experiment, and STIM (Neuroscan, El Paso, TX) was used to control stimulus delivery and to record behavioral responses. Audiometric quality insert earphones (supplied by Neuroscan) were used for stimulus delivery. As noted below in the Results, no behavioral differences were observed between blocks of trials in which man-made sounds were targets and those in which living sounds were targets, thereby excluding accounts of the present findings in terms of attention differences.
EEG acquisition and preprocessing.
Continuous 64-channel EEG was acquired through Neuroscan Synamps (impedances <5 kΩ), referenced to the nose, bandpass filtered 0.05–200 Hz, and digitized at 1000 Hz. Peristimulus epochs of continuous EEG (−100 to 500 ms) were averaged from each subject separately for each condition to compute auditory evoked potentials (AEPs). For the contrast of living and man-made stimulus categories, only distracter trials (i.e., trials not requiring an overt behavioral response) were included in analyses. The average number of accepted sweeps in response to sounds of living objects was 385 ± 85 (range, 255–515) and for sounds of man-made objects was 373 ± 102 (range, 184–539). These values did not significantly differ (t(8) = 0.904; p > 0.35).
We also conducted an analysis of the EEG data that involved the comparison of responses to the same sounds when they served as targets versus when they served as distracters [for a similar approach with a visual categorization task, see VanRullen and Thorpe (2001)]. Differences between these conditions reveal the time course within which the brain initiates the discrimination of sound categories and is furthermore exempt from the possibility that acoustic differences underlie effects in brain responses. That is, differences between responses to targets and distracters can only be explained as the capacity of the brain to perform the living/man-made discrimination, because these AEPs are derived from the same sounds. Because distracters outnumbered targets 9:1, it was further important to select the same number of trials that contributed to the distracter AEP as to the target AEPs to ensure equivalent signal-to-noise ratios for these AEPs. The number of trials contributing to each subject’s AEP for target stimuli was therefore used as the determinant for the number of trials selected to contribute to each subject’s AEP for distracter stimuli. One of the nine subjects was excluded from this analysis because of poor signal quality. For the remaining eight subjects, the average number of accepted sweeps for each condition in this analysis was 110 ± 3 (range, 97–120). For this contrast, only the time course of differential responses was assessed, because our objective with this analysis was to situate living/man-made differences within the framework of task-related differences.
For all analyses, baseline was defined as the 100 ms prestimulus period. Trials with blinks or eye movements were rejected off-line, using horizontal and vertical electro-oculograms. An artifact criterion of ±100 μV was applied at all other electrodes, and each EEG epoch was also visually evaluated before its inclusion in the AEP. Data from artifact electrodes from each subject and condition were interpolated (Perrin et al., 1987). After this procedure and before group averaging, each subject’s data were 40 Hz low-pass filtered, down-sampled to a common 61-channel montage (see Fig. 2, inset), and recalculated against the average reference.
Electrical neuroimaging results for the contrast of living versus man-made sounds. a, The topographic pattern analyses identified seven stable topographies for both conditions over the 500 ms poststimulus period. The time period when each map was observed is indicated. Over the initial 155 ms poststimulus period, the same series of maps was observed in response to both sounds of living and man-made objects. At the group-average level, different maps were observed for each condition over the 155–257 ms period (framed in green and blue). b, The results of the individual subject fitting procedure revealed that these maps differentially accounted for the responses to sounds of living and man-made objects. c, The time periods when stable topographies were observed served as the basis for the time windows from which area measures were calculated at specific scalp sites and from the global field power. Bar graphs display the mean (±SEM) area from these midline electrodes and the global field power over the 70–119 ms period (*p < 0.01). d, A more precise determination of the time course of differential processing was obtained with a point-wise t test for each electrode and for the GFP. See Results for details.
EEG analyses and source estimation.
In addition to prototypical event-related potential analyses entailing area measures from selected scalp sites, the AEPs from living and man-made conditions were submitted to two independent analyses of the electric field at the scalp. The methods applied here have been described in detail previously (Murray et al., 2004a, b, 2005; Foxe et al., 2005). The first was a topographic pattern (map) analysis, which was additionally used for defining time periods over which the abovementioned area measures were calculated. Maps were compared over time within and between conditions, because topographic changes indicate differences in the active generators of the brain (Fender, 1987). This method is independent of the reference electrode (Michel et al., 2004b) and is insensitive to pure amplitude modulations across conditions (topographies of normalized maps are compared). A modified cross-validation criterion determined the number of maps that explained the whole group-averaged data set (Pascual-Marqui et al., 1995). The pattern of maps observed in the group-averaged data were statistically tested by comparing each of these maps with the moment-by-moment scalp topography of individual subjects’ AEPs from each condition. Each time point was labeled according to the map with which it best correlated spatially, yielding a measure of map presence that was in turn submitted to an ANOVA with factors of condition and map (hereafter referred to as “fitting”). This fitting procedure revealed whether a given experimental condition was more often described by one map versus another, and therefore whether different generator configurations better accounted for particular experimental conditions (i.e., whether there is a significant interaction between factors of condition and map).
The second analysis used the instantaneous global field power (GFP) for each subject and stimulus condition to identify changes in electric field strength. GFP is equivalent to the spatial standard deviation of the scalp electric field (Lehmann and Skrandies, 1980). The observation of a GFP modulation does not exclude the possibility of a contemporaneous change in the electric field topography or topographic modulations that nonetheless yield statistically indistinguishable GFP values. However, observation of a GFP modulation without simultaneous topographic changes is most parsimoniously interpreted as amplitude modulation of statistically indistinguishable generators across experimental conditions. The analysis of a global waveform measure of the AEP was performed to minimize observer bias that can follow from analyses restricted to specific selected electrodes, although we also include such to facilitate comparison with other analysis approaches. GFP area measures were calculated (vs the 0 μV baseline) using time periods of stable scalp topography defined as described above and statistically tested with a paired t test.
To specify the onset of differential responses, we calculated point-wise paired t tests between AEP responses. For each electrode as well as for the GFP, the first time point in which the t test exceeded the 0.05 α-criterion for at least 11 consecutive data points was labeled as onset of an AEP modulation (for similar approaches, see Guthrie and Buchwald, 1991; Murray et al., 2002, 2004a). In determining the onset of an AEP modulation, we additionally required that at least three electrodes exhibited differential responses at a given latency. The results of the point-wise t tests from the entire electrode montage are displayed as an intensity plot (see Figs. 2d, 3b).
Finally, we estimated the sources in the brain underlying the AEPs in response to living and man-made sounds, using the local autoregressive average (LAURA) distributed linear inverse solution (Grave de Peralta Menendez et al., 2001, 2004) (for a comparison of inverse solution methods, see Michel et al., 2004b). LAURA selects the source configuration that better mimics the biophysical behavior of electric vector fields (i.e., activity at one point depends on the activity at neighboring points according to electromagnetic laws). The solution space was calculated on a realistic head model that included 4024 nodes, selected from a 6 × 6 × 6 mm grid equally distributed within the gray matter of the Montreal Neurological Institute’s average brain. We emphasize that these estimations provide visualization, rather than a statistical analysis, of the likely underlying sources.
Results
Behavioral results
Subjects accurately performed the task with no significant difference between sensitivity measures based on signal detection theory (d’) (Green and Sweets, 1966) when either sounds of living and man-made objects served as targets (5.1 ± 0.5 vs 5.1 ± 0.6; t(8) = 0.02; p = 0.99). Likewise, mean reaction times for living and man-made targets did not significantly differ (947 ± 194 ms vs 916 ± 148 ms; t(8) = 0.74; p = 0.48). These reaction times are consistent with previous studies of environmental sound recognition in which reaction times on the order of ∼1 s were obtained (Lebrun et al., 1998; Saygin et al., 2003; Bergerbest et al., 2004). Thus, behavioral differences cannot account for any AEP modulations, a notion that is similarly supported by our psychophysical study of an additional 10 subjects (see Materials and Methods). Similarly, time-frequency analysis of the sounds’ spectrograms showed significant living versus man-made differences only after ∼125 ms, and no difference was observed between the mean harmonics-to-noise ratios of the sound categories (Fig. 1), indicating that such features cannot account for AEP effects before ∼145 ms (for details of these analyses, see Materials and Methods).
Electrophysiological results
In terms of AEPs in response to sounds of living and man-made objects, the topographic pattern analysis identified periods of stable electric field configurations at the scalp and determined whether different configurations of brain generators accounted for responses to sounds of living and man-made objects. These analyses additionally provided objective means for defining AEP components, rather than arbitrarily selecting time windows (Michel et al., 2004b). Seven different topographies accounted for the whole data set (i.e., the cumulative 500 ms poststimulus period from both experimental conditions) with a global explained variance of 98.02%. Across the two experimental conditions, identical electric field topographies were observed over the 0–69, 70–119, 120–154, 258–355, and 356–500 ms periods. In contrast, two topographies were identified over the 155–257 ms period (Fig. 2a). The fitting procedure statistically confirmed this observation, yielding an interaction between condition and map over the 155–257 ms time period (F(1, 8) = 5.90; p < 0.05), with neither main effect reaching our 0.05 significance criterion. That is, different maps (and by extension different configurations of intracranial generators) predominated the responses to living sounds and man-made sounds during this time period (Fig. 2b).
These time periods of stable scalp topography were also used to define time windows for the analysis of individual electrodes and GFP waveforms (Fig. 2). Visual inspection of these waveforms revealed a difference between experimental conditions over the period encompassing the peak of the N1 component at frontocentral scalp sites. This evidence of auditory object discrimination was statistically tested using area measures (vs the 0 μV baseline) over the 70–119 ms period from four midline electrodes (AFz, FCz, CPz, and POz). These values from FCz and CPz are shown in the bar graphs of Figure 2 and were submitted to a two-condition (living vs man-made sounds) × 4 electrode repeated-measures ANOVA (reported p values reflect Greenhouse–Geisser correction for nonsphericity when necessary). There was a main effect of condition (F(1, 8) = 9.77; p < 0.015), indicative of the generally larger magnitude of responses to man-made sounds. There was also a main effect of electrode (F(2.3, 18.3) = 15.69; p < 0.001). However, the interaction between these factors did not reach our p < 0.05 significance criterion. As above, this analysis provides no evidence of topographic variation between conditions. Follow-up comparisons indicated that responses significantly differed only at electrode FCz (t(8) = 3.636; p < 0.007). Analysis of the GFP, a global measure of response strength across the entire electrode montage, revealed stronger responses to man-made than living objects over the 70–119 ms period (t(8) = 3.38; p < 0.01). In addition to GFP area, we likewise tested GFP peak latency over the 70–119 ms and 155–257 ms periods. Only the latter period exhibited a significant difference, with man-made sounds having a ∼12 ms earlier peak response (199 vs 211 ms; t(8) = 2.37; p < 0.05). These findings are summarized in Table 2.
Summary of global field power analyses
To more precisely identify the timing of differential processing, we likewise calculated point-wise t tests at each electrode as well as for the GFP waveforms (see Materials and Methods). The results of these tests are displayed in Figure 2d. Temporally stable differences between responses to living and man-made categories of sounds began at the level of single electrodes at 62 ms, and in the case of the GFP at 79 ms. In addition, we compared AEPs elicited by the same sounds when they served as distracters versus when they served as targets. A similar approach has been applied in studies of visual categorization to place an upper temporal limit on the speed by which categorical brain processes initiate and also as a control for undetected differences in low-level visual features between stimulus categories (VanRullen and Thorpe, 2001). Temporally stable differential responses to targets and distracters began at the level of single electrodes at 95 ms, and in the case of GFP at 100 ms (Fig. 3). It is important to note that these task-related effects follow the onset of category-related effects but precede differences in the spectrograms of the same sound categories. This timing therefore lends additional support to our contention that categorical effects do not follow from low-level acoustic differences. It is also worth noting that these AEP effects, in particular the task-related effects, substantially precede mean reaction times. One possibility is that our participants emphasized accuracy more than speed and that faster reaction times could be obtained either with extensive training or alternative paradigms. Alternatively, it may be the case that the observed AEP effects represent only an initial phase of processing necessary for accurate sound categorization. Additional investigation on such topics will undoubtedly prove fruitful.
Electrophysiological results for the contrast of target and distracter trials elicited by the same sounds. a, AEP and GFP waveforms show differential responses beginning at ∼100 ms poststimulus onset. b, The timing of such effects was statistically tested with point-wise t tests at each electrode and for the GFP. See Results for details.
To this point, analyses at global and local levels revealed differential activity to distinct categories of environmental sounds at ∼70 ms poststimulus onset. Over the 70–119 ms period, there was no evidence that this modulation followed from a change in the scalp topography. Rather, a common stable scalp topography was observed for both conditions. In contrast, there was a robust GFP modulation, suggestive of a change in the response magnitude of an indistinguishable underlying network of active brain areas. Over the 155–257 ms period, different maps predominated the responses to living and man-made sounds, and there was additionally a GFP peak latency difference over this time period, although no significant modulation of GFP amplitude (Table 2). The mean GFP peak latency difference of ∼12 ms likewise corresponds well with the observed mean difference of 13.8 ms in the frequency of presence of each map in the responses to each condition (Fig. 2b). Together, these results are most parsimoniously interpreted as a prolongation in the activity of a brain network in the case of responses to living sounds relative to man-made sounds.
LAURA-distributed source estimations were calculated over the 70–119 and 155–257 ms periods. To do this, AEPs for each subject and each experimental condition separately were averaged across each of the abovementioned time periods in which stable topographies were identified. Source estimations were then calculated and subsequently averaged across subjects. Figure 4 shows the mean LAURA estimations over the 70–119 ms period. Both conditions exhibited prominent sources within the right middle temporal cortex, right inferior frontal cortex, and bilateral prefrontal cortices. Weaker sources were observed in the right inferior parietal cortex and left inferior frontal cortex. The group-averaged difference in LAURA source estimations for these conditions was also calculated and revealed stronger responses to sounds of man-made objects within the right posterior temporal cortex [maximal difference at 59, −29, 0 mm using the coordinate system of Talairach and Tournoux (1988)], as well as a difference within the left inferior frontal cortex (−43, 35, −2 mm). These maxima correspond to Brodmann’s areas (BA) 21/22 and 47, respectively.
LAURA source estimations over the 70–119 ms period. a and b show group-averaged (n = 9) source estimations for each stimulus condition. c depicts the mean (n = 9) difference of these source estimations.
Figure 5 shows the mean LAURA estimations over the 155–257 ms period. Because different electric fields predominated the AEPs to each condition over different latencies, estimations were calculated for specific segments of this time period. For sounds of living objects, these were 155–211 and 212–257 ms. For sounds of man-made objects, these were 155–199 and 200–257 ms. Both conditions included prominent bilateral sources within the posterior portion of the superior and middle temporal cortices as well as premotor cortices, with weaker activity in the left inferior frontal cortex. Stronger activity was observed within premotor cortices in response to sounds of man-made.
LAURA source estimations over the 155–257 ms period. a, Group-averaged (n = 9) source estimations for sounds of living objects over time periods in which different scalp topographies were identified. b, Group-averaged (n = 9) source estimations for sounds of man-made objects over time periods in which different scalp topographies were identified.
Discussion
This is the first demonstration that the differential processing of categories of complex sounds already begins within 70 ms poststimulus onset. The observed effects were not linked to behavioral differences, nor were they attributable to differences in either time-frequency analysis of spectrograms or mean harmonics-to-noise ratios between stimulus categories. These data add critical temporal information to recent hemodynamic imaging investigations of auditory object processing (Arnott et al., 2004; Binder et al., 2004; Lewis et al., 2004, 2005; Zatorre et al., 2004). We additionally show that a common network of brain areas associated with the auditory what pathway initially responds to both categories of sounds, although more strongly to those of man-made objects within regions of the right posterior superior and middle temporal cortices (BA21/22) and left inferior frontal cortex (BA47). Bilateral sources within the posterior portion of the superior and middle temporal cortices as well as premotor cortices were subsequently active in response to both categories of sounds, although with different durations. The initial discrimination of sounds of living and man-made objects occurs rapidly and through differential recruitment of the same bilateral brain network.
Differential processing of sounds of objects occurred at 70 ms via a strength modulation of statistically indistinguishable generator configurations, with no evidence for differences in the topography of the electric field at the scalp (and by extension, the configuration of active brain areas). In addition, the earliest differential activity was localized to the posterior superior and middle temporal cortices of the right hemisphere (BA21/22). According to the model proposed by Kaas and Hackett (2000), these cortices are approximately two to three synapses from core auditory regions and likely represent intermediary hierarchical levels in auditory processing (see below for temporal considerations). Several lines of evidence indicate that such regions within the right hemisphere play a critical role in nonlinguistic auditory object processing, and in particular the fine discrimination of pitch (for review, see Tervaniemi and Hugdahl, 2003). Auditory agnosia has been reported by several groups following right hemispheric damage (Vignolo, 1982, 2003; Fujii et al., 1990; Schnider et al., 1994; Clarke et al., 1996, 2002). Right-lateralized activations of this region (and/or others in its vicinity) have been reported in healthy individuals during the discrimination of environmental sounds, musical features and sounds of musical instruments, and voices using hemodynamic (Hugdahl et al., 1999; Belin et al., 2000, 2004; Bergerbest et al., 2004; Zatorre et al., 2004) and electromagnetic techniques (Tervaniemi et al., 2000; Tervaniemi and Hugdahl, 2003). Other studies suggest that auditory object processing involves a more bilateral, distributed network of brain regions (Maeder et al., 2001; Lewis et al., 2004, 2005; Lattner et al., 2005). However, even when bilateral activity is observed in these studies, the possibility that effects occur first within right hemisphere regions cannot be ruled out on account of the poor temporal resolution of the hemodynamic response. Indeed, the present findings provide one measure of support for the notion that differential processing of categories of sounds initiates predominantly within structures of the right hemisphere and is followed shortly thereafter by left-hemispheric and bilateral activity (compare Figs. 4 and 5). The present findings thus support the general conclusion that early activity in auditory cortices within the superior and middle temporal gyri of the right hemisphere may be functionally specialized for the categorization of nonlinguistic stimuli.
It is also important to briefly consider these results in terms of the evidence suggesting that categories of visual and auditory objects, including words, rely on distinct and widely distributed brain networks (Caramazza and Mahon, 2003; Noppeney et al., 2006). One proposition is that these networks vary in a domain-specific manner according to both perceptual attributes as well as the higher-order associations formed over time and experience. To date, the categorical processing of sounds of objects has only been investigated recently with brain imaging methods. Using functional magnetic resonance imaging, Lewis et al. (2005) found that responses to sounds of man-made objects (specifically tools) versus living objects (animals) significantly differed within a network of areas that included inferior frontal, premotor, parietal, and posterior middle temporal cortices (predominantly within the left hemisphere). The contrast of animals versus tools yielded stronger responses within the middle superior temporal gyrus (bilaterally). The differential network active in the case of tools was taken as evidence that the successful recognition of sounds of tools might rely on determining how such sounds were produced, perhaps including mental imagery of actions, spatial representations for actions, and visualization of the visual form of the tools themselves (for discussion, see Lewis et al., 2005). That is, representations of tools might include richer multisensory and action-related associations than their animal counterparts (although this will require additional experimental data). Although the specific networks involved in categorical processing were not the primary focus of this study (particularly given the relative limitations of localizing scalp-recorded data), the temporal dimension of our data set and source estimation thereof do provide a suggestion as to when those brain networks considered involved in associating incoming sensory information with more abstract representations can aid sound discrimination.
It therefore is important to situate the timing of the present effects within the current understanding of the time course of sensory transmission within the auditory system. Responses from primary auditory cortex have been recorded intracranially in macaque monkeys within onsets at ∼10–12 ms (Steinschneider et al., 1992; Lakatos et al., 2005) and in humans with onsets at ∼15–20 ms poststimulus (Liegeois-Chauvel et al., 1994; Howard et al., 2000; Godey et al., 2001; Brugge et al., 2003). This interspecies timing difference is in keeping with a general “3:5 rule” that describes the correspondence between responses in monkeys and humans (Schroeder et al., 1998). Recent evoked magnetic field recordings from humans listening to monaural clicks further indicates that response propagation within the initial ∼50 ms poststimulus includes regions of the anterolateral part of Heschl’s gyrus, the posterior parietal cortex, and posterior and anterior portions of the superior temporal gyrus, as well as the planum temporale (Inui et al., 2005). These latencies are in keeping with predictions based on anatomical studies in humans (Rivier and Clarke, 1997; Tardif and Clarke, 2001) as well as nonhuman primates (Romanski et al., 1999; Kaas and Hackett, 2000), which place regions of the superior and middle temporal cortex at approximately two to three synapses from primary cortices. Additional evidence indicates that the speed of sensory transmission within the auditory system of macaque monkeys may depend on stimulus complexity. Lakatos et al. (2005) have shown that broadband noises versus pure tones can lead to significantly earlier responses within belt regions and moreover that responses to such noises in belt regions were significantly earlier than responses to the same stimuli within core cortex. This latter finding suggests that responses to complex sounds might have a processing advantage over pure tones (in terms of speed) and perhaps also engage distinct parallel pathways. Although this issue awaits additional experimentation, such data do emphasize the rapidity with which responses to auditory stimuli arrive and propagate in cortical structures and also the importance of establishing temporal response profiles for specific classes of stimuli (e.g., complex versus simple sounds). In light of such information, the widespread network observed in the present study 70 ms poststimulus onset is well within physiological plausibility.
The present data demonstrate that the speed of auditory object processes is within the same timeframe as that within the visual modality (Thorpe et al., 1996; Mouchetant-Rostaing et al., 2000; Michel et al., 2004a). As in the case of visual object processing, access to semantic attributes of auditory objects thus occurs rapidly and via distributed activation of higher-level cortical regions. This timeframe carries implications for our understanding of multisensory integration of meaningful stimuli, which may be mechanistically distinct from integration of more rudimentary stimulus features (Laurienti et al., 2005; Lehmann and Murray, 2005). Recent evidence in the macaque has shown that multisensory integration of specific face and voice signals peaks at ∼85–95 ms within core and lateral belt cortices (Ghazanfar et al., 2005). The selectivity of these integration effects suggests that categorization of voices occurred within this latency. However, the temporal dynamics of vocalization discrimination was not specifically assessed in this study or in others in which microelectrode recordings were made along the rostral and caudal portions of belt cortex in response to a variety of monkey calls at different azimuthal locations (Tian et al., 2001). Similar findings of multisensory integration, albeit with corresponding delays, have been observed in human subjects in response to videos and sounds of syllabic vocalizations (Raij et al., 2000) as well as in response to images and animal vocalizations (Molholm et al., 2004). The latency of our effects underscores those of the multisensory effects observed in these studies to suggest that unisensory and multisensory object processes might proceed in parallel, rather than serially.
In conclusion, sensory-cognitive processing within the auditory system, like its visual and somatosensory counterparts, is substantially faster than traditionally believed. Here, we extend this notion to reveal the timing and likely neurophysiological mechanism by which the initial categorical discrimination of complex sounds can occur.
Footnotes
-
This work was supported by Swiss National Science Foundation Grants 3100A0-103895/1 and 3200BO-105680/1. We thank Denis Brunet for the development of Cartool ERP analysis software (http://brainmapping.unige.ch), Rolando Grave de Peralta Menendez for the development of the LAURA inverse solution, Raphaël Meylan for technical assistance, and Christoph Michel and Charles Schroeder for advice and comments on this manuscript.
-
↵*M.M.M. and C.C. contributed equally to this work.
- Correspondence should be addressed to either Micah M. Murray or Stephanie Clarke, Neuropsychology Division, Hôpital Nestlé, Centre Hospitalier Universitaire Vaudois, Avenue Pierre Decker 5, 1011 Lausanne, Switzerland. Email: micah.murray{at}chuv.ch or stephanie.clarke{at}chuv.ch