Abstract
Human listeners can effortlessly categorize a wide range of environmental sounds. Whereas categorizing visual object classes (e.g., faces, tools, houses, etc.) preferentially activates different regions of visually sensitive cortex, it is not known whether the auditory system exhibits a similar organization for different types or categories of complex sounds outside of human speech. Using functional magnetic resonance imaging, we show that hearing and correctly or incorrectly categorizing animal vocalizations (as opposed to hand-manipulated tool sounds) preferentially activated middle portions of the left and right superior temporal gyri (mSTG). On average, the vocalization sounds had much greater harmonic and phase-coupling content (acoustically similar to human speech sounds), which may represent some of the signal attributes that preferentially activate the mSTG regions. In contrast, correctly categorized tool sounds (and even animal sounds that were miscategorized as being tool-related sounds) preferentially activated a widespread, predominantly left hemisphere cortical “mirror network.” This network directly overlapped substantial portions of motor-related cortices that were independently activated when participants pantomimed tool manipulations with their right (dominant) hand. These data suggest that the recognition processing for some sounds involves a causal reasoning mechanism (a high-level auditory “how” pathway), automatically evoked when attending to hand-manipulated tool sounds, that effectively associates the dynamic motor actions likely to have produced the sound(s).
Introduction
At an early age, we begin to learn to categorize different objects and their associated sounds into different semantic or conceptual categories. The idea that constructive processes operate to create conceptual categories in our minds has been around since the days of the early Greek (Medin and Coley, 1998) and Buddhist (Lusthaus, 2002) philosophers. In general, our ability to recognize different types of environmental (nonverbal) sounds is thought to be accomplished, in part, by abstracting the physical properties of a sound and matching them to the known characteristics of a given sound category (Komatsu, 1992; Medin and Coley, 1998). For instance, animal vocalizations often contain frequencies and harmonics that covary in time, which may serve as signal attributes or compound cues for recognition (Nelken et al., 1999; Reide et al., 2001; Ehret and Riecke, 2002). Although less well studied, some sounds produced by tools may share common acoustical features, such as a metallic twang or high-pitched ringing sound, which could conceivably serve as “low-level” signal attributes that aid in our ability to recognize them as tools. Of course, the degree to which a person can specifically identify a given sound (such as recognizing the distinctive clicking sounds of a vs ½ inch ratchet wrench) depends primarily on the listener's experience or expertise. Thus, cognitive and other “high-level” associations are also likely to play a key role in the process of sound recognition (Kéri, 2003; Thompson-Schill, 2003).
More recently, human lesion and neuroimaging studies have indicated that the retrieval of words and other aspects of conceptual knowledge pertaining to distinct object categories are represented, at least in part, along separate brain regions (De Renzi and Lucchelli, 1988; Gainotti et al., 1995; Damasio et al., 1996; Martin et al., 1996; Chao et al., 1999; Moore and Price, 1999; Chao and Martin, 2000; Martin, 2001; Grossman et al., 2002; Ilmberger et al., 2002). Using visual or verbal input, the conceptual categories most consistently revealing segregated cortical pathways or networks include “man-made items” versus “living things.” A few studies have examined brain responses to different categories of sound, suggesting, for instance, that human speech sounds are processed differently from other sound categories, such as musical instruments, tones, environmental sounds, and animal cries (Belin et al., 2000; Binder et al., 2000; Levy et al., 2001; Fecteau et al., 2004). However, to our knowledge, no studies have systematically contrasted a wide array of sounds representing the conceptual categories of man-made items (tools in use) versus living things (animal sounds). Thus, using functional magnetic resonance imaging (fMRI), we sought to determine whether hearing and categorizing “tool” versus “animal” sounds would selectively activate different cortical pathways, thereby revealing a high-level architectural and functional organization of the human auditory system for sound recognition (Lewis and DeYoe, 2003).
Materials and Methods
Participants and main paradigm. We tested 20 right-handed adult participants (aged 21-52 years; 10 women) with no history of neurologic, psychiatric, or audiologic symptoms. Informed consent was obtained following guidelines approved by the Medical College of Wisconsin Human Research Review Committee. Participants (with eyes closed) were presented with 306 stimulus trials (108 animal sounds, 108 tool sounds, and 90 silent trials, over six separate fMRI scans) in a random order. They were informed that they would hear a large selection of animal and tool sounds, together with silent periods. Their task was to listen to the sounds and respond silently in their heads as to whether the likely sound-source was a tool or an animal. Participants were explicitly informed not to make any overt motor or vocal responses while being scanned.
Tool- and animal-related sound samples (44.1 kHz, 16-bit, stereo) were compiled from compact disk collections of professionally recorded sounds (Sound Ideas, Richmond Hill, Ontario, Canada). Sounds were typically recorded using stereo microphones, which contained two directional monaural microphones with 90° to 120° separation. The tool-related sounds were all from items typically manipulated by one hand (excluding motorized power tools; see Appendix, Table A), and the animal sounds were predominantly vocalizations (Appendix, Table B). The subsets of tool versus animal sounds were carefully matched overall (Cool Edit Pro; Syntrillium Software, Phoenix, AZ) for average root mean squared (RMS) power (loudness) and duration (1.9-2.1 s), and both stimulus sets contained a wide range of frequencies, tempos, and cadences, including sounds that were nearly continuous and those containing 1-14 distinct epochs (e.g., a dog producing four asynchronous barks). An envelope of 20 ms rise and fall times was applied to all sounds to minimize clicks at sound onset and offset. Human speech sounds (conversations, calls, exclamations, etc.) were also compiled as described above (Sound Ideas) but were only used for the purpose of comparing acoustic signal properties, as addressed below.
Imaging and data analysis. Scanning was conducted at 1.5 T on a General Electric (GE Medical Systems, Milwaukee, WI) Signa scanner, equipped with a commercial head coil (Medical Advances, Milwaukee, WI) suited for whole-head, echo planar imaging of blood-oxygenation level-dependent (BOLD) signals (Bandettini et al., 1993). We used a “silent” clustered-acquisition fMRI design that allowed sound stimuli (typically 70-85 dB, L-weighted through ear plugs) to be presented during scanner silence, over electrostatic headphones (Koss Inc, Milwaukee, WI). The scanning cycle has been described previously (Lewis et al., 2004). Briefly, every 10 s, a sound (or silent event) was presented, and, 7.5 s after stimulus onset, there followed the collection of BOLD signals from axial brain slices (echo time, 40 ms; repetition time, 1.8 s; 52 gradient-recalled image volumes). Image volumes included 16-18 axial slices of 6 mm thickness, with in-plane voxel dimensions of 3.75 × 3.75 mm. T1-weighted anatomical MR images were collected using a spoiled Grass Instruments (Quincy, MA) pulse sequence (1.1 mm slices, with 0.9375 × 0.9375 mm in-plane resolution). Immediately after the scanning session, participants heard all of the sounds again and indicated whether they judged each sound (when initially heard in the scanner) as either (1) an animal or (2) a tool, or if they were (3) uncertain as to the category of the sound source.
Data were viewed and analyzed using AFNI and related software plugins (http://afni.nimh.nih.gov/) (Cox, 1996). For each participant, the six scans were concatenated into one time series. Brain volume images were motion corrected for global head translations and rotations by reregistering them to the 20th brain volume of the last scan (closest to the time of anatomical image acquisition). Multiple linear regression analyses compared the BOLD brain responses with tool versus animal sounds, both relative to silence. Only those sounds correctly categorized were retained for the main analysis (see Fig. 1): BOLD responses to sounds that were later found to have been incorrectly categorized or deemed as uncertain by each participant could be censored (excluded) from the deconvolution model for that individual (see parentheses in Appendix, Tables A, B). Additionally, after data collection, 14 animal sounds that were not clearly vocalizations and 14 tool sounds that were not strongly associated with use by the dominant hand were subsequently censored from all of the data analyses.
Individual anatomical and functional brain maps were transformed into the standardized Talairach coordinate space (Talairach and Tournoux, 1988). Functional data (multiple regression coefficients) were spatially low-pass filtered (5 mm box filter) and then merged by combining coefficient values for each interpolated voxel across all participants. The combination of individual voxel probability threshold (t test; p < 0.005) and the cluster size threshold (8 voxel minimum) yielded the equivalent of a whole-brain corrected significance level of α < 0.05. A split-half correlation test yielded a voxel-by-voxel correlation between two subgroups (matched for gender and age) of 0.79, resulting in a Spearman-Brown estimated reliability coefficient of ρXY = 0.88 for the entire sample of 20 participants (Binder et al., 1997; Lewis et al., 2004). This indicates the level of correlation that would be expected between the activation pattern of our sample of 20 participants and activation patterns from other random samples of 20 participants matched for gender, age, and handedness. Public domain software packages SureFit and Caret (http://brainmap.wustl.edu) were used to project data onto the Colin Atlas brain (AFNI-tlrc) and to display the data (Van Essen et al., 2001; Van Essen, 2003). Portions of these data can be viewed at http://sumsdb.wustl.edu:8081/sums/directory.do?dirid=707082, which contains a database of surface-related data from other brain mapping studies.
Virtual tool manipulation task. Twelve of the 20 participants (six female) also performed a “virtual” tool manipulation task. They alternated (five cycles, 20 s blocks, eyes closed) between resting their hand versus making hand and arm movements (distal to the elbow) as if manipulating a variety of different tools, although no tools were actually being held. This included three to five different virtual tools from the list in Appendix (Table A), typically including hammering, sawing, scissoring, sanding, and ratcheting. However, other pantomimes were included for some participants after the first scan if deemed necessary to minimize incidental head movements, which were closely monitored in real time. Blocks were cued by different tone pips, and no other sound stimuli were presented. Using methods described previously (Lewis et al., 2000), three to four scan repetitions were averaged, and the time series data were then cross-correlated with “ideal” sinusoid reference waveforms. Response magnitudes were calculated as the amplitude of the best-fit reference waveform. Individual maps were transformed into Talairach space coordinates, spatially low-pass filtered (5 mm box filter), and then merged across participants (t test; p < 0.005; whole-brain corrected significance level of α < 0.05).
Spectral and temporal analyses of sounds. Spectrographs, amplitude plots, individual power plots, and harmonics-to-noise-ratio (HNR) analyses of the tool, animal, and speech sounds were generated using freely available phonetic software (Praat; http://www.fon.hum.uva.nl/praat/). The HNR algorithm (below) determined the degree of periodicity within a sound signal, x(t), based on finding a maximum autocorrelation, r′x(τmax), of the signal at a time lag (τ) greater than zero (Boersma, 1993):
This measure quantified the acoustic energy of the harmonics that were present within a sound over time, r′x(τmax), relative to that of the remaining “noise,” 1 - r′x(τmax), which represents nonharmonic, irregular, or chaotic acoustic energy. As extreme examples, a 2 s sample of white noise yielded an HNR value of -7.6 (using standard parameters: 10 ms time step; 75 Hz minimum pitch; 1 period per window), whereas a sample consisting of 2 and 4 kHz sine-wave tones produced an HNR value of 65.4.
For the averaged power spectra plots, we analyzed power (in Hertz) of the frequency spectrum (0-22 kHz, 1 Hz increments), which was estimated using the Thomson multi-taper method with a sampling frequency of 44.1 kHz (Matlab; MathWorks, Natick, MA). The power spectrum of each sound was normalized to the total power of the respective sound. The normalized power spectra were summed (94 per sound category), and each frequency was presented as a proportion of the total power.
The bicoherence analysis (a form of bispectral analysis; Matlab) quantified the degree of phase coupling between all frequency pairs (f1 and f2, at 1 Hz increments), examining linear and nonlinear second-order interactions of acoustic components (Sigl and Chamoun, 1994). In the bispectral equation below, Xi(f1) represents the power at f1 (its Fourier transform, which contains a real and imaginary component), and represents the complex conjugate of the power:
The bispectrum calculation (the resulting real triple product) was normalized to create a bicoherence measure, thereby producing a measure of phase coupling independent of amplitude. Larger values reflected related, as opposed to random, associations between frequency pairs over time. The bicoherence analysis yielded a two-dimensional (2-D) rendering with a fourfold symmetry about cardinal axes (data not shown), which were averaged across all 94 sounds within a category. To aid in the visualization of differences in the 2-D plots, the real component (the first octant triangle) was collapsed across one frequency dimension (f2) to create a one-dimensional bicoherence “profile” for each category of sound.
Results
Tool versus animal sound paradigm
We scanned 20 right-handed participants while they listened to a random presentation of tool sounds (Appendix, Table A; hand-manipulated tools in use), animal sounds (Appendix, Table B; predominantly vocalizations), and silent events. Participants were instructed to categorize each stimulus as either a tool or animal sound silently in their head. Participants were explicitly instructed not to produce any motor responses (i.e., button presses, subvocally naming sounds, etc.) during the scan session so as to avoid confounding any activation attributable to sound processing with that attributable to motor output responses. We examined focal brain activation (changes in fMRI signals) to correctly and incorrectly categorized sound stimuli based on response data collected immediately after the scanning session (on average, 92% correct, 4% miscategorized, and 4% uncertain).
The group-averaged pattern of activation attributable to hearing and correctly categorizing both tool and animal sounds relative to silence included a widespread pattern of activation in both hemispheres (Fig. 1a, yellow to orange). As expected, the strongest activation (yellow; p < 0.00005 threshold) included primary auditory cortex (overlapping Heschl's gyrus) plus other auditory regions along the planum temporale and superior temporal plane (collectively termed “PAC+”), located within the lateral sulci (LaS) (Wessinger et al., 2001). Moderate activation (orange-yellow, p < 0.0001; and orange, p < 0.001; α < 0.05) was present along portions of the superior temporal gyri (STG) and superior temporal sulci (STS), the left inferior frontal region (precentral sulcus), the superior frontal gyri on the medial wall (not visible in lateral views), and in subcortical structures, including bilateral thalamic and caudate nuclei (data not shown). The activation pattern attributable to hearing and correctly categorizing tools (Fig. 1b, red) relative to silence and animals (Fig. 1c, blue) relative to silence revealed a roughly similar overall pattern of activation, although the tools activated a more expansive network.
In contrast, some regions of cortex (Fig. 1a-c, light green) showed either a depression below baseline in response to the sound stimuli or a relatively greater activation during the silent periods, which may in part be attributable to task-unrelated effects, such as semantic monitoring, day dreaming, or other thought processes during silent trials (Binder et al., 1999; Calvert and Lewis, 2004). These regions included much of the dorsal occipital cortex and portions of dorsal frontal cortex in both hemispheres.
To reveal brain regions preferentially involved in processing tool-related sounds relative to animal sounds, and vice versa, we effectively subtracted (via multiple linear regression analysis on an individual basis) the activation pattern for the animal sounds versus silence (Fig. 1c) from that for the tool sounds versus silence (Fig. 1b), both at zero threshold. The resulting pattern of activation is illustrated on lateral views (Fig. 1d) and the corresponding flat-map models (Fig. 1e) of the Colin Atlas brain and on select axial slices (Fig. 2) from one of the participants. In these maps, red depicts cortex that was preferentially activated by the tool sounds and blue depicts regions preferentially activated by the animal sounds. No significant differential activity to tool versus animal sounds was observed in subcortical brain regions, and no significant differences were observed across gender (α < 0.05 in a two-sample t test for means).
The main finding from the tool versus animal sound comparison (Figs. 1d,e, 2) was that several cortical regions were differentially activated by the two different categories of sound. Animal sounds evoked significantly stronger activity (blue) in both the left and right hemisphere (bilaterally) along middle portions of the STG (mSTG). Importantly, this bias was present in all 20 individuals. In contrast, tool sounds evoked activity (red) mostly in the left hemisphere. This included nine major cortical foci (Fig. 1): (1) the middle portion of the left inferior frontal sulcus (mIFS); (2) the left ventral premotor cortex (VPMC); (3) the left inferior postcentral plus temporo-parietal junction and parietal opercular cortex (collectively referred to as IPoCeC); (4, 5) posterior portions of the left and right lateral sulci (pLaS); (6) the left anterior intraparietal regions, overlapping previously defined area “AIP” (Binkofski et al., 1999; Buccino et al., 2001; Grèzes et al., 2003); (7) portions of the left posterior parietal cortex; and (8, 9) the left and right posterior middle temporal gyri (pMTG) (including posterior portions of the STS).
Note that, in our previous study using 15 of the 20 same participants, the bilateral mSTG foci (blue) for animal sounds were found to be relatively insensitive to the “perceived recognition” of natural sounds per se: these foci were comparably activated by both recognizable environmental sounds (a more diverse range of sound categories) as well as by the corresponding backward-played versions of the same sounds, which were judged as unrecognizable (Lewis et al., 2004). In contrast, the bilateral pMTG foci (red) plus portions of the left IPoCeC, left VPMC and mIFS foci preferential for tool sounds directly overlapped cortex implicated in environmental sound “recognition” in our previous study.
Region of interest analysis and miscategorized sounds
To further assess the preferential activations attributable to tool versus animal sounds and their representation as distinct “conceptual” categories, we charted the averaged BOLD response magnitudes (Fig. 2, charts) within 11 cortical regions-of-interest (ROIs) from Figure 1, d and e. The first column depicts responses to correctly categorized tool sounds (“T,” red) relative to silence, and the second column depicts the responses to correctly categorized animal sounds (“A,” dark blue) relative to silence. Four of the ROIs showed increased responses to tool sounds but showed decreased responses to animal sounds, including the left and right pMTG, left posterior parietal, and left AIP cortex. These response characteristics may have been indicative of processing that was selective for tools as opposed to animal vocalizations, although the interpretation of these negative BOLD signals remains unclear. The other seven ROIs showed positive responses to both tool and animal sounds relative to silence but were significantly preferential for one or the other sound category. This included the bilateral mSTG foci for animal sounds and the bilateral pLaS, left mIFS, left VPMC, and left IPoCeC for tool sounds. Together, these ROI data demonstrated that a range of differential responses were evoked by listening to and correctly categorizing tool versus animal sounds.
We additionally charted brain responses in the ROIs when sounds were incorrectly categorized, including animal sounds miscategorized as tools (Fig. 2, “T̄”, pink) and tool sounds miscategorized as animals (“Ā”, light blue), both relative to silence. The mSTG foci were more strongly activated by animal vocalization sounds (middle two columns) than tool sounds, regardless of whether or not they were correctly categorized. These findings were consistent with the placement of the mSTG at early to intermediate auditory processing stages as opposed to more high-level conceptual stages, as is addressed further below.
In striking contrast, all of the ROIs that were preferential for correctly categorized tool sounds relative to animal sounds (red > dark blue) were also preferentially activated by animal sounds judged to be tool sounds relative to tool sounds judged to be animal sounds (pink > light blue). The miscategorization analysis revealed a data trend that clearly showed the “tool-related” network (red foci) being preferentially activated when sounds, whether correctly or incorrectly categorized, were perceived as being produced by tools.
Manipulating virtual tools
One possible explanation for the strong left-lateralized activation evoked by the tool-related sounds (red) was that these regions may be associating dominant right-hand and arm motor actions typically correlated with tool use. To explore this possibility, 12 of the 20 participants additionally performed a separate motor task, usually during the same scanning session. In a block paradigm, they alternated between making right-hand and arm movements as if manipulating a variety of virtual tools (simulating hammering, sanding, sawing, ratcheting, etc.) versus resting the hand motionless. No sound stimuli were presented during these scans except for tone pips to cue when the task periods began and ended.
Making hand manipulations as if using tools evoked the strongest activation (Fig. 3, green) in the hand and arm representations of the primary motor and somatosensory cortices (“M1” and “S1,” respectively) of the left hemisphere and in the right cerebellar cortex (data not shown). Other cortical regions were also significantly activated by manipulating virtual tools and were essentially consistent with previous studies involving actual tool or object manipulations in contrast to other movements (Binkofski et al., 1999; Moll et al., 2000; Amedi et al., 2001; Choi et al., 2001). However, activity in the left and right pMTG regions was more pronounced in, if not unique to, the present study. This may have been attributable to the participants imagining the sounds and/or visualizing the pantomimed tools, in contrast to the motionless rest condition. Thus, some of the activity (green) may reflect visualizations or other forms of mental imagery in addition to, or instead of, overt motor processing. The intermediate colors yellow and cyan depict regions of overlap with the tool versus animal sound recognition data, respectively.
The main result of the virtual tool manipulation experiment was that most of the cortical foci showing preferential activity to hearing and categorizing tool-related sounds (as opposed to animal sounds) were also activated, at least in part, by actual right-hand movements mimicking the use of tools in the absence of sound (Fig. 3, yellow). This included the left mIFS, left VPMC, left IPoCeC, left AIP, bilateral pLaS, and bilateral pMTG foci. The overlap between the tool sound processing foci and the hand manipulation foci was also evident on a voxelwise basis in most individuals.
Spectral and phase analysis of sound signals
Some of the differences in cortical activity evoked by categorizing tool versus animal sounds described above appeared to be attributable, in part, to relatively high-level associative processing with other sensory modalities (multimodal or supramodal processing), notably including motor-related cortices. However, some of the differential brain activation may have been attributable to differences in the acoustic signal properties (low-level auditory processing) between these two conceptual categories of sound (Nelken et al., 1999; Reide et al., 2001; Ehret and Riecke, 2002; Lewicki, 2002). To explore this possibility, we quantitatively compared (Fig. 4) the acoustical features of the tool and animal sound stimuli we presented. Additionally, because the left and right mSTG foci for animal sounds directly overlapped cortex found previously to be preferentially activated by speech versus nonspeech sounds (Belin et al., 2000; Binder et al., 2000; Fecteau et al., 2004; Lewis et al., 2004), we also analyzed a wide variety of human speech sounds, expressions, and utterances for quantitative comparison. This included samples from American English dialogs and monologs, which were processed (and matched in duration and loudness) using exactly the same techniques as for the tool and animal sounds.
Spectrographs (Fig. 4a-c) illustrate the energy of specific frequencies present over the duration of each sound. To quantify some of these complex signal attributes, we first examined the overall power spectra from 0 to 20 kHz (shown to the right of each spectrograph). On average (Fig. 4d), the animal and speech sounds (vocalizations) contained greater similarity in their overall normalized power spectra, including relatively greater power in the ∼500-3500 Hz range. However, the power spectrum did not capture or characterize any of the potential interactions or relationships between frequency components in the acoustic signals or the evolution of spectral features over time. Presumably, some of these more complex spectral and temporal relationships could distinguish the different categories of sound (Nelken et al., 1999; Reide et al., 2001; Ehret and Riecke, 2002). Thus, we explored two advanced signal processing techniques in an attempt to distinguish these three conceptually distinct categories of sound.
Because vocalization sounds are known to have a large degree of harmonic content (Reide et al., 2001; Ehret and Riecke, 2002), we examined the HNR of all of the sounds. This measure compared the acoustic energy of the harmonic components (periodic signals) over time with that of noise, in which noise was defined as the remaining nonharmonic, irregular, or chaotic acoustic energy in the signal. The HNR values correlated well with the spectrographs. For instance, the spectrographs of the animal sound (Fig. 4b; rooster call) and speech sound (Fig. 4c; woman answering “hello” on a telephone) both showed clear bands of energy in frequency (dark horizontal banding patterns), which were consistent with their having large HNR values or “harmonic content.” In contrast, the spectrograph of the tool sound (Fig. 4a; a saw cutting wood with an up, down, up, down stroke cycle) showed a broad and nearly continuous range of frequencies within each stroke and consequently had a very low (negative) HNR value. On average (Fig. 4e), the speech sounds had greater HNR values than the animal sounds, and both the animal and human sounds (vocalizations) had much greater harmonic content than the tool sounds.
Interestingly, the tool sounds that were miscategorized as animal sounds contained relatively higher HNR values (average, -0.2 dB; 21 sounds) than the correctly categorized tools sounds (-1.1 dB), and the miscategorized animal sounds had lower HNR values (+6.1; 24 sounds) than the correctly categorized animal sounds (+9.9 dB). This error trend supported the notion that HNR signal content (or lack thereof) may have influenced the perception of the sounds.
To examine temporal dynamics of the sounds, we adopted a bicoherence analysis (Fig. 4f) that quantified the degree of phase coupling of frequencies within the different sounds (Sigl and Chamoun, 1994). Nerve fibers in the cochlea can respond to pure sine-wave tones by synchronizing (spiking) with a particular phase angle of the sound (Warren, 1999). Such “phase locking” can occur with frequencies up to ∼4000 Hz, thereby serving as one means for the brain to encode or neurally represent tones. Phase information is known to be used by the auditory system, such as when hearing binaural beats (akin to hearing beats when tuning guitar strings) and for spatial hearing (localizing a sound based on phase angle differences at each ear) (Blauert, 1997). The bicoherence analysis, unlike the power spectra, quantified the degree of phase coupling (monaurally; in only the left ear channel of a given sound) that existed between different frequency pairs independent of amplitude. Frequency components that were coupled in-phase (having a nonrandom relationship) produced a greater bicoherence value, which is reflected in the one-dimensional profiles illustrated. The bicoherence analysis revealed a greater degree of similarity between the animal and human vocalizations relative to the tool sounds, especially in the 250-1400 Hz range. Whether phase-coupling attributes can significantly influence sound recognition processes remains to be explored but appear to be a potential cue for distinguishing vocalizations from tool-related sounds. Together, the above quantitative analyses suggest that the signal attributes of the animal versus tool sounds could have accounted for some of the differential fMRI activation, quite likely including the activity in the bilateral mSTG foci.
Discussion
Animal sounds and the bilateral mSTG
Together with previous fMRI studies, the present data suggest that the bilateral mSTG foci, preferential for animal sounds, represent “intermediate” or pre-representational sound processing stages of a cortical sound recognition system. The bilateral mSTG foci directly overlapped a progression of cortex reported to be more responsive to passively heard spoken words versus tones, and tones versus white noise (Binder et al., 2000; Wessinger et al., 2001; Lewis et al., 2004). This progression, extending multidirectionally from the primary auditory cortex out to and including the mSTG, appears to reflect a hierarchy of increasing responsiveness to increasing acoustic structure. Additionally, previous neuroimaging and electrophysiological studies have reported preferential activity to human voices or vocalizations along cortex overlapping (or near) the bilateral mSTG, in contrast to a variety of complex control sounds, including scrambled voices or musical instruments (Belin et al., 2000; Levy et al., 2001; Fecteau et al., 2004). Our ROI analysis indicated that the animal sounds (Fig. 2, dark blue and pink), in contrast to the tool sounds, produced greater activation in the mSTG foci regardless of whether the sounds were correctly categorized. Similarly, these foci were found to be relatively insensitive to environmental sound recognition per se (Lewis et al., 2004). Thus, the mSTG foci appear to represent stages that are primarily before high-level semantic, lexical, or other representational processing (Damasio et al., 1996; Binder et al., 1997; Näätänen and Winkler, 1999; Tranel et al., 2003).
The above progression of auditory cortex shares both anatomical and functional homology with some of the macaque lateral and anterior “belt” and “parabelt” auditory areas that surround primary auditory cortex (Rauschecker, 1998; Kaas et al., 1999). These areas include increasing proportions of neurons selective for components of species-specific monkey calls. Thus, they are thought to subserve the pre-processing of vocalization and communication sounds (Rauschecker, 1998).
We hypothesize that the preferential activity in the mSTG for animal sounds reflects the preferential processing of certain signal attributes, such as harmonic or phase-coupling content (Fig. 4), which are characteristic of vocalization sounds and tend to be less pronounced or less consistent in tool-related sounds. Because most human listeners are exceedingly experienced with processing and recognizing human speech (from birth and perhaps in utero), the bilateral mSTG foci may represent stages that are, or can become, optimally organized to process the prevailing signal components of vocalizations (predominantly human speech but also animal vocalizations). Thus, the mSTG regions may represent rudimentary stages of a sound recognition system that are optimized for spoken language processing.
Tool sounds and action knowledge
Right-handed people typically learn of characteristic tool sounds while manipulating a tool with their right-hand and viewing the complex motions of the hand and tool. The cortex that was preferential for processing tool sounds (Fig. 3, yellow and red) may be involved in associating or matching motor ideas regarding right arm and hand manipulations (and possibly visualizing the manipulations) that could have been correlated with sound production. Ostensibly, the collection of left-lateralized regions preferential for tool sounds was consistent with representing part of a “mirror system,” as reported in both human (Iacoboni et al., 1999; Rizzolatti et al., 2002; Grèzes et al., 2003) and monkey (Kohler et al., 2002). In particular, the left VPMC and/or mIFS foci may share homology with the macaque area F5, which contains “audiovisual mirror neurons” that discharge when the animal performs a specific action, views the related action, or hears the related sound in isolation (Kohler et al., 2002). Similarly, portions of the left IPoCeC may share homology with the macaque area 7b, which is thought to be involved in the perception of the space within grasping range and in the organization of contralateral arm movements toward stimuli presented in that space (Rushworth et al., 2001; Burton, 2002).
The bilateral pLaS foci may share homology with auditory-somatosensory convergence sites in the retroinsular cortex of the macaque (Hackett et al., 2001; Schroeder et al., 2001; Schroeder and Foxe, 2002; Fu et al., 2003). This includes the caudomedial auditory belt area, which in humans is a region maximally activated, for instance, by the combined presentation of sandpaper-like sounds together with feeling sandpaper being rolled over the right hand (Foxe et al., 2002). The pLaS foci may also overlap the somatosensory area S2, which has a role in active hand manipulations and integrated sequential touching, such as for identifying objects based on touch (Burton, 2002; Grèzes et al., 2003). Together, the preferential activation to tool sounds in the left VPMC and mIFS, IPoCeC, AIP, and bilateral pLaS was consistent with the mirroring of motor actions that were likely to have led to the tool-related sounds.
The left and right pMTG activation for tool sounds may have reflected the processing of dynamic motion associations (motor or visual) or more abstract “action knowledge” representations associated with the sounds (Martin et al., 1996; Grossman et al., 2002; Phillips et al., 2002; Kellenbach et al., 2003). In our previous study, these foci were activated by a much wider range of recognizable environmental sounds (depicting actions) than just tool-related sounds (Lewis et al., 2004). Moreover, pMTG activity has been reported in response to viewing tools or objects in motion (Beauchamp et al., 2002, 2004; Manthey et al., 2003) or pictures of tools associated with action (Chao et al., 1999; Grèzes and Decety, 2002). Thus, the bilateral pMTG activity may be reflecting some form of visual or mental imagery of the motion dynamics associated with the sound.
Together, the network of regions preferential for tool sounds may serve to match or associate multisensory knowledge to reason as to how the tool sounds might have been produced, revealing a high-level auditory “how” pathway for purposes of recognition (Johnson-Frey, 2003; Kellenbach et al., 2003). Thus, what we perceive when we hear tool sounds may not be determined so much by the signal characteristics themselves but rather by the relationship between the sound and our experiences with the probable actions that produced them (Schwartz et al., 2003; Körding and Wolpert, 2004).
From multisensory perception to conception and language
Strikingly, the left pMTG focus for tool sounds and the left mSTG focus for animal sounds (Fig. 1d) complemented, by partially overlapping, the reported large-scale cortical architecture for tool versus animal word-form (lexical/semantic) knowledge (Damasio et al., 1996). In particular, lesions to the left mSTG region (and cortex farther ventral) can lead to deficits in naming pictures of animals, whereas lesions to the left pMTG produce deficits more specific to naming tools. These regions may thus represent early stages of a language-specific pathway that leads to the verbal representation and identification of the perceived sounds (Tranel et al., 2003). However, why would word-form knowledge for the category of tools be represented in the posterior temporal lobe and that for animals be situated farther anterior, and not, say, vice versa?
The present data are suggestive of a multimodal sensory-driven mechanism that may explain this large-scale architecture. We speculate that tool sounds typically have a greater degree of intermodal invariant associations with the other sensory modalities than do most vocalization sounds (Lewkowicz, 2000; Calvert and Lewis, 2004). For instance, pounding with a hammer will typically expose the perceiver to strongly correlated temporal information from the auditory, motor, tactile, and visual modalities. The pMTG represents cortex that is well situated, for compactness of cortical “wiring,” to receive and associate multiple sensory inputs, because they are approximately located between the primary sensory domains for hearing, motor/touch, and vision (Van Essen, 1997; Amedi et al., 2001; Beauchamp et al., 2002, 2004; Lewis et al., 2004).
In contrast, vocalization sounds may evoke a relatively greater degree of processing specific to the auditory modality, such as extracting harmonic or phase-correlated information within the sound (e.g., leading to phoneme and word recognition). Such processing might ideally be situated closer to the primary and nearby auditory cortices (such as the mSTG foci), again for compactness of cortical wiring, because there would be relatively fewer dynamic multimodal associations to be made. A notable exception would include lip reading while listening to speech, which involves audiovisual integration that can greatly aid in speech perception. Although lip reading can activate the mSTG foci (Calvert et al., 1997), the regions most sensitive to the integration of audio and visual speech overlap cortex near the pMTG foci (Calvert et al., 2000; Calvert and Lewis, 2004). These findings support the proposed role of the pMTG in processing dynamic multimodal “actions,” which more typically may apply to tool sounds, but in some circumstances would apply to observing (hearing and seeing) the production of vocalizations with correlated mouth, face, and body gesticulations.
Curiously, there was a moderate degree of overlap between the classically defined Broca's area for language production (Fig. 3, compare near pars opercularis and pars triangularis) with the left VPMC and mIFS foci, which were preferential for tool sounds and virtual tool manipulations. Some theories of the evolution of speech perception (e.g., the “motor theory”) posit that the auditory representation of speech sounds would ideally be situated near motor cortices (Liberman and Mattingly, 1985; Iacoboni et al., 1999; Holden, 2004). In particular, speech production and other forms of communication are primarily learned through mimicry, involving the sequential coordination of muscle movements of the vocal chords and/or the arms and hands. Aspects of tool use are also learned through mimicry, and the resulting sounds (their “meaning”) may thus be associated with such motor actions. Thus, the present data appear to lend support to a motor theory behind the evolution of communication and language in humans, reflecting a possible link between sound recognition, imitation, gesturing, and language articulation.
Together, the present findings are consistent with the existence of sensory- and multisensory-driven mechanisms that may establish the gross cortical organization of the human auditory system, wherein animal vocalization sounds evoke greater processing demands on the mSTG bilaterally, whereas tool sounds impose greater demands on mostly left hemisphere audiomotor association cortices and the bilateral pMTG (possibly audiovisual association cortices). This organization, via multisensory “bootstrapping,” may then lead to the reported large-scale cortical architecture for word-form knowledge, including the distinct conceptual categories of manmade artifacts (tools) and living things (animals).
Appendix
Please visit www.jneurosci.org to view on-line supplemental material.
Footnotes
This work was supported by National Institutes of Health (NIH) Grants R03 DC04642 (J.W.L.) and EY10244 and MH51358 (E.A.D.) and General Clinical Research Center/NIH Grant MO1-RR00058 (Medical College of Wisconsin). We thank B. Doug Ward for assistance with paradigm design and statistical analyses, Nathan Walker for assistance with error analyses and figures, Dr. Harold Burton for comments on the paradigm, Dr. Kristina Ropella for acoustical analysis suggestions, and Dr. Aina Puce for constructive comments on previous versions of this manuscript. We thank Dr. Robert Cox for continual development of AFNI and Dr. David Van Essen, Donna Hanlon, and John Harwell for continual development of cortical data presentation with CARET.
Correspondence should be addressed to Dr. James W. Lewis, Department of Physiology and Pharmacology, P.O. Box 9229, West Virginia University, Morgantown, WV 26506. E-mail: jlewis{at}hsc.wvu.edu.
Copyright © 2005 Society for Neuroscience 0270-6474/05/255148-11$15.00/0