Abstract
Natural environments convey information through multiple sensory modalities, all of which contribute to people's percepts. Although it has been shown that visual or auditory content of scene categories can be decoded from brain activity, it remains unclear how humans represent scene information beyond a specific sensory modality domain. To address this question, we investigated how categories of scene images and sounds are represented in several brain regions. A group of healthy human subjects (both sexes) participated in the present study, where their brain activity was measured with fMRI while viewing images or listening to sounds of different real-world environments. We found that both visual and auditory scene categories can be decoded not only from modality-specific areas, but also from several brain regions in the temporal, parietal, and prefrontal cortex (PFC). Intriguingly, only in the PFC, but not in any other regions, categories of scene images and sounds appear to be represented in similar activation patterns, suggesting that scene representations in PFC are modality-independent. Furthermore, the error patterns of neural decoders indicate that category-specific neural activity patterns in the middle and superior frontal gyri are tightly linked to categorization behavior. Our findings demonstrate that complex scene information is represented at an abstract level in the PFC, regardless of the sensory modality of the stimulus.
SIGNIFICANCE STATEMENT Our experience in daily life includes multiple sensory inputs, such as images, sounds, or scents from the surroundings, which all contribute to our understanding of the environment. Here, for the first time, we investigated where and how in the brain information about the natural environment from multiple senses is merged to form modality-independent representations of scene categories. We show direct decoding of scene categories across sensory modalities from patterns of neural activity in the prefrontal cortex (PFC). We also conclusively tie these neural representations to human categorization behavior by comparing patterns of errors between a neural decoder and behavior. Our findings suggest that PFC is a central hub for integrating sensory information and computing modality-independent representations of scene categories.
Introduction
Imagine taking a walk on the beach. Your sensory experience would include the sparkle of the sun's reflection on the water, the sound of the crushing waves, and the smell of ocean air. Even though the brain has clearly delineated processing channels for all of these sensory modalities, we still have the integral concept of the beach, which is not tied to particular sensory modalities. What are the neural systems underlying this convergence, which allows our brain to represent the world beyond sensory modalities? Here we show the neural representations of scene information that generalize across different sensory modalities.
Neural mechanisms underlying the perception of visual scenes have been studied extensively for the last two decades, showing a hierarchical structure from posterior to anterior parts of visual cortex. Starting from low-level features, such as orientation, represented in primary visual cortex, the level of representation becomes more abstract, through intermediate-level features represented in V3 and V4 (Nishimoto et al., 2011), to various aspects of scenes represented in high-level visual areas: local elements of a scene in the occipital place area (OPA) (MacEvoy and Epstein, 2007; Dilks et al., 2013), scene geometry and scene content in the parahippocampal place area (PPA) (Epstein and Kanwisher, 1998; Walther et al., 2009), and the embedding of a specific scene into real-world topography in the retrosplenial cortex (RSC) and hippocampus (Morgan et al., 2011). Does this abstraction continue beyond the visual domain? To identify representations of scene content beyond this visual processing hierarchy, we here investigate neural activation patterns elicited by visual and auditory scene stimuli.
Previous work has identified neural representations that are not confined to a particular sense. Several brain areas have been shown to integrate signals from more than one sense (Calvert, 2001; Driver and Noesselt, 2008), such as posterior superior temporal sulcus (STS) (Beauchamp et al., 2004), the posterior parietal cortex (Cohen and Anderson, 2004; Molholm et al., 2006; Sereno and Huang, 2006), and the prefrontal cortex (PFC) (Sugihara et al., 2006; Romanski, 2007). Some of these areas show similar neural activity patterns when the same information is delivered from different senses for various stimuli, such as objects (Man et al., 2012), emotions (Müller et al., 2012), or face/voice identities (Park et al., 2010). Despite these observations, little is known about how scene information is processed beyond the sensory modality domain.
In real-world settings, our perception of scenes typically relies on multiple senses. Therefore, we postulate that there should exist a stage of modality-independent representation of scenes, which generalizes information across different modality channels. We hypothesize that PFC may play a role in representing scene categories beyond the modality domain based on previous research showing that PFC shows categorical representations of visual information (Freedman et al., 2001; Walther et al., 2009).
The present study investigates modality-independent scene representations using multivoxel pattern analysis (MVPA) of fMRI data. Two different types of MVPA were performed to define modality-independent representations in the brain. First, we identified brain areas that process both visual and auditory scene information by decoding neural representations of scene categories elicited by scene images and sounds separately. Second, we tested whether these areas represent visual and auditory information with similar neural codes by performing cross-decoding analysis between the two modalities.
After identifying modality-independent representations of scene categories in the brain, we further explored the characteristics of these representations with two additional types of analysis. We first examined whether the neural representations of scene categories in one modality are degraded by conflicting information from the other modality. Second, we tested to what extent scene category representations are related to human behavior and to the physical structure of the stimuli by comparing error patterns. Among the multisensory brain regions that we investigated, only the regions in PFC contain modality-independent representations of scene categories showing both visual and auditory scene representations in similar neural activity patterns.
Materials and Methods
We posit four idealized models of how visual and auditory information can be processed within a brain region: a purely visual model, a purely auditory model, a multimodal model with separate but intermixed neural populations for processing visual and auditory information, and a cross-modal model with neural populations for representing scene category information regardless of sensory modalities (Fig. 1C). Experimental conditions and analysis protocols were designed to discriminate between these models (Fig. 1A,B).
Figure 1D shows predicted results for each of the four models. Specifically, we expect that primarily visual and auditory regions will contain neurons dedicated to processing their respective modality exclusively. In these regions, scene categories should be decodable from the corresponding modality condition only, but not across modalities. In multimodal regions, both visual and auditory information should be processed in anatomically collocated but functionally separate neural populations. Therefore, we expect that both image and sound categories can be decoded, but decoding across modalities should not be possible. In cross-modal regions, both image and sound categories should be decodable, and scene category decoding should generalize across modalities.
Participants
Thirteen subjects (18–25 years old; 6 females, 7 males) participated in the fMRI experiment. All participants were in good health with no past history of psychiatric or neurological disorders and reported having normal hearing and normal or corrected-to-normal vision. They gave written informed consent before the experiment began according to the Institutional Review Board of the Ohio State University.
A separate group of 25 undergraduate students from the University of Toronto (18–21 years old; 16 females, 9 males) participated in the behavioral experiment for course credit. All participants had normal hearing and normal or corrected-to-normal vision and gave written informed consent. The experiment was approved by the Research Ethics Board of the University of Toronto.
Stimuli
In the fMRI experiment, 640 color photographs of four scene categories (beaches, forests, cities, and offices) were used. The images have previously been rated as the best exemplars of their categories from a database of ∼4000 images that were downloaded from the internet (Torralbo et al., 2013). Images were presented at a resolution of 800 × 600 pixels using a Christie DS+6K-M projector operating at a refresh rate 60 Hz. Images subtended ∼21 × 17 degrees of visual angle.
Sixty-four sound clips representing the same four scene categories (beaches, forests, cities, or offices) were used as auditory stimuli. The sound clips were purchased from various commercial sound libraries and edited to 15 s in length. They include auditory scenes from real-world environments (i.e., sounds of waves, seagulls for a beach scene; the sound of office machines and indistinct murmur from human conversations for an office scene). Because of this relatively longer presentation time for each audio exemplar, fewer exemplars were used compared with those in the image condition. Perceived loudness was equated using Replay Gain as implemented in Audacity software (version 2.1.0). In a pilot experiment, the sound clips were correctly identified and rated as highly typical for their categories by 14 naive subjects. Both visual and auditory stimuli are available at the Open Science Framework repository (OSF): DOI https://doi.org/10.17605/OSF.IO/HWXQV.
The same visual and the auditory stimuli were used in the behavioral experiment. In the visual part of the experiment, 400 images were used for practice blocks (key-category association and staircasing), and the other 240 images were used in the main testing blocks. Images were presented on a CRT screen at a resolution of 800 × 600 pixels and subtended ∼29 × 22 degrees of visual angle. The resolution of the monitor was 1024 × 768 with a refresh rate at 150 Hz. Images were followed by a perceptual mask, which was generated by synthesizing a mixture of textures reflecting all four scene categories (Portilla and Simoncelli, 2000).
Procedure and experimental design
fMRI experiment.
The fMRI experiment consisted of three conditions: the image condition, the sound condition, and the mixed condition, in which both images and sounds were presented concurrently. Participants' brains were scanned during 12 experimental runs, four runs for each condition. Each run started with the instruction asking participants to attend, for the duration of the run, to either images (image runs and half of the mixed runs) or sounds (sound runs and the other half of the mixed runs). In the analysis, we combined the data across the two attention manipulation conditions because this attention manipulation was not the main purpose of the present study, and it did not influence the decoding accuracy in most of the ROIs (except lateral occipital complex [LOC], middle temporal gyrus [MTG], and superior parietal gyrus [SPG] for visual categories).
Runs contained eight blocks, two for each scene category, interleaved with 12.5 s fixation periods to allow for the hemodynamic response to return to baseline levels. The beginning and the end of a run also included a fixation period of 12.5 s. The order of blocks within runs and the order of runs were counterbalanced across participants. Mixed runs were only presented after at least two pure image and sound runs. Stimuli were arranged into eight blocks of 15 s duration. During image blocks, participants were shown 10 color photographs of the same scene category for 1.5 s each. During sound blocks, they were shown a blank screen with a fixation cross and a 15 s sound clip was played using Sensimetrics S14 MR-compatible in-ear noise-canceling headphones at ∼70 dB. During mixed blocks, participants were shown images and played a sound clip of a different scene category at the same time. A fixation cross was presented throughout each block, and subjects were instructed to maintain fixation. Each run lasted 3 min 52.5 s.
fMRI data acquisition and preprocessing.
Imaging data were recorded on a 3 Tesla Siemens MAGNETOM Trio MRI scanner with a 12-channel head coil at the Center for Cognitive and Behavioral Brain Imaging at Ohio State University. High-resolution anatomical images were acquired with a 3D-MPRAGE sequence with sagittal slices covering the whole brain; inversion time = 930 ms, TR = 1900 ms, TE = 4.68 ms, flip angle = 9°, voxel size = 1 × 1 × 1 mm, matrix size = 224 × 256 × 160 mm. Functional images for the main experiment were recorded with a gradient echo, EPI sequence with a volume TR of 2.5 s, a TE of 28 ms, and a flip angle of 78 degrees; 48 axial slices with 3 mm thickness were recorded without gap, resulting in an isotropic voxel size of 3 × 3 × 3 mm.
fMRI data were motion corrected to one EPI image (the 72nd volume of the 10th run), followed by spatial smoothing with a Gaussian kernel with 2 mm FWHM and temporal filtering with a high-pass filter at 1/400 Hz. Data were normalized to percentage signal change by subtracting the mean of the first fixation period in each run and dividing by the mean across all runs. The effects of head motion (6 motion parameters) and scanner drift (second degree polynomial) were regressed out using a GLM. The residuals of this GLM analysis were averaged over the duration of individual blocks, resulting in 96 brain volumes that were used as input for MVPA. Preprocessing was performed using AFNI (Cox, 1996).
Behavioral experiment.
The behavioral experiment consisted of two parts: a visual and an auditory part. The order of the two parts was randomized for each participant.
The visual part consisted of two phases: practice and testing. Participants performed two types of practice: one for the key-category association and the other for the fast image presentation. During the first block of practice, photographs of natural scenes were presented for 200 ms (stimulus onset asynchrony [SOA]), immediately followed by a perceptual mask for 500 ms. Participants were asked to press one of four keys ('a,' 's,' 'k,' 'l') within 2 s of stimulus onset. They received acoustic feedback (a short beep) when they made an error. Assignment of the keys to the four scene categories (beaches, forests, cities, offices) was randomized for each participant. After participants achieved 90% accuracy in this key practice phase, they completed four additional practice blocks (40 trials each), during which the SOA was linearly decreased to 26.7 ms. The main testing phase consisted of six blocks (40 trials each) of four alternative-forced-choice (4AFC) scene categorization with a fixed SOA of 26.7 ms and without feedback.
In the auditory part, participants listened to scene soundscapes of 15 s in length. They indicated their categorization decision by pressing one of four keys (same key assignment as in the visual part). To make the task difficulty comparable with the image categorization task, we overlaid pure-tone noise onto the original sound clips. Noise consisted of 30 ms snippets of pure tones, whose frequency was randomly chosen between 50 and 2000 Hz with 3 ms of fade-in and fade-out (linear ramp). Based on a pilot experiment, we set the relative volume of the noise stimulus to 9 times the volume of the scene sounds. To familiarize participants with the task, they first performed 4 practice trials. Participants were asked to respond with the key corresponding to the correct category, starting from 8 s after the onset of the sound and without an upper time limit. Participants were encouraged not to deliberate on the answer but to respond as quickly and as accurately as possible.
Data analysis and statistical analysis
Defining ROIs.
ROIs in visual cortex were defined using functional localizer scans, which were performed at the end of the same session as the main experiment. Retinotopic areas in early visual cortex were identified using the meridian stimulation method (Kastner et al., 1998). The vertical and horizontal meridians of the visual field were stimulated alternatingly with wedges with flickering checkerboard pattern. Boundaries between visual areas were outlined on the computationally flattened cortical surface. The boundary between V1 and V2 was identified as the first vertical meridian activity, the boundary between V2 and V3 as the first horizontal meridian, and the boundary between V3 and V4 (lower bank of the calcarine fissure only) as the second vertical meridian. To establish the anterior boundary of V4, we stimulated the upper and lower visual field in alternation with flickering checkerboard patterns. The anterior boundary of V4 was found by ensuring that both upper and lower visual field are represented in V4. Participants maintained central fixation during the localizer scan.
To define high-level visual areas, we presented participants with blocks of images of faces, scenes, objects, and scrambled images of objects. fMRI data from this localizer were preprocessed the same way as the main experiment data, but spatially smoothed with a 4 mm FWHM Gaussian filter. Data were further processed using a GLM (3dDeconvolve in Afni) with regressors for all four image types. ROIs were defined as contiguous clusters of voxels with significant contrast (q < 0.05, corrected using false discovery rate [FDR]) of the following: scenes > (faces and objects) for PPA, RSC (Epstein and Kanwisher, 1998), and OPA (Dilks et al., 2013); and objects > (scrambled objects) for LOC (Malach et al., 1995). Voxels that could not be uniquely assigned to one of the functional ROIs were excluded from the analysis.
Anatomically defined ROIs were extracted using a probability atlas in AFNI (DD Desai MPM) (Destrieux et al., 2010): MTG, superior temporal gyrus (STG), STS, angular gyrus (AG), SPG, intraparietal sulcus (IPS), medial frontal gyrus (MFG), superior frontal gyrus (SFG), and inferior frontal gyrus (IFG) with pars opercularis, pars orbitalis, and pars triangularis. Anatomical masks for auditory cortex (ACX) and its subdivisions (planum temporale [PT], posteromedial Heschl's gyrus, middle Heschl's gyrus, anterolateral Heschl's gyrus, and planum polare) were made available by Sam Norman-Haignere (Norman-Haignere et al., 2013). After nonlinear alignment of each participants' anatomical image to MNI space using AFNI's 3dQwarp function, the inverse of the alignment was used to project anatomical ROI masks back into original subject space using 3dNwarpApply. All decoding analyses, including for the anatomically defined ROIs, were performed in original subject space.
MVPA.
For each participant, we trained a linear support vector machine (SVM; using LIBSVM) (Chang and Lin, 2011) to assign the correct scene category labels to the voxel activations inside an ROI based on the fMRI data from all runs except one. The SVM decoder then produced predictions for the labels of the data in the left-out run. This leave-one-run-out (LORO) cross-validation procedure was repeated with each run being left out in turn, thus producing predicted scene category labels for all runs. Decoding accuracy was assessed as the fraction of blocks with correct category labels. Group-level statistics was computed over all 13 participants using one-tailed t tests, determining whether decoding accuracy was significantly above chance level (0.25). Significance of the t test was adjusted for multiple comparisons using FDR (Westfall and Young, 1993).
To curb overfitting of the classifier to the training data, we reduced the dimensionality of the neural data by selecting a subset of voxels in each ROI. Voxel selection was performed by ranking voxels in the training data according to the F statistics of a one-way ANOVA of each voxel's activity with scene category as the main factor (Pereira et al., 2009). We determined the optimal number of voxels by performing a separate LORO cross-validation within the training data. For pure image and sound conditions, the training data of each cross-validation fold were used (nested cross-validation). In the cross-decoding and the mixed condition, the entire training data were used for voxel selection because training and test data were completely separate in these conditions. Using the training data, we performed LORO cross-validation analyses with the number of selected voxels varying from 100 to 1000 (step size of 100). We included voxels according to decreasing rank order of their F statistics. We compared the decoding performance across different voxel numbers and determined the optimal number of voxels, which showed the best decoding performance within the training data. Once we decided the optimal number of voxels, we trained the classifier using the entire training set, selecting the optimal number of voxels. Optimal voxel numbers varied by ROI and participant, showing an overall average of 107.125 across all ROIs and participants.
Error correlations.
Category label predictions of the decoder were recorded in a confusion matrix, whose rows indicate the category of the stimulus, and whose columns represent the category predictions by the decoder. Diagonal elements indicate correct predictions, and off-diagonal elements represent decoding errors. Neural representations of scene categories were compared with human behavior by correlating the error patterns (the off-diagonal elements of the confusion matrices) between neural decoding and behavioral responses (Walther et al., 2012). Statistical significance of the correlations was established nonparametrically against the null distribution of all error correlations that were obtained by jointly permuting the rows and columns of one of the confusion matrices in question (24 possible permutations of four labels). Error correlations were considered as significant when none of the correlations in the null set exceeded the correlation for the correct ordering of category labels (p < 0.0417).
To assess the similarity between neural representations and the physical characteristics of the stimuli, we constructed simple computational models of scene categorization based on low-level stimulus features. Scene images were filtered with a bank of Gabor filters with four different orientations at four scales, averaged in a 3 × 3 grid. Images were categorized based on the resulting 144-dimensional feature vector in a 16-fold cross-validation, using a linear SVM, resulting in a classification accuracy of 85.8% (chance: 25%).
Physical properties of the sounds were assessed using a cochleagram, which mimics the biomechanics of the human ear (Meddis et al., 1990; Wang and Brown, 2006). The cochleagrams of individual sound clips were integrated over their duration and subsampled to 128 frequency bands, resulting in a biomechanically realistic frequency spectrum. The activation of the frequency bands was used as input to a linear SVM, which predicted scene categories of sounds in a 16-fold cross-validation. The classifier accurately categorized 57.8% of the scene sounds (chance: 25%). Error patterns from the computational analyses of images and sounds were correlated with those obtained from the neural decoder.
Error patterns of human observers were obtained from the behavioral experiment. Mean accuracy was 76.6% for the visual task (SD 12.35%, mean reaction time = 885.6 ms) and 78.1% for the auditory task (SD 6.86%, mean reaction time = 7.84 s). Behavioral errors were recorded in confusion matrices, separately for images and sounds. Rows of the confusion matrix indicate the true category of the stimulus, and columns indicate participants' responses. Individual cells contain the relative frequency of the responses indicated by the columns to stimuli indicated by the rows. Group-mean confusion matrices were compared with confusion matrices derived from neural decoding.
Searchlight analysis.
To explore representations of scene categories outside of predefined ROIs, we performed a searchlight analysis. We defined a cubic “searchlight” of 7 × 7 × 7 voxels (21 × 21 × 21 mm). The searchlight was centered on each voxel in turn (Kriegeskorte et al., 2006), and LORO cross-validation analysis was performed within each searchlight location using a linear SVM classifier (CosmoMVPA Toolbox) (Oosterhof et al., 2016). Decoding accuracy as well as the full confusion matrix at a given searchlight location were assigned to the central voxel.
For the group analysis, we first coregistered each participant's anatomical brain to the MNI 152 template using a diffeomorphic transformation as calculated by AFNI's 3dQWarp. We then used the same transformation parameters to register individual decoding accuracy maps to MNI space using 3dNWarpApply, followed by spatial smoothing with a 4 mm FWHM Gaussian filter. To identify voxels with decodable categorical information at the group level, we performed one-tailed t tests to test whether decoding accuracy at each searchlight location was above chance (0.25). After thresholding at p < 0.01 (one-tailed) from the t test, we conducted a cluster-level correction for multiple comparisons. We used AFNI's 3dClustSim to conduct α probability simulations for each participant. The estimated smoothness parameters computed by 3dFWHMx were used to conduct the cluster simulation. In the simulations, a p value of 0.01 was used to threshold the simulated data before clustering and a corrected α of 0.001 was used to determine the minimum cluster size. The average of the minimum cluster sizes across all the participants was 35 voxels.
Results
Decoding scene categories of images and sounds
To assess neural representations of scene categories from images and sounds, we performed MVPA for each ROI. A linear SVM using LIBSVM (Chang and Lin, 2011) was trained to associate neural activity patterns with category labels and then tested to determine whether a trained classifier can predict the scene category in a LORO cross-validation.
Figure 2 illustrates decoding accuracy in each condition for various brain areas (for the results of statistical tests, see Table 1). As shown in previous studies (Walther et al., 2009, 2011; Kravitz et al., 2011; Park et al., 2011; Choo and Walther, 2016), both early visual areas V1-V4 and high-level visual areas, including the PPA, RSC, OPA, and LOC, show category-specific scene representations. We were also able to decode the scene categories from activity elicited by sounds of the respective natural environments in auditory cortex (ACX) as well as its anatomical subdivisions (Figs. 2, 3).
Unlike previous reports showing that auditory content can be decoded from early visual cortex (Vetter et al., 2014; Paton et al., 2016), we did not find representations of auditory scene information in V1-V4. However, we were able to decode auditory scene categories in higher visual areas: the OPA (30.5%, t(12) = 1.966, q = 0.036, d = 1.36) and the RSC (31.3%, t(12) = 1.803, q = 0.048, d = 0.5). Intriguingly, we could also decode scene categories from images in ACX with a decoding accuracy of 29.8% (t(12) = 1.910, q = 1.91, d = 0.53).
Having found modality-specific representations of scene categories in visual and auditory cortices, we aimed to identify scene representations in areas that are not limited to a specific sensory modality. We could decode categories of both visual and auditory scenes in several temporal and parietal regions (Fig. 2): the MTG, the STG, the SPG, and the AG. In the STS, we could decode scene categories only from images, not from sounds. Although previous studies have suggested that the IPS is involved in audiovisual processing (Calvert et al., 2001), we could not decode visual or auditory scene categories in the IPS.
Next, we examined whether PFC showed category-specific representations for both visual and auditory scene information. Previous studies have found strong hemispheric specialization in PFC (Gaffan and Harrison, 1991; Slotnick and Moo, 2006; Goel et al., 2007). We therefore analyzed functional activity in PFC separately by hemisphere. We were able to decode visual scene categories significantly above chance from the left IFG, pars opercularis, the right IFG, pars triangularis, and in both hemispheres from the MFG and the SFG (Figs. 2, 3; Table 1). The categories of scene sounds were decodable in the right IFG, pars triangularis, as well as the MFG and SFG in both hemispheres (Figs. 2, 3; Table 1).
Although the temporal, parietal, and PFC all showed both visual and auditory scene representations, this does not necessarily imply that these areas process scene information beyond the sensory modality domain. Neural representations of scene categories in cross-modal regions should not merely consist of coexisting populations of neurons with visually and auditorily triggered activation patterns (Fig. 1C, the multimodal model); the voxels in these ROIs should be activated equally by visual and auditory inputs if they represent the same category. In other words, if the neural activity pattern elicited by watching a picture of a forest reflects the scene category of forest, then this neural representation should be similar to that elicited by listening to forest sounds (Fig. 1C, the cross-modal model). We aimed to explicitly examine whether scene category information in the prefrontal areas transcends sensory modalities using cross-decoding analysis between the image and sound conditions.
Cross-modal decoding
For the cross-decoding analysis, we trained the decoder using the image labels from the image runs and then tested whether it could correctly predict the categories of scenes presented as sounds in the sound runs. We also performed the reverse analysis, training the decoder on the sound runs and testing it on the image runs.
Cross-decoding from images to sounds succeeded in the MFG in both hemispheres and in the right IFG, pars orbitalis. The right MFG and the right IFG, pars triangularis, showed significant decoding accuracy for cross-decoding from sounds to images (Figs. 2, 3). However, cross-decoding was not possible in either direction anywhere in sensory cortices or temporal and parietal cortices, which have significant decoding of both image and sound categories. Although V3 and the PT showed significant decoding in the cross-decoding analysis of images to sounds (Figs. 2, 3; Table 1), it is hard to interpret these findings as equivalent to those in prefrontal regions because these ROIs only show significant decoding of either image (V3) or sound categories (PT) in the straight decoding analysis. These results therefore suggest that only prefrontal areas contain modality-independent representations of scene categories with similar neural activity patterns from visual and auditory scene information.
Presenting images and sounds concurrently
We explored the characteristics of cross-modal regions using an interference condition, in which images and sounds from incongruous categories were simultaneously presented. If a population of neurons encodes a scene category independently of sensory modality, then we should see a degradation of the category representation in the presence of a competing signal from the other modality. If, on the other hand, two separate but intermixed populations of neurons encode the visual and auditory categories, respectively, then we should be able to still decode the category from at least one of the two senses.
To decode scene categories from this mixed condition, we created an image and a sound decoder by training separate classifiers with data from the image-only and the sound-only conditions. We then tested these decoders with neural activity patterns from the mixed condition, using either image or sound labels as ground truth. As the training and the test data are from separate sets of runs, cross-validation was not needed for this analysis.
We were able to decode visual and auditory scene categories from the respective sensory brain areas, even in the presence of conflicting information from the other modality. In temporal and parietal ROIs, we could decode scene categories, but those ROIs were no longer multimodal; they only represented scene categories in either the visual or auditory domain but no longer both (Fig. 3E,F; Table 1). These findings suggest that these ROIs contain separate but intermixed neural populations for visual and auditory information. For ROIs in PFC, we found that conflicting audiovisual stimuli interfered heavily with representations of scene categories (Fig. 3E,F). Scene categories could no longer be decoded in PFC from either modality, except for visual scenes in the right MFG. Presumably, this breakdown of the decoding of scene categories is due to the conflicting information from the two sensory modalities arriving at the same cross-modal populations of neurons.
Error correlations
To further explore the characteristics of the neural representations of scene categories, we compared the patterns of errors from the neural decoders with those from human behavior as well as with the physical attributes of the stimuli. If representations of scenes in a certain brain region are used directly for categorical decisions, then error patterns from this ROI should be similar to errors made in behavioral categorization (Walther et al., 2009). However, in early stages of neural processing, scene representations might reflect the physical properties of the scene images or sounds. In this case, the error patterns of the decoders should resemble the errors that a classifier would make solely based on low-level physical properties.
To assess similarity of representations, we correlated the patterns of errors (off-diagonal elements of the confusion matrices; see Materials and Methods) between the neural decoders, physical structure of the stimuli, and human behavior. Statistical significance of the correlations was established with nonparametric permutation tests. Here we considered error correlations to be significant when none of the correlations in the null set exceeded the correlation of the correct ordering of the categories (p < 0.0417).
Behavioral errors from image categorization were not correlated with the errors derived from image properties (r = −0.458, p = 0.083), suggesting that behavioral judgment of scene categories was not directly driven by low-level physical differences between the images. There was, however, a positive error correlation between the auditory task and physical properties of sounds (r = 0.407, p < 0.0417).
For the image condition, errors from neural decoders were similar to those from image structure in early visual cortex and similar to human behavioral errors in high-level visual areas (Fig. 4A); in early visual cortex, decoding errors were positively correlated with image structure (V1: r = 0.746, p < 0.0417; V2: r = 0.451, p = 0.083) but not with behavioral errors (V1: r = −0.272, p = 0.929; V2: r = −0.132, p = 0.333). Negative error correlations with image behavior in these areas are due to image behavior being negatively correlated with image structure. V3 and V4 showed no significant correlation with stimulus structure (V3: r = −0.171, p = 0.292; V4: r = 0.076, p = 0.667) or behavior (V3: r = 0.356, p = 0.125; V4: r = 0.250, p = 0.125). In high-level scene-selective areas, decoding errors were positively correlated with image behavior (PPA: r = 0.570, p < 0.0417; RSC: r = 0.637, p < 0.0417; not in OPA: r = 0.183, p = 0.292), but not with the error patterns representing image structure (PPA: r = 0.0838, p = 0.333; RSC: r = −0.230, p = 0.292; OPA: r = 0.099, p = 0.458).
Errors from the neural decoders in SPG were positively correlated with image behavior (r = 0.404, p = < 0.0417) but not with image structure (r = 0.220, p = 0.167). The errors from MTG, STS, and AG also showed high correlation with image behavior errors but did not reach significance in the permutation test (MTG: r = 0.360, p = 0.083; STS: r = 0.348, p = 0.083; AG: r = 0.552, p = 0.083; Fig. 4A). Finally, in PFC, errors from the image decoders in the right MFG and SFG show positive correlation with image behavior (right MFG: r = 0.377, p < 0.0417; right SFG, r = 0.338, p < 0.0417). The left hemisphere of these ROIs also showed positive correlations but not significantly (left MFG: r = 0.309, p = 0.083; left SFG: r = 0.212, p = 0.208). On the other hand, the left IFG, pars opercularis, and the right IFG, pars triangularis, showed no error correlation at all with either image behavior or structure (Fig. 4A).
In the sound condition, error patterns from sound structure as well as sound behavior were positively correlated with decoding errors from ACX, even though the permutation test did not reach significant level (with sound structure: r = 0.438, p = 0.083; with sound behavior: r = 0.46; p = 0.125). Four of the five anatomical subdivisions of ACX showed positive correlation with sound structure (TE1.1: r = 0.478, p < 0.0417; TE1.0: r = 0.443, p < 0.0417; PT: r = 0.419, p = 0.167; PP: r = 0.219, p = 0.0125). These areas also showed positive error correlations with human behavior for sounds (TE1.1: r = 0.50, p = 0.083; TE1.0: r = 0.83, p < 0.0417; PT: r = 0.347, p = 0.167; PP: r = 0.419, p = 0.125; Fig. 4B).
None of the temporal or parietal ROIs showed significant error correlations of decoding sounds with sound structure or behavior (Fig. 4B). In PFC, we found that errors from the right SFG positively correlate with both sound structure (r = 0.397, p < 0.0417) and sound behavior (r = 0.721, p < 0.0417). The right MFG also showed high correlations with sound behavior but not significantly (right MFG: r = 0.486, p = 0.125; Fig. 4B). In the left hemisphere, neither MFG nor SFG showed significant correlation with sound behavior (left MFG: r = −0.279, p = 0.375; left SFG: r = 0.22, p = 0.208).
Whole-brain analysis
To explore representations of scene categories beyond the predefined ROIs, we performed a whole-brain searchlight analysis with a size of 7 × 7 × 7 voxels (21 × 21× 21 mm) cubic searchlight. The same LORO cross-validation analysis for image and sound conditions as well as the same two cross-decoding analyses as for the ROI-based analysis were performed at each searchlight location using a linear SVM classifier, followed by a cluster-level correction for multiple comparisons. For each decoding condition, we found several spatial clusters with significant decoding accuracy. Some of these clusters confirmed the predefined ROIs; others revealed scene representations in unexpected regions beyond the ROIs.
For the decoding of image categories, we found a large cluster of 14,599 voxels with significant decoding accuracy for decoding scene images, spanning most of visual cortex. In accordance with the ROI-based analysis, we also found clusters in PFC, overlapping with the MFG in both hemispheres as well as the right SFG and the right IFG. For a complete list of clusters, see Figure 5A and Table 2. Decoding of sound categories produced a large cluster in each hemisphere, which overlapped with auditory cortices (Fig. 5A; Table 2).
Even though we were able to find ROIs that allowed for decoding of both images and sounds, we could not find any searchlight locations where this was possible. This may be due to spatial smoothing introduced by the spatial extent of the searchlight volume as well as the alignment to the standard brain.
We found several significant clusters in the right PFC that allowed for cross-decoding between images and sounds (Fig. 5B). The image-to-sound condition produced clusters with significant decoding accuracy in the right MFG, IFG, and SFG as well as right MFG, and right MTG. The sound-to-image condition resulted in clusters in the right MFG, SFG, and MTG (Fig. 5B; Table 3). We found four compact clusters that allowed for cross-decoding in both directions in the right PFC, overlapping with the right MFG and cingulate gyrus (102 voxels), right MFG (95 voxels; 22 voxels), and right SFG (80 voxels).
We compared error patterns from the neural decoders to stimulus properties and human behavior using the same searchlight cube of 7 × 7 × 7 voxels (21 × 21 × 21 mm). In each searchlight location, the error patterns of the decoder were recorded in a confusion matrix for image and sound categories separately and compared with the error patterns of corresponding stimuli structure and behavior error patterns (for details, see Materials and Methods). In this error pattern analysis, we only included searchlight locations with significant decoding accuracy for the corresponding modality condition: the image condition in a comparison with image structure/behavior and the sound condition with sound structure/behavior (thresholded at p < 0.01; Fig. 5). Clusters in visual cortex, overlapping with V1-V4, showed significant error correlations with image properties (Fig. 6A; Table 4). Error patterns from searchlight locations in the SPG and the parahippocampal gyrus, overlapping with the PPA, correlated with errors from human behavior for image categorization. In general, we observed a posterior-to-anterior trend, with voxels in the posterior (low-level) visual regions more closely matched to stimulus properties and with voxels more anterior (high-level) visual regions more closely related to behavior.
Clusters in bilateral STG, overlapping with ACX, showed significant error correlations with sound properties and behavioral errors for sound categorization. Within this cluster, we see the same posterior-to-anterior trend, with posterior voxels being more closely related to sound properties and more anterior voxels being more closely related to behavioral categorization of scene sounds (Fig. 6B; Table 4).
Discussion
The present study investigated where and how scene information from different sensory domains forms modality-independent representations of scene categories. We have found that both visual and auditory stimuli of the natural environment elicit representations of scene categories in subregions of PFC. These neural representations of scene categories generalize across sensory modalities, suggesting that scene representations in PFC reflect scene categories not constrained to a specific sensory domain. To our knowledge, our study is the first to demonstrate a neural representation of scenes at such an abstract level.
Three distinct characteristics support the idea that neural representations of scene categories in PFC are distinct from those in modality-specific areas, such as the visual or the auditory cortices or other multisensory areas. First, both image and sound categories could be decoded from the same areas in PFC. Thus, it can be inferred that neural representations of scene categories in PFC are not limited to a specific sensory modality channel. Second, the representations in PFC could be cross-decoded from one modality to the other, showing that the category-specific neural activity patterns were similar across the sensory modalities. Third, when subjects were presented with incongruous visual and auditory scene information simultaneously, it was no longer possible to decode scene categories in PFC, whereas modality-specific areas as well as multimodal areas still carried the category-specific neural activity patterns. This result shows that inconsistent information entering through the two sensory channels in the mixed condition interferes, preventing the formation of scene categories in PFC.
Although scene categories could be decoded from both images and sounds in several ROIs in the temporal and parietal lobes, cross-decoding across sensory modalities was not possible there, suggesting that neural representations elicited by visual stimuli were not similar to those elicited by auditory stimuli. Further supporting the idea that visual and auditory representations are separate but intermixed in these regions, decoding of scene categories from the visual or auditory domain was still possible in the presence of a conflicting signal from the other domain. These findings suggest that, even though information from both visual and auditory stimuli is present in these regions (Calvert et al., 2001; Beauchamp et al., 2004), scene information is computed separately for each sensory modality, unlike in PFC. The discrimination between multimodal and truly cross-modal representations is not possible with the univariate analysis techniques used in those previous studies.
Analysis of decoding errors demonstrated that the category representations in the visual areas have a hierarchical organization. In the early stage of processing, categorical representations are formed based on the physical properties of visual inputs, whereas in the later stage, the errors of neural decoders correlate with human behavior, confirming previous findings, which mainly focused on scene-selective areas (Walther et al., 2009, 2011; Choo and Walther, 2016). Significant error correlation between human behavior and the neural decoders in prefrontal areas confirms that this hierarchical organization is extended to PFC, beyond the modality-specific areas PPA, OPA, and RSC.
Intriguingly, no similar hierarchical structure of category representations was found in the auditory domain. Both types of errors, the errors representing the physical properties and those from human behavior, were correlated to the errors of neural decoder in the ACX. This difference between the visual and the auditory domain might reflect the fact that much auditory processing occurs in subcortical regions, before the information arrives in ACX. Thus, if auditory scene processing is relying on a hierarchical neural architecture, it might not be easily detectable with fMRI. A recent study by Teng et al. (2017) showed evidence suggesting a potential structure for auditory scene processing, finding that different types of auditory features in a scene, reverberant space, and source identity are processed at different times. Further investigation with time-resolved recording techniques, such as MEG/EEG in combination with fMRI, as well as with computational modeling (Cichy and Teng, 2016) are needed for a better understanding of the neural mechanism of auditory scene processing.
Our findings show a distinction of cross-modal versus multimodal neural representations of real-world scenes in prefrontal areas versus temporal and parietal areas. However, this classification of brain areas might not necessarily hold for all situations and all types of stimuli. Indeed, previous fMRI studies as well as our findings show that brain regions traditionally considered to be sensory-specific process information from other modalities (Vetter et al., 2014; Smith and Goodale, 2015; Paton et al., 2016). For instance, Vetter et al. (2014) showed that auditory content of objects can be decoded from early visual cortex, suggesting cross-modal interactions in modality-specific areas. Using more complex stimuli at the level of scene category, our data show that auditory content can be decoded from high-level scene-selective areas (RSC and OPA). Furthermore, we also found that ACX represents visual information. Especially, one subregion of ACX, the PT, showed significant decoding accuracy in cross-modal decoding analysis as well as relatively high decoding accuracy in the image decoding analysis. These findings lead to a host of further questions for future research, such as how these visual and auditory areas are functionally connected, whether the multisensory areas mediate this interaction between the visual and auditory areas by sending feedback signals, or whether these cross-modal representations can influence or interfere with perceptual sensitivity in each sensory domain.
The whole-brain searchlight analysis confirmed the findings of our ROI-based analysis. In the image and sound decoding analyses, we found clusters with significant decoding accuracy in the visual and auditory areas as well as in the temporal, parietal, and prefrontal regions. Furthermore, the clusters in the prefrontal areas showed significant accuracy in the cross-decoding analysis, whereas the clusters in other modality-specific or multimodal areas did not, supporting the view that only representations in the PFC transcend sensory modalities. In the analysis of decoding errors, we observed that the errors of the image decoders were significantly correlated with human categorization behavior in scene-selective areas PPA and RSC as well as in the SPG, consistent with previous work by our group (Walther et al., 2009, 2011; Choo and Walther, 2016).
Previous studies addressing the integration of audiovisual information to form modality-independent representations have used univariate analysis (Downar et al., 2000; Beauchamp et al., 2004) or correlations of content-specific visual and auditory information in the brain (Hsieh et al., 2012). These methods do not distinguish between coactivation from multiple senses and modality-independent processing. Recent studies using MVPA have shown that visual and auditory information about objects (Man et al., 2012) or emotions (Peelen et al., 2010; Müller et al., 2012) evokes similar neural activity patterns across different senses, suggesting that stimulus content is represented independently of sensory modality at later stages of sensory processing. Unlike the present study, however, these studies report that areas in temporal or parietal cortex are involved in this multimodal integration. One reason for this difference could be that real-world scenes are more variable in their detailed sensory representation, typically including multiple visual and auditory cues. Furthermore, we here consider representations of scene categories as opposed to object identity (Man et al., 2012). Our results indicate that generalization across sensory modalities at the level of scene categories occurs only in PFC. The same brain regions have been found to be involved in purely visual categorization and category learning (Freedman et al., 2001; Miller and Cohen, 2001; Wood and Grafman, 2003; Meyers et al., 2008; Mack et al., 2013). This discrepancy between object identity and scene categorization might also explain different findings of cross-modal representations in visual areas.
In a recent review, Grill-Spector and Weiner (2014) suggested that the ventral temporal cortex contains a hierarchical structure for visual categorization, which has the more exemplar-specific representations in posterior areas, but the more abstract representations in anterior areas of the ventral temporal cortex. Several studies have suggested that such abstraction is tightly related to how we represent concepts in the brain by showing amodal representations across words and images (Devereux et al., 2013; Fairhall and Caramazza, 2013). Adding to this growing body of literature, we found that the posterior-to-anterior hierarchy of levels of abstraction extends to the PFC, which represents scene categories beyond the sensory modality domain. The abstraction and generalization across sensory modalities are likely to contribute to the efficiency of cognition by representing similar concepts in a consistent manner, even when the physical signal might be delivered via different sensory channels (Huth et al., 2016).
Footnotes
This work was supported by Natural Sciences and Engineering Research Council Discovery Grant 498390 and Canadian Foundation for Innovation 32896. We thank Michael Mack and Heeyoung Choo for helpful comments on an earlier version of this manuscript.
The authors declare no competing financial interests.
- Correspondence should be addressed to Dr. Yaelan Jung, 100 St. George Street, Toronto, Ontario M5S 3G3, Canada. yaelan.jung{at}mail.utoronto.ca