Abstract
In humans, the superior temporal sulcus (STS) combines auditory and visual information. However, the extent to which it relies on visual information from the ventral or dorsal stream remains uncertain. To address this, we analyzed open-source functional magnetic resonance imaging data collected from 15 participants (6 females and 9 males) as they watched a movie. We used artificial neural networks to investigate the relationship between multivariate response patterns in auditory cortex, the two visual streams, and the rest of the brain, finding that distinct portions of the STS combine information from the two visual streams with auditory information.
Significance Statement
The superior temporal sulcus (STS) combines auditory and visual inputs. However, visual information is processed along a ventral and a dorsal stream, and the extent to which these streams contribute to audio-visual combination is poorly understood. Is auditory information combined with visual information from both streams in a single centralized hub? Or do separate regions combine auditory information with ventral visual regions on one hand and with dorsal visual regions on the other? To address this question, we employed a multivariate connectivity method based on artificial neural networks. Our findings reveal that information from the two visual streams is combined with auditory information in distinct portions of STS, offering new insights into the neural architecture underlying multisensory perception.
Introduction
The human brain is adept at integrating visual and auditory information in order to create a coherent perception of the external world. Audio-visual integration contributes to sound localization (Zwiers et al., 2003) and plays a key role for emotion recognition (Piwek et al., 2015) as well as speech perception (Gentilucci and Cattaneo, 2005). Several phenomena demonstrate that the integration of visual and auditory cues shapes perceptual experience. In the McGurk effect, simultaneous presentation of a phoneme with a mismatched face video results in a distorted perception of the phoneme (McGurk and MacDonald, 1976). Similarly, presentation of mismatched auditory and visual stimuli can alter emotion recognition (Fagel, 2006), even when participants are explicitly instructed to focus only on one stimulus modality and ignore the other (Collignon et al., 2008), suggesting that audio-visual integration is automatic.
Audio-visual integration requires combining auditory information represented in the superior temporal gyrus with visual information encoded in occipitotemporal areas. Therefore, identifying brain regions that combine auditory and visual information is key for understanding the neural bases of audio-visual integration. Previous work found that the presentation of congruent audio-visual stimuli leads to supra-additive responses in the superior temporal sulcus (STS) compared with unimodal visual and auditory stimuli, whereas incongruent audio-visual stimuli leads to sub-additive responses (Calvert et al., 2000). In addition, participants’ susceptibility to the McGurk effect correlates with the strength of STS responses (Nath and Beauchamp, 2012). Furthermore, response patterns in the STS encode information about emotions and identity that generalizes across visual and auditory modalities (Peelen et al., 2010; Anzellotti and Caramazza, 2017). These studies indicate that the STS plays a pivotal role in combining auditory and visual information.
However, little is known about the precise visual representations that are involved. Visual information is processed by multiple streams: a ventral and a dorsal stream (Ungerleider, 1982). The ventral stream originates in ventral area V3 (V3v) and area V4, and the dorsal stream in dorsal area V3 (V3d) and area V5 (Felleman and Van Essen, 1987; Fig. 1a). Area V5 is associated with motion perception, featuring a large number of direction-selective neurons (Born and Bradley, 2005). In contrast, many neurons in V4 show sensitivity to color (Schein and Desimone, 1990). Correspondingly, a large number of neurons in the dorsal part of V3 respond to motion, and a large number of neurons in the ventral portion of V3 are tuned for color processing (Felleman and Van Essen, 1987). The existence of these different visual streams prompts questions about their relative contributions to the combination of visual and auditory information.
a, Visual and auditory regions of interest (ROIs). b, Responses in a combination of visual (e.g., early dorsal visual stream; a, middle panel) and auditory regions were used to predict responses in the rest of the brain using MVPN. c, In order to identify brain regions that combine responses from auditory and visual regions, we identified voxels where predictions generated using the combined patterns from auditory regions and one set of visual regions jointly (as shown in b) are significantly more accurate than predictions generated using only auditory regions or only that set of visual regions.
Auditory information could be combined with visual information from both streams or with visual information from only one of the streams. If it is combined with visual information from both streams, auditory information could be combined with information from both visual streams in a single hub, or distinct regions could combine auditory information with each visual stream separately. To investigate this, we used artificial neural networks to model the relationship between patterns of response in auditory brain regions, in the initial segments of the ventral and dorsal visual streams, and in the rest of the brain (Fig. 1b), following a strategy that has been recently adopted to investigate the combination of information from multiple category-selective regions (Fang et al., 2023). Functional magnetic resonance imaging (fMRI) data collected while participants viewed rich audio-visual stimuli (Hanke et al., 2016) were analyzed with multivariate pattern dependence networks (MVPNs; Anzellotti et al., 2017; Fang et al., 2022). Searching for brain regions where responses are better predicted using a combination of auditory responses and responses in different visual streams than using auditory or visual responses in isolation revealed two distinct portions of STS that combine information between auditory regions and the two visual streams.
Materials and Methods
Experimental design and statistical analyses
Experimental paradigm
The blood oxygen level-dependent (BOLD) functional magnetic resonance imaging (fMRI) data was obtained from the StudyForrest dataset (https://www.studyforrest.org; Hanke et al., 2016; Sengupta et al., 2016). FMRI data was acquired while participants watched the movie Forrest Gump. The movie was divided into eight segments, each of which was ∼15 min long. These segments were presented to subjects in chronological order in eight separate scanner runs.
Data acquisition parameters
Fifteen right-handed subjects (6 females; 21–39 age range; mean, 29.4 years old), whose native language was German, were scanned in a 3 T Philips Achieva dStream MRI scanner equipped with a 32-channel head coil. Functional MRI data was acquired with a T2*-weighted echoplanar imaging sequence [gradient echo, 2 s repetition time (TR), 30 ms echo time, 90° flip angle, 1,943 Hz/Px bandwidth, parallel acquisition with sensitivity encoding (SENSE) reduction factor]. Scans captured 35 axial slices in ascending order, with 80 × 80 voxels (measuring 3.0 × 3.0 mm) of in-plane resolution, within a 240 mm field-of-view, utilizing an anterior-to-posterior phase encoding direction with a 10% gap between slices. The dataset also consists of root mean squared (RMS) annotations, which measure the loudness of the film.
Preprocessing
Data was first preprocessed using fMRIPrep (https://fmriprep.readthedocs.io/en/latest/index.html; Esteban et al., 2019), a robust pipeline for preprocessing a wide range of fMRI data. Anatomical MRI images were skull-stripped using ANTs (http://stnava.github.io/ANTs/; Avants et al., 2009), and FSL FAST was used for tissue segmentation. Functional MRI images were corrected for head movement using FSL MCFLIRT (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/MCFLIRT; Greve and Fischl, 2009) and were then coregistered with anatomical scans using FSL FLIRT (Jenkinson et al., 2002). Data was denoised with CompCor using five principal components extracted from the union of cerebrospinal fluid and white matter (Behzadi et al., 2007). The raw data of one subject could not be preprocessed with the fMRIPrep pipeline. The remaining 14 subjects’ data were used for the rest of the study.
ROI definition
Two sets of visual regions were identified by creating anatomical masks using Probabilistic Maps of Visual Topography in Human Cortex (Wang et al., 2015). This atlas provides probabilistic maps in MNI space of the likelihood that a voxel is a part of a certain brain region. The early ventral stream ROI was created by choosing the 80 voxels with the highest probability to be in the ventral parts of V3 (V3v) and V4 (Fig. 1a, top panel), and the early dorsal stream ROI was created by choosing the 80 voxels with the highest probability to be in the dorsal parts of V3 (V3d) and V5 (Fig. 1a, middle panel).
Since the anatomical location of auditory brain regions is more variable across subjects than visual brain regions (Rademacher et al., 2001), auditory ROIs were defined individually for each subject by identifying voxels where responses are parametrically modulated by the loudness of auditory stimuli. To this end, standard univariate GLM analyses were conducted using FSL FEAT (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FEAT; Woolrich et al., 2001), with root mean square (RMS) levels as the predictor. The 80 voxels with the highest t-scores were selected individually for each subject (example of a subject's auditory ROI mask in Fig. 1a, bottom panel). To ensure that the remaining analyses are independent from the ROI selection, we used only data from the first fMRI run for auditory ROI selection, and this run was not used in the remaining analyses (which were therefore conducted on the remaining seven runs). There were no overlapping voxels between the ROIs.
Additionally, a group-average gray matter mask was created using the gray matter probability maps that were generated during preprocessing. This gray matter mask had a total of 53,539 voxels and was used as the target of prediction in the multivariate pattern dependence analyses, explained in the following section.
MVPN: multivariate pattern dependence network
Recent research has taken advantage of the flexibility and computational power of artificial neural networks (ANNs) in order to analyze brain connectivity (Fang et al., 2022, 2023). The multivariate pattern dependence network (MVPN) method—an extension of MVPD (Anzellotti et al., 2017)—utilizes the power of ANNs to analyze the multivariate relationships between neural response patterns. It is important to note that MVPN measures the statistical relationship between response patterns in different regions, but it cannot detect the direction of information flow. We implemented MVPN in PyTorch, and the neural networks were trained on Tesla V100 graphics processing units (GPUs). In this study, we used five-layer dense neural networks with 100 nodes per hidden layer. This architecture was selected based on prior work (Fang et al., 2022), which systematically compared different network architectures and found the five-layer dense network to yield the highest overall predictive accuracy when using two different seed regions (FFA and PPA) to predict responses across the rest of the brain. The DNNs were optimized using stochastic gradient descent (SGD) with a mean squared error (MSE) loss function, a learning rate of 0.001, and a momentum of 0.9. The models were trained for 5,000 epochs. We used a batch size of 32, and batch normalization was applied to each layer's inputs. The ANNs were given as input the multivariate response patterns in one or more sets of brain regions (Fig. 1): auditory regions, ventral visual regions (V3v and V4), dorsal visual regions (V3d and V5), and all pairwise combinations. ANNs were trained to predict the patterns of responses in all gray matter voxels.
More precisely, the MVPN method works as follows. Consider an fMRI experiment with m experimental runs. We label the multivariate time courses in a predictor region as
The neural networks were trained with a leave-one-run-out procedure to learn a function f such that:
We used the proportion of variance explained between the predictor region and all other voxels in the gray matter mask in order to measure multivariate statistical dependence. For each target region voxel
j, the variance explained
Combined-minus-max whole-brain analysis
In order to identify brain regions that depend on the combination of auditory and visual response patterns, we analyzed the StudyForrest dataset with a novel approach we introduced in a recent study (Fang et al., 2023): the “combined-minus-max” approach, described in the following paragraphs. Since run 1 was used to functionally localize auditory regions (see above, ROI definition), to prevent circularity in the analysis, we used experimental runs 2 through 8 for the combined-minus-max analysis (a total of 7 runs).
In the combined-minus-max approach, first, we used MVPN to calculate the variance explained in each gray matter voxel using individual ROIs as predictors (early dorsal stream, early ventral stream, auditory stream). Then, we used pairs of these ROIs as joint inputs of the MVPN model in order to predict the neural responses of each gray matter voxel (Fig. 1b). We tested all pairs of the three streams: (1) posterior dorsal stream and auditory stream, (2) posterior ventral stream and auditory stream, and (3) posterior ventral stream and posterior dorsal stream.
If a voxel only encodes information from one of the streams, using the responses from multiple streams as predictors should not improve the variance explained. On the contrary, if the responses in the voxel are better predicted by a neural network including multiple streams combined than by a single stream, we can conclude that the voxel combines information from multiple streams. Therefore, we searched for voxels that combine information from multiple streams by computing an index given by the difference between the proportion of variance explained by a model using two streams jointly (the “combined” model) and the proportion of variance explained by a model using the best predicting stream among the two (the “max” model). This procedure is illustrated in Figure 1c.
Formally, for each voxel
j, we can compute the variance explained by MVPN using as input responses from pairs of ROIs,
Control analysis
When using the combined-minus-max approach, there is still the possibility that the better predictive accuracy of the combined model might be due to the larger number of voxels in the combined analysis. To control for this possibility, we conducted a control analysis using voxels from the primary motor cortex (M1) as predictors (see Fang et al., 2023 as an example of an analogous approach). In this analysis, we randomly selected three nonoverlapping groups of 80 voxels in M1 (this number was chosen to match the number of voxels selected from the three streams: the posterior ventral, posterior dorsal, and auditory). We then used the responses from the three groups of M1 voxels to run a control analysis following the same procedure as the combined-minus-max analysis, and we computed the statistical significance of
Face-selective ROI analysis
Face perception requires the combination of both static and dynamic information (Dobs et al., 2014). In addition, some face-selective regions have been found to represent identity during the perception of both visual and auditory stimuli (Anzellotti and Caramazza, 2017). Therefore, we applied the combined-minus-max approach to investigate the MSD effect in face-selective regions (Kanwisher et al., 2002; Yovel, 2016).
We used the first run in the category localizer to identify three face-selective ROIs: the occipital face area (OFA), the fusiform face area (FFA), and the face-selective posterior superior temporal sulcus (STS). Data were modeled with a standard GLM using FSL FEAT (Woolrich et al., 2001). Each seed ROI was defined as a sphere with a 9 mm radius centered in the peak for the contrast faces > bodies, artifacts, scenes, and scrambled images. Data from both the left and the right hemisphere were combined for each ROI, and the 80 voxels that showed the highest z-value for the contrast were selected. Visualizations of these ROIs can be found in Figure 3a. We then analyzed the variance explained measures for each voxel in these face-selective ROIs across our three pairings (posterior dorsal stream and auditory stream, posterior ventral stream and auditory stream, and posterior dorsal stream and posterior ventral stream).
Code/software accessibility
The code to implement the analysis can be obtained at https://github.com/sccnlab/PyMVPD. A description of the code can be found in Fang et al. (2022).
Results
STS combines information from auditory regions with information from different visual streams
To identify brain regions that jointly encoded information from different streams, we calculated the MSD index for each voxel. This index was computed as the difference between the proportion of variance explained by the combined model and that of the max model (see Materials and Methods section for a detailed explanation of the “combined-minus-max” approach). Group-level analyses were used to identify voxels with MSD indices significantly greater than zero. These voxels were considered as candidate MSD brain regions. Clusters with peaks having p < 0.05 (FWE corrected) were included.
To ensure that the combined model's predictive accuracy was not merely due to the larger number of voxels used in comparison with the max analysis, we conducted a control analysis. In the control analysis, we used three nonoverlapping groups of 80 voxels from the primary motor cortex (M1) as predictors, matching the number of voxels used from the auditory cortex and two visual streams in the main analyses. We then ran the combined-minus-max analysis with these M1 voxel groups and obtained statistical significance for each gray matter voxel across subjects.
The control analysis showed significant effects in the sensorimotor cortex (peak MNI coordinates = [0, −21, 64], [33, −42, 67], [−39, −18, 41]), premotor cortex (peak MNI coordinates = [−57, −9, 44], [57, 12, 31]), the bilateral intraparietal sulcus (peak MNI coordinates = [30, −69, 54], [−24, −72, 50]), and the angular gyrus (peak MNI coordinates = [−45, −69, 37]). Importantly, the control analyses did not show significant effects in ventral and lateral occipitotemporal regions. Therefore, significant findings in these regions in the main analysis could not be explained just by a difference between the number of predictor voxels in the combined analysis and the max analysis. Voxels that yielded significant effects in the control analysis (p < 0.05, FWE corrected) were excluded before calculating the MSD indices in the main analysis.
Combining response patterns from auditory regions and the early dorsal stream revealed significant effects in the bilateral STS (peak MNI coordinates = [−66, −42, 4], [45, −57, 18]) and within the posterior cingulate cortex (peak MNI coordinates = [15, −27, 41]; Table 1; p < 0.05, FWE corrected). Combining responses from auditory regions and the early ventral stream also revealed effects in the right STS (Table 2; p < 0.05, FWE corrected), but in a more posterior portion (peak MNI coordinates = [48, −57, 8]), at the boundary with the occipital lobe (Fig. 2a).
a, Voxels showing significant effects (p < 0.05, FWE corrected) for the combination of auditory responses with responses in V3d and V5 (red) and auditory responses with responses in V3v and V4 (green). b, Voxels showing significant effects for the combination of responses in V3v and V4 with responses in V3d and V5 (blue). c, Fisher transformed Pearson’s correlation values between the auditory + dorsal and auditory + ventral combined-minus-max models, computed across the top 50 voxels in the STS (left) and the top 100 voxels across the whole brain (right) showing the greatest change in variance explained across both models. d, Pearson’s correlation values between combined-minus-max effect patterns from the auditory + dorsal and auditory + ventral models within an STS ROI. We computed these correlations across 500 splits of the participants into two equal groups, comparing pattern similarity within the same model across splits (e.g., AUD + dorsal and AUD + dorsal) to the similarity of patterns between different models across splits (e.g., AUD + dorsal in split 1 to AUD + ventral in split 2: “AD1/AV2”).
Regions combining responses between auditory regions and V3d and V5 showing significant t values (p < 0.01, FWE corrected) computed from the combined-max analysis
Regions combining responses between auditory regions and V3v and V4 showing significant t values (p < 0.01, FWE corrected) computed from the combined-max analysis
These findings indicate that auditory information is not combined with information from both visual streams within one single STS hub. Instead, distinct portions of STS combine information from auditory regions and information from ventral and dorsal visual regions, respectively.
Robustness of the results across different data splits
In order to further evaluate the robustness of the results, we defined a broad bilateral STS region of interest via the “Superior Temporal Gyrus” map from WFU Pick Atlas. We then extracted the patterns of the combined-minus-max effects across voxels as vectors. For each split of the data, this procedure yielded a vector for the auditory + dynamic combined-minus-max results and another vector for the auditory + static combined-minus-max results. The robustness of the patterns across the two splits of the data was assessed by computing Pearson’s correlation between the vectors for the two halves. The correlation for vectors from the same analysis (e.g., between the first and second halves of the auditory + dynamic analysis) was compared with the correlation for vectors from different analyses (e.g., between the first half of the auditory + dynamic analysis and the second half of the auditory + static analysis), following a procedure inspired by prior work (Haxby et al., 2001). If the results are robust across different splits of the data, we expected to observe higher correlations between the patterns for the same analysis across the splits compared with the patterns for two different analyses. The results were in line with the prediction: correlations between the vectors for the same analysis were higher than correlations for vectors for different analyses across splits (Fig. 2d).
Quantifying distinct spatial distributions of auditory + ventral and auditory + dorsal effects
Using the STS ROI introduced in the previous section, for each subject individually, we retrieved the 50 voxels with the highest
Ventral temporal cortex combines information from different visual streams
These results raise the question of whether and where information from early dorsal (V3d and V5) and ventral (V3v and V4) visual regions is combined. We adopted the same strategy to test this, searching for voxels that are better predicted by both visual streams jointly than by either stream in isolation. This analysis identified regions in the calcarine sulcus (V1 and V2) that are located upstream of V3, V4, and V5, and in regions in ventral occipitotemporal cortex, that are located downstream (peak MNI coordinates = [21, −102, 1]; Table 3; p < 0.05, FWE corrected; Fig. 2b). Notably, no effects for the combination of the two visual streams were observed in the STS. This is consistent with the finding that the combination of auditory information with different visual streams involves distinct cortical regions: if it happened in a single STS subregion, we would also expect to observe effects in that subregion for combining both visual streams.
Regions combining responses between V3v and V4, and V3d and V5, showing significant t values (p < 0.01, FWE corrected) computed from the combined-max analysis
Combination of visual and auditory information outside the STS
Our results also suggest the involvement of brain regions outside of the STS in combining audio-visual information. The combined-minus-max analysis for the combination of auditory and the early dorsal visual stream responses also identified brain regions in the anterior temporal lobe (ATL; peak MNI coordinates = [−54, −6, −15]), the primary somatosensory cortex (S1; peak MNI coordinates = [3, −42, 61]), the supramarginal gyrus (peak MNI coordinates = [−54, −6, −15]), and the retrosplenial cortex (peak MNI coordinates = [30, −54, 4]).
The combined-minus-max analysis of auditory and early ventral visual stream responses revealed brain regions in the intraparietal sulcus (IPS; peak MNI coordinates = [−39, −51, 57]), retrosplenial cortex (peak MNI coordinates = [6, −42, 4]), caudate nucleus (peak MNI coordinates = [15, −9, 24]), and the lingual gyrus (peak MNI coordinates = [−27, −57, 4]).
The combined-minus-max analysis for the posterior dorsal and posterior ventral visual stream response pairings identified a distinct set of brain regions compared with the previous two analyses. The largest cluster size was located in V1 (peak MNI coordinates = [21, −102, 1]; Fig. 2b). Other brain regions included the bilateral parahippocampal place area (PPA; peak MNI coordinates = [−30, −48, −9], [30, −51, −9]) and the cerebellum (peak MNI coordinates = [−24, −78, −25]).
Inspecting the combined-minus-max maps of individual participants in search of other regions that might show these effects, and that might not appear in the second-level analyses due to greater topographic variability across individuals, did not reveal other clear candidate regions. This does not rule out that additional regions might be identified in the future using more powerful data acquisition and analysis methods.
Overlap between the auditory + ventral and auditory + dorsal effects was observed in posterior cingulate and in pulvinar in some individual participants, but these effects were variable across participants—further work will be needed to establish whether these regions combine auditory information with both ventral and dorsal representations. To probe for three-way combination effects across auditory, ventral and dorsal regions we conducted a combined-minus-max analysis of the three regions combined minus the maximum across each of the three different pairs (auditory + dynamic combined, auditory + static combined, dynamic + static combined). Statistical nonparametric analysis did not reveal anything past the threshold (FWE corrected p < 0.05); future work with more sensitive methods or greater statistical power might reveal some effects.
Combination of information from auditory regions and different visual streams within face-selective ROIs
Considering the importance of combining facial information with auditory information for the recognition of speech and emotions (Gentilucci and Cattaneo, 2005; Piwek et al., 2015), we studied the combination of auditory and visual representations from different streams within functionally localized face-selective regions (Fig. 3a). In the face-selective STS, the effect of combining auditory and dorsal responses was significantly greater than that of combining auditory and ventral responses (t(13) = 3.82, p < 0.05) and than that of combining ventral and dorsal responses (t(13) = 4.55, p < 0.01; Fig. 3b, top panel). This finding could be due to the type of visual information encoded in V3d and V5: previous work has shown that these regions respond to motion (Felleman and Van Essen, 1987; Born and Bradley, 2005). Combining information about visual motion with auditory information might support audio-visual integration during speech perception and emotion recognition.
a, Face-selective ROIs: STS, FFA, and OFA. b, Box plots depicting the difference in variance explained between the “combined” and “max” analyses across subjects in different face-selective ROI voxels. * signifies p < 0.05, ** signifies p < 0.01, and *** signifies p < 0.001. Significantly higher combined-minus-max effects were observed in the face-selective STS for the combination of the auditory and posterior dorsal stream than for the other pairings. No significant differences were observed in the FFA across the different pairings. Significantly higher combined-minus-max effects were observed in the OFA for the combination of the posterior dorsal and posterior ventral streams than for the other pairings.
Unlike the face-selective STS, the fusiform face area (FFA) did not show significant differences between the pairwise combinations (Fig. 3b, middle panel). In the occipital face area (OFA), the effect of combining information from the two visual streams was significantly stronger than combining auditory and dorsal visual responses (t(13) = 5.11, p < 0.01) and than combining auditory and ventral visual responses (t(13) = 6.73, p < 0.001; Fig. 3b, bottom panel).
Discussion
Audio-visual integration is a fundamental process that allows for the unified perception of everyday experiences. Given that distinct visual streams encode different kinds of representations, this study sought to uncover what visual representations are combined with auditory information when engaging in audio-visual integration and what brain regions support the combination of responses from auditory regions and the different visual streams. The results demonstrate that both ventral and dorsal visual information is combined with auditory information but that distinct portions of posterior STS combine auditory information with visual information encoded in the two streams. The topography of combined-minus-max effects observed in the STS could be related to the types of features encoded in dorsal and ventral visual regions. Importantly, however, these results are only possible in the presence of audio-visual combination effects. If posterior STS encoded visual features that are well predicted by dorsal visual regions in isolation, and anterior STS encoded visual features that are well predicted by ventral visual regions in isolation, subtracting the max in the combined-minus-max analysis would remove these effects.
What are the specific factors that drive the observed topography of STS effects remains an open question. Meta-analyses suggest that different portions of posterior STS play different functional roles, including audio-visual integration, biological motion perception, theory of mind, and face processing (Hein and Knight, 2008). Meta-analyses, however, make it difficult to assess the degree of overlap between areas engaged in different functions: since different functions are probed in different participants, variability in response locations due to different functions is confounded with variability arising from individual differences. More recently, the investigation of multiple stimulus types within the same participants led to a more precise characterization of the distinct portions of the STS responsible for processing language, theory of mind, faces, voices, and biological motion (Deen et al., 2015). Relevant to the present results, Deen et al. (2015) analyzed posterior-to-anterior changes in functional specialization in posterior STS, observing greater responses for Theory of Mind tasks in more posterior portions, followed by biological motion, and ultimately by greater responses to faces and voices in anterior portions. The posterior-to-anterior organization observed in the present study, therefore, could indicate that different visual inputs are combined with auditory representations to serve the needs of distinct functional subsystems that occupy adjacent areas within STS. In order to study the relationship between the topography of the effects we identified in the present work and other functional subdivisions of STS, it will be necessary to perform both sets of analyses within the same group of participants.
Previous research on ventral stream representations suggests a possible functional role for the more posterior of the two STS hubs identified in this study. Effects for the combination of auditory information and the ventral visual stream were observed in a more posterior portion of the STS, and previous research has implicated the ventral visual stream in the recognition of the identity of objects (Ungerleider, 1982). Posterior portions of the STS that combine information from ventral visual regions and auditory regions might contribute to encoding the typical sounds produced by different kinds of objects, associating dogs with barking, cars with vrooming, and so on. In contrast, more anterior portions might encode the way different movements are associated with sounds—even when the identity or category of an object is held constant. For example, in face perception, the relationship between lip movements and phonemes is known to involve audio-visual integration mechanisms that lead to phenomena such as the McGurk effect (McGurk and MacDonald, 1976). In many other instances, sounds are produced by the dynamic interactions between multiple objects. Experiments with tailored designs, which include distinct conditions that separate between these different kinds of audio-visual information, will be needed to test this hypothesis. As an alternative hypothesis, the organization of the combination of auditory and visual information into two distinct portions of posterior STS might not be due to their engagement in supporting different functions, but to unique computational requirements of integrating auditory representations with different kinds of visual representations.
Focusing on face-selective regions of interest, we found that the combination of audio-visual information in the face-selective STS relies disproportionately on visual information encoded in dorsal visual regions. This is consistent with the observation that effects for the combination of auditory information with visual information from dorsal regions were located in more anterior portions of posterior STS in our whole-brain analyses and with the previous studies indicating that face responses also peak in more anterior portions of posterior STS (Deen et al., 2015). The latter finding could be due to the type of visual information encoded in V3d and V5: previous work has shown that these regions contain neurons that respond to motion (Felleman and Van Essen, 1987; Born and Bradley, 2005). Combining information about visual motion with auditory information might support audio-visual integration during speech perception. It will be interesting to test whether the effects for the combination of auditory information and dorsal visual representations reported here are localized to the same voxels showing an association with individual differences in susceptibility to the McGurk effect reported in previous work (Nath and Beauchamp, 2012).
Finally, the combination of visual information from the two visual streams was observed in ventral occipitotemporal cortex, and ROI analyses showed that the extent of these effects includes the OFA. Classical work has proposed the importance of motion to identify and segment objects (Spelke, 1990), leading to recent computational models of motion-based segmentation (Chen et al., 2022). We hypothesize that the combination of information from the two visual streams within occipitotemporal cortex could support motion-based segmentation. Considering the anatomical location of the effects that are colocalized with the earliest stages of category-selectivity (e.g., OFA), we hypothesize that motion-based segmentation might provide the basis for category-selectivity.
Our findings also implicate brain regions beyond the STS. Regarding the candidate MSD sites that were statistically dependent on information from the auditory and posterior ventral streams, the intraparietal sulcus (IPS) was the region with the highest t value. This region has been implicated in audio-visual integration in prior work (Lewis et al., 2000; Calvert et al., 2001).
Methodologically it is worth noting that the results obtained from the MVPN combined-minus-max analyses only establish correlational relationships. To establish causality between the joint responses from the auditory and different visual streams in MSD sites, future research could employ techniques that infer causality, such as transcranial magnetic stimulation-fMRI (TMS-fMRI). Further, our method shows that two regions jointly contribute to predict responses in a third region (i.e., statistical dependence), but we cannot determine precisely whether and how this information is integrated into a multimodal representation. In addition, we used a five-layer dense neural network to model multivariate pattern dependence across all ROI sets tested in this study. However, it is possible that the optimal model architecture for capturing brain interactions may differ depending on the specific set of predictor regions. Future work using different neural network architectures may potentially uncover additional effects. Despite these limitations, the results reveal a novel aspect of the large-scale topography of STS and provide insights into the neural architecture that supports our unified perception of the world.
The present work provides evidence for distinct portions of the multisensory posterior STS: a more posterior portion characterized by the combination of auditory and ventral representations and a more anterior portion characterized by the combination of auditory and dorsal representations. Clarifying the functional and causal contributions of these subdivisions of STS to behavior will require additional work, including importantly studies with causal methodologies.
Footnotes
We thank Wei Qiu for technical support. We also thank the StudyForrest researchers for sharing their data. This work was supported by a startup grant from Boston College and by National Science Foundation grant 19438672 to S.A.
↵*G.F. and M.F. contributed equally to this work.
The authors declare no competing financial interests.
- Correspondence should be addressed to Stefano Anzellotti at stefano.anzellotti{at}bc.edu.









