Abstract
Faces and bodies are processed in separate but adjacent regions in the primate visual cortex. Yet, the functional significance of dividing the whole person into areas dedicated to its face and body components and their neighboring locations remains unknown. Here we hypothesized that this separation and proximity together with a normalization mechanism generate clutter-tolerant representations of the face, body, and whole person when presented in complex multi-category scenes. To test this hypothesis, we conducted a fMRI study, presenting images of a person within a multi-category scene to human male and female participants and assessed the contribution of each component to the response to the scene. Our results revealed a clutter-tolerant representation of the whole person in areas selective for both faces and bodies, typically located at the border between the two category-selective regions. Regions exclusively selective for faces or bodies demonstrated clutter-tolerant representations of their preferred category, corroborating earlier findings. Thus, the adjacent locations of face- and body-selective areas enable a hardwired machinery for decluttering of the whole person, without the need for a dedicated population of person-selective neurons. This distinct yet proximal functional organization of category-selective brain regions enhances the representation of the socially significant whole person, along with its face and body components, within multi-category scenes.
Significance Statement
It is well established that faces and bodies are processed by dedicated brain areas that reside in nearby locations in primates’ high-level visual cortex. However, the functional significance of the division of the whole person to its face and body components, their neighboring locations, and the absence of a distinct neuronal population selective for the meaningful whole person remained puzzling. Here we proposed a unified solution to these fundamental open questions. We show that consistent with predictions of a normalization mechanism, this functional organization enables a hardwired machinery for decluttering the face, the body, and the whole person. This generates enhanced processing for the socially meaningful whole person and its significant face and body components in multi-category scenes.
Introduction
Intact processing of faces and bodies is critical for effective social interactions. The functional separation and the anatomical proximity of face- and body-selective brain areas in the human and monkey high-level visual cortex is well established (Pinsk et al., 2005, 2009; Schwarzlose et al., 2005; Weiner and Grill-Spector, 2013; Harry et al., 2016; Premereur et al., 2016; Foster et al., 2021; Zafirova et al., 2022). Yet, the functional significance of this anatomical organization remained unclear (for recent reviews, see Hu et al., 2020; Taubert et al., 2022). Why are faces and bodies processed by dedicated distinct mechanisms, despite their natural co-occurrence in the whole person? Why, despite the significance of the whole person for social perception, a distinct population of person-selective neurons/brain areas has not been reported so far? Why do face- and body-selective regions reside in nearby locations? Here we propose a unified account for these questions. In particular, we test the hypothesis that this division of the whole person into distinct but proximally located face- and body-selective areas supports the generation of clutter-tolerant representations for the face alone, the body alone, or the whole person when presented in multi-category scenes (Fig. 1). This mechanism eliminates the need for an additional population of person-selective neurons.
Predicted response of single neurons to a multi-category scene in face- and body-selective cortex. a, The functional organization of face- and body-selective areas in the VTC in a representative subject in the MNI space. Colors indicate category-selective voxels: voxels selective only to faces (red), only to bodies (blue), or to both faces and bodies in the border between them (purple). b, A multi-category scene composed of a face, a body, a chair, and a room and the normalization equation representing the response of a single neuron to that scene. c, A mathematical equivalent of (b), which shows the predicted contribution (weight, β) of each category to the response to the multi-category scene. According to this, the contribution of each category is determined by the sum of responses of the surrounding neurons (i.e., the normalization pool) to that category relative to the sum of responses of the surrounding neurons to all categories. Equation is shown for
A central challenge of the visual system is to generate a veridical representation of objects in multi-category scenes. This effect of clutter is reflected in a reduced neural response to an object when presented with other objects than when presented alone (Miller et al., 1993; Rolls and Tovee, 1995; Zoccolan et al., 2005; MacEvoy and Epstein, 2009, 2011; Bao and Tsao, 2018). Interestingly, this reduced response was not found in category-selective areas, where the response to the preferred category remained unaffected when presented with other categories (Reddy and Kanwisher, 2007; Bao and Tsao, 2018). A normalization mechanism was suggested to account for these findings (Reynolds and Heeger, 2009; Heeger, 2011; Carandini and Heeger, 2012; Bao and Tsao, 2018). This mechanism posits that the response of a neuron is normalized by the response of its neighboring neurons (i.e., the normalization pool; Fig. 1b,c). Therefore, when a neuron is surrounded by neurons that are selective for its nonpreferred categories, its response to simultaneous presentation of the two categories is reduced relative to the response to each object alone (Zoccolan et al., 2005). However, when the surrounding neurons are selective to the same category (i.e., a homogeneous normalization pool), as typically found in category-selective regions, the response to a preferred and a nonpreferred stimuli presented together is similar to the response to the preferred stimulus presented alone (i.e., a max response). This essentially generates a clutter-tolerant representation of the preferred category (Fig. 1d,e; Reddy and Kanwisher, 2007; Bao and Tsao, 2018; Kliger and Yovel, 2020). These findings offer a mechanistic account for the advantage of clustering neurons that are selective for significant categories, such as faces or bodies.
Here we propose that the proximity of clusters of face- and body-selective neurons together with the same normalization mechanism further enables a hardwired clutter-tolerant representation of the whole person (Fig. 1f). This is enabled by the presence of both face-selective neurons and body-selective neurons in the normalization pool (see derivations of the normalization equations in Appendix), as typically found in the border between face- and body-selective areas (Kliger and Yovel, 2020). To test this prediction, we presented the whole person in a multi-category scene and assessed whether the representations of the multi-category scene are biased to the whole person in areas that are selective to both the face and the body (Fig. 1f). These findings would indicate that the adjacent locations of face- and body-selective clusters of neurons generate a clutter-tolerant representation for the face alone, the body alone, or the whole person when presented in multi-category scenes.
Materials and Methods
Participants
Fifteen healthy volunteers (three women, ages 21–31, one left-handed) with normal or corrected-to-normal vision participated in both experiments. Participants were paid $15/h. All participants provided written informed consent to take part in the study, which was approved by the ethics committees of the Sheba Medical Center and Tel Aviv University and performed in accordance with relevant guidelines and regulations. The sample size for each experiment (N = 15) chosen for this study was similar to the sample size of other fMRI studies that examined the representation of multiple objects in the high-level visual cortex (10–15 subjects per experiment; MacEvoy and Epstein, 2009, 2011; Reddy et al., 2009; Baeck et al., 2013; Song et al., 2013; Kaiser et al., 2014; Baldassano et al., 2016; Kaiser and Peelen, 2018; Kliger and Yovel, 2020).
Stimuli
Main experiment
The stimulus set consisted of grayscale images of a multi-category scene as well as its isolated parts: face, body, chair, and room (Fig. 2a). The face and body stimuli were created by using seven grayscale images of a whole person standing in a straight frontal posture with the background removed and downloaded from the internet (Kliger and Yovel, 2020). Each image of a person was cut into two parts in the neck area resulting in a face stimulus and a headless body stimulus for each identity. The chair stimuli included seven images of chairs downloaded from the internet and scaled to a size that fits common proportions between a standing person and a chair. The face, body, and chair stimuli were presented with a gray background. The room stimuli included seven empty rooms created using the website https://roomstyler.com/3dplanner and converted to gray scale. The contrast and luminance of the rooms were scaled such that they had the same contrast and luminance of a chair including its gray background. The complex scene stimuli included seven images of a person inside a room, with the person standing behind a chair (preserving real-life proportion and composition between the person and the chair) at the center of the room. The single-category stimuli were presented at the exact same locations on the screen as they were presented within the multi-category scene. A fixation point was presented in the upper central part of all stimuli, at a location corresponding to the lower part of the neck of the standing person. The size of the multi-category scene images was 13.6 × 13.6° of visual angle.
Example of the stimuli used in the experiment. a, Stimuli of the main experiment, including isolated faces, bodies, chairs, and rooms and multi-category scenes composed of all of the isolated categories. The isolated categories were placed in the same location in the visual field as they appeared in the multi-category scene, and subjects were instructed to maintain fixation on the blue fixation dot throughout the experiment. These stimuli were used to estimate the contribution of each of the isolated categories to the response to the multi-category scene. b, Stimuli of the functional localizer. Stimuli included images of faces, bodies, outdoor scenes, nonliving objects, and scrambled objects. These stimuli were used to assess the magnitude of category selectivity of each voxel to each category.
Functional localizer stimuli
Functional localizer stimuli were grayscale images of faces, headless bodies, outdoor scenes, nonliving objects, and scrambled images of these objects (Fig. 2b). Each category consisted of 80 different images. The size of the stimuli was ∼5.5 × 5.5° of visual angle.
Apparatus and procedure
fMRI acquisition parameters
fMRI data were acquired in a 3 T Siemens MAGNETOM Prisma MRI scanner in Tel Aviv University, using a 64-channel head coil. Echo-planar volumes were acquired with the following parameters: repetition time (TR) = 1 s, echo time = 34 ms, flip angle = 60°, 66 slices per TR, multiband acceleration factor = 6, slice thickness = 2 mm, field of view = 20 cm, and 100 × 100 matrix, resulting in a voxel size of 2 × 2 × 2 mm. Stimuli were presented with MATLAB (MathWorks) and Psychtoolbox (Brainard, 1997; Kleiner et al., 2007) and displayed on a 32″ high-definition LCD screen (NordicNeuroLab) viewed by the participants at a distance of 155 cm through a mirror located in the scanner. Anatomical MPRAGE images were collected with 1 × 1 × 1 mm resolution, echo time = 2.45 ms, and TR = 2.53 s.
Experimental procedure
The study included a single recording session with six runs of the main experiment and three runs of the functional localizer. Each of the six main experiment runs included 15 pseudorandomized miniblocks, three of each of the following experimental conditions: face, body, chair, room, and the multi-category stimulus, as described in the stimuli section (Fig. 2a). Each miniblock included eight stimuli, of which seven were of different images and one image repeated for the 1-back task. Each miniblock lasted 6 s and was followed by 12 s of fixation. A single stimulus display time was 0.375 s, and the interstimulus interval was 0.375 s. Subjects performed a 1-back task (one repeated stimulus in each block). Each run began with a 6 s (6 TRs) fixation (dummy scan) and lasted a total of 276 s (276 TRs). Subjects were instructed to maintain fixation throughout the run, and their eye movements were recorded with an eye tracker (EyeLink®).
Each functional localizer run included 21 blocks: five baseline fixation blocks and four blocks for each of the five experimental conditions: faces, bodies, nature scenes, objects, and scrambled objects. Each block presented 20 stimuli of 18 different images of which two were repeated for a 1-back task. Each stimulus was presented for 0.4 s with 0.4 s interstimulus interval. Each block lasted 16 s. Each run began with a 6 s fixation (6 TRs) and lasted a total of 342 s (342 TRs).
Data analyses
fMRI data analysis and preprocessing
fMRI analysis was performed using SPM12 software, MATLAB (MathWorks) and R (R Development Core Team, 2011) costumed scripts, FreeSurfer (Dale et al., 1999), pySurfer (https://pysurfer.github.io), and Python (http://www.python.org) costumed scripts for the surface generation and presentation. The code that was used for data analyses is available at https://github.com/gylab-TAU/Multi_Category_Scenes_fMRI_analysis. The first six volumes in each run were acquired during a blank-screen display and were discarded from the analysis as “dummy scans.” The data were then preprocessed using realignment to the mean of the functional volumes and coregistration to the anatomical image (rigid body transformation), followed by spatial normalization to the MNI space. Spatial smoothing was performed for the localizer data only (5 mm). A GLM was performed with separate regressors for each run and each condition, including 24 nuisance-motion regressors for each run (six rigid body motion transformation, six motion derivatives, six square of motion, and six derivatives of the square of motion) and a baseline regressor for each run. In addition, a “scrubbing” method (Power et al., 2012) was applied for every volume with frame displacement >0.9 by adding a nuisance regressor with a value of 1 for that specific volume and zeros for all other volumes. The percent signal change (PSC) for each voxel was calculated for each experimental condition in each run by dividing the beta weight for that regressor by the beta weight of the baseline for that run.
Linear model fitting
The mean PSC across runs to the face, the body, the room, the chair, and the multi-category scene conditions from the main experiment data was extracted for each voxel of each subject. For each subject, we defined a moving mask of a sphere of 27 voxels (i.e., a 3 × 3 × 3 grid). We used a relatively small sphere of 27 voxels to assure that voxels within each sphere are homogeneous in terms of their category selectivity. For each sphere, we fitted a linear model with its voxel data as features (i.e., the PSC in each of these voxels) to predict the response to the multi-category scene based on the response to the isolated categories:
Anatomical regions of interest (anatomical ROI) definition
We defined voxels that belong to the ventrotemporal cortex (VTC) and lateral occipital cortex in the right hemisphere by using a mask based on the Harvard-Oxford Atlas (Frazier et al., 2005; Desikan et al., 2006; Makris et al., 2006; Goldstein et al., 2007). We used the max-probability mask (threshold = 0) with a voxel size of 2 × 2 × 2 mm. The ventrotemporal mask included the following areas from the Harvard-Oxford Atlas: inferior temporal gyrus, posterior division; inferior temporal gyrus, temporo-occipital part; parahippocampal gyrus, anterior division; parahippocampal gyrus, posterior division; temporal fusiform cortex, anterior division; temporal fusiform cortex, posterior division; temporal occipital fusiform cortex; and occipital fusiform gyrus (Labels 14–15, 33–34, and 36–39, respectively). The lateral occipital mask included the following areas from the Harvard-Oxford Atlas: middle temporal gyrus, temporooccipital part; lateral occipital cortex, superior division; and lateral occipital cortex, inferior division (Labels 12 and 21–22, respectively).
We selected the area labeled frontal pole (Label 0) as a control nonvisual area. The number of spheres that was randomly selected to be included in this control area was the average number of spheres of the category-selective areas for each participant.
Voxels definition by category selectivity
Based on the functional localizer data, we estimated the selectivity of each voxel of individual subjects for the face and the body by using contrast t-maps of face > object, body > object, and outdoor scenes > object, respectively. We used only these three categories since their definition shares the same baseline (i.e., they are all compared with object). We excluded the general object-selective region since the common definition of these areas (objects > scrambled objects) will result in areas that are not category specific but are similarly responsive to all object categories. Within each anatomical ROI (i.e., VTC and lateral occipital cortex), we defined several types of voxels based on their selectivity for these three categories. Face-selective voxels were defined as voxels that are selective only for faces over objects (p < 0.0001) and to faces over bodies (p < 0.0001) and not selective for bodies and places (p > 0.01). These criteria assured that the majority of the neurons within these voxels are face selective. One subject did not have face-selective voxels (i.e., selective only for faces and not for other categories). Similarly, we defined body-selective voxels as voxels selective only for bodies over objects (p < 0.0001) and to bodies over faces (p < 0.0001) but not selective for faces or places over objects (p > 0.01). In addition, we defined face- and body-selective voxels, by selecting voxels that are selective for both faces and bodies (but not for places). These voxels contain clusters of face-selective neurons and body-selective neurons. All participants showed voxels that were selective to both the face and the body in the VTC. In the lateral occipital cortex, only 5 out of 15 participants showed voxels that are selective to both the face and the body. Because the main novel contribution of this work is the response of the ROI selective to both faces and bodies, we focused only on the ventral–temporal ROIs and did not include the lateral occipital ROIs in our analyses.
The contribution of each category to the multi-category scene representation
For each subject, we calculated the betas of the model from Equation 1 for spheres of 27 voxels in the category-selective areas described above (see model fitting description above). To reduce statistical dependency as a result of the overlapping moving mask, we calculated the mean using an interleaved mask, taking only spheres that their centers are not immediately adjacent to one another. We computed the mean beta across all spheres in an ROI for each participant and across participants. We calculated the variance inflation factor (VIF), which provides a measure of multicollinearity of the beta coefficients, and removed spheres in which the VIF was larger than 10. We then performed repeated measure ANOVAs to examine the contribution of the isolated categories to the multi-category scene for voxels selective for pairs of categories within the VTC. We used category (face, body, room, and chair) and ROI selectivity (face selective, body selective, and face and body selective) as within-subject factors and the beta coefficients of the multi-category response model as a dependent variable. To test our specific hypothesis with respect to the representation of faces and bodies relative to the other categories in the different ROIs, we used paired t test. One subject did not have face-selective voxels (i.e., selective only for faces and not for other categories), and one subject did not have body-selective voxels (i.e., selective only for bodies and not for other categories) based on the abovementioned criteria and therefore was not included in statistical analysis that compared between these ROIs. These subjects were included in the analysis of other category-selective areas.
Defining nonsaturated voxels
To test whether the BOLD response to the multi-category scene was saturated, we conducted the following analysis. For each voxel, we compared the maximum response (PSC) to a single-category and to the response to the multi-category scene:
Predictions
We measured the fMRI response to a multi-category scene of a person in a room with a chair and to each of its isolated categories (see Fig. 2 and Materials and Methods). The predictions of the response to the multi-category scene according to the normalization model in the different category-selective voxels are specified in Figure 1 (see Appendix for complete mathematical derivations and predictions). The normalization model predicts that in voxels that are selective for either the face or the body—therefore containing one homogeneous population of face- or body-selective neurons—the representation of multi-category scenes will be biased to the preferred category, decluttering nonpreferred stimuli in the scene. In addition, in voxels that are selective to both the face and the body—therefore containing two homogeneous populations of face- and body-selective neurons—the representation of multi-category scenes will be biased to both preferred face and body categories, decluttering nonperson stimuli (chair and room) within the scene.
Results
A clutter-tolerant representation of the whole person in face- and body-selective areas
We defined three types of voxels in the ventrotemporal area based on their selectivity for the isolated categories (see Materials and Methods): (1) face voxels, voxels selective for faces but not for bodies, nonliving objects, or places; (2) body voxels, voxels selective for bodies but not for faces, nonliving objects, or places; and (3) face–body-selective voxels, voxels selective for both faces and bodies but not for nonliving objects and places (usually located at the border between face and body areas; see Materials and Methods and Fig. 1). To assess the contribution of each category to the representation of the multi-category scene, subjects viewed a different, independent set of stimuli containing a multi-category visual scene of a whole person standing next to a chair located inside a room, as well as stimuli of each of the components of the scene shown separately (Fig. 2a). We extracted the mean PSC response from each voxel for each of the isolated-component stimuli. We then used these voxel-level PSCs of each component of the multiscene as predictors for a linear model and the PSC response for the multi-category scene as the predicted variable (see Equation 1):
The beta coefficients of the above model represent the contribution of each of the isolated categories to the response to the multi-category scene. For each subject, we defined a moving mask of a sphere of 27 (3 × 3 × 3) voxels. For each sphere, we fitted a linear model with its voxel data as features to predict the response to the multi-category scene from the response to each its component. We included only interleaved spheres to avoid high overlap and statistical dependency between overlapping spheres. Note that the beta coefficients of the multi-category response model indicate the predicted contribution of each category to the fMRI response to the multi-category scene, not the betas derived from the standard fMRI GLM analysis.
Figure 3a–c depicts the contribution of each of the isolated categories to the response to the multi-category scene [i.e., the beta coefficients of the above linear model in (1) for the face-selective, body-selective, and face-body selective voxels of each participant and averaged across participants]. We performed a repeated measures ANOVA with category (face, body, chair, and room) and voxel selectivity (face-selective voxels, body-selective voxels, and face- and body-selective voxels) as within-subject factors and the contribution to the complex scene representation (i.e., the beta coefficients of the linear model) as the dependent variable. We found a significant effect for category [F(3,36) = 35.363; p < 0.0001;
The contribution of single categories to the representation of a multi-category scene in face- and body-selective voxels. a–c, The contribution of each isolated category to the representation of the multi-category scene in the right ventrotemporal face- and body-selective voxels as depicted by the β coefficients of the linear model (Eq. 1) in (a) face-selective voxels, (b) body-selective voxels, and (c) face- and body-selective voxels. Each dot indicates the mean β of a single subject. Gray lines connect the β's of the same subject. Diamonds and error bars indicate the group mean and SEM, respectively. Note that the β's of the linear model are not the betas extracted from the GLM that evaluates the correspondence between the fMRI hemodynamic response and stimulus presentation but indicate the contribution of each category to the response to the multi-category scene.
We performed paired t tests to compare the contribution of the preferred relative to the nonpreferred categories. In line with the predictions of the normalization model (Fig. 1), the contribution of the face to the representation of the multi-category scene in face-selective voxels was higher than the contribution of each of the other categories [
To assess goodness of fit of the normalization model to the response to the multi-category scene, we computed the R2 for each sphere. Figure 4a–c shows the distribution of the R2 values in face- and body-selective areas. The overall median R2 = 0.817 indicates a good fit of the proposed model to the data from these areas. For comparison, we defined a control area in the frontal lobe that does not show visual category selectivity (the frontal pole; see Materials and Methods). The R2 in this region was much lower median R2 = 0.333 (Fig. 4d), indicating that, consistent with our predictions, the normalization model accounts for the response in the visual category-selective cortex.
The distribution and median of
Results reported so far show that the response of voxels that are selective to both faces and bodies shows a different pattern of response than voxels that are either face or body selective. To further demonstrate that the three ROIs show distinct response characteristics, we plotted a scatterplot of the difference between
The contribution of the face and the body to the multi-category scene covaries with face–body selectivity. The scatterplot shows the covariability of the difference in the contribution of the face and the body to the multi-category scene (y-axis) and the difference between face and body selectivity that was independently measured by the functional localizer (x-axis) for each sphere of all subjects.
We performed a mixed-model linear regression with a random intercept for subjects and found a positive association between
Taken together, these findings are consistent with predictions of the normalization model (Fig. 1), indicating that voxels that are selective for both the face and the body generate a representation that is biased to the whole person, while voxels that are selective for either of the single categories generate a representation that is biased to their preferred category.
Testing an alternative account to the normalization model: a summation model
An alternative summation account for our findings, which does not rely on the assumption of normalization, suggests that if a mixture of category-selective neurons responds independently, their combined response would be the sum of their individual responses to their preferred categories. This would lead to each beta value in the model (Eq. 1) to be equal to 1. Our results do not support this summation prediction (Fig. 3). Still, a lower than the sum response to the multi-category scene may be due to saturation of the BOLD signal, rather than a lack of support for a summation model. However, under the summation model, the response to the single categories would never be higher than the response to the multi-category scene. We therefore examined whether such a pattern of activation exists. A voxel-wise analysis reveals a large proportion of voxels that show this pattern of response. We found that, in contrast to the summation account, 65.74% of the voxels in VTC that are selective to faces or bodies (face selective, 61.76%; body selective, 77.95%; face–body selective, 55.81%) showed a higher response to a single-category relative to the response to the multi-category scene (Fig. 6a–c). Subjects who did not have enough voxels in specific areas were excluded from ANOVA that examined all ROIs, but whenever available, their data are shown in Figure 6d–i and tested for planned t test (two additional subjects with no face-only voxels, one additional subject with no body-only voxels, three subjects with no face–body voxels).
The contribution of single categories to the representation of a multi-category scene in nonsaturated face- and body-selective voxels. a–c, The proportion of voxels that show a higher response to single-category than multi-category scenes (nonsaturated voxels) in (a) face-selective, (b) body-selective, and (c) face and body-selective areas. d–f, The contribution of each isolated category to the representation of the multi-category scene in face- and body-selective voxels as depicted by the β's of the linear model (Eq. 1) in (a) nonsaturated face-selective voxels, (b) nonsaturated body-selective voxels, and (c) nonsaturated face- and body-selective voxels. Each dot indicates the mean
β of a single subject. Gray lines connect the β's of the same subject. Diamonds and error bars indicate the group mean and SEM, respectively. g–i, The distribution and median of
We performed a similar analysis to the one described above (Fig. 3), only for the nonsaturated voxels—voxels that showed higher response to single-category than multi-category stimuli. Figure 6d–f depicts the contribution of each of the isolated categories to the response of the multi-category scene [i.e., the beta coefficients of the above linear model in (1) for only nonsaturated voxels that are face selective, body selective, and face–body selective of each participant and averaged across participants]. Results are similar to the previous analysis (Fig. 3). Figure 6g–i depicts the distribution of the
Finally, we examined behavioral measures during the fMRI scanning for the different stimulus categories. We measured performance on the 1-back task across the different categories. Performance was at ceiling for all categories. The mean accuracy was the following: multi-category scene = 0.99 (SD = 0.010); face = 0.98 (SD = 0.018); body = 0.98 (SD = 0.014); room =0.97 (SD = 0.019); and chair = 0.98 (SD = 0.015). In addition, we displayed the eye fixation patterns for each category. Figure 7 shows an overall similar pattern of fixations across the different stimuli, indicating that participants followed the instructions to focus on the fixation dot that was presented in the same location on screen across the different conditions during the experiment.
Eye tracker fixation patterns. Hit map of fixations duration along all experimental trials of all subjects. Background image is the picture with original location, size, and fixation point as was presented in the experiment. The data is displayed for the different experimental conditions: (a) face, (b) body, (c) room, (d) chair, and (e) multi-category scene.
Discussion
The functional properties of face- and body-selective areas have been extensively investigated in numerous neuroimaging and neurophysiological studies in humans and monkeys in the past two and a half decades. Still, the functional significance of their separation and adjacent locations remained unclear (for recent reviews, see Hu et al., 2020; Taubert et al., 2022). Our study provides a mechanistic account for this long-standing puzzle by considering the operation of a well-established normalization model on distinct face- and body-selective regions that reside in proximal anatomical locations. Consistent with predictions of the normalization model (see Fig. 1 and mathematical derivations in Appendix), we found that the representation of the multi-category scene was dominated by the face in face-selective areas, by the body in body-selective areas, and by both the face and body (i.e., the whole person) in the border between the face-selective and body-selective areas that is selective to both categories, filtering out the nonpreferred categories. To reveal this pattern of response, our study presented the whole person within a multi-category scene (Fig. 1), unlike previous studies that presented an isolated face, body, or whole person (Song et al., 2013; Bernstein et al., 2014; Kaiser et al., 2014; Fisher and Freiwald, 2015; Kliger and Yovel, 2020; Zafirova et al., 2022). This enabled us to measure the relative contribution of both the preferred face and body categories and nonpreferred categories to the multi-category scene.
Consistent with predictions of the normalization model (Fig. 1), we found that the proximal location of face- and body-selective areas enables the generation of a clutter-tolerant representation of the meaningful combination of the face and body (the whole person). This machinery eliminates the need for a dedicated population of neurons that are selective to the combined whole person stimulus. The generation of a clutter-tolerant representation of the whole person in neighboring face and body areas is accomplished through the same normalization mechanism that declutters the preferred single categories within their category-selective cortex (Reddy et al., 2009; Bao and Tsao, 2018; Kliger and Yovel, 2020). According to the normalization model, when the normalization pool of a face-selective neuron is selective to bodies (or when the normalization pool of a body-selective neuron is selective to faces), the response to the multi-category scene will be a weighted mean of the response to the two categories, and reduced response to the nonpreferred categories (see Appendix for mathematical derivations), essentially generating a clutter-tolerant representation of the whole person. Furthermore, we demonstrated that an alternative model, which predicts summation instead of normalization, was not supported by the data.
Voxels that are selective to both the face and body typically reside in the border between the face- and body-selective areas. Whereas our design does not allow us to determine whether these voxels contain two populations of face- and body-selective neurons or one population of person-selective neurons, previous studies suggest that the former might be the case. Kaiser et al. (2014) used multivoxel pattern analysis to ask if a person-selective region (a region that shows higher response to person than object stimuli) is composed of one population of person-selective neurons or nearby face-selective and body-selective neurons. Their results support the latter conclusion, though as noted by the authors, they do not provide conclusive evidence for the absence of person-selective neurons in this region. It should be noted that neurons responsive to the whole person were reported in the upper bank of the STS (Wachsmuth et al., 1994; see also fMRI findings by Fisher and Freiwald, 2015). This upper bank of the monkey STS may be the homolog of human STS (Yovel and Freiwald, 2013), which is known to show selectivity to the biological motion of the whole person (Thompson et al., 2005). Another recent study has found neurons responding to the face and body in a patch that is selective to whole person natural configuration in the anterior IT (Zafirova et al., 2024). Our findings specifically focus on the border between the face- and body-selective areas in the posterior VTC, where to the best of our knowledge, such person-selective neurons were not reported. Future studies that will record neurons that reside in the border between face- and body-selective regions are required to further support this claim.
In a recent extensive review of the literature on the response of the visual cortex to faces, bodies, and the whole person, Taubert et al. (2022) have proposed four hypotheses regarding the organization of face- and body-selective regions including separate networks, weakly integrated networks, strongly integrated networks, or a single network. They concluded that current data do not fully support any of these hypotheses and called for future studies that will combine the face and body to address these open questions. Our study goes beyond their hypotheses by predicting that the significance of this functional organization will be evident when the whole person is presented in multi-category scenes. Thus, our findings show the benefit of processing faces and bodies by both separated networks for enhancing the representations of the face or the body as well as integrated networks for enhancing the representation of the whole person when presented in multi-category scenes.
The question of whether faces and bodies are processed by separated or integrated systems has been also discussed in a recent review by Hu et al. (2020). According to their suggested model, faces and bodies are processed separately in posterior brain areas (OFA and EBA) but are integrated to the whole person in more anterior regions, the FFA and FBA (Song et al., 2013; Bernstein et al., 2014; Fisher and Freiwald, 2015). This model is consistent with our findings as the face and body adjacent voxels are primarily located in the fusiform gyrus (Fig. 1), whereas more lateral and posterior face- and body-selective voxels (i.e., OFA and EBA) are located more distant from each other. Indeed, we found that the face- and body-selective regions were proximal only in third of the participants in the lateral occipital cortex, whereas in the VTC, the face and body regions reside adjacently in all our participants (Weiner and Grill-Spector, 2010, 2013). We suggest that functional organization enables such independent and integrated processing of the face and body, either by clusters of neurons that are located more remotely from one another (mostly posteriorly) or by nearby face and body regions (mostly ventrally), respectively.
The well-established organization of the category-selective visual cortex has generated different hypotheses with respect to their emergence in particular locations in the high-level visual cortex (Saygin et al., 2016; Deen et al., 2017; van den Hurk et al., 2017). Recently, Op de Beeck et al. (2019) have proposed three main factors that determine where category-selective areas emerge in the visual cortex: (1) preexisting feature selectivity, (2) computational hierarchy, and (3) domain-specific connectivity to areas outside the visual stream. The current study goes beyond the representation of single categories and highlights the benefit of positioning different category-selective regions in proximity, in particular to the representation of multi-category scenes. Theories that attempt to account for the predetermined locations of category-selective areas should also consider the functional significance of their relative proximity for resolving the computational challenges of representing objects in clutter.
The present study proposes a bottom-up mechanism that can bias the response to certain, significant categories by clustering homogeneous category-selective neurons. Yet, other mechanisms have been suggested to bias the response to specific stimuli. For example, bottom-up, stimulus-driven mechanisms based on stimulus saliency can allocate resources toward a specific target (Beck and Kastner, 2005). Furthermore, a normalization operation was also shown to account for top-down mechanisms of selective attention that resolves competition among multiple stimuli (Desimone and Duncan, 1995; Reddy et al., 2009; Reynolds and Heeger, 2009) Thus, the proposed hardwired mechanism acts in concordance with other bottom-up and top-down mechanisms to resolve the challenge of processing rich, multi-category scenes (Pessoa et al., 2003; McMains and Kastner, 2011).
The biased representation to the whole person that we revealed is in line with behavioral studies that reported evidence for the preferred processing of the whole person. For example, Mayer et al. (2015) showed that stimuli of the whole person pop out in cluttered scenes relative to other nonhuman stimuli. Downing et al. (2004) showed that they capture attention even when they are unattended. Privileged detection for whole person and faces relative to objects was also found in continuous flash suppression tasks (Stein et al., 2012). The clutter-tolerant representation that we revealed here for the whole person may underlie these behavioral effects.
To summarize, our study offers a unified mechanistic account for long-standing questions about the neural representations of the face, the body, and the whole person in the high-level visual cortex. We explain how the same normalization mechanism enables the generation of a clutter-tolerant representation of each socially significant component (face or body) and their meaningful combination (whole person), thanks to the neighboring cortical locations of distinct clusters of face- and body-selective neurons. More generally, our study reveals a new mechanism that is used by the visual system to resolve the challenging task of processing socially meaningful stimuli in cluttered scenes.
Data Availability Statement
Data that were collected in this study will be available at https://openneuro.org. Tables of preprocessed data as well as the code that was used to generate the analysis, figures, and statistics are available at https://github.com/gylab-TAU/Multi_Category_Scenes_fMRI_analysis.
Appendix: Mathematical derivations of a model predicting the representation of a multi-category scene composed of four categories
According to the normalization model (Reynolds and Heeger, 2009), the measured neuronal response of a specific neuron (i.e., neuron j) to multi-category stimuli is divided by the sum of the responses of the surrounding neurons. This can be described by the following equation for a multi-category stimulus composed of four categories, denoted by A–D, such as the stimulus shown in Figure 1b, given by the following equation:
We apply this equation to the multi-category scene that we used in our study, which is composed of a person, a chair, and a room. Therefore, we have four categories presented in this multi-category scene, denoted as follows: a face (F), a body (B), a chair (C), and a room (i.e., a place, P):
We can separate the right side of the equation into two parts, yielding the following:
For simplicity, we showed only the derivations for the face weight, but similar derivations can be performed for all other categories, yielding the following:
A face-selective neuron that resides in a face-selective area responds more to faces than each of the other categories, that is,
We can further see that the difference between the coefficients of two categories, for example, the face and the body, is given by the following:
When using fMRI to measure the response to a multi-category stimulus, we measure the BOLD signal, which is an estimate of the response of thousands of neurons in a small patch of the cortex (e.g., a 2 × 2 × 2 mm3 voxel). The response of all the neurons in a voxel can be written as follows:
Assuming that all neurons in a given voxel have a similar normalization pool, that is, a similar surrounding, we can rewrite the equation such that k is no longer a function of j:
Footnotes
This work was supported by grants from the Israel Science Foundation (446/16, 917/21).
The authors declare no competing financial interests.
- Correspondence should be addressed to Libi Kliger at libi.kliger{at}gmail.com or Galit Yovel at gality{at}tauex.tau.ac.il.