Abstract
In everyday life, we have no trouble categorizing objects varying in position, size, and orientation. Previous fMRI research shows that higher-level object processing regions in the human lateral occipital cortex may link object responses from different affine states (i.e., size and viewpoint) through a general linear mapping function capable of predicting responses to novel objects. In this study, we extended this approach to examine the mapping for both Euclidean (e.g., position and size) and non-Euclidean (e.g., image statistics and spatial frequency) transformations across the human ventral visual processing hierarchy, including areas V1, V2, V3, V4, ventral occipitotemporal cortex, and lateral occipitotemporal cortex. The predicted pattern generated from a linear mapping function could capture a significant amount of the changes associated with the transformations throughout the ventral visual stream. The derived linear mapping functions were not category independent as performance was better for the categories included than those not included in training and better between two similar versus two dissimilar categories in both lower and higher visual regions. Consistent with object representations being stronger in higher than in lower visual regions, pattern selectivity and object category representational structure were somewhat better preserved in the predicted patterns in higher than in lower visual regions. There were no notable differences between Euclidean and non-Euclidean transformations. These findings demonstrate a near-orthogonal representation of object identity and these nonidentity features throughout the human ventral visual processing pathway with these nonidentity features largely untangled from the identity features early in visual processing.
SIGNIFICANCE STATEMENT Presently we still do not fully understand how object identity and nonidentity (e.g., position, size) information are simultaneously represented in the primate ventral visual system to form invariant representations. Previous work suggests that the human lateral occipital cortex may be linking different affine states of object representations through general linear mapping functions. Here, we show that across the entire human ventral processing pathway, we could link object responses in different states of nonidentity transformations through linear mapping functions for both Euclidean and non-Euclidean transformations. These mapping functions are not identity independent, suggesting that object identity and nonidentity features are represented in a near rather than a completely orthogonal manner.
Introduction
Although objects are constantly changing in position, size, and orientation in everyday life, we can keep track of their identities. Such invariant object identity representations are likely achieved through visual processing in the primate ventral visual pathway, where tangled object identity representations become linearly separable and transformation tolerant from lower to higher visual regions (Rolls, 2000; DiCarlo and Cox, 2007; Rust and DiCarlo, 2010; Isik et al., 2014; Fig. 1A).
At the same time, we can keep track of nonidentity features such as size, orientation, and position, which are important for tasks like object grasping. Different states of a nonidentity transformation, such as two positions in a position transformation, can also be decoded across object identities in both lower and higher ventral visual processing regions (Hong et al., 2016; Vaziri-Pashkam et al., 2019). This suggests object identity and nonidentity information may be represented in an orthogonal manner, which would facilitate independent access to these two types of information in different tasks. If so, we should be able to learn a linear mapping between the neural responses to two states of one object and apply that mapping to a new object's neural response in one state to predict its neural response in the other state. Indeed, in the human lateral occipital cortex, a higher ventral visual processing region, a linear mapping function has been shown to link different affine states of object representations, even for objects not included in training (Ward et al., 2018).
Presently, it is unknown whether linear mapping can successfully predict object responses across transformations only in higher-level visual processing regions or also in lower-level regions. An object feature untangling view may predict the existence of a stronger linear mapping in higher than lower visual regions as different object features become more separated throughout the ventral visual processing pathway (DiCarlo and Cox, 2007). However, the decoding success of nonidentity information independent of identity information in both lower and higher visual regions as described earlier suggests that even in lower visual regions, some untangling of identity and nonidentity information exists.
An orthogonal representation would predict that a linear mapping function derived from one object would equally well predict the neural response patterns of that object and other objects after transformation (Fig. 1A, left). However, transformation could be object specific, leading to an incomplete generalization and a near-orthogonal representational structure. Consequently, both an object-independent and an object-dependent component would be needed to fully capture object responses between two states of a nonidentity transformation (Fig. 1A, right).
In the present study, we attempted to provide answers to these questions. We used existing data from two studies (Vaziri-Pashkam et al., 2019; Vaziri-Pashkam and Xu, 2019) in our analysis. We adapted the methodology of Ward et al. (2018), examining the linear mapping between fMRI representations of objects in different states of nonidentity transformations in human early visual areas V1–V4 and higher object processing regions in the occipitotemporal cortex (OTC). We aimed to determine how identity and nonidentity features are simultaneously represented throughout the human ventral visual hierarchy. Taking into account the response reliability of different visual regions, we examined neural responses to two Euclidean transformations (i.e., position and size) and two non-Euclidean transformations [i.e., changes in image statistics and spatial frequency (SF)]. In addition to testing the success of the predicted fMRI response patterns, we also tested how well object representational structure was preserved through linear mapping using representational similarity analysis (RSA; Kriegeskorte and Kievit, 2013) and whether similarity among objects plays a role in predicting response patterns across objects.
We found that a linear mapping could successfully link object responses in different states of nonidentity transformations throughout the human ventral visual stream for both Euclidean and non-Euclidean transformations. However, these mapping functions were not entirely identity independent, suggesting that object identity and nonidentity features are represented in a near, rather than complete, orthogonal manner.
Materials and Methods
In this study, we applied the representational transformation analysis developed by Ward et al. (2018). This analysis method was originally used to examine fMRI responses from objects undergoing affine transformation in the human LOC. For the current study, we used this analysis to examine such responses throughout the entire human ventral processing pathway. In addition to affine/Euclidean transformations studied before, here we documented responses from two Euclidean (position and size) and two non-Euclidean transformations [image stats and spatial frequency (SF)]. To increase the signal-to-noise ratio, instead of using an event-related design as in Ward et al., we used a block design and examined responses from the average of multiple exemplars of an object category (i.e., category response) rather than individual objects. In Ward et al. (2018), the two states of an object were presented one after another within the same trial, raising the possibility that noise correlation between the two states of the object could contribute to the prediction success. That is, noise from one state of the object could be carried forward to the predicted pattern after linear mapping, making the predicted pattern more similar to the true pattern of the other state of the same object because of temporal proximity than it would otherwise be. By using the block design, we removed such temporal response correlations in the present study. The overall correlations between the predicted and the true patterns were fairly low in Ward et al. (with the mean correlations being <0.15 for Fisher z-transformed correlation coefficients), raising the possibility that a linear mapping function may only capture a very small amount of variance associated with response changes between two states of an object. Such a low correlation, however, could be because of the event-related design used. By taking into account the response reliability of a brain region in this study, we would be able to evaluate how good the predicted patterns are compared with the true pattern.
Because the signal of a block design is collapsed across different exemplars, it may be argued that interstimuli variance could reduce the ability to employ linear transformation from one category to another. However, if there is an identity-independent orthogonal representation of nonidentity transformations, then the average of several exemplars should still show the same amount of shift across transformations (Fig. 1A shows such a representation across categories; the same would hold for exemplars within a category as well). On the other hand, if the representation is not completely orthogonal, with the shift being greater for some exemplars than others within a category, then averaging across the exemplars may actually reduce the near-orthogonal effect among categories, as averaging would even out the different amounts of shifts from the different exemplars. Thus, averaging across exemplars would make it less likely to find a near-orthogonal effect. If such an effect is found (as the results would show), then it suggests that the effect is fairly robust.
During each run of the experiments, human participants viewed blocks of images, each containing 10 exemplars from one of six or eight real-world object categories (faces, houses, bodies, cats, elephants, cars, chairs, and scissors), and performed a one-back repetition detection task. During our data analysis, we first selected the 75 most reliable voxels from each ROI to equate the number of voxels across ROIs and to increase power (Tarhan & Konkle, 2020). It is important to note, however, that the results remained very similar in all experiments when we included all voxels from each ROI in the analysis. Thus the pattern of results we obtained was fairly stable and did not depend on the inclusion of the most reliable voxels. Nevertheless, we report results from the most reliable voxels to equate the number of voxels selected from each ROI and to ensure that the results obtained were not because of noise and a lack of power.
We then extracted the fMRI response pattern corresponding to each image block in each run and used that as the fMRI response pattern for the specific object category shown in that block of that run. We used a split-half approach in our analysis and divided the runs into odd and even halves. Within each half, using a leave-one-run-out cross-validation procedure, we derived a linear transformation matrix following Ward et al. (2018) to link the fMRI response patterns of two states of an object category in a given transformation in the training data. Using this transformation matrix, in the test data we generated the predicted pattern of an object category in one state using its true pattern from the other state.
We evaluated the predicted patterns in the following three different ways: how well they correlated with the true patterns, whether they showed category selectivity and were more similar to the true patterns of the same rather than different categories, and whether the category representational structure was preserved among the predicted patterns. In all three analyses, we compared the performance of the predicted patterns to that of the true patterns to evaluate whether a linear mapping can capture a significant amount of the changes associated with the transformations, using two measures of reliability (explained in more detail below). We also examined the effect of training category (i.e., whether or not a category was included in the training data), the effect of training set size (i.e., the number of categories included in the training data), the effect of ROI (i.e., whether the effect differed among the different brain regions), and their interactions. We additionally examined how the ability of using the response of one category to predict that of another category was determined by the similarity of these two categories in a given brain region.
The details of the four fMRI experiments included are described previously (Vaziri-Pashkam et al., 2019; Vaziri-Pashkam and Xu, 2019) and summarized here.
Participants
Seven (four females), seven (four females), six (four females), and 10 (five females) healthy human participants with normal or corrected-to-normal visual acuity, all right-handed, and between the ages of 18 and 35 years took part in experiments 1–4. All participants gave their informed consent before the experiments and received payment for their participation. The experiments were approved by the Committee on the Use of Human Subjects at Harvard University.
Experimental design and procedures
Main experiments
In all four experiments, participants performed a one-back object repetition detection task while viewing blocks of grayscale images of real-world object categories (Fig. 1B–D). Experiments 1–3 included the following eight real-world object categories: body, car, cat, chair, elephant, face, house, and scissors. Experiment 4 included the following six real-world object categories: body, car, chair, elephant, face, and house. These sets of categories cover a broad range of real-world objects and include small/large, animate/inanimate, and natural/man-made objects. Similar sets have been used in previous investigations of object category representations in the OTC (Haxby, 2001; Kriegeskorte et al., 2008). Each category contained 10 exemplars that varied in identity, pose (for the animal and body categories only), and orientation/viewing angle to minimize the low-level similarities among them. All images were placed on a dark gray square and displayed on a light gray background. Although experiment 4 includes only six of the eight total object categories, the exemplars used for each category are exactly the same as in the other experiments.
We analyzed two Euclidean and two non-Euclidean transformations. The two Euclidean transformations were the following: (1) experiment 1, position: object image (subtended 5.7° × 5.7°) appearing above versus below fixation by 3.08°, (2) experiment 2, size: object image shown in small (4.6° × 4.6°) versus large size (11.4° × 11.4°). The two non-Euclidean transformations were the following: (1) experiment 3, image stats: object images (subtended 9.13° × 9.13°) shown in original versus controlled format, and (2) experiment 4, SF: object images (subtended 7.8° × 7.8°) shown in high [finite impulse response (FIR) filter with cutoff of 4.40 cycles per degree) versus low (FIR filter with cutoff of 0.62 cycles per degree) SF content of an image (Fig. 1C). Controlled images were generated using the SHINE technique to achieve spectrum, histogram, and intensity normalization and equalization (Willenbockel et al., 2010). Controlled images also appeared in experiments 1 and 2 to better equate low-level differences among the images from the different categories.
During the experiment, blocks of images were shown. Each block contained a random sequential presentation of 10 exemplars from the same object category and the same transformation condition (e.g., for experiment 1, in one block of a run all the images were of cats positioned above fixation). Each image was presented for 200 ms followed by a 600 ms blank interval between the images (Fig. 1D). Two image repetitions occurred randomly in each block. Participants were asked to view the images and report the repetitions by pressing a key on an MR-compatible button box. To ensure proper fixation, participants fixated at a central dot throughout the experiment, and eye movements were monitored in all four experiments using an SR-research Eyelink 1000 Plus eye tracker (SR Resesarch). Each block lasted 8 s followed by an 8 s fixation period. In experiments 1–3, there was an additional fixation period of 8 s at the beginning of each run. In experiment 4, there was an additional fixation period of 12 s at the beginning of each run.
Each run in experiments 1–3 contained 16 blocks, 1 for each of the 8 object categories in each of the 2 states of the transformation. Each run in experiment 4 contained 18 blocks, 1 for each of the 6 object categories in the low SF condition, the high SF condition, and the full SF condition. The full SF condition blocks were not used in the transformation analysis. Experiments 1–3 included 16 runs, with each run lasting 4 min 24 s. Experiment 4 included 18 runs, with each run lasting 5 min. The order of the object categories and the two stages of the transformation were counterbalanced across runs and participants.
Localizer Experiments
Topographic visual regions
We examined ROIs from topographically localized areas within the occipital cortex, including V1, V2, V3, and V4 in each participant (Fig. 1E, left). These regions were mapped with flashing checkerboards using standard techniques (Sereno et al., 1995; Swisher et al., 2007) with parameters optimized following Swisher et al., (2007). Specifically, a polar angle wedge with an arc of 72° swept across the entire screen (23.4° × 17.5° of visual angle). The wedge had a sweep period of 55.467 s, flashed at 4 Hz, and swept for 12 cycles in each run (Swisher et al., 2007). Participants completed 4–6 runs, each lasting 11 min 56 s. The task varied slightly across participants. All participants were asked to detect a dimming in the visual display. For some participants, the dimming occurred only at fixation, for some it occurred only within the polar angle wedge, and for others, it could occur in both locations, commiserate with the various methodologies used in the literature (Swisher et al., 2007; Bressler and Silver, 2010). No differences were observed in the maps obtained through each of these methods.
Lateral occipitotemporal cortex and ventral occipitotemporal cortex
We also examined the functionally localized regions of the ventral occipitotemporal cortex (VOT) and lateral occipitotemporal cortex (LOT; Fig. 1E, middle and right). LOT and VOT loosely correspond to the location of lateral occipital and posterior fusiform (pFs) areas (Malach et al., 1995; Grill-Spector et al., 1998; Kourtzi and Kanwisher, 2000) but extend further into the temporal cortex in an effort to include as many object-selective voxels as possible in the OTC. To identify LOT and VOT ROIs, following Kourtzi and Kanwisher (2000), participants viewed blocks of face, scene, object, and scrambled object images (all subtended ∼12.0° × 12.0°). Only the contrast between objects and scrambled objects were used to localize LOT and VOT. The other object categories were included to localize other brain regions not examined in the present study. The images were photographs of grayscaled male and female faces, common objects (e.g., cars, tools, and chairs) indoor and outdoor scenes, and phase-scrambled versions of the common objects. Participants monitored a slight spatial jitter, which occurred randomly once in every 10 images. Each run contained four blocks each of scenes of faces, objects, and phase-scrambled objects. Each block lasted 16 s and contained 20 unique images, with each appearing for 750 ms and followed by a 50 ms blank display. In addition to the stimulus blocks, 8 s fixation blocks were included at the beginning, middle, and end of each run. Each participant was tested with two or three runs, each lasting 4 min 40 s.
MRI methods
MRI data were collected using a Siemens MAGNETOM Trio, A Tim System 3T scanner, with a 32-channel receiver array head coil. In experiment 4, data from the last four participants were collected after the scanner was upgraded to a Prisma system. Participants lay on their back inside the MRI scanner and viewed the back-projected LCD with a mirror mounted inside the head coil. The display had a refresh rate of 60 Hz and spatial resolution of 1024 × 768. An Apple Macbook Pro laptop was used to present the stimuli and collect the motor responses. For topographic mapping, the stimuli were presented using VisionEgg (Straw, 2008). All other stimuli were presented with MATLAB running Psychtoolbox extensions (Brainard, 1997).
Each participant completed 3–6 MRI scan sessions for us to obtain data for the high-resolution anatomic scans, topographic maps, functional ROIs, and experimental scans. Out of these MRI scan sessions, 1–4 sessions were experimental sessions for each participant. Using standard parameters, a T1-weighted high-resolution (1.0 × 1.0 × 1.3 mm3) anatomic image was obtained for surface reconstruction. For all the fMRI scans, a T2*-weighted gradient-echo pulse sequence was used. For the experimental scans, 33 axial slices parallel to the anterior commissure–posterior commissure (AC–PC) line (3 mm thick, 3 × 3 mm2 in-plane resolution with 20% skip) were used to cover the whole brain [repetition time (TR) = 2 s, echo time (TE) = 29 ms, flip angle = 90°, matrix = 64 × 64]. For the LOT/VOT localizer scans 30–31 axial slices parallel to the AC-PC line (3 mm thick, 3 × 3mm2 in-plane resolution with no skip) were used to cover occipital and temporal lobes (TR = 2 s, TE = 30 ms, flip angle = 90°, matrix = 72 × 72). For topographic mapping, 42 slices (3 mm thick, 3.125 × 3.125 mm2 in-plane resolution with no skip) just off parallel to the AC–PC line were collected to cover the whole brain (TR = 2.6 s, TE = 30 ms, flip angle = 90°, matrix = 64 × 64). Different slice prescriptions were used here for the different localizers to be consistent with the parameters used in our previous studies. Because the localizer data were projected into the volume view and then onto individual participants' flattened cortical surface, the exact slice prescriptions used had minimal impact on the final results.
Data analysis
FMRI data were analyzed using FreeSurfer (https://surfer.nmr.mgh.harvard.edu), FreeSurfer Functional Analysis Stream (Dale et al., 1999), and in-house MATLAB and Python codes. FMRI data preprocessing included 3D motion correction, slice timing correction, and linear and quadratic trend removal. No smoothing was applied to the data. All analysis for the main experiment was performed in the volume. The ROIs were selected on the surface and then projected back to the volume for further analysis.
ROI definitions
Topographic Maps
Following the procedures described in Swisher et al. (2007) and by examining phase reversals in the polar angle maps, we identified topographic areas in occipital cortices including V1, V2, V3, and V4 in each participant (Fig. 1E).
LOT and VOT
These 2 two ROIs (Fig. 1E) were defined as a cluster of continuous voxels in the LOC and VOT, respectively, that responded more (p < 0.001 uncorrected) to the original than to the scrambled object images. LOT and VOT loosely correspond to the location of the lateral occipital and pFs (Malach et al., 1995; Grill-Spector et al., 1998; Kourtzi and Kanwisher, 2000) but extend further into the temporal cortex in an effort to include as many object-selective voxels as possible in OTC regions. In experiment 4, for two participants for LOT and for one participant for VOT, the threshold of p < 0.001 resulted in too few voxels, so the threshold was relaxed to p < 0.01 to have at least 100 voxels across the two hemispheres.
MVPA analysis
In the main experiments, to generate fMRI response patterns for each condition in each run, we first convolved the 8 s stimulus presentation boxcars with a hemodynamic response function. Then, for experiments 1–3, we conducted a general linear model analysis with 16 factors (2 states of the transformation × 8 object categories) to extract the β value for each condition in each voxel in each ROI. This was done separately for each run. For experiment 4, we performed a general linear model analysis with 18 factors (three spatial frequency conditions × 6 object categories) to extract the β value in a similar way, again separately for each run. Although the full SF condition was used in subsequent analysis in Vaziri-Pashkam et al. (2019), it is not used in subsequent analysis here. We z normalized the β values across all voxels for each condition in a given ROI in each run to remove amplitude differences among conditions, ROIs, and runs.
Reliability-based voxel selection
As pattern decoding to a large extent depends on the total number of voxels in an ROI, to equate the number of voxels in different ROIs to facilitate comparisons across ROIs and to increase power, we selected the 75 most reliable voxels in each ROI using reliability-based voxel selection (Tarhan and Konkle, 2020). Across the experiments, the ROIs ranged from 115 to 901 voxels before voxel selection. This method selects voxels whose response profiles are consistent across odd and even halves of the runs and works well when there are ∼15 conditions. To implement this method, for each voxel, we calculated the split-half reliability by first averaging the runs within the odd and even halves and then correlating the resulting averaged responses for all conditions (12 or 16 in total for the 6 or 8 image categories and 2 states of a transformation) across the even and odd halves. We then selected the top 75 voxels with the highest correlations. To avoid circularity in analysis, these 75 voxels were selected using only the training runs in each iteration of a leave-one-out cross-validation procedure (which is explained in more detail in the next section). Within each ROI, the selected voxels showed an overlap of 52–61 voxels across the different training-testing iterations for the position transformation, 49–56 voxels for the size transformation, 46–55 voxels for the image stats transformation, and 41–52 voxels for the SF transformation. The 75 voxels chosen maintained a high split-half reliability of at least r = 0.70 for each participant and ROI while providing an optimal number of features for subsequent ridge regression analysis.
Deriving linear mapping between two states of a transformation
We adapted the linear mapping analysis used by Ward et al. (2018) with a split-half analysis on the 75 most reliable voxels for each training iteration in each ROI. Training and testing of the linear mapping were computed separately for each participant, each ROI, and each state of the transformation in each experiment. For each ROI, the data were first split into even and odd runs. Within each half of the data, a leave-one-out cross-validation procedure was conducted where one run served as the testing run, and the remaining runs served as training runs. For experiments 1–3, which had 16 runs, this meant that first the data were split into 2 groups with 8 runs each, and within each group there were 7 training runs and 1 testing run. For experiment 4, which had 18 runs, this meant that first the data were split into 2 groups with 9 runs each, and within each group there were 8 training runs and 1 testing run. The training runs of the even and odd runs were used to select the top 75 most reliable voxels of an ROI as described in the previous section. These 75 most reliable voxels were then applied to the testing runs. Additionally, for experiments 1–3, the number of object categories included in training ranged from one to seven as these experiments included eight categories in total. For experiment 4, the number of object categories included in training ranged from one to five as this experiment included six categories in total. All possible combinations of training categories were used.
During training of each leave-one-out cross-validation fold, we derived linear mapping matrices to link the two states of a transformation in both directions (e.g., from small to large and from large to small) using ridge regression. The end result was that responses from the 75 voxels in one state, after multiplying with the trained linear mapping matrix, would predict the responses of the 75 voxels in the other state. This was accomplished in the following way. For each object category, we first constructed two 75 [voxels] × (7 or 8) [runs] matrices corresponding to the original and the transformed states, respectively. If more than one object category was included in training, we concatenated each object's matrix so that the full-pattern matrix was 75 [voxels] × (7 or 8) [runs] × (1–7) [training categories]. So, for example, if we had seven training folds and then trained on one category, we used a 75 [voxels] × 7 [1 category * 7 runs] matrix, but when we trained with seven categories, we used a 75 [voxels] × 49 [7 categories * 7 runs] matrix. Using ridge regression, we derived the linear mapping β between the original state (matrix X) and the transformed state (matrix Y) as follows:
Experimental design and statistical analyses
Seven, seven, six, and 10 human participants took part in experiments 1–4, respectively. These numbers were chosen based on prior published studies (e.g., Haxby et al., 2001; Kamitani and Tong, 2005). The factors described in the remaining sections of Materials and Methods were evaluated at the group level using repeated-measures ANOVA and post hoc t tests. We corrected for multiple comparisons in all post hoc analyses using the Benjamini–Hochberg method (Benjamini and Hochberg, 1995). Additionally, for the ANOVA results, we calculated effect size using a partial η-squared measure (Cohen, 1973). Partial eta squared is a less biased measure of effect size for ANOVAs that is comparable across study designs. It is defined as follows:
Evaluating the predictions of the learned linear mapping
To test how well the learned linear mapping could predict fMRI response patterns between the two states of a given transformation, we first generated predicted patterns using the learned linear mapping. We then compared the predicted and true patterns using Pearson's correlation. Specifically, within each half of the data, for each leave-one-out cross-validation fold, we first generated two predicted patterns (one for each state) for each object category from the left-out data by using the pattern from one state to predict the other state using the learned linear mapping matrices (one for each direction of transformation). We then correlated each predicted pattern from one-half of the data with the corresponding true pattern in the other half of the data (Fig. 1F).
Because of the presence of measurement noise, even patterns of the exact same condition across odd and even runs may not show 100% correlation. Consequently, how well the predicted patterns may correlate with the true patterns across odd and even runs should only be assessed when we compare this to how true patterns are correlated across odd and even runs (which is a split-half reliability measure). Because noise may differ for the different brain regions, reliability may also differ. Thus obtaining the reliability measures in different brain regions additionally allows us to correct for them in the response patterns and facilitate cross-region comparisons. We calculated reliability in two different ways. We derived the first reliability measure, which we called Averaged-run Ceiling, following the method used in Kietzmann et al. (2019). Specifically, within each ROI, we averaged all but one run within one-half of the data (similar to how we used all but one run to train a linear function to generate a prediction), and averaged all runs in the other half of the data. We then correlated these two averaged patterns for each condition and transformation state, and averaged all the resulting correlations as our measure of split-half reliability for a given ROI. This measure of reliability involves the same amount of data that go into generating the predicted pattern and the true pattern. This reliability measure allows us to ask the following: Is the predicted pattern derived from data not included in training using the trained linear function as good as the average of the data included in training (i.e., the average of the true patterns used for training)? We derived the second reliability measure, which we called Single-run Ceiling, as follows: Instead of correlating the average of all but one run within one-half of the data with the average of all the runs in the other half of the data, we correlated each single run in one-half of the data with the average of all the runs in the other half of the data. We then averaged all the resulting correlation coefficients as our reliability measure for a given ROI. This reliability measure allows us to ask the following: Is the predicted pattern derived from data not included in training using the trained linear function as good as the data not included in training (i.e., the left-out true pattern)?
In addition to examining the correlation between the predicted pattern with the corresponding true pattern, we also examined whether such a correlation was higher for the same than the different category pairs. This allowed us to further evaluate category selectivity of the predicted pattern.
Within each transformation type and each ROI, we also assessed the effect of category (i.e., prediction performance for categories included or not included in training) and the effect of training set size (i.e., how prediction performance depended on the number of categories included in the training data) and their interaction. These analyses allowed us to test the generalizability of the learned linear mapping, that is, how well a linear mapping learned from one set of categories could successfully predict the patterns of categories not included in training.
To test whether linear mapping better predicts the neural response pattern in higher rather than in lower visual regions, we compared performance among the ROIs after normalizing the data by Averaged-run Ceiling, which accounts for reliability differences across ROIs. To account for the effect of training set size and to streamline the analysis, we only included the lowest and the highest training set sizes from each transformation and tested the effects of ROI, training category, training set size, and their interactions. For each of the two training set sizes and two training categories, we further examined whether there was a positive linear trend in performance from lower-level to higher-level visual regions. This was done by fitting a regression line for each participant and then testing whether the resulting slopes were >0 at the group level using a one-tailed t test.
Evaluating the predicted representational structure
Using RSA (Fig. 1G; Kriegeskorte and Kievit, 2013), we tested whether the representational structure of category information is preserved for the predicted patterns. To do so, for each transformation state, within each half of the data, we constructed representational dissimilarity matrices (RDMs) for each leave-one-out cross-validation fold of the predicted patterns. We also constructed RDMs based on the average true pattern for each half of the data. Each cell of the RDM corresponds to one minus the Pearson's correlation coefficient of a pair of category patterns. The RDM captures the relative similarity among the different categories. To test whether the RDM of the predicted patterns is similar to that of the true patterns, we first vectorized the upper triangle of each RDM to generate a category similarity vector. We then calculated the Pearson's correlation of the category similarity vector from the predicted pattern in one-half of the data to the true category similarity vector in the other half of the data. The correlations were then averaged across all cross-validation folds and each half of the data. This analysis was performed separately for the predicted patterns of trained categories and the predicted patterns of untrained categories across each training set size. To account for differences in measurement reliability across the different ROIs, these correlations were further compared with two measures of split-half reliability of each ROI, similar as before. For the first measure, which we call Averaged-run Ceiling, for each ROI, we performed a cross-validation measure, where for one-half of the data, we created an RDM based on the average true pattern of all but one of the runs (similar to the training procedure) and then correlated this with the full average true pattern RDM in the other half of the data. The resulting correlations were averaged between the two states of the transformation to produce the split-half RDM reliability measure for that ROI. As before, this reliability measure allows us to ask the following: Is the predicted RDM derived from data not included in training using the trained linear function as good as the RDM from the average of the data included in training? For the second measure, which we call Single-run Ceiling, for each ROI, we again performed a cross-validation measure, but this time, for each run from half of the data, we created an RDM and correlated it with an RDM derived from the average pattern from the other half of the data. These correlations were averaged between the two states of the transformation. This reliability measure allows us to ask the following: Is the predicted RDM derived from data not included in training using the trained linear function as good as the RDM from the data not included in training?
Relating category similarity to pattern prediction generalization
In this analysis, we examined how the generalizability of linear mapping may depend on the similarity among the different categories in a given brain region. To do so, for each ROI, using the cross-validated split-half analysis described earlier, we first obtained the linear mapping between the two states of a given transformation using data from only one category as the training data to predict the response of all other categories from one state to the other state. We then correlated the predicted pattern and the true pattern of the same category across the two halves of the data as described earlier. The resulting correlation coefficient was used as the prediction score for how well the training data of a given category could successfully predict the pattern of a different category. We repeated this analysis by including the data from each category as the training data to predict the response patterns of all other categories from one state to the other state. The results were averaged between the two states of a transformation and the two directions of predictions (i.e., using X to predict Y and using Y to predict X) and were used to construct a prediction similarity matrix in which each cell of the matrix reflects how well two categories may predict each other. For example, after training was performed using data from the elephant category, the predicted response pattern for the cat category was generated. The predicted and actual patterns for the cat category were then correlated to assess prediction accuracy. This was done for all possible pairings of categories to create a prediction similarity matrix.
To obtain the similarity among the object categories, we constructed a category similarity matrix with each cell of the matrix being the pairwise correlation of the true patterns of two categories across the two halves of the data. To match with how the predicted patterns were generated as described above, we averaged all but one run from one-half of the data and correlated this with the full average pattern from the other half of the data and then averaged all such correlations to generate the category similarity matrix. We vectorized the off-diagonal elements of these matrices and correlated them. If the generalizability of transformations depends on the similarity among the different categories in a given brain region, then we expected to see a high correlation between prediction similarity and category similarity. We did not correct the prediction similarity matrix and the category similarity matrix by the split-half reliability of each ROI here, as such a correction would not affect the final correlation results (because values were z normalized during correlation calculation, and thus any scaling factor would have no effect).
Results
Previous research has shown that we can derive linear mappings within the human LOC for affine changes of objects that are generalizable to objects not included in training (Ward et al., 2018). We evaluated and extended this work using data from four existing fMRI experiments to test how Euclidean and non-Euclidean object transformations are represented throughout the human ventral visual hierarchy, which included V1 to V4, LOT, and VOT. We evaluated the predicted patterns in the following three different ways: how well they correlated with the true patterns, whether they show category selectivity and were more similar to the true patterns of the same rather than different categories, and whether the category representational structure was preserved among the predicted patterns. We also examined the effect of training category (i.e., whether a category was included in the training data), the effect of training set size (i.e., the number of categories included in the training data), the effect of ROI (i.e., whether the effect differed among the different brain regions), and their interactions. We additionally examined how the ability of using the response of one category to predict that of another category was determined by the similarity of these two categories in a given brain region.
Evaluating the predicted fMRI response patterns
To understand how well a linear function can capture fMRI pattern differences between two states of an object in a given transformation, we evaluated how close the predicted pattern was to the true pattern by correlating the predicted patterns derived from one-half of the data with the corresponding true patterns from the other half. Overall, we found that a linear mapping was able to capture a significant amount of the changes associated with the transformations throughout the ventral visual stream, but predictions were tailored to categories included in training, suggesting that identity and nonidentity information are represented in a near-orthogonal manner. We did not see any evidence that predictions are better in higher visual regions than in lower visual regions. The full details of the results are described below.
To compare across ROIs and to account for differences in data reliability across the different ROIs, we calculated two split-half reliability measures, Averaged-run Ceiling and Single-run Ceiling, which involved the correlation of the training runs in one-half of the runs with the average of all the runs in the other half (see above, Materials and Methods). Comparison with Averaged-run Ceiling allowed us to evaluate whether the predicted patterns derived from data not included in training using the trained linear function were as good as the average of the data included in training (i.e., the average of the true patterns used for training); whereas comparison with Single-run Ceiling allowed us to evaluate whether such predicted patterns were as good as the data not included in training (i.e., the left-out true pattern). The results were evaluated at the participant group level using statistical tests.
If identity and nonidentity information are completely orthogonal to each other, then we would expect that a linear mapping would explain a significant amount of variance of the true data (i.e., the correlation between the predicted and true patterns is significantly >0) and that this linear mapping would predict the patterns of categories included and not included in training equally well. If, instead, identity and nonidentity information are near orthogonal, then we would still expect that the linear mapping would explain a significant amount of variance of the true data, but predictions would be significantly better for categories included than those not included in training. Finally, if identity and nonidentity information are nonorthogonal, then the learned mapping would not generalize to categories not included in training (i.e., the correlation between the predicted and true patterns is no different from zero for categories not included in training).
For all four transformation types and across both the training category and training set size manipulations, we found that predicted and true pattern correlations were overall quite high, ranging between 0.65 and 0.9, and were all significantly above 0 (Fig. 2, significance level for each condition indicated by the asterisks; all pairwise comparisons reported here and below were corrected for multiple comparisons using the Benjamini and Hochberg (1995) method. However, for all transformation types, ROIs, and training set sizes, all correlations were significantly less than Averaged-run Ceiling (Fig. 2A). These results indicate that although a linear mapping could capture a significant amount of the changes associated with these transformations, the predicted patterns were not as good as the average of the true patterns used for training. This shows that if the goal is to predict the best true pattern from the data, then it is better to use the average of the true patterns than to predict it through training and a linear mapping function. When we compared the correlations with Single-run Ceiling, for all transformations and nearly all ROIs, the correlations were above Single-run Ceiling for smaller training set sizes but roughly equal to Single-run Ceiling for larger training set sizes (There were a few exceptions where the correlations for untrained categories go below Single-run Ceiling; Fig. 2A). Because we used a large number of training runs to predict the responses of the left-out run, noise from the training runs would cancel each other out, resulting in the predicted pattern for the left-out run to be much less noisy than the actual pattern for the left-out run. This could explain why the predicted pattern showed a stronger correlation with the average of the other half of the runs than the actual pattern for the left-out run did. These results showed that the predicted patterns benefited from the inclusion of a large amount of training data to reduce noise and were overall better or as good as the left-out true patterns but were never as good as the average of the true patterns included in training in predicting the best true pattern.
To understand the generalizability of linear mapping in pattern prediction, within each transformation we next examined in each ROI the effect of training category, training set size, and their interactions. The significance levels and effect sizes (partial eta squared; Cohen, 1973; see above, Materials and Methods) of these effects are reported in Table 1. Overall, the main effect of training category was significant in all ROIs across all four types of transformations in that including a category in the training data significantly improved prediction. Post hoc tests confirmed this and revealed that the effect of training category was present across training set sizes and transformation types (Fig. 2A). Meanwhile, the main effect of training set size produced no benefit across ROIs and transformation types, so including more categories in the training data tended to decrease overall prediction performance (Fig. 3A). There were interaction effects between training category and training set size in all cases, where the difference in trained and untrained categories decreased with larger training set sizes.
To understand if linear mapping better predicts the neural response pattern in higher than in lower visual regions, we examined performance among the ROIs after normalizing all correlations by Averaged-run Ceiling to allow us to account for differences in noise across the different ROIs (Fig. 2B). To examine the effect of training set size and to streamline the analysis, we only included the lowest and the highest training set sizes from each transformation and tested the effects of ROI and how it may interact with the effect of training category and training set size. The position transformation showed a significant main effect of ROI (F(5,30) = 3.56, p < 0.01), with pattern prediction being lower for V4 and VOT than the other ROIs (post hoc paired t tests, ts > 2.14, ps < 0.05). Additionally, the size and SF transformations showed a marginally significant effect of ROI (Fs > 2.12, 0.06 < ps < 0.10). In the size transformation, this was mainly driven by better pattern predictions in V3 and LOT than the other ROIs (post hoc paired t tests, ts > 3.32, ps < 0.001), whereas in the SF transformations, this was mainly driven by worse pattern predictions in V1 and V4 than other ROIs (post hoc paired t tests ts > 2.84, ps < 0.01). The image stats transformation did not show a significant effect of ROI (F(5,25) = 0.80, p = 0.56). There were additionally interactions between the ROI and training category, ROI and training set size, and all three factors for all transformations (Fs > 2.49, ps < 0.05). These interactions seem to be mainly because of a greater difference in predictions for trained and untrained categories in LOT and VOT for the lowest training set size compared with V1–V4, so although performance for categories included in training for the lowest training set size was better than that for the highest training set size, performance was comparable for categories not included in training for the lowest and highest training set sizes. Overall, these results do not show that linear mapping consistently better predicts the neural response pattern in higher than in lower visual regions.
As another measure to evaluate whether prediction performance increased from lower to higher visual regions, we also tested the existence of a positive linear trend across the ROIs by fitting a regression line across the ROIs in each participant and testing whether the slope was positive at the group level. Across the different transformations and the different training set size and train category combinations, no slope was significantly >0 (ts < 1.24, ps > 0.29). This suggests that overall, linear mapping did not better predict the neural responses in higher than in lower visual regions across all four types of transformations.
Evaluating the selectivity of predicted fMRI response patterns
A successful linear prediction for transformation would not only predict that the predicted patterns are similar to the true patterns but also that the predicted patterns would be more similar to the true patterns from the same rather than different categories, thereby showing a high category selectivity. To evaluate this, we tested whether the difference in correlation between the same and different categories was >0. Overall, we found significant selectivity in all ROIs for categories included in training but only in higher-level visual regions for categories not included in training. The full details of the results are described below.
For the categories included in training, correlation differences for all transformation types, ROIs, and training set sizes were significantly >0 (Fig. 4). For categories not included in training, for all transformation types, correlation differences in LOT and VOT tended to be significantly (or marginally significantly) >0 for large training set sizes (with an exception for VOT in the position transformation); the effect was less consistent for small training set sizes. Correlation differences in V1–V4 in general did not exceed 0, except in a few cases for the SF, image stats, and size transformations (Fig. 4A). These results showed that when a category was included in the training data, category selectivity was preserved in the predicted patterns across all ROIs, training set sizes, and transformations. However, when a category was not included in the training data, although category selectivity was still preserved in the predicted patterns for higher visual regions especially when more categories were included in the training, its preservation in lower visual regions was less consistent.
In further analysis, to better understand the generalizability of linear mapping in preserving category selectivity, within each transformation we examined in each ROI the effect of training category, training set size, and their interactions. The significance levels and effect sizes (partial eta squared; Cohen, 1973) of the effects are reported in Table 1. Overall, there was a main effect of training category in all ROIs and all transformation types in that including a category in training significantly improved category selectivity. Post hoc tests revealed that the effect of training category was present for all of the training set sizes across ROIs and transformations (Fig. 4A). There was also a main effect of training set size in almost all cases. Furthermore, in all cases, there was an interaction between training category and training set size in that the difference among categories included and those not included in the training data tended to decrease with increasing training set size. For the categories included in the training data, the main effect of training set size was significant in all ROIs for all the transformation types (Fs > 6.5, ps < 0.001), reflecting the fact that performance tended to decrease with increasing training set sizes; the opposite, however, was found for the categories not included in the training data, so there was a main effect of ROI for all regions in the size, image stats, and SF transformations and for four of six regions in the position transformation (Fs > 5.72, ps < 0.001), where there was an increase in performance with increasing training set sizes. These results are consistent with the linear mapping functions being more tailored to the categories included in training rather than being independent of these categories.
To understand if category selectivity from the predicted patterns was better preserved in higher than in lower visual regions, we examined performance among the ROIs after normalizing by Averaged-run Ceiling to account for differences in noise in the different ROIs (Fig. 4B). As before, we only included the lowest and highest possible training set sizes and tested the main effect of ROI and its interaction with the main effects of training category and training set size. Overall, all transformations showed a main effect of ROI (Fs > 7.31, ps < 0.001), with LOT and VOT consistently showing higher category selectivity than the other regions in each transformation (post hoc pairwise tests, ts > 2.42, ps < 0.05). There were significant interactions between training category and training set size, between training category and ROI, between training set size and ROI, and among all three factors for all transformations (Fs > 3.30, ps < 0.05). These results showed that the effect of training category was smaller for lower than for higher visual regions, and the effect of training category was smaller for the largest than the smallest training set size. Fitting a regression line across the ROIs for each participant revealed a significantly positive linear trend of increasing performance from lower-level to higher-level visual regions for three of the four training category and training set size combinations across all the transformations (ts > 2.20, ps < 0.05; Fig. 4B). The exception to this was for categories not included in training in the lowest training set size, where the slopes of the lines were ≤0 for all the transformation types (ts < −0.71, ps > 0.74). Note that in this condition, category selectivities were largely no different nor significantly >0, thus showing a lack of success in pattern prediction likely because of the linear mapping function being much more tailored to the training category when only one category was included in the training data.
These results show that on the whole, for predicted patterns that did exhibit category selectivity, category selectivity was greater for higher than for lower visual regions. Category selectivity for categories not included in training also improved with more training categories, particularly for higher visual regions. Nevertheless, these results may not be taken to indicate that the linear mapping function better predicted responses in higher than in lower visual regions. This is because object category representation has been shown to be much stronger in higher than in lower visual regions by previous fMRI decoding studies (Vaziri-Pashkam and Xu, 2017, 2019; Vaziri-Pashkam et al., 2019). As such, a greater category selectivity was expected in higher than lower visual regions even if linear mapping function predicted responses equally well in both regions. Figure 3B provides a schematic explanation for how pattern predictability can be roughly equal in lower- and higher-level regions, whereas pattern selectivity is better in higher-level regions. Although this analysis does not provide us with a clear differentiation among brain regions, we included it here for completeness as it was the main analysis used in Ward et al. (2018).
Evaluating the prediction of representational structure
In this analysis, we tested whether the category representational structure was preserved in the patterns predicted by a linear mapping function. Although the analyses performed so far demonstrated that the predicted patterns had a high correlation with the true patterns and showed high category selectivity, whether they retained the original category representational structure remains unknown. To test this, we generated a category RDM that included all pairwise correlations of the predicted patterns for each state of a given transformation for each half of the data. We then correlated the category RDM of the predicted pattern from one-half of the data with the true category RDM from the other half to test how well category representational structures were preserved. Overall, we found that category structure was mostly preserved throughout the ventral visual stream for categories included in training but was only preserved in higher-level visual regions for categories not included in training. There was some evidence of better preservation in higher- than lower-level visual regions. The full details of the results are described below.
To compare across ROIs and to account for differences in data reliability across the different ROIs, as was done earlier, we calculated two reliability measures in each ROI, Averaged-run Ceiling and Single-run Ceiling, which involved the correlation of the RDM of the training runs in one half of the runs with the RDM of the average of all the runs in the other half (see above, Materials and Methods). Comparison with Averaged-run Ceiling allowed us to evaluate whether the RDM of the predicted patterns derived from data not included in training using the trained linear function were as good as the RDM of the average of the data included in training (i.e., the RDM of the average of the true patterns used for training); whereas comparison with Single-run Ceiling allowed us to evaluate whether the RDM from the predicted patterns were as good as the RDM from the data not included in training (i.e., the RDM of the left-out true pattern). The results were then evaluated at the participant group level using statistical tests.
For categories included in training, all correlations were significantly above zero (ts > 2.64, ps < 0.05). However, for categories not included in training, across all four transformations, only LOT and VOT had RDM correlations that were significantly above zero (ts > 2.71, ps < 0.05). The correlations for V1–V4 were only consistently above 0 for the image stats transformation (ts > 2.68, ps < 0.05) but not for the other transformations (ts < 1.74, ps > 0.11). These results showed that for categories included in training, regardless of the transformation type, category structure was at least partially preserved in the predicted patterns in both lower and higher visual regions; however, for categories not included in training, such preservation was only found in higher visual regions and was largely absent in lower visual regions.
Further, for categories included in training, RDM correlations were not significantly different from Averaged-run Ceiling for V1–V4 in the position transformation and for LOT and VOT in the position, size, and image stats transformations, particularly for smaller training set sizes (Fig. 5A). The results indicate that in general, for lower-level visual areas, the predicted RDMs generated were not as good as the average of the true patterns used for training, whereas for higher-level visual areas (and low training set sizes), the predicted RDMs were as good.
Similar to the category predictability results, many of the RDM correlations for categories included in training were greater than Single-run Ceiling, demonstrating that RDMs generated from response patterns using the linear mapping procedure were less noisy than those from the actual patterns of single runs. In the size, image stats, and SF transformations, the correlations for categories not included in training were typically below Single-run Ceiling, whereas in the position transformation, these correlations were roughly equal to Single-run Ceiling. Thus the linear mapping procedure may not be as successful in generalizing the relative similarities among categories to categories not included in training as it does to categories included in training.
In further analysis, to better understand the generalizability of linear mapping in preserving category representational structure, within each transformation and each ROI we examined the effect of training category, training set size, and their interactions. The significance levels and effect sizes of the effects are reported in Table 1 (partial eta squared; Cohen, 1973). There was a main effect of training category in all ROIs and all transformations, with categories included in training showing higher RDM correlations than those not included in training. There was also a main effect of training set size in all cases, where RDM correlations tended to decrease with larger training set sizes. Additionally, there were interactions between training category and training set size, where the difference among categories included and not included in training decreased with larger training set sizes.
To understand if category representational structure from the predicted patterns was better preserved in higher than in lower visual regions, we examined performance among the ROIs after normalizing the RDM correlations by Averaged-run Ceiling to account for differences in noise in the different ROIs (Fig. 5B). As before, we only included the lowest and highest possible training set sizes to streamline the analysis and tested the main effects of ROI and its interaction with training category and training set size. A main effect of ROI was found for the size, image stats, and SF transformations (Fs > 5.36, ps < 0.01), with LOT and VOT showing consistently greater RDM predictions than the other ROIs (post hoc pairwise t tests, ts > 2.89, ps < 0.05). The main effect of ROI was absent for the position transformation (F(5,30) = 1.80, p = 0.14). To quantify whether performance increased from lower to higher visual regions, we tested the existence of any positive linear trend across the ROIs in each of the four conditions included in each transformation. Across the 16 total conditions, 10 showed a significant or marginally significant positive trend (ts > 2.19, ps < 0.053). The exceptions to this were in the position transformation, in which no condition showed a significant trend (ts < 0.51, ps > 0.31), in the image stats transformation in the lowest training set size for categories included in training (t(5) = 1.24, p = 0.13), and in the SF transformation in the lowest training set size for categories not included in training (t(5) = 0.83, p = 0.21). Overall, a little over half of the conditions showed better preservation of the representational structure in higher than in lower visual regions. There were no interactions between the main effect of ROI and that of either training category or training set size. The better preservation of the representational structure in higher than in lower visual regions in some of the conditions may be explained by better object representations and thus more distinctive object representational structure in higher than in lower visual regions. This is because a more distinctive object representational structure would be more resilient to distortions caused by the inaccuracies in the predicted pattern even when such inaccuracies are similar across lower and higher visual regions.
Relating category similarity to pattern prediction generalization
The results obtained so far showed that although linear mapping could predict to a great extent fMRI response patterns for object categories in one state of a transformation using responses from another state, the prediction tended to depend, to a significant extent, on the specific categories included in training rather than being completely independent of the categories included in training. Such category dependency would predict that categories that are similarly represented in a given ROI could also better predict each other's response patterns with linear mapping. That is, if category A is more similar to category B than C in a brain region, then A and B should show similar linear separation across a transformation so that the linear mapping function derived from the data from category A would make a better prediction for category B than C (Fig. 6A). We found evidence supporting this prediction, with larger effects found in higher- than in lower-level visual regions.
To test this prediction, in each ROI, using the same linear mapping procedure, we used data from one category to predict the responses of all other categories from one transformation state to the other state. We then correlated the predicted pattern and the true pattern of the same category across the two halves of the data. This analysis was rotated across all the categories, with each serving as the training category to predict the other categories. The results were averaged between the two states of a transformation and the two directions of predictions (i.e., using A to predict B and using B to predict A) and were used to construct a prediction similarity matrix in which each cell of the matrix reflects how well two categories may predict each other. To obtain the similarity a among the object categories, we constructed a category similarity matrix with each cell of the matrix being the pairwise correlation of the true patterns of two categories across the two halves of the data within the same state of the transformation averaged over the two states. We vectorized the off-diagonal elements of these matrices and correlated them.
To test whether category similarity may increase pattern prediction, we first examined whether there was any positive correlation between the two. The correlations were very high, between 0.69 and 0.96 across ROIs and transformations, and pairwise t tests confirmed that these correlations are significantly >0 (ts > 12.34, ps < 0.001). However, although the correlations are extremely high, they are all still significantly <1 (Fig. 6B). All the transformations showed a significant main effect of ROI (Fs > 4.85, ps < 0.001), with LOT and VOT showing greater correlations than V1–V4 (ts > 2.47, ps < 0.05). Across the different ROIs, a positive linear trend from lower to higher visual regions was found for all transformations (ts > 2.86, ps < 0.05). Overall, these results showed that category similarity played a large role in increasing pattern prediction, with larger effects in higher-level visual regions. These results provide additional evidence arguing against a completely orthogonal representation of identity and nonidentity information in the human ventral visual system.
Discussion
Presently we still do not fully understand how object identity and nonidentity information are simultaneously represented in the primate visual system. Building on previous research by Ward et al. (2018), we analyzed data from two existing studies (Vaziri-Pashkam et al., 2019; Vaziri-Pashkam and Xu, 2019) and conducted a comprehensive investigation to examine the existence of general mapping functions among fMRI responses to object categories in different states of nonidentity transformations in the human brain. We examined responses from human early visual areas V1 to V4 and two higher object processing regions, LOT and VOT, to both Euclidean and non-Euclidean transformations of objects. For each transformation type and ROI, we derived a linear transformation matrix to link the fMRI response patterns of two states of an object category. Using this transformation matrix, we generated the predicted pattern of an object category in one state using its true pattern from the other state.
We first evaluated how well the predicted patterns of a category correlated with the true pattern of the same category. By comparing the correlations of the predicted patterns to two split-half reliability measures of the true data (see above, Materials and Methods), we found that the predicted patterns were (1) not as good as the average of the true patterns used for training and (2) were as good or better than the left-out true patterns. Finding 2 was obtained likely because the noise across training runs canceled each other out during the linear mapping procedure, leading to a less noisy prediction than the actual pattern. We also found that pattern predictions were significantly better for categories included than those not included in the training data and did not benefit from including more categories in the training data. The linear mapping functions derived during training are thus not entirely category independent but interact with the specific categories included in training. However, performance for categories not included in training was still high, where predicted patterns were mostly as good or better than the left-out true patterns, providing evidence for a near-orthogonal rather than nonorthogonal representation of identity and nonidentity information. Higher visual regions did not exhibit better performance than lower visual regions.
In our second analysis, we evaluated whether the predicted patterns were closer to the true pattern of the same than different categories (i.e., pattern selectivity). Overall, the linear-mapping-predicted response patterns showed significant category selectivity, particularly for categories included in the training data across all ROIs and all transformations. For categories not included in the training data, only higher visual regions exhibited strong and consistent category selectivity when large numbers of categories were included in training. The linear mapping functions appeared to be tailored to a significant extent to the categories included in training.
As a third evaluation, we examined whether the representational structure among categories was preserved for the predicted patterns. In general, for lower-level visual areas, the predicted RDMs were not as good as the average of the true patterns used for training, whereas for higher-level visual areas (and low training set sizes), the predicted RDMs were as good. Moreover, for three of the four transformation types, the predicted RDMs for categories not included in training were not as good as the RDMs derived from the left-out true patterns, demonstrating how the linear mapping procedure may not be as successful in generalizing the relative similarities among categories to categories not included in training. Across the different transformation types, a positive trend of better performance from lower to higher visual regions was only found in a little over half the conditions tested, indicating that prediction was not always better in higher than lower regions.
To provide a further independent test of category dependence on pattern prediction, we tested whether how well one category may predict another category was determined by the similarity of these two categories in a given brain region. Category similarity played a large role in increasing pattern prediction with the effect being larger in higher- than lower-level visual regions. These results provide additional evidence arguing against a completely orthogonal representation of identity and nonidentity information in the human ventral visual system.
The linear mapping functions we derived were not entirely category independent, which could be because of either the near-orthogonal structure of the object representational space, as depicted in Figure 1A, or data overfitting during training. In a near-orthogonal structure, identity and nonidentity features can each be linearly decoded, but a learned linear mapping between two states of a transformation for one object category may not fully predict the neural response of a new object from one state to another.
We do not believe data overfitting played a significant role here. Had the object representational space contained category-independent orthogonal representations of the nonidentity features, when more training data were included with increasing training set size we would expect to see better prediction performance whether or not a category was included in the training data. However, there were no benefits of larger training set size for pattern prediction nor preservation of representational structure. For category selectivity, for categories included in the training data, performance tended to decrease with increasing training set sizes, but for categories not included in the training data, performance tended to increase with increasing training set sizes. Moreover, the ability of using one category to predict another category was significantly correlated with the similarity of these two categories. Together, these results argue against a data overfitting account and are more consistent with the object representational space being near orthogonal.
It may be argued that, because the true pattern of an object category would always fluctuate somewhat because of noise, even if the true feature representational space is completely orthogonal, the measured feature space would always appear near orthogonal. To account for such noise-driven pattern fluctuations, we used a split-half approach in the present study. In all our measures, we correlated the predicted patterns from one-half of the data with the true pattern from the other half of the data and then assessed whether the correlation was comparable to the correlation of true patterns between the two halves of the data. Any fluctuation in response patterns would thus be captured and accounted for by the less than perfect correlation between the true patterns across the two halves of the data. Additionally, if the feature representational space was indeed completely orthogonal because we tested on the left-out data not included in training, predictions should have been equally good for categories included in training and those that were not. However, we consistently found that predictions were better for categories included than those not included in training.
A feature-untangling view of object representation may have predicted the existence of a stronger linear mapping in higher than in lower visual regions as identity features become more untangled from nonidentity features during the course of the visual information process (e.g., Rolls, 2000; DiCarlo and Cox, 2007; Rust and DiCarlo, 2010; Isik et al., 2014). However, we did not find significant differences in prediction performance across ROIs. Previous work has shown that nonidentity features such as position and the spatial frequency information of an image can be decoded independently of object identity throughout the ventral visual processing pathway (Hong et al., 2016; Vaziri-Pashkam et al., 2019; Xu and Vaziri-Pashkam, 2021). This indicates that certain nonidentity features can be represented somewhat independently of object identities, and thus there is some untangling of these nonidentity and identity features in the early visual cortex. Consequently, it is possible for a general-purpose linear mapping function to link nonidentity information across different object identities throughout the ventral visual cortex. In this regard, our results provide a revision to the depiction of the object representational structure presented in DiCarlo and Cox (2007) and DiCarlo et al. (2012). Here, we show that there is some separation of identity and certain nonidentity features early in visual processing with a near-orthogonal representation of object identity and these nonidentity features throughout the ventral visual regions. Meanwhile, it remains possible that for other real-world nonidentity features, such as viewpoint or pose, for transformations with larger magnitude even for the features tested here or for combinations of different features, a linear mapping may successfully predict responses only in higher visual regions but not those in the early visual cortex. Indeed, Pinto et al. (2008) emphasize the importance of probing a wide array of real-world image variation to accurately assess models of vision. Further research is needed to test these possibilities.
In summary, using a representational transformation analysis, we show that the entire human ventral visual system can link object responses in different states of nonidentity transformations through linear mapping functions for both Euclidean and non-Euclidean transformations. These mapping functions are not entirely identity independent, suggesting that object identity and nonidentity features are represented in a nearly rather than completely orthogonal manner. Our study provides a useful framework to more precisely characterize how different types of object features may be represented together during visual processing in the human brain.
Footnotes
This research was supported by the National Institute of Health Grants 1R01EY030854 and 1R01EY022355 to Y.X. and Grant MH108591 to M.M.C. We thank members of the Visual Cognitive Neuroscience Lab, Turk-Browne Lab, Holmes Lab, and Yale Cognitive and Neural Computation Lab for feedback on this project.
The authors declare no competing financial interests.
- Correspondence should be addressed to Viola Mocz at viola.mocz{at}yale.edu or Yaoda Xu at xucogneuro{at}gmail.com