Abstract
Deep neural networks (DNNs) are promising models of the cortical computations supporting human object recognition. However, despite their ability to explain a significant portion of variance in neural data, the agreement between models and brain representational dynamics is far from perfect. We address this issue by asking which representational features are currently unaccounted for in neural time series data, estimated for multiple areas of the ventral stream via source-reconstructed magnetoencephalography data acquired in human participants (nine females, six males) during object viewing. We focus on the ability of visuo-semantic models, consisting of human-generated labels of object features and categories, to explain variance beyond the explanatory power of DNNs alone. We report a gradual reversal in the relative importance of DNN versus visuo-semantic features as ventral-stream object representations unfold over space and time. Although lower-level visual areas are better explained by DNN features starting early in time (at 66 ms after stimulus onset), higher-level cortical dynamics are best accounted for by visuo-semantic features starting later in time (at 146 ms after stimulus onset). Among the visuo-semantic features, object parts and basic categories drive the advantage over DNNs. These results show that a significant component of the variance unexplained by DNNs in higher-level cortical dynamics is structured and can be explained by readily nameable aspects of the objects. We conclude that current DNNs fail to fully capture dynamic representations in higher-level human visual cortex and suggest a path toward more accurate models of ventral-stream computations.
SIGNIFICANCE STATEMENT When we view objects such as faces and cars in our visual environment, their neural representations dynamically unfold over time at a millisecond scale. These dynamics reflect the cortical computations that support fast and robust object recognition. DNNs have emerged as a promising framework for modeling these computations but cannot yet fully account for the neural dynamics. Using magnetoencephalography data acquired in human observers during object viewing, we show that readily nameable aspects of objects, such as 'eye', 'wheel', and 'face', can account for variance in the neural dynamics over and above DNNs. These findings suggest that DNNs and humans may in part rely on different object features for visual recognition and provide guidelines for model improvement.
- categories
- features
- object recognition
- recurrent deep neural networks
- source-reconstructed MEG data
- vision
Introduction
When we view objects in our visual environment, the neural representation of these objects dynamically unfolds over time across the cortical hierarchy of the ventral visual stream. In brain recordings from both humans and nonhuman primates, this dynamic representational unfolding can be quantified from neural population activity, showing a staggered emergence of ecologically relevant object information such as facial features, followed by object categories, and then the individuation of these inputs into specific exemplars (Sugase et al., 1999; Hung et al., 2005; Meyers et al., 2008; Carlson et al., 2013; Clarke et al., 2013; Cichy et al., 2014; Ghuman et al., 2014; Isik et al., 2014; Hebart et al., 2018; Kietzmann et al., 2019b). These neural reverberations are thought to reflect the cortical computations that support object recognition.
Deep neural networks (DNNs) have recently emerged as a promising computational framework for modeling these cortical computations (Kietzmann et al., 2019a; Doerig et al., 2022). DNNs explain significant amounts of variance in neural data obtained from visual cortex in both humans and nonhuman primates (Khaligh-Razavi and Kriegeskorte, 2014; Yamins et al., 2014; Güçlü and van Gerven, 2015; Cichy et al., 2017; Bankson et al., 2018; Bonner and Epstein, 2018; Groen et al., 2018; Jozwik et al., 2018; Schrimpf et al., 2018; Zeman et al., 2020; Storrs et al., 2020b). The transformation of object representations from shallower to deeper layers of feedforward DNNs roughly matches the transformation of object representations observed in visual cortex as neural responses unfold over space and time (Khaligh-Razavi and Kriegeskorte, 2014; Güçlü and van Gerven, 2015; Cichy et al., 2017; Bankson et al., 2018; Jozwik et al., 2018; Zeman et al., 2020). Furthermore, DNNs that incorporate dynamics through recurrent processing provide additional explanatory power, possibly by better approximating the dynamic computations that the brain relies on for perceptual inference (O'Reilly et al., 2013; Liao and Poggio, 2016; Spoerer et al., 2017; Kubilius et al., 2018; Tang et al., 2018; Kar et al., 2019; Kietzmann et al., 2019a, b; Rajaei et al., 2019; Spoerer et al., 2020). However, DNNs still leave substantial amounts of variance in brain responses unexplained (Groen et al., 2018; Schrimpf et al., 2018; Bracci et al., 2019; Kietzmann et al., 2019b), and differences among feedforward architectures are small (Jozwik et al., 2019a, b), even after training and fitting (Storrs et al., 2020b). This raises the question of what representational features are left unaccounted for in the dynamic neural data.
To address this question, we enriched our modeling strategy with visuo-semantic object information. By “visuo-semantic”, we mean nameable properties of visual objects. Our visuo-semantic models consist of object labels generated by human observers, describing lower-level object features such as 'green', higher-level object features such as 'eye', and categories such as 'face'. The visuo-semantic labels can be interpreted as vectors in a space defined by humans at the behavioral level. In contrast to DNNs, our visuo-semantic models are not image computable. However, they provide unique benchmarks for comparison with image-computable models. Prior work indicates that visuo-semantic labels explain significant amounts of response variance in higher-level primate visual cortex (Tanaka, 1996; Kanwisher et al., 1997; Epstein and Kanwisher, 1998; Downing et al., 2001; Haxby et al., 2001; Kriegeskorte et al., 2008; Yamane et al., 2008; Freiwald et al., 2009; Huth et al., 2012; Issa and DiCarlo, 2012; Mur et al., 2012; Jozwik et al., 2016, 2018). Moreover, visuo-semantic models outperform DNN architectures (AlexNet, Krizhevsky et al., 2012; VGG, Simonyan and Zisserman, 2014) at predicting perceived object similarity in humans (Jozwik et al., 2017). In addition, a recent functional magnetic resonance imaging (fMRI) study showed that combining DNNs with a semantic feature model is beneficial for explaining visual object representations at advanced processing stages of the ventral visual stream (Devereux et al., 2018). Given these findings, we hypothesized that visuo-semantic models capture representational features in ventral-stream neural dynamics that DNNs fail to account for.
We tested this hypothesis on temporally resolved magnetoencephalography (MEG) data, which can capture representational dynamics at a millisecond timescale. Human brain data acquired at this rapid sampling rate provide rich information about temporal dynamics and, by extension, about the underlying neural computations. For example, in an MEG study that used source reconstruction to localize time series to distinct areas of the ventral visual stream, time series analyses revealed temporal interdependencies between areas suggestive of recurrent information processing (Kietzmann et al., 2019b).
In this work, we used representational similarity analysis (RSA) to test both DNNs and visuo-semantic models for their ability to explain representational dynamics observed across multiple ventral-stream areas in the human brain. As DNNs, we used feedforward CORnet-Z (Zero) and locally recurrent CORnet-R (Recurrent) models, which are inspired by the anatomy of monkey visual cortex (Kubilius et al., 2018). As visuo-semantic models, we used existing human-generated labels of object features and categories (Jozwik et al., 2016). We analyzed previously published source-reconstructed MEG data acquired in healthy human participants while they were viewing object images from a range of categories (Cichy et al., 2014; Kietzmann et al., 2019b). We investigated three distinct stages of processing in the ventral cortical hierarchy, lower-level visual areas (V1-V3), intermediate visual areas (V4t/lateral occipital cortex [LO]), and higher-level visual areas (inferior temporal cortex [IT]/parahippocampal cortex [PHC]) (Glasser et al., 2016). At each stage of processing, we tested both model classes for their ability to explain variance in the temporally evolving representations. This strategy allowed us to test what visuo-semantic object information is unaccounted for by DNNs as ventral-stream processing unfolds over space and time.
Materials and Methods
Stimuli
Stimuli were 92 colored images of real-world objects spanning a range of categories, including humans, nonhuman animals, natural objects, and manmade objects (12 human body parts, 12 human faces, 12 animal bodies, 12 animal faces, 23 natural objects, and 21 manmade objects). Objects were segmented from their backgrounds (Fig. 1a) and presented to human participants and models on a gray background.
Visuo-semantic models
Visuo-semantic models have been described in Jozwik et al. (2016, 2017), where further details can be found.
Definition of visuo-semantic models
To create visuo-semantic models, human observers generated feature labels (e.g., eye) and category labels (e.g., animal) for the 92 images (Jozwik et al., 2016). The visuo-semantic models are schematically represented in Figure 1, b and c. Feature labels were divided into colors, textures, shapes, and object parts, whereas category labels were divided into subordinate categories, basic categories, and superordinate categories. Labels were obtained in a set of two experiments. In experiment 1, a group of 15 human observers (mean age, 26 years; 11 females) generated feature and category labels for the object images. Human observers were native English speakers and had normal or corrected-to-normal vision. In the instruction, we defined features as visible elements of the shown object, including colors, textures, shapes and object parts. We defined a category as a group of objects that the shown object is an example of. The instruction contained two example images, not part of the 92 object-image set, with feature and category descriptions. We asked human volunteers to list a minimum of five descriptions, both for features and for categories. The 92 images were shown, in random order, on a computer screen using a Web-based implementation, with text boxes next to each image for human observers to type feature or category descriptions. We subsequently selected, for features and categories separately, those descriptions that were generated by at least three of 15 human observers. This threshold corresponds to the number of human observers who on average mentioned a particular feature or category for a particular image. The threshold is relatively lenient, but it allows the inclusion of a rich set of descriptions, which were further pruned in experiment 2. We subsequently removed descriptions that were either inconsistent with the instructions or redundant. Observers generated 212 feature labels and 197 category labels. These labels are the model dimensions. In experiment 2, a separate group of 14 human observers (mean age, 28 years; 7 females) judged the applicability of each model dimension to each image, thereby validating the dimensions generated in experiment 1 and providing, for each image, its value (present or absent) on each of the dimensions. Human observers were native English speakers and had normal or corrected-to-normal vision. During the experiment, the object images and the descriptions, each in random order, were shown on a computer screen using a Web-based implementation. The object images formed a column, whereas the descriptions formed a row; together they defined a matrix with one entry, or checkbox, for each possible image-description pair. We asked the human observers to judge for each description whether it correctly described each object image and, if so, to tick the associated checkbox. The image values on the validated model dimensions define the model (if agreed by at least 75% of human observers from experiment 2). To increase the stability of the models during subsequent fitting, we iteratively merged binary vectors that were highly correlated (r > 0.9), alternately computing pairwise correlations between the vectors and averaging highly correlated vector pairs, until all pairwise correlations were below threshold. The final feature and category models consisted of 119 and 110 dimensions, respectively.
Construction of the visuo-semantic representational dissimilarity matrices
To compare the models to the measured brain representations, the models and the data should reside in the same representational space. This motivates transforming our models to representational dissimilarity matrix (RDM) space. For each model dimension, we computed for each pair of images the squared difference between their values on that dimension. The squared difference reflects the dissimilarity between the two images in a pair. Given that a specific feature or category can either be present or absent in a particular image, image dissimilarities along a single model dimension are binary; they are zero if a feature or category is present or absent in both images, and one if a feature or category is present in one image but absent in the other. The dissimilarities were stored in an RDM, yielding as many RDMs as model dimensions. The full visuo-semantic model consists of 229 RDM predictors (119 feature predictors and 110 category predictors).
Deep neural networks
CORnet-Z and CORnet-R architectures have been described in Kubilius et al. (2018), where further details can be found.
Architecture and training
We used feedforward (CORnet-Z) and locally recurrent (CORnet-R; Kubilius et al., 2018) models in our analyses. The architectures of the two DNNs are schematically represented in Figure 1b. The architecture of CORnets is inspired by the anatomy of monkey visual cortex. Each processing stage in the model is thought to correspond to a cortical visual area, so the four model layers correspond to areas V1, V2, V4, and IT, respectively (Kubilius et al., 2018). The output of the last model layer is mapped to the behavioral choices of the model using a linear decoder. We chose the two CORnets because they have similar architectures, but one is purely feedforward, and the other is feedforward plus locally recurrent; they are one of the best models for predicting visual responses in monkey and human IT (Schrimpf et al., 2018; Jozwik et al., 2019a, b); and their architectures are relatively simple compared with other DNNs. Each visual area in CORnet-Z consists of a single convolution, followed by a rectified linear unit (ReLU) nonlinearity and max pooling. CORnet-R introduces local recurrent dynamics within an area. The recurrence occurs only within an area; there are no bypass or feedback connections between areas. For each area, the input is downscaled twofold, and the number of channels is increased twofold by passing the input through a convolution, followed by group normalization (Wu and He, 2018) and a ReLU nonlinearity. The internal state of the area (initially zero) is added to the result and passed through another convolution, again followed by group normalization and a ReLU nonlinearity, resulting in the new internal state of the area. At time step t0 there is no input to V2 and beyond, and as a consequence no image-elicited activity is present beyond V1. From time step t1 onward, the image-elicited activity is present in all visual areas as the output of the previous area is immediately propagated forward. CORnet-R was trained using five time steps (t0–t4). Both DNNs were trained on 1.2 million images from the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) database (Russakovsky et al., 2015). The ILSVRC database provides annotations that contain a category label for each image, assigning the object in an image to one of 1000 categories, for example, 'daisy', 'macaque', and 'speedboat'. The task of the networks is to classify each object image into one of the 1000 categories.
Construction of the DNN representational dissimilarity matrices
DNN representations of the 92 images were computed from the layer activations of CORnet-Z and CORnet-R. For CORnet-Z, we included the decoder layer and the final processing stage (output) from each visual area layer, which resulted in five layers. For CORnet-R, we included the decoder layer and the final processing stage from each visual area layer for each time step, which resulted in 21 layers. For each layer of CORnet-Z and CORnet-R, we extracted the unit activations in response to the images and converted these into one activation vector per image. For each pair of images, we computed the dissimilarity (1 minus Spearman's correlation) between the activation vectors. This yielded an RDM for each DNN layer. The resulting RDMs capture which stimulus information is emphasized and which is de-emphasized by the DNNs at different stages of processing. The full DNN model consists of 26 RDM predictors (five CORnet-Z predictors and 21 CORnet-R predictors).
MEG source-reconstructed data
Acquisition and analysis of the MEG data have been described in Cichy et al. (2014), where further details can be found. The source reconstruction of the MEG data has been described in Kietzmann et al. (2019b), where further details can be found.
Participants
Sixteen healthy human volunteers participated in the MEG experiment (mean age 26, 10 females). MEG source reconstruction analyses were performed for a subset of 15 participants for whom structural and functional MRI data were acquired. Participants had normal or corrected-to-normal vision. Before scanning, the participants received information about the procedure of the experiment and gave their written informed consent for participating. The experiment was conducted in accordance with the Ethics Committee of the Massachusetts Institute of Technology Institutional Review Board and the Declaration of Helsinki.
Experimental design and task
Stimuli were presented at the center of the screen for 500 ms while participants performed a paper clip detection task. Stimuli were overlaid with a light gray fixation cross and displayed at a width of 2.9° visual angle. Participants completed 10–14 runs. Each image was presented twice in every run in random order. Participants were asked to press a button and blink their eyes in response to a paper clip image shown randomly every three to five trials. These trials were excluded from further analyses. Each participant completed two MEG sessions.
MEG data acquisition and preprocessing
MEG signals were acquired from 306 channels (204 planar gradiometers, 102 magnetometers) using an Elekta Neuromag TRIUX system at a sampling rate of 1000 Hz. The data were bandpass filtered between 0.03 and 330 Hz, cleaned using spatiotemporal filtering, and downsampled to 500 Hz. Baseline correction was performed using a time window of 100 ms before stimulus onset.
MEG source reconstruction
The source reconstructions were performed using the minimum norm estimation (MNE) Python toolbox (Gramfort, 2013). We used structural T1 scans of individual participants to obtain volume conduction estimates using single-layer boundary element models (BEMs) based on the inner skull boundary. Instead of BEMs being based on the FreeSurfer watershed algorithm originally used in the MNE Python toolbox, we extracted BEMs using FieldTrip software as the original method yielded poor reconstruction results. The source space consisted of 10,242 source points per hemisphere. The source points were positioned along the gray/white matter boundary, as estimated via FreeSurfer. We defined source orientations as surface normals with a loose orientation constraint. We used an iterative closest point procedure for MEG/MRI alignment based on fiducials and digitizer points along the head surface, after initial alignment based on fiducials. We estimated the sensor noise covariance matrix from the baseline period (100–0 ms before stimulus onset) and regularized it according to the Ledoit–Wolf procedure (Ledoit and Wolf, 2004). We projected source activations onto the surface normal, obtaining one activation estimate per point in source space and time. Source reconstruction allowed us to estimate temporal dynamics in specific brain regions. Source reconstruction provides an estimate of what brain regions the signal is coming from rather than a direct measurement of representations in different brain regions (Hauk et al., 2022).
Definition of regions of interest
We used a multimodal brain atlas (Glasser et al., 2016) to define regions of interest (ROIs). We defined three ROIs covering lower-level (V1–V3), intermediate (V4t/LO1–3), and higher-level visual areas (IT/PHC, consisting of TE1-2p, fusiform face complex [FFC], ventral visual complex [VVC], ventromedial visual area [VMV]2–3, parahippocampal area [PHA]1–3). We converted the atlas annotation files to fsaverage coordinates (Fischl et al., 1999) and mapped them to each individual participant using spherical averaging.
Construction of the MEG representational dissimilarity matrices
We computed temporally changing RDM movies from the source-reconstructed MEG data for each participant, ROI, hemisphere, and session. We first extracted a trial-average multivariate source time series for each stimulus. We then computed an RDM at each time point by estimating the pattern distance between all pairs of images using correlation distance (1 minus Pearson's correlation). The RDM movies were averaged across hemispheres and sessions, resulting in one RDM movie for each participant and ROI.
Evaluating and comparing model performance
To assess performance of the models at explaining variance in the source-reconstructed MEG data, we performed first- and second-level model fitting as described below. Model fitting within the RSA framework has been described in Khaligh-Razavi and Kriegeskorte (2014), Jozwik et al. (2016, 2017), Kietzmann et al. (2019b), Storrs et al. (2020a), and Kaniuth and Hebart (2021), where further details can be found.
First-level model fitting: obtaining cross-validated model predictions
We could predict the brain representations by making the assumption that each model dimension, that is, each visuo-semantic object label or each DNN layer, contributes equally to the representation. Our visuo-semantic models use the squared Euclidean distance as the representational dissimilarity measure, which is the sum across dimensions of the squared response difference for a given pair of stimuli. The squared differences simply sum across dimensions, so the model prediction would be the sum of the single-dimension model RDMs. A similar reasoning applies to our DNN model, which uses the correlation distance as the representational dissimilarity measure. The correlation distance is proportional to the squared Euclidean distance between normalized patterns. However, we expect that not all model dimensions contribute equally to brain representations. To improve model performance, we linearly combined the different model dimensions to yield an object representation that best predicts the source-reconstructed MEG data. Because the squared differences sum across dimensions in the squared Euclidean distance, weighting the dimensions and computing the RDM is equivalent to a weighted sum of the single-dimension RDMs. When a dimension is multiplied by weight w, then the squared differences along that dimension are multiplied by w2. We can therefore perform the fitting on the RDMs. We performed model fitting for the DNN model (26 predictors), the visuo-semantic model (229 predictors), and for the following visuo-semantic submodels: color (10 predictors), texture (12 predictors), shape (15 predictors), object parts (82 predictors), subordinate categories (38 predictors), basic categories (67 predictors), and superordinate categories (5 predictors). We included a constant term in each model to account for homogeneous changes in dissimilarity across the whole RDM. For each model, we estimated the model weights using regularized (L2) linear regression, implemented in MATLAB using the Glmnet package (https://hastie.su.domains/glmnet_matlab/?). We standardized the predictors before fitting and constrained the weights to be non-negative. To prevent biased model predictions because of overfitting to the images, model predictions were estimated by cross-validation to a subset of the images held out during fitting. For each cross-validation fold, we randomly selected 84 of the 92 images as the training set and eight images as the test set, with the constraint that test images had to contain four animate objects (two faces and two body parts) and four inanimate objects. We used the pairwise dissimilarities of the training images to estimate the model weights. The model weights were then used to predict the pairwise dissimilarities of the eight held-out images. This procedure was repeated many times until predictions were obtained for all pairwise dissimilarities. For each cross-validation fold, we determined the best regularization parameter (i.e., the one with the minimum squared error between prediction and data) using nested cross-validation to held-out images within the training set. We performed the first-level fitting procedure for each participant, ROI, and time point.
Second-level model fitting: estimating model performance
We estimated model performance using a second-level general linear model (GLM) approach. We used the cross-validated RDM predictions from the first-level model fitting as GLM predictors. We included a constant term in the GLM to account for homogeneous changes in dissimilarity across the whole RDM. We fit the GLM predictors to the source-reconstructed MEG data using non-negative least squares. We first estimated the variance explained by each individual model when fit in isolation (reduced GLM). We next estimated the variance explained by the visuo-semantic and DNN models when fit simultaneously (full GLM). We then computed the unique variance explained by each model by subtracting the variance explained by the reduced GLMs from the variance explained by the full GLM. For example, to compute the unique variance explained by the visuo-semantic model, we subtracted the variance explained by the DNN model from the variance explained by the full GLM. This approach allowed us to address whether visuo-semantic models capture representational features in ventral-stream dynamics that DNNs fail to account for and vice versa. We also estimated the unique variance explained in the source-reconstructed MEG data for visuo-semantic submodels in the presence of the DNN model, again by fitting a full GLM (all models included) and reduced GLMs (excluding the model of interest). We performed the second-level GLM fitting procedure for each participant, ROI, and time point.
Statistical inference on model performance
To evaluate the significance of the (unique) variance explained by each model across participants, we first subtracted an estimate of the prestimulus baseline in each participant and then performed a one-sided Wilcoxon signed-rank test against zero. The prestimulus baseline was defined as the average (unique) variance explained between 200 and 0 ms before stimulus onset. We also tested whether and when the (unique) variance explained differed between the visuo-semantic and DNN models using a two-sided Wilcoxon signed-rank test. We controlled the expected false discovery rate at 0.05 across time points for each model evaluation, model comparison, and ROI. We used a continuity criterion (minimally 10 consecutive significant time points sampled every 2 ms = 20 ms) to report significant time points in the manuscript text. For completeness, Figures 2 and 3 show significant time points both before and after applying the continuity criterion. Lines shown in Figures 2, 3, and 4 were low-pass filtered at 80 Hz (Butterworth IIR filter, order 6) for better visibility. Statistical inference is based on unsmoothed data.
Data availability
The datasets and code generated during the current study are available from the corresponding authors on request.
Results
DNNs better explain lower-level visual representations, visuo-semantic models better explain higher-level visual representations
We first evaluated the overall ability of the DNN and visuo-semantic models to explain the time course of information processing along the human ventral visual stream. We hypothesized that visuo-semantic models capture representational features in neural data that DNNs may fail to account for. Figure 1 shows an overview of our approach. We computed RDM movies from the source-reconstructed MEG data to characterize how the ventral-stream object representations evolved over time in each participant. We computed an RDM movie for each participant and ROI and explained variance in the movies using a DNN model and a visuo-semantic model. The DNN model consisted of internal object representations in layers of CORnet-Z, a purely feedforward model, and CORnet-R, a locally recurrent variant (Kubilius et al., 2018), to account for both feedforward and locally recurrent computations. The visuo-semantic model consisted of human-generated labels of object features (e.g., brown, furry, round, ear; 119 labels) and categories (e.g., Great Dane, dog, organism; 110 labels) for the object images presented during the MEG experiment (Jozwik et al., 2016). We computed model predictions by linearly combining either all DNN layers or all visuo-semantic labels to best explain variance in the RDM movies across time. We evaluated the model predictions on data for images left out during fitting. For each model, we tested whether and when the variance explained in the RDM movies exceeded the prestimulus baseline using a one-sided Wilcoxon signed-rank test. We also tested whether and when the amounts of explained variance differed between the two models using a two-sided Wilcoxon signed-rank test. We controlled the expected false discovery rate at 0.05 across time points. We applied a continuity criterion (20 ms) for reporting results in the text.
For lower-level visual cortex (V1–V3), the DNN model explained significant amounts of variance between 60 and 638, and 818 and 884 ms after stimulus onset, whereas the visuo-semantic model did so between 118 and 660 ms after stimulus onset (118–142, 146–178, 194–256, 264–414, 430–458, 486–520, 570–598, 608–660 ms; Fig. 2a). The DNN model explained more variance than the visuo-semantic model during the early (66–128 ms) as well as the late (422–516, 520–544, 820–844 ms) phases of the response. For intermediate visual cortex (V4t/LO), the DNN model explained variance predominantly between 62 and 610 ms after stimulus onset (62–562, 590–610, 820–848, 854–874, 952–976 ms), whereas the visuo-semantic model explained variance predominantly between 110 and 562 ms after stimulus onset (110–478, 482–562, 832–854 ms; Fig. 2a). The amount of explained variance did not significantly differ between the two models. The results for lower-level visual cortex indicate that the DNN model outperformed the visuo-semantic model at explaining object representations during the early phase of the response (<128 ms after stimulus onset), as well as the late phase of the response (>422 ms after stimulus onset). In contrast, for higher-level visual cortex (IT/PHC), the visuo-semantic model outperformed the DNN model. The DNN model explained variance only between 182 and 270 ms after stimulus onset (Fig. 2a). The visuo-semantic model explained variance during a longer time window, between 96 and 658 ms after stimulus onset (96–464, 468–500, 542–578, 606–658 ms; Fig. 2a). Furthermore, the visuo-semantic model explained more variance than the DNN model between 146 and 488 ms after stimulus onset (146–188, 196–234, 326–344, 348–402, 412–464, 468–488 ms). In summary, the results across the ventral-stream regions show a reversal in which model best explains variance in the RDM movies, from the DNN model in lower-level visual cortex, starting at 66 ms after stimulus onset, to the visuo-semantic model in higher-level visual cortex, starting at 146 ms after stimulus onset.
Visuo-semantic models explain unique variance in higher-level visual representations
Our results suggest that DNNs and visuo-semantic models explain complementary components of human ventral-stream representational dynamics. To explicitly test this hypothesis, we assessed the unique contributions of the two models. For this, we first computed the best RDM predictions for each model class and then used the resulting cross-validated RDM predictions in a second-level GLM in which we combined the two model classes. We computed the unique contribution of a model class by subtracting the variance explained by the reduced model (i.e., the GLM without the model class of interest) from the variance explained by the full model (including both model classes). For lower-level visual cortex (V1–V3), the DNN model explained unique variance between 60 and 638 and 818 and 884 ms after stimulus onset, whereas the visuo-semantic model did so between 124 and 654 ms after stimulus onset (124–142, 148–170, 228–246, 298–364, 368–412, 612–654 ms; Fig. 2b). For intermediate visual cortex (V4t/LO), the DNN model explained unique variance predominantly between 62 and 610 ms after stimulus onset (62–558, 590–610, 820–848, 952–976 ms), whereas the visuo-semantic model did so predominantly between 118 and 546 ms after stimulus onset (118–478, 490–546, 832–854 ms; Fig. 2b). These results indicate that the DNN and visuo-semantic models each explained a significant amount of unique variance in lower-level and intermediate visual cortex compared with the baseline period. However, for lower-level visual cortex, the DNN model explained more unique variance than the visuo-semantic model during the early (66–128 ms) as well as the late phases of the response (422–516, 520–544, 820–844 ms). For intermediate visual cortex, the unique variance explained did not significantly differ between the two models. For higher-level visual cortex (IT/PHC), only the visuo-semantic model explained unique variance, between 104 and 640 ms after stimulus onset (104–464, 468–500, 542–578, 608–640 ms). Furthermore, the visuo-semantic model explained significantly more unique variance than the DNN model between 146 and 488 ms after stimulus onset (146–188, 196–234, 326–344, 348–402, 412–464, 468–488 ms; Fig. 2b). These results indicate that in the context of a visuo-semantic predictor the tested DNNs explain unique variance at lower-level but not higher-level stages of visual processing. which instead show a unique contribution of visuo-semantic models. Visuo-semantic models appear to explain components of the higher-level visual representations that DNNs fail to fully capture, starting at 146 ms after stimulus onset.
Object parts and basic categories contribute to the unique variance explained by visuo-semantic models in higher-level visual representations
To better understand which components of the visuo-semantic model contribute to explaining unique variance in the higher-level visual representations, we repeated our analyses separately for subsets of object features and subsets of categories. We grouped the visuo-semantic labels into the following subsets: color, texture, shape, and object parts and subordinate, basic, and superordinate categories (Fig. 1b). The dimensionality of the submodels was naturally smaller than that of the full visuo-semantic model, which consisted of 229 object labels. The number of dimensions for the submodels was as follows: color (10), texture (12), shape (15), object parts (82), subordinate categories (38), basic categories (67), superordinate categories (5). Some of the submodels explained a similar amount of variance as the full visuo-semantic model (Fig. 3a), which indicates that including fewer dimensions did not necessarily reduce model performance. A more in-depth understanding of the relationship between model dimensionality and performance remains an important objective for future study. Here, we found that among the object features, only object parts explained variance in higher-level visual cortex (IT/PHC; Fig. 3a). Furthermore, object parts explained unique variance in higher-level visual cortex, whereas the DNN model did not (Fig. 3b). Among the categories, subordinate and basic categories explained variance in higher-level visual cortex (Fig. 3a). Furthermore, each of these models explained unique variance in higher-level visual cortex, whereas the DNN model did not (Fig. 3b). We next evaluated the three best predictors among the object features and categories together in the context of the DNN predictor. Although object parts, subordinate categories, basic categories, and DNNs all explained variance in higher-level visual cortex, only object parts and basic categories explained unique variance (Fig. 3a,b).
Discussion
Neural representations of visual objects dynamically unfold over time as we are making sense of the visual world around us. These representational dynamics are thought to reflect the cortical computations that support human object recognition. Here, we show that DNNs and human-derived visuo-semantic models explain complementary components of representational dynamics in the human ventral visual stream, estimated via source-reconstructed MEG data. We report a gradual reversal in the importance of DNN and visuo-semantic features from lower- to higher-level visual areas. DNN features explain variance over and above visuo-semantic features in lower-level visual areas V1–V3 starting early in time (at 66 ms after stimulus onset). In contrast, visuo-semantic features explain variance over and above DNN features in higher-level visual areas IT/PHC starting later in time (at 146 ms after stimulus onset). Among the visuo-semantic features, object parts and basic categories drive the advantage over DNNs. Our results suggest that a significant component of the variance unexplained by DNNs in higher-level visual areas is structured and can be explained by relatively simple, readily nameable aspects of the images. Figure 4 shows a visual summary of our results. Consistent with our hypothesis, our findings suggest that current DNNs fail to fully capture the visuo-semantic features represented in higher-level human visual cortex and suggest a path toward more accurate models of ventral-stream computations.
Our finding that DNNs outperform visuo-semantic models at explaining lower-level cortical dynamics replicates and extends prior fMRI work, which showed that DNNs explain response variance across all stages of the ventral stream, whereas visuo-semantic models predominantly explain response variance in higher-level visual cortex (Huth et al., 2012; Khaligh-Razavi and Kriegeskorte, 2014; Güçlü and van Gerven, 2015; Devereux et al., 2018; Jozwik et al., 2018). Using source-reconstructed MEG data, we show that the advantage of DNNs over visuo-semantic models in V1–V3 emerges early in time, starting within 70 ms after stimulus onset. The early advantage lasts for ∼60 ms. During this early time window, the response is likely dominated by feedforward and local recurrent processing as opposed to top-down feedback signals from higher-level areas (Isik et al., 2014; Kietzmann et al., 2019b). DNNs also outperform visuo-semantic models in V1–V3 late in time, starting around 420 ms after stimulus onset. The late advantage lasts for ∼120 ms. Prior analysis of the same source-reconstructed MEG data showed a relative increase in the explanatory power of lower-level visual features (GIST model; Oliva and Torralba, 2006) and face clustering in V1–V3 during this late time window (Kietzmann et al., 2019b). These effects were observed in the presence of a slightly elevated noise ceiling. During the late time window, the response may reflect an interplay between bottom-up stimulus processing and top-down feedback signals. Our results show the importance of analyzing temporally resolved neuroimaging data for revealing when in time competing models account for the rapid dynamic unfolding of human ventral-stream representations.
Our findings show that DNNs, despite reaching human-level performance on large-scale object recognition tasks (Schrimpf et al., 2018), fail to fully capture visuo-semantic features represented in higher-level human visual cortex, in particular object parts and basic categories. Higher-level visual representations in dynamic MEG data instead more closely resemble human perceptual judgements of object properties. In line with our results, prior fMRI work showed that DNNs only adequately accounted for higher-level visual representations after adding new representational features (Khaligh-Razavi and Kriegeskorte, 2014; Devereux et al., 2018; Storrs et al., 2020a, b). The new features were either explicit semantic features (Devereux et al., 2018) or were created by linearly combining DNN features to emphasize categorical divisions observed in the higher-level visual representations, including the division between faces and nonfaces and between animate and inanimate objects (Khaligh-Razavi and Kriegeskorte, 2014; Storrs et al., 2020a). Our results show that visuo-semantic models start outperforming DNNs in higher-level visual areas ∼150 ms after stimulus onset. This timeline coincides with the emergence of animate clustering in these areas (Kietzmann et al., 2019b) as well as with the emergence of conceptual object representations as reported in prior MEG work (Bankson et al., 2018). Our results are also consistent with an earlier MEG study that showed that adding semantic features to a simpler HMAX (Hierarchical Model and X) was beneficial for modeling object representations in visual cortex starting ∼200 ms after stimulus onset (Clarke et al., 2015). DNNs may, at least in part, use different object features for object recognition than humans do. This conclusion is consistent with prior reports that DNNs rely more strongly on lower-level image features such as texture for object categorization (Geirhos et al., 2019).
Although we refer to both DNNs and visuo-semantic object labels as models, there are substantial differences between the two. DNNs are image computable, which means that they can compute a representation for any image. In contrast, visuo-semantic object labels are generated by human observers. How the human brain computes these labels remains unknown. This can be considered a disadvantage relative to DNNs, which are computationally explicit. We have full knowledge of their computational units and of the transformations applied to the image at each processing stage. However, it is challenging to pinpoint what these processing stages represent and how they may differ from those in humans. Visuo-semantic object labels, on the other hand, are easy to interpret. By comparing DNNs and visuo-semantic models in their ability to capture human ventral-stream representational dynamics, we can identify features in the data that DNNs fail to account for and use outcomes to guide model improvement.
Our results are consistent with theories that propose an integral role for feedback in visual perception (Rao and Ballard, 1999; Bar, 2003; Ahissar and Hochstein, 2004). As summarized in Figure 4, within the first 120 ms of stimulus processing, we observe a peak in the relative contribution of DNNs in lower-level and intermediate visual cortex, followed by a peak in the relative contribution of visuo-semantic models in higher-level visual cortex. These peaks may reflect a feedforward sweep of initial stimulus processing, which is thought to support perception of the gist of the visual scene and initial analysis of category information (Oliva and Torralba, 2006; Lowe et al., 2018; Kirchner and Thorpe, 2006; Liu et al., 2009). The initial peaks are followed by a visuo-semantic peak in intermediate visual cortex ∼150 ms after stimulus onset, which appears after a period of possible feedback information flow from higher-level to intermediate visual cortex (Kietzmann et al., 2019b), and additional fluctuations in relative model performance as time unfolds. These fluctuations include a reappearance of the advantage of DNNs over visuo-semantic models in lower-level visual cortex ∼420 ms after stimulus onset. The observed sequence of events is consistent with the reverse hierarchy theory of visual perception, which proposes an initial feedforward analysis for vision at a glance followed by explicit feedback signaling for vision with scrutiny (Ahissar and Hochstein, 2004). Future research should study visual perception under challenging viewing conditions, including occlusion and clutter, which are expected to strongly engage feedback signals and recurrent computation (Lamme and Roelfsema, 2000; O'Reilly et al., 2013; Spoerer et al., 2017; Tang et al., 2018; Kar et al., 2019; Rajaei et al., 2019; Kietzmann et al., 2019b).
Our study makes several important contributions to the existing body of work on modeling ventral-stream computations with DNNs. First, our results suggest that introducing locally recurrent connections to DNNs, to more closely match the architecture of the ventral visual stream, is not sufficient to fully capture the representational dynamics observed in higher-level human visual cortex. Second, our results tie together space and time through analysis of source-reconstructed MEG data. We show that DNNs outperform visuo-semantic models in lower-level visual areas V1–V3 starting at 66 ms after image onset, whereas visuo-semantic models outperform DNNs in higher-level visual areas IT/PHC starting at 146 ms after image onset. Third, we show that a significant component of the unexplained variance in higher-level cortical dynamics is structured and can be explained by readily nameable aspects of object images, specifically object parts and basic categories. In prior behavioral work using the same image set and visuo-semantic labels, we showed that category labels, but not object parts, outperformed DNNs at explaining object similarity judgments (Jozwik et al., 2017). These results suggest that compared with responses in ventral visual cortex behavioral similarity judgements may more strongly emphasize semantic object information (Mur et al., 2013; Jozwik et al., 2017; Groen et al., 2018). Future studies should extend this work to richer stimulus and model sets.
To build more accurate models of human ventral-stream computations, we need to provide DNNs with a more human-like learning experience. Two important areas for improvement are visual diet and learning objectives. Each of these shapes the internal object representations that develop during visual learning. Humans have a rich visual diet and learn to distinguish between ecologically relevant categories at multiple levels of abstraction, including faces, humans, and animals (Mur et al., 2013; Jozwik et al., 2016). DNNs have a more constrained visual diet and are trained on category divisions that do not entirely match the ones that humans learn in the real world. For example, the most common large-scale image dataset for training DNNs with category supervision (Russakovsky et al., 2015; Khaligh-Razavi and Kriegeskorte, 2014; Güçlü and van Gerven, 2015; Cichy et al., 2017; Kubilius et al., 2018; Schrimpf et al., 2018; Jozwik et al., 2019b; Storrs et al., 2020a, b), the ILSVRC 2012 dataset (Russakovsky et al., 2015), contains subordinate categories that most humans would not be able to distinguish, including dog breeds such as Schipperke and Groenendael, and lacks some higher-level categories relevant to humans, including face and animal. The path forward is unfolding along two main directions. The first is enrichment of the visual diet of DNNs by better matching the visual variability present in the real world, for example, by increasing variability in viewpoint or by training on videos instead of static images (Barbu et al., 2019; Zhuang et al., 2019). The second is to more closely match human learning objectives, for example, by introducing more human-like category objectives or unsupervised objectives (Higgins et al., 2020; Konkle and Alvarez, 2022; Mehrer et al., 2021; Zhuang et al., 2021). Training DNNs on more human-like visual diets and learning objectives may give rise to representational features that more closely match the visuo-semantic features represented in human higher-level visual cortex.
Footnotes
This work was supported by Wellcome Trust Grant 206521/Z/17/Z to K.M.J; European Research Council Grant ERC-StG-2022-101039524 (TIME) to T.C.K; German Research Council Grants CI241/1-1, CI241/3-1, and CI241/7-1 to R.M.C.; European Research Council Grant ERC-StG-2018-803370 to R.M.C.; and Natural Sciences and Engineering Research Council of Canada Discovery Grant RGPIN-2019-06741 to M.M. We thank Martin Schrimpf for input on the CORnets methods. We thank Taylor Schmitz for helpful comments on the manuscript.
The authors declare no competing financial interests.
- Correspondence should be addressed to Kamila M. Jozwik at jozwik.kamila{at}gmail.com or Marieke Mur at mmur{at}uwo.ca