The capacity to process information in conceptual form is a fundamental aspect of human cognition, yet little is known about how this type of information is encoded in the brain. Although the role of sensory and motor cortical areas has been a focus of recent debate, neuroimaging studies of concept representation consistently implicate a network of heteromodal areas that seem to support concept retrieval in general rather than knowledge related to any particular sensory-motor content. We used predictive machine learning on fMRI data to investigate the hypothesis that cortical areas in this “general semantic network” (GSN) encode multimodal information derived from basic sensory-motor processes, possibly functioning as convergence–divergence zones for distributed concept representation. An encoding model based on five conceptual attributes directly related to sensory-motor experience (sound, color, shape, manipulability, and visual motion) was used to predict brain activation patterns associated with individual lexical concepts in a semantic decision task. When the analysis was restricted to voxels in the GSN, the model was able to identify the activation patterns corresponding to individual concrete concepts significantly above chance. In contrast, a model based on five perceptual attributes of the word form performed at chance level. This pattern was reversed when the analysis was restricted to areas involved in the perceptual analysis of written word forms. These results indicate that heteromodal areas involved in semantic processing encode information about the relative importance of different sensory-motor attributes of concepts, possibly by storing particular combinations of sensory and motor features.
SIGNIFICANCE STATEMENT The present study used a predictive encoding model of word semantics to decode conceptual information from neural activity in heteromodal cortical areas. The model is based on five sensory-motor attributes of word meaning (color, shape, sound, visual motion, and manipulability) and encodes the relative importance of each attribute to the meaning of a word. This is the first demonstration that heteromodal areas involved in semantic processing can discriminate between different concepts based on sensory-motor information alone. This finding indicates that the brain represents concepts as multimodal combinations of sensory and motor representations.
The capacity to encode and retrieve conceptual information is an essential aspect of human cognition, but little is known about how these processes are implemented in the brain. Neuroimaging studies of conceptual processing have implicated areas at various levels of the cortical hierarchy, including sensory and motor areas (Hauk et al., 2004; Hsu et al., 2012) and multimodal (Fernandino et al., 2016) and heteromodal regions (Binder et al., 2009). Binder et al. (2009) referred to the latter as a “general semantic network” (GSN) because it responds more to meaningful input (words and sentences) than to meaningless input (nonwords and scrambled sentences) regardless of the particular sensory-motor content of the items. The GSN consists of portions of the inferior parietal lobule (IPL), lateral temporal cortex (LTC), lateral prefrontal cortex (LPFC), precuneus/posterior cingulate gyrus (Pc/PCG), parahippocampal gyrus (PHG), and medial prefrontal cortex (MPFC), all of which are bilaterally activated, with stronger activations in the left hemisphere. According to embodied models of semantics, lower-level sensory and motor areas contribute to concept representation by encoding the sensory-motor features of phenomenal experience that characterize each concept, presumably derived from the experiences that led to the formation of the concept. However, the role of the GSN remains obscure. We propose that this network encodes high-level representations of the coactivation patterns exhibited by lower-level, sensory-motor cortical areas during concept retrieval, which is consistent with the idea of convergence–divergence zones originally proposed by Damasio (1989) and further developed by Simmons and Barsalou (2003). Alternatively, it is possible that the GSN encodes conceptual representations in a qualitatively distinct format that does not rely on sensory-motor information. The existence of such a disembodied code for concept representation has been endorsed by some investigators (Mahon and Caramazza, 2008).
We set out to investigate whether the heteromodal areas comprising the GSN encode sensory-motor information about concepts during word-cued concept retrieval. We used a forward-encoding model based on five sensory-motor attributes of word meaning (sound, color, visual motion, shape, and manipulability) to decode the distributed fMRI activation patterns associated with the meanings of 300 common nouns. We anticipated that this “semantic model” would successfully identify individual concrete concepts from neural activity in the GSN. Because sensory-motor attributes are less relevant for abstract nouns, we expected decoding accuracy to be lower for these words, if at all above chance. As a control, we predicted that an alternative model based on five orthographic and phonologic attributes of the word form (the “word-form model”) would not decode activation patterns in the GSN above chance levels.
As an additional control, we also evaluated both encoding models in a different set of cortical regions, namely, those involved in the perceptual analysis of written word forms (the “word form network” (WFN) (Cohen et al., 2004). Therefore, we expected the decoding accuracy of the two encoding models in these areas to show the opposite pattern relative to the GSN; that is, successful decoding for the word-form model, but not for the semantic model.
In most previous fMRI decoding studies, several exemplars of particular categories (e.g., faces, houses, and animals) were presented and a classifier was trained to discriminate between the categories. The classifier was then used to predict the category of a new (untrained) item. Successful decoding, in that context, indicates that the voxels included in the analysis encode information about stimulus category (although the nature of that information is often difficult or impossible to characterize). Our analysis has a different goal: each stimulus is treated as a unique item, the neural representation of which is hypothesized to rely on a specific, explicit representational system determined by a set of sensory-motor features and their relative weights (obtained from participants' ratings). What is predicted is not the item's category, but rather its unique identity. Successful decoding indicates that the voxels used in the analysis encode information about the specific attributes hypothesized to underlie the representational system.
Materials and Methods
The semantic model was based on five semantic attributes directly related to sensory-motor processes: sound, color, shape, manipulability, and visual motion. Ratings for these attributes were available for a set of 900 words (for details, see Fernandino et al., 2016). The ratings reflect the relevance of each attribute to the meaning of the word on a 7-point Likert scale ranging from “not at all important” to “very important.” Approximately 30 participants rated each attribute for each word.
We direct the reader to Fernandino et al. (2016) for details on the stimuli and data collection procedures, which are summarized below.
Participants were 21 healthy, right-handed, native speakers of English with no history of neurological or psychiatric disorders (7 females; mean age 29.9 years, range 20–47). All participants gave informed consent as approved by the Medical College of Wisconsin Institutional Review Board and were compensated for participation.
Stimuli consisted of the 900 nouns for which attribute ratings were available and 300 pseudowords. Six hundred nouns were relatively concrete and 300 were relatively abstract, as determined by either published imageability ratings or consensus judgment of the authors. Pseudowords were matched to the words on length, orthographic neighborhood density, and bigram and trigram metrics.
The stimuli were back-projected on a screen that was viewed by the participant through a mirror attached to the head coil. Participants performed 1200 trials (900 words, 300 pseudowords) distributed over 10 runs. Each stimulus was presented for 1000 ms and was followed by a fixation cross for a jittered interval of 1–13 s.
Participants performed a speeded semantic decision task (“can it be directly experienced with the senses?”) and responded by pressing one of two response keys with their right hand. They were instructed to press the button for “no” in the case of pseudowords.
FMRI acquisition and preprocessing.
Gradient-echo EPI images were collected in 10 runs of 196 volumes each on a GE 3T Excite MRI scanner (TR = 2000 ms, TE = 25 ms, 40 axial slices, 3 × 3 × 3 mm voxels). T1-weighted anatomical images were obtained using a 3D SPGR sequence with voxel dimensions of 1 mm isotropic.
EPI volumes were corrected for slice acquisition time and head motion. They were aligned to the T1-weighted volume and normalized into Talairach space (AFNI's TT_N27 template) using affine transformations implemented by the AFNI program @auto_tlrc. Images were smoothed with a 6 mm FWHM Gaussian kernel. Each voxel time series was rescaled to percentage of mean signal level so that subsequent regression parameter estimates reflected percentage signal change.
The semantic model was designed to predict the activation pattern corresponding to a given word based on the ratings of the five semantic attributes for that word (see “Attribute ratings” section above). The word-form model was designed to predict activation patterns based on perceptual properties of the word form, regardless of meaning, thus serving as a control for the semantic model. It was based on five orthographic and phonologic attributes of the word form: number of letters, number of syllables, orthographic neighborhood density, phonologic neighborhood density, and bigram frequency.
For the decoding procedure, we split the 900 word stimuli into a modeling set, consisting of 850 items, and a test set, consisting of 50 items. The decoding algorithm was repeated six times, each time with a different set of test words, for a total of 300 test words. Three test sets consisted of concrete nouns and the others consisted of abstract nouns. Test words were selected randomly with the constraint that the concrete and abstract subsets were matched in word frequency, number of letters, number of phonemes, number of syllables, orthographic and phonologic neighborhood densities, and bigram frequency (Table 1).
The decoding algorithm consisted of four steps: (1) generating attribute maps (AMs), (2) computing predicted maps (PMs), (3) generating observed maps (OMs), and (4) testing the PMs against the OMs. The steps are described below.
AMs for each attribute in the encoding model were generated for each participant based exclusively on the words in the modeling set (Fig. 1). This was done by including the z-transformed attribute values (sensory-motor ratings in the case of the semantic model, orthographic and phonologic measures in the case of the word-form model) as simultaneous predictor variables in a generalized least-squares (GLS) regression. For the semantic model, nuisance regressors included lexical variables unrelated to word meaning (word length, number of phonemes, number of syllables, printed word frequency, bigram frequency, and orthographic and phonological neighborhood density) and the participant's reaction time (RT) for each trial (all z-transformed). For the word-form model, nuisance regressors included the five sensory-motor ratings of word meaning and the participant's RT for each trial (all z-transformed). Two binary regressors, one coding for “word” events and the other for “pseudoword” events, were included to account for residual activity associated with early visual processing of the stimulus, as well as the subsequent motor response. Signal drift was modeled with linear, second-order, and third-order trends and residual movement artifacts were modeled with the estimates of the motion parameters. A group-level AM for each attribute was created by averaging the individual AMs (β values) across participants.
For each of the 50 words in the test set, PMs were computed as linear combinations of the group-level AMs, whereby each AM was weighted by the test word's corresponding attribute value (Fig. 2A). In the case of the semantic model, the PM for a given test word corresponded to the hypothetical activation pattern that would be associated with the meaning of that word if the word's meaning were completely captured by the five attribute ratings (i.e., sound, color, manipulation, visual motion, and shape). For the word-form model, it corresponds to hypothetical activation associated with the orthographic and phonological properties of the written word.
From the imaging data, we extracted the unique activation pattern induced by each word in the test set (OMs). For each participant, a separate GLS regression was conducted for each word in the test set, with the following explanatory variables: a binary regressor coding for the presentation of the selected test word; a binary regressor coding for presentation of all the nonselected words (i.e., the other 899 words in the stimulus set); a binary regressor coding for presentation of the pseudowords; five continuous regressors coding for each of the five attribute values for all nonselected words; and a continuous regressor coding the response time for each trial. The OM was obtained from the contrast “selected word > pseudowords.” We chose the pseudoword condition (instead of “rest”) as the baseline to make the activation maps more comparable across participants because “rest” is an unspecified condition that likely varies within and across participants. For each test word, a group-level OM was obtained by averaging the individual OMs (β values) across participants.
Testing the PMs against the OMs.
The decoding accuracy of the model was evaluated separately for each word based on the similarity between the PM and its corresponding OM relative to the similarity between the PM and all the other OMs in the test set (Fig. 2B). Similarity was defined as the voxel-by-voxel pairwise correlation between maps, and accuracy was defined as the percentile rank of the correlation strength between the PM and the corresponding OM. This percentile rank, scaled to a 0–1 range, was assigned to the PM as its accuracy score. Therefore, each PM received an accuracy score corresponding to how similar it was to its respective OM relative to the other 49 OMs, with 0 corresponding to least similar and 1 corresponding to most similar. For example, if the OM for the word “tomato” were the most highly correlated to the PM for the same word, then that PM would receive an accuracy score of 49/49 = 1. If, instead, it were the second most highly correlated to its respective PM, then the accuracy score would be 48/49 = 0.98. The Shapiro–Wilk normality test showed that the model's accuracy scores for the test words were not normally distributed, so we used nonparametric binomial tests to verify whether model performance (i.e., decoding accuracy across the 50 test words) was significantly higher than chance (0.5).
Voxel selection masks
Our hypothesis concerned the role of the GSN in representing sensory-motor information about concepts. Therefore, we created a voxel-selection mask based on the activation-likelihood estimation (ALE) meta-analysis by Binder et al. (2009), encompassing the cortical areas that were reliably associated with “general” semantic processing (Fig. 3A). The map from Binder et al. (2009) was thresholded at p < 0.05 and converted into a binary mask. To investigate which portions of the GSN contributed the most to decoding accuracy, we performed the analysis separately on each of its five regions: lateral temporoparietal, medial parietal, medial temporal, lateral prefrontal, and medial prefrontal.
As a control, the models were also evaluated in a mask corresponding to the WFN obtained from the contrast pseudowords > rest in the present dataset, thresholded at p < 0.05 (corrected). This mask included visual, somatosensory, and motor/premotor areas, as well as the thalamus, and had minimal overlap with the GSN mask (Fig. 3A). Because these regions are more strongly activated during bottom-up perceptual processing than during top-down processing (Goebel et al., 1998; O'Craven and Kanwisher, 2000), we expected their activation patterns to encode information primarily about word form rather than semantic content. Voxels displaying a low temporal signal-to-noise ratio (<200) were excluded from both masks.
Concrete versus abstract words
In total, six nonoverlapping test sets were evaluated in cross-validation. The three concrete and the three abstract sets were matched on all lexical attributes except for concreteness. As shown in Table 1, the variance of the sensory-motor attribute ratings was much smaller among abstract than among concrete words, indicating that abstract word meanings contained much less information about the sensory-motor features included in the semantic model. Therefore, if the accuracy of the semantic model were indeed driven by the sensory-motor aspects of word meaning, then decoding performance should be high for concrete but low for abstract words.
Mean RT on the semantic decision task was 942 ms (SD = 98 ms). Mean accuracy was 0.84 for words and 0.96 for pseudowords, indicating that participants attended closely to the task. As expected, RT was negatively correlated with concreteness (r = −0.66, p < 0.0001) and with each of the five semantic attributes (sound: r = −0.15; color: r = −0.64; manipulability: r = −0.41; motion: r = −0.34; shape: r = −0.71; all p < 0.0001).
Model performance in the GSN mask
Decoding accuracy for the two encoding models in the GSN mask is shown in Figure 3B. Consistent with our hypothesis, the semantic model was successful in decoding concrete [median = 0.61; 99% confidence interval (CI) = 0.53–0.67; p = 0.0004], but not abstract words (median = 0.55; 99% CI = 0.44–0.65; p = 0.06), whereas the word-form model failed to decode both types of words (concrete: median = 0.55; 99% CI = 0.47–0.65; p = 0.08; abstract: median = 0.52; 99% CI = 0.39–0.61; p = 0.28). The regions that showed the highest decoding accuracies for concrete nouns in the GSN were the lateral temporoparietal (median = 0.58) and the lateral prefrontal (median = 0.61) regions, although only decoding accuracy in the former remained significant after correction for multiple tests (p = 0.005, Bonferroni corrected). Pairwise comparisons showed no significant differences between GSN regions (all p > 0.19, uncorrected for multiple comparisons).
Model performance in the WFN mask
Decoding accuracy for the two encoding models in the WFN mask is shown in Figure 3C. The semantic model failed to decode activation patterns in this network (concrete: median = 0.55; 99% CI = 0.41–0.69; p = 0.14; abstract: median = 0.53; 99% CI = 0.41–0.59; p = 0.14), whereas the word-form model was successful for both word types (concrete: median = 0.73; 99% CI = 0.67–0.78; p < 0.000001; abstract: median = 0.71; 99% CI = 0.65–0.79; p < 0.000001).
We evaluated two forward-encoding models on their capacity to predict fMRI activity patterns for specific words. The semantic model was based on five sensory-motor attributes of word meaning and the word-form model was based on five orthographic and phonologic attributes of the word form. Each model was evaluated in two different sets of cortical areas: the GSN, a set of highly interconnected heteromodal areas that has been consistently implicated in semantic processing, and the WFN, which is involved in the perceptual analysis of word forms comprising mainly visual and motor/somatosensory areas. The semantic model successfully decoded fMRI activation patterns elicited by individual words in the GSN, but not in the WFN. As expected, decoding of GSN activity was successful for concrete but not for abstract words when the two sets were analyzed separately. The word-form model was successful in the WFN for concrete and abstract words alike, but failed to decode activity in the GSN. This pattern of results strongly indicates that the GSN encodes information about sensory-motor attributes of concepts.
The GSN was identified by Binder et al. (2009) in an ALE meta-analysis of 120 neuroimaging studies of semantic word processing. It overlaps considerably with the “default mode network,” a set of cortical areas typically deactivated during attention-demanding tasks relative to rest (Buckner et al., 2008). Resting-state connectivity and MRI tractography studies show that the core nodes of the network (IPL, LTC, Pc/PCG, and MPFC) are strongly interconnected (Greicius et al., 2009; Horn et al., 2014) and graph theoretical analyses have identified these regions as central connector hubs for more specialized, modular cortical networks (Hagmann et al., 2008; Sepulcre et al., 2012). Based on these findings, we have argued that the GSN supports multimodal conceptual representations by encoding patterns of coactivation across lower-level, modality-specific areas (Fernandino et al., 2016). The present results show that the GSN can discriminate between individual concrete concepts based exclusively on sensory-motor information, thus providing substantial support for this proposal.
The WFN mask included sensory-motor areas that have been found previously to encode information about word semantics (Hauk et al., 2004; Hsu et al., 2012; Fernandino et al., 2016). Why, then, did the semantic model fail to decode neural activation in this mask? We believe the answer lies in the nature of the task. Because perceptual word processing and concept retrieval took place virtually simultaneously in the present study (due to the low temporal resolution of the BOLD signal, the two processes were modeled as a single event in the GLM estimation of β values), activity in the WFN was driven much more strongly by the perceptual features of the stimuli (bottom-up activation) than by their semantic attributes (top-down activation), thus greatly reducing the signal-to-noise ratio of the semantic activation patterns in those areas. Future studies should investigate this issue by dissociating concept retrieval from complex sensory stimulation.
It should also be noted that the failure of the semantic model to decode activation patterns for abstract words does not necessarily imply that concrete and abstract concepts are based on qualitatively different codes; rather, it could reflect the fact that the relationship between the meaning of an abstract word and specific features of sensory-motor experience is much more complex and context dependent than that of concrete words (Badre and Wagner, 2002; Barsalou and Wiemer-Hastings, 2005; Hoffman, 2015). Low prediction accuracy was predicted for abstract words based on the relatively low variance of the sensory-motor ratings across these words (Table 1). The paradigm used here can be extended to investigate the cortical representation of abstract words by incorporating attributes that are more relevant for their characterization, such as affective, causal, and intentional attributes, and by adopting a task that is more neutral with respect to particular semantic features.
Finally, we note that our semantic task (“can it be directly experienced with the senses?”) directs attention to sensory-motor aspects of the concepts. It is possible that a task lacking this attentional focus (e.g., lexical decision) would induce relatively weaker activations related to sensory-motor information, thus producing different results (for further discussion of this issue, see Fernandino et al., 2016).
In sum, our results provide the first demonstration that heteromodal areas involved in semantic processing can discriminate between individual concepts based on sensory-motor information alone. They provide strong support for the view that conceptual representations are grounded, at least in part, in elementary sensory-motor attributes of phenomenal experience. Furthermore, they indicate that the neural architecture of these representations is hierarchically organized, with higher-level heteromodal areas encoding information about the activation patterns exhibited by lower-level sensory-motor area—patterns that are presumably established during concept formation and partially reinstated during retrieval.
This work was supported by the National Institutes of Health (National Institute of Neurological Disorders and Stroke Grant R01-NS33576 General Clinical Research Center Grant M01-RR00058).
The authors declare no competing financial interests.
- Correspondence should be addressed to Leonardo Fernandino, Department of Neurology, Medical College of Wisconsin, 8701 Watertown Plank Road, MEB 4671, Milwaukee, WI 53226.