Abstract
How neural specificity for distinct conceptual knowledge categories arises is central for understanding the organization of semantic memory in the human brain. Although there is a large body of research on the neural processing of distinct object categories, the organization of action categories remains largely unknown. In particular, it is unknown whether different action categories follow a specific topographical organization on the cortical surface analogously to the category-specific organization of object knowledge. Here, we tested whether the neural representation of action knowledge is organized in terms of nonsocial versus social and object-unrelated versus object-related actions (sociality and transitivity, respectively, hereafter). We hypothesized a major distinction of sociality and transitivity along dorsal and ventral lateral occipitotemporal cortex (LOTC), respectively. Using fMRI-based multivoxel pattern analysis, we identified neural representations of action information associated with sociality and transitivity in bilateral LOTC. Representational similarity analysis revealed a dissociation between dorsal and ventral LOTC. We found that action representations in dorsal LOTC are segregated along features of sociality, whereas action representations in ventral LOTC are segregated along features of transitivity. In addition, representations of sociality and transitivity features were found more anteriorly in LOTC than representations of specific subtypes of actions, suggesting a posterior–anterior gradient from concrete to abstract action features. These findings elucidate how the neural representations of perceptually and conceptually diverse actions are organized in distinct subsystems in the LOTC.
SIGNIFICANCE STATEMENT The lateral occipitotemporal cortex (LOTC) is critically involved in the recognition of objects and actions, but our knowledge about the underlying organizing principles is limited. Here, we discovered a dorsal–ventral distinction of actions in LOTC: dorsal LOTC represents actions based on sociality (how much an action is directed to another person) in proximity to person knowledge. In contrast, ventral LOTC represents actions based on transitivity (how much an action involves the interaction with inanimate objects) in proximity to tools/artifacts in ventral LOTC, suggesting a mutually dependent organization of actions and objects. In addition, we found a posterior-to-anterior organization of the LOTC for concrete and abstract representations, respectively. Our findings provide important insights about the organization of actions in LOTC.
Introduction
To be able to interact with our environment, we need to recognize objects and understand the actions of others. How the brain achieves this task has been researched intensively in the last decades. Research demonstrated that distinct object categories are represented in a systematic topographical organization in occipitotemporal cortex (OTC) (Chao et al., 1999; Konkle and Caramazza, 2013). By contrast, the representation of action categories is less well understood (Pillon and d'Honincthun, 2011). In particular, it is unclear whether actions are organized topographically along certain salient dimensions.
Two arguments support this assumption. First, according to the domain-specific hypothesis, distinct neural substrates became evolutionary adapted to process knowledge categories selectively for which perceptual and conceptual distinctions lead to behavioral benefits (Caramazza and Shelton, 1998). Neuropsychological distinctions were identified among evolutionarily salient object categories such as animals, conspecifics, plant life, and tools (for review, see Caramazza and Mahon, 2003). In the action knowledge domain, a similar specialization might have occurred as certain behavioral “inventions” emerged and recognition of these behaviors became relevant for survival; for example, the distinction between social versus nonsocial and object-related (transitive) versus object-unrelated (intransitive) actions. Following this account, the neural processing of action knowledge along these dimensions (sociality and transitivity, hereafter) should be exposed to evolutionary pressure resulting in category-specific adaptation, and thus segregation, of the respective neural substrates. The second argument proposes that the neuroanatomical organization of action knowledge is determined by constraints from associated object categories. Action recognition comprises object recognition, specifically the recognition of the acting agent and other agents or inanimate objects that might be involved in the respective action. Strikingly, there are systematic links between certain action and object categories: Social actions (e.g., teach, compete, sell) are linked to knowledge about animate objects (e.g., conspecifics and interpersonal relations), whereas transitive actions (e.g., cut, sew, peel) are linked to knowledge about tools and other inanimate objects. The neural representations of actions and objects might therefore determine each other based on connectivity-based constraints. Because the most salient distinction of object knowledge is observed between animate and inanimate objects (Caramazza and Mahon, 2003; Martin, 2007; Kriegeskorte et al., 2008b), it is tempting to assume a similar prominent distinction in the action domain between sociality and transitivity.
How could the neural organization of object and action knowledge be related to each other? Animate objects activate dorsolateral OTC (DLOTC), as well as lateral fusiform gyrus in ventral OTC (VOTC), whereas inanimate objects activate ventrolateral OTC (VLOTC), as well as medial fusiform/parahippocampal cortex in VOTC (Chao et al., 1999; Downing et al., 2006; Konkle and Caramazza, 2013). Likewise, human motion preferentially activates DLOTC, whereas tool motion preferentially activates VLOTC (Beauchamp et al., 2002, 2003). Consistent with this mapping, processing of socially relevant cues draws on the superior temporal sulcus (STS; Allison et al., 2000; Carter and Huettel, 2013). Conversely, during action recognition, the lateral OTC (LOTC) is predominantly activated (in addition to prefrontal and parietal areas that are not in the main focus of the present study; Van Overwalle and Baetens, 2009; Caspers et al., 2010). However, the precise organization of actions in LOTC remains unclear and is a matter of current debate (Lingnau and Downing, 2015). Here, we hypothesize that social action knowledge is represented in the vicinity of animate and/or social-related information in DLOTC, whereas transitive action knowledge is represented in the vicinity of inanimate object information (i.e., artifacts) in VLOTC. Because the VOTC also reveals a distinction along animacy and is, albeit less often, found to be activated during action observation (Gobbini et al., 2007; Caspers et al., 2010; Shultz and McCarthy, 2012), it is possible that VOTC reveals a distinction of social (fusiform gyrus) and transitive actions (parahippocampal cortex) too. To test these predictions, we used fMRI-based multivoxel pattern analysis (MVPA) and representational similarity analysis (RSA) to investigate the neural organization of actions from four categories spanning a 2D semantic space along sociality and transitivity.
Materials and Methods
Participants.
Twenty-eight healthy adults (8 females; mean age, 27 years; age range, 19–42 years) volunteered to participate in the experiment. All participants were right-handed with normal or corrected-to-normal vision and no history of neurological or psychiatric disease. Participants gave written informed consent before participation in the study. The experimental procedures were approved by the ethics committee for research involving human participants at the University of Trento, Italy.
Stimuli.
The stimulus set consisted of 24 exemplars of eight actions (192 action videos in total). Actions were selected from four categories: (1) change of possession (transitive/social): give, take; (2) object manipulation (transitive/nonsocial): open, close; (3) communication (intransitive/social): agree, disagree; and (4) body/contact action (intransitive/nonsocial): stroke, scratch. The criteria for this selection were the following: only manual actions, actions that take place in the same context, and actions that are performed without physically interacting with, but in the presence of, another person. We thereby ensured that between-category analyses capture category-specific differences while eliminating feature differences that are not essential for an action category. Furthermore, we ensured that, within each category, actions are perceptually similar with regard to movement kinematics and complexity. We thereby guaranteed that within-category MVPA relied on conceptual but not perceptual differences between the two actions of a category. In addition, by using 24 different exemplars for each action (see Fig. 1B), we increased the perceptual variance of the stimuli to ensure that MVPA relied on abstract action representations that generalize across perceptual information (Wurm and Lingnau, 2015; Wurm et al., 2016). Variance was induced by using various stimulus factors; that is, two different contexts (kitchen, office), three perspectives (right, center, left; relative to the table orientation), two different actors (female, male), and six different objects that were present or involved in the actions (kitchen context: sugar cup, honey jar, coffee jar; office context: bottle, pen box, aluminum box). The actress/actor sat on either the left or the right side and used her/his right or left hand for the action. Stimulus factors were balanced for each action.
The concrete action instantiations were implemented as follows: Give: the actor moved an object from her/his peripersonal space into the peripersonal space of the passive person; Take: the reverse of give, i.e., the actor moved an object from the passive person's peripersonal space into her/his own peripersonal space; Open: the actor changed an object's state from closed into open; Close: the reverse of open, i.e., the actor changed an object's state from open into closed. (both actions required various different kinematics based on different lid/cap types, screw, push/pull, flip); Agree: the actor made a gesture in the direction of the passive person that signals agreement with the passive person (thumbs up, forming a ring with index finger and thumb); Disagree: the actor made a gesture in the direction of the passive person that signals disagreement with the passive person (thumbs down, waving with index finger; note that the heterogeneity of gestures ensured that MVPA could not rely on concrete hand postures, but only on the associated communicative meaning); Stroke: using the palm of the hand, the actor touched the other arm or hand lightly and repeatedly, as with brushing movements; and Scratch: using the fingertips, the actor scraped or rubbed the other arm or hand as if to relieve itching.
Catch trials consisted of six exemplars of each of the eight actions that deviated from the original action (e.g., tilting or lifting an object, making a meaningless gesture, incomplete actions, etc.; 48 catch trial videos in total). Action videos were filmed using a Canon 5D Mark II camera and edited in iMovie (Apple) and MATLAB (The MathWorks, RRID:SCR_006826). All 240 videos were identical in terms of action timing; that is, the videos started with hands on the table, followed by the action, and ended with hands moving to the same position of the table. Object states (open, closed) and positions (in front of the actress/actor or the passive person) were balanced in such a way that actions could not be predicted from the setting before the action started. Edited videos were grayscale and had a length of 2 s (30 frames per second) and a resolution of 400 × 225 pixels.
In the scanner, stimuli were back-projected onto a screen (60 Hz frame rate, 1024 × 768 pixels screen resolution) via a liquid crystal projector (OC EMP 7900, Epson) and viewed through a mirror mounted on the head coil (video presentation 6.9° × 3.9° visual angle). Stimulus presentation, response collection, and synchronization with the scanner were controlled with ASF (Schwarzbach, 2011) and the MATLAB Psychtoolbox-3 for Windows (Brainard, 1997).
Design of the fMRI experiment.
Stimuli were presented in a mixed-event-related design. In each trial, videos (2 s) were followed by a 1 s fixation period. Eighteen trials were shown per block. Each of the nine conditions (eight action conditions plus one catch trial condition) was presented twice per block. Six blocks were presented per run, separated by 10 s fixation periods. Each run started with a 10 s fixation period and ended with a 16 s fixation period. In each run, the order of conditions was first-order counterbalanced (Aguirre, 2007). Each participant was scanned in a single session consisting of 8 functional scans and one anatomical scan. For each of the nine conditions there was a total of 2 (trials per block) × 6 (blocks per run) × 8 (runs per session) = 96 trials per condition. Each of the 24 exemplars per action condition was presented 4 times in the experiment.
Task.
Participants were instructed to watch the movies attentively. They were asked to press a button with the right index finger on a response button box whenever an action was meaningless or performed incompletely or incorrectly (i.e., in catch trials). Participants could respond either during the movie or during the fixation phase after the movie. To ensure that participants followed the instructions correctly, they completed a practice run outside of the scanner. Participants were not informed about the exact purpose of the study and the organization of the actions into social/nonsocial and transitive/intransitive before the experiment.
After the fMRI session, participants judged the degree of sociality and transitivity of the actions seen in the experiment. To this end, 6-point Likert scales (from 1 = not at all to 6 = very much) to the following questions were used: for transitivity, “How much does the action involve the interaction with a physical, inanimate object?”; for sociality, “How much is the action relevant for the nonacting person?” and “How much does the acting person consider possible consequences of the action for the nonacting person?” Ratings were used to ensure that the actions differed significantly along the two dimensions and were categorized as transitive/intransitive and social/nonsocial as intended. In addition, participants were asked to judge the similarity of the actions with regard to movement kinematics. We thereby ensured that sociality and transitivity are not confounded by covariance of movement differences between the actions. For each combination of the action conditions, participants judged on a 6-point Likert scale on how similar hand and arm movements of the respective actions were. Because different action instantiations were shown in the experiment, they were asked to focus on coarse-grained movements that were similar across the different instantiations. To test for covariance between sociality, transitivity, and movement similarity, we computed representational dissimilarity matrices (RDMs) by subtracting each rating value from each other (Euclidean distance) and used the vectorized triangle below the matrix diagonal for a correlation analysis for each participant. These RDM vectors were z-scored and correlated with each other to obtain one correlation coefficient (r) per correlation (sociality–transitivity, sociality–movement similarity, transitivity–movement similarity) and participant. We then used the r values in one-sample t tests to detect systematic correlations across participants. Averaged RDMs were used for RSA.
Data acquisition.
Functional and structural data were collected using a 4 T Bruker MedSpec Biospin MR scanner and an 8-channel birdcage head coil. Functional images were acquired with a T2*-weighted gradient echoplanar imaging (EPI) sequence with fat suppression. Acquisition parameters were a repetition time of 2.2 s, an echo time of 33 ms, a flip angle of 75°, a field of view of 192 mm, a matrix size of 64 × 64, and a voxel resolution of 3 × 3 × 3 mm. We used 31 slices, acquired in ascending interleaved order, with a thickness of 3 mm and 15% gap (0.45 mm). Slices were tilted to run parallel to the superior temporal sulcus. In each functional run, 176 images were acquired. Before each run, we performed an additional scan to measure the point-spread function (PSF) of the acquired sequence to correct the distortion expected with high-field imaging (Zaitsev et al., 2004).
Structural T1-weighted images were acquired with an MPRAGE sequence (176 sagittal slices, TR = 2.7 s, inversion time = 1020 ms, FA = 7°, 256 × 224 mm FOV, 1 × 1 × 1 mm resolution).
Preprocessing.
Data were analyzed using BrainVoyager QX 2.8 (BrainInnovation, RRID:SCR_013057) in combination with the BVQXTools (RRID:SCR_009532) and NeuroElf (RRID:SCR_014147) toolboxes and custom software written in MATLAB (MathWorks).
Distortions in geometry and intensity in the EPIs were corrected on the basis of the PSF data acquired before each EPI scan (Zeng and Constable, 2002). The first four volumes were removed to avoid T1 saturation. The first volume of the first run was aligned to the high-resolution anatomy (six parameters). Data were 3D motion corrected (trilinear interpolation, with the first volume of the first run of each participant as reference), followed time correction and high-pass filtering (cutoff frequency of three cycles per run). Spatial smoothing was applied with a Gaussian kernel of 8 mm FWHM for univariate analysis and 3 mm FWHM for MVPA (Wurm and Lingnau, 2015). For group analysis, both anatomical and functional data were transformed into Talairach space using trilinear interpolation.
Univariate fMRI analysis.
A group random-effects general linear model was computed using design matrices containing predictors of the 8 action conditions, catch trials, and of the 6 parameters resulting from 3D motion correction (x, y, z translation and rotation). Each predictor was convolved with a dual-gamma hemodynamic impulse response function (Friston et al., 1998). Each trial was modeled as an epoch lasting from video onset to offset (2 s). The resulting reference time courses were used to fit the signal time courses of each voxel. Statistical maps were thresholded using Threshold-Free Cluster Enhancement (TFCE; Smith and Nichols, 2009) as implemented in the CoSMoMVPA Toolbox (Oosterhof et al., 2016). A total of 10,000 Monte Carlo simulations and a corrected cluster threshold of p = 0.05 were used. Conjunctions were computed by outputting the minimum t value for each voxel of the input maps (Nichols et al., 2005). Maps were projected on a cortex-based aligned group surface for visualization.
MVPA.
MVPA was performed using the CoSMoMVPA toolbox (Oosterhof et al., 2016). Design matrices contained 16 predictors reflecting the action conditions (8 actions × 2 exemplars), 2 catch trials predictors, and 6 predictors resulting from 3D motion correction. Beta weights of experimental conditions were estimated on the basis of six trials per condition and run resulting in two beta estimates per action condition and run. The 6 trials were selected from either the first half (blocks 1–3) or the second half (blocks 4–6) of each run. Because the six trials showed different instantiations of the same action (different contexts, perspectives, objects, actors, and hands), the MVPA targeted action representations that generalize across these factors. In total, this procedure resulted in 16 β maps (eight runs × two exemplars, hereinafter referred to as “patterns”) per action condition. Searchlight-based (Kriegeskorte et al., 2006) and ROI-based MVPA were performed in volume space using spherical ROIs with a radius of 12 mm. For searchlight analyses, individual accuracy maps were entered into a one-sample t test to identify voxels yielding classification significantly above chance. Statistical maps were corrected for multiple comparisons using TFCE (see “Univariate fMRI analysis” for details). For ease of comparison, the mean accuracy maps and the outlines of the corrected clusters were projected on the same cortex-based aligned group surface. Decoding analyses were performed using a linear discriminant analysis classifier.
Multiclass decoding.
For multiclass searchlight MVPA, all eight actions were fed into the classification. In eight iterations, each action was discriminated from the remaining seven actions. The decoding accuracy at chance was therefore 12.5%. For within-category MVPA, only the two actions of the same category were decoded (e.g., open vs close for the transitive/nonsocial action category). The decoding accuracy at chance was 50%. For both analyses, classification accuracies were computed using leave-one-out cross validation; that is, the classifier was trained using the data of 15 patterns and tested on its accuracy at classifying the unseen data from the remaining pattern. This procedure was performed in 16 iterations using all possible combinations of training and test patterns. The classification accuracies from the 16 iterations were averaged to give a mean accuracy score, which was assigned to the central voxel.
Across-category decoding.
For the decoding of sociality and transitivity, the β values of the two actions were collaped within each category. A cross-decoding scheme was used: To decode actions along transitivity, we trained the classifier to discriminate between transitive versus intransitive actions for the social dimension (give/take vs agree/disagree) and tested the classifier in the nonsocial dimension (open/close vs stroke/scratch). To decode actions along sociality, we trained the classifier to discriminate between social versus nonsocial actions for the transitive dimension (give/take vs open/close) and tested the classifier in the intransitive dimension (agree/disagree vs stroke/scratch). Both tests were also done vice versa (i.e., train on intransitive and test on the transitive dimension, train on nonsocial and test on the social dimension) and the resulting accuracies were averaged across the generalization directions. As described above, classification accuracies were computed using leave-one-out cross validation.
RSA.
For RSA (Kriegeskorte et al., 2008a), we averaged the 16 β values of each action condition for each participant and voxel. For each searchlight/ROI sphere, we extracted the mean β values to obtain one multivoxel pattern per action. For each pattern, we normalized the β values by subtracting the mean β value from each individual β value (demeaning). Next, we correlated the patterns with each other resulting in an 8 × 8 correlation matrix (the neural RDM) per sphere and participant. Then, neural RDMs were correlated with the RDMs for sociality and transitivity derived from the behavioral ratings. Resulting correlation coefficients were Fisher transformed and entered into one-sample t tests. Statistical maps were corrected for multiple comparisons using TFCE (see “Univariate fMRI analysis” for details).
Vector-of-ROI analysis.
To analyze the topographical organization in LOTC and VOTC with respect to the different analyses (multiclass, across- and within-category decoding, RSA, univariate effects), we conducted a vector-of-ROI analysis (Konkle and Caramazza, 2013). To this end, we defined dorsal and ventral anchor points [posterior superior temporal sulcus (pSTS) and parahippocampal cortex, PHC] in each hemisphere based on the peak coordinates of the univariate conjunctions of sociality and transitivity (see Fig. 6). The anchor points were connected with a straight vector on the flattened cortical surface. This vector thus fully spanned LOTC and VOTC along the dorsal-ventral axis from pSTS (expected to be sensitive to person-related information) to PHC (expected to be sensitive to inanimate objects). Along this vector, we defined a series of partially overlapping spherical ROIs (12 mm radius, centers spaced 3 mm). In each ROI, we conducted all analyses as reported above using identical parameters as in the whole-brain analysis. For each analysis and hemisphere, responses were plotted as a function of position along the dorsal–ventral axis. Notably, as we focus on multivariate effects and use the univariate responses for comparison purposes only, the definition of the ROI vector is independent from the main analyses of interest.
For a second vector-of-ROI analysis along the posterior–anterior axis, the anchor points were based on anatomical landmarks: The posterior end was defined as the early visual cortex at the occipital pole; the anterior end was defined as the mid of the middle temporal gyrus (MTG). As is described above, the anchor points were connected with a straight vector on the flattened cortical surface. Along this vector, we defined a series of partially overlapping spherical ROIs (12 mm radius, centers spaced 3 mm apart). In each ROI, we conducted MVP decoding as reported above using identical parameters as in the whole-brain analysis.
Hierarchical cluster analysis.
For additional visualization, we computed dendrograms of mean neural RDMs of DLOTC and VLOTC using hierarchical cluster analysis. DLOTC and VLOTC RDMs were extracted from the vector-of-ROI analysis. To this end, we first defined the center of action-sensitive LOTC as the peak of the multiclass decoding (see Fig. 7) and then defined DLOTC and VLOTC ROIs dorsally and ventrally of that peak; that is, DLOTC was defined as eight adjacent ROIs dorsal of that peak and VLOTC was defined as eight adjacent ROIs ventral of that peak. For each hemisphere and ROI, RDMs were extracted and averaged across ROIs and participants. Hierarchical cluster analysis was performed using average distance.
Results
Behavioral results
All participants identified catch trials with sufficient accuracy, which ensured that participants paid attention to the action videos (mean error rates = 10.4 ± 1.3%, SEM).
Behavioral ratings for sociality and transitivity revealed that actions were clearly categorized into transitive versus intransitive and social versus nonsocial, respectively (see Fig. 1C for the corresponding RDMs derived from the ratings). The two different ratings for sociality, which were sensitive for sociality with respect to the passive person or the actor, were strongly correlated with each other (t(27) = 11.3, p < 0.001; mean r = 0.99, p < 0.001). We therefore collapsed the two ratings for subsequent analyses. Sociality and transitivity did not correlate significantly (t(27) = −0.082, p = 0.935; mean r = −0.066, p = 0.737), which suggests that the two experimental dimensions were independent from each other. In addition, the two dimensions did not correlate with movement similarity (transitivity–movement similarity: t(27) = −0.144, p = 0.887; mean r = 0.088, p = 0.681; sociality–movement similarity: t(27) = −0.935, p = 0.358; mean r = −0.291, p = 0.168).
Brain regions sensitive to action discrimination
To get an overview of brain regions that are generally capable of discriminating actions of distinct categories, we performed a multiclass searchlight MVPA using all actions of the four categories. This analysis was sensitive to conceptual characteristics of the action categories (including sociality and transitivity) as well as to general movement types characteristic of the different categories (e.g., reaching/grasping, wrist rotation, hand gestures). Importantly, the high stimulus variance minimized the sensitivity to low-level perceptual differences between actions and maximized the sensitivity to action representations that generalize across features such as effector (right or left hand), perspective (view from left, right, or center positions; actor on the left or right side), and concrete movement (grasping/manipulating different objects; stroking/scratching different body parts; different gestures for agreement and disagreement, respectively).
The analysis revealed highly robust above chance decoding accuracies in lateral occipitotemporal and parietal regions that were strongest in left and right LOTC and middle intraparietal sulcus (IPS)/superior parietal lobe (SPL), respectively, as well as in the left posterior postcentral sulcus (PoCS)/anterior IPS (Fig. 2A). Decoding accuracies in frontal and medial temporal regions were substantially weaker than in the aforementioned regions (Table 1), which provides support to previous studies that found LOTC and PoCS/IPS, but less so premotor/prefrontal regions, to encode feature-general action representations (Oosterhof et al., 2012; Tucciarelli et al., 2015; Wurm and Lingnau, 2015; Wurm et al., 2016).
Brain regions sensitive to sociality and transitivity distinctions
Next, we investigated the functional organization of action representations with respect to sociality and transitivity features. In a first step, we searched for representations that are sensitive to sociality and transitivity independently of the concrete action subcategory. To this end, we performed an across-category decoding searchlight analysis. The general logic was the following: we trained a classifier to decode category A versus B and tested the same classifier using the categories C versus D (and vice versa). To decode sociality-specific features we trained a classifier to decode change of possession (trans/social) versus object manipulation (trans/nonsocial) and tested the classifier using communication (intrans/social) versus body/contact actions (intrans/nonsocial). Likewise, to decode transitivity-specific features, we trained a classifier to decode change of possession (trans/social) versus communication (intrans/social) and tested the classifier using object manipulation (trans/nonsocial) versus body/contact actions (intrans/nonsocial).
Both searchlight analyses revealed strong above chance accuracies in bilateral LOTC and, strikingly, far weaker effects in parietal regions (Fig. 2B). Decoding of sociality and transitivity features differed mostly with respect to overall decoding strength; that is, there were higher decoding accuracies for transitivity compared with sociality. This difference is not surprising because transitivity distinguishes actions based on salient perceptual features such as reaching and grasping of objects, whereas sociality distinguishes actions based on more subtle, probably less perceptual, features. However, there were also anatomical differences. Although decoding of transitivity comprised regions in dorsal and ventral LOTC as well as in VOTC, decoding of sociality was mostly restricted to dorsal LOTC. Critically, in both hemispheres, decoding peaks of transitivity were in ventral LOTC, whereas peaks of sociality were in dorsal LOTC (see Fig. 5, Table 1).
In a second step, we characterized the representational organization of brain regions with respect to sociality and transitivity. To this end, we performed a searchlight-based RSA using RDMs obtained from behavioral ratings for sociality and transitivity (Fig. 3). The RSA for sociality revealed significant effects in bilateral pMTG, as well as in left postcentral gyrus. The RSA for transitivity revealed significant effects throughout lateral and VOTC (peaking in fusiform gyrus; FG /PHC), as well as in the posterior operculum, IPS, and PMd. Within LOTC, which has been suggested to be defined approximately by the boundaries middle portion of MTG (anterior), lateral occipital sulcus (posterior), STS (dorsal), and inferior temporal gyrus (ITG, ventral; Lingnau and Downing, 2015), the clusters found for sociality were located more dorsally than those for transitivity (Table 2). At a larger topographical scale, however, the dorsal–ventral gradient from transitivity to sociality was less strict because there were nearby regions fitting the transitivity model in regions other than ITG/FG/PHC (left posterior operculum/supramarginal gyrus and right posterior operculum/superior temporal gyrus), which were dorsal and anterior to the sociality clusters.
Brain regions representing category-specific subtypes of actions
The previous analysis focused on the abstract representation of sociality and transitivity features; that is, information that generalizes across category-specific actions. It is unclear, however, how these abstract dimensions relate neuroanatomically to more specific representations of action subtypes. To address this question, we decoded the actions for each category separately using a within-category searchlight MVPA (i.e., give vs take, open vs close, agree vs disagree, and stroke vs scratch). The critical difference between across-category and within-category MVPA is that the former relied on action-general differences between social versus nonsocial and transitive versus intransitive actions, respectively, whereas the within-category MVPA relied on action-specific differences between two actions of the same category. A notable feature of the within-category MVPA is that the decoded classes are perceptually similar so that the classifier exploits more subtle differences between actions: For example, videos of give and take contained highly similar reaching and grasping movements and differed only with respect to start and end location of the object relative to the actor [note that, due to the variance of actor position (left or right side of the table) and perspective, it is impossible that the decoding relied on absolute object positions]. Therefore, in the within-category MVPA, classification due to perceptual differences was minimized by keeping category-specific features such as reaching and grasping constant. In contrast, in the across-category MVPA, classification due to perceptual differences was minimized by generalizing across category-specific features. In addition, for both approaches, the high stimulus variance ensured that decoding relies on abstract representations that generalize across features such as effector and perspective.
In a first step, we performed searchlight analyses for each category separately. For each category, we obtained mean accuracy and t-maps to reveal regions where decoding accuracy was consistently above chance (50%) across participants (Fig. 4A, Table 3). Decoding accuracies were generally highest in occipitotemporal and parietal regions; however, not all four searchlight analyses revealed statistically robust effects surviving TFCE correction. Decoding open versus close (object manipulation) revealed significant clusters in left LOTC and left postcentral sulcus. These clusters overlapped well with the clusters found in a previous study that decoded open versus close actions using different stimuli (Wurm and Lingnau, 2015). Decoding agree versus disagree (communication) revealed similar clusters in left and right LOTC and left postcentral sulcus (p < 0.001, uncorrected), but only the cluster in right LOTC survived TFCE correction. Decoding give versus take (change of possession) revealed a cluster in right LOTC (p < 0.005, uncorrected). Decoding stroke versus scratch (contact/body action) revealed a cluster in right precentral gyrus/sulcus (p < 0.0025, uncorrected). A comparison of the maps revealed no systematic segregation in LOTC along transitivity and/or sociality. This is perhaps not surprising because any higher-level information such as sociality and transitivity is constant between the decoded actions in the within-category MVPA and was thus canceled out. However, because the division into four separate searchlight analyses naturally reduced the power for each analysis, we cannot rule out that also the representation of more concrete features of category-specific actions reflects distinctions along transitivity and/or sociality.
To investigate the general relationship between representations of the more abstract dimension sociality and transitivity with the more concrete representations of specific action subtypes regardless of the four categories, we collapsed the accuracy maps of the within-category MVPA for each participant and computed a t test across the averaged maps. We reasoned that this analysis should reveal areas containing representations of specific action subtypes regardless of the overarching action category. This analysis revealed significant clusters in left and right LOTC and left PoCS at the junction to the IPS (Fig. 4B). Clusters in left and right LOTC were more posterior to the clusters of the across-category decoding (Fig. 5). This finding points to a distinction between action-general and action-specific concept features along the anterior–posterior axis.
Although not the focus of the current study, the results obtained in parietal regions are worth mentioning. The clusters in PoCS partly overlapped with the anterior inferior parietal peak of the multiclass decoding. Interestingly, anterior IPL was found only to a weak extent in the across-category decoding. Consistent with previous findings (Oosterhof et al., 2010; Oosterhof et al., 2012; Leshinskaya and Caramazza, 2015; Wurm and Lingnau, 2015; Wurm et al., 2016), this pattern of results suggests that left anterior IPL represents action-specific information of a high degree of generality, but is less likely to represent higher-order dimensions such as sociality and transitivity. Anterior IPL thus reveals a functional profile that is different from the profile of LOTC and, notably, of posterior/superior parietal cortex (IPS/SPL). In IPS/SPL, effects were found only in the multiclass decoding and the RSA for transitivity but far less in the within- and across-category decoding. In other words, neural populations in IPS/SPL differentiate actions from one category from actions of other categories without generalizing across properties such as sociality and transitivity. At the same time, IPS/SPL did not differentiate actions of the same category when they were perceptually very similar; that is, have similar movement trajectories. Together, these findings suggest that IPS/SPL codes coarse-grained spatial action features specific for each of the categories. Consistent with studies on the role of IPS/SPL in action observation (Caspers et al., 2010; Binkofski and Buxbaum, 2013), it is likely that these features are related to body part motion in space that, in our study, was similar within category but different between categories.
Univariate (activation-based) effects of sociality and transitivity
Both across-category decoding and RSA analyses suggest distinct functional profiles in DLOTC and VLOTC regarding the action dimensions sociality and transitivity. Could this distinction be driven by increased activation of associated object information? For example, it is possible that the observation of social actions increased attention toward the nonacting person and thereby enhanced the processing of body and face information that could serve as socially relevant cues. Observation of social actions is also likely to induce mentalizing about the other person's feelings and reactions (Saxe and Kanwisher, 2003; Wurm et al., 2011). In contrast, observation of transitive actions is likely to direct attention toward the object involved in the action and thereby enhance the perceptual and semantic processing of that object. Following this logic, enhanced processing of person and inanimate object information should be reflected in enhanced activation in brain regions representing person and inanimate object information, respectively. To identify regions showing such activation differences, we computed univariate contrast conjunctions for social versus nonsocial (give/take vs open/close and agree/disagree vs stroke/scratch) and transitive versus intransitive actions (give/take vs agree/disagree and open/close vs stroke/scratch), respectively (Fig. 6). The contrast conjunction social versus nonsocial revealed bilateral pSTS; that is, a region typically associated with the processing of socially relevant body and face information (Allison et al., 2000). Critically, in both hemispheres, the clusters in pSTS were dorsal to the DLOTC clusters identified in the sociality RSA. The contrast conjunction for transitive versus intransitive revealed bilateral FG/PHC, which can be associated with the processing of object information (Mahon et al., 2007), as well as the bilateral dorsal premotor cortex (PMd) and SPL; that is, regions recruited during the observation, planning, and execution of reaching and grasping movements (Binkofski and Buxbaum, 2013; Turella and Lingnau, 2014). These clusters overlapped with some of the clusters identified in the transitivity RSA, which suggests that, in these regions, multivariate effects might be affected by activation of inanimate object knowledge and kinematic representations. However, in LOTC, we did not find considerable activation differences overlapping with clusters identified in the transitivity RSA. Overall, the representational similarity seems to be rather independent from the univariate effects in LOTC (see also results of the vector-of-ROI analysis).
Vector-of-ROI analysis
To provide an integrated picture of the responses with respect to sociality and transitivity, we plotted multivariate effects, along with the univariate effects to each action category for reference, as a function of the position on a dorsal–ventral axis from the dorsal end of the LOTC (pSTS) to the ventral end of the VOTC (PHC). To this end, we defined anchor points based on the univariate contrast conjunctions for social versus nonsocial and transitive versus intransitive, respectively. These anchor points were chosen because we expected a putative segregation between transitive and social actions to be most eminent between regions sensitive to person-related information and inanimate objects, respectively. Between these anchor points, we defined a vector of adjacent ROIs. From each ROI, we extracted decoding accuracies (multiclass decoding, across- and within-category decoding), RSA correlations, and univariate β estimates and plotted these responses as a function of the position on the dorsal–ventral axis. For univariate effects, we computed β estimates for each of the four action categories separately. For a better visualization of the relative differences between categories, we normalized β values (Konkle and Caramazza, 2013). For each ROI and category, we subtracted the mean of all four categories of that ROI. Results are shown in Figure 7. There were two major findings. First, the multiclass decoding (and to a less clear extent the within-category decoding) peaked in the LOTC at the level of MTG/ITG. This suggests that this region is generally most sensitive to action information. Second, the dorsal and ventral sides of this peak in LOTC showed preferences toward sociality and transitivity, respectively: across-category decoding and RSA revealed stronger effects of sociality on the dorsal compared with the ventral side of this peak. In contrast, effects of transitivity were stronger on the ventral compared with the dorsal side of this peak. These peaks were located between pSTS (dorsal end of the LOTC), the multiclass decoding peak in LOTC (middle of LOTC) and ITG (ventral end of LOTC; border to VOTC). Consistent with the univariate conjunction analysis (Fig. 6), pSTS and PHC showed univariate preferences for the two social (give/take and agree/disagree) and the two transitive (give/take and open/close) action categories, respectively. In addition, we observed a univariate preference for nonsocial action categories (open/close and stroke/scratch) in MTG/ITG, which could be due to increased processing of complex hand kinematics (Bracci et al., 2010; Orlov et al., 2014) that were specific for the two nonsocial action categories.
With regard to an additional segregation of sociality and transitivity in VOTC, the findings are less clear. The across-category decoding did not show systematic peak positions in FG and PHC that point to a distinction of sociality and transitivity. However, as expected, the RSA revealed a better fit of the sociality model in FG than in PHC, whereas the opposite effect was found for the transitivity model. It is questionable, though, whether this distinction reflects differences in representational organization of action knowledge because we did not observe a secondary peak of the multiclass decoding in VOTC, which should be the case if this region represented an additional hub of action processing.
Notably, along the whole dorsal–ventral axis, the across-category decoding of transitivity revealed higher accuracies than sociality, whereas the RSA showed higher correlations for sociality than for transitivity in DLOTC (see also the respective searchlight analyses). This apparent discrepancy can be explained by the different methods underlying MVPA and RSA: using MVPA, the classifier might have picked up different (and possibly more subtle but highly reliable) information than the RSA (which is based on correlations of whole voxel patterns without biasing single, more reliable voxels).
The differential organization of action information along sociality and transitivity in LOTC was further illustrated by a hierarchical cluster analysis. In DLOTC, social and nonsocial actions formed superordinate clusters; in VLOTC, transitive and intransitive actions formed superordinate clusters (Fig. 8).
Finally, to investigate the gradient from action-specific to more general action features along the posterior–anterior axis, we conducted a second vector-of-ROI analysis. For investigation of this action-specificity gradient, only the within- and across-category decoding is informative. We therefore performed only these decoding analyses (Fig. 9). Consistent with the whole-brain analysis (Fig. 5), in both hemispheres, the peaks of the across-category decoding were located more anteriorly relative to the within-category decoding. Note however, that the within-category decoding revealed only subtle variations along the posterior-anterior axis; that is, there was no clearly outstanding peak. This analysis therefore provided only moderate evidence for an action-specific to general gradient along the posterior axis.
Discussion
The present study investigated the neural organization of actions along the dimensions of sociality and transitivity. We report three major findings:
First, features associated with social versus nonsocial and transitive versus intransitive actions could be decoded in LOTC independently of the specific action category. For example, a classifier that was trained to distinguish between change of possession (social/transitive) and object manipulation (nonsocial/transitive) actions was able to distinguish between communicative (social/nontransitive) and body/contact (nonsocial/nontransitive) actions. This finding suggests that LOTC represents features of of sociality and transitivity at a level that is independent of specific action subtypes.
Second, dorsal and ventral subregions of LOTC were organized preferentially along sociality and transitivity, respectively. The representational similarity of actions in DLOTC was better explained by the sociality model than by the transitivity model, whereas in VLOTC, the opposite pattern was found. This suggests that DLOTC represents social and nonsocial action features distinctly, whereas VLOTC represents transitive and intransitive action features distinctly.
Finally, information about specific actions of the same category could be decoded in regions of LOTC that were posterior to the regions coding sociality and transitivity. This finding suggests a second organization principle in LOTC; that is, a gradient from posterior to anterior LOTC coding action-specific to more general category features independent of specific actions, respectively.
Dorsal and ventral LOTC/MTG differentiate social versus nonsocial and transitive versus intransitive action features, respectively
Using RSA, we demonstrated that DLOTC represents preferentially actions as predicted by the sociality model, whereas VLOTC represents preferentially actions as predicted by the transitivity model. In addition, in both hemispheres, the peak location of the social versus nonsocial action decoding was dorsal to the peak location of the transitive versus intransitive action decoding. Together, these findings show that action information along these dimensions is represented differentially in DLOTC and VLOTC. Overall, action decoding was highest at the level of MTG/ITG (Figs. 2, 7). In contrast, univariate effects of sociality and transitivity were found in pSTS and FG/PHC (Fig. 6), regions involved in the processing of person-related information (Allison et al., 2000) and inanimate objects (Chao et al., 1999; Mahon et al., 2007), respectively. Actions, even from distinct action categories such as those in our study, have structural similarities (typically involving the dynamic processing of motion and change, are typically intentional, etc.) and are therefore likely to be represented by neural substrates with similar computational properties (Kaas and Catania, 2002; Rosa and Tweedale, 2005). In other words, actions such as open, give, agree, and scratch are more similar to each other than to other, structurally different kinds of information such as persons and inanimate objects even if these kinds of information are important (albeit not constitutive) for action recognition. Following this reasoning, our finding that action information was encoded in proximity to, but not overlapping with, person-related and inanimate object knowledge is plausible. The subdivision within action-processing neural substrates along the dorsal–ventral axis (i.e., DLOTC) is more sensitive to sociality features, whereas VLOTC is more sensitive to transitivity features can be explained under the assumption that the neuroanatomical organization of action knowledge is shaped by systematic connections between object and action representations: Socially relevant person information in dorsal areas such as the STS should be more strongly connected to social action representations in LOTC. In contrast, inanimate object information in ventral areas such as the ITG and FG should be more strongly connected to object-directed action representations in LOTC. The connections to person-related and inanimate object information thus might exert opposing constraints on the representations of social and transitive actions, which could drive the anatomical segregation in the observed way. This interpretation is supported by recent studies demonstrating enhanced functional connectivity specific for inanimate objects (artifacts and tools) between FG and a region in LOTC overlapping with the region that we found to be sensitive for transitive versus intransitive action discrimination (Hutchison et al., 2014; Stevens et al., 2015). Likewise, effective connectivity between pSTS and LOTC is modulated by socially relevant cues such as facial expressions (Furl et al., 2015).
What remains unspecified is the kind of information that drives the observed distinctions in DLOTC and VLOTC as revealed by the RSA and the across-category decoding. Do the distinctions reflect semantic categorizations or are they driven by structural properties of the observed actions? Transitive actions can indeed be differentiated from intransitive actions based on intrinsic structural properties such as the reaching and grasping of objects. It is reasonable to assume that neural systems important for the recognition of reaching and grasping as well as hand–object interaction would be located in proximity to regions coding tools and other graspable objects (Bracci et al., 2012). The high structural similarities of actions within the transitive and intransitive categories are also reflected in the overall higher accuracies of the across-category decoding for transitivity. For social actions, perceptual commonalities are less evident. Give and take are perceptually different from agree and disagree gestures and, likewise, open and close are perceptually different from stroke and scratch actions. Consistent with this view, multivariate effects of sociality were generally more subtle than effects of transitivity. Furthermore, in both social and nonsocial actions, a second, passive person was present ruling out that social actions could be distinguished from nonsocial actions based on perceptual cues of the other person. Increased processing of the passive person for social actions is unlikely to drive the distinction because, in that case, one should also have observed univariate activation differences between social and nonsocial actions in LOTC, which was not the case. However, the social actions were directed to the other person and can thus be interpreted as interpersonal actions even if there was no observable reaction of the passive person. For the social actions, the acting and the passive person therefore defined a common social space, which was less the case for the nonsocial actions. The distinction between social and nonsocial could therefore be explained by more general underlying dimensions such as social space or the direction of an action toward another person or not. Another possibility is that general differences in the complexity of fine hand/finger movements, independently of the concrete movements themselves, drove the distinction between social versus nonsocial actions. Indeed, we found stronger univariate responses for the nonsocial versus social actions at the level of the pMTG (Fig. 7). Note, however, that the univariate response profile of the nonsocial actions differed from the profile of the sociality RSA, which suggests that the two analyses picked up different kinds of information. Finally, it is possible that the across-category decoding relied on semantic representations of action primitives (Schank, 1973; Schank and Abelson, 1977) that were similar for the social actions and the transitive actions, respectively. In fact, the social actions used in our study (give, take, agree, disagree) involved a transfer of (physical or mental) objects. At the same time, the transitive actions (give, take, open, close) involved a change (of location or configuration) of objects. Action concepts that are composed of similar action primitives would therefore be close to each other in representational space, consistent with our findings. Future studies should investigate the extent to which such decompositional models (Jackendoff, 1972; Gruber, 1976; Pinker, 1989) can explain the neural organization of action knowledge.
Posterior to anterior LOTC is organized along a gradient from action-specific to general action information
A secondary finding of our study is that cluster peaks of the within-category decoding were located more posteriorly in LOTC than the peaks of the across-category decoding. Note however, that the range of accuracies of the within-category decoding was relatively shallow and the clusters of both decoding analyses showed substantial overlaps. The analysis therefore suggests only subtle, preferential differences of representational content along the posterior–anterior axis.
Compared with the across-category decoding, the within-category decoding relied on more subtle differences between actions of the same category (e.g., give vs take or agree vs disagree). These differences were either at a higher visual level (e.g., the position change of an object away vs toward the body of the acting person in the case of give vs take) or at the conceptual level (e.g., making different gestures for agreement vs disagreement in the case of agree vs disagree). The stimulus variance minimized the chance of decoding perceptual aspects of the actions such as perspective, agent, or concrete action instantiation. Because the actions were from the same category, it is not possible that decoding relied on more general features characteristic for an action category (e.g., transitivity for open vs close because both actions are transitive). In summary, the within-category decoding probably identified representations of specific action subtypes at a higher visual and/or conceptual level (Wurm and Lingnau, 2015).
In contrast, the across-category decoding was not suited to detect information specific for action subtypes because the classifier was trained and tested on actions of different categories (e.g., trained on change of possession vs object manipulation and tested on communication vs body/contact actions). As elaborated above, the across-category decoding was most sensitive to action features that generalize across categories along the dimensions sociality and transitivity. The different peak locations of within- and across-category decoding suggest that abstract action-general features are represented more anteriorly than concrete action-specific features, which is consistent with recent proposals on the functional organization of LOTC from concrete to abstract and from visual to amodal action representations (see also Thompson-Schill, 2003; Martin, 2007; Watson and Chatterjee, 2011; Lingnau and Downing, 2015).
Conclusions
Our results suggest a topographic organization of LOTC along two major axes: a dorsal versus ventral distinction that segregates social versus object-related action information, respectively, and a posterior-to-anterior gradient from specific action subtypes to broader action categories that generalize across concrete action subtypes. This action topography gains its plausibility from the documented object topography, which distinguishes faces/bodies versus artifacts, and the connectivity of these object classes. Our results help to establish a clearer and theoretically motivated picture about the representational organization of LOTC.
Footnotes
This work was supported by the German Research Foundation (DFG Research Grant WU 767/1-1), the Provincia Autonoma di Trento, and the Fondazione Cassa di Risparmio di Trento e Rovereto. We thank Paola Menapace for assistance in preparing the stimulus material and Gilles Vannuscorps for helpful comments on the manuscript.
The authors declare no competing financial interests.
- Correspondence should be addressed to Moritz F. Wurm, Department of Psychology, Harvard University, 33 Kirkland Street, Cambridge, MA 02138. mwurm{at}fas.harvard.edu