Abstract
The neural basis of action understanding in humans remains disputed, with some research implicating the putative mirror neuron system (MNS) and some a mentalizing system (MZS) for inferring mental states. The basis for this dispute may be that action understanding is a heterogeneous construct: actions can be understood from sensory information about body movements or from language about action, and with the goal of understanding the implementation (“how”) or motive (“why”) of an action. Although extant research implicates the MNS in understanding implementation and the MZS in understanding motive, it remains unknown to what extent these systems subserve modality-specific or supramodal functions in action understanding. While undergoing fMRI, 21 volunteers considered the implementation (“How is she doing it?”) and motive (“Why is she doing it?”) for actions presented in video or text. Bilateral parietal and right frontal areas of the MNS showed a modality-specific association with perceiving actions in videos, while left-hemisphere MNS showed a supramodal association with understanding implementation. Largely left-hemisphere MZS showed a supramodal association with understanding motive; however, connectivity among the MZS and MNS during the inference of motive was modality specific, being significantly stronger when motive was understood from actions in videos compared to text. These results support a tripartite model of MNS and MZS contributions to action understanding, where distinct areas of the MNS contribute to action perception (“perceiving what”) and the representation of action implementation (“knowing how”), while the MZS supports an abstract, modality-independent representation of the mental states that explain action performance (“knowing why”).
Introduction
The neural basis of action understanding in humans remains disputed (Gallese et al., 2011). The debate has centered on clarifying the role of the mirror neuron system (MNS), which in humans refers to brain regions that activate during both the observation and execution of actions (Van Overwalle and Baetens, 2009; Rizzolatti and Sinigaglia, 2010; Keysers et al., 2011). Although the human MNS is reliably active during action observation, several studies have shown that it is not sensitive to the demand to explain observed actions; rather, a separate brain system known as the theory-of-mind or mentalizing system (MZS) appears to be critical (Grèzes et al., 2004; Brass et al., 2007; de Lange et al., 2008; Spunt et al., 2011).
Importantly, there still remains little consensus in the literature on what is involved in the act of understanding an action (Gallese et al., 2011; Kilner, 2011; Uithol et al., 2011). This may be because action understanding is a heterogeneous psychological construct that encompasses not one but many ecological contexts in which humans consider the actions of other humans (Vallacher and Wegner, 1987; Kilner, 2011; Uithol et al., 2011). One ecological variable is the observer's comprehension goal. In some cases, action understanding entails understanding how an action is, was, or will be implemented, for instance when attempting to imitate the action of another. In other cases, actions are represented with the goal of understanding the actor's motive—that is, why the action is being performed. A second important ecological variable is the modality through which the action becomes an object of cognition. Specifically, actions are abstract conceptual objects easily understood through language or by simply watching others act. Given these distinct types of inputs, the modality through which an action is apprehended is likely a critical variable in determining the neural systems that support a given instance of action understanding.
These ecological variables are orthogonal: actions, regardless of the modality through which they are apprehended, can be conceived with the goal to understand how or why. Hence, the two factors can be factorially combined to produce a matrix of four contexts of action understanding (see Fig. 1A). Research groups investigating the neural bases of action understanding have typically employed paradigms that capture only one or two of these contexts (Van Overwalle and Baetens, 2009; Zaki and Ochsner, 2009). During action understanding, several studies have manipulated the observer's goal (Decety et al., 1997; de Lange et al., 2008; Spunt et al., 2010, 2011), and one study has manipulated across the sensory and linguistic modalities; however, this study was not designed to directly contrast the modalities (Aziz-Zadeh et al., 2006). To date, no study has simultaneously manipulated goal and modality while attempting to hold action content constant across conditions. Therefore, we designed a novel paradigm for investigating action understanding that faithfully reproduces these four contexts as they might occur in daily life. Within this paradigm, we aimed to determine the neural systems independently sensitive to stimulus modality (sensory vs linguistic) and comprehension goal (how vs why).
Materials and Methods
Participants
Twenty-one right-handed participants (12 females, mean age = 21.71, age range = 19–32) were recruited from the University of California, Los Angeles (UCLA) subject pool and provided written informed consent according to the procedures of the UCLA Institutional Review Board. All participants were native English speakers and were not taking psychoactive medications at the time of the study.
Experimental stimuli
Action understanding task.
The complete set of stimuli used in the action understanding task consisted of 48 video–text pairs. These stimulus pairs were produced using the following procedure. First, we filmed a female actor perform 78 common object-directed actions in natural scenes. During filming, the actor was instructed to maintain neutral affect, and each clip was framed to make salient at least one object-directed hand action. After filming, all clips were edited to be silent and 5 s in duration. Our next step was to produce empirically matched text descriptions of the actions in each clip. To do so, we had 26 UCLA undergraduates view the 78 edited clips while seated at a computer station. For each clip, participants verbally identified the action in each clip by typing a response to the question “What is she doing?” The only constraint put on responses was that they begin with the string “She is.” We then selected the 48 clips that produced the highest interobserver agreement in these verbal identifications, and the modal response for each clip was then used to generate the matched text stimulus. For the final set of 48 clips, each of the paired text stimuli was provided by at least 65% of the pilot participants, and the average percentage of observers who displayed agreement was 85%.
MNS localizer task.
In addition to the primary action understanding task, all participants performed an additional task that allowed us to independently define the MNS in our sample. The stimulus set consisted of 12 videos of a male actor performing button press sequences on a four-button box. Each clip was filmed from a first-person perspective and narrowly framed on the hand action. After filming, each clip was edited to be silent and 4–5 s in duration.
Experimental design
Action understanding task.
Participants were subjected to a 2 (Stimulus Modality: Video vs Text) × 2 (Comprehension Goal: Implementation vs Motive) within-subjects factorial design, resulting in four experimental conditions (Fig. 1A). To manipulate stimulus modality, the 48 video–text pairs described above were evenly divided into two sets. Each participant was randomly assigned to receive one set in video format and the other in text format. For example, approximately half of the participants received the action “She is brushing her teeth” as a video and “She is playing a guitar” as a text stimulus, while the other half received “She is brushing her teeth” as a text and “She is playing a guitar” as a video stimulus (Fig. 1B). This was done to counterbalance the action identities across the two stimulus modalities, so that across the group, the video and text conditions featured the same actions.
To manipulate comprehension goal, participants were cued before each trial to answer one of two questions. When the goal was to understand implementation, participants were presented with the question “How is she doing it?” and were asked to silently think of one important part of performing the action. When the goal was to understanding the actor's motive, participants were presented with the question “Why is she doing it?” and were asked to silently think of one plausible motive the actor has for performing the action.
MNS localizer task.
Participants underwent two conditions (Fig. 1C). In the Observation condition, participants were instructed to passively observe a video clip of a hand performing a sequence of button presses on the same MR-compatible four-button box that they were holding in their right hand. In the Execution condition, participants were cued to repeat the sequence with their right hand.
Experimental procedure
Before scanning, participants were introduced to and trained in both experimental tasks. For the action understanding task, this included performing three trials from each condition (using stimuli not featured in the scanner task) while the experimenter monitored performance. During debriefing, all participants reported complete comprehension of the experimental tasks.
The structure of the action understanding task is depicted in Figure 1B. Before each trial, participants were cued to either understand implementation or motive. Once the stimulus appeared, participants were instructed to silently think of their response as quickly as possible and to make a right index finger button press once they had their response in mind. Response time (RT) was recorded at this button press. Each stimulus remained onscreen for a maximum duration of 5 s; if the participant responded before 5 s, the stimulus was replaced with a fixation cross for the remainder of the trial. The order of trials was optimized for both estimation and contrast efficiency using a genetic algorithm (Wager and Nichols, 2003). Trials were separated by a fixation screen of variable duration (sampled from an exponential distribution; range = 2000–6000 ms, mean = 3000 ms). Following the scan, participants performed the task a second time and typed their answers to the Why and How questions on a keyboard.
The structure of the MNS localizer task is depicted in Figure 1C. Each trial always began with the participant passively observing a button press sequence. Participants were then presented with a screen displaying the words “HOLD STILL.” Next, participants were given 5 s to accurately repeat the sequence observed in the clip either once or twice. To facilitate task engagement, feedback was provided (with the words “Correct” or “Incorrect” centered onscreen) following each execution period. Each trial ended with a screen displaying the words “HOLD STILL.”
The MATLAB (The MathWorks) Psychophysics Toolbox (Brainard, 1997) version 7.4 was used to present the stimuli to participants and record their responses. Participants viewed the task through MR-compatible LCD goggles.
Image acquisition
Imaging data were acquired using a Siemens Trio 3.0 tesla MRI scanner at the UCLA Ahmanson-Lovelace Brain Mapping Center (Los Angeles, CA). For each participant we acquired 1216 functional T2*-weighted echo planar image volumes (EPIs; slice thickness = 3 mm, gap = 1 mm, 36 slices, TR = 2000 ms, TE = 25 ms, flip angle = 90°, matrix = 64 × 64, FOV = 200 mm). The action understanding task was performed in two runs (each collecting 290 volumes). The MNS localizer task was performed in a single functional run (152 measurements). The final 484 volumes were collected for the purposes of another investigation. We also acquired a T2-weighted, matched-bandwidth anatomical scan (same parameters as EPIs, except: TR = 5000 ms, TE = 34 ms, flip angle = 90°, matrix = 128 × 128) and a T1-weighted, magnetization-prepared, rapid-acquisition, gradient echo anatomical scan (slice thickness = 1 mm, 176 slices, TR = 2530 ms, TE = 3.31 ms, flip angle = 7°, matrix = 256 × 256, FOV = 256 mm).
Behavior analysis
MATLAB was used to analyze all behavioral data. For each participant, we computed the mean RT for each of the four conditions and used a repeated measures ANOVA to test main effects of modality and goal and their interaction on RT. In addition, we used custom MATLAB software to examine the verbs used in the four conditions. First, we combined all participants' post-scan responses and computed the frequency of each verb (summing across different forms of each verb). For each goal, we selected the five most frequently used verbs (Table 1). Next, we used paired samples t tests to test for frequency of use differences of each verb across the modalities.
Image analysis
Functional data were analyzed using Statistical Parametric Mapping (SPM8, Wellcome Department of Cognitive Neurology, London, UK) operating in MATLAB. Within each functional run, image volumes were realigned to correct for head motion, normalized into Montreal Neurological Institute space (resampled at 3 × 3 × 3 mm) using the SPM segmentation routine, and smoothed with an 8 mm Gaussian kernel, full width at half-maximum.
For both tasks, we defined a general linear model for each participant separately. For the MNS localizer, the model included two regressors of interest: Observe and Execute. Trials were modeled as a variable epoch (Grinband et al., 2008) spanning stimulus onset to offset (Observe) or final button press (Execute) and convolved with the canonical (double-gamma) hemodynamic response function (HRF). We included the six motion parameters as covariates of no interest. For the action understanding task, the model included four regressors of interest: WhyVideo (WV), WhyText (WT), HowVideo (HV), and HowText (HT). Trials were modeled as a variable epoch spanning stimulus onset to button press and convolved with the canonical HRF. Additional covariates of no interest included regressors modeling skipped trials (defined as the absence of a button press) and the six motion parameters. For both tasks, the time series was high-pass filtered using a cutoff period of 128 s, and serial autocorrelations were modeled as an AR(1) process.
Following estimation, we first sought to define an MNS mask based on the MNS localizer. To investigate group-level effects for the MNS localizer, we entered contrast images of the effects of each regressor of interest for each participant into a random-effects analysis using a flexible factorial repeated-measures ANOVA (within-subject factor: task; blocking factor: subject). Within this model, we tested the conjunction null (Nichols et al., 2005) of Observe and Execute. The resulting SPM was conservatively thresholded using a voxel-level familywise error (FWE) rate of 0.05 combined with a cluster extent of 20 voxels. This revealed bilateral activity in regions associated with the MNS, namely, areas in the premotor cortex including posterior inferior frontal gyrus (pIFG), ventral premotor cortex (vPMC) and dorsal premotor cortex (dPMC), supplementary motor area (SMA), and clusters spanning the rostral inferior parietal lobule (rIPL) and anterior intraparietal sulcus (aIPS) (Fig. 1D and Table 2). This map will henceforth be referred to as the MNS mask.
To investigate group-level effects for the action understanding task, we entered participants' contrast images for the effects of each regressor of interest into a random-effects analysis using a flexible factorial repeated-measures ANOVA (within-subject factors: goal, modality; blocking factor: subject). Within this model, we tested the following four effects against the conjunction null (Nichols et al., 2005): (1) goal-independent effect of video versus text (WV > WT & HV > HT); (2) goal-independent effect of Text versus Video (WT > WV & HT > HV); (3) modality-independent effect of How versus Why (HV > WV & HT > WT); and (4) modality-independent effect of Why versus How (WV > HV & WT > HT). The resulting SPMs were interrogated using a two-pass procedure. In the first pass, we examined the whole brain using a cluster-level FWE rate of 0.05, with clusters of activation defined using a voxel-level p value of 0.0001 (uncorrected). In the second-pass, we restricted our examination to the MNS mask, using a cluster-level FWE rate of 0.05 with a cluster-defining threshold of p < 0.001 (uncorrected).
The conjunction analyses reported above test for regions in which the presence of the effect of one factor (e.g., modality) does not depend on the level of the other factor (e.g., goal). However, they do not test whether the magnitude of the effect of one factor depends on the level of the other factor; hence, it remains possible that regions identified in the conjunction analyses will show a modality-by-goal interaction. Therefore, we interrogated an F contrast coding the modality-by-goal interaction within a mask of all regions that demonstrated conjunction effects by using a cluster-level FWE rate of 0.05 with a cluster-defining threshold of p < 0.001 (uncorrected).
We used psychophysiological interactions (PPIs) (Friston et al., 1997) to test the hypothesis that the functional relationship among supramodal areas for understanding motive and modality-specific areas for action perception would depend on stimulus modality. PPI enables determination of brain regions whose activity shows a change in correlation with a seed region (the “physiological” component of the PPI) as a function of a change in the participants' psychological state (the “psychological” component of the PPI). The analysis was performed using the SPM generalized PPI toolbox (http://www.martinos.org/∼mclaren/ftp/Utilities_DGM). As seed, we used the area of dorsomedial prefrontal cortex (dmPFC) that showed the strongest supramodal association with understanding motive from actions (see Fig. 3). The dmPFC seeds were defined using a two-step procedure. First, we defined a binary mask of dmPFC (see Fig. 4A) by thresholding the group-level, supramodal effect of understanding motive [WV > HV & WT > HT] at p < 0.00001 (a more conservative threshold was used to restrict the mask to the dorsal mPFC; the resulting area spanned 276 voxels). Then, for each participant we used an automated algorithm to define regions with the dmPFC mask of at least 20 voxel extent that demonstrated the supramodal effect of understanding motive at p < 0.05 (uncorrected). This yielded 16 participants with valid seed regions.
We set up one PPI model for each participant, which included four PPI regressors, one for the effect of each condition. These regressors were created in the following way: (1) for each participant, we first defined the time series of their seed region as the first eigenvariate (adjusted for effects of interest); (2) the time series was deconvolved to estimate the underlying neural activity using the deconvolution algorithm in SPM8 (Gitelman et al., 2003); (3) the deconvolved time series was multiplied by the predicted time series (pre-convolved) of each condition, resulting in one “neural” PPI for each condition; and (3) each neural PPI was then convolved with the canonical HRF, yielding the four PPI regressors. As covariates of no interest, these models also included the time series of the seed region, the time series of each condition convolved with the canonical HRF, and the six motion parameters.
To investigate group-level PPI effects, we entered contrast images of the PPI effects for each participant into a random-effects analysis using a flexible factorial repeated-measures ANOVA (within-subject factors: goal, modality; blocking factor: subject). In interrogating this model, we focused on comparisons involving the two conditions demanding the inference of motive: WV > Fixation Baseline, WT > Fixation Baseline, and WV > WT. We restricted our examination to a mask consisting of the three MNS regions observed to be associated with action perception (see Fig. 2) by using a cluster-level FWE rate of 0.05 with a cluster-defining threshold of p < 0.001.
For all analyses, regions of activation were labeled based on a combination of visual comparison to functional regions identified in existing meta-analyses (Van Overwalle and Baetens, 2009; Caspers et al., 2010) and by reference to probabilistic cytoarchitechtonic maps of the human brain using the SPM anatomy toolbox (Eickhoff et al., 2005). For visual presentation, thresholded t statistic maps were either: (1) surface rendered using the Surfrend toolbox version 1.0.2 (http://spmsurfrend.sourceforge.net); or (2) overlaid on the average of the participants' T1-weighted anatomical images. Percent signal change for regions of interest (ROIs) was calculated using the MarsBar software (http://marsbar.sourceforge.net). ROIs from clusters that included multiple subregions were defined by growing 4 mm spheres around local peaks.
The rfxplot toolbox (Gläscher, 2009) was used to compute peristimulus time histograms (PSTHs) of the event-related response to WV and WT trials in the subsample included in the PPI analysis (see Fig. 4C). This was performed for the single subject ROIs of dmPFC used as seeds, and the group-level area of right plFG/vPMC that was found to be functionally coupled to dmPFC. The extracted PSTHs spanned the peristimulus period −2 to 12 s, and data were split in 2 s time bins corresponding to the TR. To investigate whether the peak response occurred significantly later in dmPFC during WV compared to WT trials, as well as in dmPFC compared to right pIFG/vPMC during WV, we defined the time of peak in each region on a subject-by-subject basis as the bin containing the maximum value in the peristimulus period 2 to 10 s. Paired sample t tests were then used to compare the identified time to peaks.
Results
Behavioral results
A repeated-measures ANOVA examining the effects of goal and modality on response time revealed a nonsignificant main effect of goal, F(1,20) = 0.650, p = 0.430, and marginally significant effects of both modality, F(1,20) = 4.278, p = 0.052, and the interaction of goal and modality, F(1,20) = 4.329, p = 0.051. Examination of the simple effects revealed that these marginal effects were driven by increased RT to HowText trials (M = 3.57 s, SD = 0.28) compared to the three other trial types (HowVideo: M = 3.34 s, SD = 0.28; WhyText = 3.44 s, SD = 0.24; WhyVideo = 3.36 s, SD = 0.26). As described above, we accounted for RT differences using a variable-epoch model of the neural response to each trial (Grinband et al., 2008).
Table 1 displays the five most commonly used verbs for each of the two goals along with examples of their use by participants. When comparing frequency of use of each verb across the modalities, no significant differences were observed, all t values <1.41, p values >0.15. These data confirm the face validity of the goal manipulation and suggest that the content of responses across the two modalities likely showed minimal differences.
Modality-specific effects
We first sought to determine modality-specific effects on the BOLD response during action understanding, that is, effects of presentation modality that were independent of the observer's comprehension goal. As displayed in Figure 2 and listed in Table 3, observing videos compared to text descriptions of actions was associated with a widespread network, including occipitotemporal and occipitoparietal areas known to be associated with the perception of human form and motion (Allison et al., 2000; Grèzes and Decety, 2001; Peelen and Downing, 2007); the amygdala bilaterally, which is critically involved in social perception (Adolphs, 2009); and in three areas believed to be part of the human MNS (Keysers et al., 2011; Van Overwalle and Baetens, 2009): bilateral anterior intraparietal sulcus (aIPS) and an area of right frontal cortex spanning the posterior inferior gyrus (pIFG) and ventral premotor cortex (vPMC). When performing the analysis within the MNS mask, we also observed the areas in bilateral aIPS and right pIFG/vPMC. These results demonstrate that bilateral parietal and right frontal MNS areas are associated with action perception regardless of the observer's explicit comprehension goal.
When examining the modality-specific effect of encoding text descriptions compared to videos of actions, no regions were observed, even at a more liberal voxelwise threshold of p < 0.001.
Supramodal effects
Next, we sought to determine brain areas that show a supramodal association with understanding the implementation compared to the motive of actions. As displayed in Figure 3A and listed in Table 4, we observed robust and exclusively left hemispheric activation in the superior parietal lobule, posterior middle temporal gyrus, and multiple areas believed to be part of the human MNS, namely pIFG/vPMC, aIPS, rostral inferior parietal lobule (rIPL), and dorsal premotor cortex (dPMC). The MNS regions were also observed when restricting the analysis to the MNS mask. This demonstrates that left-lateralized areas of the MNS form a supramodal system for explicitly representing the implementation of actions (Frey, 2008).
We then determined brain areas showing a supramodal association with understanding the motive compared to the implementation of actions. As displayed in Figure 3B and listed in Table 4, this revealed robust and primarily left-lateralized activation of core areas of the MZS as indicated by reviews and meta-analyses (Frith and Frith, 1999; Van Overwalle and Baetens, 2009; Mar, 2011), namely dorsomedial prefrontal cortex (dmPFC), ventromedial prefrontal cortex (vmPFC), posterior cingulate cortex (PCC) extending into precuneus (PC), left temporoparietal junction (TPJ), and bilateral anterior temporal cortex (aTC). This demonstrates that largely left-lateralized areas of the MZS form a supramodal system for understanding motive from actions.
Modality-by-goal interaction effects
Next, we tested the modality-by-goal interaction within the group of regions that demonstrated either modality-specific or supramodal effects. This analysis revealed no significant interaction effects in these regions. Even in light of this null result, it is important to emphasize that the conjunction analyses reported above demonstrate only that the presence—and not the magnitude—of the effect of one factor (e.g., modality) is not dependent on the level of the other factor (e.g., goal). For instance, the effect in right pIFG/vPMC is goal independent only insofar as the presence of a significant effect in the contrast Video > Text did not depend on whether the observer possessed the goal to understand implementation or motive.
Dissociating left and right pIFG/vPMC
The results thus far suggest a clear dissociation of the function of similar areas of the pIFG/vPMC in the left and right hemispheres. As the bar graph in Figure 2 demonstrates, right pIFG/vPMC was selectively associated with the perception of actions in videos, regardless of the observer's explicit comprehension goal. Conversely, the bar graphs in Figure 3 demonstrate that left pIFG/vPMC was selectively associated with the explicit representation of action implementation, regardless of the presentation modality. To garner further evidence for this dissociation, we directly contrasted WhyVideo and HowText trials within the bilateral frontal areas of the MNS mask. Consistent with the proposed dissociation, right pIFG/vPMC showed a robust response in the contrast WhyVideo > HowText (peak: x = 45, y = 17, z = 22; voxel extent = 40; t = 4.69 cluster-level pFWE = 0.005), while left pIFG/vPMC showed a robust response in the opposite contrast (peak: x = −48, y = 5, z = 25; voxel extent = 20; t = 3.78 cluster-level pFWE = 0.013).
Modality-specific connectivity among the MNS and MZS during the inference of motive
Finally, we investigated the proposition that, in some contexts but not others, the MNS and MZS may collaborate to enable action understanding. Based on attribution theories from social psychology (Gilbert, 1998), we have elsewhere proposed an Identification–Attribution (I–A) model of MNS and MZS contributions to causal attribution during social perception wherein the MNS contributes to the (perceptual) identification of motor behaviors that are subsequently attributed to social causes, such as motives and beliefs, in the MZS (Spunt and Lieberman, 2012). This predicts that the MNS and MZS collaborate when individuals infer motive only when actions must be decoded from sensory information about body motion. Thus, we conducted a PPI analysis (Friston et al., 1997) to test the hypothesis that functional connectivity among the MZS and MNS would be greater when motive is inferred from videos compared to text descriptions of actions. As a seed, we used the region observed to be most reliably associated with understanding motive, the dmPFC. We tested the hypothesis that this area would be functionally coupled with the bilateral parietal and right frontal MNS areas found to be selective for action perception (Fig. 2). As displayed in Figure 4A, dmPFC demonstrated significantly increased functional coupling with a cluster in right pIFG/vPMC (peak: x = 48, y = 14, z = 31; voxel extent = 12; t = 3.77 cluster-level pFWE = 0.016) when motive was inferred from videos compared to text descriptions of actions. This demonstrates that the context of action understanding, in addition to modulating activation of the MZS and MNS, also modulates connectivity between the two systems.
Time course analyses
The I–A model described above not only predicts modality-specific connectivity among the MNS and MZS, but also predicts an ordered sequence of mental operations, with MZS-mediated attribution occurring only after motor behaviors are encoded in the MNS. In the present study, this generates two predictions. First, the time to peak response in the MZS should occur later when motive is inferred from videos compared to text descriptions of actions. In addition, when motive is inferred from videos, the time to peak response in the MZS should be later than the same response in the MNS. We indeed observed that in the dmPFC, time to peak was significantly later during WhyVideo (M = 7.63 s, SD = 1.82) than during WhyText (M = 6.00 s, SD = 1.79), t(15) = 2.93, p=.010 and compared to the time to peak in right pIFG/vPMC during WhyVideo (M = 4.38 s, SD = 1.50), t(15) = 6.34, p < 0.001 (Fig. 4C). These time course analyses are consistent with the ordered sequence of operations predicted by the I–A model.
Discussion
The present results help resolve the ongoing debate over the neural bases of action understanding. When varying task demands are systematically considered, as in the present study, the roles the MNS and MZS have in action understanding emerges with more clarity. We observed that bilateral parietal and right frontal MNS showed a modality-specific association with perceiving actions (“what is being done”), left hemisphere MNS showed a supramodal association with understanding action implementation (“how it is being done”), and the MZS showed a supramodal association with understanding motive from actions (“why it is being done”). These results support the tripartite model proposed by Thioux et al. (2008) wherein the MNS supports understanding actions at low (how) and intermediate (what) levels of abstraction, whereas the MZS supports understanding actions at high (why) levels of abstraction (Vallacher and Wegner, 1987). Their claims were based on studies that manipulated either the observer's goal or the content of a perceived action (Grèzes et al., 2004; Brass et al., 2007; de Lange et al., 2008). However, these studies could only provide indirect support, as none independently manipulated the level of understanding (goal) and the demand for action perception (modality) while holding action content constant across conditions. By doing so in the present study, we have provided unambiguous support for a tripartite model of the brain systems supporting action understanding.
Right pIFG/vPMC, in the MNS, was selectively associated with the perception of action regardless of the observer's comprehension goal. Additionally, this region was functionally coupled with dmPFC, a core area of the MZS, when participants were prompted to understand the motives driving actions presented in videos but not in text. We recently found that these two areas are functionally coupled when individuals are prompted to infer the cause of observed emotional facial expressions (Spunt and Lieberman, 2012). This functional coupling is consistent with social psychological theorizing stating that before making attributions about the causes of behavior, sensory input about body movements must be translated into attribution-relevant events such as goals, intentions, and emotions (Gilbert, 1998). The right pIFG/vPMC is a strong candidate for being critically involved in this translation function, as previous neuroimaging studies of action perception demonstrate that this region is associated with spontaneous event segmentation (Zacks et al., 2001) and with the encoding of features of the motor context that indicate what the actor is intending to do (Iacoboni et al., 2005; Hamilton and Grafton, 2008). Collectively, this suggests an Identification–Attribution model of MNS and MZS system function during social perception, wherein the MNS translates sensory input about motor behavior into a format that is relevant to attributional processes carried out in the MZS. In line with this ordered sequence of operations, we found that for trials in which participants inferred motive, dmPFC exhibited a significantly delayed response to videos of actions, both compared to its own response to text descriptions of actions and to the response of right pIFG/vPMC to videos of actions. Although the results of these time course analyses are consistent with the Identification–Attribution model, it should be noted that RT to video trials was not longer than RT to text trials, as might be predicted by a two-stage model of inferring motive during action perception. However, the lack of an RT difference does not necessarily rule out this two-stage model. It is possible that the kind of information gathered from the visual analysis of a social scene, including (but not limited to) the decoding of motor intention putatively based in the MNS, facilitates more efficient inferential processing in the MZS. The videos were all filmed in contextually rich natural scenes, and it is plausible that this additional contextual information—not captured by the text descriptions—provides constraints that make the inference of motive easier. This would explain a lack of difference in total processing speed across the two conditions despite the additional processing stage for videos. Future research is needed to clarify the temporal dynamics of MNS and MZS function during action perception.
In contrast to the right pIFG/vPMC, left pIFG/vPMC was selectively sensitive to the explicit goal to understand the implementation of actions. In fact, we found that left pIFG/vPMC was more active when understanding implementation from text descriptions of actions than when understanding motive from videos of actions. This left-lateralized effect cannot be explained by the fact that this comparison is of verbal to nonverbal stimuli, because as is clear from the bar graphs in Figure 3C, left pIFG/vPMC shows an enhanced response to understanding implementation from videos of actions than to understanding motive from text descriptions of actions. This demonstrates that left pIFG/vPMC is not selective for visual input regarding actions, and that the reliable association of left pIFG/vPMC with action observation in prior work may be due to either spontaneous (e.g., during passive observation) or goal-related (e.g., during active imitation) cognition about action implementation. Indeed, a specific role for this region in the explicit identification of motor events is consistent with neuroimaging studies showing that linguistic material describing actions preferentially activates left pIFG compared to linguistic material not describing actions (Frey, 2008).
The goal to understand motive from action robustly activated the MZS in a modality-independent fashion. Although previous studies have observed an MZS association with explaining actions in terms of mental states (Grèzes et al., 2004; Brass et al., 2007; de Lange et al., 2008; Spunt et al., 2010, 2011), the present study provides the first empirical demonstration that the MZS is a supramodal system for inferring motive from actions, whether seen or read. This demonstration is important given that the majority of studies mapping the MZS have relied on verbal materials (Van Overwalle and Baetens, 2009). With this said, the observation of supramodal function is implied by studies showing that the MZS comes online during mental state inference for a wide variety of different experimental paradigms (Gallagher et al., 2000; Gobbini et al., 2007; Carrington and Bailey, 2009). In addition, our findings add to recent evidence that two areas of the MZS, the mPFC and left pSTS, encode the identity of emotions in a supramodal fashion (Peelen et al., 2010). Hence, the MZS may support an abstract, modality-independent representation of multiple categories of mental state (e.g., emotion vs motive) on the basis of multiple categories of behavior (e.g., emotional expression vs goal-directed action).
Importantly, although we demonstrated that the MZS plays a supramodal role in mental state inference, its functional connectivity with the MNS was modality specific. This suggests that the MNS may indeed play a role in mental state inference, but only on two conditions: (1) the observer possesses the goal to infer mental states on the basis of the target's actions; and (2) the actions must be decoded from sensory information, rather than conveyed verbally. Generally, this underscores the conditional, or context-dependent, function of the neural systems that support social cognition, where the MNS, the MZS, and MNS/MZS connectivity may each make independent contributions to the human ability to represent the mental states associated with another's behavior.
We observed an interesting dissociation between posterior and anterior regions of the IFG in their contribution to action representation. Namely, we observed that posterior regions of IFG were associated with action perception (Fig. 2) and understanding action implementation (Fig. 3A), whereas an anterior region of IFG in the ventrolateral PFC was associated with understanding action motive (Fig. 3B). This anterior–posterior distinction in IFG mirrors the rostrocaudal axis believed to exist in the function of IFG (Badre and D'Esposito, 2009; Kilner, 2011), wherein anterior regions encode abstract representations of actions (“why”) while more posterior regions encode more concrete representations (“what” and “how”).
Finally, these findings converge with two recent studies that used transcranial magnetic stimulation to show that, for both photographs of objects (Pobric et al., 2010) and object-words (Ishibashi et al., 2011), a region of the left IPL participates in the representation of the manipulability of the object—that is, how it is used—while a region of the left aTC contains more abstract, modality-nonspecific information about object function—that is, why it is used (see also Canessa et al., 2008). These regions correspond well with the left IPL and left aTC regions observed to be associated with representing action implementation (“how”) and motive (“why”), respectively, in the present study.
In conclusion, it is possible that, in our evolutionary past, action understanding encompassed a single mental process, perhaps corresponding to the decoding of motor intention from observed behavior (Frith and Frith, 1999). However, the present study demonstrates that attempts to isolate the single brain system that supports action understanding in humans is likely doomed to failure. This is because action understanding can be used to refer to numerous mental processes that are deployed to meet the demands of a complex social world, where social interactions are not always mediated by the perception of moving bodies and where the aims of the observer can and do vary. Moving forward, research and theory on the neural bases of action understanding will benefit by embracing this variability.
Footnotes
We thank Edward Kobayashi and John Mezzanotte for their assistance, Marco Iacoboni for comments on a previous version of the manuscript, and Donald McLaren for advice on statistical analysis. For generous support we also thank the Brain Mapping Medical Research Organization, Brain Mapping Support Foundation, Pierson-Lovelace Foundation, The Ahmanson Foundation, William M. and Linda R. Dietel Philanthropic Fund at the Northern Piedmont Community Foundation, Tamkin Foundation, Jennifer Jones-Simon Foundation, Capital Group Companies Charitable Foundation, Robson Family and Northstar Fund.
- Correspondence should be addressed to Robert P. Spunt, Department of Psychology, Franz Hall, University of California, Los Angeles, Los Angeles, CA 90095-1563. bobspunt{at}gmail.com