Abstract
Understanding other people's actions is a fundamental prerequisite for social interactions. Whether action understanding relies on simulating the actions of others in the observers' motor system or on the access to conceptual knowledge stored in nonmotor areas is strongly debated. It has been argued previously that areas that play a crucial role in action understanding should (1) distinguish between different actions, (2) generalize across the ways in which actions are performed (Dinstein et al., 2008; Oosterhof et al., 2013; Caramazza et al., 2014), and (3) have access to action information around the time of action recognition (Hauk et al., 2008). Whereas previous studies focused on the first two criteria, little is known about the dynamics underlying action understanding. We examined which human brain regions are able to distinguish between pointing and grasping, regardless of reach direction (left or right) and effector (left or right hand), using multivariate pattern analysis of magnetoencephalography data. We show that the lateral occipitotemporal cortex (LOTC) has the earliest access to abstract action representations, which coincides with the time point from which there was enough information to allow discriminating between the two actions. By contrast, precentral regions, though recruited early, have access to such abstract representations substantially later. Our results demonstrate that in contrast to the LOTC, the early recruitment of precentral regions does not contain the detailed information that is required to recognize an action. We discuss previous theoretical claims of motor theories and how they are incompatible with our data.
SIGNIFICANCE STATEMENT It is debated whether our ability to understand other people's actions relies on the simulation of actions in the observers' motor system, or is based on access to conceptual knowledge stored in nonmotor areas. Here, using magnetoencephalography in combination with machine learning, we examined where in the brain and at which point in time it is possible to distinguish between pointing and grasping actions regardless of the way in which they are performed (effector, reach direction). We show that, in contrast to the predictions of motor theories of action understanding, the lateral occipitotemporal cortex has access to abstract action representations substantially earlier than precentral regions.
Introduction
How do we assign meaning to actions performed by other people? One of the most dominant views in the literature is the idea that action concepts are grounded in the motor system (Rizzolatti et al., 2001; Kiefer and Pulvermüller, 2012). By contrast, according to classical cognitive theories (Mahon and Caramazza, 2008; Caramazza et al., 2014), the ability to understand the meaning of other people's actions draws on conceptual representations stored outside the motor system, such as posterior temporal regions (Hickok, 2009).
A region involved in action understanding should be able (1) to discriminate between different actions (action specificity) and (2) to generalize across different possible instances of a particular action (Dinstein et al., 2008; Oosterhof et al., 2013; Caramazza et al., 2014). For example, grasping has the same meaning for an observer regardless of whether the movement is performed with the left or right hand, or toward the left or right side of visual space. In other words, a region important for action understanding should represent the action while generalizing across concrete instantiations such as the underlying effector or reach direction. Previous fMRI and transcranial magnetic stimulation studies in humans reported abstract action representations in parietal, frontal, and occipital regions (Hamilton and Grafton, 2006, 2008; Cattaneo et al., 2010; Oosterhof et al., 2010, 2013), making it difficult to draw firm conclusions regarding the ongoing debate between motor and cognitive theories. One important factor not well understood so far is the underlying temporal profile of action representations. Such information is crucial since the two theories lead to opposite predictions: according to motor theories, motor areas should have the earliest access to abstract action representations (Pulvermüller, 2005); by contrast, according to cognitive theories, areas outside the motor system should have the earliest access to such abstract action representations.
Here we use multivariate pattern analysis (MVPA) of magnetoencephalography (MEG) data to examine where in the brain and at which point in time it is possible to distinguish between observed pointing and grasping regardless of reach direction (left or right) or effector (left or right hand). In contrast to motor theories of action understanding, we show that abstract action representations are encoded in the lateral occipitotemporal cortex (LOTC) earlier than in precentral regions.
Materials and Methods
We performed two separate experiments with two different groups of participants: one behavioral experiment to identify the time point at which the videos contained enough information to allow participants to discriminate between pointing and grasping, and an MEG experiment. The same stimuli were used for the two experiments.
Participants
Fourteen students (seven females; mean age: 23.13 years; SD: 2.253 years; all right handed) from the University of Trento took part in the behavioral experiment and received a reimbursement of €6 at the end of the session. A different group of 17 students (11 females; mean age: 23.3 years; SD: 2.1 years; all right handed) from the University of Trento with normal or corrected-to-normal visual acuity and with no neurological disorders took part in the MEG experiment. All participants received a reimbursement of €25 at the end of the MEG session. All of them gave informed consent in accordance with the Declaration of Helsinki. The experimental procedures were approved by the Ethics Committee for research involving human participants at the University of Trento.
Stimuli
Stimuli consisted of short video clips (833 ms) depicting simple center-out hand movements (Fig. 1A). Each clip started with the hand of the actor touching the central object (a polystyrene semisphere) with the index finger resting in the same position. After a variable amount of time (median: 183 ms; range: 67–367 ms), a center-out movement toward one of the other semispheres started. Movement onset was defined as the time point in which the rest position was released and the initiation of hand preshaping, i.e., the movement of the fingers and palm into position for grasping. The video ended as soon as the hand reached one of the peripheral semispheres (for an example trial sequence, see Fig. 1A). The actions were recorded from four different actors (one male) using a digital video camera. Only the hands (and part of the forearm) of the actors were visible in the field of view. We instructed the actors to keep the velocity and kinematics of the movements as similar as possible across the two different movements. We discarded, based on our perceptual judgment, videos in which the velocity or kinematics were too dissimilar from the others and videos in which the preshaping of the hands before movement onset could give information regarding the upcoming action, keeping a total of 80 videos (five exemplars for each combination of actor × movement type × direction). We obtained movements performed with the left hand creating specular copies of the right-hand movement videos via software (Matlab, Mathworks), for a total of 160 videos. On each video, we superimposed a small white cross (0.88 × 0.88°) above the central semisphere to enable fixation and thus avoid possible noise in the MEG signal due to eye movements.
Example of a trial sequence and experimental design. A, During MEG recording, N = 17 participants watched video clips of simple reach-to-point or reach-to-grasp movements (duration: 833 ms). Participants were instructed to fixate on a central fixation cross while attentively observing the entire video without performing any movements. To ensure that participants paid attention to the videos, different types of questions were asked during occasional catch trials, which were later discarded from the analysis (see Material and Methods). The green fixation cross indicated the period during which participants were told to blink. Eye movements were recorded using an MEG-compatible eye tracker. B, We used a 2 × 2 × 2 design, manipulating the type of movement (pointing/grasping), reach direction (left/right), and effector (left/right hand) .
Behavioral experiment
Procedure.
To identify the minimum video duration required to be able to distinguish between observed pointing and grasping, we presented participants with videos depicting pointing or grasping movements directed toward the left or right side, performed with the left or right hand. The duration of the videos was parametrically varied (167, 200, 233, or 333 ms). Participants had to classify the type of observed movement by pressing one of two possible buttons while ignoring the other two dimensions (reach direction, effector). A trial started with a fixation period (white cross) of 2 s. Then the video appeared for a variable duration. As soon as the video ended, the fixation cross appeared again, and participants had to indicate by button press which movement they had observed. Participants were instructed to respond as accurately as possible. Video duration, type of movement, effector, and reach direction were randomized. Each participant completed four experimental runs of ∼5.5 min, for a total of 512 trials (64 repetitions per condition). Stimuli were presented on a CRT monitor (ViewSonic Graphic Series G90fB; screen resolution: 1280 × 1024; refresh rate: 60 Hz) placed ∼64 cm in front of the participant.
Statistical analysis.
The aim of the behavioral experiment was to individuate the point in time in which the two actions started to perceptually diverge. To compute the accuracy for discriminating between the two observed actions as a function of video duration, we divided the number of correct classifications by the total number of trials, separately for each video duration and each participant, collapsing across effector (left, right hand) and reach direction (left, right). We then used a χ2 test to assess at which video duration the accuracy was higher than chance level (50%).
MEG experiment
Procedure.
We presented participants (N = 17) with short videos (833 ms) of reach-to-point and reach-to-grasp movements performed with either the left or right hand toward peripheral targets on the left or right side (Fig. 1A) while measuring their brain oscillatory activity. We used a 2 × 2 × 2 factorial design (Fig. 1B), varying the type of movement (pointing/grasping), the effector (left/right hand), and reach direction (left/right). Each trial consisted of the following events (Fig. 1A): a green fixation cross (blink phase: 800 ms), a white fixation cross (fixation phase: randomly jittered within 2000–2500 ms), the video (video phase: 833 ms), and a white fixation cross (resting phase: 1000 ms). Trial duration varied from 4633 to 5133 ms, depending on the duration of the fixation phase. The blink phase at the beginning of each trial provided time for participants to blink during a controlled time window and thus reduced the probability of blinking during the fixation phase or during video presentation. Participants were instructed to blink every time they saw the green cross. During the fixation phase, participants had to maintain fixation on the white cross. We jittered the fixation phase to reduce the chances that participants would trigger a neural response by predicting the appearance of the video. When the video appeared, participants were asked to keep fixating on the cross and to globally pay attention to the ongoing movement. In contrast to the task used in the behavioral experiment, we asked them in particular to attend to all three dimensions we manipulated, i.e., movement type, effector, and reach direction. During the resting phase, participants had to keep fixating and to wait for the green cross that indicated the beginning of a new trial.
To ensure that participants were paying attention to the video, we introduced catch trials (10% of all trials), during which we presented a question regarding one of the three dimensions (e.g., “Was the direction to the left?”). Catch trials were presented occasionally with the following constraints: (1) if trial N was a catch trial, trial N + 1 could not be a catch trial; (2) the first trial of a run could not be a catch trial. A catch trial was identical to an experimental trial except for the question that appeared at the end of the catch trial (1 s after video offset). Since participants did not know when a catch trial would appear and what the question would be, they had to pay attention to each video and to each of the three dimensions to perform the task correctly. The answer was always binary (yes or no) and participants had MEG-compatible buttons for answering to the questions. The assignment of the response to the two buttons changed randomly for each question to avoid any potential confounds related to motor preparation. Eye movements were monitored using the OEM system (OEM eye tracker, SensoMotoric Instruments; 60 Hz sampling rate). After each response, feedback was provided (a cartoon smiling or sad face).
Each participant performed 10 runs, consisting of 64 trials, plus 6 catch trials, for a total of 640 experimental trials and 60 catch trials. The number of repetitions for each factorial combination (movement type × effector × reach direction) per participant was 80. Before entering the shielded room, participants familiarized themselves with the stimuli and the task. Each run lasted from 4.9 to 5.5 min, depending on the duration of the fixation phase, for a total duration of the session of ∼52 min. At the end of each run, participants were allowed to rest for a few minutes before a new acquisition started.
Stimuli were projected on a screen (screen resolution: 1280 × 1024 pixels; refresh rate: 60 Hz) that was placed ∼130 cm in front of the participant. The screen was visible as a rectangular aperture of ∼21.7 × 13.16°. We controlled visual stimulation during the behavioral and the MEG sessions using ASF (Schwarzbach, 2011), a toolbox for Matlab (Mathworks) based on the Psychtoolbox (Brainard, 1997).
MEG data acquisition and analysis.
At the beginning of the MEG session, the head shape of each participant was digitally acquired using the Polhemus system (Polhemus). Moreover, we placed three coils at the participant's forehead and two behind the ears to acquire the head position of each participant within the MEG helmet at the beginning of each run. Before entering the shielded room containing the MEG system, participants were asked to remove all magnetic materials that could distort the measurement.
We measured neuromagnetic brain activity using a 306-channel whole-head MEG system (Neuromag, Elekta) at a sampling rate of 1000 Hz. The system consists of 204 planar gradiometers and 102 magnetometers arranged in a helmet configuration. Here, we are reporting results of the gradiometers only. Triggers were sent at video onset to synchronize stimulus presentation with neural activity. To check for the correct timing of the stimuli, and to take into account possible delays of the stimulus presentation with respect to the triggers, we used a photodiode on the stimulation screen inside the shielded room.
MEG data preprocessing.
We analyzed data using the open source Matlab-based Fieldtrip toolbox (Oostenveld et al., 2011). Continuous data were cut into epochs from −1 to 1.3 s relative to video onset. Epochs were high-pass filtered at 1 Hz to remove very slow frequencies and direct current offset. Frequencies due to the electrical system were also filtered out using a band-stop filter (Butterworth IIR filter) at 50 Hz and its harmonics (100 and 150 Hz). Trials with blinks or eye movements during the presentation of the video or during the baseline period were discarded on the basis of the information from the eye tracker. In addition, we visually inspected trials for artifacts, blind to the condition, and rejected trials that were clearly affected by external noise or spike current. On average, we rejected 13% of the trials per participant. If a sensor was very noisy for the entire experimental session, it was rejected. To have the same number of sensors for each participant, missing sensors were reconstructed by interpolation of the neighbors.
Time–frequency analysis
To obtain a time–frequency representation of the oscillatory activity associated with movement observation, we applied Fourier transformation to sliding time windows with a fixed length of 500 ms. The sliding window moved in steps of 50 ms; power was calculated for frequencies in a range from 2 to 40 Hz in steps of 2 Hz. To avoid spectral leakage and to control for frequency smoothing, a Hanning taper was applied before Fourier transformation. Subsequently, for the univariate analysis only, power was averaged across effector and reach direction, and the spectral power was normalized relative to baseline (−0.5 to −0.3 s with respect to the onset of the video, i.e., during a subperiod of the fixation phase).
Source analysis
Neural sources were found using dynamic imaging of coherent sources (DICS), a frequency domain beamforming technique (Gross et al., 2001). We chose the frequencies and times of interest based on the sensor-level analysis. Specifically, we considered the sensor with the greatest accuracy of the classifier (multivariate analysis) to distinguish between pointing and grasping, generalizing across effector and reach direction, in those frequency bands that survived the multiple-comparison tests. Note that, given the way the sensors were selected, source analysis merely served as a visualization of the sources.
For each participant we used a volume conductor model with the single-shell method (Nolte, 2003). The models were built warping a dipole grid based on a MNI template brain to fit the individual head shape of each participant. We proceeded with DICS for each separate condition using a common spatial filter computed from the combination of the two conditions. In this way, any difference between the two conditions cannot be ascribed to differences between the filters.
MEG statistical analysis (sensor level)
We performed both univariate and multivariate analyses in sensor space, followed by a beamforming analysis (Gross et al., 2001) to identify sources explaining any observed effects. Univariate analysis was conducted to observe the classical decrease in power in alpha and beta bands (Cochin et al., 1999; Pineda, 2005; Hari, 2006). Importantly, to identify at which sensors and at which point in time it is possible to distinguish between the two movements on the basis of the MEG signal, we used multivariate analysis on the computed power and the sources adapting an algorithm developed for the analysis of fMRI data (Oosterhof et al., 2012a).
Behavioral analysis (MEG experiment)
Participants' accuracy in answering the questions in the catch trials during the MEG experiment was evaluated on-line by observing the feedback provided after each catch trial. All participants were able to answer the questions and typically made two or three mistakes within the entire session (mostly at the beginning of the experiment). We are thus certain that participants were attending to the videos.
Univariate analysis
Note that in contrast to the multivariate analysis, in which we specifically targeted regions that show movement selectivity generalizing across effector and reach direction, the purpose of the univariate analysis was to identify areas with less specific properties. In particular, as a quality control, we examined whether we obtain the typical decrease in the alpha and beta bands during action observation (Cochin et al., 1999; Pineda, 2005; Hari, 2006). Furthermore, we aimed to determine which frequency bands and which sensors are modulated differently during pointing and grasping when collapsing across effector and reach direction.
All the experimental conditions were baseline corrected by subtracting the fixation period (from −0.5 to −0.3 s) from the poststimulus period (from 0 to 1.3 s). To assess the difference between pointing and grasping, we used a nonparametric method (permutation test), with a cluster method for multiple-comparison correction (Maris and Oostenveld, 2007) with participants as units of observation. In brief, we computed t scores between the two movements for each sensor–frequency–time bin. The observed cluster-level statistic was obtained by summing the t scores of neighboring bins (in time, frequency, and sensors) exceeding an a priori defined critical value (p < 0.05). We repeated the procedure 1000 times by swapping the condition labels and we obtained the distribution of permuted cluster-level statistics. At each iteration, the maximum cluster-level statistic was considered to control for type I error. The p value was the proportion of permuted cluster-level statistics that exceeded the observed cluster-level statistic. If the p value was <0.05, the cluster was taken as significant.
Multivariate analysis
The assumption behind multivariate analysis in MEG is that the processing of each stimulus category is associated with a specific neural activity that induces an oscillatory signal (or neural pattern) consisting of a unique combination of sensor, time, and/or frequency. Multivariate analyses exploit differences in terms of these patterns of activations. By contrast, univariate analyses do not consider such patterns, but address whether two conditions differ in terms of the average response of a single variable (e.g., averaged frequency over time). This is why multivariate analyses are more sensitive than univariate analyses (Haxby et al., 2001; Kriegeskorte et al., 2006). Importantly, multivariate analysis enables an analysis of whether the representational content of an area—examined via the underlying neural pattern—generalizes across low-level features. In our case, we aimed to identify regions in which the unique neural patterns associated with pointing and grasping generalized across effector (left or right hand) and reach direction (left or right; for a schematic overview, see Fig. 2). We trained a classifier to discriminate between the two types of movements using the spatiospectral-temporal MEG signal (for details, see next paragraph) related to movements performed with one of the two effectors and toward one of the two directions. We then tested on the opposite combination of effector and direction. For example, we trained a classifier to distinguish between observed grasping and pointing actions performed with the left hand toward the left, and tested the classifier to distinguish between observed grasping and pointing performed with the right hand toward the right. In this way, above-chance classification could only be due to information related to the type of movement, and not to low-level perceptual features.
Feature selection. Schematic representation of the method we adopted for selecting the features used for the multivariate analysis. Here we show one specific step of the algorithm with the selected central sensor (black dotted circle) with one neighboring sensor only (gray dotted circle) for illustrative purpose. A, Time–frequency representations (colors indicate power intensity) in the posterior sensors of the MEG helmet in two conditions of interest (conditions A and B). The arrows starting from the circles indicate the corresponding magnified sensors. B, Enlarged views of the two example sensors for conditions A and B. The dotted rectangles illustrate an example time–frequency bin (2 neighboring bins per side for the time dimension; 4 neighboring bins per side for the frequency dimension; see Materials and Methods). For feature selection, for each time–frequency bin, we scanned each individual sensor with its 10 neighboring sensors. B shows a matrix representation of the specific sensor/frequency/time bins. C, We then rearranged the dimensions of the matrix from 3D to 1D to obtain the corresponding feature vectors for conditions A and B. The feature vectors were used as input for the decoding analysis over sensors, frequency, and time. Specifically, the feature vectors were partitioned in independent chunks and used for training and testing the classifier. In the depicted example, each feature within the matrices was assigned with a number to show the same feature within the feature vectors for visualization purposes.
Analyses were performed using CoSMoMVPA, an MVPA toolbox in Matlab [N.N. Oosterhof, A.C. Connolly, and J.V. Haxby, in preparation (“CoSMoMVPA: Multi-Modal Multivariate Pattern Analysis of Neuroimaging Data in Matlab/GNU Octave”; toolbox available from http://cosmomvpa.org)]. The toolbox provides an adapted version of the multivariate searchlight approach (Kriegeskorte et al., 2006), an information-based algorithm that employs a multivariate approach at each location in the brain to enable analysis of the neural contents. In this analysis, we used local “neighborhoods” of features in channel–time–frequency space. We used a sensor radius of 1, a time radius of 100 ms, and a frequency radius of 8 Hz. For a given “center” feature a sensor–time–frequency triple, its neighbors consisted of features for which its sensor, time, and frequency where all within the corresponding radii.
The main steps used in the multivariate analysis (for a schematic illustration, see Fig. 2) were as follows: (1) compute the time–frequency representation separately for each sensor and each trial (Fig. 2A); (2) select the “central” feature and its neighbors in time–frequency–sensor space (Fig. 2A, insets, dashed rectangles; for an enlarged view, see Fig. 2B); (3) create a feature vector for each trial by selecting all features in its neighborhood (Fig. 2C) and normalize (z transform) the data; (4) create independent partitions for training and testing the classifier (Table 1); (5) train the classifier; and (6) test the classifier. We repeated steps 2–5 for each sensor and for each time and frequency bin, and the classification result for each center feature was assigned to its corresponding location in time–frequency–sensor space. For classification, we used a support vector machine algorithm, a type of classifier that looks for linear combinations of features to create a decision boundary to discriminate between two classes or stimuli (Mur et al., 2009; Pereira et al., 2009).
Cross-comparisons used for training and testing
To create subsets of trials to feed the classifier with the aim of differentiating between neural responses related to the observation of grasping and pointing actions regardless of effector and reach direction, we divided for each subject the dataset in two independent halves, each containing only movements with a complementary combination of effector and reach direction. The first half contained left-hand movements to the right and right-hand movements to the left, and the second half contained left-hand movements to the left and right-hand movements to the right. We further divided the data in independent “chunks,” each of which contained ≥136 trials (depending on the number of trials remaining after artifact rejection) of a specific condition of interest. Then, for each half, we adopted a leave-one-chunk-out cross-validation method. We used three chunks associated to a specific condition for training, and a corresponding chunk with the complementary effector and direction for testing (cross-condition classification). This procedure was repeated for all chunks. Note that within a chunk the only dimension that differed across trials was the type of movement: grasping versus pointing. Thus, we assumed that the classifier learned to discriminate between these two classes of stimuli. For example, if the training dataset contained the conditions grasping to the right with the right hand and pointing to the right with the right hand, the testing dataset contained the conditions grasping to the left with the left hand and pointing to the left with the left hand. For this type of classification, the classifier had to rely on differences between the two types of movements. If the model was able to discriminate between the two movements in the independent subset, this indicates that it had learned the difference between the two types of movements using the previous training subset, generalizing across effector and reach direction. We adopted this approach for each possible factorial combination (for a complete list, see Table 1).
The testing phase provided accuracy maps for each participant, reflecting the classifiers' performance in discriminating between the two observed movements regardless of effector and reach directions [in a similar way as traditional fMRI searchlights (Kriegeskorte et al., 2006), except that the features consist of sensor–time–frequency triples rather than voxels]. We thus had information regarding where, when, and in which frequency band it was possible to distinguish between the abstract neural representations of the two movements.
To assess the reliability of the performance of the classifier, we used a nonparametric method (permutation test, similar to the procedure described above for the univariate analysis; Maris and Oostenveld, 2007). In this case, we used the difference between the obtained classification accuracy and chance-level accuracy (the accuracy expected under the null hypothesis of no difference between the two conditions, meaning 50%) to compute the test statistic used in the permutation steps (see Univariate analysis, above).
Any effect observed at sensor level has to be generated by neural sources. To visualize the sources underlying the cross-decoding effects for the frequency bands and time windows observed at the sensor level, we conducted a multivariate analysis at the source level, using the same searchlight approach as before (Kriegeskorte et al., 2006). Note that multivariate analysis was necessary here to identify which regions of the brain represented actions at an abstract level (i.e., generalizing across effector and reach direction). We reconstructed the source activity for the frequency bands and time windows that were significant at sensor level and expected to identify in which regions of the brain it is possible to distinguish between grasping and pointing across effector and reach direction. We obtained estimates of frequency power at each grid point using a beamformer algorithm (see previous section) on a single-trial basis. A searchlight was defined taking the power values at each grid point with its neighbors in a circle of 2 cm radius. For each participant, we found the accuracy maps indicating the performance of the classifier in discriminating between the two observed movements (regardless of effector and reach direction). For descriptive purposes, we are reporting the clusters showing the greatest classification accuracy.
Results
Behavioral experiment
We computed a χ2 test to evaluate at which time point participants' performance was significantly higher than chance level (50%; Fig. 3). We found that performance of the participants was not different from chance level at 167 ms (χ2 = 11.7307; df = 13; p = 0.5498) and at 200 ms (χ2 = 21.4835; df = 13; p = 0.0639). Performance was significantly higher than chance level from 233 ms onwards (χ2 = 58.0318; df = 13; p = 1.178e-07). This means that participants were unable to distinguish the two actions if videos were shorter than 233 ms. Since mean movement onset in the videos (defined as the time point at which the rest position was released and hand preshaping was initiated; see Stimuli) was 191 ms (SD: 90 ms; median: 183 ms), this indicates that the two actions were perceptually indistinguishable before movement onset.
Behavioral results. Behavioral performance (percentage correct) for categorizing the two observed movements (grasping, pointing) as a function of video duration, collapsed across effector and reach direction. As expected, participants responded more accurately with increasing video duration. Statistical analysis confirmed that participants reached above-chance performance in classifying the two movements from 233 ms onwards (see Material and Methods, Statistical analysis, Behavioral experiment). Each dot represents data from a single participant. The continuous line indicates the linear model that best fits the data.
MEG experiment
Univariate analysis
We first analyzed the MEG signal using classical univariate methods to assess whether the stimuli induced a modulatory activity in the ongoing oscillations relative to rest. Low frequency bands, such as alpha and beta bands, are typically characterized by a decrease in power, presumably due to neuronal activity synchronization in specific brain regions (Pfurtscheller and Lopes da Silva, 1999), indicating neural processing of the stimulus. Univariate analyses comparing the activation period (after video onset) with the baseline (before video onset) demonstrated that passive observation of pointing and grasping modulates alpha-band (8–12 Hz) and beta-band (15–25 Hz) power, as well as the theta band (4–7 Hz) power, over posterior, parietal, and frontal sensors. Figure 4A shows one central sensor for illustrative purposes. In the depicted sensor, the alpha and beta rebounds related to post-observation processes are evident. Dotted lines approximately indicate the different stages of the movement (see figure legends for details).
Theta, alpha, and beta band activity during action observation and univariate contrast. A, Time–frequency representation of the difference (expressed in t scores) between grasping and pointing (collapsed across effector and reach direction) for the sensor highlighted in the head model. The four dotted lines indicate the following events, from left to right: (1) video onset, (2) median movement onset, (3) approximate time at which the hand touches the object (∼550 ms), and (4) video offset (833 ms). B, Same as A, but those time–frequency bins that did not survive the permutation test with Monte Carlo and cluster-based method for multiple-comparisons correction were set to zero. C, D, Topography representation of the two frequency bands observed in B. E, Power change during action observation relative to baseline (fixation cross) over a representative sensor. The power change was calculated as (activation − baseline)/baseline, such that 1 indicates 100% increase relative to baseline and −1 indicates 100% decrease relative to baseline. The classical power decrease in alpha and beta bands following observed movement onset (at t = 0 s) is evident.
The decrease in power that we observed in the alpha and beta bands is in line with previous studies (Pineda, 2005; Hari, 2006) and has been suggested to reflect sensorimotor system activity. Further, the increase in power in the theta and low alpha (4–8 Hz) band has been observed during memory tasks (Jensen and Tesche, 2002). In addition, these low frequencies have been reported to be modulated during action observation both in humans (Frenkel-Toledo et al., 2013; Pavlidou et al., 2014a,b) and monkeys (Kilner et al., 2014; Caggiano et al., 2015).
A direct comparison of grasping and pointing movements (collapsing over effector and reach direction; see Materials and Methods, Univariate analysis) showed a significant differential modulatory activity in beta-band (central frequency: 24 Hz) and alpha-band (central frequency: 16 Hz) power over sensorimotor sensors at a late latency only (from ∼750 to 1100 ms, and from ∼500 to 750 ms, respectively). Figure 4B illustrates this effect for the same representative significant sensor as in Figure 4A over central regions. Bluish colors indicate that the power decrease is greater for grasping than for pointing; reddish colors indicate the opposite. Figure 4, C and D, shows the topography representations of the significant sensors in two selected subsets of frequency bands and time windows that were all located over central and right-central sensors. These results show that (1) the brain processes the two actions as being different, and that (2) sensorimotor areas might be involved. The fact that grasping induces a greater decrease than pointing could be due to the higher complexity of this movement, which in turn is likely to recruit more neural sources. However, this differential activity occurs quite late (at ∼600 ms), long after the two movements were perceptually distinguishable. Thus, there must be another, earlier, process, which the univariate analysis did not reveal, that enables the brain to discriminate between the two movements.
Multivariate analysis
Figure 5A–C shows the results of the multivariate analysis at sensor level. Two types of representations are provided: (1) a time–frequency representation, to show the dynamics of all the considered frequencies at each time point in a specific subset of sensors (A); and (2) a topographical representation, to show the spatial information at specific time points and frequency bands (B and C). The inset in Figure 5A shows the two time–frequency clusters that survived the multiple-comparisons correction. The lateral plots show the averaged t values over the sensors highlighted on the two topoplots in the middle. We observed that the classifier was able to significantly (p < 0.05; corrected for multiple comparisons using a cluster-based method; maximum accuracy: 53.46%) discriminate between the two observed movements, generalizing across effector (left and right hand) and reach direction (left and right) over posterior sensors as early as 150 ms and lasting until 550 ms in the low alpha/theta range (Fig. 5A, left;). By contrast, significant discrimination over more anterior sensors was possible only within a window of 550–1200 ms (i.e., at a late stage of the video, when the hand interacts with the object; Fig. 5A, right). Figure 5, B and C, shows the topographies at different times and frequencies, selected according to the following criteria: (1) as time of interest, we selected the central point of the time windows selected based on the significant clusters that survived the significance test [i.e., 400 ms (200−600 ms) for the cluster obtained in the earlier time window, and 900 ms (600−1200 ms) for the cluster obtained in the later time window]; and (2) frequency bands were chosen based on previous studies showing a modulation of the low alpha (8–10 Hz) and high theta (6–8 Hz) bands (Frenkel-Toledo et al., 2013) and the high alpha (8–14 Hz) and beta (15–25 Hz) bands during action observation (Pineda, 2005). For each time of interest (400 and 900 ms), we selected the peak frequency within each considered frequency band (i.e., 6, 8, 10, and 18 Hz).
Results of the neural spatiotemporal decoding. To identify abstract action representations of the observed actions (e.g., observing “grasping” regardless of whether it was performed with the left or the right hand), we trained the MVPA classifier to discriminate between pointing and grasping using one effector (e.g., the left hand) and one reach direction (e.g., toward the left), and tested the performance of the classifier using an independent dataset, using pointing and grasping movements performed with the other hand toward the opposite reach direction. We decoded the observed movements over time bins, frequency bins, and sensors using a time–frequency–channel searchlight analysis. A, The lateral plots show the time–frequency representation of the decoding in sensors depicted in the inset topoplots. Reddish colors indicate higher classification. Sensors were selected on the basis of the highest decoding accuracy at the frequency of interest. The central inset shows the two clusters that survived the correction for multiple comparisons (cluster obtained at early time point: 200–600 ms; cluster obtained at late time point: 600–1200 ms). B, Topography of the decoding at 400 ms and low frequencies (6 and 8 Hz; smoothing: 4 Hz). C, Topography of the decoding at 900 ms and higher frequencies (10 and 18 Hz; smoothing: 3 Hz). D, E, Sources accounting for the decoding effect found at sensor level, thresholded to retain only those voxels with the 10% highest decoding accuracies. For sensor-level analysis only, significant differences were computed using permutation analysis and Monte Carlo methods and results are cluster corrected for multiple comparisons. Maps were projected on the population-average, landmark-based, and surface-based atlas (Van Essen, 2005), using Caret software (Van Essen et al., 2001).
To examine the cortical sources of the effects shown in Figure 5A–C, we performed another multivariate analysis at source level, using the same cross-comparisons as we did for the sensor analysis (for details, see Materials and Methods). To find the sources at 400 ms for the frequencies 6 and 8 Hz, we used temporal smoothing of 4 Hz and time windows of 150–650 ms and 212–587 ms, respectively. Figure 5, D and E, shows the decoding accuracies of all the sources projected on surface template MNI brains, thresholded to retain only those voxels with 10% of the highest accuracies (for the corresponding mean and individual decoding accuracies, see Fig. 6; for a direct comparison with univariate analysis, see Fig. 7). For the 6 Hz signal, the highest decoding accuracies were found bilaterally in the LOTC, extending into the inferior temporal gyrus and the superior temporal gyrus in the right hemisphere, and in the left superior parietal cortex, extending into the inferior parietal cortex (Fig. 5D, left; for MNI coordinates of the peak voxel in each cluster, see Table 2). The highest decoding accuracies for the 8 Hz signal were located in the left LOTC (Fig. 5D, right), slightly anterior to the source identified at 6 Hz.
Maximum accuracy within each region. Within each identified source, the voxel with the maximum mean accuracy was selected and plotted with individual accuracies (black dots). Left MTG, Middle temporal gyrus (MNI: −50, −64, 12); Left SPL, superior parietal lobule (MNI: −20, −56, 48); Right PCG, precentral gyrus (MNI: 28, −6, 28); Right IFG, inferior frontal gyrus (MNI: 20, 24, 28; Table 2).
Comparison between univariate and multivariate analyses. Comparison between univariate (top row) and multivariate (bottom row) analyses in two time windows (200–600 and 600–1200 ms). The upper topoplots show the sensors that survived the permutation test when comparing grasping versus pointing (collapsing across effector and reach direction). The lower topoplots show the sensors that survived the permutation test when comparing the observed accuracy of the classifier to distinguish between pointing and grasping (generalizing across effector and reach direction) against chance level (50%). Multivariate analysis was more sensitive in detecting the subtle differences between the neural signals induced by observation of the two movement types in the earlier time window. All shown clusters are corrected for multiple comparisons (p < 0.05).
MNI coordinates of the sources
Regarding the sources related to the decoding obtained in the late time window, we chose 900 ms as time of interest for the frequencies 10 and 18 Hz (time windows: 600–1200 and 678–1222 ms, respectively; smoothing: 3 Hz). For the 10 Hz signal, we obtained the highest decoding accuracies in the right precentral gyrus (Fig. 5E, left). For the 18 Hz signal, we obtained highest decoding accuracies in the right inferior frontal gyrus (IFG; Fig. 5E, right).
To show a complete overview of the temporal dynamics of the neural decoding at sensor space, we plotted the decoding accuracy (expressed in t values) for separate time bins (50–150, 150–250, 250–350, 350–450, 450–550, and 550–650 ms for the early observed decoding; Fig. 8A; 350–450, 450–550, 550–650, 650–750, 750–850, 850–950, and 950–1050 ms for the late observed decoding; Fig. 8B), averaged across frequency bands (theta: 2–6 Hz; low alpha: 7–9 Hz; alpha: 9–11 Hz; beta: 17–19 Hz). Figure 8 shows how the effect over posterior sensors evolves over time, and that anterior sensors do not show up before ∼700 ms.
Neural decoding over time. The topoplots show the dynamics of above-chance accuracy (expressed as t scores) of the classifier in discriminating observed grasping and pointing (generalizing across effector and reach direction) for specific frequency bands (theta: 5–7 Hz; low alpha: 7–9 Hz; alpha: 9–11 Hz; beta: 17–19 Hz). A, The earliest significant decoding occurred in the posterior part of the helmet configuration in the lower-frequency bands. B, Decoding in the higher frequency bands was significant at a later latency.
To further evaluate the reliability of the classifier, we also used a simulation approach. Specifically, we ran a Monte Carlo simulation to estimate the probability of finding an accuracy of 53.46% under the null hypothesis of chance accuracy. The cross-validation partitioning scheme divided the data into two independent halves (Table 1; see Materials and Methods), with the first half containing left-hand rightwards and right-hand leftwards trials, and the second half containing right-hand rightwards and left-hand leftwards trials. In each independent half, there were two folds, with a minimum of 136 trials (across participants and halves) after rejecting trials with artifacts and balancing the partitions so that each of the two actions occurred equally often. For each participant separately, we found that the correlation of classification accuracies for the test sets in two folds to be r = 0.3289 (median across participants and the two independent halves). Thus, in our simulation we used the same value as follows. For each permutation, uniformly distributed [on the interval (0, 1)] random data were generated for two independent halves, two folds, 136 samples, and 17 participants. To assess the effect of dependency we used three sets of independently and normally distributed data, i1, i2, and icommon. To match the correlation between accuracies, for each independent half of the data, data were made dependent through d1 = i1 * γ + icommon * (1 − γ) and d2 = i2 * γ + icommon * (1 − γ), with γ = 0.415 found through binary search to match the correlation (r = 0.3289) across dependent folds, as observed in the original data. For each iteration, classification accuracy was simulated by dividing the number of samples that exceeded 0.5 in d1 and d2 by the number of samples. To obtain classification accuracies relative to chance, 0.5 was subtracted.
To assess the effect of independence, we also ran the same analysis setting γ = 0 (corresponding to r = 0, i.e., full independence between folds) and γ = 1 (corresponding to r = 1, i.e., full dependence between folds).
We used 100,000 iterations and found that the maximum classification accuracies found in the data (using r = 0.3289 for fold correlation) was significant (pMC,sensor, r = 0.3289 < 0.00001); for the latter, no iteration showed a higher mean than that observed in the data (Fig. 9). We obtained similar results for the additionally simulated cases of fully independent folds (r = 0; psensor, r = 0.00 < 0.00001), and dependent folds (r = 1; psensor, r = 1.00 < 0.00001).
Simulation analysis. Illustration how “low” classification accuracy (53.46% for sensor data; 50% is chance level) can be highly significant, using normal distribution probability plots of Monte Carlo-simulated classification accuracy distribution (relative to chance, 50%). The simulation uses the same parameters as those used in the study (17 participants; minimum after-trial rejection: 544 trials per participant; same cross-validation scheme as used in original data). Dependency across cross-validation folds was set to r = 0.3289 (green crosses) to match the value observed in the original data; for comparison, results are also shown for the cases of no dependence (r = 0.00; blue) and full dependence (r = 1.00; orange). The maximum classification accuracy above chance as observed in the original data is indicated by a black line.
Discussion
Using MVPA of MEG data, we found that the LOTC has the earliest access to abstract action representations. By contrast, precentral regions, though recruited relatively early, have access to abstract action representations substantially later than the LOTC. Behavioral data indicated that participants were not able to distinguish between the two actions before 233 ms, and this latency is comparable to the one observed in the LOTC.
Early abstract action representations in occipitotemporal and parietal regions
Although MEG has a lower spatial resolution than fMRI, we can confidently say, based on the topographical results and source analysis, that the source that accounted for the decoding effect we found at the early stage was located within the left and right LOTC. The LOTC hosts regions sensitive to body parts, kinematics, body postures, manipulable objects, and observed movements (Valyear and Culham, 2010; Downing and Peelen, 2011; Buxbaum et al., 2014; Pavlidou et al., 2014a,b; Lingnau and Downing, 2015). The LOTC has been shown to be modulated when participants are required to process the meaning, compared with the effector, involved in an action (Lingnau and Petris, 2013). Moreover, the LOTC is recruited during the semantic processing of verbs (Papeo et al., 2015), and lesions to this region are associated with impairments in action recognition (Kalénine et al., 2010; Urgesi et al., 2014). In line with this view, a recent lesion study demonstrated that lesions to primary motor, somatosensory, and inferior parietal lobule were accompanied by impaired action performance. By contrast, lesions to the posterior LOTC were associated with impaired action recognition, whereas lesions to the anterior LOTC were accompanied by impairments in both tasks (Tarhan et al., 2015). Together, these studies suggest that the LOTC is well suited to integrate various sources of information crucial for action understanding.
Neuroimaging studies using MPVA of fMRI data have recently shown that the LOTC also contains abstract representations of observed actions, e.g., action representations that generalize from action execution to action observation and vice versa (Oosterhof et al., 2010), that generalize across viewpoint (first person, third person; Oosterhof et al., 2012a), that generalize across kinematics (Wurm and Lingnau, 2015), and that generalize across the object involved in the action (Wurm and Lingnau, 2015; Wurm et al., 2015). Importantly, our study shows that such abstract representations are available before observing this kind of representation in precentral regions, around the time when there is enough information in the stimuli to distinguish between the two types of actions. Our findings are compatible with cognitive theories of action understanding that predict the earliest encoding of the meaning of an action outside the motor system. By contrast, our results are not compatible with motor theories of action understanding that would predict the earliest access to abstract action representations in precentral regions.
The fact that we observed abstract action representations in the LOTC earlier than in precentral regions is compatible with a framework suggested by Kilner (2011). According to this view, the middle temporal gyrus in the LOTC and the anterior portion of the IFG encode the most likely goal or intention of an action (e.g., grasping an object), which is communicated to the posterior portion of the IFG, where the most likely action is selected. In this framework, the role of the posterior IFG would be to generate a concrete instance of the action (e.g., grasping an object on the left using the right hand) through motor simulation. In contrast to motor theories of action understanding, the role of this motor simulation would not be to provide access to the meaning of the action, but rather to contribute to the generation of the predicted sensory consequences of the most likely action.
We observed abstract action representations at ∼400 ms in the left superior parietal lobule as well, extending into the inferior parietal lobule. This result is in line with previous monkey (Fogassi et al., 2005; Rizzolatti et al., 2014) and human fMRI studies (Grafton and Hamilton, 2007; Oosterhof et al., 2010, 2012b; Leshinskaya and Caramazza, 2015; Wurm and Lingnau, 2015; Wurm et al., 2015), suggesting that, similar to the LOTC, this region contains abstract action representations. The observation that the superior parietal and the inferior parietal lobule have access to abstract action representations earlier than precentral regions raises the possibility that these regions might play an intermediate role between the LOTC and precentral regions (Wurm et al., 2015). In line with this view, Pavlidou et al. (2014b) demonstrated that the difference between plausible and implausible actions is first obtained over left temporal sensors, followed by parieto-occipital and sensorimotor sensors.
Late abstract action representations in precentral regions
The contrast between observation and baseline showed a modulation of the high alpha and beta frequency bands over central sensors during passive action observation (Fig. 4E), an effect that has been suggested to be related to sensorimotor processing in motor and premotor regions (Pineda, 2005). Although we observed an early modulation of high-alpha and beta frequencies in precentral regions for observation versus baseline, these regions had access to abstract representations of the observed actions substantially later than the time at which the actions were distinguishable. This finding makes a determinant role of precentral regions in action understanding implausible. In line with this view, damage to precentral regions does not necessarily impair the ability to understand actions (Negri et al., 2007; Kalénine et al., 2010; but see Pazzaglia et al., 2008). If precentral regions do not play a determinant role in action understanding, what could be the alternative role of the late abstract action representations we obtained in these regions? Since the LOTC and precentral regions are functionally interconnected (Kilner, 2011; Nelissen et al., 2011; Turken and Dronkers, 2011; Engel et al., 2013; Papeo et al., 2015), higher-level representations in precentral regions have been suggested to be a result of information spreading throughout the network (Mahon and Caramazza, 2008). Instead of providing access to the meaning of an action, precentral regions thus might be recruited to plan an appropriate movement in response to the observed action as a consequence or in parallel to the process of action understanding.
Potential caveats
One potential limitation regarding the interpretation of our results is related to the fact that one of the main distinctions between pointing and grasping, next to the preshaping of the hand, is the number of fingers involved. It is therefore difficult to disentangle whether our classification is based on the number of fingers involved in the movement, the preshaping of the hand while approaching the target, or a combination of the two. Note that pointing and grasping movements are defined both by the number of fingers involved and by the hand configuration; in other words, understanding actions could be based on the number of fingers observed as well as on the shape of the hand.
Another possible criticism could be that we were able to distinguish between the two movements based on the MEG signal as early as 150 ms, which seems counterintuitive given that the mean movement onset in the videos was ∼191 ms. There are several not mutually exclusive explanations for this observation. First, movements started before 150 ms in 43.8% of the videos (see Material and Methods). By contrast, the peak of decoding from the MEG signal was obtained at ∼300 ms. Second, we had to apply a certain amount of temporal smoothing during time–frequency computation and during the searchlight analysis (see Materials and Methods). Consequently, when the algorithm analyzes the time bin at 150 ms, it also considers information present at 200 and 250 ms, which contained more information about movement type. This means that the absolute latency at which the two actions can be distinguished based on the MEG signal has to be interpreted with a grain of salt. Importantly, we do not aim to draw strong conclusions regarding the absolute onset at which movements can be decoded in the different regions, but rather about the relative difference between putative regions involved in action understanding. Thus, our conclusion still holds: the LOTC encodes abstract representations of actions earlier than precentral regions.
One might argue that although we observed the strongest source in the early time window within the LOTC, the source analysis also revealed a small left frontal region. This frontal source is very likely generated by a single temporal source, in line with the observation that no frontal sensor showed significant decoding in this early time window (Fig. 8). Note that the absence of a frontal source in the early time window does not prove that such a source does not exist. What we can state with a certain confidence, though, is that the same analysis that revealed a strong and reliable source in the LOTC did not reveal any frontal source in the early time window.
Conclusion
Our results demonstrate that the LOTC has access to abstract action representations substantially earlier than precentral regions, in line with the idea that action understanding occurs outside the motor system, with subsequent activation of precentral regions due to information provided from the LOTC. Our results therefore provide important constraints for biologically plausible models of action understanding.
Footnotes
This work was supported by the Provincia Autonoma di Trento and the Fondazione Cassa di Risparmio di Trento e Rovereto.
The authors declare no competing financial interests.
- Correspondence should be addressed to Angelika Lingnau, Department of Psychology, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK. angelika.lingnau{at}unitn.it or angelika.lingnau{at}rhul.ac.uk