Abstract
During face-to-face communication, the perception and recognition of facial movements can facilitate individuals' understanding of what is said. Facial movements are a form of complex biological motion. Separate neural pathways are thought to processing (1) simple, nonbiological motion with an obligatory waypoint in the motion-sensitive visual middle temporal area (V5/MT); and (2) complex biological motion. Here, we present findings that challenge this dichotomy. Neuronavigated offline transcranial magnetic stimulation (TMS) over V5/MT on 24 participants (17 females and 7 males) led to increased response times in the recognition of simple, nonbiological motion as well as visual speech recognition compared with TMS over the vertex, an active control region. TMS of area V5/MT also reduced practice effects on response times, that are typically observed in both visual speech and motion recognition tasks over time. Our findings provide the first indication that area V5/MT causally influences the recognition of visual speech.
SIGNIFICANCE STATEMENT In everyday face-to-face communication, speech comprehension is often facilitated by viewing a speaker's facial movements. Several brain areas contribute to the recognition of visual speech. One area of interest is the motion-sensitive visual medial temporal area (V5/MT), which has been associated with the perception of simple, nonbiological motion such as moving dots, as well as more complex, biological motion such as visual speech. Here, we demonstrate using noninvasive brain stimulation that area V5/MT is causally relevant in recognizing visual speech. This finding provides new insights into the neural mechanisms that support the perception of human communication signals, which will help guide future research in typically developed individuals and populations with communication difficulties.
Introduction
Humans gain information about what a speaker is saying from their visible articulator movements. In everyday life, visual speech recognition, also called speech-reading or lip-reading (Campbell, 2008), is especially important in face-to-face conversations with high levels of auditory background noise, as it can facilitate comprehension in noisy environments or even enable it when auditory signals are absent (Sumby and Pollack, 1954; Grant and Seitz, 2000; Chandrasekaran and Ghazanfar, 2011). Visual speech recognition is also beneficial for speech comprehension in hearing impairment (Auer and Bernstein, 2007; Rosemann et al., 2020) and contributes to neurodevelopmental disorders (Schelinski et al., 2014; van Laarhoven et al., 2016).
There has been a great deal of debate surrounding how and where in the brain visual speech recognition is conducted. Previous research has demonstrated that visual speech recognition relies on the perception of dynamic and configural facial features linked to dorsal and ventral visual neural pathways, respectively (Campbell et al., 2001; O'Toole et al., 2002; Bernstein and Liebenthal, 2014). One cortical region consistently implicated in visual speech processing is the posterior superior temporal sulcus (pSTS) and posterior superior temporal gyrus (pSTG), including a most sensitive subregion coined the temporal visual speech area (TVSA; Schultz et al., 2005; Bernstein et al., 2011; Riedel et al., 2015). Further, functional MRI studies showed that the motion-sensitive visual middle temporal area (V5/MT) responds to biological motion, including movements pertaining to visual speech (Calvert and Campbell, 2003; Paulesu et al., 2003; Peuskens et al., 2005; Borowiak et al., 2018).
In addition to complex, biological motion, there is strong evidence that V5/MT is involved in perceiving nonbiological simple motion such as linearly or circularly moving shapes (Grèzes et al., 2001; Antal et al., 2004; Silvanto et al., 2006; Koivisto et al., 2010; Cai et al., 2014). While many studies using transcranial magnetic stimulation (TMS) have demonstrated that an inhibition of V5/MT leads to a reduction in individuals' recognition of nonbiological simple motion (Beckers and Zeki, 1995; Laycock et al., 2007; McKeefry et al., 2008; Grasso et al., 2018), its causal contribution to visual speech recognition remains unclear. Separate mechanisms for biological versus nonbiological motion have been suggested, and it remains unresolved whether V5/MT contributes to both (Miller et al., 2018). One approach argues that V5/MT might be an obligatory waypoint for information pertaining only to nonbiological, but not biological, motion such as visual speech. For instance, Mather et al. (2016) demonstrated that inhibitory TMS over V5/MT resulted in decreased nonbiological motion discrimination accuracy compared with the stimulation of a control site, but had no effects on the identification of point-light walkers. However, no study so far has investigated the effects of TMS over V5/MT on visual speech recognition in particular, despite fMRI findings showing a V5/MT response to visual speech, but not to walking stimuli (Santi et al., 2003). There may in fact be separate processing streams that subserve different types or characteristics of biological motion.
In contrast to the previous approach, O'Toole et al. (2002) argue that facial motion information transits through V5/MT and is then passed to more specialized regions such as the TVSA. V5/MT is also functionally and structurally connected to both the TVSA and visual ventral stream regions (Ethofer et al., 2011; Furl, 2015; Bernstein et al., 2018). Moreover, a case report on a patient with V5/MT lesions showed impaired visual speech recognition but intact biological motion recognition (Campbell et al., 1997). Note, however, that the lesion extended also into the left STS and left STG; hence, it is unclear whether this impairment was caused by the V5/MT and/or the STS/STG lesion. In summary, it is an open question whether V5/MT is causally involved in processing visual speech information.
The objective of the present study was to investigate the functional relevance of area V5/MT for visual speech recognition. We applied offline inhibitory TMS over V5/MT and an active control brain region and examined participants' subsequent visual speech recognition performance, as well as their recognition of simple nonbiological motion.
Materials and Methods
The Ethics Committee at Technische Universität Dresden approved the study (Aktenzeichen EK1701042019). We conducted the experiments at the Neuroimaging Center of Technische Universität Dresden. All participants gave their written informed consent and were reimbursed for their participation.
Participants
Twenty-four right-handed, native German speakers between the ages of 19 and 36 years (mean = 25.2, SD = 5.49) were included in the data analysis (7 males, 17 females). Of the 27 participants who were recruited for the study, 3 participants were excluded as they experienced mild head or jaw pain during stimulation. None of the participants reported a history of neurologic or psychiatric disorders, or any contraindications for MRI or TMS. During the recruitment process, all participants underwent an online screening for clinical conditions that are associated with an altered visual speech recognition, including developmental dyslexia and autism spectrum disorder (Schelinski et al., 2014; van Laarhoven et al., 2016). The screening contained an inquiry about an existing diagnosis of a developmental dyslexia, personally or in relatives, and the German version of the autism-spectrum quotient (AQ; mean AQ = 15.03, SD = 5.87; Freitag et al., 2007). Before the fMRI testing, we applied the Freiburg Visual Acuity Test (Bach, 1996). All participants passed this test (normal or corrected decimal visual acuity > 0.8).
Design
The study consisted of four separate sessions for each participant (Fig. 1a). In the first session, we acquired structural and functional MRI data. The MRI data were used to determine each participant's individual V5/MT coordinates in both hemispheres. The second and third sessions consisted of an offline TMS protocol, which included the stimulation of bilateral V5/MT and two sites in the vertex region as an active control. We used a 2 × 2 repeated-measures design with the within-participant factors Task (visual speech task vs motion direction task) and Stimulation (V5/MT vs vertex). Additionally, we acquired baseline performance measurements in the behavioral tasks before the administration of stimulation in Sessions 2 and 3. The behavioral tasks were completed again in a fourth and final session during which no stimulation occurred. The behavioral data collected in Session 4 were used for exploratory purposes and to assess the effects of TMS on practice effects across more sessions. Sessions 2 and 3 were separated by 8 d on average (SD = 3.82 d; range, 3–14 d), and Session 4 followed 1–3 d later (mean = 1.91, SD = 0.81). The order of V5/MT and vertex stimulation across Sessions 2 and 3 was counterbalanced between participants.
Localization of V5/MT
Design and stimuli.
A well established fMRI paradigm was used to localize V5/MT in each individual participant (Bridge et al., 2008). We chose to run a conventional “moving versus static” paradigm to aim for typical V5/MT coordinates. The stimuli were videos of random-dot kinematograms (RDKs) that consisted either of a single frame of static dots (static condition) or dots moving inward or outward in coherent motion (moving condition). We used 0.1° large white dots and a 0.2° large gray fixation point in the center of the screen in front of a black background. In the moving condition, 250 dots were generated overall and shifted with a speed of 4.7°/s. We created the stimuli using MATLAB R2019b (version 9.6.0.1047502) and Psychtoolbox version 3.0.15 (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007). Participants viewed a 187 cm remote mirrored screen (screen diagonal = 65 cm). The procedure consisted of 32 blocks, randomly alternating between the static and moving conditions (8× inward moving, 8× outward moving, 16× static).
Data acquisition parameters.
Structural and functional images were acquired on a 3 T Tim Trio machine with a 32-receiver channels head coil (Siemens Healthcare).
Structural T1-weighted images were acquired (TR = 1.9 s; TE = 2.26 ms; flip angle = 9°; bandwidth = 200 Hz/pixel) with a voxel size of 1 mm³ (176 sagittal slices).
Functional images were acquired with a gradient echoplanar imaging sequence (TR = 2.36 s; TE = 25 ms; flip angle = 80°; bandwidth = 2232 Hz/pixel; whole-brain coverage) with a voxel size of 3 mm³ (slice thickness = 2.5 mm; spacing between slices = 0.5 mm).
Data preprocessing and analysis.
The fMRI data were analyzed with SPM12 (Wellcome Trust Center for Neuroimaging, UCL, London, UK; www.fil.ion.ucl.ac.uk/spm) in a MATLAB environment (R2019b; version 9.6.0.1047502). Preprocessing of the images included calculating voxel deformation maps to correct for motion artifacts, realigning, unwarping, normalizing to Montreal Neurologic Institute (MNI) standard stereotactic space, and smoothing with a Gaussian filter of 8 mm full-width at half-maximum. We generated statistical parametric maps by modeling the evoked hemodynamic response for the static and moving conditions using the general linear model approach. The target V5/MT coordinates were retrieved with the contrast “moving” versus “static,” computed for each participant at the first level by means of a t-contrast, with a threshold of p < 0.001, uncorrected. Each participant's individual V5/MT coordinates were determined within a cluster of a group contrast of the same localizer from a previous experiment (https://osf.io/9b4fn/). Within this area, we located each subject's coordinates with the highest t-value. The MNI coordinates of left V5/MT (mean coordinates: x = −45.75 ± 2.78, y = −72.86 ± 5.06, z = 5.71 ± 3.58) and right V5/MT (mean coordinates: x = 47.25 ± 3.62, y = −67.32 ± 4.35, z = 4.75 ± 3.95) of each individual participant were retransformed to individual space for the neuronavigated TMS. The average induced electric field on the mean coordinates is visualized in Figure 2.
Transcranial magnetic stimulation
TMS procedure and neuronavigation.
Offline TMS was administered in each participant's second and third experimental sessions. We decided to carry out an offline stimulation because of several reasons. First, online TMS protocols were ill suited for a visual speech experiment: we planned to disrupt motion processing for the whole stimulus length. While single pulses would not have lasted over the whole stimulus duration, a sufficient pulse train on the other hand needed a long intertrain interval that would not have been compatible with an n-back and similar tasks. Second, we planned to inhibit area V5/MT bilaterally. Together with presenting the stimulus on both visual hemifields, online TMS was not feasible since two coils on both scalp sites would have physically interfered with each other. Third, in a future planned study, we aimed to apply the design to an fMRI experiment following TMS, for which offline stimulation of area V5/MT would be needed. Each participant's active motor threshold (AMT) was measured at the beginning of the first TMS session to determine their individual stimulation intensity. Single TMS pulses starting with an intensity of 50% of stimulator output were applied to the left primary motor cortex (M1) while the participant was instructed to lightly tense their right index finger muscle (abductor pollicis brevis) by pressing it against their thumb; and the stimulation intensity was decreased until 5 of 10 pulses resulted in a motor evoked potential above a predetermined threshold of 50 μV. Motor thresholds are a stable parameter within individuals across several days (Kimiskidis et al., 2004; Rossi et al., 2009; Perellón-Alfonso et al., 2018). The respective target coordinates were converted from a standard site (x = 37; y = −21, z = 58) from a meta-analysis (Mayka et al., 2006).
To stimulate the vertex area, two sites 1.5 cm anterior and posterior of the computed vertex coordinate were determined. Based on the individual structural MR images, the vertex coordinates were transformed into individual space from MNI coordinates corresponding to the location of the Cz electrode in the 10–20 EEG system (x = 0.8, y = −14.7, z = 73.9) from a study by Okamoto et al. (2004) who projected 10–20 standard cranial positions over the MNI cortical surface.
We used stereotactic neuronavigation (BrainSight, Rogue Research) to precisely position the TMS coil over each individual's V5/MT, vertex, and M1 coordinates. The coil was set over the BrainSight-indicated entry points of the respective coordinates. The entry points were those sites on the participant's scalp that had the shortest distance to the target coordinates. We counterbalanced whether the participants received stimulation over the anterior or posterior vertex area first and if the left or right V5/MT would be stimulated first, respectively. Immediately after stimulating both hemispheres, the participant started with the behavioral experiment.
TMS parameters.
Participants received continuous theta-burst stimulation (cTBS) with a stimulator (model MagPro X100, MagVenture) and a figure-of-eight MCF-B65 coil (double-circle diameter = 75 mm). For each site, 200 frames of a 60 ms duration were applied. Each frame contained three pulses that were separated by a 15 ms interval (600 pulses in total). An intertrain interval with a duration of 140 ms followed each frame, resulting in an overall stimulation duration of 40 s for each site. The stimulation was applied with an intensity of 100% of the AMT.
We used this stimulation protocol and design as a pilot study with a similar design was not effective or interpretable: 40% of pilot participants reported the effective stimulation as feeling more intense than sham stimulation. Because of potentially confounding effects of differences in stimulation intensity between effective and sham TMS conditions, we opted in the current study to use the active stimulation of the vertex rather than sham stimulation as an experimental control condition. Moreover, a potential main effect of stimulation would not allow drawing any conclusion about whether the stimulation effect would be specific to the stimulated area because of the presence versus absence of electromagnetic pulses alone. The current study did not include any of the pilot experiment participants. The pilot experiment methods and results are summarized at https://osf.io/9b4fn/.
Behavioral experiment
Stimuli.
RDKs were used as stimuli in the motion direction task. The RDKs consisted of a fixation point in the center of the screen surrounded by white ∼0.05° large dots with a density of 0.35 dots/°2, moving with a speed of 3.15°/s. We created videos for 48 directions that were allocated equally over a 360° space. Since articulator movements in the visual speech videos started 10 frames (200 ms) after video onsets on average (Díaz et al., 2018), we constructed the RDKs so that the dots also remained static for the first 200 ms in each video. To ensure that the pattern of the static dots would not predict the subsequent motion direction, we generated seven videos per direction (i.e., 336 videos in total). Fifty percent of the dots moved coherently in a respective direction, and the other dots moved randomly. All videos had a duration of 1900 ms. Videos were generated using MATLAB R2019b (version 9.6.0.1047502) and Psychtoolbox (version 3.0.15; Brainard, 1997; Pelli, 1997; Kleiner et al., 2007) using the “limited lifetime” algorithm (Pilly and Seitz, 2009).
For the visual speech task, stimuli were taken from an in-house stimulus database of video recordings of German native speakers uttering short syllables (Borowiak et al., 2018; Díaz et al., 2018). We selected 48 muted syllable videos of one female speaker in front of a black background. The syllables were composed in a vowel-consonant-vowel structure that was a combination of the vowels /a/, /e/, and /u/, and the consonants /f/, /p/, /n/, /r/, /s/, and /t/. The vowels and consonants corresponded to discriminable viseme classes in German (see Aschenberner and Weiss, “1 Phoneme-viseme mapping for German video-realistic audio-visual-speech-synthesis IKP-working paper NF 11”; available at: https://www.semanticscholar.org/paper/1-Phoneme-Viseme-Mapping-for-German-Video-Realistic-Aschenberner/c1535ec43b8d0c261aaa9ea4a85915a9f7fe9850). Videos were on average 1.89 ± 0.19 s long, and started and ended with the speakers' mouths closed.
Task procedure.
Each block of the behavioral experiment started with a 1000-ms-long instruction screen presenting the word “direction” (German: “RICHTUNG”) or “syllable” (German: “SILBE”), which instructed participants as to which task to perform, followed by a 500-ms-long black screen (Fig. 1b). During the motion direction task, participants watched the RDKs and responded by button press with the right hand to indicate whether the dots moved in the same or a different direction than those shown in the previous video. Similarly, in every “syllable” block, the visual speech videos were displayed, and participants responded to indicate whether the uttered syllable in the muted video was the same as or different from the one shown in the preceding video. Each block contained 28 trials separated by a 300-ms-long intertrial interval. The match/mismatch rate (i.e., same/different syllable/direction) was 50% overall, jittered from 40% to 60% across blocks, corresponding to a chance accuracy level of 50%. Another aim of our randomization was to keep both tasks similar in design and to ensure an equivalent level of difficulty between tasks as well as between blocks. Therefore, in every block of the motion direction task, all presented directions belonged to a single common quadrant within the field of view (e.g., between 0° and 90°) and had to be at least 15° apart. For each block of the visual speech task, four different syllables were chosen that all started with the same viseme, respectively. Thus, each block of the two tasks contained either four different syllables or four different directions.
The behavioral experiments (i.e., poststimulation) consisted of 24 blocks overall, while the baseline task for the two TMS sessions (i.e., prestimulation) consisted of 12 blocks. For each participant, syllables and motion directions that had not been presented in the baseline task at the start of the first TMS session were used in the baseline task during the second TMS session. Before the baseline task, participants completed four additional practice blocks, each containing eight trials with feedback, to become familiar with the task. The assignment of the left and right buttons to the match and mismatch conditions was counterbalanced between participants. The experiment was run using Labvanced (Finger et al., 2017) software. Stimuli were presented on a 53 × 30 cm computer screen at a 60 cm distance from participants. At the end of the final session (Session 4), participants completed a questionnaire on potential differences between the TMS sessions, TMS side effects, and strategies used in the behavioral experiment.
Hypotheses and data analysis
Our main hypothesis (Hypothesis 1) was that the application of TMS over area V5/MT would lead to an increase in response times in both tasks compared with the application of TMS over the vertex area. In addition (Hypothesis 2), we expected that TMS over V5/MT would lead to a decreased practice effect (less of a decrease in response times over time) for both tasks in contrast to TMS over the vertex region. This prediction was based on prior findings that a repeated testing of visual speech leads to an improvement of performance over time (Lander and Davies, 2008; Riedel et al., 2015). This also applies to motion perception within RDKs (Zanker, 1999).
The dependent variable in our analyses was response time because TMS typically influences response times rather than accuracy and is therefore considered the standard measure for assessing effects of TMS (Pascual-Leone et al., 1996; Ashbridge et al., 1997; Sack et al., 2007; Hartwigsen et al., 2017). We tested our hypotheses with linear mixed-effects models in R version 3.6.3 (see R Foundation for Statistical Computing, “R: a language and environment for statistical computing”; available at: https://www.r-project.org/). The mixed-effects model for both tests included fixed effects of Task (motion direction vs visual speech) and Stimulation (V5/MT vs vertex). To test Hypothesis 1, we modeled response times in the behavioral tasks completed during the two TMS sessions. Response times were computed as the time from video onset until the participant's response. To rule out falsely assigned responses, we excluded response times <200 ms. Incorrect responses were also excluded from the response time analysis. To test Hypothesis 2, we compared the performance of the baseline tasks and the behavioral experiment in the two TMS sessions. Specifically, we computed a ratio by dividing the response times of the behavioral experiment by the response times of the respective baseline for each participant, task, and stimulation condition. The random-effects structure of the mixed-effects models was determined using backward model selection, in which random-effects terms that accounted for the least variance were removed one by one until the fitted mixed model was no longer singular, that is, until variances of one or more linear combinations of random effects were no longer equal to zero. Both models included a random intercept by Subject and a random slope by Subject for the Task factor, and a random intercept by Stimulus. For exploratory purposes, we also analyzed accuracy (percentage correct) in the behavioral tasks using the same linear mixed-effects modeling approach. The accuracy model contained a random intercept by Subject and a random slope by Subject for the Stimulation factor, as well as a random intercept by Stimulus.
Results
Our primary hypothesis was that the application of inhibitory TMS over V5/MT would increase the times it took participants to recognize both visual speech and motion direction compared with inhibitory TMS over the vertex region. The linear mixed-effects model on response times revealed a significant main effect of Stimulation confirming our hypothesis: stimulation of V5/MT increased recognition response times compared with stimulation of the vertex region (β = 16.42; t = 4.13; p < 0.001; 95% CI = 8.64 24.20; d = 0.05; Fig. 3). Compared with vertex stimulation, V5/MT stimulation delayed the recognition of visual speech by 26 ms (SE = 4 ms) on average, and the recognition of motion direction by 16 ms (SE = 4 ms) on average (Fig. 3b).
In addition, the model revealed a significant main effect of Task (i.e., participants responded faster in the motion direction task than in the visual speech task; β = 560.98; t = 18.64; p < 0.001; 95% CI = 502.00, 619.96; d = 4.06). This was expected, as the critical viseme in the visual speech task occurred later than the motion onset in the motion task. The interaction of Stimulation and Task factors was not significant (β = 9.41; t = 1.684; p = 0.092; 95% CI = −1.55, 20.37; d = 0.02). The full mixed-model results are shown in Table 1.
Our second hypothesis was that the effect of V5/MT stimulation relative to vertex stimulation would diminish practice effects in both behavioral tasks (i.e., lessen the reduction in response times between baseline performance and the main behavioral performance). To account for such practice effects and intersubject variability in response times, we computed the ratio of response times measured during baseline (i.e., prestimulation) performance and response times measured during the behavioral experiment (i.e., poststimulation). The mean prestimulation and poststimulation response times for each task and stimulation condition are shown in Table 2. A more positive ratio indicated a greater TMS effect or a smaller practice effect over time. The mixed-effects model revealed a significant main effect of Stimulation: V5/MT stimulation resulted in a larger response time ratio than vertex stimulation (β = 0.02; t = 5.658; p < 0.001; 95% CI = 0.01, 0.03; d = 0.15; Fig. 4). Thus, participants showed less improvement between the prestimulation baseline and the poststimulation experiment when area V5/MT was stimulated compared with when the vertex was stimulated. The full model results are shown in Table 3.
An exploratory mixed-effects model of accuracy scores indicated no significant main effect of Stimulation (β = 0; t = −0.419; p = 0.676; 95% CI = −0.03, 0.02; d = 0.15). In contrast, we found a significant main effect of Task on accuracy scores (β = 0.03; t = 3.013; p = 0.003; 95% CI = 0.01, 0.05; d = 0.51). Participants were more accurate in recognizing visual speech (mean = 86.1% correct, SE = 0.01%) than motion direction (mean = 84.5% correct, SE = 0.01%). Finally, the model showed a significant interaction of Task and Stimulation (β = −0.03; t = −3.207; p = 0.001; 95% CI = 0.04, −0.01; d = 0.04). The interaction was driven by a greater difference in accuracy between tasks during vertex stimulation (β = 0.03; z = 3.01; p = 0.014) compared with V5/MT stimulation (β = 0.00; z = 0.35; p = 0.98), as well as a significant effect of stimulation on visual speech recognition accuracy (β = 0.03; z = 2.57; p = 0.0497), but not motion direction accuracy (β = 0.00; z = 0.42; p = 0.98). The model results are shown in Table 4 and Figure 5.
We also calculated the ratio of prestimulation and poststimulation accuracy by dividing the accuracies following stimulation by the mean accuracies during baseline (i.e., prestimulation), analogous to the response time ratio. A more positive ratio indicated a smaller TMS effect or greater practice effect over time. We found a significant main effect of Stimulation (β = 0.04; t = 5.223; p < 0.001; 95% CI = 0.03, 0.06; d = 0.06). Participants showed a larger practice effect in the V5/MT stimulation condition (mean = 1.06, SE = 0.03) than in the vertex stimulation condition (mean = 1.01 SE = 0.03). Moreover, there was a significant main effect of Task (β = 0.05; t = 2.227; p = 0.03; 95% CI = 0.01, −0.10; d = 0.73), revealing a larger practice effect for the visual speech (mean = 1.07, SE = 0.04) than for the motion recognition task (mean = 1.01, SE = 0.02). The full model results can be found at https://osf.io/9b4fn/.
Based on the questionnaire completed following the TMS sessions, 67% of participants noticed no differences between the two TMS sessions, whereas ∼13% correctly perceived the V5/MT stimulation as more intense or as increasing task difficulty. Reported side effects were mild headache (12.5% of participants), fatigue (8.33%), and neck tension (8.33%). Supplementary material can be found at https://osf.io/9b4fn/.
Discussion
The current study used inhibitory TMS to investigate the causal relevance of cortical area V5/MT for the recognition of visual speech. Participants completed visual speech and nonbiological motion recognition tasks after undergoing TMS over V5/MT and a control site. We found that V5/MT stimulation slowed the recognition of both visual speech and simple, nonbiological motion relative to control TMS. Moreover, it reduced practice effects on response times. These findings support our main hypothesis that area V5/MT causally influences the recognition of visual speech.
To our knowledge, no previous study has investigated causal influences of area V5/MT to visual speech recognition. Causal contributions of area V5/MT are consistent with neuroimaging studies that report V5/MT responses during visual speech and biological motion recognition (Calvert and Campbell, 2003; Paulesu et al., 2003; Peuskens et al., 2005; Borowiak et al., 2018). Visual speech recognition performance typically increases over the course of a task (Lander and Davies, 2008; Sánchez-Panchuelo et al., 2012). Our findings show that V5/MT stimulation also interferes with such improvement.
The stimulation effect of V5/MT raises the question of how and to what extent it interacts with other cerebral regions and their respective functions during visual speech recognition. Findings in macaque brains have suggested a dual neural processing route for biological motion: one route for motion information via MT/V5, and a second route for visual form that bypasses MT/V5 (Bullier and Morel, 1990). Studies in humans corroborate this evidence (Miller et al., 2018). For instance, Mather et al. (2016) report a reduced recognition of motion direction for dots displaying a general drift, but not point-light walkers following V5/MT stimulation, indicating that certain types of biological motion (i.e., walking) may not necessarily depend on V5/MT.
In contrast, our experiment showed that V5/MT serves as a critical component for recognizing visual speech, which points toward an interconnected network rather than separate routes. Prior fMRI findings have corroborated this suggestion by revealing structures within both ventral and dorsal visual pathways that contribute to visual speech recognition (Campbell et al., 2001; Bernstein et al., 2011; Files et al., 2015). For instance, parts of the pSTS/pSTG respond specifically to multimodal speech processing (Beauchamp et al., 2002; Grossman et al., 2005; Hall et al., 2005; Schultz et al., 2005), and the functional integrity of pSTS/pSTG is relevant for visual speech recognition (Riedel et al., 2015). It remains uncertain whether other areas compensate for a V5/MT inhibition, leading visual speech recognition to remain intact. Moreover, research indicates a flexible use of certain areas and their functions when processing visual speech. Viewing natural speech uses a network that codes for visual form and movement (Campbell et al., 1997), and thus there might be an application of neuronal mechanisms dependent on the benefit for the specific task and stimulus (Giese and Poggio, 2003; Thirkettle et al., 2009), which is in line with the dorsal and ventral pathway contribution. Arnal et al. (2009) suggest an interaction of the STS and motion-sensitive cortex areas: signals from auditory and motion-sensitive areas synchronize when syllables are articulated, and the STS reflects a feedback pathway for incongruent signals. Motion-relevant areas might be less used for the recognition of stimuli relying more on form-based strategies than in conditions retrieving dynamic visual cues. For instance, the TVSA could be recruited during the usage of the former, as opposed to V5/MT for the latter. This indicates the possibility of stimulation effects depending on the task design; TMS might differently affect visual speech recognition among multiple speaker identities than an intra-speaker speech recognition.
Our findings extend previous studies using inhibitory TMS over V5/MT on visual motion recognition tasks, such as speed and direction discrimination, motion detection, and implied motion perception (Beckers and Zeki, 1995; Kourtzi and Kanwisher, 2000; Schenk et al., 2005; Laycock et al., 2007; McKeefry et al., 2008; Silvanto and Cattaneo, 2010; Grasso et al., 2018). Similar to visual speech recognition, response times in the motion direction task decreased between prestimulation and poststimulation to a smaller degree following TMS over V5/MT compared with vertex stimulation. Given that simple visual tasks such as the recognition of random dot motion lead to rapid improvement in performance (Ahissar and Hochstein, 1997; Zanker, 1999), we have shown that V5/MT stimulation can interfere with such practice effects. However, our exploratory accuracy analyses indicate a speed–accuracy trade-off in which TMS over V5/MT led to slower, but more correct, responses in relation to the baseline performance.
Several prior studies have administered TMS to V5/MT using online stimulation protocols that are time locked to motion onsets, thus potentially aligning with sensitive periods for stimulation (Schenk et al., 2005; Sack et al., 2006; Laycock et al., 2007; McKeefry et al., 2008; Stevens et al., 2009; Alexander et al., 2018; Grasso et al., 2018). Early periods are thought to reflect direct communication between V5/MT and the visual sensory thalamus (Laycock et al., 2007), especially for high-frequency movement components such as those contained in visual speech (Beckers and Zeki, 1995; Grasso et al., 2018). Since recent research has demonstrated an important role of the visual thalamus for visual speech recognition (Díaz et al., 2018), we speculate that feedback between V5/MT and subcortical structures may also contribute to visual speech recognition. Our results also show that V5/MT can be inhibited by longer-term offline TMS protocols, complementing the few existing respective studies (Cai et al., 2014; Chakraborty et al., 2019). It is noteworthy, however, in contrast to the response times, TMS lowered accuracies only on the visual speech task and not the motion direction task. This is consistent with the view that TMS generally has stronger effects on response times (Ashbridge et al., 1997; Sack et al., 2007).
Active TMS conditions are often used preferentially to sham conditions as experimental controls, depending on the investigated brain region and somatosensory side effects of stimulation (Loo et al., 2000; Duecker and Sack, 2015). Nevertheless, TMS effects in active control site designs can be explained either by inhibition of the target region or by excitation of the control region. We argue that, in the current study, TMS had an inhibitory effect on V5/MT rather than an excitatory effect on the vertex for two reasons. First, to our knowledge, there are no findings in which the stimulation protocol for V5/MT (Cai et al., 2014; Chakraborty et al., 2019) has shown an excitatory rather than inhibitory effect on response times. Second, vertex stimulation does not induce increased BOLD responses at the stimulated site and, instead, even decreases activations of regions within the default mode network (Jung et al., 2016).
Relatedly, although the precise site localization in the current study was achieved using fMRI-guided neuronavigation, one cannot rule out a stimulation of immediately adjacent areas. Kolster et al. (2010) defined three regions within the V5/MT cluster that adjoin the core V5/MT region: the ventral part of the medial superior temporal area, the fundus of the superior temporal area, and the V4 transitional zone. Located deeper within the sulci than V5/MT and, although considered motion sensitive, they are thought to respond more preferentially to shape relative to V5/MT. It remains unknown whether, dependent on interindividual differences in V5/MT depth and location, neighboring areas were costimulated in some participants. Regardless, potential stimulation of shape-related regions in the V5/MT vicinity cannot explain the motion-related TMS effects observed in the current study.
In summary, we consider our results as a promising first indication that V5/MT causally affects visual speech recognition. While neuroimaging studies have already revealed multiple sites associated with visual speech recognition processes, including V5/MT (Campbell et al., 2001; Calvert and Campbell, 2003; Paulesu et al., 2003; Hall et al., 2005; Blank and von Kriegstein, 2013; Borowiak et al., 2018), here we extend those findings by showing that V5/MT is causally involved, which is not possible using fMRI. Concerning biological motion, this may also argue against a dichotomy of involvement versus noninvolvement of V5/MT and suggests instead that the contribution of V5/MT depends on subtypes of biological motion and their respective parameters, such as velocity. This has already been implicated by findings on low-level simple motion, where the recognition of rapid motion was found to be more susceptible to a V5/MT inhibition compared with slower motion, especially early after stimulus onsets (Grasso et al., 2018). Since visual speech rarely occurs in isolation, but rather within complex audiovisual environments, our results also highlight the importance of V5/MT for speech recognition in natural settings. Visual signals are a crucial component in audiovisual speech perception: seeing a talker improves the comprehension of their auditory speech (Sumby and Pollack, 1954), and, incongruently, visual speech signals can even interfere with speech recognition (McGurk and Macdonald, 1976). Moreover, individuals on the autism spectrum, characterized by decreased visual speech recognition abilities, have displayed reduced responses in visual movement areas including V5/MT, but not in other speech regions (Borowiak et al., 2018). The finding that V5/MT can causally impact visual speech recognition is a step toward better understanding of the neural mechanisms supporting the perception of human communication signals both in typically developed individuals as well as in populations with communication difficulties.
Footnotes
The study was funded by the European Research Council-Consolidator Grant SENSOCOM 647051 to K.v.K. L.J. is also supported by the Deutsche Forschungsgemeinschaft Grant 178833530 (CRC-940). We thank Moana Beyer and Kira Eckert for help with organizing and conducting the experiments.
The authors declare no competing financial interests.
- Correspondence should be addressed to Lisa Jeschke at lisa.jeschke{at}tu-dresden.de