Abstract
The integral capacity of human language together with semantic memory drives the linkage of words and their meaning, which theoretically is subject to cognitive control. However, it remains unknown whether, across different language modalities and input/output formats, there is a shared system in the human brain for word-meaning binding and how this system interacts with cognitive control. Here, we conducted a functional magnetic resonance imaging experiment based on a large cohort of subjects (50 females, 50 males) to comprehensively measure the brain responses evoked by semantic processing in spoken and written word comprehension and production tasks (listening, speaking, reading, and writing). We found that heteromodal word input and output tasks involved distributed brain regions within a frontal-parietal-temporal network and focally coactivated the anterior lateral visual word form area (VWFA), which is located in the basal occipitotemporal area. Directed connectivity analysis revealed that the VWFA was invariably under significant top-down modulation of the frontoparietal control network and interacts with regions related to attention and semantic representation. This study reveals that the VWFA is a key site subserving general semantic processes linking words and meaning, challenging the predominant emphasis on this area's specific role in reading or more general visual processes. Our findings also suggest that the dynamics between semantic memory and cognitive control mechanisms during word processing are largely independent of the modalities of input or output.
SIGNIFICANCE STATEMENT Binding words and their meaning into a coherent whole during retrieval requires accessing semantic memory and cognitive control, allowing our thoughts to be expressed and comprehended through mind-external tokens in multiple modalities, such as written or spoken forms. However, it is still unknown whether multimodal language comprehension and production share a common word-meaning binding system in human brains and how this system is connected to a cognitive control mechanism. By systematically measuring brain activity evoked by spoken and written verbal input and output tasks tagging word-meaning binding processes, we demonstrate a general word-meaning binding site within the visual word form area (VWFA) and how this site is modulated by the frontal-parietal control network.
- controlled semantic cognition
- word-meaning binding
- visual word form area
- word comprehension and production
- fMRI
Introduction
Binding words and their meaning as a central language processing is dependent on semantic memory, wherein concepts are extracted from or externalized into forms in different modalities, such as spoken or written words. Semantic memory comprises conceptual associations abstracted from experiences without reference to specific instances, including word meanings, objects and facts (Martin, 2007; Patterson et al., 2007; Binder and Desai, 2011; Forseth et al., 2018). When in the service of language, multiple lead-in processes access the semantic memory repository and drive word-meaning binding to enable verbal comprehension and production (Indefrey and Levelt, 2004; Forseth et al., 2018), which should be modulated by cognitive control (Binder and Desai, 2011). Nevertheless, it is unclear (1) where the modality-invariant word-meaning binding system is located in the brain and (2) how this system is subject to semantic control.
Attempts have been made to identify the neural correlates underlying semantic representational systems across input modalities. A strong resemblance has been demonstrated between two sets of semantic representations recalled by spoken and written stories during language comprehension (Deniz et al., 2019), and the left anterodorsal pars triangularis (Brodmann area 45, BA 45) is significantly correlated with cross-modal semantic similarity encoding (Liuzzi et al., 2017). Furthermore, previous literature suggests that word storage and retrieval mainly rely on ventrolateral temporal lobes, particularly the middle temporal gyrus (MTG), inferior temporal gyrus (ITG), and fusiform gyrus (FG; Binder et al., 2009; Davis and Gaskell, 2009). The ventrolateral temporal lobes include several regions for phonology-semantics-orthography linkage, including the lexical interface (Hickok and Poeppel, 2007; Forseth et al., 2018), the left ventral anterior temporal lobe [ATLv; also referred to as basal temporal language area (BTLA); Binney et al., 2010; Purcell et al., 2014], and the visual word form area (VWFA; Dehaene and Cohen, 2007, 2011). These anatomic and functional areas have long been considered heteromodal integration regions, and recent findings corroborate that posterior parts of the lateral MTG, ITG, and middle FG are centrally involved in representing lexical items independently of modalities (Forseth et al., 2018; Evans et al., 2019; Mattioni et al., 2020). Latest studies with epilepsy patients that used naming tasks to tag the word processing uncovered that the middle occipitotemporal cortex (i.e., FG and ITG) may function as lexical semantic hub and play a crucial part in associating words and their meaning (Forseth et al., 2018; Binder et al., 2020).
Recent research proposed controlled semantic cognition (CSC) framework wherein semantic cognition is dependent on two principal interacting neural systems: one for semantic representation and another for controlling the activity within the representational system according to specific contexts (Lambon Ralph et al., 2017). According to the CSC, concepts arise from multidimensional verbal and nonverbal experiences encoded in modality-specific brain regions and the multimodal hub situated in the bilateral ATLv. The ATLv necessarily mediates the interactions among modality-specific attributes (Patterson et al., 2007; Lambon Ralph et al., 2017) and is engaged across different modalities (verbal and nonverbal semantics, auditory and visual input presentation; Visser et al., 2012; Rice et al., 2015). Prior research on cognitive control in general (O'Reilly et al., 2002; Badre et al., 2005; Badre and D'esposito, 2009; Cole et al., 2013), semantic retrieval specifically (Thompson-Schill et al., 1997; Wagner et al., 2001), and disordered semantic control (Jefferies and Lambon Ralph, 2006; Rogers et al., 2015; Thompson et al., 2015) jointly point out that the semantic control network is composed of prefrontal and temporoparietal regions. Recently, a large-scale meta-analysis clearly demonstrated that semantic control is dependent on a left-dominant distributed network consisting of the inferior frontal gyrus (IFG), posterior MTG/ITG, and dorsomedial prefrontal cortex (PFC), with the inferior parietal lobe (IPL) being occasionally activated for its key part in more domain-general cognitive control (Jackson, 2021).
Here, we investigated the existence of a task/modality-invariant word-meaning binding site and illustrated its interaction with semantic control. We hypothesized that semantic hub areas (especially the ventral temporal area) should act as a general word-meaning binding system and activity within this system is modulated by the semantic control network (Fig. 1A).
Materials and Methods
To test our hypotheses, we conducted an fMRI experiment using a paradigm specifically tagging word-meaning association with varying input and output modalities (acoustic or visual; Fig. 1B). We presented concrete nouns referring to familiar objects in daily life as speech sounds, visual words, or pictures to investigate the cortical processing of single-word retrieval.
Experimental design
Subjects
Number of subjects (N = 100) was determined by a prestudy statistical power analysis with the empirical parameters (intersubject variability = 0.3%, intrasubject variability = 0.75%) and simulation procedure described in a previous study (Desmond and Glover, 2002). The number of time-points in the simulated time series was selected as 128 which corresponded to the shortest one in the four task sections (i.e., reading section, see more details below). The results showed that a sample size of 39 reached a typical 80% statistical power, while a sample size of 100 reached 100% power at the voxel level activation with a signal change of 0.75%. The power provided by a sample size of 100 was still sufficient even with a relatively small signal change (90% power with a signal change of 0.5%). A recent study on fMRI data power prediction also suggested that a sample size of 70 already approached 100% power for the random field theory family-wise error rate inference for peak effects (Durnez et al., 2016). To obtain available data of 100 subjects, 101 healthy native Chinese speakers (50 females; 23.0 ± 2.2 years old) were recruited in this study as paid volunteers. One male subject was excluded because of large head motion (>2.5 mm). None of the subjects reported a history of mental disorder or any kind of language impairment, and they were all right-handed with normal hearing and normal (or corrected) vision. All subjects provided written informed consent approved by the local Institutional Review Board at Peking University.
Task procedure
Our study consists of two fMRI sessions, four sections (i.e., listening, reading, speaking, and writing) in each session, and two tasks (target task and baseline task) in each section. All subjects in this study completed two scan sessions in an MRI scanner on separate days (with an interval of 1.74 ± 1.69 d). Each scan session was composed of four task sections of single-word processing. Each of the four task modalities comprised a target (Lanm, Ranm, Spic, and Wpic) task to evoke explicit semantic processing and a baseline task (Lgdr, Rchr, Ssyl, and Wchr) to control other semantically irrelevant processes such as primary sensory-motor responses or sublexical orthographic encoding. In the listening section, the speech sounds of words or their time-reversed equivalents were presented in the animacy judgment task (Lanm) or gender judgment task (Lgdr), in which subjects were asked to judge whether the word they heard was a creature (for normal speech sounds) or to judge the gender of speakers (for time-reversed speech sounds). Auditory stimuli were presented binaurally using a pair of MRI-compatible headphones, which provided 25-dB/sound pressure level (SPL) attenuation of noise. Subjects were allowed to adjust the sound volume in a short testing scan before formal sessions. The auditory stimuli were presented initially at an 80 dB/SPL. The reading section consisted of an animacy judgment task (Ranm) and character number judgment (Rchr), in which subjects were required to judge whether the picture or word presented referred to a creature or to judge the number of characters in a normal word. All of the judgments were made by pressing corresponding buttons on a handheld keyboard. In the speaking section, pictures and syllable strings were presented in the overt picture naming task (Spic) or overt syllable repeating task (Ssyl), and both names of objects in the pictures and syllable strings were matched in length. When asked to overtly name the picture or repeat the syllable strings, subjects were instructed to speak while keeping their heads as still as possible to avoid head motion. The writing section consisted of an overt picture naming task (Wpic) and a character copying task (Wchr). The subjects were asked to name the pictures by writing down the corresponding words or to copy normal words or a meaningless symbol. Subjects wrote on an MRI-compatible tablet placed on their abdomen, and they were told to write using only their right hand and keep their other body parts as still as possible. The trajectories of handwriting were recorded in real time. For detailed procedures and the timing of each section, see Figure 1B,C.
Stimuli
Concrete nouns denoting familiar objects in daily life were presented as speech sounds, visual words or pictures to investigate the cortical processing of single-word comprehension and production in different modalities. Chinese words consist of one or more monosyllabic characters, and each square-shaped Chinese character is a combination of radicals or strokes. The speech sounds of nouns were recorded by a male and a female native Chinese speaker in a soundproof studio. The recordings were further edited for length and amplitude, and half of the recordings were time-reversed to generate acoustic counterparts. There were two sets of auditory stimuli (i.e., one normal set and one time-reversed set) for the two tasks in the listening section. Each set comprised 64 stimuli (half female and half male) that were randomly and evenly assigned to eight blocks. Pictures of objects were chosen from a standardized picture set (Snodgrass and Vanderwart, 1980) of black-and-white line drawings. In the speaking and writing sections, 64 and 40 pictures were presented, respectively, for naming tasks separately. In the reading sections, two sets of 56 normal words were presented separately for the judgment of animacy or number of characters, with words in text form being balanced for word frequency and the number of strokes across tasks. In the speaking section, there were 64 syllable strings for the overt repeating task, and the syllable strings consisted of two or three basic syllables (i.e., do, re, mi, fa, sol, la, ti). In the reading section, each visually presented word was a noun with two or three characters, while single-character nouns were used in the writing section to minimize subjects' movements in the scanner. In the writing section, the copying task involved 40 normal characters.
MRI data acquisition
MRI data were collected using a Siemens Prisma 3T scanner (Siemens Healthineers) with a 20-channel head coil. Thirty-three continuous axial slices that covered the whole brain were acquired using a T2*-weighted gradient-echo EPI sequence with the following parameters: TR/TE/FA = 2 s/30 ms/90°, matrix size = 64 × 64, in-plane image resolution = 3.5 × 3.5 mm, slice thickness = 4.2 mm. A high-resolution T1-weighted image was acquired for anatomic details with isotropic 1 mm resolution using the MPRAGE sequence (TR/TE/FA = 2.53 s/2.98 ms/7°).
Statistical analysis
fMRI data preprocessing
In short, two fMRI scan sessions were conducted for each subject. There were four sections in each scan session, including listening, speaking, reading, and writing. Each of the four sections included two tasks, one explicit semantic processing task and one baseline task. First-level statistical analysis was conducted for each of the four sections separately with each subject's combined data from the two fMRI scan sessions. For each subject, functional data from the two scan sessions were segmented into four task sections that were separately processed in the following analyses. For each section, realignment, segmentation and normalization were performed using SPM12 (http://www.fil.ion.ucl.ac.uk/spm). Functional data were registered to the first volume of the first session through rigid-body realignment and coregistered to T1 images. T1 images were segmented using tissue probability maps, which also returned mutual deformation fields between individual space and Montreal Neurologic Institute (MNI) standard space. Deformation fields were used to normalize functional data to a 2-mm isotropic resolution. A multiple linear regression with nuisance variables was performed on the normalized data. The six head-motion parameters (three for translation and three for rotation) and their first derivatives were included to minimize head-motion effects. The first three principal components of signals from white matter (WM) and CSF and their first derivatives were adopted to regress out nuisance signal components (e.g., cardiac and respiratory effects) based on the CompCor method (Behzadi et al., 2007). Session-related effects were also regressed out in this step. Individual WM and CSF masks were generated by eroding corresponding masks obtained in the segment step (two voxels for WM masks and one voxel for CSF masks). It has been demonstrated that spurious FC arising from motion cannot be totally eliminated by realignment and regression of head-motion parameters (Power et al., 2012). Thus, functional data were further scrubbed by discarding volumes with framewise displacement (FD) larger than 0.5 mm. Isolated volumes with a length less than four volumes after scrubbing were also removed. Data interpolation of the censored data were conducted through a least-squares spectral decomposition of the uncensored volumes; that is, only the “good” data were used to restore signals at censored time points (Power et al., 2014). Only 100 subjects were included in further data analyses because one male subject was excluded because of large head motion (>2.5 mm).
Conjunction analysis
As mentioned before, the target (Lanm, Ranm, Spic, and Wpic) tasks evoked explicit semantic processing, and the baseline tasks (Lgdr, Rchr, Ssyl, and Wchr) controlled other semantically irrelevant processes, such as primary sensory-motor responses or sublexical orthographic encoding (Fig. 1B). Therefore, we first identified the brain activations critically underpinning word-meaning binding processing in four sections (listening, speaking, reading, and writing) by contrasting the target task and baseline task of each section Tables 2–5. A conjunction analysis (Price and Friston, 1997) was performed at the first level on each task contrast (target task>baseline task) of the four sections using the subject's individual statistical parametric mappings of the minimum t statistic over the contrasts specified in each task. The conjunction analysis allowed us to demonstrate the task-invariant nature of regional responses because it preserved only voxels that were significant (thresholded) in all of the contributing SPM maps to examine responses that were commonly evoked by contrasts of interest. This analysis informed us whether activations were jointly significant in a series of tasks. The results (Tables 6–10) were thresholded at voxel-level p < 0.001 with a cluster-level familywise error (FEW) corrected p < 0.05.
Granger causality analysis (GCA)
GCA was used to investigate the causal interactions between regions of interest (ROIs) for each task, and differences between the total causal influence strength flowing into and out of a given ROI were also calculated (i.e., in-degree; Blinowska et al., 2004). Having excluded subcortical areas, cortical areas were selected based on group-level SPM activation maps (target task>baseline task) as ROIs used in GCA. The selection of ROIs encompassed as many ROIs as possible to cover the entire activated regions (Tables 11, 12). ROIs were further classified into large-scale functional networks according to the maximal spatial overlap with Yeo's 7-network cortical parcellation (Yeo et al., 2011). Specifically, we extracted a 10-mm cube for each ROI around its group-level peak coordinates, calculated the proportion of voxels belonging to each network and assigned this ROI to the network with the largest proportion. For individual subjects, the location of each cortical node was defined by searching for the peak activation (uncorrected voxel-level p < 0.05) within a 10 mm-radius cube centered at the peak coordinates of group activation results, masked by the corresponding automated anatomic labeling templates (Tzourio-Mazoyer et al., 2002) to avoid contribution from adjacent areas. The time series of each node were extracted as the principal eigenvalue of an activated cluster within a 6-mm radius sphere centered around the individual peak coordinates. Because the specific location of activation may vary across subjects, this procedure guaranteed comparability between models via the application of functional and anatomic constraints in the extraction of time series (Harrison and Tong, 2009). All 100 subjects were included in GCA. For each subject, we extracted time series of eight ROIs for listening, 14 ROIs for speaking, nine ROIs for reading, and 15 ROIs for writing from the explicit semantic processing tasks only (Tables 11, 12). GCA was performed using the Multivariate Granger Causality Toolbox (Barnett and Seth, 2014). The order of the autoregressive model used to obtain the influence measure was determined using the Bayesian information criterion (Schwarz, 1978), and the GC values between pairwise ROIs were calculated. F tests were performed to obtain the statistical significance of inflow/outflow strength at a threshold of p < 0.05 with Bonferroni correction for multiple comparisons. We also calculated the sum of GC values flowing into and out of a region and computed the differences between them to measure the information net inflow for a given ROI (in-out degree). A 10,000-times permutation test (the original time series of each ROI was randomly shuffled, and the in-out degree was recalculated) was used to obtain the significance threshold of the in-out degree.
Results
Activation patterns of spoken-/written-language comprehension and production tasks
The behavioral results during the fMRI scan are shown in Table 1. These results indicated that subjects successfully completed all tasks and that the task-evoked responses were reliable. Contrasted activations tagging the word-meaning binding for spoken and written input and output and their coactivations are shown in Figure 2 and summarized in Tables 2–5. Generally, the activations involved the ventrolateral and posterior medial PFC, posterior dorsal parietal lobe, middle/posterior ventral occipitotemporal regions, and ventromedial occipital lobe (OCC).
Specifically, as shown in Figure 2A, cortical activations (Table 2) for speech comprehension (listening) included the inferior frontal junction (IFJ) and the dorsal premotor cortex (PMd) that falls into BA6, the right anterior middle frontal gyrus (MFGa), the bilateral posterior intraparietal sulcus (IPSp), the precuneus (PCN), and the left middle occipitotemporal sulcus (OTSm). In the spoken-language production task (Fig. 2A; Table 3), more cortical activations were detected than in the comprehension task (i.e., listening). Specifically, apart from regions found for the listening task contrast, other frontal regions were involved, including the bilateral anterior IFG (IFGa), bilateral posterior MFG (MFGp), bilateral presupplementary motor area (preSMA), and left dorsomedial PFC (PFCdm), yet the left PMd, which was found in listening contrast, was absent. For the posterior brain regions found in the listening section, the bilateral PCN and left OTSm were also involved in the speech production task, while bilateral IPSp regions were not. Activations were also found in visual association areas, including the right OTSm and the bilateral OCC.
The written-language comprehension task (reading), as shown in Figure 2A and Table 4, evoked cortical activations overlapping with those found in the spoken-language comprehension and production tasks (listening and speaking). Specifically, similar to the regions found in spoken language tasks, the reading task contrast engaged the bilateral IFJ, the left IFGa, the left PMd, the bilateral preSMA, the left IPSp, the left OTSm and the bilateral OCC. Notably, reading specifically recruited the left sylvian fissure at the end of the parietal temporal border sylvian parietal temporal (SPT), which is suggested to be a sensory-motor integration interface (Hickok and Poeppel, 2007). These activations contain areas previously reported to be involved in reading, namely, Broca's area implicated in language comprehension and production; the ventral occipitotemporal cortex, which includes the VWFA (Cohen et al., 2000; Dehaene et al., 2005; Dehaene and Cohen, 2011), and IPS regions that may be a source of top-down modulation (Planton et al., 2013). Notably, the left IFJ cluster contains part of the left middle frontal region (BA 9/46), which has been identified as a region specific to Chinese reading processing and is correlated with Chinese dyslexia (Siok et al., 2004). For the written-language production (writing) task (Fig. 2A; Table 5), brain area activations also included the aforementioned areas in the other three sections. In terms of bilateral activations, the writing task in this study activated the bilateral MFGa, preSMA, IFJ, IFGa, IPS, OTSm, and OCC. The left MFGp and SPT were also activated, which conformed with the findings of a previous meta-analysis of writing (Planton et al., 2013). Concretely, cortical activations of our written-language production sections activated writing-specific areas (left PMd, left IFJ, and left IPS), general motor area (preSMA) and areas for linguistic processes (IFGa and OTSm).
To further investigate the common brain substrates shared across comprehension tasks, production tasks, spoken-language tasks, and written-language tasks, conjunction analyses were conducted at the first level on each contrast tagging the association between words and meaning in comprehension tasks (listening, reading) and production tasks (speaking, writing). Comprehension tasks and spoken-language task coactivations were all left hemispheric and involved the left OTSm and the posterior dorsal parietal lobe, while spoken-language tasks additionally recruited the left medial temporal lobe (MTL; Fig. 2B; Tables 6, 8). Furthermore, the production tasks and written-language tasks (Fig. 2B; Tables 7, 9) only coactivated the left OTSm. In addition to activating the left OTSm, the production tasks coactivated the bilateral preSMA, bilateral lPFCdm, bilateral IFGa, bilateral IFJ, bilateral OCC, and eft MFGa; moreover, the written-language tasks coactivated the bilateral preSMA, right PFCdm, bilateral IFGa, bilateral IFJ, bilateral IPS, and bilateral OCC.
These findings lend support to the view that the semantic control system overlaps with general executive control and working memory (Lambon Ralph et al., 2017). Bilateral IFJs, especially the right IFJ, are proposed to engage in central cognitive control (i.e., to facilitate goal-directed actions and suppress inappropriate actions; Brass et al., 2005; Braver et al., 2009) and attention control (Zhang et al., 2018; Zhou et al., 2020). Importantly, the IFJ is part of the mid-posterior ventrolateral PFC, which has been repeatedly attested to support domain-general control of semantic selection (Badre et al., 2005). Furthermore, the storage and retrieval of lexical items are suggested mainly to involve the lateral and ventral temporal lobes to which the left OTSm belongs. In addition to the left IFGa, the right IFGa was activated (BA 45/47), and this area has been attested to enhance naming tasks by influencing phonological processing (Naeser et al., 2011). Additionally, the right IFGa is argued to implement response inhibition, a critical aspect of executive functions (Aron et al., 2003). The MFGa, as part of the semantic CON, is also closely related to working memory, enabling previously selected task sets to be held in a pending state for subsequent automatic retrieval and execution on completion of the ongoing task (i.e., cognitive branching; Koechlin and Hyafil, 2007). The posterior parietal parts found here, while being an important part of the human semantic system (Binder et al., 2009; Binder and Desai, 2011), were suggested to support attentional working memory maintenance (Christophel et al., 2018) and episodic memory retrieval (Cavanna and Trimble, 2006).
Spoken-language and written-language comprehension and production tasks coactivate VWFA
To investigate whether there exists a modality-independent word-meaning binding system, a conjunction analysis (Price and Friston, 1997) was conducted to find the region(s) that significantly engaged in the association of words and their meaning across spoken and written language comprehension and production tasks. As shown in Figure 3A, a modality/task overlap significantly coactivated by all four sections (i.e., listening, speaking, reading, and writing) was identified in the left OTSm (cluster size: 100 voxels, peak in MNI coordinates: –46, –46, –14, thresholded at voxel-level p < 0.001 with a cluster-level FWE corrected p < 0.05).
This result is consistent with previous findings that lexical item storage and retrieval are primarily pertinent to the ventrolateral temporal regions (Hickok and Poeppel, 2007; Forseth et al., 2018). Intriguingly, the left OTSm overlap fell within the anterior lateral portion of the VWFA (ranging from approximately Talairach y = −64 to −48 or MNI y = −66 to −48, as shown in Fig. 3; Dehaene et al., 2005). The VWFA is a brain site that is selectively attuned to reading and closely related to the visual word form (i.e., the abstract sequence of letters or characters of stokes that composes a written symbol), and it has a pivotal role in the orthography-semantics-phonology association (Dehaene et al., 2005; Dehaene and Cohen, 2011). Moreover, the left OTSm overlaps with several conceptual regions that commonly support the linkage of different modality-specific lexical information (Fig. 3; Table 10; Cohen et al., 2002; Hasson et al., 2002; Dehaene et al., 2005; Lacadie et al., 2008; Dehaene and Cohen, 2011; Zhao et al., 2016; Stevens et al., 2017). According to the dual-stream model for speech processing (Hickok and Poeppel, 2007), the left OTSm belongs to the lexical interface area (bilateral ventral temporal regions), which in theory links specific word entries to their lexical/semantic properties (Hickok and Poeppel, 2007), implying that it serves to gather and integrate different sorts of lexical/semantic information. Since lexical knowledge is by no means single modal, the lexical interface must be capable of addressing multimodal information simultaneously to fulfil its role. Additionally, the overlap borders the ventral ATL, which is also referred to as the BTLA (ranging from approximately Talairach y = −46 to 17 or MNI y = −46 to 20), which is an area that was recently termed the orthography semantic interface region and supports the linkage of modality-specific lexical information types during spoken/written comprehension and production (Purcell et al., 2014).
Previous literature on memory processing and biolinguistics provides indirect evidence that the left OTSm is an ideal site where various types of information converge to be maintained for further processing, with the particular advantage of communicating different language processing routes. The left occipitotemporal area is closely related to working memory maintenance in the absence of perceptual stimulation; it not only displays specificity to multiple categories, especially visual categories, but also retains low-level to high-level abstract representations (Ranganath and D'Esposito, 2001; Harrison and Tong, 2009; Christophel et al., 2017). The temporal association cortex is a cordial interface that communicates with long-term memory for episodic memory retrieval (Vaz et al., 2019), and lesions in this area pertain to certain recognition disorders that are closely related to the functions of polymodal information binding, such as pure alexia (Dehaene et al., 2005). Structurally, the left OTSm is located at the convergence zone of two anatomic pathways for language processing: the dorsal pathway (from meaning to sound) along the arcuate fascicle (AF) and superior longitudinal fascicle (SLF) connecting the temporal lobe and premotor cortices (BA 44, pars opercularis, and PMd) and the ventral pathway (from sound to meaning) along the extreme capsule (EmC) connecting the temporal lobe to the ventrolateral PFC (BA 45/47; Saur et al., 2008). On the one hand, ontogenetic research has demonstrated that the ventral pathway linking the ventrolateral IFG via the EmC to the temporal cortex is detectable in both adults and newborns (Perani et al., 2011). In terms of dorsal pathways, two dorsal pathways are detectable in adult brains: one connects the temporal cortex via the AF/SLF to the posterior portion of the IFG (BA 44), and the other connects the temporal cortex to the PM. However, in the brains of newborns, only the pathway to the PM can be detected. On the other hand, phylogenetic research has demonstrated that the dorsal pathway with BA 44 as the destination is well developed in human adult brains but ill developed in macaques (Rilling et al., 2008). Thus, since the left OTSm is located within a region that has easy access to multidimensional signals, it is more capable of executing heteromodal integration.
The anterior VWFA is primarily modulated by the frontoparietal semantic control network
Having identified a heteromodal region for general word-meaning association, we examined how it is connected to semantic control across different verbal input and output modalities. We predicted that activity related to the retrieval of word semantics in this region could be modulated by frontoparietal control regions such as the bilateral IFG, an area that has been reported to play a vital part in both semantic control (Noonan et al., 2013; Rogers et al., 2015; Lambon Ralph et al., 2017; Chiou et al., 2018) and general cognitive control (Brass et al., 2005; Stokes et al., 2013).
GCA was applied to investigate the causal interactions between ROIs for each explicit semantic processing task. Briefly, the principle of GCA is based on the predictability of time-varying signals, and one signal is said to Granger cause another if this signal contains information that helps to predict the future behavior of another (Blinowska et al., 2004). Full connections among all ROIs are shown in Figure 4Ai. Figure 4B demonstrates the ROIs that have a direct interaction with the left OTSm. In terms of spoken-language tasks (e.g., listening and speaking), the left OTSm receives causal influence largely from the inferior frontal regions. When performing the listening task, the left IFJ and left posterior inferior parietal region (i.e., IPSp) exert top-down modulation over the left OTSm. For the speaking task, in addition to the left IFJ, the right hemispheric IFJ and ventromedial temporal cortex (i.e., right OTSm) also displayed top-down influence on the left OTSm, with the right OTSm bidirectionally connected to the left OTSm. Similarly, in terms of written-language tasks (e.g., reading and writing), top-down modulation over the left OTSm primarily comes from inferior frontal areas, including the left IFGa and bilateral IFJ, as well as dorsal prefrontal regions and IPL. When performing the reading task, the bilateral IFJ and left IFGa exerted significant control over the left OTSm. For the writing task, in addition to the inferior frontal regions found in the reading section, the left preSMA, left MFGa, left MFGp and bilateral IPS were also involved; in contrast with the reading section, the bilateral IFJ was bidirectionally connected to the left OTSm, and the IFGa received input from the left OTSm. The left IPSp and right IPSa sent top-down information to and received bottom-up information from the left OTSm, respectively. The left MFGa also received input from the left OTSm, whereas the left MFGp was bidirectionally connected to the left OTSm. In summary, across four task sections, the left OTSm (i.e., anterior VWFA) primarily received top-down information from prefrontal regions, especially the bilateral IFG. It is worth highlighting that only production tasks (speaking and writing) demonstrated bottom-up modulation from the left OTSm to other regions.
As shown in Figure 4Aii, we further situated the regions involved in explicit semantic processing (i.e., ROIs included in GCA) according to a general parcellation of functional large-scale networks (Yeo et al., 2011), which enabled us to assess these regions in terms of a few domain-general or domain-specific networks, with some of them being consistently found in both resting and task states. With respect to all ROIs and their connections across four sections, production tasks involved more networks, such as default mode network (DMN), as well as more between-network interactions. In terms of ROIs that directly communicated with the left OTSm, the left OTSm, which was classified into the dorsal attention network (DAN), only receives top-down modulation from nodes of the control network (CON) and/or nodes within the DAN itself when performing verbal comprehension tasks (listening and reading). During verbal production tasks (speaking and writing), significant top-down control was also from nods of CON and/or DAN. However, the speaking section involved bidirectional communication between the left OTSm and visual network (VIS), and the writing section demonstrated bidirectional communication between the left OTSm and nodes of the ventral attention network (VAN) and DMN. Furthermore, the left OTSm has little significant information outflow, whereas this node outputs information to a VIS node (i.e., right OTSm) in speaking and to the bilateral IFJ in reading.
The anterior VWFA has more information input from multiple sources than output in production tasks
Conjunction analyses and GCA analysis jointly indicated that the left OTSm (i.e., anterior VWFA) was primarily modulated by prefrontal regions when critically subserving the binding between words and their meaning regardless of task modality and whether the tasks are receptive or expressive. According to the connectivity pattern shown in Figure 4, it appears that as a word-meaning binding center, in verbal comprehension tasks (listening and reading), the left OTSm appeared to receive all necessary retrieved words and semantic attributes from various code-specific sensory-motor and linguistic resources, whereas in verbal production tasks, the left OTSm seemed to have more bottom-up feedback into other nodes. A possible reason could be that comprehension tasks mainly required mapping between linguistic stimuli to words/concepts, but production tasks triggered mapping from nonlinguistic stimuli to words/concepts, and the words/concepts were mapped to external linguistic forms (speech sounds or written characters), particularly for the writing naming task in which word form is explicitly required as the output.
To further explore the information input and output of the word-meaning binding site under different tasks and modalities, we calculated the difference between the sums of GC values flowing into and out of each ROI to measure the information net inflow (i.e., in-out degree). Given its major role assumed during the integration of all necessary retrieved words and semantic attributes, the left OTSm shall have more input than output and a greater net inflow compared with other regions in comprehension tasks, while in production tasks, the outflow from the left OTSm may outweigh its inflow when information of word form is explicitly required in the task output.
The in-out degree results (Fig. 5) revealed that the left OTSm had significant net causal information inflow from multiple sources in listening, speaking, and reading tasks except the writing tasks (permutation test, p < 0.05; Bonferroni corrected for multiple comparisons). Interestingly, the left OTSm has much greater input information flow than other ROIs in comprehension tasks (listening and reading), with a smaller difference revealed between the left OTSm and left IPSp for reading. In contrast, a quite different pattern was found for production tasks. For the speaking section, the left OTSm still had more input than output, yet its net inflow no longer remarkably outweighed that in the other regions. For the writing section, the left OTSm showed a reverse pattern with more outflow than inflow. Notably, prefrontal regions assumed to exert semantic control over the left OTSm generally have more output than input (negative or relatively low positive net inflow), such as the bilateral IFJ and left preSMA.
Discussion
Understanding how the integral capacity of language faculty along with semantic cognition enables the binding between words and their meaning to be prominent in cognitive neuroscience and language sciences. To date, it is unknown whether there is a shared system in the human brain for word-meaning binding and how it interacts with semantic control. To address this issue, we conducted a comprehensive study that simultaneously evaluates word retrieval brain activity across different language modalities and input/output formats. In conclusion, our findings reveal the following. (1) Word-meaning binding under receptive and expressive tasks of different modalities separately involves distributed brain regions within a frontal-parietal-temporal network. (2) The anterior VWFA plays a critical role in word retrieval and arguably is a key site subserving general semantic processes linking words and meaning, which challenges the predominant emphasis on this area's specific role in reading or other general visual processes. (3) The VWFA, as a general word-meaning binding site, receives direct top-down modulation from the frontoparietal CON across all tasks and displays task-specific differences, indicating that the dynamics between the semantic memory and cognitive control mechanism during word processing are largely independent of the modalities of input or output. These results together contribute to a better understanding of how semantic cognition drives word retrieval.
Previous literature has suggested that the left ventral ATL and occipitotemporal regions may serve the multimodal binding involved in word meaning association (Binney et al., 2010; Pulvermüller, 2013; Forseth et al., 2018; Binder et al., 2020). However, our results revealed that the left inferior occipitotemporal region (i.e., the left OTSm or anterior VWFA) acts as a heteromodal word-meaning binding site. For semantic control, a left-lateralized network comprising the IFG, posterior MTG, posterior ITG and PFCdm has been suggested as its core neural correlate (Jackson, 2021). Accordingly, we uncovered that at the single-word level, the left inferior frontal area, particularly the IFJ, was invariably recruited and modulated the inferior temporal region in a top-down manner, and the right IFJ was also engaged in three sections except the listening section. Importantly, the posterior part of the ITG may also function as a word-meaning association. The left inferior parietal area (i.e., IPSp) was observed to exert influence over the word-meaning center only in listening and writing, instead of being constantly detected in all four sections. Such occasional involvement is consistent with a priori findings that this region implements more domain-general rather than semantics-specific control and therefore will only be occasionally identified during semantic tasks (Jefferies, 2013; Lambon Ralph et al., 2017; Jackson, 2021). Notably, the posterior MTG was not significantly activated in our study across the four tasks. Past literature has suggested that this area is inclined to boost weakly encoded semantic information retrieval (which requires a greater degree of online exploration of the semantic database) and to become active when control demands within receptive tasks vary (Noonan et al., 2013; Lambon Ralph et al., 2017).
Critically, the current study contributes to solving the VWFA controversy and clarifying its nature. For the particularly strong response of VWFA to written words, there have been two general hypotheses. The first is that VWFA becomes specialized for a single category, subtending visual word formation as prelexical and all visual processing; in line with this assumption, some claimed that this area should be better conceptualized as a region equipped with specific processing characteristics involved in more general visual processes (Price and Devlin, 2003; Gaillard et al., 2006; Glezer et al., 2009; Dehaene and Cohen, 2011; Striem-Amit et al., 2012; Vogel et al., 2014). Another hypothesis is that the VWFA is a special site in the visual cortex with dense interconnections to cortical areas related to language and/or other important higher level cognitive functions, and it is possible that this hypothesis and the first one are both true (Wandell et al., 2012; Wandell and Le, 2017). Furthermore, prior literature has demonstrated that occipitotemporal brain representations imbued with semantic processing are independent of perceptual modalities (e.g., spoken, written, braille, or sign language; Caramazza et al., 1990; Forseth et al., 2018; Evans et al., 2019; Mattioni et al., 2020). Our results provide strong cross-modality evidence for the view that VWFA is not predominantly or solely dedicated to reading but instead is critical for both spoken-language and written-language processes. More recent findings suggested that the VWFA serves as a link between language and attention (Chen et al., 2019), and the lateral portion of the VWFA may contribute to lexico-semantic access (Bouhali et al., 2019). As the heteromodal word-meaning binding site discovered in this study, the left OTSm (i.e., anterior VWFA) falls in the DAN (Fox et al., 2006), which echoes previous findings that the VWFA is functionally connected to the DAN (Vogel et al., 2012) and is a crucial region for language processing.
Another possible role of the left OTSm, or VWFA, can be identified from the perspective of working memory. By definition, the episodic buffer of working memory congregates and chunks modality-specific representations into a multimodal unity. On the one hand, the buffer temporarily holds relevant information to collectively process these pieces provided that different aspects and computations of language are suggested to be encoded with different time scales (Sahin et al., 2009; Ding et al., 2016; Sheng et al., 2019). On the other hand, the buffer tackles the gap between various modalities using a multidimensional code, which is required for language processing to gather and integrate information from diverse sources (Rudner and Rönnberg, 2008). Moreover, the episodic buffer is expected to demonstrate strong modulative interaction primarily with the central executive system but not necessarily under its tight control (Baddeley and Hitch, 2019). While the brain signatures of the episodic buffer per se remain largely unknown, at the core, the buffer for working memory acts as a heteromodal processing center, which is theoretically a role similar to that of the lexical hub for language. Thus, based on cultural recycling theory (Dehaene and Cohen, 2007), we propose that beyond visual word recognition in reading, one of the essential functions for the long-debated VWFA could be cross-modal lexical processing, and VWFA may be recycled from an episodic buffer system by language faculty (Price and Devlin, 2003; Gaillard et al., 2006; Glezer et al., 2009; Dehaene and Cohen, 2011; Striem-Amit et al., 2012; Vogel et al., 2014).
It is also worth noting that while we hypothesized that the ATLv shall be involved in the linkage of words and their meaning, this critical region is absent. There is accumulated evidence showing that the left ATL is also one of the key regions responsible for combining basic linguistic units into more complex representations (e.g., meaning composition for words in a phrase; Bemis and Pylkkänen, 2011; Hagoort, 2019; Pylkkänen, 2019). However, in this study, the simple single-word processing tasks we adopted were supposed to have minimal (if any) demands for the basic linguistic featured in the left ATLv, which may lead to this region being less involved in general. On the other hand, although we had a relatively large sample size that provided reliable statistical power, we cannot rule out the possibility that the absence of ATLv was to some extent because of signal loss and distortion in ventral temporal regions (Binney et al., 2010; Visser et al., 2012; Lambon Ralph et al., 2017), which needs to be tested with the newly developed techniques to address this issue (Halai et al., 2014; Kundu et al., 2018).
Future research is expected to address important questions pertinent to our findings. The tasks used in this study focused on a single word category (i.e., noun) and were fairly simple (i.e., single-word level processing). To what extent the current findings hold true for other types of words (such as abstract words) or more complicated tasks (e.g., sentence-level processing) is not known. Additionally, the temporal resolution of fMRI limits the tracking of real-time neural dynamics. Particularly, despite a large sample size (N = 100), the finding of top-down control is based on GCA, which is not an ideal method for fMRI data because of the confounding influence of regional variations in HRF. Thus, it is encouraged to verify these findings using other imaging modalities with a more fine-grained timescale (e.g., milliseconds), such as magnetoencephalography. To tackle these questions, we should more effectively bridge neuroscience and theories from psychology and linguistics to develop appropriate tasks. Additionally, other brain imaging modalities with decent spatiotemporal resolution, such as magnetoencephalography and intracranial electroencephalography, should be adopted.
Footnotes
This work was supported by National Natural Scientific Foundation of China Grants 81790650, 81790651, 81727808, 31421003, 81627901, and 31771253; the National Key Research and Development Program of China Grant 2017YFC0108900; the Beijing Municipal Science and Technology Commission Grant Z181100001518003; and the Capital's Funds for Health Improvement and Research Grant 2020-4-801. We thank the National Center for Protein Sciences at Peking University in Beijing, China, for assistance with data acquisition and analyses.
The authors declare no competing financial interests.
- Correspondence should be addressed to Jia-Hong Gao at jgao{at}pku.edu.cn