Prevailing hierarchical models propose that temporal processing capacity—the amount of information that a brain region processes in a unit time—decreases at higher stages in the ventral stream regardless of domain. However, it is unknown if temporal processing capacities are domain general or domain specific in human high-level visual cortex. Using a novel fMRI paradigm, we measured temporal capacities of functional regions in high-level visual cortex. Contrary to hierarchical models, our data reveal domain-specific processing capacities as follows: (1) regions processing information from different domains have differential temporal capacities within each stage of the visual hierarchy and (2) domain-specific regions display the same temporal capacity regardless of their position in the processing hierarchy. In general, character-selective regions have the lowest capacity, face- and place-selective regions have an intermediate capacity, and body-selective regions have the highest capacity. Notably, domain-specific temporal processing capacities are not apparent in V1 and have perceptual implications. Behavioral testing revealed that the encoding capacity of body images is higher than that of characters, faces, and places, and there is a correspondence between peak encoding rates and cortical capacities for characters and bodies. The present evidence supports a model in which the natural statistics of temporal information in the visual world may affect domain-specific temporal processing and encoding capacities. These findings suggest that the functional organization of high-level visual cortex may be constrained by temporal characteristics of stimuli in the natural world, and this temporal capacity is a characteristic of domain-specific networks in high-level visual cortex.
SIGNIFICANCE STATEMENT Visual stimuli bombard us at different rates every day. For example, words and scenes are typically stationary and vary at slow rates. In contrast, bodies are dynamic and typically change at faster rates. Using a novel fMRI paradigm, we measured temporal processing capacities of functional regions in human high-level visual cortex. Contrary to prevailing theories, we find that different regions have different processing capacities, which have behavioral implications. In general, character-selective regions have the lowest capacity, face- and place-selective regions have an intermediate capacity, and body-selective regions have the highest capacity. These results suggest that temporal processing capacity is a characteristic of domain-specific networks in high-level visual cortex and contributes to the segregation of cortical regions.
- domain specificity
- extrastriate body area
- fusiform face area
- parahippocampal place area
- visual word form area
Visual stimuli bombard us at different rates in our daily life. For example, words and scenes are typically stationary and vary at slow rates. In contrast, bodies are dynamic and typically change at faster rates. While these effects may seem intuitive, little is known about how temporal processing capacity is implemented in high-level visual cortex and how it contributes to perception.
The prevailing hypothesis suggests that temporal processing capacity, or the amount of information that a particular brain region can process in a unit of time, is domain general and decreases at progressively higher stages of the ventral stream hierarchy (Singh et al., 2000; Mukamel et al., 2004; McKeeff et al., 2007; Hasson et al., 2008; Gauthier et al., 2012). This hypothesis is derived from investigations of the temporal characteristics of neurons in nonhuman primates providing evidence for lower capacity in higher stages of the ventral stream processing hierarchy (i.e., in inferotemporal cortex, IT) than primary visual cortex, namely V1. Three types of empirical evidence support this hierarchy hypothesis. First, the firing rate of V1 neurons decreases for presentation rates >10 Hz (Foster et al., 1985; Chance et al., 1998), while neuronal firing rates in IT decrease for slower rates >4 Hz (Keysers et al., 2001). Second, information accumulation in V1 is at least twice as fast as in IT (Optican and Richmond, 1987; Richmond and Optican, 1990; Heller et al., 1995). Third, response latencies increase across the visual hierarchy from V1 to IT (Schmolesky et al., 1998), as does the dispersion of latencies across anatomically adjacent neurons (Vogels and Orban, 1994). fMRI studies in humans also support this hypothesis. Several studies reported that early visual areas (V1–hV4) process 10–30 items per second, whereas high-level regions in ventral temporal cortex (VTC) process only 4–6 items per second (Mukamel et al., 2004; Liu and Wandell, 2005; McKeeff et al., 2007; Gauthier et al., 2012).
Contrary to the prevailing hierarchy hypothesis, recent studies found similar temporal processing in lateral occipital temporal cortex (LOTC) and VTC (Mukamel et al., 2004; Gentile and Rossion, 2014), even though LOTC is considered to precede VTC along this hierarchy (Haxby et al., 2000; Peelen and Downing, 2007; Schwarzlose et al., 2008; Dilks et al., 2013). These conflicting findings suggest a new unconsidered hypothesis that the temporal processing capacity of high-level visual regions may be domain specific and related to the natural temporal regularities of the stimuli that they process rather than determined by their stage in the visual hierarchy. This temporal hypothesis predicts that regions processing stimuli with fast dynamics in the natural world (such as bodies) will have a higher temporal processing capacity than regions processing stimuli with slower dynamics (such as words).
To test these hypotheses, we scanned subjects with fMRI while they viewed stimuli from multiple domains (Fig. 1A, characters, bodies, faces, and places) at various rates (Fig. 1B, 1–8 Hz). Crucially, the amount of information in a trial was maintained across rates by keeping the number of stimuli in a trial constant. We reasoned that selectivity in regions processing stimuli of a particular domain would be highest when the presentation rate is aligned with their temporal processing capacity and would substantially decrease when that capacity is surpassed. We considered two hypotheses: (1) a domain-general hierarchy hypothesis that predicts decreasing capacities from LOTC to VTC independent of category preference and (2) a domain-specific temporal hypothesis that predicts domain-varying temporal capacities independent of processing stage in LOTC or VTC.
Materials and Methods
Twelve right-handed subjects (five female, ages 19–44) with normal or corrected-to-normal vision were recruited from Stanford University for the main experiment (Experiment 1). Eight of these subjects (four female) returned for both a control study (Experiment 2) and a retinotopic mapping experiment to functionally define primary visual cortex (V1) using a population receptive field model (Dumoulin and Wandell, 2008). Participants gave their written informed consent. The Stanford Internal Review Board on Human Subjects Research approved all procedures.
Experiment 1: fMRI temporal processing capacity experiment.
Subjects viewed gray-level stimuli, which consisted of two different types of five different domains: characters, bodies, faces, places, and objects (Fig. 1A, example stimuli in first row). Characters consisted of pseudowords (Glezer et al., 2009) and numbers. Bodies consisted of either whole bodies with obscured heads or limbs. Faces were either those of adults or children. Places were either indoor corridors or houses. Finally, objects were either cars or guitars. The view, size, and retinal position of the images varied. Each item was overlaid on a 10.5° phase-scrambled background generated from a randomly selected image from the entire set to minimize low-level differences across categories.
We introduce a novel paradigm in which the number of stimuli (information) in a trial is constant (eight stimuli per trial; Fig. 1B) and the presentation rate, and consequently trial duration, varies across conditions. Prior studies (Mukamel et al., 2004; McKeeff et al., 2007; Gentile and Rossion, 2014) used constant block duration, confounding the number of stimuli with rate: blocks with faster rates had more stimuli compared with blocks with slower rates. Consequently, with this design, it is impossible to disentangle two competing effects on fMRI responses: stimulus rate versus the number of stimuli. In contrast, our new design measures the effect of rate alone on cortical responses. In each trial, stimuli of one category were shown in rapid serial visual presentation (RSVP) at a single rate of 1, 2, 4, or 8 items per second (Hertz). Stimulus trials were counterbalanced with blank baseline trials. In each run, stimuli were shown at one rate. The order of runs at different rates was counterbalanced across subjects. The total duration of visual presentation at a particular rate was 11 min, which equated the statistical power across rates. The same set of 1440 images was shown across all rates, but images did not repeat within a rate.
Subjects fixated on a central dot and pressed a button when a phase-scrambled oddball image appeared randomly in a trial. Trials contained 0, 1, or 2 phase-scrambled images with probability of 0.25, 0.5, or 0.25, respectively.
Experiment 2: control study.
The design of the second experiment was identical to the main experiment, except that we modified the stimulus set (Fig. 1A, example stimuli in second row) to simultaneously control for multiple low-level differences between stimuli of different categories including their (1) contrast, (2) luminance, (3) similarity to other stimuli of the same category, (4) visual field coverage, and (5) spatial frequency power distributions. In doing so, we also considered the impact of image manipulations on high-level factors and made efforts to equate stimuli of different categories in terms of (6) familiarity and (7) memorability. Below, we detail the results from the quantification of these seven effects from Experiment 2.
Michelson Contrast: Calculated as the ratio between the difference and sum of the maximum and minimum pixel intensities, was matched across all images by scaling the grayscale values of each image such that 1% of pixels is saturated at the lowest and highest intensities, respectively. Further image processing by the SHINE toolbox (Willenbockel et al., 2010) and spatial filtering introduced a small amount of variability in contrast across images, but there is no significant difference in contrast level among images of different categories (no main effect of category, F(4,1435) = 0.73, p > 0.05; Fig. 1C, Contrast).
Luminance: We used the SHINE toolbox (Willenbockel et al., 2010) to match histograms of grayscale values across all stimuli such that the mean luminance of images from different categories does not differ (no main effect of category, F(4,1435) = 0.97, p > 0.05; Fig. 1C, Luminance).
Similarity: To estimate visual similarity among images of a category, we measured the mean Euclidean distance between normalized grayscale values (scaled to be between 0 and 1) of each stimulus and every other stimulus of the same category (Grill-Spector et al., 1999). The distance between a pair of images was calculated using the following formula in which p represents the pth pixel in an image, n represents the total number of pixels in an image, and Ip and I′p represent the pth pixel in two different images: Similarity (S) was defined as follows: S = 1 − D. Thus, pairwise similarity values range from 0 (images with inverted intensity difference at each pixel) to 1 (identical images). Although there are numerically significant differences in mean similarity across categories (main effect of category, F(4,101529) = 788.52, p < 0.001), the range of similarity values is similar across categories (Fig. 1C, Similarity), and differences in the mean values are not perceptually relevant because the size of the effect (0.0029, standard deviation (SD) of category means in normalized grayscale values) is below the resolution of the display (0.0039, minimum pixelwise distance resolved by the display in normalized values, corresponding to the change in one gray-level value of a possible 255 increments). Boxplots in Figure 1C, Similarity, show distributions of similarities for all possible pairwise comparisons in each category.
Visual field coverage: While we varied the view, size, and position of stimuli used in Experiment 1 to maximize variation within a category and reduce variation across categories, we further refined the cropping and placement of stimuli in the visual field in Experiment 2. This refinement further reduced biases related to intrinsic regularities in the shape and aspect ratio of different categories. Figure 1C, Pixel Intensity, shows the mean luminance of each pixel location, averaged across all images in a category.
Spatial frequency: Was controlled by filtering the images with a low-pass Gaussian filter with a visual angle cutoff frequency of 8 cycles/degree. After filtering, the mean spatial frequency distribution across images of each category was matched. Figure 1C, Spatial Frequency, shows mean proportion of total power in each spatial frequency, averaged across all images in a category.
Familiarity: To diminish familiarity effects across categories, all stimuli were selected to be equally unfamiliar to participants. None of the face or place stimuli were personally familiar to our participants, and pronounceable pseudowords and uncommon numbers were used as character stimuli instead of more familiar English words and frequently encountered numbers.
Memorability: We tested whether the memorability of stimuli belonging to different categories was matched by comparing recognition memory for stimuli of various categories presented at 1 Hz (see below, Behavioral testing). Performance did not significantly differ across categories at 1 Hz (no main effect of category, F(3,31) = 0.79, p > 0.05).
All subjects participated in four runs of an independent functional localizer experiment (6 min per run) collected on a different day using 16 s blocks, 1 Hz presentation, and an oddball task (Weiner and Grill-Spector, 2010, 2011).
Subjects were scanned on a GE 3 tesla Signa scanner at the Center for Cognitive and Neurobiological Imaging at Stanford University using a custom-built, phase-array 32-channel head coil. We acquired 34 slices covering occipitotemporal cortex (resolution: 2.4 × 2.4 × 2.4 mm; one-shot T2*-sensitive gradient echo acquisition sequence: FOV = 192 mm, TE = 30 ms, TR = 2000 ms, and flip angle = 77°). A whole-brain, anatomical volume was acquired (T1-weighted BRAVO pulse sequence; resolution: 1 × 1 × 1 mm, TI = 450 ms, flip angle = 12°, 1 NEX, FOV = 240 mm).
Functional ROIs were defined in individual subjects from the functional localizer using anatomical and functional criteria (Weiner and Grill-Spector, 2010, 2012; Grill-Spector and Weiner, 2014). A common threshold (t > 4, voxel level) was used to define all regions across all subjects, and subsequent analyses were conducted in ROIs identifiable in at least 9 of the 12 subjects (ROIs from a representative subject are shown in Fig. 3A).
Character-selective ROIs (characters > others) included a bilateral region in the inferior occipital sulcus (IOS, N = 12), a bilateral region in the posterior occipital temporal sulcus (pOTS; N = 12), corresponding to the visual word form area (Cohen et al., 2000; Ben-Shachar et al., 2011), and a more anterior patch often lateralized to the left hemisphere localized to mid OTS (mOTS; N = 9). In our prior studies, we use anatomical nomenclature with the corresponding selectivity to refer to face-, place-, and body-selective ROIs (e.g., mFus-faces). Here, we extend this nomenclature to character-selective regions. Thus, we use the label pOTS-characters for visual word form area 1 (VWFA-1) and mOTS-characters for VWFA-2.
Body-selective ROIs (bodies > others) were found bilaterally in the lateral occipital sulcus (LOS; N = 12), inferior temporal gyrus (ITG; N = 12), and middle temporal gyrus (MTG; N = 12) as in Weiner and Grill-Spector (2011), collectively corresponding to the extrastriate body area (Downing et al., 2001). An additional body-selective region was defined in the OTS (N = 9), referred to elsewhere as the fusiform body area (Peelen and Downing, 2005; Schwarzlose et al., 2005).
Face-selective ROIs (faces > others) were observed on the inferior occipital gyrus (IOG; N = 11), corresponding to the occipital face area (Gauthier et al., 2000), and on the posterior (pFus; N = 12) and mid (mFus; N = 12) lateral fusiform gyrus, collectively corresponding to the fusiform face area (Kanwisher et al., 1997).
Place-selective ROIs (places > others) were observed near the transverse occipital sulcus (TOS; N = 12) as described in previous studies (Hasson et al., 2003), collateral sulcus (CoS; N = 12), corresponding to the parahippocampal place area (Epstein and Kanwisher, 1998), and medial parietal/retrosplenial cortex (RSC; N = 10), corresponding to the retrosplenial complex (O'Craven and Kanwisher, 2000).
ROIs were considered in three anatomical sections arranged from posterior to anterior corresponding to putative stages of the visual hierarchy: LOTC (IOS-characters, LOS-bodies, IOG-faces, and TOS-places), posterior VTC (pOTS-characters, ITG-bodies, and pFus-faces), and mid VTC (mOTS-characters, OTS-bodies, mFus-faces, and CoS-places). Two additional regions were located outside these anatomical expanses (MTG-bodies and RSC-places).
Estimating responses to experimental conditions.
We ran a GLM using the experimental conditions to generate a design matrix that was then convolved with the HRF implemented in SPM (http://www.fil.ion.ucl.ac.uk/spm). This procedure was done on the time course of each voxel (units of percentage signal change) and separately for each rate. Response amplitudes (betas) and residual variance of each voxel were estimated from the GLM. The contrast effect size (CES) was measured as the difference in response amplitudes to a preferred category versus all other categories (in units of percentage signal change). Selectivity (t-value) was measured as the ratio of the CES versus its standard error. The standard error was estimated from the residual variance of the GLM without whitening (Worsley, 2001).
To examine the goodness of our model fits across rates, we compared the variance explained by the GLM fits in each voxel for each rate. Across both experiments, this analysis revealed no significant differences in the proportion of variance explained by models fit separately for 1 and 8 Hz data in any ROI (no significant differences, ts < 2.01, ps > 0.05, paired t test for each ROI comparing proportion of variance explained at 1 and 8 Hz; Fig. 2). Goodness of fit of the GLM tended to be highest at 4 Hz in most category-selective ROIs even though these regions were defined using a localizer with stimuli presented at 1 Hz. This fit at 4 Hz was significantly better than in other rates in some ROIs in Experiment 1 (main effect of rate, Fs > 2.99, ps < 0.05, one-way ANOVA on proportion of variance explained by GLM for each ROI; Fig. 2A, thick lines) but not in any ROI in Experiment 2 (no main effect of rate, Fs < 2.49, ps > 0.05; Fig. 2A,B, thin lines). These findings rule out the possibility that our results reflect differences in model fits between fast (8 Hz) and slow (1 Hz) presentation rates or between short (2 s) and long (8 s) trial durations.
Evaluating the effect of rate on category selectivity.
We measured the effect of rate on category selectivity using two different repeated-measures ANOVAs. Category selectivity was defined as the mean voxel t-value of preference to one category versus others, averaged across voxels in a given ROI. The first ANOVA was conducted to determine whether selectivity varies differentially as a function of rate across regions located in the same putative hierarchical tier. This ANOVA included factors of rate and preferred domain using data from subjects with all regions defined in a particular anatomical section for repeated-measures analyses (anatomical sections—LOTC: IOS-characters, LOS-bodies, IOG-faces, and TOS-places; posterior VTC: pOTS-characters, ITG-bodies, and pFus-faces; mid VTC: mOTS-characters, OTS-bodies, mFus-faces, and CoS-places; Fig. 3B). To determine the reproducibility of domain-specific effects of rate on selectivity after controlling low-level stimulus properties, we directly compared selectivity data from Experiments 1 and 2 for sets of regions in the same anatomical section using a three-way ANOVA with factors of rate, preferred domain, and experiment. The second ANOVA was conducted to determine whether selectivity varies differentially as a function of rate across regions with the same category preference, but in different hierarchical tiers (preferred categories—characters: IOS, pOTS, and mOTS; bodies: LOS, ITG, MTG, and OTS; faces: IOG, pFus, and mFus; places: TOS, CoS, and RSC; Fig. 4).
To gain a finer-grained estimate of the presentation rate producing maximal selectivity in a given region within the range of rates measured in the present study, we next fitted third-order polynomial functions to the temporal frequency tuning curve of each ROI in individual subjects as in previous studies (McKeeff et al., 2007; Gauthier et al., 2012). We then identified the rate corresponding to the local peak of the function within the 1–8 Hz range in which we presented our stimuli (Table 1, Selectivity). We compared the distributions of these local maxima across regions in the same anatomical section and across regions with the same category preference using a series of one-way ANOVAs and data from all subjects.
Evaluating the effect of rate on response amplitude.
Selectivity is affected by (1) differences in response amplitudes across conditions and (2) the noise level. Consequently, we measured the effect of rate on response amplitudes to preferred and non-preferred categories in each ROI and the effect of rate on the CES or the difference in amplitude between preferred and non-preferred stimuli. We first determined whether the effect of presentation rate on response amplitude differs for preferred and non-preferred stimuli in sets of regions preferring the same category using a series of three-way repeated-measures ANOVAs with factors of rate, preference, and ROI separately for Experiment 1 (Fig. 5) and Experiment 2 (Fig. 7). We then tested whether this interaction significantly differed between experiments using a series of four-way ANOVAs with factors of rate, preference, ROI, and experiment. Likewise, we determined the consistency of rate modulations for preferred stimuli across experiments using a three-way ANOVA with factors of rate, ROI, and experiment. For Experiment 2, we also examined if the effect of rate on response amplitudes in a given ROI varied across each individual stimulus domain using a two-way repeated-measures ANOVA with factors of rate and domain for each ROI (Fig. 6).
The same curve-fitting procedures and statistical analyses used for selectivity data were also applied to amplitude data to estimate the rate eliciting peak responses to the preferred category (Table 1, Preferred) as well as peak CES (Table 1, CES) for each ROI. To further quantify these effects, we compared these additional metrics of temporal capacity across regions processing different domains in the same hierarchical tier and across regions processing the same domain in different hierarchical tiers using a series of one-way ANOVAs. Results using amplitudes to the preferred category or CES (Table 1, CES) were similar (no significant differences, t(12) = 0.50, p > 0.05, paired t test comparing estimated peak rates derived from amplitude to the preferred category and CES computed at the subject level and then averaged across subjects for each ROI in Experiment 1). Therefore, we focus our analysis on response amplitudes to the preferred category in the main text, but present all data in Table 1.
Finally, to assess the level of consistency between different neural measures of temporal capacity within and across experiments, we compared the capacity estimates derived from various metrics (peak amplitude to the preferred category, peak CES, and peak selectivity) in Experiments 1 and 2 among sets of regions located in the same anatomical section using a series of three-way ANOVAs with factors of preferred domain, experiment, and metric. Complementary analyses were then performed for sets of regions with the same category preference using three-way ANOVAs with factors of ROI, experiment, and metric.
Eight of the 12 subjects participated in a behavioral experiment conducted on a different day outside the scanner to estimate subjects' behavioral encoding capacities (Fig. 8). Encoding capacities were measured as a function of rate for a subset of the categories used in the fMRI experiments. These categories were as follows: characters, bodies, faces, and places. Subjects viewed trials comprised of three stages: (1) an encoding phase consisting of an RSVP sequence of eight stimuli from a single category presented at a single rate of 1 Hz, 2 Hz, 4 Hz, or 8 Hz; (2) a 3 s delay containing a blank screen; and (3) a probe image (Fig. 8A). Subjects were instructed to indicate if the probe image occurred in the preceding sequence or not (0.5 probability). Subjects participated in four runs, each consisting of 64 trials at a single rate with 16 trials per category. The order of runs at different rates was counterbalanced across subjects. In Figure 8B, we plot accuracy for each domain as a function of the presentation rate during encoding.
We chose this type of behavioral test because it requires the subject to process all eight stimuli during the encoding phase. In contrast, having a probe at the beginning of the trial would not require the subjects to process all images during encoding and pilot behavioral testing indicated that subjects' performance was at ceiling for all rates, consistent with prior findings (Potter and Levy, 1969). Ideally, we would have had subjects respond and report their percept for each and every stimulus. However, this is impractical for rates of 4 and 8 Hz, which are faster than the response times of typical subjects.
Are temporal processing capacities in high-level visual cortex domain general or domain specific?
We examined the effect of presentation rate on category selectivity and tested if (1) presentation rate affects the degree of category selectivity, (2) regions in the same anatomical location and processing stage of the ventral stream have the same processing capacity (domain-general hierarchy hypothesis), or (3) regions processing the same category in different anatomical locations have the same processing capacity (domain-specific temporal hypothesis). To address these questions, we independently defined functional ROIs selective to characters, bodies, faces, and places in each subject and measured the degree of selectivity in each ROI as a function of presentation rate. We divided the ventral stream into three anatomical sections: LOTC, posterior VTC, and mid VTC (Fig. 3A). Each section corresponds to a putative tier of the visual hierarchy and contains a cluster of character, body, face, and place ROIs (with the exception of posterior VTC that did not contain a place-selective ROI).
We find that presentation rate significantly modulates the degree of category selectivity in all ventral stream ROIs regardless of their location in the hierarchy or category preference (main effect of rate, Fs > 25.28, ps < 0.001, repeated-measures ANOVA for regions in each anatomical section with factors of rate and preferred domain). Thus, a property of category-selective regions is that they are more selective to their preferred category when presented at a certain rate.
Contrary to predictions of the hierarchy hypothesis, data from Experiment 1 also reveal that each anatomical section contains regions with different temporal capacities (rate by domain interaction, Fs > 6.16, ps < 0.001, repeated-measures ANOVA for regions in each anatomical section with factors of rate and preferred domain; Fig. 3B). In LOTC, for example, regions selective to bodies, faces, and places show highest selectivity at 4 Hz, but the character-selective IOS exhibits maximal selectivity at 2 Hz (Fig. 3B; LOTC). Likewise in mid VTC, regions selective to characters, bodies, faces, and places are anatomically adjacent (Fig. 3A, mid VTC) yet show different peaks and troughs in their temporal processing characteristics (Fig. 3B, mid VTC). Moving from medial to lateral: CoS-places and mFus-faces have comparable selectivity at 2 and 4 Hz with a decline at 8 Hz, OTS-bodies reveals higher selectivity at 4 and 8 Hz than 1 and 2 Hz, and mOTS-characters shows peak selectivity at 2 Hz with a decline at 4 Hz.
Is temporal processing capacity organized hierarchically among sets of regions processing the same domain?
To further discount the hierarchy hypothesis, we must rule out the possibility that domain-specific temporal processing capacity systematically declines across the ventral stream hierarchy when considering sets of regions with the same category preference. Therefore, for each ROI and subject we estimated the presentation rate that produced maximal selectivity as a neural marker of temporal capacity. We then examined if this peak varied across regions in the same anatomical section selective to different categories and across regions at different stages of the hierarchy processing the same domain.
Our results indicate that the rates estimated to elicit maximal selectivity are similar across regions processing the same domain across anatomical locations (no main effect of ROI, Fs < 0.89, ps > 0.05, ANOVA on peak rates estimated from selectivity for sets of regions with the same category preference; Fig. 4, dashed horizontal bars), but vary considerably across regions processing different domains at each stage of the hierarchy (main effect of ROI, Fs > 15.03, ps < 0.001, ANOVA on peak rates estimated from selectivity for sets of regions in each anatomical section). As shown in Table 1, regions processing characters show maximal selectivity at a similar rate in Experiment 1 (IOS: 1.94 ± 0.15 Hz, average ± SEM; pOTS: 1.97 ± 0.14 Hz; mOTS: 2.07 ± 0.30 Hz). The same is true for regions processing bodies (LOS: 5.12 ± 0.14 Hz; ITG: 4.84 ± 0.41 Hz; MTG: 4.79 ± 0.37 Hz; OTS: 5.10 ± 0.55 Hz), faces (IOG: 2.98 ± 0.39 Hz; pFus: 3.20 ± 0.36 Hz; mFus: 3.09 ± 0.30 Hz), or places (TOS: 2.28 ± 0.35 Hz; CoS: 3.07 ± 0.28 Hz; RSC: 2.65 ± 0.35 Hz). Consistent with these findings, selectivity does not differ as a function of rate across regions processing the same domain (no rate by ROI interaction, Fs < 2.22, ps > 0.05, repeated-measures ANOVA for sets of regions with the same category preference with factors of rate and ROI; Fig. 4), providing further evidence against the hierarchy hypothesis and in support of domain-specific temporal processing capacities in high-level visual cortex.
How does presentation rate affect response amplitude?
In principle, selectivity for a preferred category (e.g., a t-value of face selectivity in face-selective regions) and response amplitudes to a preferred category (e.g., responses to faces in face-selective regions) are closely related. In practice, the presentation rate eliciting peak selectivity may differ from the presentation rate eliciting maximal amplitude if the effect of rate differs across responses to preferred and non-preferred stimuli in that region.
In fact, we find differential effects of rate on responses to preferred and non-preferred stimuli in all category-selective regions in Experiment 1 (Fig. 5). Comparing the effect of rate on response amplitudes to preferred and non-preferred stimuli among sets of regions selective to the same category, we find that rate-dependent modulation of amplitude differs for the preferred category and non-preferred categories [rate by preference interaction, Fs > 4.55, ps < 0.05, repeated-measures ANOVA for sets of regions with the same category preference with factors of rate, preference (preferred/non-preferred stimuli), and ROI, but no significant three-way interaction, Fs < 0.93, ps > 0.05]. That is, presentation rate generally has a more prominent effect on response amplitudes to the preferred than non-preferred stimuli.
Closely replicating the selectivity data, the rate producing the peak response amplitude for the preferred category does not differ across regions processing the same domain (no main effect of ROI, Fs < 0.72, ps > 0.05, ANOVA on peak rates estimated from amplitude to the preferred category for sets of regions with the same category preference; Fig. 5, dashed horizontal bars). In contrast, the rate producing maximal response to the region's preferred category varies significantly across regions selective to different domains at each stage of the hierarchy (main effect of ROI, Fs > 6.03, ps < 0.01, ANOVA on peak rates estimated from amplitude for regions in each anatomical section). Likewise, the rank ordering of processing capacities associated with different domains observed in selectivity data (Fig. 4) is replicated in the amplitude of response for the preferred category and for the CES (difference in response amplitudes between preferred and nonpreferred stimuli) as summarized in Table 1. That is, regardless of the metric used to estimate temporal capacity, we find that the presentation rates estimated to elicit peak responses are slowest for regions processing characters, intermediate among regions preferring places and faces, and fastest for regions processing bodies.
Are domain-specific temporal frequencies in high-level visual cortex driven by low-level differences among stimulus categories?
Low-level differences among images of different categories may contribute to the domain-specific temporal processing capacities that we observe in high-level visual cortex. To address this concern, we conducted a second control experiment in eight of the subjects who participated in Experiment 1 using a modified stimulus set designed to minimize categorical differences across multiple low-level properties including contrast, luminance, within-category image similarity, visual field coverage, and spatial frequency (see Materials and Methods; Fig. 1C). Analyses of Experiment 2 data indicate that it is unlikely that low-level image properties drive capacity differences across domains.
First, to validate the low-level controls implemented in Experiment 2, we examined V1 responses. There are no significant differences in the response amplitudes of V1 to different stimulus categories across rates (no main effect of domain, F(3,21) = 1.46, p > 0.05, repeated-measures ANOVA with factors of rate and domain; Fig. 6, V1), and we do not find variable effects of rate on V1 responses for different categories (no rate by domain interaction, F(9,63) = 1.79, p > 0.05). In contrast to V1, category-selective regions illustrate higher response amplitudes to particular domains across rates (main effect of domain, Fs > 16.26, ps < 0.001, repeated-measures ANOVA with factors of rate and domain for each ROI; Fig. 6). Further, category-selective regions, even those in LOTC that are considered lower in the hierarchy compared with VTC, show a significant rate by domain interaction (Fs > 2.05, ps < 0.05; Fig. 6).
Second, despite changing our stimuli in Experiment 2, we find that presentation rate significantly modulates category selectivity in a domain-specific manner. Each hierarchical tier contains regions with different temporal capacities (rate by domain interaction, Fs > 2.10, ps < 0.03, ANOVA for regions in each anatomical section with factors of rate, preferred domain, and experiment), and these effects are consistent across experiments (no three-way interaction between rate, domain, and experiment, Fs < 0.92, ps > 0.05).
Third, examination of response amplitudes among regions selective to the same category shows that presentation rate differentially modulates responses to preferred and non-preferred stimuli (rate by preference interaction, Fs > 2.89, ps < 0.04, ANOVA for sets of regions with the same category preference with factors of rate, preference, ROI, and experiment). These effects are similar across experiments (no three-way interaction between rate, preference, and experiment, Fs < 1.44, ps > 0.05; compare Figs. 5 and 7). Similar to Experiment 1, we also find that responses to the preferred category show the same rate modulation among regions preferring the same category (main effect of rate, Fs > 9.32, ps < 0.001, ANOVA for sets of regions with the same category preference with factors of rate, ROI, and experiment, but there are no two-way or three-way interactions, Fs < 2.73, ps > 0.05; Fig. 7, Preferred).
These analyses validate that the domain-specific temporal processing capacities observed in Experiment 1 are reproducible and are not driven by low-level differences among stimuli or inherited from V1.
Are domain-specific temporal processing capacity estimates consistent across experiments and metrics?
To establish the overall impact of our stimulus manipulations, we compared capacity estimates across metrics and experiments (Table 1). Consistent with prior analyses, the largest effect is that temporal capacity estimates differ between regions selective to different domains at each stage of the processing hierarchy. This differentiation is consistent across experiments and metrics (main effect of domain, Fs > 22.60, ps < 0.001, ANOVA on all peak rate estimates from Experiments 1 and 2 for regions in the same anatomical section with factors of preferred domain, experiment, and metric, but no main effect of experiment or metric, Fs < 3.80, ps > 0.05, and no two-way or three-way interactions, Fs < 0.61, ps > 0.05).
Nevertheless, controlling for low-level differences among stimuli does produce some numerical variations of peak capacity estimates within a domain. For example, face-selective ROIs show slightly higher capacities in Experiment 2 compared with Experiment 1, while body-selective ROIs show an opposite trend (main effect of experiment, Fs > 5.87, ps < 0.05, ANOVA on all peak rate estimates from Experiments 1 and 2 for body- and face-selective regions separately with factors of ROI, experiment, and metric; Table 1). Also, some character- and place-selective regions exhibit shifts in capacity estimates across experiments (ROI by experiment interaction, Fs > 5.23, p < 0.01, ANOVA on all peak rate estimates from Experiments 1 and 2 for character- and place-selective regions separately with factors of ROI, experiment, and metric, but no main effect of experiment, Fs < 1.63, ps > 0.05).
Critically, despite numerical shifts within domains, the ranking of processing capacities for regions processing different domains across experiments is consistent: on average and across all metrics, character-selective IOS, pOTS, and mOTS are the slowest three regions; body-selective LOS, ITG, MTG, and OTS are the fastest four regions; and all face-selective regions are intermediate, but faster than place-selective regions (Table 1).
Does domain-specific temporal processing capacity have a behavioral impact?
We hypothesized that if domain-specific temporal processing capacities serve as bottlenecks for perception, there should also be category-specific differences in temporal encoding capacity (i.e., behavioral measures of temporal capacity). To test this hypothesis, we conducted a behavioral experiment during which our subjects viewed encoding stimuli varying in both presentation rate and domain. As in the fMRI experiments, all encoding trials contained eight stimuli, but trials varied in their presentation rate (Fig. 8A; see Materials and Methods). To evaluate temporal encoding capacity, subjects were probed with an image 3 s after the end of the encoding interval and reported if the probe had appeared (or not) in the preceding encoding interval. We reasoned that accuracy should be high for stimuli presented at rates at or below encoding capacity, but would decline at rates exceeding the encoding capacity.
Notably, rate affects encoding performance both when considering accuracy for each category individually (main effect of rate, Fs > 5.77, ps < 0.05, ANOVA on accuracy for each domain separately; Fig. 8B) and when analyzing performance across all domains (main effect of rate, F(3,21) = 57.18, p < 0.001, repeated-measures ANOVA with factors of rate and domain). Interestingly, the effect of rate on encoding capacity differs across domains (rate by domain interaction, F(9,63) = 4.82, p < 0.001), and several aspects of behavioral encoding capacity mirror the temporal processing capacity of high-level visual cortex. These effects are apparent in three ways. First, behavioral and neural capacity measures for characters peak at presentation rates at ∼2 Hz and drop sharply at faster rates. Second, both accuracy and cortical capacity for bodies are highest at 4 Hz. Third, encoding of faces at 8 Hz declines less than encoding of places at 8 Hz, consistent with the observation of higher selectivity to faces than places at this rate in LOTC (Fig. 3).
Despite these striking consistencies, we also observe some deviations between behavioral encoding and neural processing capacities. For example, behavioral encoding capacity for faces and places is highest at ∼1 Hz, which is lower than the processing capacity of these stimuli in high-level visual cortex both in selectivity and amplitude measures. Overall, these behavioral data provide striking evidence for domain-specific encoding capacities that are largely consistent with our fMRI findings.
According to present hierarchical models, the entirety of high-level visual cortex has a single, domain-general temporal processing capacity that is slower than earlier visual regions. Contrary to this model, our results indicate that domain specificity is a better predictor of temporal processing capacity in high-level visual cortex than anatomical location or processing stage in the visual hierarchy. Specifically, body-selective regions illustrate the fastest capacity, face- and place-selective regions exhibit an intermediate capacity, and regions preferring characters manifest the slowest capacity. In comparison, neural responses within early visual areas like V1 do not differ across domains as a function of presentation rate. Crucially, our findings are not because of low-level image differences among exemplars of different categories, are behaviorally relevant, and provide evidence for parallel domain-specific encoding capacities.
Estimates of temporal capacity are consistent across amplitude and selectivity metrics
Our data reveal a domain-specific capacity across selectivity, amplitude, and contrast effect size metrics with a consistent rank ordering of capacities of different domains across metrics and experiments (Table 1). Across both experiments, we also observe differential effects of rate on response amplitudes to preferred and non-preferred categories in category-selective ROIs. This difference is a consequence of two factors. First, in some ROIs, especially those in mid VTC, responses to non-preferred categories are low. Thus, we do not find a rate effect modulating the response to non-preferred stimuli (Fig. 5), perhaps because of a floor effect. Second, in other ROIs, especially those in LOTC, responses to non-preferred stimuli are lower than to preferred stimuli, but show a differential rate modulation (Fig. 6). This suggests the intriguing possibility that processing capacity in high-level visual cortex is domain specific even within individual ROIs. Alternatively, as we are measuring fMRI signals with voxels of 2.4 mm on a side, and category-selective regions with different preferences also neighbor one another, it is possible that the differential rate effects across categories in an ROI occur from partial voluming effects. Future fMRI-guided neurophysiology experiments may shed light on these alternatives.
While we provide the first evidence for differences in temporal capacity between body- and character-selective regions, our estimates of similar capacity for face- and place-selective regions agree with prior studies (McKeeff et al., 2007; Gauthier et al., 2012; Gentile and Rossion, 2014). Our capacity estimates are slightly slower than these previous studies, perhaps because we controlled the number of stimuli across rates (these factors were previously confounded; see Materials and Methods), but the consistency of our findings across amplitude, CES, and selectivity metrics and their stability across experiments illustrates that our measurements are reliable. Nevertheless, we are cognizant that multiple factors may impact capacity. Bottom-up factors such as reduced contrast or increased noise may reduce neural processing capacity. Top-down factors such as familiarity may enhance capacity. For example, high-frequency words might be processed faster than pseudowords. Future experiments can be used to evaluate the effects of additional bottom-up and top-down factors on neural processing capacity.
Visual encoding capacity is coupled with neural capacity in high-level visual cortex
The results of our behavioral experiment reveal that encoding capacities for complex visual stimuli deteriorate at distinct presentation rates for different domains, but we find that performance generally decreases for all categories at 8 Hz. Nevertheless, this should not be interpreted as a fixed capacity limitation on visual processing given the results of previous studies showing that behavioral capacity varies considerably across tasks (Potter, 1976; Potter and Fox, 2009) and is modulated by a range of high-level factors including stimulus complexity and familiarity (Näsänen et al., 2006).
Considering the variety of factors known to modulate temporal capacity, the consistencies we observe between behavioral and neural capacities are striking, but also not unreasonable given the actions we took to eliminate categorical biases in our stimulus set. Across categories, images were matched on low-level properties such as contrast, luminance, and spatial frequency, and on high-level parameters such as familiarity and memorability. When controlling for these factors, our results suggest that temporal capacities for encoding stimuli with enough detail to allow accurate within-category recognition are closely coupled with neural capacity estimates in regions preferring characters and bodies. Nevertheless, it is possible that neural capacity estimates reflecting the rate at which selectivity or response amplitudes fall to floor (instead of the rate at which they peak) are coupled with capacities for more simple perceptual operations. This hypothesis can be examined in future research comparing behavioral capacities for encoding compared with other tasks such as detection and categorization (Grill-Spector and Kanwisher, 2005; Fei-Fei et al., 2007).
Temporal encoding capacity in high-level visual cortex is domain specific
At the theoretical level, our results suggest a new view of high-level visual cortex in which regions processing related content have common temporal processing capacities that are not hierarchically organized from LOTC to VTC. In general, this finding is consistent with the notion that regions selective to particular categories operate as networks that are synchronized to avoid a processing bottleneck. While the underlying source of capacity differences across high-level visual cortex is uncertain, we hypothesize that two factors may contribute: (1) regularities in the temporal characteristics of exemplars of various domains in the real world and (2) the amount of time required to process domain-specific information.
For instance, lower temporal processing capacities of character-selective regions may be related to the stationary nature of characters. These lower capacities may also be due to the recruitment of additional cortical areas processing linguistic or semantic information when viewing written characters. On the other hand, the high capacity of body-selective regions may result from the necessity to perform fast computations of rapidly evolving, nonrigid body movements. Consistent with this speculation, computational models of biological motion perception posit the existence of “snapshot” neurons selective to brief frames of action sequences (Giese and Poggio, 2003). Interestingly, behavioral encoding performance for pseudowords and bodies peaked at the same rates that produced the highest selectivity in ROIs preferring these domains, respectively. This suggests that the temporal processing capacity of these regions may influence the rate of behavioral encoding specifically for items from these categories. Considering other examined domains, both faces and places contain more malleable features than characters, but their large-scale configurations are more rigid than those of bodies (e.g., eyes and clouds remain above most other features of faces and places, respectively). Thus, the present evidence supports the theory that the natural statistics of temporal information within the visual world may generate domain-specific temporal encoding capacities.
Temporal processing capacity contributes to the organization of high-level visual cortex
We propose that temporal processing capacity contributes to the segregation of category-selective regions in high-level visual cortex. It is interesting to speculate how this segregation may develop across the ventral stream hierarchy. From its origin in the retina and lateral geniculate nucleus, the visual system contains segregated temporal channels for faster processing in the magnocellular (M) pathway and slower processing in the parvocellular (P) pathway (Derrington and Lennie, 1984; Schneider et al., 2004). This segregation is further propagated into cortex in V1 (Sun et al., 2007) and continues downstream to V2 (Munk et al., 1995) and the middle temporal visual area (MT; Maunsell et al., 1990). Thus, it is possible that the propagation of channels with separable temporal processing capacity continues even further and influences the organization of high-level visual cortex.
Other fundamental properties that prevail across the visual system such as segregation of neurons processing foveal and peripheral stimuli have also been proposed to constrain the topology of high-level visual cortex. For example, viewing biases that are associated with particular domains (e.g., people tend to foveate on faces and characters) may generate the particular eccentricity biases associated with category-selective regions (Levy et al., 2001; Hasson et al., 2002, 2003; Kay et al., 2015). Similar to the eccentricity bias hypothesis (Malach et al., 2002), the topology of category-selective regions may also be determined by “temporal biases” of stimuli in the natural world that are mapped onto regions with differential contribution of M and P inputs. These are not mutually exclusive organizational principles. On the contrary, we propose that the functional organization of high-level visual cortex may be constrained both by spatial and temporal capacity limitations, the combination of which is domain specific. Thus, we suggest that both eccentricity and temporal biases may contribute to the functional organization of high-level visual cortex. For example, regions selective to faces and places have similar temporal processing capacity, but remain segregated as a result of their association with different eccentricity biases and viewing patterns. On the other hand, regions processing faces and characters have similar foveal biases but distinct temporal dynamics, which may guide segregation of associated neural processing in lateral VTC. Our results suggest that temporal processing capacity is one of multiple dimensions (Huth et al., 2012) contributing to segregation of functional regions, and that temporal capacity is a characteristic of domain-specific networks in high-level visual cortex.
Supplemental material for this article is available at vpnl.stanford.edu/fLoc/. This material contains stimuli and code for a functional localizer experiment to define category-selective visual regions. This material has not been peer reviewed.
This research was funded by National Institutes of Health Grant 1R01EY02391501A1. We thank Samantha Guiry for help generating stimuli.
The authors declare no competing financial interests.
- Correspondence should be addressed to Anthony Stigliani, Department of Psychology, Jordan Hall, Stanford University, Stanford, CA 94305.