Abstract
Visual object recognition relies on elaborate sensory processes that transform retinal inputs to object representations, but it also requires decision-making processes that read out object representations and function over prolonged time scales. The computational properties of these decision-making processes remain underexplored for object recognition. Here, we study these computations by developing a stochastic multifeature face categorization task. Using quantitative models and tight control of spatiotemporal visual information, we demonstrate that human subjects (five males, eight females) categorize faces through an integration process that first linearly adds the evidence conferred by task-relevant features over space to create aggregated momentary evidence and then linearly integrates it over time with minimum information loss. Discrimination of stimuli along different category boundaries (e.g., identity or expression of a face) is implemented by adjusting feature weights of spatial integration. This linear but flexible integration process over space and time bridges past studies on simple perceptual decisions to complex object recognition behavior.
SIGNIFICANCE STATEMENT Although simple perceptual decision-making such as discrimination of random dot motion has been successfully explained as accumulation of sensory evidence, we lack rigorous experimental paradigms to study the mechanisms underlying complex perceptual decision-making such as discrimination of naturalistic faces. We develop a stochastic multifeature face categorization task as a systematic approach to quantify the properties and potential limitations of the decision-making processes during object recognition. We show that human face categorization could be modeled as a linear integration of sensory evidence over space and time. Our framework to study object recognition as a spatiotemporal integration process is broadly applicable to other object categories and bridges past studies of object recognition and perceptual decision-making.
- face recognition
- flexible decision making
- linear spatiotemporal integration
- feature combination
- bounded accumulation of evidence
- reverse correlation
- psychophysics
Introduction
Accurate and fast discrimination of visual objects is essential to guide our behavior in complex and dynamic environments. Previous studies largely focused on the elaborate sensory mechanisms that transform visual inputs to object-selective neural responses in the inferior temporal cortex of the primate brain through a set of representational changes along the ventral visual pathway (Riesenhuber and Poggio, 1999; DiCarlo and Cox, 2007; Freiwald and Tsao, 2010; Yamins et al., 2014). However, goal-directed behavior also requires decision-making processes that can flexibly read out sensory representations and guide actions based on them as well as information about the environment, behavioral goals, and expected costs and gains. Such processes have been extensively examined using simplified sensory stimuli that vary along a single dimension, for example, the direction of moving dots changing from left to right (Palmer et al., 2005). For those stimuli, subjects' behavior could be successfully accounted for by flexible mechanisms that accumulate sensory evidence and combine it with task-relevant information (Ratcliff and Rouder, 1998; Gold and Shadlen, 2007). However, more complex visual decisions based on stimuli defined by multiple features, such as object images, remain underexplored, although the need for such tests is gaining significance, and important steps are being taken in this direction (Heekeren et al., 2004; Philiastides and Sajda, 2006; Philiastides et al., 2014; Zhan et al., 2019).
Here, we apply the quantitative approach developed for studying simple perceptual decisions to investigate face recognition. We focus on face recognition because it is by far the most extensively studied among the subdomains of object vision (Kanwisher and Yovel, 2006; Tsao and Livingstone, 2008; Barraclough and Perrett, 2011; Rossion, 2014; Perrodin et al., 2015). Face stimuli are also convenient to use because they allow quantitative manipulation of sensory information pivotal for mechanistic characterization of the decision-making process (Waskom et al., 2019); images can be decomposed into local spatial parts (e.g., eyes, nose, mouth) and can be morphed between two instances (e.g., faces of two individuals) to create a parametric stimulus set. At the same time, human face perception is highly elaborate and embodies the central challenge of object recognition that must distinguish different identities from complex visual appearances (Tsao and Livingstone, 2008).
To quantitatively characterize the decision-making process, we investigate face recognition as a process of combining sensory evidence over both space and time. Faces are thought to be processed holistically (Maurer et al., 2002; Richler et al., 2012); breaking the configuration of facial images significantly affects face perception, indicating spatial interactions across facial parts. However, computational properties of the spatial integration remain elusive (Richler et al., 2012). One may consider that holistic recognition arises from nonlinear integration of facial features (Shen and Palmeri, 2015), but linear integration may also suffice to account for holistic effects (Gold et al., 2012). Furthermore, humans flexibly use different facial parts to categorize faces according to their behavioral needs (e.g., discrimination of identity vs expression; Schyns et al., 2002, 2007), but the underlying mechanisms of this flexibility also remain underexplored.
In addition to spatial properties, face and object recognition also include rich temporal dynamics. Although object identification and categorization are usually fast, reaction times (RTs) are often hundreds of milliseconds longer (Gauthier et al., 1998; Kampf et al., 2002; Ramon et al., 2011; Carlson et al., 2014; Witthoft et al., 2018) than the time required for a feedforward sweep along the ventral visual pathway (Thorpe et al., 1996; Hung et al., 2005). Furthermore, recognition performance follows a speed-accuracy trade-off, where additional time improves accuracy (Thorpe et al., 1996; Gauthier et al., 1997). Together, these observations suggest that the decision-making process in face and object recognition is not instantaneous but unfolds over time (Heitz and Schall, 2012; Hanks et al., 2014). However, the computational properties have scarcely been characterized.
Using our novel face categorization tasks that tightly control spatiotemporal sensory information (Okazawa et al., 2018, 2021), we show that human subjects categorize faces by linearly integrating visual information over space and time. Spatial features are weighted nonuniformly and integrated largely linearly to form momentary evidence, which is then accumulated over time to generate a decision variable that guides the behavior. The temporal accumulation is also linear, and the time constant is quite long, preventing significant loss of information (or leak) during the decision-making process. Between identity and expression categorizations, the weighting for spatial integration flexibly changes to accommodate task demands. Together, we offer a novel framework to study face recognition as a spatiotemporal integration process, which unifies two rich veins of visual research, namely, object recognition and perceptual decision-making.
Materials and Methods
Observers and experimental setup
Thirteen human observers (18–35 years of age, five males and eight females recruited from students and staff at New York University) participated in the experiments. Observers had normal or corrected-to-normal vision. They were naive to the purpose of the experiment, except for one observer who is an author (G.O.). They all provided informed written consent before participation. All experimental procedures were approved by the Institutional Review Board at New York University.
Throughout the experiment, subjects were seated in an adjustable chair in a semidark room with chin and forehead supported before a CRT display monitor (21-inch Sony GDM-5402; 75 Hz refresh rate; 1600 × 1200 pixels screen resolution; 52 cm viewing distance). Stimulus presentation was controlled with the Psychophysics Toolbox (Brainard, 1997) and MATLAB (MathWorks). Eye movements were monitored using a high-speed infrared camera (EyeLink, SR Research). Gaze position was recorded at 1 kHz.
Experimental design
Stochastic multifeature face categorization task
The task required the classification of faces into two categories, each defined by a prototype face (Fig. 1A,B). The subject initiated each trial by fixating a small red point at the center of the screen [fixation point (FP), 0.3° diameter]. After a short delay (200–500 ms, truncated exponential distribution), two targets appeared 5° above and below the FP to indicate the two possible face category choices (category 1 or 2). Simultaneously with the target onset, a face stimulus (2.18° × 2.83°, ∼83 × 108 pixels) appeared on the screen parafoveally (stimulus center 1.8° to the left or right of the FP, counterbalanced across subjects; results were similar for the two sides). We placed the stimuli parafoveally, aiming to present the informative facial features at comparable visual eccentricities and yet keep the stimuli close enough to the fovea to take advantage of the foveal bias for face perception (Levy et al., 2001; Kreichman et al., 2020). The parafoveal presentation also enabled us to control subjects' fixation so that small eye movements (e.g., microsaccades) within the acceptable fixation window did not substantially change the sensory inputs. Subjects reported the face category by making a saccade to one of the two targets as soon as they were ready. The stimulus was extinguished immediately after the saccade initiation. Reaction times were calculated as the time from the stimulus onset to the saccade initiation. If subjects failed to make a choice in 5 s, the trial was aborted (0.101% of trials). To manipulate task difficulty, we created a morph continuum between the two prototypes and presented intermediate morphed faces on different trials (see below). Distinct auditory feedbacks were delivered for correct and error choices. When the face was ambiguous (halfway between the two prototypes on the morph continuum), correct feedback was delivered on a random half of trials.
Subjects could perform two categorization tasks, identity categorization (Fig. 1B, top) and expression categorization (Fig. 1B, bottom). The prototype faces for each task were chosen from the photographs of MacBrain Face Stimulus Set (Tottenham et al., 2009). For the illustrations of identity stimuli in Figure 1, A, B, and C, we used morphed images of two authors' faces to avoid copyright issues. We developed a custom algorithm that morphed different facial features (regions of the stimulus) independently between the two prototype faces. Our algorithm started with 97–103 manually matched anchor points on the prototypes and morphed one face into another by linear interpolation of the positions of anchor points and textures inside the tessellated triangles defined by the anchor points. The result was a perceptually seamless transformation of the geometry and internal features from one face to another. Our method enabled us to morph different regions of the faces independently. We focused on three key regions (eyes, nose, and mouth) and created an independent series of morphs for each one of them. The faces that were used in the task were composed of different morph levels of these three informative features. Anything outside those features was set to the halfway morph between the prototypes and thus was uninformative. The informativeness of the three features (stimulus strength) was defined based on the mixture of prototypes, spanning from –100% when the feature was identical to prototype 1 to +100% when it was identical to prototype 2 (Fig. 1C). At the middle of the morph line (0% morph), the feature was equally shaped by the two prototypes.
By varying the three features independently, we could study spatial integration through creating ambiguous stimuli in which different features could support different choices (Fig. 1C). We could also study temporal integration of features by varying the three discriminating features every 106.7 ms within each trial (Fig. 1D). This frame duration provide us with sufficiently precise measurements of subjects' temporal integration in their ∼1 s decision times while ensuring the smooth subliminal transition of frames (see below). The stimulus strengths of three features in each trial were drawn randomly from independent Gaussian distributions. The mean and SD of these distributions were equal and fixed within each trial, but the means varied randomly from trial to trial. For the identity task, we tested the following seven mean stimulus strengths: –50%, –30%, –14%, 0%, +14%, +30%, and +50%. For the expression task, we used –50%, –20%, –10%, 0%, +10%, +20%, and +50%, except for subject 13, who had a higher behavioral threshold and was also exposed to ±80% morph levels. The SD was 20% morph level. Sampled values that fell outside the range of –100% to +100% (0.18% of samples) were replaced with new samples inside the range. Using larger SDs would have allowed us to sample a wider stimulus space, but we limited the SD to 20% morph level to keep the stimulus fluctuations subliminal, avoiding potential changes of decision strategy for vividly varying stimuli.
Changes in the stimulus within a trial were implemented in a subliminal fashion so that subjects did not consciously perceive variation of facial features, and yet their choices were influenced by these variations. We achieved this goal using a sequence of stimuli and masks within each trial (Movie 1). The stimuli were morphed faces with a particular combination of the three discriminating features. The masks were created by phase randomization (Heekeren et al., 2004) of the 0% morph face and therefore had largely matching spatial frequency content with the stimuli shown in the trial. The masks ensured that subjects did not consciously perceive minor changes in informative features over time within a trial. In debriefings following each experiment, subjects noted that they saw one face in each trial, but the face was covered with time-varying cloudy patterns (i.e., masks) over time.
For the majority of subjects (9 of 13), each stimulus was shown without a mask for one monitor frame (13.3 ms). Then, it gradually faded out over the next seven frames as a mask stimulus faded in. For these frames, the mask and the stimulus were linearly combined, pixel by pixel, according to a half-cosine weighting function, so that in the last frame, the weight of the mask was 1 and the weight of the stimulus was 0. Immediately afterward, a new stimulus frame with a new combination of informative features was shown, followed by another cycle of masking, and so on. For a minority of subjects (4 of 13), we replaced the half-cosine function for the transition of stimulus and mask with a full-cosine function, where each eight-frame cycle started with a mask, transitioned to an unmasked stimulus in frame 5, and transitioned back to a full mask by the beginning of the next cycle. We did not observe any noticeable difference in the results of the two presentation methods and combined data across subjects.
Twelve subjects participated in the identity categorization task (35,300 total trials; mean ± SD trials per subject, 2942 ± 252). Seven subjects participated in the expression categorization task in separate sessions (20,225 total trials; trials per subject, 2889 ± 285). Six of the subjects performed both tasks. Our subject counts are comparable to previous studies of perceptual decision-making tasks (Levi et al., 2018; Stine et al., 2020). Collecting a large number of trials from individual subjects enabled detailed quantification of decision behavior for each subject (Smith and Little, 2018). Our results were highly consistent across subjects. A part of the data for the identity categorization task was previously published (Okazawa et al., 2018).
Odd-one-out discrimination task
Our behavioral analyses and decision-making models establish that subjects' choices in the identity and expression categorization tasks were differentially informed by the three facial features; choices were most sensitive to changes in the morph level of eyes for identity discrimination and changes in the morph level of mouth for expression discrimination (Fig. 2E,F). This task-dependent sensitivity to features could arise from two sources: different visual discriminability for the same features in the two tasks and/or unequal decision weights for informative features in the two tasks (see Fig. 10A). To determine the relative contributions of these factors, we designed an odd-one-out discrimination task to measure visual discriminability of different morph levels of informative features in the two tasks (see Fig. 10B).
On each trial, subjects viewed three stimuli presented sequentially at 1.8° eccentricity (similar to the categorization tasks). The stimuli appeared after the subject fixated a central FP, shown for 320 ms each, with 500 ms interstimulus intervals. The three stimuli in a trial were the same facial feature (eyes, nose, or mouth) but had distinct morph levels, chosen randomly from the following set: –100%, –66%, –34%, 0%, +34%, +66%, +100%. Facial regions outside the target feature were masked by the background. The target feature varied randomly across trials. Subjects were instructed to report the odd stimulus in the sequence (the stimulus most distinct from the other two) by pressing one of the three response buttons within 2 s from the offset of the last stimulus (RT from stimulus offset,
Nine of the 12 subjects who participated in the identity categorization task also performed the odd-one-out discrimination task using identity stimuli in separate blocks of the same sessions. Three of the seven subjects who participated in the expression task performed the odd-one-out task using expression stimuli. For the identity stimuli, 13,648 trials were collected across the three features (nine subjects, 1516 ± 420 trials per subject, mean ± SD). For the expression stimuli, 3570 trials were collected (three subjects, 1190 ± 121 trials per subject).
Single-feature categorization task
As an alternative method to quantify the visual discriminability for individual facial features, we also performed a single-feature categorization task with a subset of subjects (see Fig. 11A). In this task, the subjects categorized the facial identities as in the main identity categorization task but based their decisions on only one facial feature shown on each trial. Facial regions outside the target feature were replaced by the background. The task structure was the same as that of the main task. Trials of the three facial features were randomly interleaved. To capture the full extent of psychometric functions, we used morph levels ranging from –150% to +150% (see Fig. 11B). The stimuli beyond 100% indicate extrapolation from the prototypes, but the extrapolated images looked natural within the tested range.
Four of the 12 subjects who performed the main identity categorization task also performed the single-feature task in the same sessions. We collected in total 5571 trials (1393 ± 117 trials per subject, mean ± SD).
Statistical analysis
Psychometric and chronometric functions
We assessed the effects of stimulus strength on the subject's performance by using logistic regression (Fig. 2A,B) as follows:
The relationship between the stimulus strength and the subject's mean RTs was assessed using a hyperbolic tangent function (Shadlen et al., 2006; Fig. 2C,D) as follows:
Psychophysical Reverse Correlation
To quantify the effect of stimulus fluctuations over time and space (facial features) on choice (Fig. 1D), we performed psychophysical reverse correlation (Ahumada, 1996; Okazawa et al., 2018). Psychophysical kernels
Joint psychometric function
To quantify the effect that cofluctuations of feature strengths have on choice, we quantified the probability of choices as a function of the joint distribution of the stimulus strengths across trials (Fig. 3A,B). We constructed the joint distribution of the three features by calculating the average strength of each feature in the trial. Thus, one trial corresponds to a point in a 3D feature space (Fig. 1C). In this space, the probability of choice was computed within a Gaussian window with a SD of 4%. Figure 3, A and B, shows 2D intersections of this 3D space. We visualized the probability of choice by drawing iso-probability contours at 0.1 intervals. The trials of all stimulus strengths were included in this analysis, but similar results were also obtained by restricting the analysis to the low morph levels (≤14%). We aggregated data across all subjects, but similar results were observed within individual subjects.
To quantify linear and multiplicative effects on joint psychometric functions, we performed the following logistic regression:
Relationship between stimulus strength and subjective evidence
To quantitatively predict behavioral responses from stimulus parameters, one must first know the mapping function between the physical stimulus strength (morph level) and the amount of evidence subjects acquired from the stimulus. This mapping could be assessed by performing a logistic regression that relates choice to different ranges of stimulus strength (Fig. 4), similar to those performed in previous studies (Yang and Shadlen, 2007; Waskom and Kiani, 2018). For this analysis, we used the following regression:
Model fit and evaluation
To quantitatively examine the properties of the decision-making process, we fit several competing models to the subject's choices and RTs. Based on our earlier analyses (Figs. 3, 4), these models commonly use a linear mapping between feature morph levels and the evidence acquired from each feature, as well as linear functions for spatial integration of informative features in each frame. The combined momentary evidence from each stimulus frame was then integrated over time. Our main models are therefore extensions of the drift diffusion model, where fluctuations of the three informative facial features are accumulated toward decision bounds, and reaching a bound triggers a response after a nondecision time. Our simplest model used linear integration over time, whereas our more complex alternatives allowed leaky integration or dynamic changes of sensitivity over time. We also examined models that independently accumulate the evidence from each informative feature (i.e., three competing drift diffusion processes), where the decision and RT were determined by the first process reaching a bound. Below, we first provide the equations and intuitions for the simplest model and explain our fitting and evaluation procedures. Afterward, we explain the alternative models.
Spatial integration in our models linearly combines the strength of features at each time to calculate the momentary evidence conferred by a stimulus frame as follows:
Overall, this linear integration model had six degrees of freedom: decision bound height (B), sensitivity parameters (
To generate the model psychophysical kernels, we created
The same fitting procedure was used for the alternative models explained below.
Leaky integration
To test the degree of temporal integration, we added a memory loss (leak) in the decision-making process. This model is implemented as an Ornstein-Uhlenbeck process, whose Fokker-Planck equation is the following:
Dynamic sensitivity
To test whether the effect of sensory evidence on choice is constant over time, we allowed sensitivity to features to be modulated dynamically. To capture both linear and nonlinear temporal changes, the modulation included linear (
Parallel accumulation of evidence from three facial features
The models above first integrated the evidence conferred by the three informative facial features (spatial integration) and then accumulated this aggregated momentary evidence over time. We also considered alternative models, in which evidence from each feature was accumulated independently over time. These models therefore included three competing accumulators. Each accumulator received momentary evidence from one feature with fixed sensitivity (
To further explore different decision rules, we constructed two variants of the parallel accumulation model (see Fig. 9A,B). In the first variant, the decision was based on the sign of the majority of the accumulators (i.e., two or more of three accumulators) at the moment when one accumulator reached the bound. In the second variant, the decision was based on the sign of the sum of the decision variables across the three accumulators at the time when one accumulator reached the bound. All model variants had six free parameters (
Analysis of odd-one-out discrimination task
We used subjects' choices in the odd-one-out task to estimate visual discriminability of different morph levels of the informative features. We adopted an ideal observer model developed by Maloney and Yang (2003), where the perception of morph level i of feature f is defined as a Gaussian distribution,
The probabilities on the right side of the equation can be derived from
We fit the ideal observer model to the subject's choices using maximum likelihood estimation. As there were seven morph levels for each feature in our task, choices for each feature could be explained using eight parameters (
If the differences of feature sensitivity parameters (
Results
Spatial integration in face categorization
We developed stochastic multifeature face categorization tasks suitable for studying spatial and temporal properties of the computations underlying the decision-making process. Subjects classified naturalistic face stimuli into two categories. In each trial, subjects observed a face stimulus with subliminally varying features and, when ready, reported the category with a saccadic eye movement to one of the two targets (Fig. 1A). The targets were associated with the two prototypes that represented the discriminated categories—identities of two different people in the identity categorization task (Fig. 1B, top) or happy and sad expressions of a person in the expression categorization task (Fig. 1B, bottom). The stimulus changed dynamically in each trial. The dynamic stimulus stream consisted of a sequence of face stimuli interleaved by masks (Fig. 1A). Each face stimulus was engineered to have three informative features in the eyes, nose, and mouth regions, and sensory evidence conferred by the three informative features rapidly fluctuated over time as explained in the next paragraph. The masks between face stimuli kept the changes in facial features subliminal, creating the impression that a fixed face was covered periodically with varying noise patterns (see Materials and Methods; Movie 1).
Using a custom algorithm, we could independently morph the informative facial features (eyes, nose, mouth) between the two prototypes (Fig. 1B) to create a 3D stimulus space whose axes correspond to the morph level of the informative features (Fig. 1C). In this space, the prototypes are two opposite corners (specified as ± 100% morph), and the diagonal connecting the prototypes correspond to a continuum of faces where the three features of each face equally support one category versus the other. For the off-diagonal faces, however, the three features provide unequal or even opposing information for the subject's choice. In each trial, the nominal mean stimulus strength (% morph) was sampled from the diagonal line (Fig. 1C, black dots). The dynamic stimulus stream was created by independently sampling a stimulus every 106.7 ms from a 3D symmetric Gaussian distribution with the specified nominal mean (SD 20% morph; Fig. 1C,D). The presented stimuli were therefore frequently off diagonal in the stimulus space. The subtle fluctuations of features influenced subjects' choices as we show below, enabling us to determine how subjects combined spatiotemporal sensory evidence over space and time for face categorization.
We first evaluated subjects' choices and RTs in both tasks (Fig. 2). The average correct rate excluding 0% morph level was 91.0% ± 0.7% (mean ± SEM across subjects) for the identity task and 89.2% ± 1.2% for the expression task. The choice accuracy monotonically improved as a function of the nominal mean stimulus strength in the trial (Fig. 2A,B; identity task,
To test whether multiple facial features informed subjects' decisions, we used psychophysical reverse correlation to evaluate the effect of the fluctuations of individual features on choice. Psychophysical kernels were generated by calculating the difference between the feature fluctuations conditioned on choice (Eq. 3). We focused on trials with the lowest stimulus strengths where choices were most strongly influenced by the feature fluctuations (0–14%; mean morph level of each trial was subtracted from the fluctuations; see Materials and Methods). Figure 2, E and F, shows the kernel amplitude of the three facial features averaged over time from stimulus onset to median RT. These kernel amplitudes quantify the overall sensitivity of subjects' choices to equal fluctuations of the three features (in %morph units). The kernel amplitudes markedly differed across features in each task (identity task: F(2,33) = 55.4, p = 2.8 × 10−11, expression task: F(2,18) = 33.6, p = 8.5 × 10−7; one-way ANOVA), greatest for the eyes region in the identity task (p < 9.5 × 10−10 compared with nose and mouth, post hoc Bonferroni test) and for the mouth region in the expression task (p < 3.9 × 10−5 compared with eyes and nose).
Critically, the choice was influenced by more than one feature. In the identity task, all three features had significantly positive kernel amplitudes (eyes, t(11) = 15.4, p = 8.6 × 10−9; nose, t(11) = 4.8, p = 5.4 × 10−4; mouth, t(11) = 4.2,
Although the amplitude of psychophysical kernels informs us about the overall sensitivity of choice to feature fluctuations in the face stimuli, it does not clarify the contribution of sensory and decision-making processes to this sensitivity. Specifically, subjects' choices may be more sensitive to changes in one feature because the visual system is better at discriminating the feature changes (visual discriminability) or because the decision-making process attributes a larger weight to the changes of that feature (decision weights; Schyns et al., 2002; Sigala and Logothetis, 2002). We dissociate these factors in the final section of Results, but for the intervening sections, we focus on the overall sensitivity of choice to different features.
Linearity of spatial integration of facial features
How do subjects integrate information from multiple spatial features? Could it be approximated as a linear process, or does it involve significant nonlinear effects, for example, synergistic interactions that magnify the effect of cofluctuations across features? Nonlinear effects can be empirically discovered by plotting joint psychometric functions that depict subjects' accuracy as a function of the strength of the facial features (Fig. 3A,B). Here, we define the true mean strength of each feature as the average of the feature morph levels over the stimulus frames shown on each trial (see Figs. 6, 7 for temporal effects). The plots visualize the three orthogonal 2D slices of the 3D stimulus space (Fig. 1C), and the contour lines show the probability of choosing the second target (choice 2) at the end of the trial.
These iso-performance contours (Fig. 3A,B, thin lines) were largely straight and parallel to each other, suggesting that a weighted linear integration across features underlies behavioral responses. The slope of contours in each 2D plot reflects the relative contribution of the two facial features to choice. For example, the nearly vertical contours in the eyes-versus-nose plot of the identity task indicate that eyes had a much greater influence on subjects' choices, consistent with the amplitudes of psychophysical kernels (Fig. 2E). Critically, the straight and parallel contour lines indicate that spatial integration does not involve substantial nonlinearity. A linear model, however, does not explain curved contours, which appear at the highest morph levels, especially in the 2D plots of the less informative pairs (e.g., the nose × mouth plot for the identity task). Multiple factors could give rise to the curved contours. First, subjects rarely make mistakes at the highest morph levels, reducing our ability to perfectly define the contour lines at those levels. Second, the 2D plots marginalize over the third informative feature, and this marginalization is imperfect because of finite trial counts in the dataset. Put together, we cannot readily attribute the presence of curved contours at the highest morph levels to nonlinear processes and should rely on statistical tests for discovery. As we explain below, statistical tests fail to detect nonlinearity in the integration of features.
To quantify the contributions of linear and nonlinear factors, we performed a logistic regression on the choices using both linear and nonlinear multiplicative combinations of the feature strengths (Eq. 4). The model accurately fit to the contour lines in Figure 3, A and B (thick pale lines; identity task,
Linearity of the mapping between stimulus strength and subjective evidence
Quantitative prediction of behavior requires understanding the mapping between the stimulus strength as defined by the experimenter (morph level in our experiment) and the evidence conferred by the stimulus for the subject's decision. The parallel linear contours in Figure 3 demonstrate that the strength of one informative feature can be traded for another informative feature to maintain the same choice probability. They further show that this trade-off is largely stable across the stimulus space, strongly suggesting a linear mapping between morph levels and inferred evidence.
To formally test this hypothesis, we quantified the relationship between feature strengths and the effects of features on choice by estimating subjective evidence in log-odds units. Following the methods developed by Yang and Shadlen (2007), we split the feature strengths (% morph) of each stimulus frame into 10 levels and performed a logistic regression to explain subjects' choices based on the number of occurrences of different feature morph levels in a trial. The resulting regression coefficients correspond to the change of the log odds of choice furnished by a feature morph level. For both the identity and expression morphs, the stimulus strength mapped linearly onto subjective evidence (Fig. 4; identity task, R2 = 0.94; expression task, R2 = 0.96), with the exception of the highest stimulus strengths, which exerted slightly larger effects on choice than expected from a linear model. The linearity for a wide range of morph levels—especially for the middle range in which subjects frequently chose both targets—permits us to approximate the total evidence conferred by a stimulus as a weighted sum of the morph levels of the informative features.
Temporal integration mechanisms
The linearity of spatial integration significantly simplifies our approach to investigate integration of sensory evidence over time. We adopted a quantitative-model-based approach by testing a variety of models that have the same linear spatial integration process but differ in ways that use stimulus information over time. We leveraged stimulus fluctuations within and across trials to identify the mechanisms that shaped the behavior. We further validated these models by comparing predicted psychophysical kernels with the empirical ones.
In our main model, the momentary evidence from each stimulus frame is linearly integrated over time (Fig. 5A). The momentary evidence from a stimulus frame is a linear weighted sum of the morph levels of informative features in the stimulus, compatible with linear spatial integration shown in the previous sections. The model assumes that sensitivities for these informative features (
The same model also quantitatively explains the psychophysical kernels (Fig. 6; identity task, R2 = 0.86; expression task, R2 = 0.84). The observed kernels showed characteristic temporal dynamics in addition to the inhomogeneity of amplitudes across features, as described earlier (Fig. 2E,F). The temporal dynamics are explained by decision bounds and nondecision time in the model (Okazawa et al., 2018). When aligned to stimulus onset, the kernels decreased over time. This decline happens in the model because nondecision time creates a temporal gap between bound crossing and the report of the decision, making stimuli immediately before the report inconsequential for the decision. When aligned to the saccadic response, the kernels peaked several hundred milliseconds before the saccade. This peak emerges in the model because stopping is conditional on a stimulus fluctuation that takes the decision variable beyond the bound, whereas the drop near the response time happens again because of the nondecision time. Critically, the model assumptions about static sensitivity and linear integration matched the observed kernels. Further, the inequality of kernel amplitudes across facial features and tasks were adequately captured by the different sensitivity parameters for individual features (
To further test properties of temporal integration, we challenged our model with two plausible extensions (Fig. 7). First, integration may be imperfect, and early information can be gradually lost over time (Usher and McClelland, 2001; Bogacz et al., 2006). Such a leaky integration process can be modeled by incorporating an exponential leak rate in the integration process (Fig. 7A). When this leak rate becomes infinitely large, the model reduces to a memory-less process that commits to a choice if the momentary sensory evidence exceeds a decision bound, that is, extrema detection (Waskom and Kiani, 2018; Stine et al., 2020). To examine these alternatives, we fit the leaky integration model to the behavioral data. Although the leak rate is difficult to assess in typical perceptual tasks (Stine et al., 2020), our temporally fluctuating stimuli provide a strong constraint on the range of the leak rate that matches behavioral data because increased leak rates lead to lower contribution of earlier stimulus fluctuations to choice. We found that although the fitted leak rate was statistically greater than zero (Fig. 7B; identity task, t(11) = 3.01, p = 0.012; expression task, t(6) = 2.99, p = 0.024), it was consistently small across subjects (identity task, mean ± SEM across subjects, 0.013 ± 0.004s−1; expression task, 0.005 ± 0.002s−1). These leak rates correspond to integration time constants larger than 100 s, which is much longer than the duration of each trial (∼1 s), supporting near-perfect integration over time.
The second extension allows time-varying sensitivity to sensory evidence within a trial (Levi et al., 2018), as opposed to the constant sensitivity assumed in our main model. To capture a wide variety of plausible temporal dynamics, we added linear and quadratic temporal modulations of drift rate over time to the model (Fig. 7C; Eq. 10). However, the modulation parameters were quite close to zero (Fig. 7D; identity task:
Testing the sequence of spatiotemporal integration
In the models above, temporal integration operates on the momentary evidence generated from the spatial integration of features of each stimulus frame. But is it necessary for spatial integration to precede temporal integration? Although our data-driven analyses above suggest that subjects combined information across facial features (Figs. 3, 4), it might be plausible that spatial integration follows the temporal integration process instead of preceding it. Specifically, the evidence conferred by each informative facial feature may be independently integrated over time, and then a decision may be rendered based on the collective outcome of the three feature-wise integration processes (i.e., spatial integration following temporal integration). A variety of spatial pooling rules may be used in such a model. A choice can be determined by the first feature integrator that reaches the bound (Fig. 8A), by the majority of feature integrators (Fig. 9A), or by the sum of the decision variables of the integrators after the first bound crossing (Fig. 9B). In all of these model variants, the choice is shaped by multiple features because of the stochasticity of the stimuli and noise (Otto and Mamassian, 2012). For example, the eyes integrator would dictate the choice in many trials of the identity categorization task, but the other feature integrators would also have a smaller but tangible effect, compatible with the differential contribution of features to choice, as shown in the previous sections (Fig. 2E,F). Are such parallel integration models compatible with the empirical data?
Figure 8 demonstrates that models with late spatial integration fail to explain the behavior. Although these models could fit the psychometric and chronometric functions (Fig. 8B,C), they underperformed our main model (model log-likelihood difference for the joint distribution of choice and RT,
In general, late spatial integration causes a lower signal-to-noise ratio and is therefore more prone to wrong choices because it ignores part of the available sensory information by terminating the decision-making process based on only one feature or by suboptimally pooling across spatial features after the termination (Fig. 9, test of different spatial pooling rules). To match subjects' high performance, these models would therefore have to alter the speed accuracy trade-off by pushing the decision bound higher than those used by the subjects. However, this change leads to qualitative distortions in the psychophysical kernels. Our approach to augment standard choice and RT analyses with psychophysical reverse correlations was key to identify these qualitative differences (Okazawa et al., 2018), which can be used to reliably distinguish models with different orders of spatial and temporal integration.
What underlies differential contribution of facial features to choice: visual discriminability or decision weight?
The psychophysical kernels and decision-making models in the previous sections indicated that subjects' choices were differentially sensitive to fluctuations of the three informative features in each categorization task (Figs. 2E,F; 3C,D; 4) and across tasks (drift diffusion model sensitivity for features depicted in Fig. 10E; F(2,51) = 47.4, p = 2.3 × 10−12, two-way ANOVA interaction between features and tasks). However, as explained earlier, a higher overall sensitivity to a feature could arise from better visual discriminability of changes in the feature or a higher weight applied by the decision-making process to the feature (Fig. 10A). Both factors are likely present in our task. Task-dependent changes of feature sensitivities support the existence of flexible decision weights. Differential visual discriminability is a likely contributor too because of distinct facial features across faces in the identity task or expressions in the expression task. To determine the exact contribution of visual discriminability and decision weights to the overall sensitivity, we measured the discrimination performance of the same subjects for each facial feature using two tasks—odd-one-out discrimination (Fig. 10B) and categorization of single facial features (Fig. 11A).
In the odd-one-out task, subjects viewed three consecutive images of a facial feature (eyes, nose, or mouth) with different stimulus strengths and chose the one that was perceptually distinct from the other two (Fig. 10B). Subjects successfully identified the morph level that was distinct from the other two and had higher choice accuracy when the morph level differences were larger (Fig. 10C). However, the improvement of accuracy as a function of morph level difference was not identical across features. The rate of increase (slope of psychometric functions) was higher for the eyes of the identity-task stimuli and higher for the mouth of the expression-task stimuli, suggesting that the most sensitive features in those tasks were most discriminable too. We used a model based on signal detection theory (Maloney and Yang, 2003) to fit the psychometric functions and retrieve the effective representational noise (
Similar results were also obtained in the single-feature categorization tasks, where subjects performed categorizations similar to the main task while viewing only one facial feature (Fig. 11A). We derived the model sensitivity for each facial feature by fitting a drift diffusion model to the subjects' choices and RTs (Fig. 11B). Because subjects discriminated a single feature in this task, differential weighting of features could not play a role in shaping their behavior, and the model sensitivity for each feature was proportional to the feature discriminability. The order of feature discriminability was similar to that from the odd-one-out task, with eyes showing more discriminability for the stimuli of the identity task (Fig. 11C).
Although the results of both tasks support that visual discriminability was nonuniform across facial features, this contrast was less pronounced than that of the model sensitivities in the main task (Fig. 10E,F). Consequently, dividing the model sensitivities by the discriminability revealed residual differences reflecting nonuniform decision weights across features (Fig. 10G; F(2,30) = 6.1, p = 0.0059, two-way ANOVA, main effect of features) and between the tasks (F(2,30) = 10.9, p = 2.8 × 10−4, two-way ANOVA, interaction between features and tasks). In other words, context-dependent decision weights play a significant role in the differential contributions of facial features to decisions. Furthermore, these weights suggest that subjects rely more on more informative (less noisy) features. In fact, the decision weights were positively correlated with visual discriminability (Fig. 10H; R = 0.744, p = 2.0 × 10−7), akin to an optimal cue integration process (Ernst and Banks, 2002; Oruç et al., 2003; Drugowitsch et al., 2014). Together, the decision-making process in face categorization involves context-dependent adjustment of decision weights that improves behavioral performance.
Discussion
Successful categorization or identification of objects depends on elaborate sensory and decision-making processes that transmit and use sensory information to implement goal-directed behavior. The properties of the decision-making process remain underexplored for object vision. Existing models commonly assume instantaneous decoding mechanisms based on linear readout of population responses of sensory neurons (Hung et al., 2005; Majaj et al., 2015; Rajalingham et al., 2015; Chang and Tsao, 2017), but they are unable to account for aspects of behavior that are based on deliberation on temporally extended visual information common in our daily environments. By extending a quantitative framework developed for studying simpler perceptual decision (Ratcliff and Rouder, 1998; Palmer et al., 2005; Gold and Shadlen, 2007; O'Connell et al., 2012; Waskom et al., 2019), we establish an experimental and modeling approach that quantitatively links sensory inputs and behavioral responses during face categorization. We show that human face categorization constitutes spatiotemporal evidence integration processes. A spatial integration process aggregates stimulus information into momentary evidence, which is then integrated over time by a temporal integration process. The temporal integration is largely linear and because of long time constants has minimal or no loss of information over time. The spatial integration is also linear and accommodates flexible behavior across tasks by adjusting the weights applied to visual features. These weights remain stable over time in our task, providing no indication that the construction of momentary evidence or the informativeness changes with stimulus viewing time.
Our approach bridges past studies on object recognition and perceptual decision-making by formulating face recognition as a process that integrates sensory evidence over space and time. Past research on object recognition focused largely on feedforward visual processing and instantaneous readout of the visual representations, leaving a conceptual gap for understanding the temporally extended processes that underlie perception and action planning based on visual object information. Several studies have attempted to fill this gap by using noisy object stimuli (Heekeren et al., 2004; Philiastides and Sajda, 2006; Ploran et al., 2007; Philiastides et al., 2014; Heidari-Gorji et al., 2021) or sequential presentation of object features (Ploran et al., 2007; Jack et al., 2014). However, the stimulus manipulations in these studies did not allow a comprehensive exploration of both spatial and temporal processes. They either created a one-dimensional stimulus axis that eroded the possibility to study spatial integration across features or created temporal sequences that eroded the possibility to study temporal integration jointly with spatial integration. Our success hinges on a novel stimulus design, namely, independent parametric morphing of individual facial features and subliminal spatiotemporal feature fluctuations within trials. Independent feature fluctuations were key to characterize the integration processes, and the subliminal sensory fluctuations ensured that our stimulus manipulations did not alter subjects' decision strategy, addressing a fundamental challenge (Murray and Gold, 2004) to alternative methods (e.g., covering face parts; Gosselin and Schyns, 2001; Schyns et al., 2002; but see Gosselin and Schyns, 2004).
We used three behavioral measures—choice, reaction time, and psychophysical reverse correlation—to assess the mechanisms underlying the behavior. Some key features of the decision-making process cannot be readily inferred solely from choice and reaction time, for example, the time constant of the integration process (Ditterich, 2006; Stine et al., 2020). However, the inclusion of psychophysical kernels provides a more powerful three-pronged approach (Okazawa et al., 2018) that enabled us to establish differential sensitivities for informative features (Fig. 2E,F), linearity of spatial integration (Fig. 3), long time constants (minimum information loss) for temporal integration (Fig. 7B), static feature sensitivities (Fig. 7D), and failure of late spatial integration in the parallel feature integration models (Figs. 8, 9). The precise agreement of psychophysical kernels between model and data (Fig. 6) reinforces our conclusion that face categorization arises from linear spatiotemporal integration of visual evidence.
Face perception is often construed as a holistic process because breaking the configuration of face images, for example, removing parts (Tanaka and Farah, 1993), shifting parts (Young et al., 1987), or inverting images (Yin, 1969), reduces performance for face discrimination (Taubert et al., 2011), categorization (Young et al., 1987), or recognition (Tanaka and Farah, 1993). However, the mechanistic underpinnings of these phenomena remain elusive (Richler et al., 2012). The linear spatial integration mechanism has the potential to provide mechanistic explanations for some of these holistic effects. For example, changes in the configuration of facial features could reduce visual discriminability of facial features (Murphy and Cook, 2017), disrupt spatial integration (Gold et al., 2012; Witthoft et al., 2016), or cause suboptimal weighting of informative features (Sekuler et al., 2004). Holistic effects can also be manifested as impairment in facial part recognition when placed together with other uninformative facial parts (composite face effect; Young et al., 1987). This might arise because face stimuli automatically trigger spatial integration that combines information from irrelevant parts. Our approach offers a quantitative path to test these possibilities using a unified modeling framework—a fruitful direction to pursue in the future.
The linearity of spatial integration over facial features has been a source of controversy in the past (Gold et al., 2012; Gold, 2014; Shen and Palmeri, 2015). The controversy partly stems from the ambiguity in what visual information contributes to face recognition. Some suggest that local shape information of facial parts accounts for holistic face processing (McKone and Yovel, 2009), whereas others suggest that configural information, such as distances between facial features, gives rise to nonlinearities (Shen and Palmeri, 2015) and holistic properties (Le Grand et al., 2001; Maurer et al., 2002). Our study does not directly address this question because feature locations in our stimuli were kept largely constant to facilitate morphing between faces. However, our approach can be generalized to include configural information and systematically tease apart spatial integration over feature contents from integration over the relative configuration of features. An ideal decision-making process would treat configural information similar to content information by linearly integrating independent pieces of information. Although our current results strongly suggest linear integration over feature contents, we remain open to emergent nonlinearities for configural information.
Another key finding in our experiments is flexible, task-dependent decision weights for informative features (Fig. 10). Past studies demonstrated the preferential use of more informative features over others during face and object categorization (Schyns et al., 2002; Sigala and Logothetis, 2002; De Baene et al., 2008). But it was not entirely clear whether and by how much subjects' enhanced sensitivity stemmed from visual discriminability of features or decision weights. We have shown that the differential model sensitivity for facial features in our tasks could not be fully explained by inhomogeneity of visual discriminability across features, thus confirming flexible decision weights for facial features. Importantly, the weights were proportional to the visual discriminability of features in each task (Fig. 10H), consistent with the idea of optimal cue integration that explains multisensory integration behavior (Ernst and Banks, 2002; Oruç et al., 2003; Drugowitsch et al., 2014). Our observation suggests that face recognition is compatible with Bayesian computations in cue combination paradigms (Gold et al., 2012; Fetsch et al., 2013). It is an important future direction to test whether the recognition of other object categories also conforms to such optimal computations (Kersten et al., 2004). Moreover, neural responses to object stimuli can dynamically change because of adaptation or expectation (Kaliukhovich et al., 2013), which can alter both the sensory and decision-making processes (Mather and Sharman, 2015; Witthoft et al., 2018). How decision-making processes adapt to dynamic inputs is another important direction to be explored in the future.
The quantitative characterization of behavior is pivotal for linking computational mechanisms and neural activity as it guides future research on where and how the spatiotemporal integration of sensory evidence is implemented in the brain. Face stimuli evoke activity in a wide network of regions in the temporal cortex with different levels of specialization for processing facial parts, view invariance, facial identity and emotions, as well as social interactions (Freiwald and Tsao, 2010; Freiwald et al., 2016; Sliwa and Freiwald, 2017; Hesse and Tsao, 2020; Hu et al., 2020). Although neural activity in these regions is known to causally alter face recognition behavior (Afraz et al., 2006; Parvizi et al., 2012; Moeller et al., 2017), the exact contribution to the decision-making process remains unresolved. Prevailing theories emphasize the role of these regions in sensory processing, commonly attributing rigid selectivities to the neurons that are invariant to behavioral goals. In these theories, flexible spatiotemporal integration of evidence, as we explain in our model, is left to downstream sensorimotor or association areas commonly implicated in decision-making (Ratcliff et al., 2003; Cisek and Kalaska, 2005; Gold and Shadlen, 2007; Schall, 2019; Okazawa et al., 2021). However, neurons in the inferior temporal cortex show response dynamics that can reflect temporally extended decisions (Akrami et al., 2009), and they may alter selectivity in a task-dependent manner (Koida and Komatsu, 2007; Tajima et al., 2017), challenging a purely sensory role for the inferior temporal neurons and hinting at the potential contribution of these neurons to flexible spatial and temporal integration. Future studies that focus on the interactions between temporal cortex and downstream areas implicated in decision-making will clarify the role of different brain regions. Our experimental framework provides a foundation for studying such interactions by determining the properties of spatiotemporal integration and making quantitative predictions about the underlying neural responses.
Footnotes
This work was supported by the Simons Collaboration on the Global Brain (Grant 542997), McKnight Scholar Award, Pew Scholars Program in the Biomedical Sciences Award, and National Institute of Mental Health (Grant R01 MH109180-01). G.O. was supported by postdoctoral fellowships from the Charles H. Revson Foundation and the Japan Society for the Promotion of Science. We thank Stanley J. Komban, Michael L. Waskom, and Koosha Khalvati for discussions.
The authors declare no competing financial interests.
- Correspondence should be addressed to Roozbeh Kiani at roozbeh{at}nyu.edu