Abstract
The primate visual system is organized into two parallel anatomical pathways, both originating in early visual areas but terminating in posterior parietal or inferior temporal regions. Classically, these two pathways have been thought to subserve spatial vision and visual guided actions (dorsal pathway) and object identification (ventral pathway). However, evidence is accumulating that dorsal visual areas may also represent many aspects of object shape in absence of demands for attention or action. Dorsal visual areas exhibit selectivity for three-dimensional cues of depth and are considered necessary for the extraction of surfaces from depth cues and can carry out cognitive functions with such cues as well. These results suggest that dorsal visual areas may participate in object recognition, but it is unclear to what capacity. Here, we tested whether three-dimensional structure-from-motion (SFM) cues, thought to be computed exclusively by dorsal stream mechanisms, are sufficient to drive complex object recognition. We then tested whether recognition of such stimuli relies on dorsal stream mechanisms alone, or whether dorsal–ventral integration is invoked. Results suggest that such cues are sufficient to drive unfamiliar face recognition in normal participants and that ventral stream areas are necessary for both identification and learning of unfamiliar faces from SFM cues.
Introduction
The cortical visual areas of primates are broadly organized into two separate anatomical pathways, a dorsal pathway that includes areas in the posterior parietal cortex (PPC) and a ventral pathway that includes inferior temporal (IT) regions (Ungerleider and Mishkin, 1982; Goodale and Milner, 1992). The two pathways have been thought to represent different aspects of vision, the dorsal pathway representing spatial relations and visually guided actions and the ventral pathway being critical for object identification.
Although ventral visual areas are considered important for complex visual object recognition, many aspects of object recognition may also be carried out in parallel by visual areas in the PPC. Lehky and Sereno (2007) found that cells in areas LIP of the monkey responded strongly and rapidly to two-dimensional forms with a pattern similar to IT cells recorded in the same study. Konen and Kastner (2008), using functional magnetic resonance (fMR)-adaptation in humans, report two areas along the intraparietal sulcus (IPS) that showed adaptation to two-dimensional forms and three-dimensional shapes, regardless of the viewpoint or size of the object. Shape selectivity and invariance to size and viewpoint are important properties of an object recognition system, and regions in the PPC exhibit these properties.
Visual areas in the PPC in humans and monkeys exhibit selectivity for three-dimensional cues of shape such as structure-from-motion (SFM) (Vanduffel et al., 2002), stereopsis, and perspective (Shikata et al., 1996, 2001, 2003; Sugihara et al., 2002; Anderson and Siegel, 2005; Orban et al., 2006; Durand et al., 2007). Some of the three-dimensional cue-selective neurons in these regions exhibit properties that are suggestive of a role in “high-level” visual perception. Cells in the caudal IPS (CIP) exhibit orientation-selective and delay-sustained activity during delayed matching of two three-dimensionally oriented surfaces (Tsutsui et al., 2003). Furthermore, temporary deactivation of this area results in impairment on this discrimination task (Tsutsui et al., 2001). These results imply that dorsal visual areas are involved in certain cognitive aspects of shape processing from three-dimensional cues.
The processing of visual motion is commonly thought to depend on dorsal stream mechanisms as well. Dynamic aspects of a visual scene provide important cues for object segregation and identification. For example, gestures, emotional expressions, and idiosyncratic head movements can be used to drive identity and gender categorization in the absence of other shape cues (Hill and Johnston, 2001). However, three-dimensional SFM cues can be derived from all visual objects. These cues are highly informative of object shape and may be capable of driving complex recognition processes in the absence of other shape cues or idiosyncratic movements.
A number of attempts have been made to estimate the contribution of SFM to face recognition (O'Toole et al., 2002). However previous studies had not separated the sole contribution of object motion from monocular cues (e.g., shading) or other motion cues (e.g., facial gestures and identity signatures). Although a specific role for SFM has been postulated by a model of face recognition (O'Toole et al., 2002), to date no direct evidence exists in support of this model.
Materials and Methods
We first sought to assess whether naive observers can use SFM cues to carry out a complex object recognition task, namely, recognize unfamiliar faces. We then attempted to distinguish between the two competing hypotheses outlined above, one postulating a role for dorsal visual areas in object recognition from three-dimensional cues and the other postulating the necessity of dorsal visual areas for the extraction of surfaces from depth and the ventral visual areas for the recognition and identification of the three-dimensional objects.
Our stimuli consisted of three-dimensional laser-scanned heads (Troje and Bülthoff, 1996) and three-dimensional models of chairs and other objects that were rendered using a unique texture mapping technique (three-dimensional procedural texture mapping). This approach eliminates sources of biological motion as well as monocular depth cues such as shading and texture gradients. The resulting images have no defining two-dimensional features that may be used to recognize the objects (Fig. 1). The motion-defined objects are invisible when the display is static. However, rotating the surfaces in depth yields a vivid three-dimensional percept from the SFM cues.
Stimuli and design.
Three-dimensional laser-scanned heads from the Max Planck database were used for these experiments (Troje and Bülthoff, 1996). The stimuli were rendered with three-dimensional procedural texture maps to ensure uniform textures, as described in detail in our previous study (Liu et al., 2005). The 20 heads rotated in depth from left to right, from −22.5 to 22.5° about the vertical axis at a rate of 27.3°/s, and were rendered with perspective transformation. The recognition targets were the same heads, but rendered with shading only, in orthographic projection to avoid simple metric matching. Thus, the subjects always viewed motion-defined stimuli and matched them to shaded targets. Twenty subjects participated in each of the first two experiments (mean age, 26.8; 15 females and 25 males), with 10 in each condition. Subjects viewed the rotating SFM faces that extended ∼30° of visual angle vertically and 21° horizontally and identified the face among eight gender-matched targets. All participants gave written informed consent before inclusion in the study, which had been approved by the Research Ethics Board of McGill University (Canada) and the Ethical Committee of the University Hospital of Geneva (Switzerland).
Patient studies.
Information on the patients is provided in Table 1. First, all patients viewed a series of 15 objects and three-dimensional geometric shapes defined by SFM and were asked to name them. Ten adult normal controls also completed this naming task. After this, all patients completed a series of additional 1:8 identification tasks as described above, consisting of rotating SFM faces and rotating SFM chairs (rendered in a manner identical to the faces) matched to shaded static targets. Finally, their ability to match static displays was tested on the same task but with static shaded faces and chairs.
Patient P.S., suffering from prosopagnosia, completed two additional tasks designed to probe her capacity to use SFM cues for face and object discrimination and recognition. She completed a face-learning task in which she was required to learn to name four faces (two male; two female) presented via SFM. Each of the faces was present for 3.3 s only (one rotation), and she was encouraged to respond as fast as possible. The patient viewed each of the faces 80 times in the course of the study. Her residual ability at object recognition was also tested using chair-learning tasks that were carried out in the same manner as the face-learning task.
Controls.
We measured the performance on the 1:8 identification tasks of both patients with lesions that left their vision unaffected and normal subjects with no neurological damage. Two control patients, one suffering from damage to temporal and parietal cortices and exhibiting mild aphasia and the other suffering from damage to the parietal cortex participated in the same tasks described above. In addition, eight subjects (five females; three males), aged 46–52 (mean, 50.1; SD, 2.1) with no neurological impairments and normal or corrected-to-normal vision participated in the matching tasks. Patient V.D. served as a negative control for the stimuli and paradigm used here. He suffered from severe motion blindness (for more information, see Results). A total of 11 normal age-matched controls underwent the additional face and chair-learning tasks that P.S. completed, with four in each object category condition; the three remaining participants completed both tasks (N = 7 for each task).
Results
A previous study had suggested that SFM cues may be of limited use in familiar face recognition, but are not sufficient for unfamiliar face recognition (Bruce and Valentine, 1988). It remains unclear whether facial movement in general (as in the case of continuous multiview video of a face) aids better recognition than a single photograph (Pike et al., 1997; Christie and Bruce, 1998). This type of rigid movement would include SFM cues along with other cues; thus, it would not speak directly to a role for SFM in face recognition. Although at least one model of cortical object processing suggests a role for SFM cues in face recognition (O'Toole et al., 2002), there is no direct evidence to validate this claim.
Recognition of unfamiliar faces from SFM
In the first experiment, normal subjects viewed motion-defined face stimuli on one screen while attempting to identify the face among eight choices (target faces) on another screen. The eight target faces were rendered as static shaded faces, similar to sculptures, and were matched for gender with the motion-defined face. One group of subjects viewed the dynamic faces, whereas another group viewed a single static frame. This latter condition served as a control to ensure that there were no contaminating factors in the stimuli that could aid face recognition in the absence of dynamic information. We found that subjects viewing the control stimuli performed at chance (Fig. 2, right bar), whereas subjects viewing the SFM faces performed approximately four times above chance (t(18) = 5.9916; p < 0.0001).
We next tested whether transient texture gradients formed while the face rotates in depth can be used for successful recognition. The same recognition task was used, but with textures that rotated incongruently with head rotation. These stimuli could therefore only be recognized if the transient texture gradients served as a reliable source of structural information, given that SFM cues were removed. Subjects in this condition performed slightly above chance (Fig. 2, middle bar) but significantly below the SFM group (t(18) = 4.5097; p < 0.001). Together, these results confirm the usefulness of purely dynamic cues of shape, devoid of other monocular depth cues or biological motion signals, in driving complex object recognition such as the recognition of unfamiliar faces.
We next sought to distinguish between the two hypotheses outlined above, concerning the role of the dorsal three-dimensional representations in object recognition. If three-dimensional shape representations in dorsal visual areas were sufficient to carry out complex visual object recognition, then a patient with ventral stream impairment would have no difficulty on tasks requiring identification and object learning from three-dimensional cues such as SFM. If, however, dorsal three-dimensional shape representations must be relayed to ventral stream regions for object recognition, as postulated by O'Toole et al. (2002), then ventral stream impairment would be the limiting factor for successful recognition of shapes from three-dimensional cues such as SFM. We tested these contrasting possibilities in neuropsychological cases of akinetopsia (Zihl et al., 1983) and prosopagnosia (Damasio et al., 1982). The former represents an impairment of dorsal stream visual processing resulting in impaired motion perception, whereas the latter represents impairment in the ventral stream to produce a specific inability to recognize faces.
Prosopagnosic patient
To assess the necessity of ventral stream structures in the recognition of motion-defined stimuli, we examined patient P.S., whose clinical condition was previously studied in detail and reported by Rossion et al. (2003). P.S. is a 57-year-old right-handed woman who suffers from severe and chronic prosopagnosia. She exhibited no difficulty in perceiving SFM stimuli and performed perfectly on the object-naming task. On the 1:8 identification tasks, her performance replicated some of the previous reports using face and object photographs by Rossion et al. (2003). Her identification accuracy with the chairs, although not as good as normal controls, was well above chance and within 2 SDs of the normal performance (Fig. 3). However, she was impaired on face identification; her 1:8 matching performance with SFM faces was at chance and >2 SDs below the group average. With static shaded faces she was able to perform above chance, but still significantly worse than the normal controls. She has developed a strategy of using the lips to match faces, and this facial feature is difficult to identify in the right-to-left rotating SFM faces, but clear in the shaded stimuli. Thus, it is likely that her strategy of using the lips drove her performance on the shaded faces above chance, but her performance was still >2 SDs below the normal control group.
We additionally designed a task to test her capacity to learn unfamiliar motion-defined faces and motion-defined chairs. She was asked to learn the names of four faces (two male and two female), four office chairs, or four armchairs that were selected such that the set of armchairs were similar in homogeneity to the set of faces. P.S. was unable to learn the faces even after 80 repetitions of each face, whereas seven age-matched controls were able to reliably learn the task (Fig. 4A). Her raw performance for each face across the sessions is displayed in Figure 4B. In contrast to normal controls, her performance is unreliable over time; the occurrence of correct and incorrect responses for each face is random and she commits significantly more errors in the last 20 trials than normal controls. She reported facility at perceiving the face and all of the facial components but, similar to face photographs, she reported that she could not “put the face together.” Performance on a similar chair-naming task (Fig. 5A) remained unaffected. Her performance with motion-defined office chairs reached a ceiling after only 10 trials and was comparable with her performance with the shaded stimuli. When we used highly similar chairs (armchairs), her performance increased more slowly, but she was clearly able to learn the chairs as evidenced by her consecutively correct performance on the chairs and the similarity between her performance and that of age-matched normal controls (Fig. 5B); on average, she committed the same number of errors as the normal controls. In general, the face- and chair-naming tasks were similar in difficulty as evidenced by the performance of the normal controls. In fact, the chair-naming task was slightly more difficult, with normals committing on average fewer errors on the last 20 trials of the face-naming task than the chair-naming task.
Akinetopsic patient
Patient V.D. is a 47-year-old, right-handed man suffering from dementia affecting primarily visuospatial functions as revealed by extensive neuropsychological testing. He exhibited a severe impairment for direction discrimination from coherent motion and orientation discrimination of two-dimensional forms-from-motion (Blanke et al. 2007). However, neuropsychological testing did not reveal any object recognition deficits, and so we were interested to know whether he could use three-dimensional SFM cues by a system other than his impaired dorsal stream. Additionally, we were interested to know whether there was any other information in our three-dimensional SFM stimuli in addition to the motion-defined structure that could be used to drive discrimination performance, even though previous control studies had suggested this to not be the case. In effect, the performance of patient V.D. served as a negative control for the stimuli and paradigm used here.
The results from this patient, shown in Figure 3, suggest that he is unable to extract motion cues from the displays and thus unable to perceive motion-defined stimuli. It is unlikely that nonmotion cues were present in the stimuli because otherwise he would have used this information to drive his performance above chance. However, he can recognize stimuli if they are defined by other cues, such as shading, suggesting that he does not have a difficulty making fine discriminations. The fact that his near-normal performance with the shaded stimuli did not translate to any residual ability to perceive the three-dimensional SFM stimuli confirms that the extraction of surfaces from these dynamic cues requires putatively dorsal stream mechanisms.
Discussion
We have shown that ventral stream mechanisms are necessary for complex object recognition using SFM cues even though the bulk of evidence suggests that dorsal stream mechanisms are essential for extracting the surface structure from this depth cue. Our results lead to several conjectures.
First, the results from the naive subjects suggest that motion cues alone are sufficient to drive complex object recognition including the recognition of unfamiliar faces. This may at first stand at odds with studies that suggest head motion does not enhance face recognition, but note that here only three-dimensional SFM cues were available, not additional edge and shading cues. Thus, it may be the case that SFM cues may not improve face recognition if other reliable cues are present. Liu and Ward (2006) found that that a three-dimensional cue such as stereopsis improved face recognition performance when perspective transformation degraded performance. Thus, it may be the case that head motion may also improve recognition, but if three-dimensional perception is affected by a spatial transformation.
Second, the data from patient P.S. suggest that the ventral stream object representations are cue-invariant, that they may process a given object regardless of the three-dimensional cue used to define the shape. This is supported by the finding that P.S. displayed a specific impairment that was category-selective for faces, but not cue-selective; she performed significantly worse than normal controls on both matching tasks with faces defined by SFM and those defined by shading. Importantly, her results imply that the ventral face processing mechanisms that she lacks were also the recipient of a putative dorsal input.
Third, although there is evidence that neurons in the PPC (e.g., area CIP) may represent three-dimensional surface information during delay periods (Tsutsui et al., 2001, 2003), these mnemonic functions are insufficient for creating new memory associations for long-term reference. This is supported by P.S.'s inability to learn name associations to four faces from SFM. P.S. does not have a long-term or short-term memory deficit (Rossion et al. 2003); thus, her inability to learn the four faces from SFM-based stimuli is likely attributable to her category-selective impairment.
O'Toole et al. (2002) postulated a role for dorsal–ventral integration from SFM cues, although no direct evidence for this link had been provided until now. Kriegeskorte et al. (2003) found support for the model of O'Toole et al. (2002) in an event-related paradigm with a face detection task that used two SFM-defined faces. Although they reported increased fusiform face area (FFA) activity in response to faces compared with random surfaces, they found a similar category selective response even in the human homolog of MT (hMT+) as well as a differential response in IPS for faces defined by another type of motion cue (termed on-surface SFM). Although their results suggest a role for the FFA in perception of motion-defined faces, the same role can be equally attributed to the hMT+ and IPS peaks observed in their study. Recently, Konen and Kastner (2008) have demonstrated, using an fMR-adaptation paradigm, that PPC shape selectivity is comparable with that of the ventral stream, thus highlighting the need to clarify the role of the dorsal stream shape representations in object recognition.
There is a growing body of evidence to suggest that dynamic cues such as SFM are processed by dorsal stream areas (Andersen and Bradley, 1998; Vanduffel et al., 2002; Anderson and Siegel, 2005; Orban et al., 2006), whereas recognition of complex objects, such as faces, is dependent on ventral stream processing (Haxby et al., 1991; Kanwisher et al., 1997; Ishai et al., 1999). Interestingly, monkeys with lesions to a specific part of the ventral stream, the inferotemporal cortex (area IT), are unable to perform perceptual and memory-related tasks with luminance-defined patterns, but perform normally on perceptual tasks using motion-defined patterns (Britten et al., 1992). Thus, it appears that not all aspects of complex visual recognition depend on ventral stream mechanisms.
Ventral stream areas, such as the IT cortex of monkeys, are highly interconnected with parahippocampal areas (Seltzer and Pandya, 1991), leading to the conjecture that this cortical stream is important for memory formation and object recognition. Neural processes underlying perception of motion-defined patterns presumably remain undisturbed after ventral stream dysfunction. In humans, a ventral system impairment (agnosia) does not impair the ability to use motion-parallax cues for depth reach planning in a delayed-response task that requires retention of perceptual information (Dijkerman et al., 1999). Although dorsal stream areas may exhibit shape selectivity (Shikata et al., 1996; Nakamura et al., 2001; Lehky and Sereno, 2007), our results suggest that these regions may not be involved in object recognition per se, in the sense of allowing for comparisons with stored representations.
Our results have both neurobiological and clinical significance. It remains unclear whether dorsal–ventral integration requires synchronized activity between the two streams (Singer, 1999) and what exactly is the nature of the representation that is transmitted from dorsal stream areas to their ventral stream counterparts. The SFM-defined face recognition task also provides a novel probe of dorsal–ventral integration, allowing for studies on the role of attention in cortical integration or its disruption in neurological disorders.
Footnotes
-
This work was supported by Fonds de la Recherche en Santé du Québec and Natural Sciences and Engineering Research Council fellowships (R.F.), Canadian Institutes of Health Research operating grants (A.C.), and grants from the Swiss National Science Foundation and Fondation de la Famille Sandoz (O.B.). Special thanks to Bruno Rossion, Eugene Mayer, Michela Adriani, and Karen Borrmann for their time and help.
- Correspondence should be addressed to Reza Farivar, Department of Psychology, McGill University, 1205 Doctor Penfield Avenue, Montreal, QC H3A 1B1, Canada. reza.farivar{at}mail.mcgill.ca