Research report
An electrophysiological study of scene effects on object identification

https://doi.org/10.1016/S0926-6410(02)00244-6Get rights and content

Abstract

The meaning of a visual scene influences the identification of visual objects embedded in it. We investigated the nature and time course of scene effects on object identification by recording event-related brain potentials (ERPs) and response times (RTs). In three experiments, participants identified objects within a scene that were either semantically congruous (e.g., a pot in a kitchen) or incongruous (e.g., a desk in a river). As expected, RTs were faster for congruous than incongruous objects. The earliest sign of reliable scene congruity effects in the ERPs (greater positivity for congruous pictures between 300 and 500 ms) was around 300 ms. Both the morphology and time course of the N390 scene congruity effect are reminiscent of the N400 sentence congruity effect typically observed in sentence context paradigms, suggesting a functional similarity of the neural processes involved. Overall, these results support theories postulating that visual scenes do not appreciably affect object identification processes before associated semantic information is activated. We speculate that the N390 scene congruity effect reflects the action of visual scene schemata stored in the anterior temporal lobe.

Introduction

Most empirical work on visual object identification has focused on isolated objects (see for review Refs. [41], [47], [78]). Yet, in our everyday visual environment, objects are embedded in meaningful visual scenes. In this work, we focus on the processing effects of one specific high-level regularity in visual scenes, namely the long-term co-occurrence of object classes in certain scene contexts, on visual object identification. Specifically, we compare the identification of an object in a setting in which it is often seen (congruous context) versus another in which it is rarely seen (incongruous context) such as a personal computer in an office versus a bathroom scene. In these studies, we do not distinguish between ‘associative’ and ‘semantic’ regularities: objects that are semantically related not only tend to be seen in similar scenes but also typically tend to co-occur in the same scene.

Recent studies by Chun and collaborators (see for review Ref. [16]) using visual search paradigms have shown that the cognitive system can indeed acquire incidentally information about the co-occurrence of visual shapes. These studies have demonstrated that seeing a target shape in the context of an array of other shapes facilitates later search of the target shape when it appears in the same context array, relative to a novel one. Furthermore, the results of electrophysiological studies in non-human primates have provided direct evidence that neurons in the anterior temporal cortex can encode such co-occurrence patterns by means of associations between elaborate representations of visual stimuli [55], [56]. Outside the laboratory, temporal associations of this type may occur routinely via systematic patterns of eye movements (e.g., Ref. [22]); after all, visual objects that tend to co-occur in a visual scene are likely to be fixated in close temporal proximity.

Cognitive psychological accounts have hypothesized the existence of specific memory representations for such co-occurrence patterns. Within a number of psychological accounts, cumulative interactions of the organism with the environment are presumed to lead to the formation of scene-specific schemata, or frames that represent the “likelihoods, ranges and distributions of things and events” (Ref. [25], p. 321). Some accounts also include memory representations that are built of a network of dynamic associations among neuron-like nodes and have the advantage of providing a potential link with the neurobiology (e.g., Ref. [64]).

Once activated, scene-specific memory representations are assumed to influence processing of incoming visual information, although both the mode and time course of this influence are debated. Different accounts of object identification in scenes postulate different loci for context effects. Among the processes into which object identification has been decomposed, perceptual processes are believed to analyze the visual input and to transform it into ‘structural descriptions’ (i.e., high-level visual representations of the shape and structure of visual objects [63]). Subsequent processes are presumed to match the structural descriptions of the object to be identified with those of object models stored in the structural description system [63]. If a good match is found, relevant semantic knowledge is activated. From a computational vision perspective, object identification terminates with a successful match; thus, the processes involved in the activation of semantic knowledge (and beyond) are typically referred to as ‘post-identification’. In our view, the activation of semantic knowledge is instead an integral part of object identification because the purpose of object vision is precisely that of ascertaining semantic knowledge about visual stimuli.

The level of processing at which scene information can affect object analysis varies between accounts, among which there are three main subdivisions.

According to these accounts, an activated schema can affect the early perceptual analysis of objects within the scene. Schema activation is thought to occur rapidly on the basis of global, low-resolution contextual information [43], [51]. Among the possible candidates for such low-resolution information are scene-emergent features, such as geon clusters typically associated with specific classes of scenes [9] or configurations of oriented ‘blobs’ [71]. An activated schema is assumed to facilitate the detection of perceptual features (color, texture, size, motion, etc.) that are associated with objects specified within the schema itself [4], [25], by means of feature-selective attention. Objects consistent with a scene would thus be identified more quickly on the basis of such features than objects that are inconsistent with a scene.

According to a second class of accounts, scene schemata are assumed to have their effect on the processes involved in matching the structural description with those of object models stored in memory, beyond early perceptual analysis. If we conceptualize this matching process as a selection process wherein the visual system must scan multitudes of representations in the structural description system to find a good match, scene constraints could serve to reduce the size of the search space by priming/biasing classes of the most likely object model representations (e.g., Ref. [78]).

This third class of accounts relegates scene effects to even later processes, such as during semantic knowledge activation or later (e.g., Refs. [21], [33], [37], [47]). According to these accounts, bottom-up visual analysis is sufficient to discriminate between entry-level object categories, after which context may have its effects.

It has proven difficult to infer the time course of context effects from behavioral measures alone because they reflect the ‘downstream’ effects of an experimental manipulation (from the earliest perceptual stages to the motor response). We, therefore, chose to examine the time course of scene effects on object identification via a brain measure with greater temporal precision, namely, recordings of event-related brain potentials (ERPs). Scalp ERPs enable continuous monitoring of the modulations of synchronous neural activity elicited by experimental manipulations in a relatively direct manner. As ERPs have provided crucial information regarding the time course of semantic context effects in language processing [42], [80], we aimed to use a similar approach to investigate the timing of context effects on nonlinguistic, visual processing.

To assess the time course of scene effects on object identification, we compared ERPs elicited by objects appearing in congruous versus incongruous scene contexts. In such an analysis, the timing of ERP differences between these two conditions provides an estimate of the time by when neural representations of scene information must have begun to interact with identification processes for the target object. Furthermore, the spatial distribution of these ERP congruity effects across the head can provide some clues about the nature of the processes involved. To our knowledge, this is the first ERP study that directly addresses scene effects on object identification.

Experiment 1 is a behavioral study to demonstrate the scene congruity effect with our stimulus set. Experiment 2 is an ERP study to determine the time course and spatial signature of the scene congruity effect. Experiment 3 is a replication of Experiment 2 with a modified paradigm not requiring explicit congruity judgments in order to allow a more direct comparison of the congruity effects obtained using scenes with those found in earlier studies with sentential contexts.

Note that throughout the paper we will use the traditional electrophysiological nomenclature to refer to most ERP components (i.e., voltage deflections): the first letter indicates whether the component is negative (‘N’) or positive (‘P’) relative to the chosen reference, whereas the subsequent number(s) indicates either the average latency of the component in ms (e.g., N200) or the ordinal position of the component (e.g., ‘P1’ refers to the first positive ERP component in the visual ERP).

Based on numerous behavioral studies that have reported scene effects on object identification (e.g., Refs. [10], [14]), we expected to obtain a reaction time advantage for congruous relative to incongruous objects. The timing of ERP congruity effects is thus the main focus of the present study. We also used ERP congruity effects to evaluate the three main accounts of scene effects on object cognition. On the one hand, we reasoned that if scene schemata affect the early perceptual analysis of objects, then ERP components indexing early perceptual processes should be modulated by congruity. Short-latency ERPs (onsetting during the first 200 ms poststimulus) are believed to index early perceptual processes because (a) they are modulated by variations in the perceptual attributes of a stimulus, such as spatial location (e.g., Ref. [17]), luminance [46], color [15], [46], spatial frequency [40] coherent motion [59], and shape [68], [77], and (b) they are modulated by task manipulations that affect the perceptual encoding of a stimulus: the P1 and the N1 components (which onset between 80 and 130 ms poststimulus) are modulated by spatial attention (e.g., Ref. [32]); furthermore, the selection negativity (SN, which onsets between 140 and 180 ms poststimulus) is elicited by selective attention to nonspatial features of a visual stimulus such as color [3], shape [72], direction of motion [2], spatial frequency [31], and orientation [39], [62]. The time course and estimated neuroanatomical location of these ERPs (early extrastriate cortex) suggest that they occur prior to stimulus identification.

On the other hand, if scene schemata act only later to affect the activation of semantic knowledge, then congruity should modulate only later components, such as the N400 [42]. N400 amplitude is typically reduced when the eliciting stimulus is preceded by an associatively or semantically related one (e.g., Refs. [5], [6], [7], [8], [12], [36], [50], [69], [70]). Indeed, one view of the N400 is that it reflects neural processes involved in the activation of semantic knowledge [61], [80] probably stored in the anterior temporal lobes.

Predictions for the ERP correlates of scene schemata effects on structural description matching processes can be estimated from prior ERP investigations of the time course of object identification. A class of negativities (‘N300/N350’, ‘Ncl’) peaking around 350 ms with a frontal scalp distribution (with mastoid reference site) has been proposed to reflect activation of structural description matching processes [24], [26], [36], [50], [67]; we refer to these as the structural description negativity, Nsd. By 200 to 300 [230/250] ms, the Nsd is greater for unidentified real objects and non-objects (i.e., images of objects that do not exist) than for identified or real objects. These effects can last for several hundred milliseconds, consistent with the idea that the Nsd reflects repeated but ultimately failed attempts to find a good match for unidentified objects and non-objects; in contrast, Nsd is rapidly reduced within 300 ms when structural description matching succeeds with identified real objects. Thus, if scene schemata affect object identification at the level of structural description matching processes, we might expect to find an ERP congruity effect that onsets around 200 ms with an anterior scalp distribution similar to Nsd effects.

Section snippets

Participants

All 42 participants were UCSD students and native speakers of English. They received course credit, or were paid $5.00/h for participating. Ten participants took part in Experiment 1 (four men, six women between 18 and 25 years of age, mean 20; nine right-handed). Seventeen participants, different from those employed in Experiment 1, took part in Experiment 2 (nine men, eight women between 18 and 25 years of age, mean 20.3; 14 right-handed); data from two participants were discarded due to

Behavioral data

In Experiment 1 (Table 1) participants correctly identified 93% of the target objects. Congruity affected neither the object identification rates (F(2,18)=1.14, P>0.1) nor the object confidence ratings (F(2,18)=0.02, P>0.1). In contrast, mean confidence ratings for the scenes were higher in the congruous (3.93) than incongruous (3.86) condition (F(1,9)=5.62, P<0.05). Finally, congruous stimuli were rated significantly higher in congruity than incongruous ones (F(1,9)=3156.84, P<0.00001). Median

N390 congruity effect

The first reliable effect of congruity, modulation of an N400-like component, begins ∼300 ms, peaks∼ 390 ms, and lasts until ∼500 ms. Throughout this interval, congruous targets show more positivity than incongruous ones. The time course of this congruity effect is roughly similar to that reported for written words in sentences, suggesting that the neural processes underlying the interaction between a visual stimulus and the context in which it appears may operate under similar time constraints

Conclusions

The current experiments, together with prior findings, suggest the following picture of object identification in briefly flashed scenes. Within the first 250–300 ms, visual inputs are processed without any notable impact of top-down influences from the semantic content of a scene. The effects of a scene or a sentence context on object identification processes occur somewhat later, ∼300 ms, as reflected in the N390 scene congruity effect; here, information in memory accessed by the scene context

Acknowledgements

Work reported herein was supported by grants MH52893, HD22614, and AG08313 to M. Kutas. During the revision process of this article the first author was supported by a McDonnell-Pew Program in Cognitive Neuroscience Award and by grants NMA 201-01-C-0032 and 5R01 MH 60734-03 to Stephen M. Kosslyn. We would like to thank Haline E. Schendan and Stephen M. Kosslyn for helpful discussion.

References (80)

  • P.J. Holcomb et al.

    Event-related brain potentials reflect semantic priming in an object decision task

    Brain Cogn.

    (1994)
  • A. Hollingworth et al.

    Object identification is isolated from scene semantic constraint: evidence from object type and token discrimination

    Acta Psychol.

    (1999)
  • D. Karis et al.

    ‘P300’ and memory: Individual differences in the von Restorff effect

    Cogn. Psychol.

    (1984)
  • J.L. Kenemans et al.

    Event-related potentials to conjunctions of spatial frequency and orientation as a function of stimulus parameters and response requirements

    Electroencephalogr. Clin. Neurophysiol.

    (1993)
  • J.L. Kenemans et al.

    On the processing of spatial frequencies as revealed by evoked-potential source modeling

    Clin. Neurophysiol.

    (2000)
  • M.G. Manolas et al.

    Differences in human visual evoked potentials during the perception of colour as revealed by a bootstrap method to compare cortical activity. A prospective study

    Neurosci. Lett.

    (1999)
  • G. McCarthy et al.

    Scalp distributions of event-related potentials: An ambiguity associated with analysis of variance models

    Electroencephalogr. Clin. Neurophysiol.

    (1985)
  • B. Milner

    Visual recognition and recall after right temporal-lobe excision in man

    Neuropsychologia

    (1968)
  • Y. Miyashita et al.

    Feedback signal from medial temporal lobe mediates visual associative mnemonic codes of inferotemporal neurons

    Brain Res. Cogn. Brain Res.

    (1996)
  • E.A. Murray et al.

    Perceptual-mnemonic functions of the perirhinal cortex

    Trends Cogn. Sci.

    (1999)
  • M. Niedeggen et al.

    Characteristics of visual evoked potentials generated by motion coherence onset

    Brain Res. Cogn. Brain Res.

    (1999)
  • D.E. Rumelhart et al.

    Schemata and sequential thought processes in PDP models

  • R. Srebro

    A bootstrap method to compare the shapes of two scalp fields

    Electroencephalogr. Clin. Neurophysiol.

    (1996)
  • G.W. Van Hoesen

    Anatomy of the medial temporal lobe

    Magn. Reson. Imaging

    (1995)
  • A. Amir

    Uniqueness of the generators of brain evoked potential maps

    IEEE Trans. Biomed. Eng.

    (1994)
  • L. Anllo-Vento et al.

    Selective attention to the color and direction of moving stimuli: electrophysiological correlates of hierarchical feature selection

    Percept. Psychophys.

    (1996)
  • L. Anllo-Vento et al.

    Spatio-temporal dynamics of attention to color: evidence from human electrophysiology

    Hum. Brain Mapp.

    (1998)
  • J.R. Antes et al.

    Processing global information in briefly presented pictures

    Psychol. Res.

    (1981)
  • I. Biederman

    Aspects and extensions of a theory of human image understanding

  • S.J. Boyce et al.

    Identification of objects in scenes: the role of scene background in object naming

    J. Exp. Psychol. Learn. Mem. Cogn.

    (1992)
  • S.J. Boyce et al.

    Effect of background information on object identification

    J. Exp. Psychol. Hum. Percept. Perform.

    (1989)
  • H. Buchner et al.

    The timing of visual evoked potential activity in human area V4

    Proc. R. Soc. Lond. B Biol. Sci.

    (1994)
  • V.P. Clark et al.

    Identification of early visual evoked potential generators by retinotopic analyses

    Hum. Brain Mapp.

    (1995)
  • V.P. Clark et al.

    Identification of early visual evoked potential generators by retinotopic and topographic analyses

    Hum. Brain Mapp.

    (1995)
  • A.R. Damasio

    The brain binds entities and events by multiregional activation from convergence zones

    Neural Comput.

    (1989)
  • P. De Graef

    Scene-context effects and models of real-world perception

  • P. De Graef et al.

    Local and global contextual constraints on the identification of objects in scenes. Special Issue: Object perception and scene analysis

    Can. J. Psychol.

    (1992)
  • E. Donchin et al.

    Is the P300 component a manifestation of context updating?

    Behav. Brain Sci.

    (1988)
  • G.M. Doninger et al.

    Activation timecourse of ventral visual stream object-recognition areas: high density electrical mapping of perceptual closure processes

    J. Cogn. Neurosci.

    (2000)
  • A. Friedman

    Framing pictures: the role of knowledge in automatized encoding and memory for gist

    J. Exp. Psychol. Gen.

    (1979)
  • Cited by (204)

    View all citing articles on Scopus
    View full text