Review
Making Sense of Real-World Scenes

https://doi.org/10.1016/j.tics.2016.09.003Get rights and content

Trends

We are not simply passive viewers of scenes, but active participants within them, and understanding a scene entails processing factors such as what actions can be performed, where items are likely to be, how we can move through it, etc.

Scene understanding has developed many parallel research fields, looking at recognition, search, navigation, and action affordance, among others.

The multitude of different properties present in scenes, and the inherent correlations among them, make establishing the critical features for any given goal or for any given brain region extremely challenging.

The research field has arrived at a stage where observer goals should be integrated, and the relationship between co-occurring properties explored, so as to start building a more comprehensive framework for scene understanding.

To interact with the world, we have to make sense of the continuous sensory input conveying information about our environment. A recent surge of studies has investigated the processes enabling scene understanding, using increasingly complex stimuli and sophisticated analyses to highlight the visual features and brain regions involved. However, there are two major challenges to producing a comprehensive framework for scene understanding. First, scene perception is highly dynamic, subserving multiple behavioral goals. Second, a multitude of different visual properties co-occur across scenes and may be correlated or independent. We synthesize the recent literature and argue that for a complete view of scene understanding, it is necessary to account for both differing observer goals and the contribution of diverse scene properties.

Section snippets

Interacting with Real-World Scenes

Making a cup of tea is an easy task that requires minimal concentration, yet the composition of behaviors involved is deceptively complex: recognizing the room next door as a kitchen, navigating to it while maneuvering around obstacles, locating and handling objects (e.g., teabag, kettle), and manipulating those objects until they are in a correct state (e.g., filling the kettle). In addition, it requires knowledge of relative locations and future destinations within the environment (e.g., take

Toward a Comprehensive Framework for Scene Understanding

There are two major challenges to producing a comprehensive theoretical framework that would outline the cognitive and neural mechanisms of scene understanding. The first is that, while the physical characteristics of our surrounding environment are generally stable, our immediate goals are not. At any given moment, different visual aspects of an environment will be prioritized based on our current goal (Figure 1A, Key Figure; Box 2). However, the dynamic nature of scene understanding is often

Mapping Properties to Goals

Based on the discussion of these four main goals of scene understanding, it should be clear that the goals themselves are not mutually exclusive. For example, recognition facilitates search and navigation processes; navigation sometimes requires searching for specific information (e.g., objects, boundaries); scene affordance must consider navigability within a space, and so forth. Similarly, informative properties overlap various goals: spatial layout facilitates the early stages of recognition

The Neural Mechanisms of Scene Understanding

In general, visual scene processing in humans has been characterized by a trio of scene-selective regions: the occipital place area (OPA), parahippocampal place area (PPA), and retrosplenial complex (RSC), on the lateral occipital, ventral temporal, and medial parietal cortical surfaces, respectively [82]. Studies in non-human primates have also reported scene-selective regions 83, 84, 85, as well as regions responsive to spatial landscapes [86]. Much of the research on humans has focused on

Concluding Remarks and Future Directions

Scene understanding entails representing information about the properties and arrangement of the world to facilitate the ongoing needs of the viewer. By focusing on four major goals of scene understanding – recognizing the environment, searching for information within the environment, moving through the environment, and determining what actions can be performed – we have demonstrated how different goals use similar properties and, conversely, how many properties can be used for different goals.

Acknowledgments

G.L.M., I.I.A.G., and C.I.B. are supported by the intramural research program of NIMH (ZIAMH002909). I.I.A.G. is supported by a Rubicon Fellowship from The Netherlands Organization of Scientific Research (NWO). We thank Wilma Bainbridge, Michelle Greene, Assaf Harel, Antje Nuthmann, and Edward Silson for commenting on earlier versions of this manuscript.

To what extent are a combination of properties that facilitate a single goal (e.g., edges for scene recognition), generalizable to other tasks

Glossary

Convolutional neural network
a computer vision model with a multi-layer hierarchical architecture that can be trained to perform classification of visual images.
Diagnosticity
the relative usefulness of a specific subset of perceptual information in facilitating an observer's goal. For example, an oven is highly diagnostic in helping to categorize a scene as a kitchen, while an apple is less so.
Environmental space
a physical space that is too large to be appreciated without locomotion, requiring

References (147)

  • M.M. Hayhoe et al.

    Modeling task control of eye movements

    Curr. Biol.

    (2014)
  • T. Foulsham

    The where, what and when of gaze allocation in the lab and the natural environment

    Vision Res.

    (2011)
  • D.S. Marigold et al.

    Gaze fixation patterns for negotiating complex ground terrain

    Neuroscience

    (2007)
  • J.B. Julian

    The occipital place area is causally involved in representing environmental boundaries during navigation

    Curr. Biol.

    (2016)
  • W.A. Bainbridge et al.

    Interaction envelope: local spatial representations of objects at all scales in scene-selective regions

    Neuroimage

    (2015)
  • A. Oliva et al.

    The role of context in object recognition

    Trends Cogn. Sci.

    (2007)
  • M.E. Wokke

    Conflict in the kitchen: contextual modulation of responsiveness to affordances

    Conscious. Cogn.

    (2016)
  • S. Kornblith

    A network for scene processing in the macaque temporal lobe

    Neuron

    (2013)
  • S. Vaziri

    A channel for 3D environmental shape in anterior inferotemporal cortex

    Neuron

    (2014)
  • D.M. Watson

    Patterns of response to visual scenes are linked to the low-level properties of the image

    Neuroimage

    (2014)
  • D.M. Watson

    Patterns of neural response in scene-selective regions of the human brain are affected by low-level manipulations of spatial frequency

    Neuroimage

    (2016)
  • L. Kauffmann

    Spatial frequency processing in scene-selective cortical regions

    Neuroimage

    (2015)
  • H. Choo et al.

    Contour junctions underlie neural representations of scene categories in high-level human visual cortex

    Neuroimage

    (2016)
  • T. Konkle et al.

    A real-world size organization of object responses in occipitotemporal cortex

    Neuron

    (2012)
  • D.E. Stansbury

    Natural scene statistics account for the representation of scene categories in human visual cortex

    Neuron

    (2013)
  • A. Torralba et al.

    Statistics of natural image categories

    Netw. Comput. Neural Syst.

    (2003)
  • G.L. Malcolm

    Beyond gist: strategic and incremental information accumulation for scene categorization

    Psychol. Sci.

    (2014)
  • M.R. Greene

    What you see is what you expect: rapid scene understanding benefits from prior experience

    Atten. Percept. Psychophys.

    (2015)
  • R. VanRullen

    Four common conceptual fallacies in mapping the time course of recognition

    Front. Psychol.

    (2011)
  • A. Oliva

    Scene perception

  • M.C. Potter

    Detecting meaning in RSVP at 13 ms per picture

    Atten. Percept. Psychophys.

    (2014)
  • J.F. Maguire et al.

    Failure to detect meaning in RSVP at 27 ms per picture

    Attention, Perception, Psychophys.

    (2016)
  • A. Oliva et al.

    Modeling the shape of the scene: a holistic representation of the spatial envelope

    Int. J. Comput. Vis.

    (2001)
  • H.S. Scholte

    Brain responses strongly correlate with Weibull image statistics when processing natural images

    J. Vis.

    (2009)
  • M.R. Greene et al.

    The briefest of glances: the time course of natural scene understanding

    Psychol. Sci.

    (2009)
  • I.I.A. Groen

    From image statistics to scene gist: evoked neural activity reveals transition from low-level natural image structure to scene category

    J. Neurosci.

    (2013)
  • D.B. Walther et al.

    Nonaccidental properties underlie human categorization of complex natural scenes

    Psychol. Sci.

    (2014)
  • V. Goffaux

    Diagnostic colours contribute to the early stages of scene categorization: behavioural and neurophysiological evidence

    Vis. Cogn.

    (2005)
  • M.S. Castelhano et al.

    The influence of color on the perception of scene gist

    J. Exp. Psychol. Hum. Percept. Perform.

    (2008)
  • J.L. Davenport et al.

    Scene consistency in object and background perception

    Psychol. Sci.

    (2004)
  • C.R. Gagne et al.

    Do simultaneously viewed objects influence scene recognition individually or as groups? Two perceptual studies

    PLoS One

    (2014)
  • M.R. Greene

    Visual noise from natural scene statistics reveals human scene category representations

    (2014)
  • R. Glanemann

    Rapid apprehension of the coherence of action scenes

    Psychon. Bull. Rev.

    (2016)
  • I. Kadar et al.

    A perceptual paradigm and psychophysical evidence for hierarchy in scene gist processing

    J. Vis.

    (2012)
  • L. Fei-Fei

    What do we perceive in a glance of a real-world scene?

    J. Vis.

    (2007)
  • K. Rayner

    Masking of foveal and parafoveal vision during eye fixations in reading

    J. Exp. Psychol. Hum. Percept. Perform.

    (1981)
  • B.W. Tatler

    Looking at domestic textiles: an eye-tracking experiment analysing influences on viewing behaviour at Owlpen Manor

    Text. Hist.

    (2016)
  • A. Nuthmann

    How do the regions of the visual field contribute to object search in real-world scenes? Evidence from eye movements

    J. Exp. Psychol. Hum. Percept. Perform.

    (2014)
  • M.S. Castelhano

    Viewing task influences on eye movements during scene perception

    J. Vis.

    (2009)
  • B.W. Tatler et al.

    The prominence of behavioural biases in eye guidance

    Vis. Cogn.

    (2009)
  • Cited by (107)

    View all citing articles on Scopus
    View full text