Elsevier

Cognitive Psychology

Volume 56, Issue 4, June 2008, Pages 237-283
Cognitive Psychology

Dynamic binding of identity and location information: A serial model of multiple identity tracking

https://doi.org/10.1016/j.cogpsych.2007.03.001Get rights and content

Abstract

Tracking of multiple moving objects is commonly assumed to be carried out by a fixed-capacity parallel mechanism. The present study proposes a serial model (MOMIT) to explain performance accuracy in the maintenance of multiple moving objects with distinct identities. A serial refresh mechanism is postulated, which makes recourse to continuous attention switching, a capacity-limited episodic buffer for identity-location bindings, indexed location information stored in the visuospatial short-term memory, and an active role of long-term memory. As identity-location bindings are refreshed serially, a location error is inherent for all other targets except the focally attended one. The magnitude of this location error is a key factor in predicting tracking accuracy. MOMIT’s predictions were supported by the data of five experiments: performance accuracy decreased as a function of target set-size, speed, and familiarity. A mathematical version of MOMIT fitted nicely to the observed data with plausible parameter estimates for the binding capacity and refresh time.

Introduction

Our perceptual-attentional system faces a challenging task when attempting to simultaneously track multiple moving elements in a visual scene. Tasks of this kind are common in many real-life visual environments, such as in traffic and sports, for example, when a player keeps track of other moving players during a soccer game, or when an air traffic controller monitors aircraft on a radar screen. In order to make quick and sensible decisions in tasks like these (e.g., passing the ball to a team mate, or giving ATC clearance to aircraft), observers need to keep track of where each visual element is located at any given time.

Computationally, in this kind of task observers must be able to dynamically bind correct identities to continuously changing spatiotemporal locations: observers face a dynamic what–where binding problem (for the binding problem, see e.g., Treisman, 1996; the special issue of Visual Cognition 3–5/8, 2001). In the real world, our perceptual–attentional system must solve this problem in very different environments and stimulus conditions. The target stimuli may vary in many ways with respect to, for example, their spatiotemporal and semantic properties. The number of objects-to-be-tracked may vary and so may speed, movement direction, and movement trajectory of the target objects, as well as the number of distracter items present. The objects may temporarily disappear due to external (occlusion) or internal (saccadic suppression) reasons. The objects may be familiar or unfamiliar to the observer. Finally, the tracked objects may be visually identical to or distinct from each other.

There is empirical evidence demonstrating that binding of ‘what’ to ‘where’ in a visual scene is not an easy task for the attentional-perceptual system even when the observed scenes are static. For example, data suggest that observers may be prone to illusory conjunctions, that is, to erroneously bind visual feature combinations (e.g., binding the red color with a triangle when it should have been attached to a circle; Treisman & Schmidt, 1982). Moreover, it seems that what–where bindings are not achieved automatically in an early stage of visual processing. Instead, neurophysiological evidence suggests that the visual input is functionally decomposed into separate dimensions in the ventral (what) and the dorsal (where) streams (e.g., Ungerleider & Mishkin, 1982). All this supports the idea that the dynamic binding problem described above is real and that it is a non-trivial endeavor for the perceptual-attentional system to accomplish the task. Clearly, a mechanism is needed to ensure that correct identity-location combinations are perceived and maintained.

Several fundamental questions can be asked about the workings of the binding mechanism. How efficiently does it carry out its task across different dynamic input situations? Is it vulnerable to changes in ‘where’ information but not in ‘what’ information, or vice versa, or both, or neither? How crucial is the availability of identity information for successful tracking? Does it matter whether the identities are known to the observer (the role of semantics)? How susceptible is the binding mechanism to changes in target velocity (the effect of speed)? Does the mechanism possess a large parallel capacity so that the observer is able to genuinely track several moving target identities simultaneously, or does it possess only a very limited capacity so that the observer needs to constantly shift the focal attention from one target to another to check out the identities (the effect of set-size)?

Previous research on dynamic visual attention suggests that the tracking mechanism carries out its task very efficiently without recourse to featural or semantic properties of the objects (e.g., Kahneman et al., 1992, Pylyshyn and Storm, 1988, see Section 9 for a more detailed discussion for different versions of the fixed capacity parallel account). According to theories of this kind, tracking is parallel and semantic properties of the objects play no role in the process, but dynamic what–where bindings are made solely on the basis of low-level physical or spatiotemporal properties of the moving objects.

In this article, we propose an alternative serial account where semantic properties can influence dynamic tracking. The starting point for the model were the findings reported by Oksama and Hyönä (2004). Among other things, they observed notable individual variation in the tracking capacity, which significantly correlated with visuospatial working memory and task-switching capacity. Oksama and Hyönä sketched a serial model based on temporary visuopatial memory in order to explain their results. This serial model is here developed further and also fully formalized. Its predictions investigated in dynamic binding experiments where set-size, object speed and object type were manipulated. We demonstrate that our serial model can quantitatively account for observed set-size, speed, and semantic effects.

In what follows, we introduce our model of dynamic binding, called MOMIT (Model Of Multiple Identity Tracking), which makes recourse to continuous attention switching, episodic buffer, visuo-spatial short-term memory (VSTM), and an active role of long-term memory in dynamic binding. We first lay out and motivate the basic assumptions of the model and outline the general architecture of MOMIT. We then describe how the postulated mechanisms are assumed to function in practice (a functional model). The mathematical model and its fit to experimental data are described at the end of the Results section.

MOMIT is based on the following five tenets:

  • (1)

    Efficient maintenance of multiple moving objects requires continuous serial (re)activation and refreshing of the dynamic identity-location bindings. If bindings are not periodically refreshed, they are eventually lost. The refreshing of existing bindings is assumed to be a non-automatic and effortful process based on continuous shifts of attention between targets.

  • (2)

    There is a capacity limitation as to the number of bindings that can be simultaneously kept active in the episodic buffer where the identity-location bindings are assumed to be created and temporarily stored. Moreover, there is significant individual variation in the tracking capacity.

  • (3)

    Long-term memory (LTM) representations are utilized in creating temporary bindings. Thus, binding is more readily made for familiar than unfamiliar visual objects. As a result, tracking performance is better for familiar than unfamiliar targets.

  • (4)

    Spatial indexes (or location pointers) for the tracked targets are temporarily stored in VSTM. These indexes are then utilized by the mechanism that programs shifts of visual attention between targets. As targets move continuously, there is a location error involved in the spatial indexes.

  • (5)

    The system responsible for shifting of attention during tracking also obtains location information of moving objects in parallel. This information is provided by the peripheral vision. However, unlike the location information provided by VSTM (see Principle 4), this spatial information is not indexed; in other words, targets are not differentiated from distracters on the basis of this information.

Next, we motivate in more detail each of the five tenets of MOMIT.

The first assumption of MOMIT is that targets are tracked by continuously refreshing the identity-location bindings one at a time. Underlying this assumption is the view that focal attention is required (1) to identify a target and (2) to (re)bind the identity to a spatial index. Several parallels with this assumption can be found in the relevant prior literature. First, Moray, 1984, Logan, 2002a have argued that when task-relevant visual episodes are spatially spread around a large visual scene, due to visual acuity reasons, it is highly likely that simultaneous tracking of multiple objects becomes impossible (in the present set of experiments, target objects moved wide apart from each other). Moray develops this idea further by introducing uncertainty as a critical factor in dynamic environments. Uncertainty can be either exogenous (due to characteristics of the visual input, such as poor visibility) or endogenous (due to psychological reasons, such as forgetting) in nature. Uncertainty increases as a function of time. To reduce uncertainty, the observer is assumed to periodically update the constantly changing situation by taking visual samples of the current state of the dynamic environment. Second, the first assumption of MOMIT is also in keeping with the studies of Treisman and colleagues, who have argued that focused attention is required to both create feature conjunctions (Treisman & Gelade, 1980) and to temporarily maintain bindings (Wheeler & Treisman, 2002). Third, it is in line with the view advocated by Logan and Zbrodoff (1999), according to which visuospatial attention in intimately involved in constructing compositional representations, such as identity-location bindings. Finally, the assumption bears analogy to the findings in reading research where it is shown that words (a specific type of visual objects) are primarily identified only when they are focally attended (see Rayner, 1998, for a review). (We hasten to add that word recognition in reading undoubtedly differs from multiple identity tracking in many important respects.) The binding with the lowest activation (or highest uncertainty) is assumed to have the highest priority when a target object is chosen to be refreshed (see Moray, 1984). It should be noted the first assumption of MOMIT stands in contrast to Cavanagh and Alvarez, 2005, Pylyshyn and Storm, 1988, Yantis, 1992, who all argue in favor of parallel models and against serial accounts.

By efficient maintenance of identity-location bindings we mean that they can be accessed rapidly and accurately if needed, for example, in order to make a rapid decision, such as giving clearance to an aircraft or passing the ball to a fellow player. More direct evidence supporting the assumption of continuous refreshing comes from Oksama and Hyönä (2004). They observed that tracking deteriorated linearly as a function of set-size (Pylyshyn & Storm, 1988 also observed that tracking deteriorated as a function of set-size but they interpret the finding differently). This implies that the tracking performance involves a serial component. Evidence supporting the quickly decaying nature of target location information is also provided by Oksama and Hyönä (2004). They conducted a signal detection analysis of the performance of tracking multiple identical objects and showed that the tracking performance deteriorated as a function of time (see also Pylyshyn, 2004). This is taken as evidence that visual tracking is not completely automatic and parallel but effortful serial attention is also needed. Finally, Oksama and Hyönä (2004) demonstrated that individual differences in the attention switching capacity significantly correlate with the tracking performance.

We assume that there exists a tight coupling between attention shifts and eye movements (see e.g., Deubel and Schneider, 1996, Findlay and Gilchrist, 2003): attention drives the eyes in that a shift of attention is followed by a saccade to the attended object location (Henderson, 1992). That eye movements are needed in tracking objects in real-life type of visual environments is also proposed by Logan, 2002a, Moray, 1984. However, this specific assumption is not put to a test in the present study (see Landry, Sheridan, & Yufik, 2001, for an attempt to use eye movements to study tracking of multiple moving objects).

The second building block of MOMIT is a limited-capacity episodic buffer where identity-location bindings are assumed to be created and temporarily stored. For example, Luck and Vogel (1997) have shown that about four static identity-location bindings may be kept active at a time. The average capacity is about 4 even when the objects are moving (Horowitz et al., 2007, Oksama and Hyönä, 2004). The notion of episodic buffer is borrowed from Baddeley (2000), who posits such a temporary store for the service of integrating different fleeting representations into one unified whole (e.g., binding identity and location information together). We like to add that we are not committed to the notion of episodic buffer, but any temporary store would do. The fact that working memory is indeed intimately involved in multiple identity tracking (MIT) is demonstrated by Oksama and Hyönä (2004), who observed a strong correlation between MIT performance and working memory capacity measured using visuospatial stimuli. Finally, the claim that significant individual differences exist in the tracking capacity is also based on the results of Oksama and Hyönä (2004).

The third assumption of MOMIT is that LTM is involved in the creation of temporary identity-location bindings. If available, identity information is activated during MIT (i.e., when creating bindings). Thus, attention is assumed to be disengaged sooner from familiar than unfamiliar objects, and the latency of revisiting a target object is thus shorter for familiar than unfamiliar objects. Hence, MOMIT predicts better tracking performance for familiar than unfamiliar objects. This assumption is in line with theories that consider LTM representations to be pertinent in perception and short-term maintenance of visual objects and other type of stimuli (see e.g., Chun and Potter, 1995, Cowan, 1995, Kanwisher, 1991, Ruchkin et al., 2003, Shapiro et al., 1994). This principle is also in keeping with the spirit of the late selection theories of attention, which argue that all object properties, including semantic properties, are activated and utilized in the early stages of visual processing and thus influence the binding process. To our knowledge, this view has not yet been applied to dynamic visual attention.1

The fourth assumption of MOMIT is that the location information necessary to create bindings is available in VSTM (perhaps in the form of the visuospatial sketchpad proposed by Baddeley, 1986, see also Logie, 1995) as spatial indexes (Logan and Zbrodoff, 1999, Pylyshyn and Storm, 1988). Spatial indexes are deictic pointers that specify the location of the target objects in space. They act as an ‘address’ for the perceptual object but they do not specify the properties of objects (see Logan & Zbrodoff, 1999). A key feature of this subsystem is that the spatial coordinates are not assumed to be perfectly up-to-date, but they provide only an approximate location for the targets that are currently not focally attended. As target location information is updated only when the target is focally attended, in dynamic visual environments the location information provided by spatial indexes is always old by nature. Thus, there is a difference between the stored VSTM coordinates and the present location of the targets (except for the focally attended one). We call this difference the VSTM coordinate error. In MOMIT, the magnitude of this VSTM error is one of the key factors determining the success of tracking multiple moving objects. The assumption that spatial indexes are provided by VSTM is at odds with the models of multiple object tracking, which posit that no memory is needed to perform the task (e.g., Pylyshyn, 1989, Yantis, 1992). It is also inconsistent with the FINST theory of Pylyshyn, 1989, Pylyshyn, 1994, Pylyshyn, 2001, which posits that spatial indexes are constantly updated; once the pointers are aligned with the targets, they are assumed to move along with the targets without the need of attention (they are like ‘sticky fingers’ placed on targets). To sum up, in the MOMIT architecture semantic identity-location bindings are assumed to be created and stored in the episodic buffer, but the targets’ indexed location coordinates are stored in VSTM.

The fifth assumption of MOMIT is that peripheral vision provides non-indexed location information (cf. the indexed location information of VSTM) about all moving objects in parallel. This information is accurate, but it does not differentiate targets from distracters. The serial shifting of attention is controlled partly endogenously with the help of VSTM (see above) and partly exogenously with the help of peripheral vision. VSTM provides a rough spatial location of the to-be-attended target; the object to be focally attended next is the closest object around the area determined by VSTM. Thus, the attended object is not necessarily the intended one (due to the VSTM error)—it could also be a distracter or a recently refreshed target. If the attended object is not the intended one, it is attended anyway; if it is a distracter (or another target), a search for the intended target item is initiated.

Here we outline the general architecture of MOMIT. A graphic description of the architecture of MOMIT is shown in Fig. 1. The model provides a functional description of the process of how what and where information are linked together and maintained in a dynamic visual environment. The model’s architecture consists of five components: (1) a component responsible for the analysis of what information, (2) a component responsible for the analysis of where information, (3) a temporary memory buffer (VSTM) for maintaining indexed location information, (4) a control system for attention switching, and (5) a temporary episodic buffer for maintaining what–where bindings. The model is designed to simulate a situation where multiple, constantly moving objects are tracked in a visual scene (possibly, but not necessarily in the presence of distracters). The situation where objects are static or only one object is moving are special cases of the kind of visual environment we are interested in here.2 In the following, we describe the different components of the model.

  • (1)

    Analysis of ‘what’. This component provides access to the semantic identity of a focally attended object. It is assumed that identity information is activated early and independently of the ‘where’ analysis. Identity information is transmitted to the episodic buffer (Component 5), where temporary bindings are formed and maintained. The model assumes that the faster identity information is accessed for the attended object, the more readily a binding will be formed and, consequently, the more efficiently multiple bindings will be maintained (i.e., the refresh rate is fast and the probability of losing previous bindings is small). Thus, factors that influence the speed of access of identity information will influence the efficiency to maintain multiple bindings. One such factor is the familiarity of the identity information stored in LMT. Familiar objects will be identified faster and consequently tracked more efficiently. A second factor by which LTM may influence the binding process is related to discriminability among target objects (not tested in the present study). It may be presumed that a low discriminability will slow down the what-analysis and thus hamper with the dynamic maintenance of multiple bindings (cf. Chun and Potter, 1995, Duncan and Humphreys, 1989).

  • (2)

    Analysis of ‘where’. This component yields parallel location information for those moving objects (both targets and distracters) that are positioned within an area constrained by peripheral vision. This spatial information is transmitted to Component 4 that is responsible for serial shifting of attention. It is also transmitted to the episodic buffer (Component 5) where the target-relevant location information is used in creating or updating an identity-location binding for the currently attended target. The resulting indexed location information is then temporarily stored in VSTM (Component 3). Obviously, all factors that impoverish the peripheral perception of moving objects (e.g., occlusions) hamper with the workings of this parallel where system.

  • (3)

    Temporary memory buffer for indexed location information. The core assumption here is that in order to maintain of multiple bindings, temporary storage of the targets’ former locations is needed. This information is used when programming attention shifts between the tracked objects. Thus, MOMIT assumes that there is a component that saves temporal location information for the tracked objects when they are focally attended. This storage function is presumed to be carried out by VSTM. Because the target location information is updated only when targets are focally attended, it means that with constantly moving targets this indexed location information is not completely accurate (cf. the VSTM coordinate error in the mathematical formulation of MOMIT presented below). This short-term memory component plays a central role in the predictions derived from MOMIT. The more accurate the location information is in the buffer, the more accurately attention will be shifted between the to-be-tracked targets. In dynamic visual environments, the accuracy of indexed location information is modulated by target speed and set-size, as discussed above. Apart from the indexed locations becoming inaccurate as a function of time, they are also vulnerable to endogenous uncertainty (Moray, 1984), such as forgetting and interference.

  • (4)

    Control system for attention switching. Any model that makes recourse to serial attention switching requires an attention control mechanism. This is because during tracking a decision of which target to select next is frequently made. A random selection does not work, as non-optimal selection results in inefficient maintenance of multiple identities. For example, by attending a recently refreshed target another target is put into danger (i.e., the binding may be lost). It is assumed that this control system receives input from two sources, from VSTM (an endogenous component) and from the parallel where system (an exogenous component), and the object to be attended is determined by the joint interplay between these two information sources. VSTM provides a rough spatial location of the to-be-attended target and the parallel component determines the specific object to be focally attended next, which is the closest object around the area determined by VSTM.

    The above argumentation leads to a prediction that attentional resources available for time-sharing and smooth serial allocation of attention between items-to-be-tracked contribute to the overall tracking capacity and to the effectiveness with which bindings are maintained in an active state. Individual differences in the executive capacity (an ability to serially allocate attention to multiple tasks) constitute one such factor as demonstrated by Oksama and Hyönä (2004), experience with the task another, as demonstrated by Allen, McGeorge, Pearson, and Milne (2004). MOMIT also predicts that when MIT is performed concurrently with another (non-visual) task that also calls for executive resources, the MIT performance will be hampered.

  • (5)

    Temporary episodic buffer for what–where bindings. MOMIT assumes that temporary episodic memory representations of the formed bindings are constructed and maintained during tracking.3 The obvious function of these representations is to retain bindings for a short period of time to overcome temporary disappearance of visual input (e.g., during saccades, blinks, or occlusions). These representations may then be consulted if needed (‘the lion is behind the tree’). As argued above, the episodic buffer consists on average of four bindings.

Based on the principles described above, we next provide a functional description of the mechanisms assumed to be responsible for maintaining multiple dynamic identity-location bindings. The starting point in the description is that the system tries to maintain three moving targets: T1, T2, and T3.

  • 1.

    Target T1 is focally attended and a semantic identity-location binding is formed for it (or updated in case a binding has already been created during a previous cycle) in the episodic buffer. The present location of Target T1 is stored in VSTM.

  • 2.

    The next target-to-be-attended is endogenously selected among the alternatives (T2 or T3). Target T2 is selected on the basis of the activation level of the bindings (the binding with the lowest activation has the highest priority).

  • 3.

    When attention is disengaged from Target T1, focal attention is shifted to an exogenously determined object in the vicinity of the endogenously selected target (i.e., to an object nearest to the endogenously selected target location).

  • 4a.

    If the new attended object is the intended one (T2 in this case), then an identity-location binding is (re)constructed for T2 (see Cycle 1), and in Cycle 2 T3 is selected as the next to-be-attended moving object. Cycles 1–4 are repeated as long as necessary. Or,

  • 4b.

    if the attended object is not the intended one, a corrective attention shift is carried out from the wrong object to a new one located in the vicinity. When the right one is found, the process is analogous to what is described in (4a).

In order to test MOMIT’s goodness of fit, we quantitatively fitted the mathematically formalized MOMIT to the data (see Section 8). In addition, we also derived MOMIT’s specific predictions as regards main effects and interactions in the analyses of variance (see Appendix A for a mathematical proof of the predictions and Appendix B for a detailed comparison of the observed and predicted effects). In short, MOMIT predicts that maintenance of dynamic bindings deteriorates as a function of target set-size, speed, and familiarity. It also predicts that all interactions between these three factors are significant. These predictions arise from the model’s serial architecture. As bindings have to be refreshed serially, the effects of these factors that influence the time it takes to refresh target bindings add up in a multiplicative fashion—hence the predictions for interactions.

To study dynamic identity tracking, i.e., the observer’s awareness of the location and identity of visual elements at any given time, we developed two multiple identity tracking (MIT) tasks. In the first one (used in Experiments 1 and 3A), several non-identical elements move around for a while, after which movement is stopped and elements are immediately masked (to eliminate iconic memory), one element is probed, and the participant is asked to identify the probed element (in the variant used in Experiment 2A objects continue to move also when masked and probed). In order to minimize verbalization and memory requirements during response selection, the probed item is chosen among all the moving objects that are presented in a separate response screen. We call this the partial report probe recognition (PRPR) task (it was also used by Oksama & Hyönä, 2004).

The second variant of MIT makes use of the change detection paradigm (e.g., Rensink et al., 1997, Simons and Levin, 1998). In this variant (used in Experiments 2B and 3B), the movement phase is followed by a short flicker, during which the moving objects are erased and the participants see only empty framed squares moving (each object is surrounded by a frame throughout the trial). When objects reappear to the empty frames, they appear either in their original frames, or in half of the time two of the targets swap position. The participant’s task is to respond whether or not a change took place (i.e., to respond either ‘yes’ or ‘no’). Thus, for the response, no semantic identity information needs to be necessarily consulted (for another variant of the change detection procedure in multiple object tracking, see Bahrami, 2003).

Notice that distracter items are not needed in MIT because the tracked objects are visually distinct from each other. Notice also that in order to able to respond accurately to the masked probes some kind of temporary, short-lived (but not iconic) memory representation of the what–where binding must be consulted. If there is no memory trace for the binding, the accuracy in responding to the masked probes is at a chance level. Thus, in a sense we test whether an ‘occlusion-tolerant’ memory representation is created during dynamic identity tracking (the issue of temporary memory representations is discussed in more detail in Section 9).

Section snippets

Experiment 1

Effects of set-size, object type, and object speed on dynamic identity tracking was studied in Experiment 1. By including all these factors in the same experimental design it was possible to examine the interactions between the factors and thus test the predictions of MOMIT put forth in Section 1. Set-size was varied from 2 to 6. Three speed conditions were created that are called slow, medium, and fast. To study possible effects of semantic identity in MIT we compared highly familiar objects

Experiment 2A

In the PRPR technique used in Experiment 1, the movement of the elements stops at the time when objects are masked and one of the designated targets is probed by flashing a frame around it. It is possible that some specific phenomenon or a strategy related to static locations may emerge at this stage (e.g., binding is easier to maintain because there is no spatiotemporal processing cost). To eliminate this possibility we developed another variant of the probe recognition task for Experiment 2A.

Experiment 2B

In the PRPR task used in Experiments 1 and 2A, the participants have to keep active in short-term memory the identity of the probed object during the response selection stage in order to be able to respond accurately to the probe. While our partial report recognition procedure was designed minimize memory requirements it may still be argued that familiar stimuli may be easier than unfamiliar stimuli to keep active in short-term memory during this short response selection stage (the set-size and

Experiment 3A

In Experiment 3A, familiar faces and ‘pseudo-faces’ were used as stimuli. Pseudo-faces were created from familiar faces by deconstructing them and then rearranging and recombining the pieces of familiar faces. The face-like appearance of the pseudo-faces was preserved (pseudo-faces have eyes, nose, hair, etc.). Nevertheless, they give an odd appearance, which is why we named them ‘frankensteins’. What is crucial in the present context, the pseudo-faces lack a known identity (analogously to

Experiment 3B

In Experiment 3B, tracking of familiar faces and pseudo-faces was examined using the change detection paradigm introduced in Experiment 2B. The facial stimuli were the same that were used in Experiment 3A.

A mathematical formulation of MOMIT

In the following we provide a mathematical formulation of MOMIT. The aim is to model performance accuracy of dynamic identity-location binding as a function of target set-size, object speed, and object familiarity. The mathematical model consists of three components: binding capacity, probability of guessing, and dynamic processing cost. We first describe each component separately, after which the three components are integrated into a single formula that allows us to fit the model to the

Model fitting

We fit MOMIT to the data of Experiment 1 using Eq. (6), which includes 7 parameters, two of which, s and m, are free parameters. The parameters that were not free were fixed on the basis of empirical data (a, x) or were derived from the structure of the experiment (n, v, p). Parameters a and x determine the probability function Pce in relation to visual acuity. Parameter a determines the curvature of the function, while parameter x determines the location where Pce reaches 0. For a, we aimed for a

General discussion

The present study investigated the observer’s ability to track and maintain multiple identities in a dynamic visual environment. In five experiments, we examined effects of target set-size, semantic identity information, and object speed on performance accuracy. Two experimental techniques (partial report probe recognition and change detection) were employed to study tracking of familiar objects, pseudo-objects, familiar faces, and pseudo-faces. The following findings were obtained: (1) a

Conclusions

In the present article, we have proposed a functional and formal model for the maintenance of multiple bindings in a dynamic visual environment. The proposed serial MOMIT model can account for most of the results reported in the present study; in addition, it is also consistent with the results of Oksama and Hyönä (2004). It is sufficiently detailed to generate empirically testable predictions (some of which are mentioned above). It provides a theoretical alternative to parallel models popular

Acknowledgment

The completion of this study was made possible by a grant provided by Finnish Scientific Advisory Board of Defence to the first author. The second author acknowledges the support of Suomen Akatemia (the Academy of Finland). We are grateful to Gordon Logan and three anonymous referees for their highly useful comments on a previous version of this article. We also thank Juhani Sinivuo, Marja-Leena Haavisto and Krista Oinonen, who supported the work in many ways, and Maija Seppänen and Jari

References (67)

  • D. Simons et al.

    Change blindness: Past, present, and future

    Trends in Cognitive Sciences

    (2005)
  • A. Treisman

    The binding problem

    Current Opinion in Neurobiology

    (1996)
  • A. Treisman et al.

    A feature-integration theory of attention

    Cognitive Psychology

    (1980)
  • A. Treisman et al.

    Illusory conjunctions in the perception of objects

    Cognitive Psychology

    (1982)
  • S. Yantis

    Multielement visual tracking: Attention and perceptual organization

    Cognitive Psychology

    (1992)
  • R. Allen et al.

    Attention and expertise in multiple target tracking

    Applied Cognitive Psychology

    (2004)
  • G.A. Alvarez et al.

    Independent resources for attentional tracking in the left and right visual hemifields

    Psychological Science

    (2005)
  • A.D. Baddeley

    Working memory

    (1986)
  • B. Bahrami

    Object property encoding and change blindness in multiple object tracking

    Visual Cognition

    (2003)
  • C. Bundesen

    A theory of visual attention

    Psychological Review

    (1990)
  • C. Bundesen et al.

    A neural theory of visual attention: Bridging cognition and neurophysiology

    Psychological Review

    (2005)
  • M.M. Chun

    Types and tokens in visual processing: A double dissociation between the attentional blink and repetition blindness

    Journal of Experimental Psychology: Human Perception and Performance

    (1997)
  • M.M. Chun et al.

    A two-stage model for multiple target detection in rapid serial visual presentation

    Journal of Experimental Psychology: Human Perception and Performance

    (1995)
  • Corel Mega Gallery (1996). Corel...
  • N. Cowan

    Attention and memory: An integrated framework

    (1995)
  • P. De Graef et al.

    Perceptual effects of scene context on object identification

    Psychological Research

    (1990)
  • J.N. Duncan et al.

    Visual search and stimulus similarity

    Psychological Review

    (1989)
  • J.M. Findlay et al.

    Active vision: The psychology of looking and seeing

    (2003)
  • F. Germeys et al.

    Transsaccadic perception of saccade target and flanker objects

    Journal of Experimental Psychology: Human Perception and Performance

    (2002)
  • R.D. Gordon et al.

    What’s in an object file? Evidence from priming studies

    Perception & Psychophysics

    (1996)
  • J.M. Henderson

    Visual attention and eye movement control in reading and picture viewing

  • J.M. Henderson

    Two representational systems in dynamic visual identification

    Journal of Experimental Psychology: General

    (1994)
  • J.M. Henderson et al.

    Roles of object-file review and type priming in visual identification within and across eye fixations

    Journal of Experimental Psychology: Human Perception and Performance

    (1994)
  • Cited by (125)

    • Two-year-olds succeed at MIT: Multiple identity tracking in 20- and 25-month-old infants

      2019, Journal of Experimental Child Psychology
      Citation Excerpt :

      By this age, infants are capable of silently labeling simple familiar objects (Mani & Plunkett, 2010); thus, they may attempt verbal rehearsal. To minimize this, similarly to Oksama and Hyönä (2008), we used complex non-nameable stimuli (see Fig. 1A). In Experiment 1, we tested 20-month-olds in a standard “no movement” DMR task (Fig. 1B).

    • Multiple object tracking with extended occlusions

      2023, Quarterly Journal of Experimental Psychology
    View all citing articles on Scopus
    View full text