Taming the beast: extracting generalizable knowledge from computational models of cognition

https://doi.org/10.1016/j.cobeha.2016.04.003Get rights and content

Highlights

  • Computational models extract general principles from specific data.

  • However, such principles are conditional on modeling assumptions.

  • Faulty assumptions can bias findings and mislead data interpretation.

  • The modeler's toolkit includes several techniques to avoid this pitfall.

  • These techniques are critical to broadly interpret computational findings.

Generalizing knowledge from experimental data requires constructing theories capable of explaining observations and extending beyond them. Computational modeling offers formal quantitative methods for generating and testing theories of cognition and neural processing. These techniques can be used to extract general principles from specific experimental measurements, but introduce dangers inherent to theory: model-based analyses are conditioned on a set of fixed assumptions that impact the interpretations of experimental data. When these conditions are not met, model-based results can be misleading or biased. Recent work in computational modeling has highlighted the implications of this problem and developed new methods for minimizing its negative impact. Here we discuss the issues that arise when data is interpreted through models and strategies for avoiding misinterpretation of data through model fitting.

Introduction

Behavioral and physiological data in systems and cognitive neuroscience are generally collected in reduced environments and constrained experimental conditions, often designed to be diagnostic of competing theories. The generalization of knowledge from simple experiments is crucial to advance our broader understanding of brain and behavior. However, interpreting data according to existing theories causes our knowledge to be conditioned on the quality of said theories and the assumptions thereof. In many cases these conditions are met and theory can drive scientific progress by reducing a dazzling array of neuroscientific data into simpler terms, yielding falsifiable predictions. In other cases a general overarching theory can lead to wasted resources and, at worst, can even impede scientific progress.

Both the advantages and potential dangers of theory are amplified for computational theories, which provide extremely explicit predictions under a specific set of assumptions. Such theories offer an advantage over more abstract ones in that they make predictions about behavior or neurophysiology that are testable, falsifiable, and comparable across models. They do so by formalizing the fundamental definition of the model and linking it to experimental data through a set of assumptions (e.g., the particular form of the behavioral or neural likelihood distribution, conditional independencies in choice behavior, parameter stationarity, etc.). These assumptions can affect how we interpret evidence for or against a model, how we explain differences in the behavior of individuals or how we ascribe functional roles to biological systems based on physiological measurements. Here we examine the theoretical and practical consequences of these assumptions, paying special attention to the factors that determine whether a given assumption will give rise to spurious interpretations of experimental data. In particular, we highlight and evaluate methods used to minimize the impact of modeling assumptions on interpretations of experimental results.

While we appreciate that assumptions implicit in experimental designs are critical for data interpretation, and that inappropriate model selection or fitting techniques can produce misleading results, here we focus specifically on issues that can arise when computational models are appropriately and quantitatively linked to meaningful empirical data [1•, 2, 3, 4] (see Box 1 for relevant definitions). Under such conditions, quantitative model fitting offers at least three key advantages: first, competing models can be compared and ranked according to their abilities to fit empirical data, second, differences across individuals or task conditions can be assessed according to model parameter estimates, providing potential mechanisms giving rise to those differences, and third, neural computation can be evaluated by comparing physiological measurements to latent model variables. In this section we discuss recent work that highlights how each of these potential advantages can be negated by unmet assumptions.

The impact of modeling assumptions on the arbitration of competing models was recently highlighted in a factorial comparison of visual working memory models [5]. The study decomposed working memory into three distinct processes (item limitation, precision attribution, and feature binding) to construct a large model-set through exhaustive combination of possible computational instantiations of each process. The model set contained archetypical models such as competing ‘slots’ and ‘resource’ models of working memory capacity limitations, but also contained novel combinations that had never been previously tested. This combinatorial model set was fit to data from 10 previously published studies that had come to different conclusions regarding the nature of capacity limitations (roughly half had concluded in favor of discrete slots [6, 7], whereas the other half had concluded in favor of continuous resource limitation [8, 9]). Despite the fact that the original studies had arrived at contradictory conclusions, rank ordered fits from the combinatorial model set were remarkably consistent across datasets. The contradictory conclusions of the original studies were possible because each study compared only a small subset of models by making fixed assumptions (that differed across study) regarding untested model processes, allowing essentially the same data to be used to support competing and mutually exclusive psychological theories.

Similar issues can arise when estimating parameters within a single model. Computational modeling provides a powerful framework for inferring individual differences in latent parameters governing behavior, such as the learning rate used to incorporate new information in supervised or reinforcement learning tasks [10, 11, 12, 13]. However, fairly subtle aspects of model specification can have major effects on this estimation process. One recent computational study showed that the common and seemingly innocuous assumption that learning rate is fixed over time can have drastic consequences on interpretations of behavioral differences: failure to model adjustments in learning rate led to the spurious conclusion that faster learners were best fit by lower learning rates [14, 15•]. This is to say that, in at least in some extreme cases, naïve reliance on parameter fits can give rise to an interpretation that is exactly opposite to the truth. A similar phenomenon has been noted in computational studies of reinforcement learning and working memory contributions to behavioral learning: failure to appropriately model the working memory process can lead to spurious identification of deficits in the reinforcement learning system [16, 17].

The impact that inappropriate assumptions can have on model selection and parameter estimation in turn corrupts the latent variables that are used to test theories of neural computation. Without valid computational targets, analysis of physiological measurements such as fMRI BOLD or EEG are more likely to yield null results, or worse, provide misleading information regarding the computational origins of behavior [18, 19]. Or, more simply put, if we do not have a clear understanding of the computations that govern behavior, what hope do we have of discovering the neural instantiations of those computations?

So how can computational models be used to generalize knowledge across cognitive tasks, contexts and species without falling prey to the risks described above? A prominent notion in statistics is that robust inference can be achieved through iteration of steps comprised of model estimation and model criticism [20]. In formal terms, the estimation step involves estimating parameters that allow for the best description of behavior given a candidate model, and the criticism step involves estimating the probability of all possible datasets that could have been observed given that model parameterization [20, 21]. This type of criticism is referred to as predictive checking and allows the modeler to ask whether the empirical data that were observed were ‘typical’ for a given model. If the observed data are atypical for the fit model, then model revisions are necessary.

In practice, researchers are generally focused on particular meaningful features of the data motivated by the experimental design, and the typicality of data is often assessed through analyses designed to probe these key features. Specifically, parameters are estimated through model fitting and then parameterized models are used to simulate data that are subjected to the same descriptive analyses originally applied to the empirical data (e.g., a learning study might be concerned with learning curves and/or asymptotic accuracy in one condition compared to another). This approach depends critically on the precision with which behavior can be characterized through these descriptive analyses: the more precisely a behavior of interest can be quantified through a descriptive analysis, the more diagnostic it will be of model sufficiency [15]. In some cases, ability of simulated data to reproduce basic properties of the original dataset (e.g., conditional accuracy and reaction time) can provide rich information regarding why a given model fits the data better than other candidates [16, 22]. In other cases, the failure to adequately describe these key features, or some aspect of them, can reveal an inappropriate assumption or a missing component of the empirical data (see Figure 1). For example, distributional analyses of response times can reveal when a model captures empirical data and where it may miss (e.g., the tail or leading edge of the distribution and how they do or do not differ between task conditions); [23, 24], patterns that can be revealed via posterior predictive checks [25, 26]. In reward learning tasks, sequential choice behavior can be described as a linear function of outcome and reward history, which allows validation of (or reveals deviations from) the specific patterns of data expected by basic reinforcement learning models [27•, 28•]. This type of analysis was recently extended to directly test a model of how rewards can be misattributed in lesioned monkeys, in a manner that was relatively insensitive to modeling assumptions [29].

A related strategy for evaluating model sufficiency is through diagnostic techniques based on predicted likelihood functions. There is a rich statistical literature on the problems that can arise when residuals are not distributed according to the expected likelihood function [30]. Cognitive computational models can fall prey to similar issues; non-uniformity or heteroscedasticity of residuals can inflate some measures of goodness-of-fit and give undue influence to particular data points leading to increased variance or even bias in parameter estimates [31, 32]. In some cases, differences in the model-predicted and actual likelihood functions can be observed directly by computing residuals and systematic mismatches between the two can be corrected through changes to the existing model [33]. The appropriate likelihood function is particularly important if one considers the possibility that some small proportion of trials are generated by alternative processes other than the model in question. For example, in value-based decision making, the commonly used softmax choice function assumes that choices are increasingly noisy when the differences between alternative choice values are small, and more deterministic when these value differences grow. There is ample evidence for this type of choice function [34], as opposed to alternative epsilon-greedy choice functions in which the level of choice stochasticity is uniform across value differences. But several studies have shown that choice likelihood can deviate substantially from either of these functions (for example as a result of just a few choices driven by a different process altogether, such as attentional lapses), and the failure to take into account this irreducible noise can over-emphasize or under-emphasize particular data points and potentially bias results (see Figure 2) [17, 35, 36, 37].

So if a model can simulate the key descriptive findings and links computational variables to experimental data through a suitable likelihood function can concerns regarding untested assumptions be set aside? Not necessarily. While these are important criteria for evaluating inferences drawn from a computational theory, they are not typically exclusive to a single model. Thus the question always remains: could the data be better explained through a different set of mechanisms under a different set of assumptions? While this problem is endemic to all of science and not just computational models, one approach to answering this question is to explicitly validate the robustness of model-based findings across a broad range of assumptions. Typically, such validations are conducted using parameters that were originally fixed for simplicity without strong a priori rationale [31, 38]. A related strategy for assessing the impact of faulty assumptions is to simulate data from models that make a range of assumptions and attempt to recover information from these models with a separate set of models containing mismatched assumptions. In some cases this strategy has revealed problems, such as with the interpretability of softmax temperature in learning models [15], but in other cases it has highlighted the robustness of specific model-based strategies, such as in the estimation of fMRI prediction error signals using reinforcement learning models under certain conditions [39].

An alternative to explicitly testing the assumptions through which data are linked to a model is to derive and test axioms that should hold irrespective of assumptions [40]. Just as Euclidean geometry postulates that all triangles should have interior angles that sum to 180°, the equations defining a computational model can often be rearranged to identify sets of equalities or inequalities that the entire class of models must obey. One notable example of this strategy comes from economics, where choice consistency, which was established as a fundamental prediction of utility maximization theory, was pivotal for both the falsification of the theory and the subsequent development of better behavioral models [41, 42, 43]. Recently, the same approach has been used to test the suitability of reward prediction error models for description of fMRI, electrophysiology, and voltammetry signals [44, 45, 46, 47]. While axioms are not mathematically tractable for all models, the basic approach can be followed by identifying testable predictions or boundary conditions through model simulation [48, 49]. On its face, the axiomatic approach seems to differ from those discussed above in philosophy: quantitative model fitting promotes inductive reasoning based on reverse inference across a fixed model space, whereas the axiomatic approach lends itself to rejection and revision of failed explanations.

However, this characterization of standard modeling approaches in cognitive science is missing the concept of criticism. Fitting assumptive computational models allows us to induce knowledge regarding the general structure of our world based on specific examples, but rigorous criticism ensures that the knowledge we gain in this way will generalize outside of our model set and experimental space [20, 21, 50]. Falsification of specific model predictions guides model revisions that make theories more robust and reduces the likelihood of misleading interpretations of experimental results. Through this lens axiomatic methods can be considered as a specific form of criticism tailored to the core assertions of a computational model. It has been noted that the process of careful model criticism can be thought of as obeying an epistemological theory of hypothetico-deductivism, whereby information is gained through rejecting unlikely models, as opposed to through producing support for more likely ones [51].

While we essentially agree with this perspective, we believe that well specified computational theories and the models that instantiate them provide the best of both worlds in terms of philosophy of science: first, the ability to induce general knowledge from specific data points within a constrained modeling space [52] and second, the ability to test, reject, and improve upon existing models through a deductive hypothesis testing approach [53]. A balance of these two approaches should allow steady scientific progress through inductive reasoning kept in check by a commitment to falsification of invalid or inappropriate assumptions.

Section snippets

Conflict of interest

M.J.F. is a consultant for Hoffman La Roche pharmaceuticals using computational psychiatry methods.

References (54)

  • W. Zhang et al.

    Discrete fixed-resolution representations in visual working memory

    Nature

    (2008)
  • S.J. Luck et al.

    Visual working memory capacity: from psychophysics and neurobiology to individual differences

    Trends Cogn Sci

    (2013)
  • P.M. Bays et al.

    Dynamic shifts of limited working memory resources in human vision

    Science

    (2008)
  • T.V. Wiecki et al.

    Model-based cognitive neuroscience approaches to computational psychiatry: clustering and classification

    Clin Psychol Sci

    (2015)
  • R.B. Mars et al.

    Model-based analyses: promises, pitfalls, and example applications to the study of cognitive control

    Q J Exp Psychol

    (2012)
  • T.E.J. Behrens et al.

    Learning the value of information in an uncertain world

    Nat Neurosci

    (2007)
  • M.X. Cohen

    Individual differences and the neural representations of reward expectation and reward prediction error

    Soc Cogn Affect Neurosci

    (2006)
  • M.R. Nassar et al.

    An approximately Bayesian delta-rule model explains the dynamics of belief updating in a changing environment

    J Neurosci

    (2010)
  • M.R. Nassar et al.

    A healthy fear of the unknown: perspectives on the interpretation of parameter fits from computational models in neuroscience

    PLoS Comput Biol

    (2013)
  • A.G.E. Collins et al.

    How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis

    Eur J Neurosci

    (2012)
  • A.G.E. Collins et al.

    Working memory contributions to reinforcement learning impairments in schizophrenia

    J Neurosci

    (2014)
  • J.P.A. Ioannidis

    Why most published research findings are false

    PLoS Med

    (2005)
  • A. Gelman et al.

    Beyond power calculations: assessing type S (Sign) and type M (magnitude) errors

    Perspect Psychol Sci

    (2014)
  • G. Box

    Sampling and Bayes’ inference in scientific modelling and robustness

    J R Stat Soc Ser A (Gen)

    (1980)
  • A. Gelman et al.

    Posterior predictive assessment of model fitness via realized discrepancies

    Stat Sin

    (1996)
  • L. Ding et al.

    Separate, causal roles of the caudate in saccadic choice and execution in a perceptual decision task

    Neuron

    (2012)
  • R. Ratcliff et al.

    Reinforcement-based decision making in corticostriatal circuits: mutual constraints by neurocomputational and diffusion models

    Neural Comput

    (2012)
  • Cited by (47)

    • Brain signals of a Surprise-Actor-Critic model: Evidence for multiple learning modules in human decision making

      2022, NeuroImage
      Citation Excerpt :

      While the models’ log-evidence can be approximated Akaikes Information Criterion (AIC) and Bayesian information criterion (BIC), cross-validation is often considered a more robust method for model comparison (Ito and Doya, 2011). For the Surprise Actor-critic (which we later focus on as our winning model of model comparison), we performed (i) posterior predictive checks as well as (ii) a model recovery and (iii) a parameter recovery analysis (Nassar and Frank, 2016; Wilson and Collins, 2019). Our goal was to investigate whether (i) Surprise Actor-critic’s choices mirror those of the real participants, (ii) we can identify it as the winning model, and (iii) recover its parameter values given its generated data.

    • Susceptibility to eating disorders is associated with cognitive inflexibility in female university students

      2021, Journal of Behavioral and Cognitive Therapy
      Citation Excerpt :

      Moreover, many results are difficult to interpret because of high rates of comorbidity (Herzog & Eddy, 2007). On the positive side, the literature reviewed above suggests that, to better understand dysfunctional cognitive processes underlying EDs, computational modeling of participants performance (Nassar & Frank, 2016) is to be preferred to the plethora of descriptive measures currently used (see Miles et al., 2020). In this study, we examine whether deficits in cognitive flexibility are associated to a predisposition to EDs in the general adolescent female population free from major psychological disorders.

    • The computational challenge of social learning

      2021, Trends in Cognitive Sciences
      Citation Excerpt :

      However, the fact that the number of factors known to bias social cognition is matched by the sheer number of verbal theories that abound [36] only highlights the pressing need to more precisely characterize the mechanics of the social mind. The problem is that reducing the theoretical landscape through a superficial model-fitting enterprise informed by simple, borrowed models (e.g., focusing on model estimation with little attention to model criticism [37]), has the potential to lead us astray [38]. When simple models are fit to complex behavior, parameter estimates can be misleading [25,39,40] and will likely fail to generalize beyond a particular experimental paradigm [24], a point that has recently been noted with respect to the learning rate in RL models [41].

    • Balancing exploration and exploitation with information and randomization

      2021, Current Opinion in Behavioral Sciences
      Citation Excerpt :

      For example, directed exploration is often opposed by risk and ambiguity aversion [25,26], while random exploration is difficult to disentangle from boredom or disengagement from the task. As a result, early work looking for directed exploration led to mixed results (e.g. [27,28]), and the interpretation of choice variability as random exploration has been controversial [29,30••,31]. A key advance in proving the existence of directed and random exploration was the design of tasks that manipulate how valuable it is to explore, independent of confounding factors.

    View all citing articles on Scopus
    View full text