How Bayes factors change scientific practice

https://doi.org/10.1016/j.jmp.2015.10.003Get rights and content

Highlights

  • Bayes factors would help science deal with the credibility crisis.

  • Bayes factors retain their meaning regardless of optional stopping.

  • Bayes factors retain their meaning despite other tests being conducted.

  • Bayes factors retain their meaning regardless of time of analysis.

  • The logic of Bayes helps illuminate the benefits of pre-registration.

Abstract

Bayes factors provide a symmetrical measure of evidence for one model versus another (e.g. H1 versus H0) in order to relate theory to data. These properties help solve some (but not all) of the problems underlying the credibility crisis in psychology. The symmetry of the measure of evidence means that there can be evidence for H0 just as much as for H1; or the Bayes factor may indicate insufficient evidence either way. P-values cannot make this three-way distinction. Thus, Bayes factors indicate when the data count against a theory (and when they count for nothing); and thus they indicate when replications actually support H0 or H1 (in ways that power cannot). There is every reason to publish evidence supporting the null as going against it, because the evidence can be measured to be just as strong either way (thus the published record can be more balanced). Bayes factors can be B-hacked but they mitigate the problem because a) they allow evidence in either direction so people will be less tempted to hack in just one direction; b) as a measure of evidence they are insensitive to the stopping rule; c) families of tests cannot be arbitrarily defined; and d) falsely implying a contrast is planned rather than post hoc becomes irrelevant (though the value of pre-registration is not mitigated).

Introduction

A Bayes factor is a form of statistical inference in which one model, say H1, is pitted against another, say H0. Both models need to be specified, even if in a default way. Significance testing (using only the p-value for inference, as per Fisher, 1935) involves setting up a model for H0 alone—and yet is typically still used to pit H0 against H1. I will argue that significance testing is in this way flawed, with harmful consequences for the practice of science (Wagenmakers, 2007). Bayes factors, by specifying two models, resolve several key problems (though not all problems). After defining a Bayes factor, the introduction first indicates the general consequences of having two models (namely, the ability to obtain evidence for the null hypothesis; and the fact the alternative has to be specified well enough to make predictions). Then the body of the paper explores four ways in which these consequences may change the practice of science for the better.

In order to define a Bayes factor, the following equation can be derived with a few steps from the axioms of probability (e.g. Stone, 2013): Normative posterior belief in one theory versus another in the light of data  =  a Bayes factor, B×prior belief in one theory versus another. That is, whatever strength of belief one happened to have in different theories prior to data (which will be different for different people), that belief should be updated by the same amount, B, for everyone.1What this equation tells us is that if we measure strength of evidence of data as the amount by which anyone should change their strength of belief in the two theories in the light of the data, then the only relevant information is provided by the Bayes factor, B (cf Birnbaum, 1962). Conventional approximate guidelines for strength of evidence were provided by Jeffreys (1939, though Bayes factors stand on their own as continuous measures of degrees of evidence). If B>3 then there is substantial evidence for H1 rather than H0; if B<1/3 then there is substantial evidence for H0 rather than H1; and if B is in between 1/3 and 3 then the evidence is insensitive.

The term ‘prior’ has two meanings in the context of Bayes factors. P(H1) is a prior probability of H1, i.e. how much you believe in H1 before seeing the data. But the term ‘prior’ is also used to refer to setting up the model of H1, i.e. to state what the theory predicts, used for obtaining P(D|H1), the probability of obtaining the data given the theory. When measuring strength of evidence with Bayes factors, there is no need to specify priors in the first sense; but there is a need to specify a model (prior in the second sense). To know how much evidence supports a theory one must know what the theory predicts; but one does not have to know how much one believes in a theory a priori. In this paper, specifying what a theory predicts will be called a ‘model’.

The specification of two models in a Bayesian approach, rather than one in significance testing, has two direct consequences: One is that Bayes factors are symmetric in a way that p-values are asymmetric; and, second, Bayes factors relate theory to data in a direct way that is not possible with p-values. Here I clarify what these two properties mean; then the paper will consider in detail how these properties are important for how we do science.

First, a Bayes factor, unlike a p-value, is a continuous degree of evidence that can symmetrically favour one model or another (e.g.  Rouder, Speckman, Sun, Morey, & Iverson, 2009). Let us call the models H1 and H0. By using conventional criteria, the Bayes factor can indicate whether evidence is weak or strong. Thus, the Bayes factor may indicate (i) strong evidence for H1 and against H0; or (ii) strong evidence for H0 and against H1; or (iii) not much evidence either way. That is a Bayes factor can make a three-way distinction. A p-value, by contrast, is asymmetric. A small p-value (often) indicates evidence against H0 and for the H1 of interest; but a large p-value does not distinguish evidence for H0 from not much evidence for anything. A p-value only tries to make a two-way distinction: evidence against H0 (i.e. (i)) versus anything else (i.e. (ii) or (iii), without distinguishing them) (and even this it does not do very well; Lindley, 1957). A large p-value is, therefore, never in itself evidence for H0. The asymmetry of p-values leads to many problems that are part of the ‘credibility crisis’ in science (Pashler & Wagenmakers, 2012). The reason why p-values are asymmetric is that they specify only one model: H0. This is their simplicity and hence their beguiling beauty. But their simplicity is simplistic. This paper will argue that using Bayes factors will therefore help solve some (but not all) of the problems leading to the credibility crisis, by changing scientific practice. The symmetry is particularly important in determining support for the null hypothesis, interpreting replications, and p-hacking by optional stopping, all practical issues discussed below.

The strict use of only one model is Fisherian; Neyman and Pearson (1967) argued that two models should be used, and introduced the concept of power, which helps introduce symmetry in inference, in that it provides grounds for asserting the null hypothesis. Unfortunately power is a flawed solution (Dienes, 2014) and that might explain why it is not always taken up. Power cannot be determined based on the actual data in order to assess their sensitivity; hence, a high powered non-significant result might not actually be evidence for the null hypothesis, as we shall see. Further, it involves (or should involve) specifying only the minimal interesting effect size, which is a rather incomplete specification of H1 (and it is the aspect of H1 most difficult to make in many cases). In practice, psychologists are happy to assert null hypotheses even when power has not been calculated, and inference is based on p-values alone (as we shall see).

The second consequence of having to specify H1 as well as H0 is that thought must be given to what one’s theory actually predicts (Vanpaemel, 2010). In this way, Bayes factors allow a more intimate connection between theory and data than p-values allow. This issue is particularly important for dealing with issues of multiple testing and the timing of theorizing versus collecting data. I conjecture that a Bayesian view of these issues will lead to a more probing exploration of theory than significance testing encourages, a point taken up at the end.

The paper now considers in detail the specific changes to scientific practice the use of Bayes factors may bring about. Specifically it considers, in order, issues of obtaining support for the null hypothesis; of the effect of stopping rules on error rates; of dealing with multiple comparisons in theory evaluation; and, finally, of planned versus post hoc tests and the role of timing of theory and data in scientific inference. I will argue that Bayesian inference compared to significance testing leads to a re-evaluation of all these issues.

Section snippets

Supporting the null hypothesis

Here we consider in turn the problem of providing support for the null hypothesis; how Bayes factors help; and why the orthodox solution of using power does not solve the problem, as illustrated by high powered attempts to replicate studies.

The problem. The key problem created by the asymmetry of the p-value is that significance testing per se (i.e. inference by use of p-values) cannot provide evidence for the null hypothesis. Indeed, that is exactly how p-values are asymmetric. Despite that, a

Discussion

Bayes factors provide a symmetrical measure of evidence for one model versus another (e.g. H1 versus H0) in order to relate theory to precisely the data relevant to it. These properties help solve some (but not all) of the problems underlying the credibility crisis in psychology. The symmetry of the measure of evidence means that there can be evidence for H0 just as much as for H1; or the Bayes factor may indicate insufficient evidence either way. P-values (even with power calculations) cannot

References (68)

  • A. Birnbaum

    On the foundations of statistical inference

    Journal of the American Statistical Association

    (1962)
  • J. Carp

    On the plurality of (methodological) worlds: estimating the analytic flexibility of fMRI experiments

    Frontiers in Neuroscience

    (2012)
  • C.D. Chambers

    Ten reasons why journals must review manuscripts before results are known

    Addiction

    (2015)
  • C.D. Chambers et al.

    Instead of “playing the game” it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond

    AIMS Neurosci.

    (2014)
  • J. Cohen

    Statistical power analysis for the behavioural sciences

    (1988)
  • J. Correll

    1/f noise and effort on implicit measures of bias

    Journal of Personality and Social Psychology

    (2008)
  • Z. Dienes

    Bayesian versus Orthodox statistics: Which side are you on?

    Perspectives on Psychological Sciences

    (2011)
  • Z. Dienes

    Using Bayes to get the most out of non-significant results

    Frontiers in Psychology

    (2014)
  • Z. Dienes

    How Bayesian statistics are needed to determine whether mental states are unconscious

  • Z. Estes et al.

    Head up, foot down: object words orient attention to the objects’ typical location

    Psychological Science

    (2008)
  • Etz, A. (2015). Retrieved 30 September 2015....
  • D. Fanelli

    Positive results increase down the hierarchy of the sciences

    PLoS One

    (2010)
  • R.P. Feynman

    The meaning of it all

    (1998)
  • R.A. Fisher

    The design of experiments

    (1935)
  • A. Gelman et al.

    Bayesian data analysis

    (2013)
  • Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there...
  • A. Gelman et al.

    Avoiding model selection in Bayesian social research

  • B. Goldacre

    Bad pharma: how medicine is broken, and how we can fix it

    (2013)
  • I.J. Good

    Good thinking: the foundations of probability and its applications

    (1983)
  • H. Hoijtink

    Informative hypotheses: theory and practice for behavioral and social scientists

    (2011)
  • C. Howson et al.

    Scientific reasoning: the Bayesian approach

    (2006)
  • J.P.A. Ioannidis

    Why most published research findings are false

    PLoS Medicine

    (2005)
  • E.T. Jaynes

    Probability theory: the logic of science

    (2003)
  • H. Jeffreys

    The theory of probability

    (1939)
  • Cited by (264)

    View all citing articles on Scopus
    View full text