Abstract
Null-hypothesis significance testing (NHST) has become the main tool of inference in neuroscience, and yet evidence suggests we do not use this tool well: tests are often planned poorly, conducted unfairly, and interpreted invalidly. This editorial makes the case that in addition to reforms to increase rigor we should test less, reserving NHST for clearly confirmatory contexts in which the researcher has derived a quantitative prediction, can provide the inputs needed to plan a quality test, and can specify the criteria not only for confirming their hypothesis but also for rejecting it. A reduction in testing would be accompanied by an expansion of the use of estimation [effect sizes and confidence intervals (CIs)]. Estimation is more suitable for exploratory research, provides the inputs needed to plan strong tests, and provides important contexts for properly interpreting tests.
When drawing conclusions from data, statistics offers two modes of inference: testing and estimation. Testing focuses on answering binary questions (Does this drug affect fear memory?) and is summarized with a test statistic and a decision (p = 0.04; reject the null hypothesis of exactly no effect). Estimation focuses on answering quantitative questions (How much did the drug affect fear memory?) and is summarized with an effect size and an expression of uncertainty (Freezing increased by 20%, 95% CI[1, 39]). The expression of uncertainty uses the expected sampling error for the study to estimate what might be true in general about the effect (While our experiment found an increase in freezing of 20%, the 95% confidence interval shows the data are compatible with the real effect of this drug being as small as 1% up to as large as 39%).
Estimates can be quantitatively synthesized (meta-analysis), so estimation is particularly suited for expressing the accumulation of knowledge across similar studies. Tests are also meant to be synthesized, with a clear rejection of the null requiring the regular occurrence of statistical significance across a series of tests (Fisher, 1926). These modes of inference cut across statistical philosophies: neuroscientists currently rely primarily on the frequentist approach to statistics, which can be used to test (with p values) or to estimate (with confidence intervals). There is increasing use of Bayesian statistics, but this can also be used to test (with Bayes factors) or to estimate (e.g., with credible intervals).
Estimation and testing are two sides of the same coin and in frequentist statistics are just algebraic rearrangements of the same statistical model. They differ markedly in focusing the researcher's attention: testing examines whether a specific hypothesis can be judged incompatible with the data; estimation summarizes the hypotheses that remain compatible with the data. If we think of the scientist as a detective, testing emphasizes if a particular suspect can be ruled out; estimation summarizes the suspects who should remain under investigation.
In neuroscience (and many other fields), testing has become the dominant approach to statistical inference, specifically the frequentist approach to testing known as null-hypothesis significance testing (NHST). It was not always this way. Much of the most enduring work in neuroscience was conducted without the use of NHST (Hodgkin and Huxley, 1952; Olds and Milner, 1954; Scoville and Milner, 1957; Katz and Miledi, 1968; Bliss and Lomo, 1973; Sherrington et al., 1995). At the inception of the Journal of Neuroscience in 1980, only 35% of papers in the first volume (50 of 142) reported p values; most instead relied on description and/or estimation, reporting effect sizes with standard errors. For those papers that did use NHST, it was often mixed with estimation, with an average of only seven p values reported per paper. By now, NHST has become ubiquitous (although sometimes supplemented with or supplanted by Bayesian testing). In 2020, 98% of papers (663 of 678) in the Journal of Neuroscience included NHST results, with an average of roughly 50 p values reported per paper (these figures are from a regular-expression search for p values from the pdf-extracted texts of every article in volumes 1 and 40 of the Journal of Neuroscience; R code for the analysis is posted at https://github.com/rcalinjageman/jneuro_p_values/).
Given the central role NHST has come to play in neuroscience, we should ask: are we testing well? That is: do we design our experiments to produce tests that can be informative, conduct our tests in an even-handed way, and interpret the results sensibly? On all counts, the answer is no. The evidence is overwhelming that norms for deploying NHST are badly broken at every stage of the research process:
Researchers test mindlessly (Gigerenzer, 2004), conducting tests even when they cannot clearly articulate their hypothesis.
Despite the regular usage of NHST, little attention is given to establishing the conditions for a quality test. Sample-size planning is rare (Szucs and Ioannidis, 2020) and poor (Goodhill, 2017), sample sizes in some subfields are demonstrably inadequate (Walum et al., 2016; Medina and Cason, 2017), and data analysis decisions are sometimes post hoc and overly flexible (Héroux et al., 2017).
Our current testing practices are unfair. A significant result is taken to confirm the researcher's (often unspecified) hypothesis, but no criterion is established for rejecting the researcher's hypothesis. The researcher cannot lose, and this is reflected in a neuroscience literature that shows, in the aggregate, a statistically implausible success rate for hypothesis tests given the sample sizes used (Button et al., 2013; Szucs and Ioannidis, 2017).
Interpretation of tests is often uncertainty-blind, where even a single significant test can end up cited as an established fact (Calin-Jageman and Cumming, 2019).
This dour assessment of the use of NHST in neuroscience represents a bird's eye view of an extremely broad and diverse field. This does not mean that all subfields and specialties are equally afflicted; statistical practices in neuroscience are heterogenous (Nord et al., 2017). This reinforces the need for reform, as our current practices intermix our best science with our most dubious conclusions under the same badge of statistical significance.
Pointing out the poor use of NHST has long been a cottage industry for curmudgeons, both in neuroscience and in other fields where NHST is ubiquitously but poorly used (Meehl, 1967; Cohen, 1994; Gigerenzer, 2004). This criticism has sometimes provoked rousing defenses of all that NHST could be if only it were used properly (Lakens, 2021). Nolo contendere. The question at hand, then, is: what would it take for neuroscientists to test well?
One part of the solution is to adopt reforms to promote the rigorous use of testing. There have recently been many promising steps in this direction. Another part of the solution, however, is to use NHST much less. Testing should be for testing hypotheses. That is, we should conduct only strong tests where the researcher has derived a clear prediction from their theory, has planned a high-powered test with a clear analysis plan, and has specified the criteria not only for confirming their hypothesis but also for rejecting it. Ideally, this restricted use of testing would come with publishing reforms to ensure that validly-conducted tests are published (e.g., preregistered review).
In place of testing, our default approach to summarizing research results should be estimation. Estimation is the most appropriate approach for the descriptive and exploratory work required to develop research hypotheses for testing. In addition, it is estimation that provides many of the inputs needed to plan and conduct a strong hypothesis test. Finally, estimation can help foster appropriately cautious interpretations of the tests we do conduct.
In what follows, I sketch the potential advantages of narrowing the use of NHST to clearly confirmatory contexts while expanding the use of estimation. This is not quite a call to “abandon” p values. But it would dramatically change the way we conduct inference in neuroscience: the occasions on which NHST is meaningful are far narrower than licensed under current norms. Although the change in practices would be substantial, there does not seem to be much controversy over the need to make it. Even staunch defenders of NHST concede that we need to test more rarely and more rigorously (Scheel et al., 2020), and an increased emphasis on estimation has been one of the most consistently recommend reforms for better inference in science (Cohen, 1994; Rothman, 2010; Cumming, 2012; Szucs and Ioannidis, 2017; Amrhein and Greenland, 2022).
Estimation for Exploration
As in other sciences, neuroscientists pursue projects where the data takes them, with each new result spawning new questions. Forging a trail of discovery requires exploration. New manipulations and new assays may have to be brought into the lab, and the exact parameters for a sensitive experiment may not be immediately obvious. Moreover, we are often screening for important factors rather than testing well-formed hypotheses, cycling through a range of possibilities of about equal promise.
Exploration is essential for fruitful science, but we recognize that the hypotheses that emerge are especially tenuous because of the numerous opportunities exploration provides for capitalizing on chance. For example, a lab may screen inhibitors of five different signaling molecules before obtaining a statistically significant effect on a behavior of interest. While the significant result is intriguing, the multiple tests conducted provide multiple chances for spurious findings, and this risk would be further increased if the researcher has used the achievement of statistical significance to guide decisions about adding samples, selecting an analysis, or refining exclusion criteria.
Exploration needs to be capped off with strong tests or reported in ways that makes the tentative nature of the conclusions clear. This is not always the practice in neuroscience. Instead, the first significant p value found during exploration is often what ends up published, presented as confirmatory and usually without mention of other factors that were screened but found nonsignificant. When exploration regularly masquerades as confirmation, the research literature produced will be unreliable: the practical significance of true findings will be exaggerated, and spurious claims will be unacceptably prevalent. It is hard to gauge the extent of this problem in neuroscience, but the clear excess in significant findings across the published literature is troubling (Button et al., 2013; Szucs and Ioannidis, 2017), suggesting that relevant nonsignificant results are either discarded or unduly coaxed under the threshold for statistical significance. More informally, most researchers in the field can share stories of frustration encountered trying to build from ostensibly solid findings from the published literature.
The solution is not to universally impose requirements for rigorous testing. At early stages of the research process, asking a researcher to preregister their hypothesis and sampling plan would only lead to frustration, evasion, or pro forma efforts that provide a veneer of rigor. Instead, the solution is to use estimation to guide exploration, and then to conduct strong tests as the proper capstone of the research hypotheses that emerge.
Given that current practices often obscure the distinction between exploratory and confirmatory research, would it be possible to adopt reforms that draw a more clear line between them? Yes. Confirmatory research is research that puts your hypothesis at risk, where a negative result is interpretable and would change your thinking in a meaningful way. Other sources discuss at length what is required to conduct what has been called a severe test (Popper, 1959; Mayo, 2018; Scheel et al., 2020), but for NHST some key requirements would be:
A clear and quantitative prediction derived from your research hypothesis.
A sample-size plan that will provide a high-powered test of your prediction.
An analysis plan with limited flexibility, with a priori specification of what data will be counted as valid, what analysis strategy will be used, and what standards will indicate confirmation and disconfirmation of the prediction.
A researcher about to conduct a confirmatory hypothesis test should find it straightforward to preregister their study and/or undergo preregistered review.
Properly understood, much of what neuroscientists currently report via NHST as though it was confirmatory should be reported via estimation as exploratory. For example, a researcher may read that PKM inhibitors extend fear memory in rats, and from that may hypothesize that she will observe the same effect in eye-blink conditioning in rabbits. Is this confirmatory research ready for NHST? Probably not. First, the researcher probably does not have a clear prediction yet (Would the effect be just as strong as what was reported for fear memory? Perhaps a smaller effect should be predicted based uncertainty in the original finding or ceiling effects in the eye-blink assay?). Without a clear prediction, a power analysis cannot be conducted to determine an adequate sample size. Moreover, important procedural details will start off as informed hunches (What dose? What time point? What criteria for exclusion because of side effects?). Even if the researcher's guesswork does yield p < 0.05 on the first batch of animals, this is still an exploratory and tentative finding that should be reported via estimation. Why? Because this initial stage of “throwing stuff at the wall” does not put the researcher's (vague) hypothesis at risk: any nonsignificant result would be sensibly dismissed as a sign of bad luck in adapting the protocol rather than as evidence that some substantive aspect of the researcher's thinking is flawed. If a negative result would not convince you that you have got something wrong, then you are not testing your hypothesis.
Replacing our illiberal use of testing with estimation could have several benefits. First, estimation more clearly focuses on uncertainty. Rather than categorical claims (Here we showed protein X enhances LTP, p = 0.04), researchers would emphasize the range of effect sizes compatible with their results (We estimate protein X enhances LTP, but it is not yet clear whether this is an infinitesimal, small, or moderate facilitation, Mean increase = 20% 95% CI [1, 40]). Although this is just a different way of summarizing the same data, the estimation frame more clearly indicates the need for additional research to confirm a meaningful effect. Under our current practices, we often create the illusion that this confirmatory step has already succeeded.
Another benefit of estimation is that estimates are not categorized as significant and nonsignificant. Estimation might therefore support more fully reporting screening activity (We also screened protein Y, finding a mean increase of 5 ± 19%, and protein Z, finding a mean 0 ± 23%; we did not further follow-up on these observations). Our current practice of underreporting nonsignificant results leads to waste and makes it impossible to extract accurate knowledge from the published literature.
A third benefit of estimation is that it provides sample-size planning that is suitable for exploration. Under current practices, sample-size planning for NHST seems to be rare (Szucs and Ioannidis, 2020) or poor (Goodhill, 2017), and sampling-to-significance may be common (John et al., 2012; Buchanan and Lohse, 2016). This is a sign that NHST is often being applied too early, before the researcher has developed a quantitative prediction and other inputs that are required to plan a strong hypothesis test. Estimation is more suitable for exploration, allowing researchers to plan for the precision of their estimates rather than the power to detect a predicted effect (Cumming and Calin-Jageman, 2017). With planning for precision, the researcher specifies a desired level of accuracy (say ±30% of the observed effect) and then either predetermines the sample size likely to be needed or collects data until the desired precision is obtained (Kelley et al., 2018). Planning for precision lets the researcher efficiently and iteratively invest resources suited to the value of an accurate answer.
Another benefit of estimation is that it can help us better calibrate our expectations for future research. It turns out the p values are surprisingly erratic (Cumming, 2008), and they elide much of the information needed to judge how surprising a new result is relative to a previous one. For example, imagine you recently discovered a statistically significant effect of an antioxidant on long-term memory (t(8) = 3.34, p = 0.01). You then repeat the experiment with some extension to probe mechanism (e.g., while also inactivating a signaling pathway you think might mediate the effect). In this follow-up study, however, the basic effect of the antioxidant on memory turns out nonsignificant (t(8) = 1.2, p = 0.25). Based on significance status alone, you might be tempted to judge the follow-up finding as surprising, potentially even as an indication that something in your protocol has gone wrong. The reality, however, is that this is an unexceptional sequence of results. From an initial finding of p = 0.01, we should expect exact replications to yield p values mostly in the range of 0.00001 to 0.41, with only 33% also turning out statistically significant at the 0.05 level (Cumming, 2012). If that seems surprising, you are not alone; researchers can underestimate the expected variation in p values (Lai et al., 2012). Thus, judging by p values alone risks conflating normal sampling variation with substantive differences in results. Estimation can help. Consider the same set of results expressed as estimates: d = 1.9 95% CI [0.65, 4.38] for the first study; d = 0.70 95% CI [–0.58, 2.46] for the follow-up. This way of expressing the results makes clear that the difference between studies is fairly small and unsurprising relative to expected sampling variation.
Estimation for Planning Strong Tests
In addition to expanding the use of estimation, we should improve the use of testing. A meaningful test is a well-planned test: one where the researcher's hypothesis has been clearly articulated, a sample that will provide an interpretable result has been planned, and there is limited room for post hoc flexibility in the filtering and analysis of the data collected.
Estimation is helpful for planning strong tests. First, estimation yields effect sizes and uncertainty, the key inputs a researcher needs to ponder when determining a reasonable sample for NHST. Second, estimation makes it possible to adopt a fair testing procedure, one that puts both the null and the researcher's hypothesis at risk (Fig. 1). This requires one extra step before data collection: the researcher must define the range of effects they consider negligible for their research purposes. Then, when the data are collected, two tests are conducted: one to test the researcher's hypothesis that the effect is substantive (called a minimal-effect test), the other to test the alternative hypothesis that the effect is negligible (called an equivalence test). If both tests are nonsignificant, the test is deemed uninformative. This has been called “inference by interval” (Dienes, 2014), because it is easy to understand and communicate this mode of testing through estimation (see Fig. 1): for an α of 0.05, the result is deemed substantive if the entire 95% CI is outside the range of negligible effects, negligible if the 90% CI is entirely within that range, and ambiguous otherwise (those confidence levels are not a typo; a 90% CI is used to see whether an effect is negligible because both ends must be inside the null interval). There are excellent tutorials on estimation by interval (Lakens et al., 2018) and accessible software resources for the actual computations (Lakens and Caldwell, 2022). Estimation by interval might be especially useful for neuroscience projects that achieve large sample sizes, as this mode of testing is not overwhelmed by sample size the way using NHST with a point null is.
Inference by interval is not the only way to conduct fair testing. Another option is the use of Bayesian statistics. The Bayesian approach quantifies the degree of support for the researcher's hypothesis and the degree of support for the null hypothesis. Results can be ambiguous between the two, in which case the study is deemed uninformative. There are multiple useful approaches to Bayesian hypothesis testing (Kruschke and Liddell, 2018), and the approach is increasingly accessible, with excellent tutorials and software resources available (Love et al., 2019; Keysers et al., 2020).
There are costs to these solutions. First, strong testing requires the researcher to make more judgements when designing the test. In Bayesian statistics, the researcher must specify their priors (the probabilities they assign to the range of possible effect sizes). For inference by interval, the researcher must decide what range of effect sizes they consider negligible. A second cost is in sample size: compared with our current practice of using NHST against a point null of exactly 0, Bayesian statistics tends to require larger sample sizes to reach a clear decision, and inference by interval even more so.
These costs are not trivial, but there is no scientific alternative. A testing procedure that can both confirm and disconfirm hypotheses is required for a science that is parsimonious, self-correcting, and credible.
Estimation for Interpreting Tests
When we have conducted a strong test, it is important to then interpret that test well. Current practices suggest this can be difficult. For example, here is a lightly-adapted and anonymized set of results from a recent issue of the Journal of Neuroscience:
There was not a significant difference between treated mice and their controls in distance covered in an open-field test (t(20) = 1.8, p = 0.08), suggesting that Treatment X does not affect general locomotion. There was, however, a significant increase in freezing 24-h after associative fear-conditioning (t(22) = 3.8, p = 0.001). Taken together, these results show a selective effect on memory function.
…
Treatment X enhances memory.
While the details have been elided, this approach to using NHST should feel familiar; it is a fairly typical example of how NHST results are interpreted in the neuroscience literature. Unfortunately, what is typical is also quite dubious:
A nonsignificant result is not a reliable indicator that there is no effect; instead, we need to conduct inference by interval or use Bayesian analysis to quantify support for the null (Lakens et al., 2018; Keysers et al., 2020).
Comparing significance status is not a reliable way to determine specificity (Gelman and Stern, 2006; Nieuwenhuis et al., 2011); instead, this requires a formal comparison of results (a test for an interaction)
A single significant finding cannot support sweeping generalizations about an effect; instead, we need to calibrate claims to the remaining uncertainty in the finding and rely on the regular occurrence of statistical significance across multiple tests for the clear rejection of the null hypothesis
Reporting estimates along with tests could help with each of these issues. Here is the same data reported and interpreted through an estimation lens:
The difference between treated and control mice in an open-field test was compatible with treatment X producing a small decrease in locomotion, no difference, and up to a very large increase in locomotion (d = 0.73 95% CI [–0.12, 1.62]). For fear-conditioning, results were compatible with anywhere between a moderate to very large increase in memory (d = 1.52 95% CI [0.67, 2.43]). In our samples, Treatment X produced a larger effect on memory, but the estimated difference in effect is uncertain (d = 0.79 95% CI [–0.39, 2.00]).
…
We are uncertain about the effect of Treatment X on locomotion, but conclude that it produces at least moderate to potentially startling levels of memory enhancement. Collecting data to refine this estimated effect will be important, as if the true level of memory enhancement is only moderate (d = 0.67) it will require substantial resources (n = 48/group for 90% power) or a change in design to replicate and extend this finding.
With the estimation lens, these results provide a more clear-eyed appraisal of the knowledge that has been gained and what is left to still nail down. That is of critical importance for fruitful science: we cannot leverage previous findings if the tenuous often masquerades as the confirmed.
Conclusions
Science is characterized not only by inquiry but also by a constant drive to improve inquiry. Part of that drive is evidenced in a decades-long conversation over statistical practices and how we make claims from data. Although reformers might bemoan the conservative pace of change in scientific culture, it is clear that statistical practices in neuroscience are not fixed, but do change, in the long run quite dramatically.
We have a real need for additional changes in practice. NHST should be applied not only more rigorously but more judiciously; it is appropriate only for strong tests that put the researcher's hypothesis at real risk. This requires (at a minimum) a quantitative prediction, a sample-size plan to obtain high power, and an analysis plan of limited flexibility with standards for both confirming and disconfirming the prediction. Properly understood, most research in neuroscience is exploratory and would be better summarized through estimation, as this highlights uncertainty, provides the inputs needed to then plan strong tests, and provides the additional context needed to reach thoughtful conclusions about test results.
Footnotes
Dual Perspectives Companion Paper: Neuroscience Needs to Test Both Statistical and Scientific Hypotheses, by Bradley E. Alger et al.
I thank Bradley Alger for engaging in this debate and to the reviewers and Geoff Cumming for useful feedback on this editorial. The best arguments in this editorial are adapted from the great pantheon of thinkers who have shaped the ongoing debate over statistical inference in science; any errors are my own.
R.J.C.-J. is co-author of an undergraduate statistics textbook that emphasizes estimation.
- Correspondence should be addressed to Robert J. Calin-Jageman at rcalinjageman{at}dom.edu