Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Collections
    • Podcast
  • ALERTS
  • FOR AUTHORS
    • Information for Authors
    • Fees
    • Journal Clubs
    • eLetters
    • Submit
    • Special Collections
  • EDITORIAL BOARD
    • Editorial Board
    • ECR Advisory Board
    • Journal Staff
  • ABOUT
    • Overview
    • Advertise
    • For the Media
    • Rights and Permissions
    • Privacy Policy
    • Feedback
    • Accessibility
  • SUBSCRIBE

User menu

  • Log out
  • Log in
  • My Cart

Search

  • Advanced search
Journal of Neuroscience
  • Log out
  • Log in
  • My Cart
Journal of Neuroscience

Advanced Search

Submit a Manuscript
  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Collections
    • Podcast
  • ALERTS
  • FOR AUTHORS
    • Information for Authors
    • Fees
    • Journal Clubs
    • eLetters
    • Submit
    • Special Collections
  • EDITORIAL BOARD
    • Editorial Board
    • ECR Advisory Board
    • Journal Staff
  • ABOUT
    • Overview
    • Advertise
    • For the Media
    • Rights and Permissions
    • Privacy Policy
    • Feedback
    • Accessibility
  • SUBSCRIBE
PreviousNext
Dual Perspectives

Better Inference in Neuroscience: Test Less, Estimate More

Robert J. Calin-Jageman
Journal of Neuroscience 9 November 2022, 42 (45) 8427-8431; https://doi.org/10.1523/JNEUROSCI.1133-22.2022
Robert J. Calin-Jageman
Neuroscience Program, Dominican University, River Forest, Illinois 60305
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Robert J. Calin-Jageman
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Abstract

Null-hypothesis significance testing (NHST) has become the main tool of inference in neuroscience, and yet evidence suggests we do not use this tool well: tests are often planned poorly, conducted unfairly, and interpreted invalidly. This editorial makes the case that in addition to reforms to increase rigor we should test less, reserving NHST for clearly confirmatory contexts in which the researcher has derived a quantitative prediction, can provide the inputs needed to plan a quality test, and can specify the criteria not only for confirming their hypothesis but also for rejecting it. A reduction in testing would be accompanied by an expansion of the use of estimation [effect sizes and confidence intervals (CIs)]. Estimation is more suitable for exploratory research, provides the inputs needed to plan strong tests, and provides important contexts for properly interpreting tests.

When drawing conclusions from data, statistics offers two modes of inference: testing and estimation. Testing focuses on answering binary questions (Does this drug affect fear memory?) and is summarized with a test statistic and a decision (p = 0.04; reject the null hypothesis of exactly no effect). Estimation focuses on answering quantitative questions (How much did the drug affect fear memory?) and is summarized with an effect size and an expression of uncertainty (Freezing increased by 20%, 95% CI[1, 39]). The expression of uncertainty uses the expected sampling error for the study to estimate what might be true in general about the effect (While our experiment found an increase in freezing of 20%, the 95% confidence interval shows the data are compatible with the real effect of this drug being as small as 1% up to as large as 39%).

Estimates can be quantitatively synthesized (meta-analysis), so estimation is particularly suited for expressing the accumulation of knowledge across similar studies. Tests are also meant to be synthesized, with a clear rejection of the null requiring the regular occurrence of statistical significance across a series of tests (Fisher, 1926). These modes of inference cut across statistical philosophies: neuroscientists currently rely primarily on the frequentist approach to statistics, which can be used to test (with p values) or to estimate (with confidence intervals). There is increasing use of Bayesian statistics, but this can also be used to test (with Bayes factors) or to estimate (e.g., with credible intervals).

Estimation and testing are two sides of the same coin and in frequentist statistics are just algebraic rearrangements of the same statistical model. They differ markedly in focusing the researcher's attention: testing examines whether a specific hypothesis can be judged incompatible with the data; estimation summarizes the hypotheses that remain compatible with the data. If we think of the scientist as a detective, testing emphasizes if a particular suspect can be ruled out; estimation summarizes the suspects who should remain under investigation.

In neuroscience (and many other fields), testing has become the dominant approach to statistical inference, specifically the frequentist approach to testing known as null-hypothesis significance testing (NHST). It was not always this way. Much of the most enduring work in neuroscience was conducted without the use of NHST (Hodgkin and Huxley, 1952; Olds and Milner, 1954; Scoville and Milner, 1957; Katz and Miledi, 1968; Bliss and Lomo, 1973; Sherrington et al., 1995). At the inception of the Journal of Neuroscience in 1980, only 35% of papers in the first volume (50 of 142) reported p values; most instead relied on description and/or estimation, reporting effect sizes with standard errors. For those papers that did use NHST, it was often mixed with estimation, with an average of only seven p values reported per paper. By now, NHST has become ubiquitous (although sometimes supplemented with or supplanted by Bayesian testing). In 2020, 98% of papers (663 of 678) in the Journal of Neuroscience included NHST results, with an average of roughly 50 p values reported per paper (these figures are from a regular-expression search for p values from the pdf-extracted texts of every article in volumes 1 and 40 of the Journal of Neuroscience; R code for the analysis is posted at https://github.com/rcalinjageman/jneuro_p_values/).

Given the central role NHST has come to play in neuroscience, we should ask: are we testing well? That is: do we design our experiments to produce tests that can be informative, conduct our tests in an even-handed way, and interpret the results sensibly? On all counts, the answer is no. The evidence is overwhelming that norms for deploying NHST are badly broken at every stage of the research process:

  • Researchers test mindlessly (Gigerenzer, 2004), conducting tests even when they cannot clearly articulate their hypothesis.

  • Despite the regular usage of NHST, little attention is given to establishing the conditions for a quality test. Sample-size planning is rare (Szucs and Ioannidis, 2020) and poor (Goodhill, 2017), sample sizes in some subfields are demonstrably inadequate (Walum et al., 2016; Medina and Cason, 2017), and data analysis decisions are sometimes post hoc and overly flexible (Héroux et al., 2017).

  • Our current testing practices are unfair. A significant result is taken to confirm the researcher's (often unspecified) hypothesis, but no criterion is established for rejecting the researcher's hypothesis. The researcher cannot lose, and this is reflected in a neuroscience literature that shows, in the aggregate, a statistically implausible success rate for hypothesis tests given the sample sizes used (Button et al., 2013; Szucs and Ioannidis, 2017).

  • Interpretation of tests is often uncertainty-blind, where even a single significant test can end up cited as an established fact (Calin-Jageman and Cumming, 2019).

This dour assessment of the use of NHST in neuroscience represents a bird's eye view of an extremely broad and diverse field. This does not mean that all subfields and specialties are equally afflicted; statistical practices in neuroscience are heterogenous (Nord et al., 2017). This reinforces the need for reform, as our current practices intermix our best science with our most dubious conclusions under the same badge of statistical significance.

Pointing out the poor use of NHST has long been a cottage industry for curmudgeons, both in neuroscience and in other fields where NHST is ubiquitously but poorly used (Meehl, 1967; Cohen, 1994; Gigerenzer, 2004). This criticism has sometimes provoked rousing defenses of all that NHST could be if only it were used properly (Lakens, 2021). Nolo contendere. The question at hand, then, is: what would it take for neuroscientists to test well?

One part of the solution is to adopt reforms to promote the rigorous use of testing. There have recently been many promising steps in this direction. Another part of the solution, however, is to use NHST much less. Testing should be for testing hypotheses. That is, we should conduct only strong tests where the researcher has derived a clear prediction from their theory, has planned a high-powered test with a clear analysis plan, and has specified the criteria not only for confirming their hypothesis but also for rejecting it. Ideally, this restricted use of testing would come with publishing reforms to ensure that validly-conducted tests are published (e.g., preregistered review).

In place of testing, our default approach to summarizing research results should be estimation. Estimation is the most appropriate approach for the descriptive and exploratory work required to develop research hypotheses for testing. In addition, it is estimation that provides many of the inputs needed to plan and conduct a strong hypothesis test. Finally, estimation can help foster appropriately cautious interpretations of the tests we do conduct.

In what follows, I sketch the potential advantages of narrowing the use of NHST to clearly confirmatory contexts while expanding the use of estimation. This is not quite a call to “abandon” p values. But it would dramatically change the way we conduct inference in neuroscience: the occasions on which NHST is meaningful are far narrower than licensed under current norms. Although the change in practices would be substantial, there does not seem to be much controversy over the need to make it. Even staunch defenders of NHST concede that we need to test more rarely and more rigorously (Scheel et al., 2020), and an increased emphasis on estimation has been one of the most consistently recommend reforms for better inference in science (Cohen, 1994; Rothman, 2010; Cumming, 2012; Szucs and Ioannidis, 2017; Amrhein and Greenland, 2022).

Estimation for Exploration

As in other sciences, neuroscientists pursue projects where the data takes them, with each new result spawning new questions. Forging a trail of discovery requires exploration. New manipulations and new assays may have to be brought into the lab, and the exact parameters for a sensitive experiment may not be immediately obvious. Moreover, we are often screening for important factors rather than testing well-formed hypotheses, cycling through a range of possibilities of about equal promise.

Exploration is essential for fruitful science, but we recognize that the hypotheses that emerge are especially tenuous because of the numerous opportunities exploration provides for capitalizing on chance. For example, a lab may screen inhibitors of five different signaling molecules before obtaining a statistically significant effect on a behavior of interest. While the significant result is intriguing, the multiple tests conducted provide multiple chances for spurious findings, and this risk would be further increased if the researcher has used the achievement of statistical significance to guide decisions about adding samples, selecting an analysis, or refining exclusion criteria.

Exploration needs to be capped off with strong tests or reported in ways that makes the tentative nature of the conclusions clear. This is not always the practice in neuroscience. Instead, the first significant p value found during exploration is often what ends up published, presented as confirmatory and usually without mention of other factors that were screened but found nonsignificant. When exploration regularly masquerades as confirmation, the research literature produced will be unreliable: the practical significance of true findings will be exaggerated, and spurious claims will be unacceptably prevalent. It is hard to gauge the extent of this problem in neuroscience, but the clear excess in significant findings across the published literature is troubling (Button et al., 2013; Szucs and Ioannidis, 2017), suggesting that relevant nonsignificant results are either discarded or unduly coaxed under the threshold for statistical significance. More informally, most researchers in the field can share stories of frustration encountered trying to build from ostensibly solid findings from the published literature.

The solution is not to universally impose requirements for rigorous testing. At early stages of the research process, asking a researcher to preregister their hypothesis and sampling plan would only lead to frustration, evasion, or pro forma efforts that provide a veneer of rigor. Instead, the solution is to use estimation to guide exploration, and then to conduct strong tests as the proper capstone of the research hypotheses that emerge.

Given that current practices often obscure the distinction between exploratory and confirmatory research, would it be possible to adopt reforms that draw a more clear line between them? Yes. Confirmatory research is research that puts your hypothesis at risk, where a negative result is interpretable and would change your thinking in a meaningful way. Other sources discuss at length what is required to conduct what has been called a severe test (Popper, 1959; Mayo, 2018; Scheel et al., 2020), but for NHST some key requirements would be:

  • A clear and quantitative prediction derived from your research hypothesis.

  • A sample-size plan that will provide a high-powered test of your prediction.

  • An analysis plan with limited flexibility, with a priori specification of what data will be counted as valid, what analysis strategy will be used, and what standards will indicate confirmation and disconfirmation of the prediction.

A researcher about to conduct a confirmatory hypothesis test should find it straightforward to preregister their study and/or undergo preregistered review.

Properly understood, much of what neuroscientists currently report via NHST as though it was confirmatory should be reported via estimation as exploratory. For example, a researcher may read that PKM inhibitors extend fear memory in rats, and from that may hypothesize that she will observe the same effect in eye-blink conditioning in rabbits. Is this confirmatory research ready for NHST? Probably not. First, the researcher probably does not have a clear prediction yet (Would the effect be just as strong as what was reported for fear memory? Perhaps a smaller effect should be predicted based uncertainty in the original finding or ceiling effects in the eye-blink assay?). Without a clear prediction, a power analysis cannot be conducted to determine an adequate sample size. Moreover, important procedural details will start off as informed hunches (What dose? What time point? What criteria for exclusion because of side effects?). Even if the researcher's guesswork does yield p < 0.05 on the first batch of animals, this is still an exploratory and tentative finding that should be reported via estimation. Why? Because this initial stage of “throwing stuff at the wall” does not put the researcher's (vague) hypothesis at risk: any nonsignificant result would be sensibly dismissed as a sign of bad luck in adapting the protocol rather than as evidence that some substantive aspect of the researcher's thinking is flawed. If a negative result would not convince you that you have got something wrong, then you are not testing your hypothesis.

Replacing our illiberal use of testing with estimation could have several benefits. First, estimation more clearly focuses on uncertainty. Rather than categorical claims (Here we showed protein X enhances LTP, p = 0.04), researchers would emphasize the range of effect sizes compatible with their results (We estimate protein X enhances LTP, but it is not yet clear whether this is an infinitesimal, small, or moderate facilitation, Mean increase = 20% 95% CI [1, 40]). Although this is just a different way of summarizing the same data, the estimation frame more clearly indicates the need for additional research to confirm a meaningful effect. Under our current practices, we often create the illusion that this confirmatory step has already succeeded.

Another benefit of estimation is that estimates are not categorized as significant and nonsignificant. Estimation might therefore support more fully reporting screening activity (We also screened protein Y, finding a mean increase of 5 ± 19%, and protein Z, finding a mean 0 ± 23%; we did not further follow-up on these observations). Our current practice of underreporting nonsignificant results leads to waste and makes it impossible to extract accurate knowledge from the published literature.

A third benefit of estimation is that it provides sample-size planning that is suitable for exploration. Under current practices, sample-size planning for NHST seems to be rare (Szucs and Ioannidis, 2020) or poor (Goodhill, 2017), and sampling-to-significance may be common (John et al., 2012; Buchanan and Lohse, 2016). This is a sign that NHST is often being applied too early, before the researcher has developed a quantitative prediction and other inputs that are required to plan a strong hypothesis test. Estimation is more suitable for exploration, allowing researchers to plan for the precision of their estimates rather than the power to detect a predicted effect (Cumming and Calin-Jageman, 2017). With planning for precision, the researcher specifies a desired level of accuracy (say ±30% of the observed effect) and then either predetermines the sample size likely to be needed or collects data until the desired precision is obtained (Kelley et al., 2018). Planning for precision lets the researcher efficiently and iteratively invest resources suited to the value of an accurate answer.

Another benefit of estimation is that it can help us better calibrate our expectations for future research. It turns out the p values are surprisingly erratic (Cumming, 2008), and they elide much of the information needed to judge how surprising a new result is relative to a previous one. For example, imagine you recently discovered a statistically significant effect of an antioxidant on long-term memory (t(8) = 3.34, p = 0.01). You then repeat the experiment with some extension to probe mechanism (e.g., while also inactivating a signaling pathway you think might mediate the effect). In this follow-up study, however, the basic effect of the antioxidant on memory turns out nonsignificant (t(8) = 1.2, p = 0.25). Based on significance status alone, you might be tempted to judge the follow-up finding as surprising, potentially even as an indication that something in your protocol has gone wrong. The reality, however, is that this is an unexceptional sequence of results. From an initial finding of p = 0.01, we should expect exact replications to yield p values mostly in the range of 0.00001 to 0.41, with only 33% also turning out statistically significant at the 0.05 level (Cumming, 2012). If that seems surprising, you are not alone; researchers can underestimate the expected variation in p values (Lai et al., 2012). Thus, judging by p values alone risks conflating normal sampling variation with substantive differences in results. Estimation can help. Consider the same set of results expressed as estimates: d = 1.9 95% CI [0.65, 4.38] for the first study; d = 0.70 95% CI [–0.58, 2.46] for the follow-up. This way of expressing the results makes clear that the difference between studies is fairly small and unsurprising relative to expected sampling variation.

Estimation for Planning Strong Tests

In addition to expanding the use of estimation, we should improve the use of testing. A meaningful test is a well-planned test: one where the researcher's hypothesis has been clearly articulated, a sample that will provide an interpretable result has been planned, and there is limited room for post hoc flexibility in the filtering and analysis of the data collected.

Estimation is helpful for planning strong tests. First, estimation yields effect sizes and uncertainty, the key inputs a researcher needs to ponder when determining a reasonable sample for NHST. Second, estimation makes it possible to adopt a fair testing procedure, one that puts both the null and the researcher's hypothesis at risk (Fig. 1). This requires one extra step before data collection: the researcher must define the range of effects they consider negligible for their research purposes. Then, when the data are collected, two tests are conducted: one to test the researcher's hypothesis that the effect is substantive (called a minimal-effect test), the other to test the alternative hypothesis that the effect is negligible (called an equivalence test). If both tests are nonsignificant, the test is deemed uninformative. This has been called “inference by interval” (Dienes, 2014), because it is easy to understand and communicate this mode of testing through estimation (see Fig. 1): for an α of 0.05, the result is deemed substantive if the entire 95% CI is outside the range of negligible effects, negligible if the 90% CI is entirely within that range, and ambiguous otherwise (those confidence levels are not a typo; a 90% CI is used to see whether an effect is negligible because both ends must be inside the null interval). There are excellent tutorials on estimation by interval (Lakens et al., 2018) and accessible software resources for the actual computations (Lakens and Caldwell, 2022). Estimation by interval might be especially useful for neuroscience projects that achieve large sample sizes, as this mode of testing is not overwhelmed by sample size the way using NHST with a point null is.

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

Strong testing via inference by interval. In this approach to testing, the researcher defines the range of effect sizes that are negligible relative to the research question at hand (gray box with dashed border). Significance at α = 0.05 can then be seen graphically by plotting the effect size in the sample (triangle) with the 95% confidence interval (thin bar) and 90% confidence interval (thick bar). The effect is deemed substantive if the entire 95% confidence interval is outside the zone of negligible effects, negligible if the entire 90% confidence interval is inside this zone, and ambiguous otherwise.

Inference by interval is not the only way to conduct fair testing. Another option is the use of Bayesian statistics. The Bayesian approach quantifies the degree of support for the researcher's hypothesis and the degree of support for the null hypothesis. Results can be ambiguous between the two, in which case the study is deemed uninformative. There are multiple useful approaches to Bayesian hypothesis testing (Kruschke and Liddell, 2018), and the approach is increasingly accessible, with excellent tutorials and software resources available (Love et al., 2019; Keysers et al., 2020).

There are costs to these solutions. First, strong testing requires the researcher to make more judgements when designing the test. In Bayesian statistics, the researcher must specify their priors (the probabilities they assign to the range of possible effect sizes). For inference by interval, the researcher must decide what range of effect sizes they consider negligible. A second cost is in sample size: compared with our current practice of using NHST against a point null of exactly 0, Bayesian statistics tends to require larger sample sizes to reach a clear decision, and inference by interval even more so.

These costs are not trivial, but there is no scientific alternative. A testing procedure that can both confirm and disconfirm hypotheses is required for a science that is parsimonious, self-correcting, and credible.

Estimation for Interpreting Tests

When we have conducted a strong test, it is important to then interpret that test well. Current practices suggest this can be difficult. For example, here is a lightly-adapted and anonymized set of results from a recent issue of the Journal of Neuroscience: There was not a significant difference between treated mice and their controls in distance covered in an open-field test (t(20) = 1.8, p = 0.08), suggesting that Treatment X does not affect general locomotion. There was, however, a significant increase in freezing 24-h after associative fear-conditioning (t(22) = 3.8, p = 0.001). Taken together, these results show a selective effect on memory function.…Treatment X enhances memory.

While the details have been elided, this approach to using NHST should feel familiar; it is a fairly typical example of how NHST results are interpreted in the neuroscience literature. Unfortunately, what is typical is also quite dubious:

  • A nonsignificant result is not a reliable indicator that there is no effect; instead, we need to conduct inference by interval or use Bayesian analysis to quantify support for the null (Lakens et al., 2018; Keysers et al., 2020).

  • Comparing significance status is not a reliable way to determine specificity (Gelman and Stern, 2006; Nieuwenhuis et al., 2011); instead, this requires a formal comparison of results (a test for an interaction)

  • A single significant finding cannot support sweeping generalizations about an effect; instead, we need to calibrate claims to the remaining uncertainty in the finding and rely on the regular occurrence of statistical significance across multiple tests for the clear rejection of the null hypothesis

Reporting estimates along with tests could help with each of these issues. Here is the same data reported and interpreted through an estimation lens: The difference between treated and control mice in an open-field test was compatible with treatment X producing a small decrease in locomotion, no difference, and up to a very large increase in locomotion (d = 0.73 95% CI [–0.12, 1.62]). For fear-conditioning, results were compatible with anywhere between a moderate to very large increase in memory (d = 1.52 95% CI [0.67, 2.43]). In our samples, Treatment X produced a larger effect on memory, but the estimated difference in effect is uncertain (d = 0.79 95% CI [–0.39, 2.00]).…We are uncertain about the effect of Treatment X on locomotion, but conclude that it produces at least moderate to potentially startling levels of memory enhancement. Collecting data to refine this estimated effect will be important, as if the true level of memory enhancement is only moderate (d = 0.67) it will require substantial resources (n = 48/group for 90% power) or a change in design to replicate and extend this finding.

With the estimation lens, these results provide a more clear-eyed appraisal of the knowledge that has been gained and what is left to still nail down. That is of critical importance for fruitful science: we cannot leverage previous findings if the tenuous often masquerades as the confirmed.

Conclusions

Science is characterized not only by inquiry but also by a constant drive to improve inquiry. Part of that drive is evidenced in a decades-long conversation over statistical practices and how we make claims from data. Although reformers might bemoan the conservative pace of change in scientific culture, it is clear that statistical practices in neuroscience are not fixed, but do change, in the long run quite dramatically.

We have a real need for additional changes in practice. NHST should be applied not only more rigorously but more judiciously; it is appropriate only for strong tests that put the researcher's hypothesis at real risk. This requires (at a minimum) a quantitative prediction, a sample-size plan to obtain high power, and an analysis plan of limited flexibility with standards for both confirming and disconfirming the prediction. Properly understood, most research in neuroscience is exploratory and would be better summarized through estimation, as this highlights uncertainty, provides the inputs needed to then plan strong tests, and provides the additional context needed to reach thoughtful conclusions about test results.

Footnotes

  • Dual Perspectives Companion Paper: Neuroscience Needs to Test Both Statistical and Scientific Hypotheses, by Bradley E. Alger et al.

  • I thank Bradley Alger for engaging in this debate and to the reviewers and Geoff Cumming for useful feedback on this editorial. The best arguments in this editorial are adapted from the great pantheon of thinkers who have shaped the ongoing debate over statistical inference in science; any errors are my own.

  • R.J.C.-J. is co-author of an undergraduate statistics textbook that emphasizes estimation.

  • Correspondence should be addressed to Robert J. Calin-Jageman at rcalinjageman{at}dom.edu

SfN exclusive license.

References

  1. ↵
    1. Amrhein V,
    2. Greenland S
    (2022) Discuss practical importance of results based on interval estimates and p-value functions, not only on point estimates and null p-values. J Inf Technol 37:316–320. doi:10.1177/02683962221105904
    OpenUrlCrossRef
  2. ↵
    1. Bliss TV,
    2. Lomo T
    (1973) Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J Physiol 232:331–356. doi:10.1113/jphysiol.1973.sp010273 pmid:4727084
    OpenUrlCrossRefPubMed
  3. ↵
    1. Buchanan TL,
    2. Lohse KR
    (2016) Researchers' perceptions of statistical significance contribute to bias in health and exercise science. Meas Phys Educ Exerc Sci 20:131–139. doi:10.1080/1091367X.2016.1166112
    OpenUrlCrossRef
  4. ↵
    1. Button KS,
    2. Ioannidis JPA,
    3. Mokrysz C,
    4. Nosek BA,
    5. Flint J,
    6. Robinson ESJ,
    7. Munafò MR
    (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14:365–376. doi:10.1038/nrn3475 pmid:23571845
    OpenUrlCrossRefPubMed
  5. ↵
    1. Calin-Jageman RJ,
    2. Cumming G
    (2019) The new statistics for better science: ask how much, how uncertain, and what else is known. Am Stat 73:271–280. doi:10.1080/00031305.2018.1518266 pmid:31762475
    OpenUrlCrossRefPubMed
  6. ↵
    1. Cohen J
    (1994) The earth is round (p <.05). Am Psychol 49:997–1003.
    OpenUrlCrossRef
  7. ↵
    1. Cumming G
    (2008) Replication and p intervals. Perspect Psychol Sci 3:286–300. doi:10.1111/j.1745-6924.2008.00079.x
    OpenUrlCrossRefPubMed
  8. ↵
    1. Cumming G
    (2012) Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
  9. ↵
    1. Cumming G,
    2. Calin-Jageman RJ
    (2017) Introduction to the new statistics: estimation, open science, and beyond. New York: Routledge.
  10. ↵
    1. Dienes Z
    (2014) Using Bayes to get the most out of non-significant results. Front Psychol 5:781. doi:10.3389/fpsyg.2014.00781
    OpenUrlCrossRefPubMed
  11. ↵
    1. Fisher RA
    (1926) The arrangement of field experiments. J Minist Agric (G B) 33:503–513.
    OpenUrl
  12. ↵
    1. Gelman A,
    2. Stern H
    (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat 60:328–331. doi:10.1198/000313006X152649
    OpenUrlCrossRef
  13. ↵
    1. Gigerenzer G
    (2004) Mindless statistics. J Socio Econ 33:587–606. doi:10.1016/j.socec.2004.09.033
    OpenUrlCrossRef
  14. ↵
    1. Goodhill GJ
    (2017) Is neuroscience facing up to statistical power? arXiv 1–5.
  15. ↵
    1. Héroux ME,
    2. Loo CK,
    3. Taylor JL,
    4. Gandevia SC
    (2017) Questionable science and reproducibility in electrical brain stimulation research. PLoS One 12:e0175635. doi:10.1371/journal.pone.0175635 pmid:28445482
    OpenUrlCrossRefPubMed
  16. ↵
    1. Hodgkin AL,
    2. Huxley AF
    (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. J Physiol 117:500–544. doi:10.1113/jphysiol.1952.sp004764 pmid:12991237
    OpenUrlCrossRefPubMed
  17. ↵
    1. John LK,
    2. Loewenstein G,
    3. Prelec D
    (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci 23:524–532. doi:10.1177/0956797611430953 pmid:22508865
    OpenUrlCrossRefPubMed
  18. ↵
    1. Katz B,
    2. Miledi R
    (1968) The role of calcium in neuromuscular facilitation. J Physiol 195:481–492. doi:10.1113/jphysiol.1968.sp008469 pmid:4296699
    OpenUrlCrossRefPubMed
  19. ↵
    1. Kelley K,
    2. Darku FB,
    3. Chattopadhyay B
    (2018) Accuracy in parameter estimation for a general class of effect sizes: a sequential approach. Psychol Methods 23:226–243. doi:10.1037/met0000127 pmid:28383948
    OpenUrlCrossRefPubMed
  20. ↵
    1. Keysers C,
    2. Gazzola V,
    3. Wagenmakers EJ
    (2020) Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence. Nat Neurosci 23:788–799. doi:10.1038/s41593-020-0660-4 pmid:32601411
    OpenUrlCrossRefPubMed
  21. ↵
    1. Kruschke JK,
    2. Liddell TM
    (2018) The Bayesian new statistics: hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev 25:178–206. doi:10.3758/s13423-016-1221-4 pmid:28176294
    OpenUrlCrossRefPubMed
  22. ↵
    1. Lai J,
    2. Fidler F,
    3. Cumming G
    (2012) Subjective p intervals researchers underestimate the variability of p values over replication. Methodology 8:51–62. doi:10.1027/1614-2241/a000037
    OpenUrlCrossRef
  23. ↵
    1. Lakens D
    (2021) The practical alternative to the p value is the correctly used p value. Perspect Psychol Sci 16:639–648. doi:10.1177/1745691620958012 pmid:33560174
    OpenUrlCrossRefPubMed
  24. ↵
    1. Lakens D,
    2. Caldwell A
    (2022) TOSTER: two one-sided tests (TOST) equivalence testing. Available at: https://cran.r-project.org/web/packages/TOSTER/index.html.
  25. ↵
    1. Lakens D,
    2. Dienes Z,
    3. Isager PM,
    4. Scheel AM,
    5. McLatchie N
    (2018) Improving inferences about null effects with Bayes factors and equivalence tests. J Gerontol B XX:1–13.
  26. ↵
    1. Love J,
    2. Selker R,
    3. Marsman M,
    4. Jamil T,
    5. Dropmann D,
    6. Verhagen J,
    7. Ly A,
    8. Gronau QF,
    9. Šmíra M,
    10. Epskamp S,
    11. Matzke D,
    12. Wild A,
    13. Knight P,
    14. Rouder JN,
    15. Morey RD,
    16. Wagenmakers EJ
    (2019) JASP: graphical statistical software for common statistical designs. J Stat Soft 88:1–17. doi:10.18637/jss.v088.i02
    OpenUrlCrossRef
  27. ↵
    1. Mayo DG
    (2018) Statistical inference as severe testing: how to get beyond the statistics wars, Ed 1. Cambridge; New York: Cambridge University Press.
  28. ↵
    1. Medina J,
    2. Cason S
    (2017) No evidential value in samples of transcranial direct current stimulation (tDCS) studies of cognition and working memory in healthy populations. Cortex 94:131–141. doi:10.1016/j.cortex.2017.06.021 pmid:28759803
    OpenUrlCrossRefPubMed
  29. ↵
    1. Meehl PE
    (1967) Theory-testing in psychology and physics: a methodological paradox. Philos Sci 34:103–115. doi:10.1086/288135
    OpenUrlCrossRef
  30. ↵
    1. Nieuwenhuis S,
    2. Forstmann BU,
    3. Wagenmakers E-J
    (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci 14:1105–1107. doi:10.1038/nn.2886
    OpenUrlCrossRefPubMed
  31. ↵
    1. Nord CL,
    2. Valton V,
    3. Wood J,
    4. Roiser JP
    (2017) Power-up: a reanalysis of “power failure” in neuroscience using mixture modeling. J Neurosci 37:8051–8061. doi:10.1523/JNEUROSCI.3592-16.2017 pmid:28706080
    OpenUrlAbstract/FREE Full Text
  32. ↵
    1. Olds J,
    2. Milner P
    (1954) Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. J comp Physiol Psychol 47:419–427. doi:10.1037/h0058775 pmid:13233369
    OpenUrlCrossRefPubMed
  33. ↵
    1. Popper KR
    (1959) The logic of scientific discovery. London: Julius Springer, Hutchinson and Co.
  34. ↵
    1. Rothman KJ
    (2010) A Show of Confidence. N Engl J Med 299:1362–1363.
    OpenUrl
  35. ↵
    1. Scheel AM,
    2. Tiokhin L,
    3. Isager PM,
    4. Lakens D
    (2020) Why hypothesis testers should spend less time testing hypotheses. Perspect Psychol Sci 16:744–755.
    OpenUrl
  36. ↵
    1. Scoville WB,
    2. Milner B
    (1957) Loss of recent memory after bilateral hippocampal lesions. J Neurol Neurosurg Psychiatry 20:11–21. doi:10.1136/jnnp.20.1.11 pmid:13406589
    OpenUrlFREE Full Text
  37. ↵
    1. Sherrington R, et al
    . (1995) Cloning of a gene bearing missense mutations in early-onset familial Alzheimer's disease. Nature 375:754–760. doi:10.1038/375754a0 pmid:7596406
    OpenUrlCrossRefPubMed
  38. ↵
    1. Szucs D,
    2. Ioannidis JPA
    (2017) Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biol 15:e2000797. doi:10.1371/journal.pbio.2000797
    OpenUrlCrossRefPubMed
  39. ↵
    1. Szucs D,
    2. Ioannidis JP
    (2020) Sample size evolution in neuroimaging research: an evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018) in high-impact journals. Neuroimage 221:117164. doi:10.1016/j.neuroimage.2020.117164 pmid:32679253
    OpenUrlCrossRefPubMed
  40. ↵
    1. Walum H,
    2. Waldman ID,
    3. Young LJ
    (2016) Statistical and methodological considerations for the interpretation of intranasal oxytocin studies. Biol Psychiatry 79:251–257. doi:10.1016/j.biopsych.2015.06.016 pmid:26210057
    OpenUrlCrossRefPubMed
Back to top

In this issue

The Journal of Neuroscience: 42 (45)
Journal of Neuroscience
Vol. 42, Issue 45
9 Nov 2022
  • Table of Contents
  • Table of Contents (PDF)
  • About the Cover
  • Index by author
  • Masthead (PDF)
Email

Thank you for sharing this Journal of Neuroscience article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Better Inference in Neuroscience: Test Less, Estimate More
(Your Name) has forwarded a page to you from Journal of Neuroscience
(Your Name) thought you would be interested in this article in Journal of Neuroscience.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Better Inference in Neuroscience: Test Less, Estimate More
Robert J. Calin-Jageman
Journal of Neuroscience 9 November 2022, 42 (45) 8427-8431; DOI: 10.1523/JNEUROSCI.1133-22.2022

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Request Permissions
Share
Better Inference in Neuroscience: Test Less, Estimate More
Robert J. Calin-Jageman
Journal of Neuroscience 9 November 2022, 42 (45) 8427-8431; DOI: 10.1523/JNEUROSCI.1133-22.2022
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Estimation for Exploration
    • Estimation for Planning Strong Tests
    • Estimation for Interpreting Tests
    • Conclusions
    • Footnotes
    • References
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Responses to this article

Respond to this article

Jump to comment:

  • RE: Author Response to Calin-Jageman et al.
    Bradley Alger
    Published on: 09 November 2022
  • Published on: (9 November 2022)
    Page navigation anchor for RE: Author Response to Calin-Jageman et al.
    RE: Author Response to Calin-Jageman et al.
    • Bradley Alger, Author, Univ of Maryland School of Medicine

    This Dual Perspective was set up to debate whether conventional significance testing methods should be replaced with “estimation statistics.” Calin-Jageman advocated replacement. I favored retaining and improving conventional methods. My main point was that, while the estimation approach can be useful, its philosophy seriously underestimates the vital importance of qualitative decision-making in science. Thus it ignores the reality and needs of many neuroscience subfields. In contrast, significance testing facilitates decision-making. In his Dual Perspective essay, Calin-Jageman modifies his original position and now agrees that we need not “abandon” p values and that, instead, we should strive to improve significance testing procedures. His candor and open-mindedness are marks of a true scholar.

     

    We agree that neurosciences must tighten mechanics of its statistical practices and will benefit from regularly reporting confidence intervals and effect sizes. Rather than dwelling on minor remaining disagreements, we should now focus on broader issues that continue to plague neuroscience. The following suggestions, intended for consideration by authors, editors, and reviewers, could markedly enhance the clarity and reliability of our communications

     

    Clearly state the purpose and goals of a study.

    • Did it test a scientific hypothesis (or hypotheses)? If so, state each hypothesis e...
    Show More

    This Dual Perspective was set up to debate whether conventional significance testing methods should be replaced with “estimation statistics.” Calin-Jageman advocated replacement. I favored retaining and improving conventional methods. My main point was that, while the estimation approach can be useful, its philosophy seriously underestimates the vital importance of qualitative decision-making in science. Thus it ignores the reality and needs of many neuroscience subfields. In contrast, significance testing facilitates decision-making. In his Dual Perspective essay, Calin-Jageman modifies his original position and now agrees that we need not “abandon” p values and that, instead, we should strive to improve significance testing procedures. His candor and open-mindedness are marks of a true scholar.

     

    We agree that neurosciences must tighten mechanics of its statistical practices and will benefit from regularly reporting confidence intervals and effect sizes. Rather than dwelling on minor remaining disagreements, we should now focus on broader issues that continue to plague neuroscience. The following suggestions, intended for consideration by authors, editors, and reviewers, could markedly enhance the clarity and reliability of our communications

     

    Clearly state the purpose and goals of a study.

    • Did it test a scientific hypothesis (or hypotheses)? If so, state each hypothesis explicitly, together with its predictions, especially potentially falsifying predictions, that were tested.
    • If no hypothesis was tested, then state the purpose of the project, e.g., to characterize a phenomenon quantitatively; to demonstrate a new technique, etc., and what was done.
    • Many projects use more than one approach, e.g., a gene-screen followed by tests of a hypothesis that was suggested by the screening data. Briefly outline the overall plan.

     

    Justify conclusions.

    • The conclusions of a paper should fit the scope of the project.
    • Scientific hypothesis-testing experiments test predictions that follow logically from the hypothesis. Conclusions refer directly to possible truth or falsehood of the hypothesis.
    • Non-hypothesis testing projects summarize and interpret results of collecting data or answering questions. Conclusions may refer to one or more scientific hypotheses compatible with the data, but not tested by it.  

     

    Recognize that neuroscience is not a unitary field.

    • P-valued tests supplemented by estimation statistics serve a variety of purposes.Testing predictions of scientific hypotheses is only one of them. For instance, significance testing can help define the capabilities of an experimental technique or the efficacy of a drug, even if no scientific hypothesis is present.
    • Concepts imported from other fields, e.g., psychology, are not universally applicable. For instance, the “exploratory-confirmatory” framework does not readily map onto certain subfields or modes of doing neuroscience.

     

    Revitalize reviewing and publishing policies.

    • Papers should be judged according to their own merits and stated goals, such as those sketched above. Were the goals achieved? Were they significant?
    • Recognize that rigorously falsifying an important scientific hypothesis is a valuable advance in knowledge, even if no better explanation is demonstrated.
    • Review the internal logic of the paper.

    Do the data justify the conclusions? Are there fallacies in the reasoning?

    • Accept that no paper can be completely comprehensive.

    Is the reviewer requesting a new experiment because of a truly critical omission, or because the reviewer thinks it would be interesting?

                  Given what is known, are the conclusions reasonable?

    • Evaluate other obvious interpretations for the data.

    Call authors’ attention to egregious oversights, but do not reject a paper for failing to   discuss every conceivable alternative.

     

    Show Less
    Competing Interests: None declared.

Related Articles

Cited By...

More in this TOC Section

  • Alzheimer's Targeted Treatments: Focus on Amyloid and Inflammation
  • Neuroscience Needs to Test Both Statistical and Scientific Hypotheses
Show more Dual Perspectives

Subjects

  • 2022 Annual Meeting Issue
  • Home
  • Alerts
  • Follow SFN on BlueSky
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Issue Archive
  • Collections

Information

  • For Authors
  • For Advertisers
  • For the Media
  • For Subscribers

About

  • About the Journal
  • Editorial Board
  • Privacy Notice
  • Contact
  • Accessibility
(JNeurosci logo)
(SfN logo)

Copyright © 2025 by the Society for Neuroscience.
JNeurosci Online ISSN: 1529-2401

The ideas and opinions expressed in JNeurosci do not necessarily reflect those of SfN or the JNeurosci Editorial Board. Publication of an advertisement or other product mention in JNeurosci should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in JNeurosci.