In A/B testing, the p-value is commonly used to declare whether a tested hypothesis is a “winner” or not (assuming you’ve reached a certain sample size and power to detect the effect size of interest of course).
The 0.05 threshold is widely accepted, but what is a p-value exactly in statistical significance testing? Why is 0.05 often used as a license for making claims?
We often take p-values at face value but rarely dig into what they really mean. The American Statistical Association’s statement on p-values might even surprise you.
What a P-Value Actually Measures
A p-value is the probability of observing data as extreme (or more extreme) than what was actually observed, assuming that the null hypothesis is true. In other words, it’s a conditional probability that tells us the likelihood of our observed results under the null hypothesis.
So, a p-value only indicates how incompatible the data is with a specified statistical model—like a null hypothesis of no difference between two groups. The smaller the p-value, the more incompatible the data is with the null hypothesis.
Already, I bet you realize it doesn't tell you that much, or less than you thought.
So Let’s clear up a few things p-values don’t tell you:
- P-values do not measure the probability of “how likely the hypothesis is true.”
This is a crucial point. They don’t provide statements about the hypothesis itself—only about the data in relation to the hypothesis being tested. Depending on when you stop a test or peek at the results, you might get data that seems incompatible with the null hypothesis without truly disproving it. - P-values do not measure the size or importance of an effect.
A small p-value does not imply a larger or more meaningful effect. - A p-value by itself does not provide evidence regarding a hypothesis.
When the p-value is large, it simply means that the observed data is consistent with the null hypothesis. But this consistency doesn’t confirm the null hypothesis as true, because too much variability, a small sample size, or other limitations may prevent us from detecting an effect even if it exists. A non-significant result doesn’t guarantee that the effect isn’t there.
You might realize that hypothesis testing is asymmetric—it’s designed to find evidence against the null hypothesis rather than to prove it right. Rejecting the null suggests that the alternative hypothesis may be true, but not rejecting it doesn’t provide strong support for the null.
Alternative Approaches to Address p-Value Limitations and Reduce Misinterpretations
- Estimation Over Testing: Use confidence intervals to show a range of plausible values, credibility intervals (Bayesian) to express probabilities, or prediction intervals to estimate future data points.
- Bayesian Methods: Bayes Factors compare evidence for competing hypotheses, while posterior probabilities assess the likelihood that a hypothesis is true given the data, providing a probability statement about the hypothesis itself.
- Alternative Evidence Measures: Likelihood ratios provide relative evidence for hypotheses, and false discovery rate (FDR) control adjusts for multiple comparisons to reduce false positives.
- Decision-Theoretic Models: Incorporate cost-benefit analysis or expected loss minimization to guide decisions based on real-world consequences.
- Emphasize Practical Significance: Focus on effect size and real-world impact, rather than relying solely on statistical significance.