Why Doing Good Science is Hard and How to Do it Better

February 25, 2019

Doing good science is hard and a lot of experiments fail. Although the scientific method helps to reduce uncertainty and lead to discoveries, its path is full of potholes.

In this post, you’ll learn about common p-value misinterpretations, p-hacking, and the problem with performing multiple hypothesis tests. Of course, not only are the problems presented, but their potential solutions as well.

By the end of the post, you should have a good idea of some of the pitfalls of hypothesis testing, how to avoid them, and an appreciation for why doing good science is so hard.

P-Value misinterpretations

There are many ways to misinterpret a p-value. By definition, a p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming the null hypothesis is true.

What the p-value is not:

A measure of the size of the effect or the strength of the evidence
The chance that the intervention is effective
A statement that the null hypothesis is true or false
A statement that the alternative hypothesis is true or false

If you want to measure the strength of the evidence or size of effect, then you need to calculate the effect size. This can be done with Pearson’s r correlation, standardized difference of means, or other methods.

Reporting the effect size in your research is suggested since p-values will tell you the likelihood that experimental results differ from chance expectations but not the relative magnitude of the experimental treatment or the size of the experimental effect.

P-values also don’t tell you the chance that the intervention is effective but calculating precision does, and base rates influence this calculation. If the base rate for the intervention is low, this opens the door to many opportunities for false positives even if a hypothesis test shows a statistically significant result.

For example, if the chance the intervention is effective is 65%, then there is still only a 65% chance that the intervention was actually effective while leaving a false discovery rate of 35%. Neglecting the impact of base rates is known as the base rate fallacy and happens more often than you think.

Lastly, p-values also can’t tell you whether a hypothesis is true or false. Statistics is an inferential framework and there’s no way to know for sure if some hypothesis is true or not. Remember, there’s no such thing as proof in science.

The p-hacking problem

As a scientist, one of your degrees of freedom when setting up a hypothesis test is deciding which variables to include in the data you test. Your hypothesis will, to a degree, influence which variables you might include in the data and after testing the hypothesis with those variables, you might get a p-value greater than 5%.

At this point, you might be tempted to try different variables in your data and retest. But if you try enough combinations of variables and test each scenario, you’re likely to get a p-value of 5% or less as demonstrated by this app in this fivethirtyeight blog post. It’s called p-hacking, and it can allow you to achieve a p-value of 5% or less under competing alternative hypotheses.

There are at least a few problems with this:

Since you can get a statistically significant p-value under competing alternative hypotheses as a result of the data you choose to include in testing, p-hacking doesn’t help you get closer to the truth of the thing you’re studying. Even worse, if such results are published and the research makes its way into conventional wisdom, it’ll be difficult to remove.
As the number of hypothesis tests performed increases, the rate of false positives (i.e. erroneously calling a null finding significant) increases.
You might be falling victim to confirmation bias by ignoring the results of other hypothesis tests performed and only considering results of the tests that align with your beliefs.
Since many journals require a p-value of 5% or less for publication, it creates an incentive for you to p-hack your way to this 5% threshold creating not only an ethical dilemma but also lower quality research.

Addressing p-hacking

To help mitigate p-hacking, you should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed.

If you performed multiple hypothesis tests without a strong basis for expecting the result to be statistically significant, as can happen in genomics where genotypes for millions of genetic markers can be measured and tested, you should verify that there was some sort of control for the family-wise error rate or false discovery rate (as discussed in the next section). Otherwise, the study might not be meaningful.

It might also be a good idea to report the power of the hypothesis test. That is, report 1 - the probability of not rejecting the null hypothesis when it’s false. Keep in mind that power can be influenced by the sample size, significance level, variability in the dataset, and if the true parameter is far from the parameter assumed by the null hypothesis.

In short, the greater the sample size, the greater the power. The greater the significance level, the greater the power. The lower the variability in the dataset, the greater the power. And the further away the true parameter is from the parameter assumed by the null hypothesis, the greater the power.

Testing multiple hypotheses with the Bonferroni Correction

Since the probability of false positives increases as the number of hypothesis tests performed increases, it is necessary to try and control this. As such, you might want to control the probability of one or more false positives out of all hypothesis tests conducted. This is sometimes called the family-wise error rate.

One way to control for this is to set the significance level to $\alpha/\text{n}$ where $n$ is the number of hypothesis tests. This kind of correction is called the Bonferroni correction and ensures that the family-wise error rate is less than or equal to $\alpha$.

However, this correction can be too strict especially if you’re performing many hypothesis tests. The reason is since you’re controlling for the family-wise error rate, you also might be missing some true positives that existed at a higher significance level.

Clearly there’s a balance to be struck between increasing the power of the hypothesis test (i.e. increasing the probability of rejecting the null hypothesis when the alternative hypothesis is true) and controlling for false positives.

Testing multiple hypotheses with the Benjamini-Hochberg procedure

Instead of trying to control for the family-wise error rate, you can instead try to control for the false discovery rate which is the proportion of all the hypothesis tests identified as having statistically significant results that actually don’t have statistically significant results. In other words, the false discovery rate is equal to FP/(FP + TP).

Controlling for the false discovery rate should help you identify as many hypothesis tests with statistically significant results as possible, but still try to keep a relatively low proportion of false positives. Like $\alpha$ which controls the false positive rate, we similarly use another significance level, $\beta$, which controls the false discovery rate.

The procedure you can use to control the false discovery rate is called the Benjamini-Hochberg procedure. You first choose a $\beta$, the significance level for the false discovery rate. Then calculate the p-values for all null hypothesis tests performed and sort from lowest to highest with $i$ being the index of the p-value in the list. Now find the index, $k$, of the largest p-value such that it’s less than or equal to $\frac{i}{m} \beta$ where $m$ is the number of null hypothesis tests performed. All null hypothesis tests with p-value index $i <= k$ are considered statistically significant by the Benjamini-Hochberg procedure.

Conclusion

As you can see, doing good science does not just involve performing a null hypothesis test and publishing your findings when you get a p-value less than or equal to 5%. There are ways to misinterpret p-values, to tweak data to get the right p-value for a hypothesis that you’re convinced of, and to perform enough tests with different samples of data until you get the desired p-value.

But now that you you’re aware of the potholes and are armed with some ways to avoid them, I hope it helps you improve the quality of your research and get you closer to the truth.