Hypothesis testing is one of the most important tools in the sciences. It allows you to investigate a thing you’re interested in and tells you how surprised you should be about the results. It’s the detective that tells you whether you should continue investigating your theory or divert efforts elsewhere. Does that diet pill you’re taking actually work? How much sleep do you really need? Does that HR-mandated team-building exercise really help strengthen your relationship with your coworkers? (Spoiler alert, it doesn’t).

Social media and the news saturates us with “studies show this” and “studies show that” but how do you know if any of those studies are valid? And what does it even mean for them to be valid? Although studies can definitely be affected by the data collection process, what we’re going to focus on here is the actual hypothesis test itself and why being familiar with its process will arm you with the necessary set of skills to perform replicable, reliable, and actionable tests — and to call bullshit on a study.

In any hypothesis test, you have a default hypothesis (the null hypothesis) and the theory you’re curious about (the alternative hypothesis). The null hypothesis is the hypothesis that whatever intervention/theory you’re studying has no effect. For example, if you’re testing whether a drug is effective, the null hypothesis would state that the drug has no effect while the alternative hypothesis would posit that it does. Or maybe you’d like to know if a redesign of your company’s website actually made a difference in sales — the null hypothesis is that the redesign had no effect on sales and the alternative hypothesis is that it did.

Hypothesis testing is a bit like playing devil’s advocate with a friend, but instead of just trolling, you both go out and collect data, run repeatable tests, and determine which of you is more likely to be right. In essence, having a null hypothesis ensures that the data you’re studying is not only consistent with whatever theory you have, but also inconsistent with the negation of your theory (i.e. the null hypothesis).

How the Null Hypothesis Test Works

Once identifying your null and alternative hypotheses, you need to run the test. Skipping over a bunch of math formulas, it goes something like this:

  1. Perform an experiment (this is where you collect your data).
  2. Assume that the null hypothesis is true and let the p-value be the probability of getting the results that you got.
  3. If the p-value is quite small (i.e. < 0.05), your results are statistically significant which gives you evidence to reject the null hypothesis; otherwise, the null hypothesis can’t be ruled out just yet.

You might be wondering why a p-value of < 0.05 means that your results are statistically significant. Let’s say your null hypothesis is that condoms don’t have an effect on STD transmission and you assume this to be true. You run your experiment, collect some data, and turns out you get some results that were very unlikely to get (meaning the probability of getting those results was really small, < 0.05). This might cause you to doubt the assumption you made about condoms having no effect. Why? Because you got results that were very rare to get, meaning your results were significant enough to cast doubt on your assumption that the null hypothesis was true.

Consider the ways you could be wrong when interpreting the results of a hypothesis test: you can either reject the null hypothesis when it’s actually true (a Type I error), or fail to reject it when it’s actually false (a Type II error). We can’t control for the Type II error so we require that a Type I error be < 0.05 which is our way of saying any statistically significant results we get could be flat-out coincidence only 5% of the time (or less). It’s damage control to make sure we don’t make utter fools of ourselves and this restriction leaves us with a 95% confidence when claiming statistically significant results and a 5% margin of error. The statistically significant results we got in the above condom example could have been a fluke, just coincidence, but it’s only 5% likely or less.

Even though there’s sound reason for having a small p-value, the actually threshold of 0.05 happens to be a convention created by Ronald Fisher, considered to be the father of modern statistics, so that when a scientist talks about how they achieved statistically significant results, other scientists know that the results in question were significant enough to only be coincidence 5% of the time.

A Fuzzy Contradiction

For the mathematically literate, the null hypothesis test might resemble a fuzzy version of the scaffolding for a proof by contradiction whose steps are as such:

  1. Suppose hypothesis, $\mathrm{H}$, is true.
  2. Since $\mathrm{H}$ is true, some fact, $\mathrm{F}$, can’t be true.
  3. But $\mathrm{F}$ is true.
  4. Therefore, $\mathrm{H}$ is false.

Compared to our steps for a hypothesis test:

  1. Suppose the null hypothesis, $\mathrm{H_0}$, is true.
  2. Since $\mathrm{H_0}$ is true, it follows that a certain outcome, $\mathrm{O}$, is very unlikely.
  3. But $\mathrm{O}$ was actually observed.
  4. Therefore, $\mathrm{H_0}$ is very unlikely.

The difference between the proof by contradiction and the steps involved in performing a hypothesis test? Absolute mathematical certainty vs. likelihood. Many people might be tempted into thinking statistics shares the same certainty enjoyed by pure mathematics, but it doesn’t. Statistics is an inferential framework and as such, depends on data that might be incomplete or tampered with, not to mention the data could have been derived from an improperly-set-up experiment that left plenty of room for a plethora of confounding variables. Uncertainty abounds in the field and the best answer any statistician can ever give is in terms of likelihood, never certainty.

Tuning the Microscope

As mentioned before, hypothesis testing is a scientific instrument with a degree of precision and as such, one must carefully decide what precision is needed given the experiment. An underpowered hypothesis test would not be powerful enough to detect whatever effect you’re trying to observe. It’s analogous to using a magnifying glass in order to observe one of your cheek cells. But a magnifying glass is too weak to observe something so small and you might as well have not bothered with the test at all. Typically, a test is underpowered when studying a small population where the difference in the population creates an effect that’s just big enough to pass the p-value threshold of 0.05.

Jordan Ellenberg in his book, “How Not to Be Wrong”, gives a great example of an underpowered test. He mentions a journal article in Psychological Science that found married women in the middle of their ovulatory cycle more likely to vote for the presidential candidate Mitt Romney. A population size of 228 women were polled; of those women polled during their peak fertility period, 40.4% said they’d support Romney, while only 23.4% of the other married women who weren’t ovulating showed support for Romney. With such a small population size, the difference between the two groups of women was big enough to pass the p-value test and reject the null hypothesis (i.e. ovulation in married women has no effect on supporting Mitt Romney). Ellenberg goes on to say on page 149:

-the difference is too big. Is it really plausible that, among married women who dig Mitt Romney, nearly half spend a large part of each month supporting Barack Obama? Wouldn’t anyone notice? If there’s really a political swing to the right once ovulation kicks in, it seems likely to be substantially smaller. But the relatively small size of the study means a more realistic assessment of the strength of the effect would have been rejected, paradoxically, by the p-value filter.

The opposite problem is had with an overpowered study. Let’s say such an overpowered study (i.e. a study with a large population size) is performed and showed that taking a new blood pressure medication doubled your chance of having a stroke. Now some people might choose to stop taking their blood pressure meds for fear of having a stroke; after all, you’re twice as likely. But if the likelihood of having a stroke in the first place was 1 in 8,000, a number very close to zero, then doubling that number, 2 in 8,000, is still really close to zero. Twice a very small number is still a very small number. And that’s the headline - an overpowered study is really sensitive to small effects which might pass as statistically significant but might not even matter. What if a patient with heart disease suffered an infarction because he or she decided to stop taking their blood pressure meds after reading the “twice as likely to stroke” headline? Our overpowered study took a microscope to observe a golf ball and missed the forest from the trees. Care must be taken when reading or hearing such headlines and questions must be asked. That all being said, in the real world an overpowered study is preferred to an underpowered one. If the test has significant results, we just need to make sure we interpret those results in a practical manner.

If you’re hard-pressed to know exactly the sample size you need your test perhaps due to budget restraints, the formula is simple:

$\mathrm{sample\ size = (z\ score)^2 * StdDev * (1 - StdDev) / (confidence\ interval)^2}$

The Importance of Replicable Research

xkcd significance comic

Toward the beginning of this post, we clarified that if the null hypothesis were true, we’re still 5% likely to reject it in favor of the alternative. That’s why we can only say we’re 95% confident in the statistically significant results we get because 1 out of 20 times, our results actually aren’t significant at all. And this is exactly what the xkcd comic above is referring to; test 20 different jelly beans for a link to acne and it’s not surprising that only 1 out of 20 show a link.

This should hammer home the importance of replicable research, which entails following the same steps of the research, but with new data. Repeating your research with new data helps to ensure that you’re not that one lucky scientist who did his research once and found that green jelly beans had a statistically significant effect on acne.

Closing Remarks

Hypothesis testing has been a godsend in scientific investigation; it’s allowed us to focus our efforts toward more promising areas of research and has allowed us to challenge commonly held beliefs and defend against harmful actions. Although this post didn’t go into the technicals of performing an actual hypothesis test (perhaps for a later post), I hope that this gave you a good understanding and overview of its use cases and gotchas.

This post was inspired by some sections of the book How Not to Be Wrong written by Jordan Ellenberg. If you haven’t read it already, I highly recommend it, a background in mathematics or statistics is not required.