Why You Must Randomize Treatment and Control Groups
You have to randomize treatment and control groups if you want to design an experiment like an A/B test. Why? Randomizing minimizes systematic differences between two groups. This helps you minimize the number of confounding variables in your experiment. With confounding variables out of the way, you can be more confident an effect was due to the treatment itself.
Awesome, this sounds great. But how do you really know that random assignment minimizes systematic differences between two groups?
If any of the above sounds confusing, you might start with Hypothesis Testing.
Randomly splitting into two groups
Say you have some finite population from which to choose members of both your treatment and control groups:
$$ x_1, \dots, x_{2N} $$
You do what you’re told and randomly assign members into two different groups. The first group:
$$ u_1, \dots, u_N $$
And the second group:
$$ v_1, \dots, v_N $$
So your finite population is just a combination of these two groups:
$$ x_1, \dots, x_{2N} = u_1, \dots, u_N + v_1, \dots, v_N $$
If you divide both sides by $2N$, you get:
$$ \begin{aligned} \mu &= \frac{1}{2}\bar{u} + \frac{1}{2}\bar{v} \\\ &= \frac{1}{2}(\bar{u} + \bar{v}) \end{aligned} $$
So the average of the two sample means is equal to the population mean, that’s interesting.
Discovering the differences
But what do you know about the difference between the two groups? Well,
$$ \begin{aligned} \bar{u} - \bar{v} &= \bar{u} - (2\mu - \bar{u}) \\\ &= \bar{u} - 2\mu + \bar{u} \\\ &= 2\bar{u} - 2\mu \\\ &= 2(\bar{u} - \mu) \end{aligned} $$
So the difference between the two groups is just twice the distance from $\bar{u}$ to $\mu$. And since $u_1, \dots, u_N$ is a random sample from your population, then:
$$ \E(\bar{u}) = \mu $$
and
$$ \begin{aligned} \var(\bar{u}) &= \frac{\sigma^2}{N} \cdot \sqrt{\frac{2N - N}{2N - 1}} \\\ &\approx \frac{\sigma^2}{2N} \text{ since 2N-1 is approx 2N as N gets larger} \end{aligned} $$
where $\sqrt{\frac{2N - N}{2N - 1}}$ is a finite population correction factor since the size of your groups is greater than 5% of the finite population size. For a further detailed derivation of $\var(\bar{u})$, you should read my post about the Central Limit Theorem.
Now that you have these two results, you can now answer the question “what is the expected difference between both groups?”:
$$ \begin{aligned} \E(\bar{u} - \bar{v}) &= 2\E(\bar{u} - \mu) \\\ &= 2 \cdot 0 \\\ &= 0 \end{aligned} $$
And what about the variance of the difference between both groups?
$$ \begin{aligned} \var(\bar{u} - \bar{v}) &= 2^2 \var(\bar{u} - \mu) \\\ &= 4 \var(\bar{u}) \\ &= \frac{2\sigma^2}{N} \end{aligned} $$
So on average, the difference between $\bar{u}$ and $\bar{v}$ is 0 which tells you that the two groups will be about the same! But what about random volatility? In practice, random volatility is unlikely since the variance of $\bar{u} - \bar{v}$ converges to 0 as the population size approaches infinity. This, of course, is nothing new. Sample variance decreases as the sample size increases.
Confidence about the differences
You just mentioned that random volatility is unlikely and variance converges to 0 as you keep increasing the population size. In practice however, increasing your population size might be impractical.
So let’s say you just want to be 95% confident that the difference between the means of the two randomly selected groups, with each group having $N$ members, is less than some small number. In mathematical terms, this is asking for you to solve:
$$ 2\sigma(\bar{u} - \bar{v}) = \epsilon $$
You write $2 \sigma$ since any normal random variable is within two standard deviations of the mean about 95% of the time. And you know $\bar{u} - \bar{v} = 2(\bar{u} - \mu)$ is normal since $\bar{u}$ is normal by the Central Limit Theorem.
Solving for $N$, you get:
$$ \begin{aligned} 2 \sigma(2(\bar{u} - \mu)) &= \epsilon \\\ 2 \sqrt{2} \sigma(\bar{u} - \mu) &= \epsilon \\\ \frac{2 \sqrt{2} \sigma}{\sqrt{2N}} &= \epsilon \\\ \frac{4 \sigma^2}{\epsilon^2} &= N \end{aligned} $$
Of course this further reinforces the fact that by increasing $N$, the two sample means become arbitrarily close.
Conclusion
You’ve shown how random assignment ensures that any differences between and within the two groups are not systematic at the outset of the experiment. This means that any observed differences between the groups at the end of the experiment can be more confidently attributed to the effects of the experiment itself, rather than underlying differences between groups.
If you’re interested in more content around hypothesis testing, check out How P-Hacking Increases False Positives or Why Doing Good Science is Hard and How to Do it Better.