Bias and Variance

The Bias-Variance trade-off
Every time you build a model, a tradeoff exists between bias and variance. One of the first things you look at are the prediction errors of the dev and test sets. You can decompose the errors into two components known as bias and variance.
Bias is an error that results from incorrect assumptions the model makes about the training data. Variance is an error that happens when the model is sensitive to small changes in the training data.
If you build a model and observe that it has a hard time predicting data it’s already seen, then your model doesn’t fit the data well. We call this high bias.
If your model is too sensitive to changes in the training data, then it will try to predict random noise rather than the intended outputs. This results in a model that will overfit your data. We call this high variance.
Since you want a model that minimizes bias and variance, a trade-off arises. You could have a model with a high training accuracy, but performs poorly on the dev and test sets. In this case, it’s better to sacrifice some of that training accuracy in exchange for better performance on the dev and test sets. This trade-off is perhaps more intuitively understood by the image below.

Bias-Variance decomposition
Although the bias-variance trade-off might feel experimentally familiar, it should be mathematically verified. It turns out we can decompose prediction errors into bias and variance (and an unavoidable error term).
So the data we have comes from some underlying function,
The underlying function is what we’re trying to approximate with some model,
where
and
So,
And since
So now
Since
And since
For more justification in each of these steps, refer to the derivation procedure posted on Berkeley’s machine learning blog.
Addressing high bias and high variance
To minimize bias and variance, we need to have a game-plan for how to address high bias and high variance. In the case of high bias, we can:
- Choose a more complex model that can learn from the data.
- For neural networks, choose a better weight initialization method.
- Train the model longer.
- Choose a model that leverages boosting.
- Ensemble multiple models.
To address high variance, we have a few more options at our disposal:
- Get more data! Theoretically, variance approaches zero as the number of samples approaches infinity.
- If you can’t get more data, use data augmentation.
- Handicap your model so that it doesn’t become too complex via regularization techniques.
- Ensemble multiple models.
- Introduce some noise in the data if your data set is too clean.