Bias-Variance Tradeoff Explained

Understanding the Bias-Variance tradeoff at three different levels: simple, intermediate and advanced.

Anton Muehlemann
Insight

--

In this post, we explain the bias-variance tradeoff in machine learning at three different levels: simple, intermediate and advanced. We will follow up with some illustrative examples and discuss some practical implications in the end.

If you can’t explain it simply, you don’t understand it well enough. — Albert Einstein

Understanding the bias-variance tradeoff can be a bit tricky at first as it involves several quantities that are never available in practice. Nonetheless, it provides insights into best practices for optimizing real-world machine learning applications.

Simple

The prediction error of a machine learning model, that is the difference between the ground truth and the trained model, can be decomposed as the sum of two error terms: the bias term and the variance term. Since the total error is the sum of these two error terms, there is a trade-off between the two.

Total Error = Bias + Variance

The bias term measures how well our choice of a model, for example linear models, can approximate the ground truth in a best-case scenario. The variance term measures how tightly clustered trained models are around this best-case scenario. Low variance means that all trained models are very close to the best-case scenario and high variance means that they are spread out.

The bias-variance tradeoff visualized. The bias is the distance between the ground truth f and the best possible approximation within our modeling space (ğ). The g{D}‘s are the results of training on different datasets D. The variance measures how much each trained function fluctuates about the best possible approximation ğ.

Intermediate

The bias-variance tradeoff refers to a decomposition of the prediction error in machine learning as the sum of a bias and a variance term.

Here f is the ground truth function and g{D} is the learned function given a dataset D and given your choice of the modeling space. Examples of modeling spaces could be linear models, neural networks or random forests.

The term ğ is interpreted as the “best we can do” given our modeling choice. In other words, even if we had an infinite amount of training data, we would still end up with ğ because we can only find models that lie within our modeling space.

The first term (f — ğ)² is the bias term and measures the distance between the “best we can do” and the ground truth.

The second term (ğ — g{D})² is the variance term and measures how well we can “zoom in” on the “best we can do” given different training data sets D.

An example of the bias-variance tradeoff in practice. On the top left is the ground truth function f — the function we are trying to approximate. To fit a model we are only given two data points at a time (D’s). Even though f is not linear, given the limited amount of data, we decide to use linear models. On the bottom left, we see ğ — the best linear approximation to f. The error between f and ğ represents the bias. On the right, we see three different linear functions (g{D}’s) which were learned from the three different datasets (D’s). The fluctuation of the g{D}’s about ğ is responsible for the variance.

Advanced

The bias-variance tradeoff is a decomposition of the expected prediction error of a machine learning model into a bias term and a variance term. To fully understand this decomposition we need to first define three concepts.

First, there is the ground truth function f (“God’s function”). The ultimate goal is to recover God’s function purely based on its data outputs. Think of “God’s function” as the function that lets you find outputs to any given input. For instance, this could be a cat expert that tells us if there is a cat in an image.

Second, there is the domain X on which God’s function is defined. With X, we can associate an (unknown) probability distribution P. By drawing samples x from X we can generate a dataset D. Each dataset D is sampled from the collection of all possible datasets Đ. Just like X, there is another (unknown) probability distribution on the set of all datasets Đ. For example, in our cat examples, X is the set of all possible pictures and there is some probability of encountering any particular image. Typically D would be a training or test dataset that contains a number of different images and the set Đ would be the set of all sets of images.

Third, there is the hypothesis set H, which specifies the class of functions (models) we would like to consider when approximating God’s function f. Unlike all previous notions, the choice of modeling space explicitly depends on your own choice. Unsurprisingly, this choice is responsible for the bias term. Examples of hypothesis sets could be neural networks, linear models or random forests.

The goal of machine learning is to find an element g within our hypothesis set H that matches the given dataset D best. We make this dependence explicit by calling the learned function g{D}.

Now suppose that we had access to all possible datasets Đ. Computing the expectation of g{D} with respect to those datasets gives us an “average” hypothesis ğ. We can interpret ğ as the best possible function within our hypothesis set H. With ğ at hand, we are ready to write the bias-variance tradeoff (see pdf for details on the derivation):

Error = Bias + Variance

On the left, we have the expected prediction error as the squared difference between the ground truth function f and the learned function g{D}. The expectations are taken with respect to both the domain and the datasets.

Typically, the performance would only be measured on a test set which is a subset of X. But instead of changing X, it is easier to think about changing the probability distribution P in such a way that it only considers elements in X that are contained in the test set. However, it is worth noting that the bias-variance tradeoff holds true irrespectively of your model design and whether you measure your error in the training, validation or test set.

The first summand is the bias term and tells us how far the ground truth f is from ğ. Recall that ğ can be interpreted as the best possible approximation of f within the hypothesis set H. Notably, neither the ground truth nor the best possible approximation depends on the dataset D.

The second summand is the variance term and measures the expected fluctuation of each hypothesis g{D} about ğ.

As a side note, if god’s function was non-deterministic, that is if an x has more than one label, there would be an additional noise term in the decomposition. To learn more about the role of noise, you can check out this article on noise and bias.

Examples

Let’s look at some examples of the bias-variance tradeoff.

No Bias and no Variance

Since the prediction error is the sum of bias and variance, the only certain way to have no error is if our modeling space only contains God’s function f, meaning H={f}. Since our modeling space contains God’s function f, there is no bias. Furthermore, since the modeling space only has one element, we don’t need to train our model and no matter the training data, the “learned” function is always f and thus there is no variance.

High Variance

Assume we only have 100 training examples. If we attempt to fit these few data points to a neural network with 10000+ parameters, even the slightest change in the input data is likely to lead to a completely different trained model. Thus, we have high variance.

It is tempting to say that in this case we automatically have low bias. This is false. Recall that the bias does not depend on data. In fact, we don’t know the complexity of the underlying ground truth function f and it may itself have one million parameters. In this case, we would also have high bias.

The truth is that we were only given very few training examples and even though we don’t know the bias, we know that our total error is the sum of bias and variance. Considering that our goal is to reduce the prediction error, we need to find a model whose complexity matches the low number of training examples. In other words, choosing a very complex model for only 100 data points will always give a high variance and thus make our overall prediction poor. To maximize our chances of a low prediction error, we have to pick a simpler model.

Low Bias?

Let’s consider an extreme example where our training points are of the form (k, f(k)) and f is a random number generator. If our goal was to have no bias, we would be forced to learn a new value for each training point and thus learn infinitely many parameters. Of course, in this case, the variance would be very high as the algorithm doesn’t learn any pattern and each change in input gives a new model.

In the above situation, learning is simply not feasible. The point of a (good) random number generator is that no pattern can be learned and thus the prediction error of any model must be high.

The takeaway is that, since we never know f, we need to use our domain knowledge and the data to get an idea of a good model to use. Once, we decided on our model, for example neural network, random forest, linear regression, etc., we can use the training data to fix the free parameters.

Practical Implications and Conclusions

Despite its abstract nature, the bias-variance tradeoff has important practical implications.

Just by looking at the equation, we readily conclude that the bias term does not depend on training data. Thus, in particular, collecting more data cannot reduce bias. Instead, the only way to reduce bias is by changing the modeling space H.

In contrast, the variance term depends on data and collecting more data will lead to a closer clustering of learned functions g{D} about the best possible approximation ğ and thus to lower variance. Also, using regularization restricts the fluctuations of each trained function and thus also helps to reduce variance.

Ultimately, as the prediction error is the sum of bias and variance, a good model needs to strive for both low bias and low variance. Thus, as machine learning practitioners, our goal is to find a modeling space that has the right complexity for both the problem and the data we are given.

Are you interested in transitioning to a career in tech? Sign up to learn more about the Insight Fellows programs and start your application today.

--

--