Why an MSE Loss Might Make Your Self-Driving Car Crash

Towards the end of my PhD, I co-authored a paper on model misspecification, see (Cervera* et al., 2021). The idea struck me while reviewing the self-driving car literature. I noticed that almost everyone was training steering angle predictors using the mean squared error (MSE) loss. It’s an understandable choice: MSE is the default loss for regression in machine learning.

Here’s the catch: using MSE for steering angle prediction is a bad idea because it trains your model to average over options rather than select one.

Our paper explored this problem in depth, but the core insight — that MSE can cause a catastrophe — got buried under layers of cautious phrasing and technical detail. That’s how scientific writing works: every claim must be hedged, every conclusion defended. It’s the right approach for academics, but it often buries the simple, intuitive message.

A blog post is different. Here, I can say what I really think: if you train a self-driving car with MSE, you’re optimizing for the wrong thing, and it might crash.

In this post, I’ll unpack that message. First, I’ll explain the mathematical intuition behind MSE — it predicts the mean of the target predictive distribution. From there it’s easy to see why even a model with near-zero MSE can still fail dramatically, and why this matters if you care about staying on the road (literally or metaphorically).

The MSE Loss Forces the Model to Predict the Mean of a Gaussian

To understand why MSE behaves the way it does, we need to look at where it comes from.

For a deeper dive into the origins of common loss functions, see chapter 2 of my thesis: (Henning, 2022).

In practice, maximum likelihood training essentially boils down to the minimization of a Kullback–Leibler divergence \(\text{KL}\left( p(y \mid x) ; q(y \mid w, x) \right)\) of an assumed ground-truth distribution \(p(y \mid x)\) from a parametrized model distribution \(q(y \mid x; w)\).

In our case, \(x\) is the image from the car’s front camera and \(y \in \mathbb{R}\) is the steering angle. In simple words, we assume that every image \(x\) induces a distribution over steering angles \(p(y \mid x)\). We also assume that human drivers sample from this distribution, generating data tuples \((x,y)\) - \(x\) being the image the driver sees, and \(y\) the steering angle they choose.

Now, since \(y\) is continuous, we are dealing with a regression problem. Next, we need to choose a model \(q(y \mid w, x)\) parametrized by \(w\) that can faithfully resemble the unknown ground-truth data generating process \(p(y \mid x)\). Unfortunately, continuous probability distributions are a bit cumbersome to work with, so the sake of simplicity we fall back onto a Gaussian approximation \(q(y \mid w, x) = \mathcal{N}(y; f(x; w), \sigma^2)\), where \(f(x; w)\) is (for us) a neural network with parameters \(w\) that predicts the mean of the Gaussian.

Taking all of these assumptions together, we can arrive at a tractable loss function for our optimization problem:

\[\min_w \mathbb{E}_{p(x)} \left[ \text{KL}\left( p(y \mid x) ; q(y \mid w, x) \right) \right] \Leftrightarrow \min_w - \mathbb{E}_{p(x,y)} \left[ \log q(y \mid w, x) \right] \Leftrightarrow \min_w \mathbb{E}_{p(x,y)} \left[ \left( y - f(x;w) \right)^2 \right]\]

Taking a Monte-Carlo estimate of the last expected value yields the classical MSE loss used for regression.

How the MSE Loss Can Make You Crash

Here’s the core problem: MSE ignores safety and plausibility. It simply averages over all possible outcomes — and that average can be the worst choice.

The previous section highlighted that there are many assumptions going into the derivation of the MSE loss. One crucial assumption has been, that we directly predict a mean, which, at optimality, resembles the mean of the ground-truth \(p(y \mid x)\) (cf. SM C in (Cervera* et al., 2021)).

This is a problem, because we are not interested in the mean of \(p(y \mid x)\), but the plausible choices it offers.

In the figure below, a highway lane splits into two: the driver can either turn left or right (\(p(y \mid x)\) is bimodal). The mean of these choices points right in-between the two lanes.

This figure is taken directly from the paper (cf., Fig. 3 in (Cervera* et al., 2021)). It shows a lane bifurcation where the driver has to decide to either go left or right. A possible ground-truth is shown in orange. An optimal model under these assumptions predicts the mean of plausible steering angles, which lies between the two valid choices — leading the car off the road.

Beyond Cars: Why This Problem Matters

Any time you’re averaging over multiple plausible outcomes, MSE will happily produce a result that doesn’t correspond to any real-world action.

In self-driving, relying on a single steering-angle prediction would be unwise. Navigation goals, lane detection, and planning signals can all interact to mitigate risk. But these systems only reduce the chance of catastrophic failure — they don’t fix the root cause: the wrong modelling choice.

So here’s my advice: if you’re working on regression, don’t blindly pick MSE just because textbooks treat it like the universal default. There are other alternatives — some of which we discuss in our paper.

If you’re curious, take a look! 😉

References

NeurIPS 2021 Bayes
Uncertainty estimation under model misspecification in neural network regression

Maria R. Cervera^*, Rafael Dätwyler^*, Francesco D’Angelo^*, and 3 more authors

In NeurIPS Workshop on Robustness and misspecification in probabilistic modeling. (Spotlight) , 2021

Abs arXiv Bib Code

Although neural networks are powerful function approximators, the underlying modelling assumptions ultimately define the likelihood and thus the hypothesis class they are parameterizing. In classification, these assumptions are minimal as the commonly employed softmax is capable of representing any categorical distribution. In regression, however, restrictive assumptions on the type of continuous distribution to be realized are typically placed, like the dominant choice of training via mean-squared error and its underlying Gaussianity assumption. Recently, modelling advances allow to be agnostic to the type of continuous distribution to be modelled, granting regression the flexibility of classification models. While past studies stress the benefit of such flexible regression models in terms of performance, here we study the effect of the model choice on uncertainty estimation. We highlight that under model misspecification, aleatoric uncertainty is not properly captured, and that a Bayesian treatment of a misspecified model leads to unreliable epistemic uncertainty estimates. Overall, our study provides an overview on how modelling choices in regression may influence uncertainty estimation and thus any downstream decision making process.
@inproceedings{cervera:henning:2021:regression, title = {Uncertainty estimation under model misspecification in neural network regression}, author = {Cervera, Maria R. and Dätwyler, Rafael and D'Angelo, Francesco and Keurti, Hamza and Grewe, Benjamin F. and Henning, Christian}, booktitle = {NeurIPS Workshop on Robustness and misspecification in probabilistic modeling}, year = {2021}, }
Knowledge uncertainty and lifelong learning in neural systems

Christian Henning

PhD Thesis , 2022

Abs Bib HTML PDF

Natural intelligence has the ability to continuously learn from its environment, an environment that is constantly changing and thus induces uncertainties that need to be coped with to ensure survival. By contrast, artificial intelligence (AI) commonly learns from data only once during a particular training phase, and rarely explicitly represents or utilizes uncertainties. In this thesis, we contribute towards improving AI in these regards by designing and understanding neural network-based models that learn continually and that explicitly represent several sources of uncertainty, with the ultimate goal of obtaining models that are useful, reliable and practical. We start by setting this research into a broader context and providing an introduction to the fields of uncertainty estimation and continual learning. This detailed review can constitute an entry point for those interested in familiarizing themselves with these topics. After laying this foundation, we dive into the specific question of how to learn a set of tasks continually and present our approach for solving this problem based on a system of neural networks. More specifically, we train a meta-network to generate task-specific parameters for an inference model and show that, in this setting, forgetting can be prevented using a simple regularization at the meta-level. Due to the existence of task-specific solutions, the problem arises of having to infer the task to which an unseen input belongs. We investigate two major ways for solving this \emphtask-inference problem: (i) replay-based and (ii) uncertainty-based. While replay-based task-inference exhibits remarkable performance on simple benchmarks, our implementation of this method relies on generative modelling, which becomes disproportionately difficult with increased task complexity. Uncertainty-based task-inference, on the other hand, does not rely on external models and scales more easily to complex scenarios. Because calibrating the uncertainties required for task-inference is difficult, in practice, one often resorts to models that should \emphknow what they don’t know. This can in theory be achieved through a Bayesian treatment of model parameters. However, due to the difficulty in interpreting the prior knowledge given to a neural network-based model, it also becomes difficult to interpret what it is that the model \emphknows not to know. This realization has implications beyond continual learning, and more generally affects how current machine learning models handle unseen inputs. We discuss the intricacies associated with choosing prior knowledge in neural networks and show that common choices often lead to uncertainties that do not intrinsically reflect certain desiderata such as detecting unseen inputs that the model should not generalize to. Overall, this thesis compactly summarizes and contributes to the advancement of two important topics in nowadays deep learning research, uncertainty estimation and continual learning, while disclosing existing challenges, evaluating novel approaches and identifying promising avenues for future research.
@article{henning2022phdthesis, title = {Knowledge uncertainty and lifelong learning in neural systems}, author = {Henning, Christian}, year = {2022}, publisher = {ETH Zurich}, }

Enjoy Reading This Article?

Here are some more articles you might like to read next: