The missing question in supervised learning | Vincenzo Coia | Statistician, combining research and practice for probabilistic and risk modelling in the earth sciences.

You all know the drill – you’re asked to make predictions of a continuous variable, so you turn to your favourite supervised learning method to do the trick. But have you ever suspected that you could be after the wrong type of output before you even begin?

Regression trees, loess, linear regression… you name it, they’re all in pursuit of the mean (well, almost all). But the true outcome is random. It has a distribution. Are you sure you want the mean of that distribution?

You might say “Yes! It ensures my prediction is as close as possible to the outcome!” If this is indeed what you want, the mean still might not be your best choice – it only ensures the mean squared error is minimized.

There are a suite of other options that might be more appropriate than the mean. The good thing is, your favourite supervised learning method probably has a natural extension for estimating these alternatives. Let’s investigate the quantities you might care about.

The Median

No, the median isn’t just an inferior version of the mean, to be used under the unfortunate presence of outliers.

If I randomly pick a data scientist, what do you think their salary would be? This distribution has a right-skew, so chances are, your data scientist earns less than the mean. Predict the median, and you’ll have a 50% chance that your data scientist does earn at least what you predict.

In short, use the median when you want your prediction to be exceeded with a coin toss.

Minimize the mean absolute error to get this prediction.

Higher (or lower) Quantiles

Want to make it to an interview on time? You add some “buffer time” to the expected travel time, right? What you’re after is a high quantile of travel time – something like the 0.99-quantile, so that there is only a small chance you’ll be late (1% in this case).

Use a high (or low) quantile if you want a conservative (or liberal) prediction – or both, if you want a prediction interval.

Minimize the mean rho function to get this prediction.

The Mean

The mean is useful when we care about totals. Want to know how much gas a vehicle uses? You’re after the mean, because the total quantity drawn out over time is what matters.

Minimize the mean squared error to get this prediction.

Other Options

Do you really need to distill your prediction down to a single number? Consider looking at the entire distribution of the outcome as your prediction (typically conditional on predictors) – after all, this conveys the entire uncertainty about the outcome. This is known as probabilistic forecasting.

There are other measures, too. Expected shortfall is useful for risk analysis, or even expectiles. Maybe you care about variance or skewness for some reason. Whatever you want to get at, just make sure you ask yourself what you actually care about. You have an entire distribution to distill!

(Photo from Pexels)