Variance Explained

John Haman

2020/06/30

Categories: Statistics Tags: Statistics

‘Variance explained’. It’s so simple, it’s seductive for researchers and decision makers. You fit a model and compute the \(R^{2}\) from the model – this is the amount of variation explained in the response by the statistical model. With \(R^{2}\) in hand, you get to make wonderfully simple statements like ‘90% of the variation in Y is due to X’ according to my model.

I’ve got two issues with \(R^2\).

  1. ‘Variance Explained’ sounds simple, but it doesn’t actually mean anything. There’s a little bit more about this in This paper from Chris Achen.. My issue with \(R^2\) is that explained variance is a function of data and model, and the \(R^2\) statistic doesn’t show which one is more important – you can have a high \(R^2\) with bad data, but a ‘good’ model that was implemented in a way that doesn’t actually answer any research questions. Or you could have good data, but overfit the hell out of your model to end up with a high \(R^2\) (and a useless model).

  2. ‘Variance explained’ only makes sense in least squares regression models. Got binary data? You are out of luck. Want to LASSO to get those predictors under control? No \(R^2\) for you. And there’s no \(R^2\) for mixed models or Bayesian models.

    Well, I fib a little bit about Bayesian R^2 because there is this paper by Gelman and company that comes up with something, but I doubt it has all the properties that make \(R^2\) something desirable. For mixed models, some have proposed extensions of \(R^2\), but they are not comparable to OLS \(R^2\) apples to apples.

I’ll end by pasting this response from Doug Bates when asked about \(R^2\) for (generalized) mixed models on the mixed models mailing list:

I must admit to getting a little twitchy when people speak of the “R2 for GLMMs”. R2 for a linear model is well-defined and has many desirable properties. For other models one can define different quantities that reflect some but not all of these properties. But this is not calculating an R2 in the sense of obtaining a number having all the properties that the R2 for linear models does. Usually there are several different ways that such a quantity could be defined. Especially for GLMs and GLMMs before you can define “proportion of response variance explained” you first need to define what you mean by “response variance”. The whole point of GLMs and GLMMs is that a simple sum of squares of deviations does not meaningfully reflect the variability in the response because the variance of an individual response depends on its mean.

Confusion about what constitutes R2 or degrees of freedom of any of the other quantities associated with linear models as applied to other models comes from confusing the formula with the concept. Although formulas are derived from models the derivation often involves quite sophisticated mathematics. To avoid a potentially confusing derivation and just “cut to the chase” it is easier to present the formulas. But the formula is not the concept. Generalizing a formula is not equivalent to generalizing the concept. And those formulas are almost never used in practice, especially for generalized linear models, analysis of variance and random effects. I have a “meta-theorem” that the only quantity actually calculated according to the formulas given in introductory texts is the sample mean.

It may seem that I am being a grumpy old man about this, and perhaps I am, but the danger is that people expect an “R2-like” quantity to have all the properties of an R2 for linear models. It can’t. There is no way to generalize all the properties to a much more complicated model like a GLMM.

I was once on the committee reviewing a thesis proposal for Ph.D. candidacy. The proposal was to examine I think 9 different formulas that could be considered ways of computing an R2 for a nonlinear regression model to decide which one was “best”. Of course, this would be done through a simulation study with only a couple of different models and only a few different sets of parameter values for each. My suggestion that this was an entirely meaningless exercise was not greeted warmly.

I must be a grumpy old man!