Variable Importance – Random Effects model

John Haman

2020/11/10

Categories: Statistics R Tags: Statistics R

We talked last about the nebulous concept of ‘variable importance’. I tried to make a point in the previous post that it’s not a concept that linear regression is well suited for. I feel that it does not make sense to discuss variable importance for these models because

  1. ‘Variable Importance’ does not correspond to estimates for linear regression parameters, so it’s a concept external to the regression model. Yet there is some thought that it can be made internal to the model (?)

  2. Variable Importance is a gateway drug to model selection, which we should not spend energy on. (It’s better to just carefully shrink small effects than spend energy worrying about whether they are truly non-zero).1.

  3. Variable importance implies that the worth of a variable in a model can be scrutinized in a piece-wise manner, one variable at a time. This is not so when variables are quite correlated.

Okay, so right out the gate, I’m not a fan of throwing around ‘variable importance’ as if it applies in all situations (or even in the simple situation of the linear regression model).

But one exception to that rule is that I do think ‘variable importance’ makes sense in an ANOVA model – as long as the concept is identified to mean the standard deviation of batches of regression coefficients. Let me explain.

In ANOVA, the object of study in the ANOVA table, with one row per variable. Although this is made to be very convoluted in a statistics class, to main point of the ANOVA table is identify the standard deviation of the regression coefficients of each variable. That is, the focus of the estimation is not on the means in all the experimental conditions, but on the variation of the means. This is the variance components perspective.

I think the ideal ANOVA table looks something like

Variable DF std. dev. of coefficients
A 2 \(s_1\)
B 3 \(s_2\)
A * B 6 \(s_3\)
residuals 20 \(s_e\)

The hope is that if the model is a good fit, then the variance of the measurements \(y\) can decompose into something like

\[ \mathrm{Var}(y)= \mathrm{Var} A + \mathrm{Var} B + \mathrm{Var} A \times B + \mathrm{Var} \{error\}. \]

Or, in other words, the variation with-in and among the ANOVA variables ‘explains’ the variation in the measurements. I think it is very natural to identify variable importance as sources of high variance.

Now, one can go abut this from a Bayesian POV or a Frequentist POV. Both are pretty good, but for a lot of situations, we might find that the Bayesian version works a bit better. It’s somewhat traditional to use ‘non-informative’ priors for this kind of analysis, so we’ll use those as we proceed.

The absolute easiest way to do this is with rstanarm, or a similar software that let’s you fit a Bayesian model with one line of code, but rstanarm doesn’t permit improper priors, which I would like to use in this case.

I also want to practice my roll-your-own MCMC sampler chops, so here we go. In this situation, the Gibbs sampler is one of the most efficient methods that can use, but I don’t feel like working out the conditional distributions. It’s easier to code up a random walk Metropolis-Hastings MCMC.

## Z design matrix
## beta - batched regression parameters
## sigma - std deviations

The model is:

\[ y \sim N(Z_m \beta_m, \sigma_\epsilon) \]

\[ \beta_m \sim N(0, \sigma_m) \]

\[ \sigma_m \propto 1 \]

\[ \sigma_\epsilon \propto 1 \]

where \(Z_{m}\) is a ‘batch’ of the data that pertains to a row of the ANOVA table.


  1. There is some evidence out there that LASSO doesn’t even do a good job at this. And variable selection is LASSO’s job!