Variable Importance - Linear Regression

John Haman

2020/11/01

Categories: R Statistics Tags: R Statistics

Sometimes, I hear people say that to determine the most importance variables in a regression model, you have to standardize all the predictors. For example here.

The advice is good, because model coefficients alone are calculated on predictors of different scales, but also misguided, because the advice seems to imply that one should use the magnitude of the model coefficients to determine variable importance.

But standardizing variables is unnecessary. It’s probably better to use the \(t\) scores from the model to figure out what matters.

For example, the regression model

summary(fit_unscaled <- lm(mpg ~ ., data = mtcars))$coefficients
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

if believable, shows that wt is the most important variable. But in scaling the data,

dat_scaled <- as.data.frame(scale(mtcars))
dat_scaled$mpg <- mtcars$mpg
summary(fit_scaled <- lm(mpg ~ ., data = dat_scaled))$coefficients
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 20.0906250  0.4684931 42.8835050 6.185024e-22
## cyl         -0.1990240  1.8663298 -0.1066392 9.160874e-01
## disp         1.6527522  2.2132353  0.7467585 4.634887e-01
## hp          -1.4728757  1.4925162 -0.9868407 3.349553e-01
## drat         0.4208515  0.8743992  0.4813036 6.352779e-01
## wt          -3.6352668  1.8536038 -1.9611887 6.325215e-02
## qsec         1.4671532  1.3059782  1.1234133 2.739413e-01
## vs           0.1601576  1.0607063  0.1509915 8.814235e-01
## am           1.2575703  1.0262499  1.2254035 2.339897e-01
## gear         0.4835664  1.1017333  0.4389142 6.652064e-01
## carb        -0.3221020  1.3386010 -0.2406258 8.121787e-01

we see (again) that wt is the most important variable. (Sorry, my examples are boring.)

But under \(t\) scores, am is the second most important variable, while disp is the second most important variable using the standardized regression coefficients.

(Nevermind the fact that am is an indicator variable and disp is a continuous variable). That’s another plot-twist addressed partially by this Gelman paper.

The \(t\) scores approach is in direct opposition to the standardized regression coefficients method to variable importance. Using regression coefficients, the idea is that the most important variables have the largest effect sizes. Using \(t\) scores (or, equivalently, \(p\)-values) the idea is that the most important variables are the ones that most certainly have non-zero effects. This is what Fisher was thinking about when he thought up \(p\)-values: he was looking for a continuous measure of evidence against a singular point-null hypothesis.

But this exercise is somewhat pointless, because the \(t\) scores are the same (after all, the scaling is just a linear transform to the data matrix). And it’s really the \(t\) scores that should be used to determine variable importance, because these take into account the uncertainty in the regression coefficients. At least that’s what I think for linear regression.

If we have groups of variables that are correlated, or a categorical variable, replace \(t\) score with \(F\) score.

For fancier models, we could start thinking about Gelman and Pardoe’s average predictive comparisons.

But what are we really doing here?

‘Variable importance’ is one of the top questions that a statistician is asked when she is presenting the results of a model. Unfortunately, it’s an exceedingly tricky question to answer, for at least a couple reasons:

  1. ‘Variable importance’ does not correspond to the estimate of a parameter, so it’s answer is outside the scope of the model.

  2. The ‘variable importance’ question implies that variables can be examined piece-wise to determine which are the most important. In practice, variables are at least partially collinear, so trying to tease out the marginal ‘importance’ of each variable leaves me scratching my head.

In our example above, wt is highly correlated with several other variables in the data frame:

cor(mtcars)[, 6] # Pearson correlation between wt and all other vars
##        mpg        cyl       disp         hp       drat         wt       qsec 
## -0.8676594  0.7824958  0.8879799  0.6587479 -0.7124406  1.0000000 -0.1747159 
##         vs         am       gear       carb 
## -0.5549157 -0.6924953 -0.5832870  0.4276059

So, even \(t\) scores can mislead in this example.

Variable importance is not just a function of \(x\) and \(y\), but of all the other \(x\)’s that are completing to explain \(y\) as well.

  1. ‘Variable importance’ is like a gateway drug to model selection, which is the enemy of predictive discrimination. It’s been suggested that we are better off throwing all relevant information at the model, followed by a careful application of shrinkage estimation.

We will talk next time about (some version) of variable importance for random effects models, which makes more sense than \(t\) scores or whatever the en vogue macine learning solution is today.

(BTW - I do think there are some good reasons to standardize predictors in a regression model, but scaling for variable importance is probably not one of them, unless you are into this)