Sometimes, I hear people say that to determine the most importance variables in a regression model, you have to standardize all the predictors. For example here.
The advice is good, because model coefficients alone are calculated on predictors of different scales, but also misguided, because the advice seems to imply that one should use the magnitude of the model coefficients to determine variable importance.
But standardizing variables is unnecessary. It’s probably better to use the \(t\) scores from the model to figure out what matters.
For example, the regression model
summary(fit_unscaled <- lm(mpg ~ ., data = mtcars))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
if believable, shows that wt
is the most important variable. But in scaling
the data,
dat_scaled <- as.data.frame(scale(mtcars))
dat_scaled$mpg <- mtcars$mpg
summary(fit_scaled <- lm(mpg ~ ., data = dat_scaled))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0906250 0.4684931 42.8835050 6.185024e-22
## cyl -0.1990240 1.8663298 -0.1066392 9.160874e-01
## disp 1.6527522 2.2132353 0.7467585 4.634887e-01
## hp -1.4728757 1.4925162 -0.9868407 3.349553e-01
## drat 0.4208515 0.8743992 0.4813036 6.352779e-01
## wt -3.6352668 1.8536038 -1.9611887 6.325215e-02
## qsec 1.4671532 1.3059782 1.1234133 2.739413e-01
## vs 0.1601576 1.0607063 0.1509915 8.814235e-01
## am 1.2575703 1.0262499 1.2254035 2.339897e-01
## gear 0.4835664 1.1017333 0.4389142 6.652064e-01
## carb -0.3221020 1.3386010 -0.2406258 8.121787e-01
we see (again) that wt
is the most important variable. (Sorry, my examples are
boring.)
But under \(t\) scores, am
is the second most important variable, while disp
is the second most important variable using the standardized regression
coefficients.
(Nevermind the fact that am
is an indicator variable and disp
is a continuous variable). That’s another plot-twist addressed partially by this Gelman paper.
The \(t\) scores approach is in direct opposition to the standardized regression coefficients method to variable importance. Using regression coefficients, the idea is that the most important variables have the largest effect sizes. Using \(t\) scores (or, equivalently, \(p\)-values) the idea is that the most important variables are the ones that most certainly have non-zero effects. This is what Fisher was thinking about when he thought up \(p\)-values: he was looking for a continuous measure of evidence against a singular point-null hypothesis.
But this exercise is somewhat pointless, because the \(t\) scores are the same (after all, the scaling is just a linear transform to the data matrix). And it’s really the \(t\) scores that should be used to determine variable importance, because these take into account the uncertainty in the regression coefficients. At least that’s what I think for linear regression.
If we have groups of variables that are correlated, or a categorical variable, replace \(t\) score with \(F\) score.
For fancier models, we could start thinking about Gelman and Pardoe’s average predictive comparisons.
But what are we really doing here?
‘Variable importance’ is one of the top questions that a statistician is asked when she is presenting the results of a model. Unfortunately, it’s an exceedingly tricky question to answer, for at least a couple reasons:
‘Variable importance’ does not correspond to the estimate of a parameter, so it’s answer is outside the scope of the model.
The ‘variable importance’ question implies that variables can be examined piece-wise to determine which are the most important. In practice, variables are at least partially collinear, so trying to tease out the marginal ‘importance’ of each variable leaves me scratching my head.
In our example above, wt
is highly correlated with several other variables in
the data frame:
cor(mtcars)[, 6] # Pearson correlation between wt and all other vars
## mpg cyl disp hp drat wt qsec
## -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406 1.0000000 -0.1747159
## vs am gear carb
## -0.5549157 -0.6924953 -0.5832870 0.4276059
So, even \(t\) scores can mislead in this example.
Variable importance is not just a function of \(x\) and \(y\), but of all the other \(x\)’s that are completing to explain \(y\) as well.
- ‘Variable importance’ is like a gateway drug to model selection, which is the enemy of predictive discrimination. It’s been suggested that we are better off throwing all relevant information at the model, followed by a careful application of shrinkage estimation.
We will talk next time about (some version) of variable importance for random effects models, which makes more sense than \(t\) scores or whatever the en vogue macine learning solution is today.
(BTW - I do think there are some good reasons to standardize predictors in a regression model, but scaling for variable importance is probably not one of them, unless you are into this)