Does model selection bias p-values?

John Haman


Categories: Statistics R Tags: Statistics R


Model selections techniques have the potential to wreck inferential statistics.

My friend brought up an interesting practice at her work: Model builders were using step-wise AIC to select a linear model specification. There are dozens of possible variables, and there is little interest in trying to select a reasonable model based on subject matter expertise. After the best model is found using step-wise selection, \(p\)-values are further called in to pare down the model: predictors with large \(p\)-values get tossed out.

I quipped that this is kinda contradictory to step-wise selection with AIC since

  1. It’s possible that further refinements to the model after the step-wise selection can actually increase of the AIC.

  2. Step-wise model selection invalidates the \(p\)-values reported after the fact.

The second quip is something that doesn’t get reported enough in the stats literature, and it’s not unique to step-wise methods. Any sort of data-driven model selection is going to mess up the \(p\)-values (and the confidence intervals, and the prediction intervals, and probably some other stuff too.)

In a nutshell, by optimizing for AIC, we are minimizing the \(p\)-values, because among some classes of models, small AIC is “correlated”1 with a small set of small \(p\)-values.

I don’t mean to pick on step-wise regression in this note, it’s just one example of how modern data-mining techniques generally ruin the replicability of results. Inference after data-mining is double dipping the data: We first use the data to calculate the model that best fits the data, then we ask the model if it has found anything note worthy. But when we ask the model if it found anything good, we can forget that we already gave it the best chance to find something through data dredging!

When we do the second step (inference), virtually all statistical software forgets that we had the first (model selection) and happily reports back the \(p\)-values assuming that the final model was the first model we tested!2

Here is a small simulation to demonstrate the problem:

## Load libs
library(ggplot2); theme_set(theme_bw(base_size = 15))

To start, we use 6 uncorrelated and continuous covariates.

## Specify the true model
P <- 6
beta <- seq(0, 0, length.out = P)
N <- 100

X <- mvrnorm(n = N, mu = rep(0, P),
             Sigma = matrix(0.0, P, P) + diag(1.0, P))

## do a fake model fit to pull out some useful information
test_fit <- lm(rnorm(N) ~ X)

Now for the simulation. The basic idea is commented in the code. We’re going to keep the design matrix \(X\) constant, but in each of the \(2000\) replications, we generate a new \(y\). The true model is that none of the predictors have any effect on the outcome. That part is critical – no effect means the \(p\)-values from the true model should be uniform. At the end of the simulation, we tabulate the \(p\)-values for all the predictors into one data.frame, appending the source of the \(p\)-values, whether from a true model or from an AIC-selected model.

Now, I’m taking the easy way out by making the predictors meaningless in the true model. That’s just because I happen to know that the \(p\)-value distribution should be uniform in that case. We could cook similar examples where the true effect do matter, but I’m not sure how to work out the \(p\)-value distribution in that case.3. Either way, predictors effective or not, the model selection is going to pull the \(p\)-values closer to zero.

nsim <- 2000

## for each i
#### generate new y (X stays same, beta stays same, N same),
#### step wise regression to find best model
#### look at the p-value distr. of MEs 2, 4, and 6

out <- correct <- data.frame(matrix(vector(), nsim, P + 1))
colnames(out) <- colnames(correct) <- names(coef(test_fit))

## small helper function to pull out the p-values
get_pvals <- function(fit) summary(fit)[["coefficients"]][, 4]

for (i in 1 : nsim) {
  ## Make Data
  y <- rnorm(N, X %*% beta + 1, 1)
  DF <- data.frame(X = X, y = y)
  ## Record correct P-values
  fit <- lm(y ~ . , DF)
  correct[i, ] <- get_pvals(fit)
  ## Record AIC T-values
  fitAIC <- stepAIC(lm(y ~ . * ., DF), direction = "backward", trace = 0)
  coef_aic <- rep(1, P + 1); names(coef_aic) <- names(coef(fit))
  coef_aic <- get_pvals(fitAIC)[names(coef(fit))]
  out[i, ] <- coef_aic

## merge the simulation data for graphing

out$source <- "AIC Model"
correct$source <- "True Model"

pdat <- rbind(correct, out)

Now here is the real outcome: we graph the \(p\)-value distributions for all the main effects.

We see that the data-driven step-wise selection is able to pick up some signal where none is present: the \(p\)-value distributions after step-wise selection are not flat, they are biased toward 0. This means that \(p\)-values after model selection can appear smaller than they actually are.

ppval <- function(x) {
  ggplot(pdat, aes(x = get(x), color = source, fill = source)) +
    geom_histogram(alpha = 0.3, position = "identity", breaks = seq(0, 1, 0.1)) +
    xlab("p-value") +
    ggtitle(paste("P-value distribution for variable", x),
            "(Distribution should be uniform!)") +
    facet_grid( ~ source) +
    theme(legend.position = "bottom")

for (i in paste0("X", 1 : P)) {

So there you have it folks. If you use data driven model selection strategies, don’t take those \(p\)-values too seriously!

  1. I use the term colloquially.

  2. No, I don’t think this should be changed to correct for inference after model selection. The statistical software should not make assumptions about the intentions of the analyst.

  3. Perhaps something involving the non-central \(t\)-distribution.