More about correlated predictors

John Haman

2021/05/30

Categories: Statistics Experimental Design Tags: Correlation Estimation Simulation

The problem with correlated predictors is that there’s no way to pin-point the unique effect of the individual predictors. Well, that’s a bit of a lie. We can determine partial effects if the model is correctly specified.

Here’s the simulation: Let’s take two correlated predictors, \(x\) and \(z\), and say they have some joint effect on \(y\).

This code block simulates the correlated variates. \(x\) and \(z\) are correlated with a Pearson’s correlation of \(0.9\).

library(MASS)
rho <- 0.9
beta <- c(3, -1, 1)
dat <- data.frame(mvrnorm(100, c(5, 5), matrix(rho, 2, 2) + diag(1 - rho, 2)))
colnames(dat) <- c('x', 'z')
mat <- model.matrix( ~ x + z, dat)

We can create the response and tack on some random noise. This is a standard linear regression model \(y=X \beta + \varepsilon\).

dat$y <- mat %*% beta + rnorm(100)

If we take a look at the model output, we will see that the effects are correctly recovered:

lm(y ~ ., dat)
## 
## Call:
## lm(formula = y ~ ., data = dat)
## 
## Coefficients:
## (Intercept)            x            z  
##      3.4054      -0.7082       0.6497

But the “issue”1 is that correct effect identification depends on correct model specification.

Here are the “raw” effect estimates:

lm(y ~ x, dat)
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##     3.61735     -0.09406
lm(y ~ z, dat)
## 
## Call:
## lm(formula = y ~ z, data = dat)
## 
## Coefficients:
## (Intercept)            z  
##     2.82566      0.06294

What is going on here?

The key point is that marginal effects are not the same as conditional effects. The model y ~ x + y estimates the effect of x on y conditional on a value of z, and the effect of z on y conditional on a value of x. However, y ~ x estimates the effect of x on y unconditionally.

When data are collected from an orthogonal designed experiment it happens that these effects coincide, and so model mis-specification isn’t such a big deal. In any other case, we need to have some good idea of what we wish to condition on. This is my “resolution” to Simpson’s “paradox”.

Thomas Lumley discussed this issue very clearly on his blog.


  1. for practitioners looking for causal effects↩︎