Sometimes we use univariate data summaries to capture in the information from a multi-factor experiment. For example, when we study the effect of correlated factors A and B on Y, but summarize the effects separately. It is well known to statisticians that the marginal effect interpretation of regression coefficients falls part when factors are correlated. Nevertheless, researchers will probably always try to pinpoint marginal effects, even when their estimation is hazardous.
So what is lost when we ignore the correlations between our factors?
This is just an easy way to make a correlated design.
set.seed(4850)
design <- data.frame(
A = rbinom(40, 1, 0.5),
B = rbinom(40, 1, 0.5)
) %>% arrange(A, B)
cor(design)
## A B
## A 1.0000000 0.4505636
## B 0.4505636 1.0000000
One summary at a time
## # A tibble: 2 x 2
## A `mean(y)`
## <int> <dbl>
## 1 0 4.50
## 2 1 8.32
## # A tibble: 2 x 2
## B `mean(y)`
## <int> <dbl>
## 1 0 4.99
## 2 1 8.02
## [1] 3.820703
## [1] 3.02835
What is the problem? Its that both effects cannot be true simultaneously when the design covariates are correlated! There is only so much variation in the data.
Fixed effects
So the effects are about 4 and 3 respectively. Let’s pop open the old lm table to confirm that:
##
## Call:
## lm(formula = y ~ ., data = design)
##
## Coefficients:
## (Intercept) A B
## 4.068 3.080 1.642
Oh, this is quite a bit different. lm says the effect of A is 3, and the effect of B is 1.6, much smaller than what the simple effects seem to imply.
What’s going on here? Is the effect of b 1.5 or 3? I don’t think this experiment is set up to measure the effect of b. If the effect of b is \(E(y|b=1)-E(y|b=0)\), the design is only set up to measures differences that are conditional on A. And A is not some random variable that can just be integrated out.