# Univariate statistics for multivariate problems

## 2020/04/12

Categories: Statistics

Sometimes we use univariate data summaries to capture in the information from a multi-factor experiment. For example, when we study the effect of correlated factors A and B on Y, but summarize the effects separately. It is well known to statisticians that the marginal effect interpretation of regression coefficients falls part when factors are correlated. Nevertheless, researchers will probably always try to pinpoint marginal effects, even when their estimation is hazardous.

So what is lost when we ignore the correlations between our factors?

library(lme4)
library(dplyr)
library(magrittr)

This is just an easy way to make a correlated design.

set.seed(4850)

design <- data.frame(
A = rbinom(40, 1, 0.5),
B = rbinom(40, 1, 0.5)
) %>% arrange(A, B)

cor(design)
##           A         B
## A 1.0000000 0.4505636
## B 0.4505636 1.0000000
design$y <- rnorm(40, 4 + 3 * design$A + 2 * design$B, 1) ## One summary at a time (y_by_a <- design %>% group_by(A) %>% summarize(mean(y))) ## # A tibble: 2 x 2 ## A mean(y) ## <int> <dbl> ## 1 0 4.50 ## 2 1 8.32 (y_by_b <- design %>% group_by(B) %>% summarize(mean(y))) ## # A tibble: 2 x 2 ## B mean(y) ## <int> <dbl> ## 1 0 4.99 ## 2 1 8.02 diff(y_by_a$mean(y))
##  3.820703
diff(y_by_b$mean(y)) ##  3.02835 What is the problem? Its that both effects cannot be true simultaneously when the design covariates are correlated! There is only so much variation in the data. ## Fixed effects So the effects are about 4 and 3 respectively. Let’s pop open the old lm table to confirm that: (fit <- lm(y ~., data = design)) ## ## Call: ## lm(formula = y ~ ., data = design) ## ## Coefficients: ## (Intercept) A B ## 4.068 3.080 1.642 Oh, this is quite a bit different. lm says the effect of A is 3, and the effect of B is 1.6, much smaller than what the simple effects seem to imply. What’s going on here? Is the effect of b 1.5 or 3? I don’t think this experiment is set up to measure the effect of b. If the effect of b is $$E(y|b=1)-E(y|b=0)$$, the design is only set up to measures differences that are conditional on A. And A is not some random variable that can just be integrated out. ## Uncorrelated Designs This problem goes away when we have uncorrelated designs, like a full factorial. In this regard, factorial designs are highly underappreciated. They let you estimate effects independent of the design. This is a design trait that goes under-the-radar compared to things like optimality and power, but I hope practioners can pay more attention to it. When you read a news story about the effect of a drug on blood pressure you think that you are reading about the marginal effect of that drug, but what you may be reading about is the effect of the drug given the experimental design, which goes unreported. design_uncor <- data.frame ( A = rep(c(0, 1), each = 20), B = rep(c(0, 1, 0, 1), each = 10) ) %>% arrange(A, B) cor(design_uncor) ## A B ## A 1 0 ## B 0 1 design_uncor$y <- rnorm(40, 4 + 3 * design$A + 2 * design$B, 1)
(y_by_a <- design_uncor %>% group_by(A) %>% summarize(mean(y)))
## # A tibble: 2 x 2
##       A mean(y)
##   <dbl>     <dbl>
## 1     0      4.82
## 2     1      8.66
(y_by_b <- design_uncor %>% group_by(B) %>% summarize(mean(y)))
## # A tibble: 2 x 2
##       B mean(y)
##   <dbl>     <dbl>
## 1     0      6.33
## 2     1      7.15
diff(y_by_a$mean(y)) ##  3.841116 diff(y_by_b$mean(y))
##  0.8228139
(fit_uncor <- lm(y ~., data = design_uncor))
##
## Call:
## lm(formula = y ~ ., data = design_uncor)
##
## Coefficients:
## (Intercept)            A            B
##      4.4082       3.8411       0.8228

The estimates are identical. Orthogonal designs rule. I hope to see them used more often.