Following another discussion of the 16x sample size rule for half-sized interactions on Andrew Gelman’s blog, I commented that I do not think the rule holds up under different coding schemes, but that I’d not done a simulation. In particular, I was concerned about the rule under the DOE ‘-1, +1’ coding scheme compared to the usual treatment coding scheme.
It’s actually an important distinction to make, because it’s easy to design an experiment with one coding in mind, and then analyze it under a different one. For example, lots of DOE software will use the ‘-1, +1’ coding, but R defaults to treatment coding.
Another commenter asked that I conduct the simulation and some discussion ensued about the effects of coding schemes on interaction sizes, and more fundamentally, what an interaction actually means.
So here is that simulation.
Function to calculate sample sizes
The function is unfortunately long, but I’ll give you the gist of what it does:
There are four inputs: the effect size par
, the coding scheme of the design
matrix, code_gen
, the coding scheme of the analysis, code
, and whether to
determine the sample size based on the \(p\)-value of a main effect or
interaction, main
.
Some common settings are made internal to the function: the confidence level is set at \(95\%\) and the target power is \(90\%\). Once the function gets going, it starts a simulation that loops through an increasingly larger design matrix until it finds the right size matrix that achieves power \(> 90\%\).
There are only two factors in the experiment, and as the sample size grows, I only increment the size of the matrix by \(4\) runs at a time to keep it balanced. There is also some internal logic that makes sure the right things are set up correctly when we toggle between main effects power and interaction power.
get_nsamp <- function(par = 1,
code = "contr.treatment",
code_gen = "contr.treatment",
main = TRUE) {
#
# par : magnitude of regression coef
# code : 'contr.treatment' or 'contr.sum'
# main : TRUE for main effect, FALSE for interaction effect
#
alpha <- 0.05
tgt_power <- 0.9
nsim <- 500
pval <- rep(NA, nsim)
pow <- 0
n <- 8
#
while (pow < 0.9){
#
dat <- data.frame(a = gl(2, 1, n),
b = gl(2, 2, n))
if (main) {
f1 <- formula( ~ .)
reg_coefs <- c(0, par, par)
} else {
f1 <- formula( ~ . * .)
reg_coefs <- c(0, par, par, par / 2)
}
X <- model.matrix(f1, data = dat,
contrasts = list(a = code_gen,
b = code_gen))
for (i in 1 : nsim){
#
dat$y <- X %*% reg_coefs + rnorm(n)
#
if (main) {
fit <- lm(y ~ ., data = dat,
contrasts = list(a = code,
b = code))
pval[i] <- as.list(summary(fit))$coefficients[3, 4]
} else {
fit <- lm(y ~ . * ., data = dat,
contrasts = list(a = code,
b = code))
pval[i] <- as.list(summary(fit))$coefficients[4, 4]
}
}
pow <- mean(pval < alpha)
if (pow < tgt_power) n <- n + 4
}
n
}
In the experiment, we vary how we generate the data, and how we fit the models, there are four scenarios
- Data generated by treatment coding, Data analyzed with treatment coding
- Data generated by DOE coding, Data analyzed with treatment coding
- Data generated by treatment coding, Data analyzed with DOE coding
- Data generated by DOE coding, Data analyzed with DOE coding
Summary of experimental results:
It does not matter how the data are analyzed, the powers mostly depend on how we generate the data. This is pretty comforting, but means that we do need to be careful about how we generate the data when conducting power studies.
tm1 <- get_nsamp(main = TRUE)
ti1 <- get_nsamp(main = FALSE)
tm2 <- get_nsamp(par = 1 / 2, code_gen = "contr.sum", main = TRUE)
ti2 <- get_nsamp(par = 1 / 2, code_gen = "contr.sum", main = FALSE)
sm1 <- get_nsamp(code = "contr.sum", main = TRUE)
si1 <- get_nsamp(code = "contr.sum", main = FALSE)
sm2 <- get_nsamp(par = 1 / 2, code = "contr.sum",
code_gen = "contr.sum", main = TRUE)
si2 <- get_nsamp(par = 1 / 2, code = "contr.sum",
code_gen = "contr.sum", main = FALSE)
results <- data.frame(
code_gen = gl(2, 2, 8, labels = c("treatment", "sum")),
code_analysis = gl(2, 4, 8, labels = c("treatment", "sum")),
power = gl(2, 1, 8, labels = c("main effect power", "interaction power")),
nsamp = c(tm1, ti1, tm2, ti2, sm1, si1, sm2, si2))
knitr::kable(results)
code_gen | code_analysis | power | nsamp |
---|---|---|---|
treatment | treatment | main effect power | 48 |
treatment | treatment | interaction power | 632 |
sum | treatment | main effect power | 48 |
sum | treatment | interaction power | 160 |
treatment | sum | main effect power | 44 |
treatment | sum | interaction power | 636 |
sum | sum | main effect power | 48 |
sum | sum | interaction power | 156 |
So, why does the data generation affect the results so much? We’ll talk next time! But I find this to be a convincing argument in favor of calculating powers based on anticipated outcomes, rather than anticipated regression coefficients.