Sample Size for interaction effects under different codings

Following another discussion of the 16x sample size rule for half-sized interactions on Andrew Gelman’s blog, I commented that I do not think the rule holds up under different coding schemes, but that I’d not done a simulation. In particular, I was concerned about the rule under the DOE ‘-1, +1’ coding scheme compared to the usual treatment coding scheme.

It’s actually an important distinction to make, because it’s easy to design an experiment with one coding in mind, and then analyze it under a different one. For example, lots of DOE software will use the ‘-1, +1’ coding, but R defaults to treatment coding.

Another commenter asked that I conduct the simulation and some discussion ensued about the effects of coding schemes on interaction sizes, and more fundamentally, what an interaction actually means.

So here is that simulation.

Function to calculate sample sizes

The function is unfortunately long, but I’ll give you the gist of what it does:

There are four inputs: the effect size par, the coding scheme of the design matrix, code_gen, the coding scheme of the analysis, code, and whether to determine the sample size based on the $p$ -value of a main effect or interaction, main.

Some common settings are made internal to the function: the confidence level is set at $95 %$ and the target power is $90 %$ . Once the function gets going, it starts a simulation that loops through an increasingly larger design matrix until it finds the right size matrix that achieves power $> 90 %$ .

There are only two factors in the experiment, and as the sample size grows, I only increment the size of the matrix by $4$ runs at a time to keep it balanced. There is also some internal logic that makes sure the right things are set up correctly when we toggle between main effects power and interaction power.

Details

get_nsamp <- function(par = 1,
                      code = "contr.treatment",
                      code_gen = "contr.treatment",
                      main = TRUE) {
  #
  # par : magnitude of regression coef
  # code : 'contr.treatment' or 'contr.sum'
  # main : TRUE for main effect, FALSE for interaction effect
  #
  alpha <- 0.05
  tgt_power <- 0.9
  nsim <- 500
  pval <- rep(NA, nsim)
  pow <- 0
  n <- 8
  #
  while (pow < 0.9){
    #
    dat <- data.frame(a = gl(2, 1, n),
                      b = gl(2, 2, n))
    if (main) {
      f1 <- formula( ~ .)
      reg_coefs <- c(0, par, par)
    } else {
      f1 <- formula( ~ . * .)
      reg_coefs <- c(0, par, par, par / 2)
    }
    X <- model.matrix(f1, data = dat,
                      contrasts = list(a = code_gen,
                                       b = code_gen))
    for (i in 1 : nsim){
      #
      dat$y <- X %*% reg_coefs + rnorm(n)
      #
      if (main) {
        fit <- lm(y ~ ., data = dat,
                  contrasts = list(a = code,
                                   b = code))
        pval[i] <- as.list(summary(fit))$coefficients[3, 4]
      } else {
        fit <- lm(y ~ . * ., data = dat,
                  contrasts = list(a = code,
                                   b = code))
        pval[i] <- as.list(summary(fit))$coefficients[4, 4]
      }
    }
    pow <- mean(pval < alpha)
    if (pow < tgt_power) n <- n + 4
  }
  n
}

In the experiment, we vary how we generate the data, and how we fit the models, there are four scenarios

Data generated by treatment coding, Data analyzed with treatment coding
Data generated by DOE coding, Data analyzed with treatment coding
Data generated by treatment coding, Data analyzed with DOE coding
Data generated by DOE coding, Data analyzed with DOE coding

Summary of experimental results:

It does not matter how the data are analyzed, the powers mostly depend on how we generate the data. This is pretty comforting, but means that we do need to be careful about how we generate the data when conducting power studies.

Details

tm1 <- get_nsamp(main = TRUE)
ti1 <- get_nsamp(main = FALSE)
tm2 <- get_nsamp(par = 1 / 2, code_gen = "contr.sum", main = TRUE)
ti2 <- get_nsamp(par = 1 / 2, code_gen = "contr.sum", main = FALSE)
sm1 <- get_nsamp(code = "contr.sum", main = TRUE)
si1 <- get_nsamp(code = "contr.sum", main = FALSE)
sm2 <- get_nsamp(par = 1 / 2, code = "contr.sum",
                 code_gen = "contr.sum", main = TRUE)
si2 <- get_nsamp(par = 1 / 2, code = "contr.sum",
                 code_gen = "contr.sum", main = FALSE)

results <- data.frame(
  code_gen = gl(2, 2, 8, labels = c("treatment", "sum")),
  code_analysis = gl(2, 4, 8, labels = c("treatment", "sum")),
  power = gl(2, 1, 8, labels = c("main effect power", "interaction power")),
  nsamp = c(tm1, ti1, tm2, ti2, sm1, si1, sm2, si2))

Details

knitr::kable(results)

code_gen	code_analysis	power	nsamp
treatment	treatment	main effect power	48
treatment	treatment	interaction power	632
sum	treatment	main effect power	48
sum	treatment	interaction power	160
treatment	sum	main effect power	44
treatment	sum	interaction power	636
sum	sum	main effect power	48
sum	sum	interaction power	156

So, why does the data generation affect the results so much? We’ll talk next time! But I find this to be a convincing argument in favor of calculating powers based on anticipated outcomes, rather than anticipated regression coefficients.

Sample Size for interaction effects under different codings

John Haman

2020/10/17

Function to calculate sample sizes