Estimating three or more things

Some of the ways Bayes estimates out-perform

A very confusing and widely known fact in statistics is that the least squares estimator is not the most accurate way to estimate three or more parameters. This is a very well studied phenomenon in Normal population, where it is called Stein’s paradox.

Paradox is the wrong word for it– this is just a mathematical fact that is not intuitive. The purpose of this post is to demonstrate the wide number of situations where a Bayesian estimate has better frequentist properties than a least squares estimator (or MLE, in the case of other models).

To make the comparisons as fair as possible, let’s take mean squared error, \(L(\theta, \delta) = (\theta - \delta)^{\intercal} (\theta - \delta)\) as the measure of accuracy. We will evaluate this accurate from the frequentist perspective, that is, by performing many simulations with different \(\delta\)’s.

Standard linear model

\[ y=X\theta + e, \quad e \sim Normal(0, \sigma^2) \]

Take \(X\) to have 6 columns, including the intercept term.

Set loss function

se <- function(theta, delta) {
  t(theta - delta) %*% (theta - delta)
}

Set model and parameters

theta <- c(1, -1, 5, -5, 2, -2)

samp_size <- 100

genXy <- function(theta) {
  X <- data.frame(
    a = seq(1, samp_size, by = 1),
    b = rnorm(samp_size),
    c = runif(samp_size),
    d = rbinom(samp_size, 12, 0.7),
    e = rnorm(samp_size, 1, 3))
  y <- rnorm(samp_size, cbind(1, as.matrix(X)) %*% theta, 2)

  return(cbind(X, y))
}

Here is some example output. The correlation matrix is printed to show that we are dealing with the general situation where factors are correlated. This is common in observational studies, and optimal experimental designs.

out <- genXy(theta)
cor(out)

##             a          b            c           d            e           y
## a  1.00000000 0.01250561  0.024083954  0.17150239  0.194238945 -0.96687335
## b  0.01250561 1.00000000  0.033283441  0.15265836  0.151599921  0.13110691
## c  0.02408395 0.03328344  1.000000000  0.02916633 -0.003716233 -0.06390328
## d  0.17150239 0.15265836  0.029166332  1.00000000 -0.039170096 -0.03259254
## e  0.19423895 0.15159992 -0.003716233 -0.03917010  1.000000000 -0.32999278
## y -0.96687335 0.13110691 -0.063903278 -0.03259254 -0.329992777  1.00000000

Simulation function

In this example, I just use the conjugate model to get the Bayesian estimates (I’m still afraid to do any MCMC in blogdown). If you want to fit a Bayesian model in practice, you are better of using MCMC methods from Stan or the like.

We can use lm to get the frequentist estimates, which we will need anyway for the Bayesian estimates. I’m fitting the null model, so giving the least squares estimator the best chance they can get.

sim_fun <- function(theta){
  dat <- genXy(theta)
  fit_lm <- lm(y ~ ., data = dat)
  mmat <- model.matrix(fit_lm)
  coef_b <- solve(t(mmat) %*% mmat + diag(6)) %*% (t(mmat) %*% mmat %*% fit_lm$coef)
  c(least_squares = se(theta, fit_lm$coef),
    bayes = se(theta, c(coef_b)))
}

Do Sims

set.seed(4850)
sims <- replicate(250, sim_fun(theta))
sims <- t(sims)

Get MSEs

apply(sims, 2, mean)

## least_squares         bayes 
##      2.160195      1.723372

The Bayes estimates are more accurate, but that’s not really surprising. This is just a bit of shrinkage. The Bayes estimates decreased the estimation error by about 20.4 percent.

Why are is Bayes estimate more accurate? Because it is not unbiased. While unbiasedness appears at first glance to be a great property of an estimator, unbiased estimates tend to be highly variable. Luckily, there is only a handful of situations where unbiased estimates even exist for finite sample sizes, so most frequentists turn to other methodologies.

The point

The nice thing about the least squares estimator is that it is intuitive and unbiased.¹ In science and industry, these properties carry a lot of weight, and I have witnessed much hand-wringing about whether Bayesian estimation (or any other shrinkage estimate) is the right thing to do, even when we have good prior information!

I distinguish between two kinds of “statistics”:

Statistics that estimate parameters, and hence generalize from sample to population
Statistics that summarize the data, and do not mean to generalize (summary or descriptive statistics)

In (1), if the generalization is valid (ie, we have conducted a well designed experiment, or considered many confounders in an observational study), then statisticians should serve other researchers by producing optimal (admissible) estimators.

The LSE is a valid way of generalizing to a population. I have seen much worse ways of generalizing from sample to population.

In (2), any intuitive estimator is permissible, but interval estimates are not. The trouble is that researchers often want to do (2) in order to get to (1), but the difference between (2) and the LSE would be much greater than what was shown in the calculation above!

Some attribute the success of the LSE and MLEs to the fact that they are generalized Bayes estimates↩