Larry Wasserman has a nice post on his blog about the rationality of statistical principles, in particular, the conditionality principle. Link. I know it’s old, but I just found it yesterday!

Statistics doesn’t have axioms like math, we have principles that guide data
analysis.^{1} Different principles compete for mind share, but the likelihood
principle is a big one that Bayesians like to cling to. The likelihood principle
is rather important in the philosophy of statistics because it follows logically
from two simpler and seemingly acceptable principles, sufficiency and
conditionality (Birnbaum, 1962), and LP itself seems to be a good argument
against the use of some frequentist techniques.^{2}. We’ll talk about LP later, I want to focus on CP in this post.

CP is the idea that only experiments that occurred should be
inferentially relevant. For example^{3}, suppose we have
the choice between two measuring devices. Device A measures our phenomena with a
standard deviation (measurement error) of \(1\), device B measures with an error
of \(5\). If we flip a coin to choose our measuring device^{4}, then we should *condition* on the value of the coin
flip for all inferential purposes. In other words, that a different measuring
device could have been used is irrelevant. From a frequentist point of view, the
coin flip is part of the experimental design, so we measured the outcome with
the standard deviation of \(\sqrt{1^{2} + 5^{2}}\). If we follow conditionality,
the error is either \(1\) or \(5\), depending on the outcome of the coin flip.

I enjoyed Wasserman’s post, which is new to me. For one, it reminded me of the
work of Evans et al. that showed that LP follows from the conditionality
principle (CP) alone.^{5} That is,
sufficiency is not required.

However, the meat of the post is that CP is bogus: perhaps CP makes sense in
simple examples, but should not be followed strictly, because it leads to some
weird statistical practices^{6}.

Wasserman offers the following example: Suppose we collect data \((X, y)\), where \(X\) is a \(100 \times 100,000\) (big p, small n) data matrix with independent covariates (\(\beta\)’s), and we want to fit the usual OLS model \(y=X\beta + \varepsilon\). Wasserman writes that conditioning on all the data, we are inclined to use the least squares estimate to estimate \(\beta\), but if we are only interested in the coefficient of \(X_{1}\), we are much better off discarding most (almost all!) of the data, and fitting the model with only \(X_{1}\). Thus, CP is bogus because we can do a better job by throwing away data, and CP (LP) can be easily rejected.

Here’s some quick thoughts on the example:

It’s not clear that the example is actually an indictment against CP, it may be an indictment against least squares and maximum likelihood though. CP does not prescribe one to apply least squares.

Wasserman states that we can assume the covariates are independent. I’m not sure what this means: Does this mean that the columns of \(X\) are independent, or that \(X\) is generated from \(100,000\) independent data generating processes that are independent? These are not the same, but I believe Wasserman means the latter.

This is some sleight-of-hand in the setup: It’s not clear if the likelihood has one parameter or \(100,000\). If the likelihood is multi-parameter, and the covariates are independent, then doesn’t it just factorize through by \(\beta\)?

Wasserman’s proposed estimator is unbiased given the setup.

In Wasserman’s example, we are not throwing out data in the sense that we are reducing the sample size, we are restricting the number of parameters we wish to estimate.

As is noted in the comments section of his post, a Bayesianly acceptable estimator can be constructed that is (in some sense) equivalent to throwing away data, but the estimator arguably violates LP. Of course, we know from BDA3 that we shouldn’t sweat violating LP because basically all model checking violates LP.

I like the example, and I like thinking about the example, but is it a reason to discard CP? I don’t think I’m convinced: I would like to find a cleaner example that shows the bogusness of CP: one that does not rely on this “we have \(100,000\) parameters, just kidding, it’s only 1!” trick.

But here’s why Wasserman’s post is great: Although I do not think Wasserman’s example is effective, he has a good point that philosophers of statistics should not let themselves get suckered into some camp by simple and attractive examples. These examples are deceptive, and not anything like a real-world data analysis, so statisticians should be critical, but maybe not take them too seriously!

Perhaps (Bayesian) statistical practice can be axiomatized like mathematics↩

Possibly not, according to Mayo and others↩

This is the standard example. I’m interested in collecting more elaborate, non-trivial examples.↩

Nevermind how bad this experimental design is!↩

I’ve not read the proof of Evan’s, but the proof of Birnbaum is rather elegant, so I’m in the mindset of SP+CP=LP↩

Nevermind the fact that CP seemingly implies that randomization is an irrelevant practice. Though, I think it must be the case that any strictly followed statistical principle must lead to some weird edge cases, but I don’t have a proof!↩