Larry Wasserman has a nice post on his blog about the rationality of statistical principles, in particular, the conditionality principle. Link. I know it’s old, but I just found it yesterday!
Statistics doesn’t have axioms like math, we have principles that guide data analysis.1 Different principles compete for mind share, but the likelihood principle is a big one that Bayesians like to cling to. The likelihood principle is rather important in the philosophy of statistics because it follows logically from two simpler and seemingly acceptable principles, sufficiency and conditionality (Birnbaum, 1962), and LP itself seems to be a good argument against the use of some frequentist techniques.2. We’ll talk about LP later, I want to focus on CP in this post.
CP is the idea that only experiments that occurred should be inferentially relevant. For example3, suppose we have the choice between two measuring devices. Device A measures our phenomena with a standard deviation (measurement error) of \(1\), device B measures with an error of \(5\). If we flip a coin to choose our measuring device4, then we should condition on the value of the coin flip for all inferential purposes. In other words, that a different measuring device could have been used is irrelevant. From a frequentist point of view, the coin flip is part of the experimental design, so we measured the outcome with the standard deviation of \(\sqrt{1^{2} + 5^{2}}\). If we follow conditionality, the error is either \(1\) or \(5\), depending on the outcome of the coin flip.
I enjoyed Wasserman’s post, which is new to me. For one, it reminded me of the work of Evans et al. that showed that LP follows from the conditionality principle (CP) alone.5 That is, sufficiency is not required.
However, the meat of the post is that CP is bogus: perhaps CP makes sense in simple examples, but should not be followed strictly, because it leads to some weird statistical practices6.
Wasserman offers the following example: Suppose we collect data \((X, y)\), where \(X\) is a \(100 \times 100,000\) (big p, small n) data matrix with independent covariates (\(\beta\)’s), and we want to fit the usual OLS model \(y=X\beta + \varepsilon\). Wasserman writes that conditioning on all the data, we are inclined to use the least squares estimate to estimate \(\beta\), but if we are only interested in the coefficient of \(X_{1}\), we are much better off discarding most (almost all!) of the data, and fitting the model with only \(X_{1}\). Thus, CP is bogus because we can do a better job by throwing away data, and CP (LP) can be easily rejected.
Here’s some quick thoughts on the example:
It’s not clear that the example is actually an indictment against CP, it may be an indictment against least squares and maximum likelihood though. CP does not prescribe one to apply least squares.
Wasserman states that we can assume the covariates are independent. I’m not sure what this means: Does this mean that the columns of \(X\) are independent, or that \(X\) is generated from \(100,000\) independent data generating processes that are independent? These are not the same, but I believe Wasserman means the latter.
This is some sleight-of-hand in the setup: It’s not clear if the likelihood has one parameter or \(100,000\). If the likelihood is multi-parameter, and the covariates are independent, then doesn’t it just factorize through by \(\beta\)?
Wasserman’s proposed estimator is unbiased given the setup.
In Wasserman’s example, we are not throwing out data in the sense that we are reducing the sample size, we are restricting the number of parameters we wish to estimate.
As is noted in the comments section of his post, a Bayesianly acceptable estimator can be constructed that is (in some sense) equivalent to throwing away data, but the estimator arguably violates LP. Of course, we know from BDA3 that we shouldn’t sweat violating LP because basically all model checking violates LP.
I like the example, and I like thinking about the example, but is it a reason to discard CP? I don’t think I’m convinced: I would like to find a cleaner example that shows the bogusness of CP: one that does not rely on this “we have \(100,000\) parameters, just kidding, it’s only 1!” trick.
But here’s why Wasserman’s post is great: Although I do not think Wasserman’s example is effective, he has a good point that philosophers of statistics should not let themselves get suckered into some camp by simple and attractive examples. These examples are deceptive, and not anything like a real-world data analysis, so statisticians should be critical, but maybe not take them too seriously!
Perhaps (Bayesian) statistical practice can be axiomatized like mathematics↩
Possibly not, according to Mayo and others↩
This is the standard example. I’m interested in collecting more elaborate, non-trivial examples.↩
Nevermind how bad this experimental design is!↩
I’ve not read the proof of Evan’s, but the proof of Birnbaum is rather elegant, so I’m in the mindset of SP+CP=LP↩
Nevermind the fact that CP seemingly implies that randomization is an irrelevant practice. Though, I think it must be the case that any strictly followed statistical principle must lead to some weird edge cases, but I don’t have a proof!↩