My two talks on validation and evidence

John Haman


Categories: Statistics Philosophy of Statistics Tags: Statistics Philosophy of Statistics

Last month I gave a talk about computer model validation. The point of the talk was to detail a couple of trends I observed the validation space. The trends are:

  1. Computer model validation methodologies have been held back by attempts to frame model validation projects as hypothesis testing problems. Validation research can better serve sponsor needs by providing estimates and uncertainties about important parameters.

  2. Because computer model validation strategies range from completely visual to null hypothesis significance tests, and because there are so many different and interesting ways that models can differ from reality, we have trouble comparing validation techniques, even empirically. These make suggesting one technique over another essentially a matter of personal opinion. In fact, some of the most successful model validation projects apply every validation technique available and report all data summaries from all techniques.

Other ideas permeated the seminar, but I proposed we get in the business of answering model validation questions Bayesianly. From my perspective, sponsors needs estimates of the bias between a computer model and reality, or a probability that a computer model differs from reality. Au contraire, we have been feeding them \(p\)-values, which may not address any relevant question.

One of the reasons I suggested that we attack computer model validation problems from a Bayesian point of view was because model validation questions all seem to be of the “What do I believe?” flavor of questions that Richard Royall discusses in Chapter 1 of his excellent book Statistical Evidence: A Likelihood Paradigm.

The seminar material was too much to fill one hour, so I gave an extended version of the talk in a different seminar two weeks later. This gave me the opportunity to give a talk that was primarily focused on statistical evidence, and reflect on Royall’s three questions. The questions that Royall indicates should guide statistical practice are:

  1. What do I believe?

  2. What should I do?

  3. Is the data evidence in favor of claim A over claim B?

I’ve been enamored with these questions since I read Royall’s book during JSM 2019 in a hotel room in Denver. Royall’s idea is that questions of belief are best attacked using Bayesian probabilities. Questions of action of dealt with through a decision theoretic method (could be Bayesian or Neyman-Pearson style). Finally, questions of evidence are handled using likelihood ratios. The distinctions between the three questions are laid out using an extremely effective example:

Royall’s example

Suppose a man takes a diagnostic test for a disease. The test has the following properties:

Test \ Truth Disease Present Disease Absent
Test Positive \(0.95\) \(0.05\)
Test Negative \(0.05\) \(0.95\)

The man’s test comes back positive. Upon seeing the positive test, the man’s doctor can reach three different conclusions.

  1. The man probably has the disease. (What do I believe?)

  2. The man should be treated for the disease. (What should I do?)

  3. The test is evidence in favor of the man having the disease. (Is the data evidence?)

Royall explains that conclusion (1) is only correct if the prevalence of the disease in the population is high enough to push the posterior probability \(Pr(\mathrm{Disease Present}| \mathrm{Test Positive})\) above one half. Conclusion (2) is correct if conclusion (1) is correct, or if the costs of failing to treat the disease are high. But conclusion (3) is independent of prior probabilities and consequences, and therefore may be addressed from the data and model alone. Royall uses a likelihood ratio test to conclude the diagnostic test is evidence in favor the disease being present at the level 19 to 1 (0.95 / 0.05).

Do Royall’s questions apply to real world problems?

When working on statistical problems, I think it is extremely useful to break problems down to these fundamental questions, in hopes that this can guide the selection of a statistical method or data summaries that should be reported.

In my opinion, computer model validation is very much a “What do I believe?” question, and my goal is to summarize (possibly with a relevant posterior probability) the extent that a computer model agrees with reality. However, I’m generally quite unsure of where that leaves all my other research questions.

I would like to provide ‘unbiased’ statistical summaries as advice to sponsors, but Royall’s questions leave me to only summarize the evidence in the form of likelihood ratio. This is possibly even more mysterious than a \(p\)-value, and is strictly less applicable due to all the situations that involve nuisance parameters. On the other hand, I’m typically engaging in research that informs a decision or action. In this case, it is best to think my research questions as “What should I do?” type analyses. However, I am not a decision maker, and I do not have access to the decision maker’s utility functions.

The above internal struggle leaves me wondering if Royall’s Three Questions are somehow incomplete or incompatible with general research. I would like to figure out if there is a way to expand/modify them in a way that is more broadly applicable to real problems.