The data do not speak for themselves

Statistical modelling may seem like a very complex enterprise. Our correspondents often say “I’m not interested in modelling my data; I only want to analyse it.” However, any kind of data analysis or data manipulation is equivalent to imposing assumptions. In taking the average of some numbers, we implicitly assume that the numbers all come from the same population. If we conclude that something is ‘statistically significant’, we have implicitly assumed a model, because the p-value is a probability according to a model.

The purpose of statistical modelling is to make these implicit assumptions explicit. By doing so, we are able to determine the best and most powerful way to analyse data, we can subject the assumptions to criticism, and we are more aware of the potential pitfalls of analysis. If we “only want to do data analysis” without statistical models, our results will be less informative and more vulnerable to critique.

Using a model is not a scientific weakness: it is a strength. In statistical usage, a model is always tentative; it is assumed for the sake of argument. – Adrian Baddeley • Ege Rubak • Rolf Turner, Spatial Point Patterns Methodology and Applications with R (2021)

I sometimes hear that the data speak for themselves. I disagree with this statement in general, but I think it comes from good intentions. The purpose of this statement is to detract from the importance of modeling in data analysis, and to emphasis what is important: the data itself. But I think you can’t learn from data without some mechanism, and that’s my general definition of a model.¹

There are at least two situations where you might hear someone say that the data speak for themselves.

In an analysis of “simple” data.

The analysis is essentially based on descriptive techniques. The researcher then tabulates or plots the descriptive statistics. In some sense, an analysis by descriptive statistics cannot be wrong, so what do we gain from a model?

There are at least two problems with this type of thinking.

While descriptive statistics cannot be wrong in some technical modeling sense, they can mislead, so researchers still need to make decisions and choose which data to present and how. This may involve conditioning on other variables. I think that the choices researchers make when presented with these decisions are modeling decisions.

If at any point the analyst wants to put an uncertainty interval on the data, she now requires a technical model for the data.

For example, a barchart of group averages, where each group average has an uncertainty interval that is calculated using the within group standard deviations. This is a non-controversial way to present grouped data. It is also the graph that corresponds to a generalized least squares model.

In machine learning.

Practitioners may say that all the relevant information is in the data, or the algorithm is just “regurgitating the data”. This was a common complaint of the computer image generator models, DALLE. Also, stuff like this and this.

The underlying claim is that the data are what is truly important, and the machine learning method is just some way to extract patterns from the data.

This is a situation where I’m more accepting, but I think models matter in ML too. For example, none of the achievements in ML would have been possible just by feeding data into the models of the old days. Linear regression is not going to result in cool image generations. Logistic regression isn’t going to beat deep learning (often).

The model unlocks the patterns in the data for general application.

But my critiques against “the data speak for themselves” still holds if you take a more technical definition of model.↩︎

The data do not speak for themselves

2022/08/14