Controversies in Machine Learning

Frank Harrell has a new keynote talk on Youtube. Watch it here.

While the name of the talk is controversies in predictive modeling and machine learning, I would say it also Frank Harrell’s philosophy of statistics in a nutshell. Here are my notes on the talk, which are also sprinkled through-out his RMS book.

External Validation is Overrated:

Data splitting (using training and testing sets) is bad. Training and testing is not external validation. Dr. Harrell is a biostatistician, so this makes sense, for machine learning, perhaps it makes less sense.
It’s better to use resampling to validate the entire process that produced the model, than the current model under consideration
Feature selection is not reliable in most uses. The process is highly unstable.

Validate Researchers, not models

The quality of researcher and analysis methodology used highly influences the reliability and usefulness of the resulting research.

Bayesian modeling

Frequentist penalized methods work well for prediction, but not inference. Confidence intervals and $p$ -values don’t extend well to case when results are biased
Horseshoe priors work better than LASSO and elastic net (I’m not sure why???)
There is a tendency to use methods that are fast, but we don’t stop to think about what the population of effects are. You can envision that population and encode it as a prior in Bayes.
Ordinal predictors are easier in Bayes.
DF for non-linear effects can be data driven and still preserve operating characteristics.
Imputation and modeling can be done in a single unified model (joint modeling)
Validation is less necessary: Overfitting does not occur in the usual sense, but because the analyst and reader have two different priors. The problem of overfitting is translated to a problem of selecting a prior.
Forward instead of backward probabilities.

Mirage of Variable Selection

Parsimony is the enemy of predictive discrimination. You don’t want to spend data trying to figure out which features to use.
Maxwell’s Demon. First law of thermo. Feature selection requires spending information that could be better used for estimation and prediction.
Pr(selecting ‘right’ variables)=0. You cannot use data to tell you which elements of the data you should use. You don’t have enough information unless you have millions of obs and a few features.
Researchers worry about FDR, but seldom worry about FNR.
Fraction of important features not selected >> 0.
Fraction of unimportant features selected >> 0.

CI for variable importance quantifies difficulty of variable selection

Example: $n = 300$ , $12$ features, $β_{i} = i$ , $σ = 9$ . Rank using partial $χ^{2}$ . Simulation shows the method is too noisy.
Gold standard of variable selection: full Bayesian model with carefully selected shrinkage priors that are not necessarily sparsity priors. Project the full model on to simpler models, Piironen and Vehtari (2017).

ML vs. Stats. models

Stats Models

Probability Distribution for the data.
Favors Additivity
Identified parameters of interest
Inference, Estimation, and Prediction
Most useful when SNR is low

Machine Learning

Algorithmic
Equal opportunity for interactions and main effects
Prediction only
Most useful when SNR is high

Predictive Measures

Gold Standards - not popular anymore

Smooth, flexible calibration curve
log likelihood
log likelihood + log prior
explained outcome heterogeneity
heterogeneity of the predictions (Kent & O’Quigley-type measures)
Relative explained variance (relative $R^{2}$ ): ratio of variances of $\hat{y}$ from a subset model to the full model
Majority of ML papers do not demonstrate adequate understanding of predictive accuracy.

Bad Measures

Pr(Classified Correctly), sensitivity, specificity, precision, recall. These are improper, discontinuous scoring rules.
ROC curves are highly problematic, and have a high ink to information ratio.

Decision Making

Optimal Bayes Decision that maximizes the expected utility: use posterior distribution and the utility function.
ROC/AUROC play no role