Frank Harrell has a new keynote talk on Youtube. Watch it here.

While the name of the talk is controversies in predictive modeling and machine learning, I would say it also Frank Harrell’s philosophy of statistics in a nutshell. Here are my notes on the talk, which are also sprinkled through-out his RMS book.

## External Validation is Overrated:

- Data splitting (using training and testing sets) is bad. Training and testing is
*not*external validation. Dr. Harrell is a biostatistician, so this makes sense, for machine learning, perhaps it makes less sense. - It’s better to use resampling to validate the entire process that produced the model, than the current model under consideration
- Feature selection is not reliable in most uses. The process is highly unstable.

## Validate Researchers, not models

- The quality of researcher and analysis methodology used highly influences the reliability and usefulness of the resulting research.

## Bayesian modeling

- Frequentist penalized methods work well for prediction, but not inference. Confidence intervals and \(p\)-values don’t extend well to case when results are biased
- Horseshoe priors work better than LASSO and elastic net (I’m not sure why???)
- There is a tendency to use methods that are fast, but we don’t stop to think about what the population of effects are. You can envision that population and encode it as a prior in Bayes.
- Ordinal predictors are easier in Bayes.
- DF for non-linear effects can be data driven and still preserve operating characteristics.
- Imputation and modeling can be done in a single unified model (joint modeling)
- Validation is less necessary: Overfitting does not occur in the usual sense, but because the analyst and reader have two different priors. The problem of overfitting is translated to a problem of selecting a prior.
- Forward instead of backward probabilities.

## Mirage of Variable Selection

- Parsimony is the enemy of predictive discrimination. You don’t want to spend data trying to figure out which features to use.
- Maxwell’s Demon. First law of thermo. Feature selection requires spending information that could be better used for estimation and prediction.
- Pr(selecting ‘right’ variables)=0. You cannot use data to tell you which elements of the data you should use. You don’t have enough information unless you have millions of obs and a few features.
- Researchers worry about FDR, but seldom worry about FNR.
- Fraction of important features not selected >> 0.
- Fraction of unimportant features selected >> 0.

### CI for variable importance quantifies difficulty of variable selection

- Example: \(n=300\), \(12\) features, \(\beta_i=i\), \(\sigma=9\). Rank using partial \(\chi^2\). Simulation shows the method is too noisy.
- Gold standard of variable selection: full Bayesian model with carefully selected shrinkage priors that are not necessarily sparsity priors. Project the full model on to simpler models, Piironen and Vehtari (2017).

## ML vs. Stats. models

### Stats Models

- Probability Distribution for the data.
- Favors Additivity
- Identified parameters of interest
- Inference, Estimation, and Prediction
- Most useful when SNR is low

### Machine Learning

- Algorithmic
- Equal opportunity for interactions and main effects
- Prediction only
- Most useful when SNR is high

## Predictive Measures

### Gold Standards - not popular anymore

- Smooth, flexible calibration curve
- log likelihood
- log likelihood + log prior
- explained outcome heterogeneity
- heterogeneity of the predictions (Kent & O’Quigley-type measures)
- Relative explained variance (relative \(R^2\)): ratio of variances of \(\hat y\) from a subset model to the full model
- Majority of ML papers do not demonstrate adequate understanding of predictive accuracy.

### Bad Measures

- Pr(Classified Correctly), sensitivity, specificity, precision, recall. These are improper, discontinuous scoring rules.
- ROC curves are highly problematic, and have a high ink to information ratio.

### Decision Making

- Optimal Bayes Decision that maximizes the expected utility: use posterior distribution and the utility function.
- ROC/AUROC play no role