Frank Harrell has a new keynote talk on Youtube. Watch it here.
While the name of the talk is controversies in predictive modeling and machine learning, I would say it also Frank Harrell’s philosophy of statistics in a nutshell. Here are my notes on the talk, which are also sprinkled through-out his RMS book.
External Validation is Overrated:
- Data splitting (using training and testing sets) is bad. Training and testing is not external validation. Dr. Harrell is a biostatistician, so this makes sense, for machine learning, perhaps it makes less sense.
- It’s better to use resampling to validate the entire process that produced the model, than the current model under consideration
- Feature selection is not reliable in most uses. The process is highly unstable.
Validate Researchers, not models
- The quality of researcher and analysis methodology used highly influences the reliability and usefulness of the resulting research.
- Frequentist penalized methods work well for prediction, but not inference. Confidence intervals and \(p\)-values don’t extend well to case when results are biased
- Horseshoe priors work better than LASSO and elastic net (I’m not sure why???)
- There is a tendency to use methods that are fast, but we don’t stop to think about what the population of effects are. You can envision that population and encode it as a prior in Bayes.
- Ordinal predictors are easier in Bayes.
- DF for non-linear effects can be data driven and still preserve operating characteristics.
- Imputation and modeling can be done in a single unified model (joint modeling)
- Validation is less necessary: Overfitting does not occur in the usual sense, but because the analyst and reader have two different priors. The problem of overfitting is translated to a problem of selecting a prior.
- Forward instead of backward probabilities.
Mirage of Variable Selection
- Parsimony is the enemy of predictive discrimination. You don’t want to spend data trying to figure out which features to use.
- Maxwell’s Demon. First law of thermo. Feature selection requires spending information that could be better used for estimation and prediction.
- Pr(selecting ‘right’ variables)=0. You cannot use data to tell you which elements of the data you should use. You don’t have enough information unless you have millions of obs and a few features.
- Researchers worry about FDR, but seldom worry about FNR.
- Fraction of important features not selected >> 0.
- Fraction of unimportant features selected >> 0.
CI for variable importance quantifies difficulty of variable selection
- Example: \(n=300\), \(12\) features, \(\beta_i=i\), \(\sigma=9\). Rank using partial \(\chi^2\). Simulation shows the method is too noisy.
- Gold standard of variable selection: full Bayesian model with carefully selected shrinkage priors that are not necessarily sparsity priors. Project the full model on to simpler models, Piironen and Vehtari (2017).
ML vs. Stats. models
- Probability Distribution for the data.
- Favors Additivity
- Identified parameters of interest
- Inference, Estimation, and Prediction
- Most useful when SNR is low
- Equal opportunity for interactions and main effects
- Prediction only
- Most useful when SNR is high
Gold Standards - not popular anymore
- Smooth, flexible calibration curve
- log likelihood
- log likelihood + log prior
- explained outcome heterogeneity
- heterogeneity of the predictions (Kent & O’Quigley-type measures)
- Relative explained variance (relative \(R^2\)): ratio of variances of \(\hat y\) from a subset model to the full model
- Majority of ML papers do not demonstrate adequate understanding of predictive accuracy.
- Pr(Classified Correctly), sensitivity, specificity, precision, recall. These are improper, discontinuous scoring rules.
- ROC curves are highly problematic, and have a high ink to information ratio.
- Optimal Bayes Decision that maximizes the expected utility: use posterior distribution and the utility function.
- ROC/AUROC play no role