Controversies in Machine Learning

John Haman

2020/10/06

Categories: Machine Learning Tags: Machine Learning

Frank Harrell has a new keynote talk on Youtube. Watch it here.

While the name of the talk is controversies in predictive modeling and machine learning, I would say it also Frank Harrell’s philosophy of statistics in a nutshell. Here are my notes on the talk, which are also sprinkled through-out his RMS book.

External Validation is Overrated:

Validate Researchers, not models

Bayesian modeling

Mirage of Variable Selection

CI for variable importance quantifies difficulty of variable selection

  • Example: \(n=300\), \(12\) features, \(\beta_i=i\), \(\sigma=9\). Rank using partial \(\chi^2\). Simulation shows the method is too noisy.
  • Gold standard of variable selection: full Bayesian model with carefully selected shrinkage priors that are not necessarily sparsity priors. Project the full model on to simpler models, Piironen and Vehtari (2017).

ML vs. Stats. models

Stats Models

  • Probability Distribution for the data.
  • Favors Additivity
  • Identified parameters of interest
  • Inference, Estimation, and Prediction
  • Most useful when SNR is low

Machine Learning

  • Algorithmic
  • Equal opportunity for interactions and main effects
  • Prediction only
  • Most useful when SNR is high

Predictive Measures

Bad Measures

  • Pr(Classified Correctly), sensitivity, specificity, precision, recall. These are improper, discontinuous scoring rules.
  • ROC curves are highly problematic, and have a high ink to information ratio.

Decision Making

  • Optimal Bayes Decision that maximizes the expected utility: use posterior distribution and the utility function.
  • ROC/AUROC play no role