# Controversies in Machine Learning

## 2020/10/06

Categories: Machine Learning Tags: Machine Learning

Frank Harrell has a new keynote talk on Youtube. Watch it here.

While the name of the talk is controversies in predictive modeling and machine learning, I would say it also Frank Harrell’s philosophy of statistics in a nutshell. Here are my notes on the talk, which are also sprinkled through-out his RMS book.

## External Validation is Overrated:

• Data splitting (using training and testing sets) is bad. Training and testing is not external validation. Dr. Harrell is a biostatistician, so this makes sense, for machine learning, perhaps it makes less sense.
• It’s better to use resampling to validate the entire process that produced the model, than the current model under consideration
• Feature selection is not reliable in most uses. The process is highly unstable.

## Validate Researchers, not models

• The quality of researcher and analysis methodology used highly influences the reliability and usefulness of the resulting research.

## Bayesian modeling

• Frequentist penalized methods work well for prediction, but not inference. Confidence intervals and $$p$$-values don’t extend well to case when results are biased
• Horseshoe priors work better than LASSO and elastic net (I’m not sure why???)
• There is a tendency to use methods that are fast, but we don’t stop to think about what the population of effects are. You can envision that population and encode it as a prior in Bayes.
• Ordinal predictors are easier in Bayes.
• DF for non-linear effects can be data driven and still preserve operating characteristics.
• Imputation and modeling can be done in a single unified model (joint modeling)
• Validation is less necessary: Overfitting does not occur in the usual sense, but because the analyst and reader have two different priors. The problem of overfitting is translated to a problem of selecting a prior.
• Forward instead of backward probabilities.

## Mirage of Variable Selection

• Parsimony is the enemy of predictive discrimination. You don’t want to spend data trying to figure out which features to use.
• Maxwell’s Demon. First law of thermo. Feature selection requires spending information that could be better used for estimation and prediction.
• Pr(selecting ‘right’ variables)=0. You cannot use data to tell you which elements of the data you should use. You don’t have enough information unless you have millions of obs and a few features.
• Researchers worry about FDR, but seldom worry about FNR.
• Fraction of important features not selected >> 0.
• Fraction of unimportant features selected >> 0.

### CI for variable importance quantifies difficulty of variable selection

• Example: $$n=300$$, $$12$$ features, $$\beta_i=i$$, $$\sigma=9$$. Rank using partial $$\chi^2$$. Simulation shows the method is too noisy.
• Gold standard of variable selection: full Bayesian model with carefully selected shrinkage priors that are not necessarily sparsity priors. Project the full model on to simpler models, Piironen and Vehtari (2017).

## ML vs. Stats. models

### Stats Models

• Probability Distribution for the data.
• Identified parameters of interest
• Inference, Estimation, and Prediction
• Most useful when SNR is low

### Machine Learning

• Algorithmic
• Equal opportunity for interactions and main effects
• Prediction only
• Most useful when SNR is high

## Predictive Measures

• Pr(Classified Correctly), sensitivity, specificity, precision, recall. These are improper, discontinuous scoring rules.
• ROC curves are highly problematic, and have a high ink to information ratio.

### Decision Making

• Optimal Bayes Decision that maximizes the expected utility: use posterior distribution and the utility function.
• ROC/AUROC play no role