The {rms} validate function

John Haman

2021/05/02

Categories: Statistics R Tags: Statistics R model checking

The post is about the statistics in the rms function validate. I find it very hard to remember what each of these statistics is, and the regression modeling strategies book does not lay these out as nicely as I would like.

I’m going to concentrate on the validation of logistic regression models, since that’s the most relevant to my job.

library(rms)
dat <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

We fit an example model to predict the probability of admittance to a school based on GPA and class rank.

fit <- lrm(admit ~ gpa + rank, dat, x = TRUE, y = TRUE)
fit
## Logistic Regression Model
##  
##  lrm(formula = admit ~ gpa + rank, data = dat, x = TRUE, y = TRUE)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs           400    LR chi2      36.04    R2       0.121    C       0.678    
##   0            273    d.f.             2    g        0.788    Dxy     0.355    
##   1            127    Pr(> chi2) <0.0001    gr       2.200    gamma   0.356    
##  max |deriv| 3e-09                          gp       0.157    tau-a   0.154    
##                                             Brier    0.197                     
##  
##            Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept -2.8826 1.0917 -2.64  0.0083  
##  gpa        1.0270 0.3064  3.35  0.0008  
##  rank      -0.5822 0.1263 -4.61  <0.0001 
## 

Now we validate the model, which is very easy using rms.

validate(fit, B = 200,
         method = "boot") # Default method
##           index.orig training    test optimism index.corrected   n
## Dxy           0.3551   0.3542  0.3497   0.0045          0.3506 200
## R2            0.1208   0.1222  0.1171   0.0051          0.1156 200
## Intercept     0.0000   0.0000  0.0187  -0.0187          0.0187 200
## Slope         1.0000   1.0000  1.0130  -0.0130          1.0130 200
## Emax          0.0000   0.0000  0.0062   0.0062          0.0062 200
## D             0.0876   0.0890  0.0847   0.0043          0.0834 200
## U            -0.0050  -0.0050 -0.0002  -0.0048         -0.0002 200
## Q             0.0926   0.0940  0.0849   0.0091          0.0835 200
## B             0.1971   0.1960  0.1986  -0.0026          0.1996 200
## g             0.7883   0.7898  0.7742   0.0156          0.7727 200
## gp            0.1570   0.1557  0.1546   0.0011          0.1559 200

Let’s describe each of these indices. Of course, you’ll notice that accuracy, sensitive, specificity, etc. are nowhere to be found. Broadly, validation metrics fall into one of two camps: calibration and discrimination. Quoting from Harrell:

There are two other terms for describing the components of predictive accuracy: calibration and discrimination. Calibration refers to the extent of bias. For example, if the average predicted mortality for a group of similar patients is 0.3 and the actual proportion dying is 0.3, the predictions are well calibrated. Discrimination measures a predictor’s ability to separate patients with different responses. A weather forecaster who predicts a 15% chance of rain every day of the year may be well calibrated in a certain locality if the average number of days with rain is 55 per year, but the forecasts are uninformative. A discriminating forecaster would be one who assigns a wide distribution of predictions and whose predicted risks for days where rain actually occurred are larger than for dry days. If a predictive model has poor discrimination, no adjustment or calibration can correct the model. However, if discrimination is good, the predictor can be calibrated without sacrificing the discrimination.

Discrimination Indexes

R2

The Nagelkerke \(R^2\) index. An adjusted version of the Cox & Snell R-square that adjusts the scale of the statistic to cover the full range from 0 to 1. This pseudo-\(R^2\) statistic measures the improvement in the fitted model compared to the null model.

\[ R^2 = \frac{1 - \{ L(0) / L(\hat{\beta})\} ^ {2/n}} { 1 - L(0)^{2/n}} \]

where

  • \(L(0)\) is the maximum likelihood of the null model,
  • \(L(\hat{\beta})\) is the maximum likelihood of the fitted model, and
  • \(n\) is the sample size.

Information on the statistic is available here.

A value close to 1 is desirable.

In particular, this Nagelkerke \(R^2\) is consistent with classical \(R^2\), is consistent with maximum likelihood estimation1, and it is asymptotically independent of the sample size \(n\).

g

The \(g\) index. Also called Gini’s mean difference. Calculated on the log-odds ratio scale. A measure of the model’s predictive discrimination based only on the fitted values. A larger value of \(g\) implies greater discrimination among outcomes.

\[ g = \frac{1}{n(n-1)} \sum_{i,j}^n |x_i^t \hat{\beta} - x_j^t \hat{\beta}| \]

Its value is the same as GiniMd(predict(fit)).

More information available at this link.

Minimum is 0, maximum is unbounded.

gr

The \(g\) index on the odds ratio scale. This is \(\exp\{g\}\). The value added is dubious to me.

gp

The \(g\) index on the probability scale.

Same as GiniMd(predict(fit, type = fitted)).

\[ g = \frac{1}{n(n-1)} \sum_{i,j}^n |\hat{p}_i - \hat{p}_j|. \]

Another measure of discrimination. Must be between 0 and 1.

Brier, B

The mean square error between predictions and observations. The Brier score is a strictly proper scoring rule.

\[ B = \frac{1}{n} \sum_i^n (\hat{p}_i - \mathrm{outcome}_i) \]

There are decompositions of the Brier score into discrimination and calibration components.

A score near 0 is desirable. The lower the score, the better the predictions are calibrated. A Brier score of 0.25 is associated with random guessing when dichotomous outcomes are equally likely.

The brier score may be inadequate for very rare events. Link.

The Brier score is not interpretable as a global measure of model performance, but is very useful for model comparison. Link.

Rank Discrimination Indexes

C (concordance)

The c-index is also called AUC or area-under-the-curve (the ROC curve). A value close is 1 is best. A value close to 0.5 indicates the algorithm is nearly guessing randomly. A model having c greater than roughly .8 has some utility in predicting the responses of individual subjects.

The original reference is due to Harrell, here

The c-index is related to the Wilcoxon rank test. Link.. The c index is computed by taking all possible pairs of subjects such that one subject responded and the other did not. The index is the proportion of such pairs with the responder having a higher predicted probability of response than the nonresponder.

The c-index has been linked to the Brier score. Link.

C-index is not an improper scoring rule, because it does not condition on a threshold. But it is also not a proper scoring rule either, because it effectively applies all possible thresholds.

C-index is scale invariant and threshold invariant.

In clinical studies, the C-statistic gives the probability a randomly selected patient who experienced an event (e.g. a disease or condition) had a higher risk score than a patient who had not experienced the event.

Dxy

Somers’ \(D_{xy}\) index.

For binary data, \(D_{xy} = 2 \times (c - 1/2)\), where \(c\) is the c-index, or AUC.

Somers’ D is a rank measure between all predicted probabilities and all observed outcomes, it takes values between -1 and 1. When \(D_{xy}=1\), the predictions are perfectly discriminating.

Rank based discrimination measures have the advantage of being insensitive to the prevalence of positive responses.

Related to Kendall’s \(\tau-a\), \(D_{xy} = \tau(x, y) / \tau(y, y)\). In the case of logistic regression, we do \(\tau(\hat{p}, y) / \tau(y, y)\).

gamma

The Goodman-Kruskal \(\gamma\) index, a different measure of rank correlation. Values range from -1 to 1. A \(\gamma\) of 0 indicates no association. The \(\gamma\) statistic is, like \(c\) and \(D_{xy}\) based on the counts of concordant and discordant pairs. The formula is

\[ \gamma = \frac{N_c - N_d}{N_c + N_d}, \]

where

  • \(N_c\) is the number of concordant pairs in the data, and
  • \(N_d\) is the number of discordant pairs in the data.

tau-a

Kendall’s \(\tau\) rank correlation measure. \(\tau-a\) does not make an adjustment for ties. The correlation measure is strictly between -1 and 1.

\[ \tau-a = \frac{N_c - N_d}{n(n-1)/2}, \]

where \(N_c\) and \(N_d\) are defined above.

Not the same as cor(x, y, method = "kendall"), which calculates \(\tau-b\).

Not much different from \(D_{xy}\).

In this case, we calculate the rank correlation between predictions and observations.

Validate Specific measures

Intercept and Slope

The estimated probabilities from a fitted logistic regression are \(\hat{p} = 1/(1+ \exp\{-X\hat{\beta}\})\). If we had out-of-sample test data, we could estimate the calibration parameters that correct the fitted logistic regression for new data. The idea is to calculate the calibrated probabilities,

\[ p_c = \frac{1}{1 + \exp\{- (\gamma_0 + \gamma_1 X \hat{\beta})\}} \]

Trying to estimate \(\gamma = (\gamma_{0}, \gamma_{1})\) from the training data will always result in \(\gamma = (0,1)\), not very useful. However, we can use Efron’s bootstrap to estimate over-optimism \(\gamma\) and thus obtain calibrated predictions. Some simulations have shown that this produces an efficient estimate of \(\gamma\).

More info available here.

The corrected slope can be thought of as a shrinkage factor that takes overfitting into account. The corrected intercept is an overall prevalence correction.

Emax

Simply, this is \(E_{max} = \textrm{max} |p_c - \hat{p}|\). The maximum difference between raw predicted probabilities, and the recalibated probabilities.

Since Emax does not weight the discrepancies by the actual distribution of predictions, it may be preferable to compute the average absolute discrepancy over the actual distribution of predictions.

D

Discrimination index.

\[ D = \frac{L(\gamma_0, 0) - L(\gamma_0, \gamma_1) - 1}{n}, \]

where \(L(\cdot, \cdot)\) is the -2 times the maximum log-likelihood of the calibrated logistic model, that is,

\[ L = -2 \sum_i^n y_i \log(p_{i,c}) + (1 - y_i) \log(1 - p_{i,c}). \]

Division by the sample size makes the index D independent of the sample size.

The “-1” penalty is a necessary correction for estimating one parameter.

The discrimination index is the difference in quality between the best constant predictor (one that on average predicts the overall prevalence of an event) and the best calibrated predictor.

U

Unreliability index — unitless index of how far the logit calibration curve intercept and slope are from (0, 1). Values close to 0 are best.

\[ U = \frac{L(0, 1) - L(\gamma_0, \gamma_1) - 2}{n} \]

The penalization of “-2” is due to estimating the calibration parameters, \(\gamma\).

Unreliability is the difference in quality of the uncalibrated predictions and the quality of the slope/intercept corrected predictions.

The statistic can be further decomposed to calculate unreliability due to overall prevalence and unreliability due to the need for a slope correction.

Q

Overall Quality. Q = D - U.

Logarithmic accuracy score — a scaled version of the log-likelihood achieved by the predictive model

\[ Q = \frac{L(\gamma_0, 0) - L(0,1) + 1}{n}. \]

An overall summary index for the quality of predictions. The difference between the best constant predictor and the quality of predictions as they stand (no calibration). Decomposes into unreliability and discrimination measures to assess lack of quality due to either issue.


  1. the maximum likelihood estimates of the model parameters maximize the statistic↩︎