## 2021/06/01

Categories: Statistics R Tags: Application Survival analysis rms

## Data

The REIGN data set contains information on the tenure of most world leaders for the past several decades. I found the data while perusing the a vignette for the finite interval forecasting engine (FIFE), which is an interesting statistical analysis tool in its own right. The FIFE team used to the data to demonstrate the tool’s capability to generate continuation probabilities for the survival of current world leaders.

There is a lot of good stuff in this data set, and it strikes me as an interesting example for a survival analysis. We are very fortunate to have a codebook available that explains all the interesting variables in the data.

I downloaded the REIGN data and stored it in my blog directory.

For my exploration, I need to load the data and a few helper packages. Rms loads lots of other useful packages, like ggplot2 and Hmisc, so no need to load them manually.

library(rms)
library(coxme) # for mixed effects cox model, if necessary
library(data.table)
library(forcats)

theme_set(theme_bw(base_size = 15))

dat <- dat[order(leader, tenure_months)]

There are a few items that I wanted to accomplish with this dataset:

1. Practice my Cox modeling skills, they are underdeveloped.
2. Determine survival probabilities and mean survival times for various countries. I want to see how well Cox models can do without an explicit parametric assumption.
3. Analyze the factor that contribute the most to leader survival.

Unlike FIFE, I only need the most recent observation from each world leader to make my survival model. This is because FIFE is a blend of survival analysis and time series analysis, which depends on the history of an observation that led up to it’s termination. A Cox model or a parametric survival model takes as the experimental unit the failures times of the data, and information that can be summarized at that level.

# Just take last obs per leader for cox model
dat <- dat[, .SD[.N], by = leader]

The data requires minimal work to get it into shape1. Below you will notice that I created a new variable status to use as an indicator for leaders whose tenure is right censored – that is, they are still in power as of May 2021.

dat[, government := factor(government)]
dat[, militarycareer := factor(militarycareer)]
dat[, country := factor(country)]
dat[, ccode := NULL]
dat[, tenure_years := tenure_months / 12]
dat[, status := ifelse(year == 2021 & month == 5, 0, 1)]

I also prefer to remove many variables that I don’t believe I will need. These are essentially only of interest if we need to model the time process related to the hazard of leaving political office. Consult the codebook for information on variates that I’m deleting. Some of these are simply month-to-month government data, like whether a change in legislative power occurred, and other variates are outputs from machine learning models that predict the probability of a coup. The latter I’m definitely not interested in, but the former may be useful for a survival model with time-dependent covariates.

dat[, c("gov_democracy", "dem_duration", "anticipation",
"election_now", "election_recent", "leg_recent",
"direct_recent", "indirect_recent", "victory_recent",
"defeat_recent", "change_recent", "nochange_recent",
"delayed", "lastelection", "loss", "irregular",
"political_violence", "prev_conflict", "pt_suc",
"pt_attempt", "precip", "couprisk", "pctile_risk") := NULL]

Provide a complete description of the reduced REIGN data. In total, there are over 2000 leaders in the dataset, representing over 200 countries. Data collection dates back to 1950. The average leader age at the end of tenure is about 60 years old. About 16% of leaders have a military career.

The database authors partition government types into 16 types.

There is no missing data, but the most recent data suffer from censoring.

options(datadist = datadist(dat))
html(describe(dat))

Some of the plots in the description are rather tiny in html, but interesting. Let’s create larger versions:

ggplot(dat, aes(x = age)) +
geom_histogram(binwidth = 1)

ggplot(dat, aes(x = government)) +
geom_bar() +
coord_flip()

ggplot(dat, aes(x = tenure_months / 12)) +
geom_density() +
xlab("Years in power")

There is an interesting pop in the marginal age distribution. What could it be?

## Stratified analysis

Stratified analysis is mostly suboptimal compared to regression analysis, but it’s still an essential part of data exploration. I want to take a couple different looks at the data before I formulate a model.

First, let’s see a barchart of the governments by average length of tenure.

tenure_by_gov <- dat[, mean(tenure_years), by = government
][, government := fct_reorder(government, V1)]

ggplot(tenure_by_gov, aes(government, V1)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Ave. years in power")

Same as above, but with countries instead of governments.

tenure_by_country <- dat[, mean(tenure_years), by = country
][, country := fct_reorder(country, V1)]

ggplot(tenure_by_country, aes(country, V1)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Ave. years in power")
with(dat, plot(Surv(tenure_years, status),
main = "Marginal survival function for years in office"))
abline(v = 4, lty = 2, col = "gray")
text(4, 0.75, "4 Years", pos = 4)

## Cox regression, briefly

fitc <- cph(Surv(tenure_years, status) ~ government + country + militarycareer + male + rcs(year, 4),
dat, x = TRUE, y = TRUE, surv = TRUE)

The model fit has a likelihood ratio statistic of 1499.16 on 219 degrees of freedom, high significant, so some modeling is warranted.

## Investigate model fit

#print(fitw, coefs = FALSE)
#ggplot(Predict(fitw))

## Check proportionality assumption

zph <- cox.zph(fitc)
plot(zph, var = "government")
plot(zph, var = "country")
plot(zph, var = "militarycareer")

## Determine important factors

We have (at least) two choices when it comes to figuring out which factors matter most in the model: likelihood ratio tests or Wald tests. A likelihood ratio deviance test is the default test in survival::coxph whereas Wald tests are the default ANOVA-method for rms::cph. Here we use Wald tests, because they condition on the other factors in the model, however I might prefer a likelihood ratio test to judge overall model relevance. The overall model LR test is given in the output of print(fit1), and highly significant, indicating that some modeling is warranted.

anova(fitw)
plot(anova(fitc))
S <- Survival(fit1)
med <- Quantile(fit1)
m <- Mean(fit1)

## Model validation

#validate(fit1, B = 200)

## Model calibration

cal <- calibrate(fitw, B = 200, u = 1)
saveRDS(cal, "cal_fitw.rds")
plot(cal)

## Presentation of the model

1. Otherwise I would not be playing with it in my free time!↩︎