This came up in a conversation today. What’s the right way to handle time series data when the index variable is continuous?
I’m only aware of the Gaussian process model, general regression models with the CAR correlation function for residuals, and some of the usual moving-average models that you can find in any old time series book.
Gaussian processes are obviously the most popular (right now), due in part to the publicity that they get on Gelman’s blog, and in BDA3. But GP models make me a bit nervous for a few reasons:
The likelihood is not identified, so strong priors are recommended
Fitting GP models is time consuming. I think it takes something like \(n^{2}\) or \(n^{3}\) steps to fit the model. This won’t work on my problems with \(100,000\) obs.
I’d rather not have to think about which correlation structure to use. The default for many people seems to be the quadratic exponential structure, but it’s unclear to me what’s so great about it, aside from the fact that it looks like a Normal kernel and may have some computational advantages. Maybe that’s enough to get to be a default. (After all, statistics is the science of defaults, ha).
I think generalized additive models are a great alternative to GP models. They are faster to fit than GP models and identified given some reasonable constraints. I do not think they are as flexible as GP models, but they are quite good enough (last least in my experience.)
As a small mathematical bonus, we have that GAMs are just a special case of GP models (Kimeldorf and Wahba, 1970).
Modeling w/ time
I like to think about two ways to handle time series data:
Model the changing function directly through the linear predictor
Account for the corrected errors by imposing some correlation structure on the covariance matrix of the residuals.
Option 1 means do better regressions. Option 2 is less common, but Pinheiro and Bates’ nlme book has great tips for how to approach this using their R package, which is my sole experience with correlation matrix modeling.
Research questions
As always, whatever you use to model time series data, I think the research question is the most important thing, and should come first. You want to pick a method that helps you answer the most important research questions. And you hope that the model you use to answer the question will produce outcomes that are easy to communicate.
Unfortunately with time series data, it seems that it’s easy to lose track of the research question in favor of just finding a model that fits the data nicely. I don’t know what it is: perhaps the fact that models get so complicated that they become practically useless for answering simple questions?
Of course, a nice fit doesn’t mean we can just generalize from sample to population willy-nilly… so more potential troubles.