What are ARIMA Models?

Introducing ARIMA models

ARIMA stands for Autoregressive Integrated Moving Average and it's a technique for time series analysis and for forecasting possible future values of a time series.

Autoregressive modeling and Moving Average modeling are two different approaches to forecasting time series data. ARIMA integrates these two approaches, hence the name. Forecasting is a branch of machine learning using the past behavior of a time series to predict the one or more future values of that time series. Imagine that you're buying ice cream to stock a small shop. If you know that sales of ice cream have been rising steadily as the weather warms, you should probably predict that next weeks order should be a little bigger than this weeks order. How much bigger should depend on the amount that this weeks sales differ from last weeks sales. We can't forecast the future without a past to which to compare it, so past time series data is very important for ARIMA and for all forecasting and time series analysis methods.

ARIMA is one of the most widely used approaches to time series forecasting and it can be used in two different ways depending on the type of time series data that you're working with. In the first case, we have create a Non-seasonal ARIMA model that doesn't require accounting for seasonality in your time series data. We predict the future simply based on patterns in the past data. In the second case, we account for seasonality which is regular cycles that affect the time series. These cycles can be daily, weekly, or monthly and they help define patterns in the past data of the time series that can be used to forecast future values. Like much of data science, the foundation of forecasting is having good time series data with which to train your models. A time series is an ordered sequence of measurements of a variable at equally spaced time intervals. It's important to remember that not all data sets that have a time element to it is actually time series data because of this equally spaced time interval requirement.

The Box-Jenkins Method

In 1970 the statisticians George Box and Gwilym Jenkins proposed what has become known as the The Box-Jenkins method to fit any kind of time series model.¹ The approach starts with the assumption that the process that generated the time series can be approximated using a model if it is stationary. It consists of four steps:

Identification: Assess whether the time series is stationary, and if not, how many differences are required to make it stationary. Then generate differenced data for use in diagnostic plots. Identify the parameters of an ARMA model for the data from auto-correlation and partial auto-correlation.

Estimation: Use the data to train the parameters of the model (i.e. the coefficients).

Diagnostic Checking: Evaluate the fitted model in the context of the available data and check for areas where the model may be improved. In particular this involves checking for overfitting and calculating the residual errors.

Forecasting: Now that you have a model, you can begin forecasting values with your model.

Once you’ve confirmed that your model fits your data correctly, you’re ready to begin ARIMA forecasting. We'll examine each of these steps in detail.

Characteristics of Time Series Data

A time series can be stationary or non-stationary. A stationary time series has statistical properties that are constant over time. This means that statistics like the mean, variance, autocorrelation, don't change over the data. Most statistical forecasting methods, including ARIMA, are based on the assumption that the time series can be made approximately stationary through one or more transformations. A stationary series is comparatively easy to predict because you can simply predict that the statistical properties will be about the same in the future as they were in the past. Working with non-stationary data is possible but difficult with an approach like ARIMA.

Another key feature of time series data is whether it has a trend present in the data. For instance, the prices of basic staples in a grocery store from the last 50 years would exhibit a trend because inflation would drive those prices higher. Predicting data that contains trends can be difficult because the trend obscures the other patterns in the data. If the data has a stable trend line to which it reverts consistently it may be trend-stationary, in which case the trend can be removed by just fitting a trend line and subtracting the trend from the data before fitting a model to it. If the data isn't trend-stationary, then it might be difference-stationary in which case the trend can be removed by differencing. The simplest way of differencing is to substract the previous value from each value to get a measure of how much change is present in the time series data. So for instance, if Y_t is the value of time series Y at period t, then the first difference of Y at period t is equal to Y_t-Y_t-1.

Here we can see a plot of time series that's not stationary. It has an obvious trend upwards and exhibits seasonality.

The seasonality here is a regular 12 month cycle. This could be addressed by differencing the time series by 12 units so that we difference April 1990 with April 1989. After we apply differencing with a 12 unit lag to the time series, we can see a more stationary time series. The variance of this time series still changes but an ARIMA model could be fit to this time series and forecasts made using it.

Stationarity can be confusing, for instance, a time series that has cyclic behaviour but no trend or seasonality is still stationary. As long as the cycles are not of a fixed length when we observe the series we can't know where the peaks and troughs of the cycles will occur. Generally a stationary time series will have no predictable patterns in the long-term. If you were to plot the time series data in a line chart, it would look roughly horizontal with a constant variance and no significant spikes or drops.

Autocorrelation

We can see the degree to which a time series is correlated with its past values by calculating the auto-correlation. Calculating the auto-correlation can answer questions about whether the data exhibit randomness and how related one observation is to an immediately adjacent observation. This can give us a sense of what sort of model might best represent the data. The autocorrelations are often plotted to see the correlation between the points, up to and including the lag unit.

Each lag in the autocorrelation is defined as:

$r_{k} = \frac{\sum_{t = k + 1}^{T} (y_{t} - \bar{y}) - (y_{t} - k - \bar{y})}{\sum_{t = 1}^{T} (y_{t} - \bar{y})^{2}}$

r is any lag in the autocorrelation, T is the length of the time series, and y is the value of the time series. The autocorrelation coefficients make up the autocorrelation function or ACF.

In ACF, the correlation coefficient is in the x-axis whereas the number of lags (referred to as the lag order) is shown in the y-axis. An autocorrelation plot can be created in python using plot_acf from the statsmodels library and can be created in R using the acf function.

In this ACF plot of a time series differenced with a lag of 12 time units, the zero lag correlates perfect with itself. The first lag is negative, the second lag is slightly positive, the third lag is negative, and so on. You'll notice that the 12th lag is strongly correlated with itself. Since we were looking at monthly data, this makes sense. We can see that the auto-correlation maintain roughly the same cycle throughout the time series, an indication that our time series still contains significant seasonality. ACF plots are also useful for helping to infer the parameters of the ARIMA model that will best fit this data.

Partial Autocorrelation Function (PACF)

Another important plot in preparing to use an ARIMA model on time series data is the Partial Autocorrelation Function. An ACF plot shows the the relationship between yt and y_t−k for different values of k. If y_t and y_t−1 are correlated, then y_t−1 and y_t−2 will also be correlated. But it's also possible for yt and yt−2 to be correlated because they are both connected to y_t−1, rather than because of any new information contained in y_t−2 that could be used in forecasting y_t. To overcome this problem, we can use partial autocorrelations to remove a number of lag observations. These measure the relationship between y_t and y_t−k after removing the effects of lags 1 to k. So the first partial autocorrelation is identical to the first autocorrelation, because there is nothing between them to remove. Each partial autocorrelation can be estimated as the last coefficient in an autoregressive model.

Whether you're working in R or Python or another programming language or library, you'll have a way to calculate the PACF and create a PACF plot for easy inspection. An autocorrelation plot can be created in python using plot_pacf from the statsmodels library and can be created in R using the pacf function.

This PACF uses the same data as the above ACF plot. The PACF plot starts from 1 rather than 0 as in the ACF plot and shows strong correlations until the 1.0 lag, which correlates with same month of the previous year. After that first year, we see a decreasing amount of autocorrelation as the number of lags increases. Since we were looking at monthly data with a variance that changes year to year, this makes sense.

Autoregression and Moving Average

As its name indicates, the acronym ARIMA integrates Autoregression and Moving Average models into a single model depending on the parameters passed. These two ways of modeling change throughout the time series are related but have some key differences. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term auto-regression indicates that it is a regression of the variable against itself. This technique is similar to a linear regression model in how it uses past values as inputs to the regression. Autoregression is defined as:

$y_{t} = C + ϕ_{1} y_{t - 1} + ϕ_{2} y_{t - 2} + . . . ϕ_{q} y_{t - q} + ϵ_{t}$

where ε_t is white noise. This is like a multiple regression but with lagged values of y_t as predictors. We refer to this as an AR(p) model, an autoregressive model of order p.

A moving average model on the other hand uses the past forecast errors rather than using past values of the forecast variable in a regression. A moving average simply averages k values in a window, where k is the size of the moving average window, and then advances the window. The forecast values are evaluated using the actual values to determine the error at each step in the time series. A moving average is defined as:

$y_{t} = C + ϵ_{t} + θ_{1} ϵ_{t - 1} + θ_{2} ϵ_{t - 2} + . . . θ_{q} ϵ_{t - q}$

ε_t is white noise. We refer to this as an MA(q) model, a moving average model of order q. Of course, we do not observe the values of ε_t, so it is not really a regression in the usual sense. Notice that each value of y_t can be thought of as a weighted moving average of the past few forecast errors.

Typically in an ARIMA model you'll use either the Auto-Regressive term (AR) term or the Moving Average term (MA). The ACF plot and PACF plot are oftentimes used to determine which one of these terms is most appropriate.

Specifying an ARIMA model

Once the time series has been made stationary and the nature of the auto-correlations have been determined, it's possible to fit an ARIMA model. There are 3 key parameters for an ARIMA model which are typically referred to as p, d, and q.

p: the order of the Autoregressive part of ARIMA

d: the degree of differencing involved

q: the order of the Moving Average part

These are typically written in the following order: ARIMA(p, d, q). Many programming languages and packages will provide an ARIMA function that can be called with the time series to be analyzed and these three parameters. Most often the data is split into a train set and a test set so that accuracy of the model can be tested after it has been trained. It is usually not possible to tell just from looking at a time plot what values of p and q will be most appropriate for the data. However it is oftentimes possible to use the ACF and PACF plots to determine appropriate values for p and q and thus those pllots are important terms for working with ARIMA

A rough rubric for when to use AR terms in the model is when:

ACF plots show autocorrelation decaying towards zero
PACF plot cuts off quickly towards zero
ACF of a stationary series shows positive at Lag - 1

A rough rubric for when to use MA terms in the model is when:

Negatively Autocorrelated at Lag - 1
ACF that drops sharply after a few lags
PACF decreases gradually rather than suddenly

There are a few classic ARIMA model types that you may encounter.

ARIMA(1,0,0) = first-order autoregressive model: if the series is stationary and autocorrelated, perhaps it can be predicted as a multiple of its own previous value, plus a constant. If the sales of ice cream for tomorrow can be directly predicted using only the sales of ice cream from today, then that is a first-order autoregressive model.

ARIMA(0,1,0) = random walk: If the time series is not stationary, the simplest possible model for it is a random walk model. A random walk is different from a list of random numbers because the next value in the sequence is a modification of the previous value in the sequence. This is often how we model differenced values for stock prices.

ARIMA(1,1,0) = differenced first-order autoregressive model: If the errors of a random walk model are autocorrelated, perhaps the problem can be fixed by adding one lag of the dependent variable to the prediction equation--i.e., by regressing the first difference of Y on itself lagged by one period.

ARIMA(0,1,1) without constant = simple exponential smoothing models: This is used for time-series data with no seasonality or trend. It requires a single smoothing parameter that controls the rate of influence from historical observations (indicated with a coefficient value between 0 and 1). In this technique, values closer to 1 mean that the model pays little attention to past observations, while smaller values stipulate that more of the history is taken into account during predictions.

ARIMA(0,1,1) with constant = simple exponential smoothing models with growth. This is the same as simple exponential smoothing except that there is an additive constant term that makes the Y value of the time series grow as it progresses.

There are many other ways that ARIMA models can be fit of course, which is why we often calculate multiple models and compare them to see which one will provide the best fit for our data. All of these are first order models which means that they map linear processes. There are second order models which map quadratic processes and higher models that map more complex processes.

Comparing ARIMA models

Typically multiple ARIMA models are fit to the data and compared with one another to find which one beset predicts that patterns seen in the time series data. There are three key metrics to assess the accuracy of an ARIMA model:

Akaike’s Information Criterion or AIC. This is widely used to which to select predictors for regression models, and it's also useful for determining the order of an ARIMA model. AIC quantifies both the goodness of fit of the model and the simplicity/parsimony of the model in a single statistic. A lower AIC score is better than a higher one, so we would prefer the model that has a lower score. AIC favors simpler models, more complex models receive higher scores as long as their accuracy is roughly the same as a simpler model. There is also the corrected AIC or AICC which simply has a small correction applied for the sample size.

Bayesian Information Criterion or BIC. This is another criterion for model selection that penalizes complexity even more than the AIC. As with the AIC, models with lower BIC are generally preferred to those with higher scores. If your model is going to be used for longer term forecasting, the BIC may be preferable, whereas shorter term forecasting may mean that the AIC is preferable.

The sigma squared or sigma2 value is the variance of the model residuals. The sigma term describes the volatility of the hypothesized process. If you have highly volatile data but a very low sigma squared score, or conversely non-volatile data but a high sigma squared score, that is a sign that the model isn’t capturing the actual data generating process well.

If we have held back a test data set then we can also compare accuracy metrics like RMSE for different prediction intervals. The ARIMA model can forecast values for a single time step in the future or for multiple steps at a time.

Variations of ARIMA

One other approach to configuring and comparing ARIMA models is to use Auto-ARIMA, which applies automated configuration tasks to generating and comparing ARIMA models. There multiple ways to arrive at any optimal model. The algorithm will generate multiple models and attempt to minimize the AICc and the error of the Maximum Likelihood Estimation to obtain an ARIMA model.

Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is an extension of ARIMA that supports time series data with a seasonal component. To do this, it adds three new hyperparameters to specify the autoregression, differencing and moving average for the seasonal component of the series, as well as an additional parameter for the period of the seasonality. A SARIMA model is typically expressed SARIMA((p,d,q),(P,D,Q)), where the lower case letters indicate the non-seasonal component of the time series and the upper case letters indicate the seasonal component

Vector Autoregressive Models (or VAR Models) are used for multivariate time series. They are structured so that each variable is a linear function of past lags of itself and past lags of the other variables.

ARIMA models are a powerful tool for analyzing time series data to understand past processes as well as for forecasting future values of a time series. ARIMA models combine Autoregressive models and Moving Average models to give a forecaster a highly parameterizable tool that can be used with a wide variety of time series data.

Footnotes