Home Topics Autocorrelation What is Autocorrelation?
Explore autocorrelation with watsonx.ai Subscribe for AI updates
Illustration with collage of pictograms of computer monitor, server, clouds, dots

Published: 24 May 2024
Contributors: Joshua Noble, Eda Kavlakoglu

Autocorrelation provides data analysis for time series data and modeling. It’s widely used in econometrics, signal processing and demand prediction.

 

Autocorrelation, or serial correlation, analyzes time series data to look for correlations in values at different points in a time series. This key method of analysis measures how a value correlates with itself. Instead of calculating the correlation coefficient between different variables, such as an X1 and and X2, we calculate the degree of correlation of a variable itself at time steps throughout the data set. When building a linear regression model one of the primary assumptions is that the errors in predicting the independent variable in that model are independent. Many times, when working with time series data you'll find errors that are time dependent. That is the dependency in the errors appears because of a temporal component. Error terms correlated over time are called autocorrelated errors. These errors cause issues with some of the more common ways of creating a linear regression such as ordinary least squares. The way to address these is to regress the dependent variable on itself using the time lags identified by an autocorrelation test. The 'lag' is simply a previous value of the dependent variable. If you have monthly data and want to predict the upcoming month, you may use the values of the previous two months as input. This means that you are regressing the previous two lags on the current value.

The same way that correlation measures a linear relationship between two variables, autocorrelation measures the relationship between lagged values of a time series through a linear model. When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the Autocorrelation Function, often called the ACF, of a trended time series tends to have positive values that slowly decrease as the lags increase.

When data have seasonal fluctuates or patterns, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags. When data are both trended and seasonal, you see a combination of these effects. Time series that show no autocorrelation are truly random processes and are called white noise. The ACF is a coefficient of correlation between two values in a time series.

There are a few key ways to test for autocorrelation:

You can compute the residuals and plot those standard errors at time t, usually written as et, against t. Any clusters of residuals that are on one side of the zero line may indicate where autocorrelations exist and are significant.

Running a Durbin-Watson test can help identify whether a time series contains autocorrelation. To do this in R, create a linear regression which regresses the dependent variable on the time and then pass that model to calculate the Durbin-Watson statistic. To do this in Python, you can pass the residuals from a fit linear regression model to the test.

Another option is to use a Ljung Box Test and pass the values of the time series directly to the test. The Ljung-Box test has the Null Hypothesis that the residuals are independently distributed and the Alternative Hypothesis that the residuals are not independently distributed and exhibit autocorrelation. This means in practice that results smaller than 0.05 indicate that autocorrelation exists in the time series. Both Python and R libraries provide methods to run this test. 

The most common option is to use a a correlogram visualization generated from correlations between specific lags in the time series. A pattern in the results is an indication for autocorrelation. This is plotted by showing how much correlation of different lags throughout the time series correlate. An example plot is shown below:

Non-random data have at least one significant lag. When the data are not random, it’s a good indication that you need to use a time series analysis or incorporate lags into a regression analysis to model the data appropriately.

There are fundamental features of a time series that can be identified through autocorrelation. 

  • Stationarity
  • Trends
  • Seasonality

 

Stationarity

A stationary time series has statistical properties that are constant over time. This means that statistics such as the mean, variance and autocorrelation, don't change over the data. Most statistical forecasting methods, including ARMA and ARIMA, are based on the assumption that the time series can be made approximately stationary through one or more transformations. A stationary series is comparatively easy to predict because you can simply predict that the statistical properties will be about the same in the future as they were in the past. Stationarity means that the time series does not have a trend, has a constant variance, a constant autocorrelation pattern, and no seasonal pattern. The ACF declines to near zero rapidly for a stationary time series. In contrast, the ACF drops slowly for a non-stationary time series.

Trend

A key feature of time series data is whether a trend presents in the data. For instance, the prices of basic staples in a grocery store from the last 50 years would exhibit a trend because inflation would drive those prices higher. Predicting data that contains trends can be difficult because the trend obscures the other patterns in the data. If the data has a stable trend line to which it reverts consistently it may be trend-stationary, in which case the trend can be removed by just fitting a trend line and subtracting the trend from the data before fitting a model to it. If the data isn't trend-stationary, then it might be difference-stationary in which case the trend can be removed by differencing. The simplest way of differencing is to subtract the previous value from each value to get a measure of how much change is present in the time series data. So for instance, if Yt is the value of time series Y at period t, then the first difference of Y at period t is equal to YYt-1. When trends are present in a time series, shorter lags typically have strong positive correlation or strong negative correlation values in the ACF because observations closer in time tend to have similar values. The correlations in the ACF will taper off slowly as the lags increase.

Seasonality

Seasonality is when a time series contains seasonal fluctuations or changes. We probably should expect ice cream sales to be higher in the summer months and lower in the winter months, ski sales might reliably spike in late autumn and dip in early summer. Seasonality can come in different time intervals such as days, weeks or months. The key for time-series analysis is to understand how the seasonality affects our series, therefore making us produce better forecasts for the future. When seasonal patterns are present, the ACF values will show more positive autocorrelation for lags at multiples of the seasonal frequency than for other lags.

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

Register for the guide on foundation models

Partial Autocorrelation

The Partial Autocorrelation function, often called the PACF, is similar to the ACF except that it displays only the correlation between two observations that the shorter lags between those observations do not explain. An ACF plot shows the the relationship between yt and yt−k for different values of k. If yt and yt−1 are correlated with one another, then we might assume that yt−1 and yt−2 will also be correlated because they are both connected by a lag of 1. However, it's also possible for yt and yt−2 to be correlated simply because they are both connected to yt−1, rather than because there is new information contained in yt−2 that could be used in forecasting yt. To get around this problem, we use partial autocorrelations to remove a number of lag observations. The PACF measures only the relationship between yt and yt−k  by removing the effects of lags 1 to k. The first partial autocorrelation is always identical to the first autocorrelation because there is no new data between them to remove. All the subsequent lags will show only the relationship between the lags after removing all the intervening lags. This can often give a more precise estimate of which lags might contain indications of seasonality by observing where there are larger values of positive or negative autocorrelation.

In practice, the ACF helps assess the properties of a time series. The PACF on the other hand is more useful during the specification process for an autoregressive model. Data scientists or analysts will use partial autocorrelation plots to specify regression models with time series data, Auto Regressive Moving Average (ARMA) or Auto Regressive Integrated Moving Average (ARIMA) models.

Resources Forecasting with time series data using Autoregression models in R

Create and assess Autoregression models using R on watsonx.ai.

Analyzing and forecasting with time series data using ARIMA models in Python

Create and assess ARIMA models using Python on watsonx.ai.

What are ARIMA models?

Learn about Autoregressive Integrated Moving Average (ARIMA) models for time series analysis and forecasting.

Take the next step

Build an AI strategy for your business on one collaborative AI and data platform—IBM watsonx. Train, validate, tune and deploy AI models to help you scale and accelerate the impact of AI with trusted data across your business.

Explore watsonx.ai Book a live demo