What Is Multicollinearity?

Published: 21 November 2023
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu

What is multicollinearity?

Multicollinearity denotes when independent variables in a linear regression equation are correlated. Multicollinear variables can negatively affect model predictions on unseen data. Several regularization techniques can detect and fix multicollinearity.

Multicollinearity or collinearity?

Collinearity denotes when two independent variables in a regression analysis are themselves correlated; multicollinearity signifies when more than two independent variables are correlated.¹ Their opposite is orthogonality, which designates when independent variables are not correlated. Multicollinearity prevents predictive models from producing accurate predictions by increasing model complexity and overfitting.

Context: regression analysis

A standard multivariate linear regression equation is:

Y is the predicted output (dependent variable), and X is any predictor (independent or explanatory variable). B is the regression coefficient attached and measures the change in Y for every one unit of change in the accompanying predictor (X_n) assuming all other predictors remain constant. X₀ is the value of the response variable (Y) when the independent variable equals zero. This final value is also called the y-intercept.²

Of course, this polynomial equation aims to measure and map the correlation between Y and X_n. In an ideal predictive model, none of the independent variables (X_n) are themselves correlated. Nevertheless, this can often happen in models using real world data, particularly when the models are designed with many independent variables.

A data leader's guide

Learn how to leverage the right databases for applications, analytics and generative AI.

Related content

Effects of multicollinearity

When creating a predictive model, we need to compute coefficients, as they are rarely known beforehand. To estimate regression coefficients, we use a standard ordinary least squares (OLS) matrix coefficient estimator:

Understanding this formula’s operations requires familiarity with matrix notation. But at present, all we need to understand is that the size and contents of the X matrix are determined by the independent variables chosen as the model’s parameters. Moreover, the degree of correlation between predictor variables—known as correlation coefficients and represented by —are used in calculating regression coefficients between X and Y.³

As independent variables are included or excluded from the model, the estimated coefficients for any one predictor can change drastically, making coefficient estimates unreliable and imprecise. Correlation between two or more predictors creates difficulty in determining any one variable’s individual impact on the model output. Remember that a regression coefficient measures a given predictor variable’s effect on the output assuming other predictors remain constant. But if predictors are correlated, it may not be possible to isolate predictors. Thus, estimated regression coefficients for multicollinear variables do not reflect any one predictor’s effect on the output but rather the predictor’s partial effect, depending on which covariates are in the model.⁴

Additionally, different data samples, or even small changes in data, with the same multicollinear variables can produce widely different regression coefficient. This is perhaps the most widely known problem of multicollinearity: overfitting. Overfitting denotes models with low training error and high generalization error. As mentioned, statistical significance of any one multicollinear variable remains unclear amidst its relational noise with the others. This prevents precise calculation of any one variable’s statistical significance on the model’s output, which is what the coefficient estimate largely indicates. Because multicollinearity prevents calculating precise coefficient estimates, multicollinear models fail to generalize onto unseen data. In this way, estimated coefficients for multicollinear variables possess a large variability, also known as a large standard error.⁵

Types of multicollinearity

Degrees of multicollinearity

Statistics textbooks and articles sometimes divide between extreme and perfect multicollinearity. Perfect multicollinearity signifies when one independent variable has a perfect linear correlation with one or more independent variables. Extreme multicollinearity is when one predictor is highly correlated with one or more additional independent variables.⁶ These are the two principal degrees of multicollinearity.

Causes of multicollinearity

There are not so much discrete forms of multicollinearity as different potential causes. These causes can range from the nature of data under consideration to poorly designed experiments. Some common causes are:

- Data collection This data-based multicollinearity can result if one samples a non-representative subspace of the data in question. For instance, Montgomery et al. supply the example of a supply chain delivery dataset in which order distance and size are independent variables of a predictive model. In the data they provide, order inventory size appears to increase with delivery distance. The solution to this correlation is straightforward: collect and include data samples for short distance deliveries with large inventories, or vice-versa.⁷

- Model constraints This is similar to the data collection cause, albeit not identical. Multicollinearity can result due to the nature of the data and predictive model variables in question. Imagine we are creating a predictive model to measure employee satisfaction in the workplace, with hours worked per week and reported stress being two of several predictors. There may very well be a correlation between these predictors due to the nature of the data—i.e. people who work more will likely report higher stress. A similar situation may occur if education and salary are model predictors—employees with more education will likely earn more. In this case, collecting more data may not alleviate the issue, as multicollinearity is inherent to the data itself.

- Overdefined model Multicollinearity can occur when there are more model predictors than data observation points. This issue can arise particularly in biostatistics or other biological studies. Resolving overdefined models requires eliminating select predictors from the model altogether. But how to determine which models to remove? One can conduct a several preliminary studies using subsets of regressors (i.e. predictors) or utilize principal component analysis (PCA) to combine multicollinear variables.⁸

Data-based and structural multicollinearity

Select types of data can especially lead to multicollinearity. Time series data is chief among these. Growth and trends factors, notably in economics, often move in the same direction over time, readily producing multicollinearity. Additionally, observational studies in social sciences are readily conducive to multicollinearity, since many socioeconomic variables (e.g. income, education, political affiliation, etc.) are often interrelated and uncontrolled by researchers.⁹

Multicollinearity can also result from manipulation of predictor variables. In some cases, one may use the squared or lagged values of independent variables as new model predictors. Of course, these new predictors will share a high correlation with the independent variables from whence they were derived.¹⁰ This is structural multicollinearity.

How to detect multicollinearity

Large estimated coefficients in themselves can indicate the presence of multicollinearity, as well as massive changes in estimated coefficients when a single predictor (or even data point) is added or removed from the model. Coefficients with large confidence intervals are also indicative of multicollinearity. Occasionally, coefficients possessing signs or magnitudes contrary to expectations derived from preliminary data analysis can indicate multicollinearity. Of course, none of these definitively confirm multicollinearity nor provide quantitative measurements of multicollinearity.¹¹ Several diagnostic methods help do so however.

Two relatively simple tools for measuring multicollinearity are a scatter plot and correlation matrix of independent variables. When using a scatter plot, one plots independent variable values for each data point against one another. If the scatter plot reveals a linear correlation between the chosen variables, then some degree of multicollinearity may be present. This figure illustrates multicollinear data in a scatter plot using the Montgomery et al. delivery dataset example.

Another diagnostic method is to calculate a correlation matrix for all the independent variables. The elements of the matrix are the correlation coefficients between each predictor in a model. The correlation coefficient is a value between -1 and 1 that measures the degree of correlation between two predictors. Note how the matrix contains a diagonal of 1s because each variable has a perfect correlation with itself. The higher a given matrix element, the greater the degree of correlation between them.¹²

Variance inflation factor

Variance inflation factor (VIF) is the most common method for determining the degree of multicollinearity in linear regression models. Each model predictor has a VIF value, which measures how much the variance of that predictor is inflated by the model’s other predictors.

The VIF algorithm contains several steps. A full explanation of this algorithm is beyond the scope of this article however. Suffice it to say, VIF measures a chosen variable’s proportion of the variance as determined by the model’s other independent variables. The equation representing VIF is:

R-squared (R²) signifies the coefficient of multiple determination obtained by regressing one independent variable against all the others.¹³ The bottom term of the VIF equation is tolerance, a concept distinct from tolerance intervals. Tolerance is the inverse of VIF. Though much less discussed in literature, it is nevertheless another viable means for calculating multicollinearity.¹⁴

The higher the VIF value, the greater degree of multicollinearity. There is no VIF cutoff value determining a “bad” or “good” model. Nevertheless, a widely repeated rule of thumb is that a VIF value greater than or equal to ten indicates severe multicollinearity.¹⁵

Note that R and Python contain functions for calculating VIF. Respectively, the vif() function in R’s car package and the variance_inflation_factor() function in Python’s statsmodels.stats module can compute VIF for a designated model.¹⁶

How to fix multicollinearity

As mentioned, simple fixes for multicollinearity range from diversifying or enlarging the sample size of training data to removing parameters altogether. Several regularization techniques also help correct the problem of multicollinearity. Ridge regression is one widely recommended method, that involves penalizing high-value coefficients, thereby decreasing the impact of multicollinear predictors on the model’s output. Lasso regression similarly penalizes high-value coefficients. The primary difference between these two is that ridge merely reduces coefficient values to near-zero while lasso can reduce coefficients to zero, effectively removing independent variables from the model altogether.

Example use cases

Finance

Because business and finance research cannot conduct controlled experiments and largely work with time series data, multicollinearity is a perennial issue. Recent research challenges predictor-dropping methods (e.g. PCA) for resolving collinearity on the grounds that doing so potentially removes important predictors.¹⁷ Elsewhere, researchers apply ridge regression, and novel shrinkage methods derived therefrom, to correct multicollinearity in analyzing investment management decisions.¹⁸

Criminal justice

Like many other subfields in the social sciences, criminology and criminal justice rely on observational studies, in which multicollinearity often arises. Researchers may use variable combining (e.g. PCA),¹⁹ as well as variable-dropping methods to resolve multicollinearity.²⁰ Note how, in the latter study, a VIF greater than three indicates too high multicollinearity, illustrating that not all research follows the VIF>10 rule. Research also explores other diagnostic and resolution methods for multicollinearity, such as dominance analysis, which ranks predictors according to their contributed portion of variance to the model.²¹

Footnotes

¹ Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

² Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor, An Introduction to Statistical Learning with Applications in Python, Springer, 2023, https://doi.org/10.1007/978-3-031-38747-0 (link resides outside ibm.com)

³Michael Patrick Allen, Understanding Regression Analysis, Springer, 1997. Michael Kutner, Christopher Nachtsheim, John Neter, and William Li, Applied Statistical Linear Models, 5^th Edition, McGraw-Hill, 2005.

⁴ Michael Kutner, Christopher Nachtsheim, John Neter, and William Li, Applied Statistical Linear Models, 5^th Edition, McGraw-Hill, 2005.

⁵ Michael Patrick Allen, Understanding Regression Analysis, Springer, 1997. Michael H. Kutner, Christopher J. Nachtsheim, John Neter, and William Li, Applied Statistical Linear Models, 5^th Edition, McGraw-Hill, 2005.

⁶ Michael Patrick Allen, Understanding Regression Analysis, Springer, 1997.

⁷ Douglas Montgomery, Elizabeth Peck, and G. Geoffrey Vining, Introduction to Linear Regression Analysis, John Wiley & Sons, 2012.

⁸ R.F. Gunst and J.T. Webster, "Regression analysis and problems of multicollinearity," Communications in Statistics, Vol. 4, No. 3, 1975, pp. 277-292, https://doi.org/10.1080/03610927308827246 (link resides outside ibm.com)

⁹ Larry Schroeder, David Sjoquist, and Paula Stephan, Understanding Regression Analysis: An Introductory Guide, 2^nd Edition, SAGE, 2017.

¹⁰ R.F. Gunst and J.T. Webster, "Regression analysis and problems of multicollinearity," Communications in Statistics, Vol. 4, No. 3, 1975, pp. 277-292, https://doi.org/10.1080/03610927308827246 (link resides outside ibm.com)

¹¹ Michael Patrick Allen, Understanding Regression Analysis, Springer, 1997. Michael Kutner, Christopher Nachtsheim, John Neter, and William Li, Applied Statistical Linear Models, 5^th Edition, McGraw-Hill, 2005.

¹² Michael Kutner, Christopher Nachtsheim, John Neter, and William Li, Applied Statistical Linear Models, 5^th Edition, McGraw-Hill, 2005.

¹³ Raymand Myers, Classical and modern regression with applications, Duxbury Press, 1986. Paul Allison, Multiple Regression: A Primer, Pine Forge Press, 1999. Joseph Hair, William Black, Barry Babin, Rolph E. Anderson, and Ronald Tatham, Multivariate Data Analysis, 6^th Edition, Pearson, 2006.

¹⁴ Richard Darlington and Andrew Hayes, Regression Analysis and Linear Models: Concepts, Applications, and Implementation, Guilford Press, 2017.

¹⁵ Michael Kutner, Christopher Nachtsheim, John Neter, and William Li, Applied Statistical Linear Models, 5^th Edition, McGraw-Hill, 2005.

¹⁶ Chantal Larose and Daniel Larose, Data Science Using Python and R, Wiley, 2019.

¹⁷ Thomas Lindner, Jonas Puck, and Alain Verbeke, "Misconceptions about multicollinearity in international business research: Identification, consequences, and remedies," Journal of International Business Studies, Vol. 51, 2020, pp. 283-298, https://doi.org/10.1057/s41267-019-00257-1 (link resides outside ibm.com)

¹⁸ Aquiles E.G. Kalatzis, Camila F. Bassetto, and Carlos R. Azzoni, "Multicollinearity and financial constraint in investment decisions: a Bayesian generalized ridge regression," Journal of Applied Statistics, Vol. 38, No. 2, 2011, pp. 287-299, https://www.tandfonline.com/doi/abs/10.1080/02664760903406462. Roberto Ortiz, Mauricio Contreras, and Cristhian Mellado, "Regression, multicollinearity and Markowitz," Finance Research Letters, Vol. 58, 2023, https://doi.org/10.1016/j.frl.2023.104550 (link resides outside ibm.com)

¹⁹ Kiseong Kuen, David Weisburd, Clair White, and Joshua Hinkle, "Examining impacts of street characteristics on residents' fear of crime: Evidence from a longitudinal study of crime hot spots," Journal of Criminal Justice, Vol. 82, 2022, https://doi.org/10.1016/j.jcrimjus.2022.101984 (link resides outside ibm.com)

²⁰ Howard Henderson, Sven Smith, Christopher Ferguson, and Carley Fockler, "Ecological and social correlates of violent crime," SN Social Sciences, Vol. 3, 2023, https://doi.org/10.1007/s43545-023-00786-5 (link resides outside ibm.com)

²¹ Robert Peacock "Dominance analysis of police legitimacy’s regressors: disentangling the effects of procedural justice, effectiveness, and corruption," Police Practice and Research, Vol. 22, No. 1, 2021, pp. 589-605, https://doi.org/10.1080/15614263.2020.1851229 (link resides outside ibm.com)