Effective Amazon Machine Learning
上QQ阅读APP看书,第一时间看更新

Addressing multicollinearity

The following is the definition of multicollinearity according to Wikipedia (https://en.wikipedia.org/wiki/Multicollinearity):

Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.

Basically, let's say you have a model with three predictors:

And one of the predictors is a linear combination (perfect multicollinearity) or is approximated by a linear combination (near multicollinearity) of two other predictors.

For instance: 

Here,  is some noise variable.

In that case, changes in  and  will drive changes in , and as a consequence,  will be tied to  and . The information already contained in  and  will be shared with the third predictor , which will cause high uncertainty and instability in the model. Small changes in the predictors' values will bring large variations in the coefficients. The regression may no longer be reliable. In more technical terms, the standard errors of the coefficient would increase, which would lower the significance of otherwise important predictors.

There are several ways to detect multicollinearity. Calculating the correlation matrix of the predictors is a first step, but that would only detect collinearity between pairs of predictors.

A widely used detection method for multicollinearity is to calculate the variance inflation factor or VIF for each of the predictors:

Here,  is the coefficient of determination of the regression equation in step one, with on the left-hand side and all other predictor variables (all the other variables) on the right-hand side.

A large value for one of the VIFs is an indication that the variance (the square of the standard error) of a particular coefficient is larger than it would be if that predictor was completely uncorrelated with all the other predictors. For instance, a VIF of 1.8 indicates that the variance of a predictor is 80% larger than what it would be in the uncorrelated case.

Once the attributes with high collinearity have been identified, the following options will reduce collinearity:

  • Removing the high collinearity predictors from the dataset
  • Using Partial Least Squares Regression (PLS) or Principal Components Analysis (PCA), regression methods (https://en.wikipedia.org/wiki/Principal_component_analysis); PLS and PCA will reduce the number of predictors to a smaller set of uncorrelated variables

Unfortunately, detection and removal of multicollinear variables is not available from the Amazon ML platform.