Removing redundant variables using correlation matrices>
In this recipe we will remove redundant variables by building a correlation matrix that identifies highly correlated variables.
Getting ready
This recipe uses the datafile, nasadata.txt
and the stream file, recipe_variableselection_correlations.str
.
You will need a copy of Microsoft Excel to visualize the correlation matrix.
How to do it...
To remove redundant variables using correlation matrices:
- Open the stream,
recipe_variableselection_correlations.str
by navigating to File | Open Stream. - Make sure the datafile points to the correct path to the file
nasadata.txt
. - Open the Type node named
Correlation Types
. Notice that there are several variables of type continuous whose direction values have been set to Input, and a single continuous variable has its direction set to Target. The variable set to Target can be any variable that won't be an input to the model. If you don't have a good candidate, you can create a random variable and set that one to be the Target, as is done in this stream. - Open the generated model,
random_target
, and click on the Advanced options. Note that Descriptives is selected. This is the option that creates the correlation matrix for you. Note that this Linear Regression node is the old Modeler regression node. The new Regression node no longer provides a correlation matrix. - Build the linear regression model. Open the resulting generated model and click on the Advanced tab. You will see the advanced report that includes a Pearson correlation matrix similar to what appears in the following screenshot. However, with many variables, this report is difficult to browse to identify the correlations. Navigate to File | Export | Advanced and save the advanced report as an
html
file. Any name will work but save it asCorrelations.html
for this recipe. - Open Microsoft Excel and open the file you just created called
Correlations.html
. Note that, if you are using a version of Excel prior to Excel 2007, you can only import up to 255 variables. If you are using Excel 2007 or later, you can import 16,384 variables, but it may take some time for the file to load. Save the fileCorrelations.xls
orCorrelations.xlsx
if you would like to save the work done in Excel. - The only part of the table we need is the correlation matrix itself, labeled Pearson Correlation. It is helpful to delete non-correlation matrix rows and columns, but one can still proceed without editing the Excel document. The correlation matrix generated from the
nasadata.txt
data set begins at row 17. If you wish, color-code the values of the cells using conditional formatting so it is easier to see correlation values that have a large magnitude (close to 1 or -1). If you are using Excel 2007, one suggestion is to use the conditional formatting Format option as shown in the following screenshot.This will result in a correlation matrix that looks like the one shown in the following screenshot.
If you are using Excel 2003 or older, you can use the conditional formatting options shown in the following screenshot.
This will result in a correlation matrix such as the one shown in the following screenshot:
- When you see two variables that are highly correlated with each other, make a determination which variable you would like to keep and which one you would like to remove. If more than two variables are highly correlated with each other, select only one representative of the idea. In the nasadata example, the variables
Band4
andBand5
are correlated at greater than 0.9 withBand3
, and therefore can be safely removed from analysis. One can also argue thatBand9
,Band10
andBand 12
can be removed. - In the Modeler stream, connect a Type node to the right of the correlations Type node. Double-click on the Type node, and set the direction of the variables that were discarded based on the correlation matrix shown in Excel to None. One can also use a Filter node to remove
Band4
,Band5
,Band9
,Band10
, andBand12
.
How it works...
When you desire to identify variables that are highly correlated with each other so that you can remove redundant variables, there is no single node that will perform the task. Only the Regression node and the Discriminant nodes create a correlation matrix, with only the former allowing one to export the resulting matrix. This recipe provides a method to identify the redundant variables so they can be removed.
The first five steps load the data and build the regression model so that the correlation matrix can be exported and operated on in Excel. Steps 6 to 8 show how to identify highly correlated variables in Excel so that a list of redundant variables can be created. Step 9 shows how to apply that list to a Type node or Filter node to remove the redundant fields from further analysis.
There's more...
If there are more than a dozen variables removed from analysis, it can become quite tedious to set each of these individually in a Type node or Filter node. Running a script to set the remove variables to None in a Type node or de-selecting variables in a Filter node can speed up the process significantly and reduce the likelihood of errors made in the selection process.
See also
- The Using Neural Network for Feature Selection recipe in Chapter 6, Selecting and Building a Model
- Selecting variables using the CHAID Modeling node in this chapter