Introduction>
This chapter focuses on just the first task, Select, of the data preparation phase:
Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality, and technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.
Ideally, data mining empowers business people to discover valuable patterns in large quantities of data, to develop useful models and integrate them into the business quickly and easily. The name data mining suggests that large quantities of data will be involved, that the object is to extract rare and elusive bits of the data, and that data mining calls for working with data in bulk—no sampling.
New data miners are often struck by how much selection and sampling is actually done. For some, the stereotypical data miner dives in and looks at everything. It is unclear how such an unfocused search would yield any deployable results. Years ago, some Modeler documentation told the tale of the vanishing terabyte—the name alone communicates the basic idea. The data miner in the story, terrified that their systems can't handle the volume, begins the actual act of choosing the relevant data only to discover that they only have a few hundred instances of fraud.
One could argue that the fear of Big Data stems from a misunderstanding of selection and sampling. Large data warehouses filled to the brim with data are a reality, but one doesn't data-mine the undifferentiated whole. Some of the discussion about large data files assumes that all questions require all rows of data as far back in time as they are stored. This is certainly not true. One might use only a small fraction of one's data, that fraction that allows you to accurately and efficiently answer the problem as defined during the business understanding phase.
Also, a data miner does not select data in the way that a statistician does. Statisticians do much more heavy lifting during their variable selection phase. They emerge from that phase with perhaps just a handful of variables, possibly a dozen or two at the absolute most, but never hundreds. The data miner might very well start with a presumption that there will be dozens of inputs, with hundreds being common, and thousands not unheard of. In statistics, hypotheses determine the independent variables from the offset. That is not the nature of the selection discussed here. If you are selecting a subset of rows, it is for relevance, balancing, speed, or a combination of them. Another way to summarize this difference is, if the statistician favors parsimony at this stage, the data miner favors comprehensiveness. A statistician might lean towards variables that have proven to be valuable; the data miner excludes only those variables that are going to cause problems. (The recipe on decapitation is a prime example of avoiding problems.)
Despite the advantage of favoring comprehensiveness, in practice, it is difficult to make discoveries and build models quickly when working with massive quantities of data. Although data mining tools may be designed to streamline the process, it still takes longer for each operation to complete on a large amount of data than it would with a smaller quantity. In the course of a day, the data miner will run many operations, importing, graphing, cleaning, restructuring, and so on. If each one takes an extra minute or two due to the quantity of data involved, the extra minutes add up to a large portion of the day. As the data set grows larger, the time required to run each step also increases, and the data miner spends more time waiting, leaving less time for critical thinking.
So, what's more important, working quickly or working with all the available data? The answer is not the same in every case. Some analyses really do focus on rare and elusive elements of the data. An example can be found in the network security field, where the object is to discover the tracks of a lone intruder among a sea of legitimate system users. In that case, handling a large mass of data is a practical necessity. Yet most data mining applications do not focus on such rare events. Buyers among prospects are a minority, but they are not rare. The same can be said for many other applications. Data miners are most often asked to focus on behavior that is relatively common.
If the pattern of interest happens frequently, perhaps once in a hundred cases, rather than once in a million, it is not necessary to use large masses of data at every step in order to uncover the pattern. Since that is a common situation, most data miners have the opportunity to improve their own productivity by using smaller quantities of data whenever possible. Judicious use of sampling allows the data miner to work with just enough data for any given purpose, reducing the time required to run each of many operations throughout the day.
Having said all that, it is a terribly important set of decisions. Data miners, in principle, want all the data to have an opportunity to speak. However, variables included have to have some possibility of relevance and can't interfere with other variables. One tries to keep the subjectivity at bay, but it is a challenging phase. All of these recipes deal with deciding which rows to keep, and deciding which variables to keep; as one begins to prepare a modeling data set. Modeling will likely be weeks away at this point, but this is the start of that ongoing process. In the end, the goal would be to have every relevant phenomenon measured in some form, preferably in exactly one variable. Redundancy, while perhaps not causing the same problems that it causes in statistical techniques, does nonetheless cause problems. The correlation matrix recipe, among others, addresses this issue.
Although selection includes selecting rows (cases), some of the toughest choices involve Variables. Variable selection is a key step in the data mining process. Several reasons for variable filtering or removal include:
- Removing redundant variables; redundant variables waste time and computational bandwidth needlessly. Moreover, they can introduce instabilities in some modeling algorithms, such as linear regression.
- Removing variables without any information (constants or near constants).
- Reducing the number of variables in the analysis because there are too many for efficient model building.
- Reducing the cost of deploying models. When variables are expensive to collect, assessing if the added benefit justifies its inclusion, or if other, less expensive variables can provide the same or nearly the same accuracy.
The first and the second reasons should be done during the select data step of the data preparation stage. Sometimes it is obvious which variables are essentially identical, though often highly correlated variables or near-zero variance variables are only discovered through explicit testing.
The third reason can be done during data preparation or modeling. Some modeling algorithms have variable selection built-in, such as decision trees or stepwise regression. Other algorithms do not have variable selection built-in, such as nearest neighbor and Neural Networks. However, even if an algorithm has some form of variable selection built-in, variable selection prior to modeling can still be advantageous for efficiency so the same poor or redundant predictors aren't considered over and over again.
The fourth reason is usually done after models are built when one can assess directly the value of variables in the final models.
Five of the chapter's recipes focus on selecting variables prior to modeling, making modeling more efficient. The most common approach to removing variables is to perform single-variable selection based upon the relationship of the variable with the target variable. The logic behind this kind of variable selection is that variables that don't have a strong relationship with the target variable by themselves are unlikely to combine well with other variables in a final model. This is certainly the case with forward selection algorithms (decision trees, forward selection in regression models, to name two examples), but of course isn't always the case.
The Feature Selection node in Modeler is effective in removing variables with no or little variance as well as variables with a weak relationship to the target variable. However, the feature selection node does not identify redundant variables. In addition, despite its ability to select variables with significant association to the target variable, the degree of the association between the input variable and the target variable is not transparent from the Feature Selection node. It focuses, instead, on the statistical significance of the relationship. The Feature Selection node can also remove too aggressively if you have not addressed issues with the missing data.
Four of the variable recipes here (selecting variables using correlations, CHAID, the Means node, and Association Rules) rely on exporting reports from Modeler into Microsoft Excel to facilitate the selection process.