Evaluating the need to sample from the initial data>
One of the most compelling reasons to sample is that many data sources were never created with data analysis in mind. Many operational systems would suffer serious functional problems if a data miner extracted every bit of data from the system. Business intelligence systems are built for reporting purposes—typically a week's worth or a month's worth at a time. When a year's worth is requested, it is in summary form. When the data miner requests a year's worth (or more) of line item level transactions it is often unexpected, and can be disastrous if the IT unit is not forewarned.
Real life data mining rarely begins with perfectly clean data. It's not uncommon for 90 percent of a data miner's time to go to data preparation. This is a strong motivation to work with just enough data to fill a need and no more, because more data to analyze means more data to clean, more time spent cleaning data, and very little time left available for data exploration, modeling and other responsibilities. The question often is how large a time period to examine. Do we need 4 years to examine this? The answer would be yes if we are predicting university completion, but the answer would be no if we are predicting the next best offer for an online bookseller.
In this recipe we will run a series of calculations that will help us determine if we have: just enough data, too much data that we might want to consider random sampling, or so little data that we might have to go further back in our historical data to get enough.
Getting ready
We will be using the EvaluateSampleNeed.str
file.
How to do it...
To evaluate the need to sample from the initial data, perform the following steps:
- Force TARGET_B to be flag in the Type node.
- Run a Distribution node for TARGET_B. Verify that there are 4,883 donors and 90,569 non-donors.
- Run a Distribution node on the new derive field,
RFA3_FirstLetter
. - Examine the Select node and run a new Distribution node on TARGET_B downstream of the Select node. Confirm the numbers 88,290 and 4694 for the results.
- Generate using Balance Node (reduce).
Insert it in sequence before the Distribution node and then run it.
Confirm that the two groups are now roughly equal. This is a random process; your numbers will not match the screen exactly.
- Add a Partition node after the Type node. Purely for illustration, add a Select node that allows only data from Train data set to flow to the Distribution node. We want to assess our sample size, but the Select node would be removed before modeling.
- Do we have enough data if we remove Inactive or Lapsing donors? Add a Select node that removes the categories I or L from the field
RFA_3FirstLetter
. The downstream Distribution node of TARGET_B should result in approximately 2,300 in each group.
How it works...
Early in the process we determined that we have 4833 cases of the rarer of our two groups. It would seem, at first, that we have enough data and possibly we do. A good rule of thumb is that we would want at least 1,000 cases of the rarer group in our Train data set, and ideally the same amount in our Test data set. When you don't meet these requirements there are ways around it, but when you can meet them it is one less thing to worry about.
When we explore the balanced results we meet the 1000+ rule of thumb, but are we out of the woods? There are numerous issues left to consider. Two are especially important: is all of the data relevant and is our time period appropriate?
Note that when we rerun the Distribution node downstream of the Partition node, at first it seems to give us odd results. Partition nodes tells Modeling nodes to ignore Test data, but Distribution nodes show all the data. In addition, Balance nodes only balance data in the Training data set, not the Testing data set. In this recipe, we add the select node to make this clear. In a real project one could just cut the number of cases into half to determine the number in the Train half.
The exercise in removing 1995 donors or lapsed donors cannot be taken as guidance in all cases. There are numerous reasons to restrict data. We might be interested in only major donors (as defined in the data set). We might be interested only in new donors. The point is to always return to your business case and ensure that you are determining sample size for the same group that will be your deployment population for the given business question.
In this example, we ultimately can conclude we have enough data to meet the rule of thumb, but we certainly don't have the amount of data that we appeared to have at the start.
There's more...
What do you do when you don't have enough data? One option is to go further back in time, but that option might not be available to you on all projects. Another option is to change the percentages in the Partition node. The Train data set needs its 1000s of records more than the Test data. If you are experiencing scarcity, increase the percentage of records going to the Train data.
You could also manipulate the Balance node. One need not fully boost or fully reduce. For example, if you are low on data, but have almost enough data, try doubling the numbers in the balance node. This way you are partially boosting the rare group (by a factor of 2), and you are only partially reducing the common group.
What do you do if you have too much data? As long as there is no seasonality you might look at only one campaign, or one month. If you had a lot of data, but you had seasonality, then having only one month's worth of data would not be a good idea. Better to do a random sample from each of 12 months, and then combine the data. Don't be too quick to embrace too much uncritically and simply analyze all of it. The proof will be in the ability to validate against new unbalanced data. A clever sampler will often produce the better model because they are not drowning the algorithm with noise.
See also
- The Using an empty aggregate to evaluate sample size recipe in this chapter