Random sampling – splitting a dataset in training and testing datasets
Splitting the dataset in training and testing the datasets is one operation every predictive modeller has to perform before applying the model, irrespective of the kind of data in hand or the predictive model being applied. Generally, a dataset is split into training and testing datasets. The following is a description of the two types of datasets:
- The training dataset is the one on which the model is built. This is the one on which the calculations are performed and the model equations and parameters are created.
- The testing dataset is used to check the accuracy of the model. The model equations and parameters are used to calculate the output based on the inputs from the testing datasets. These outputs are used to compare the model efficiency in the light of the actuals present in the testing dataset.
This will become clearer from the following image:
Fig. 3.37: Concept of sampling: Training and Testing data
Generally, the training and testing datasets are split in the ratio of 75:25 or 80:20. There are various ways to split the data into two halves. The crudest way that comes to mind is taking the first 75/80 percent rows as the training dataset and the rest as the testing dataset, or taking the first 25/20 percent rows as the testing and the rest as the training dataset. However, the problem with this approach is that it might bias the two datasets for a variety of reasons. The earlier rows might come from a different source or were observed during different scenarios. These situations might bias the model results from the two datasets. The rows should be chosen to avoid this bias. The most effective way to do that is to select the rows at random. Let us see a few methods to pide a dataset into training and testing datasets.
One way is to create as many standard normal random numbers, as there are rows in the dataset and then filter them for being smaller than a certain value. This filter condition is then used to partition the data in two parts. Let us see how it can be done.
Method 1 – using the Customer Churn Model
Let us use the same Customer Churn Model
data that we have been using frequently. Let us go ahead and import it, as shown:
import pandas as pd data = pd.read_csv('E:/Personal/Learning/Datasets/Book/Customer Churn Model.txt') len(data)
There are 3333 rows in the dataset. Next, we will generate random numbers and create a filter on which to partition the data:
a=np.random.randn(len(data)) check=a<0.8 training=data[check] testing=data[~check]
The rows where the value of the random number is less than 0.8 becomes a part of the training variable, while the one with a value greater than 0.8 becomes a part of the testing dataset.
Let us check the lengths of the two datasets to see in what ratio the dataset has been pided. A 75:25 split between training and testing datasets would be ideal:
len(training) len(testing)
The length of training dataset is 2635 while that of the testing dataset is 698; thus, resulting in a split very close to 75:25.
Method 2 – using sklearn
Very soon we will be introduced to a very powerful Python library used extensively for the purpose of modelling, scikit-learn
or sklearn
. This sklearn
library has inbuilt methods to split a dataset in a training and testing dataset. Let's have a look at the procedure:
from sklearn.cross_validation import train_test_split train, test = train_test_split(data, test_size = 0.2)
The test size specifies the size of the testing dataset: 0.2 means that 20 percent of the rows of the dataset should go to testing and the remaining 80 percent to training. If we check the length of these two (train and test), we can confirm that the split is indeed 80-20 percent.
Method 3 – using the shuffle function
Another method involves using the shuffle
function in the random
method. The data is read in line by line, which are shuffled randomly and then assigned to training and testing datasets in designated proportions, as shown:
import numpy as np with open('E:/Personal/Learning/Datasets/Book/Customer Churn Model.txt','rb') as f: data=f.read().split('\n') np.random.shuffle(data) train_data = data[:3*len(data)/4] test_data = data[len(data)/4:]
In some cases, mostly during data science competitions like Kaggle, we would be provided with separate training and testing datasets to start with.