Model Tuning
In this section, we will delve further into evaluating model performance and examine techniques that we can use to generalize models to new data using regularization. Providing the context of a model's performance is extremely important. Our aim is to determine whether our model is performing well compared to trivial or obvious approaches. We do this by creating a baseline model against which machine learning models we train are compared. It is important to stress that all model evaluation metrics are evaluated and reported via the test dataset since that will give us an understanding of how the model will perform on new data.
Baseline Models
A baseline model should be a simple and well-understood procedure, and the performance of this model should be the lowest acceptable performance for any model we build. For classification models, a useful and easy baseline model is to calculate the model outcome value. For example, if there are 60% false values, our baseline model would be to predict false for every value, which would give us an accuracy of 60%. For regression models, the mean or median can be used as the baseline.
Exercise 1.05: Determining a Baseline Model
In this exercise, we will put the model performance into context. The accuracy we attained from our model seemed good, but we need something to compare it to. Since machine learning model performance is relative, it is important to develop a robust baseline with which to compare models. Once again, we are using the online shoppers purchasing intention dataset, and our target variable is whether or not each user will purchase a product in their session. Follow these steps to complete this exercise:
- Import the pandas library and load in the target dataset:
import pandas as pd
target = pd.read_csv('../data/OSI_target_e2.csv')
- Next, calculate the relative proportion of each value of the target variables:
target['Revenue'].value_counts()/target.shape[0]*100
The following figure shows the output of the preceding code:
- Here, we can see that 0 is represented 84.525547% of the time—that is, there is no purchase by the user, and this is our baseline accuracy. Now, for the other model evaluation metrics:
from sklearn import metrics
y_baseline = pd.Series(data=[0]*target.shape[0])
precision, recall, \
fscore, _ = metrics.precision_recall_fscore_support\
(y_pred=y_baseline, \
y_true=target['Revenue'], average='macro')
Here, we've set the baseline model to predict 0 and have repeated the value so that it's the same as the number of rows in the test dataset.
Note
The average parameter in the precision_recall_fscore_support function has to be set to macro because when it is set to binary, as it was previously, the function is looking for true values, and our baseline model only consists of false values.
- Print the final output for precision, recall, and fscore:
print(f'Precision: {precision:.4f}\nRecall:\
{recall:.4f}\nfscore: {fscore:.4f}')
The preceding code produces the following output:
Precision: 0.9226
Recall: 0.5000
Fscore: 0.4581
Now, we have a baseline model that we can compare to our previous model, as well as any subsequent models. By doing this, we can tell that while the accuracy of our previous model seemed high, it did not score much better than this baseline model.
Note
To access the source code for this specific section, please refer to https://packt.live/31MD1jH.
You can also run this example online at https://packt.live/2VFFSXO.
Regularization
Earlier in this chapter, we learned about overfitting and what it looks like. The hallmark of overfitting is when a model is trained on the training data and performs extremely well yet performs terribly on test data. One reason for this could be that the model may be relying too heavily on certain features that lead to good performance in the training dataset but do not generalize well to new observations of data or the test dataset.
One technique that can be used to avoid this is called regularization. Regularization constrains the values of the coefficients toward zero, which discourages a complex model. There are many different types of regularization techniques. For example, in linear and logistic regression, ridge and lasso regularization are most common. In tree-based models, limiting the maximum depth of the trees acts as regularization.
There are two different types of regularization, namely L1 and L2. This term is either the L2 norm (the sum of the squared values) of the weights or the L1 norm (the sum of the absolute values) of the weights. Since the l1 regularization parameter acts as a feature selector, it is able to reduce the coefficient of features to zero. We can use the output of this model to observe which features do not contribute much to the performance and remove them entirely if desired. The l2 regularization parameter will not reduce the coefficient of features to zero, so we will observe that they all have non-zero values.
The following code shows how to instantiate the models using these regularization techniques:
model_l1 = LogisticRegressionCV(Cs=Cs, penalty='l1', \
cv=10, solver='liblinear', \
random_state=42)
model_l2 = LogisticRegressionCV(Cs=Cs, penalty='l2', \
cv=10, random_state=42)
The following code shows how to fit the models:
model_l1.fit(X_train, y_train['Revenue'])
model_l2.fit(X_train, y_train['Revenue'])
The same concepts in lasso and ridge regularization can be applied to ANNs. However, penalization occurs on the weight matrices rather than the coefficients. Dropout is another form of regularization that's used to prevent overfitting in ANNs. Dropout randomly selects nodes at each iteration and removes them, along with their connections, as shown in the following figure:
Cross-Validation
Cross-validation is often used in conjunction with regularization to help tune hyperparameters. Take, for example, the penalization parameter in ridge and lasso regression, or the proportion of nodes to drop out at each iteration using the dropout technique with ANNs. How will you determine which parameter to use? One way is to run models for each value of the regularization parameter and evaluate them on the test set; however, using the test set often can introduce bias into the model.
One popular example of cross-validation is called k-fold cross-validation. This technique gives us the ability to test our model on unseen data while retaining a test set that we will use to test at the end. Using this method, the data is pided into k subsets. In each of the k iterations, k-1 of the subsets are used as training data and the remaining subset is used as a validation set. This is repeated k times until all k subsets have been used as validation sets.
By using this technique, there is a significant reduction in bias, since most of the data is used for fitting. There is also a reduction in variation since most of the data is also used for validation. Typically, there are between 5 and 10 folds, and the technique can even be stratified, which is useful when there is a large imbalance of classes.
The following example shows 5-fold cross-validation with 20% of the data being held out as a test set. The remaining 80% is separated into 5 folds. Four of those folds comprise the training data, and the remaining fold is the validation data. This is repeated a total of five times until every fold has been used once for validation:
Activity 1.01: Adding Regularization to the Model
In this activity, we will utilize the same logistic regression model from the scikit-learn package. This time, however, we will add regularization to the model and search for the optimum regularization parameter—a process often called hyperparameter tuning. After training the models, we will test the predictions and compare the model evaluation metrics to those produced by the baseline model and the model without regularization.
The steps we will take are as follows:
- Load in the feature and target datasets of the online shoppers purchasing intention dataset from '../data/OSI_feats_e3.csv' and '../data/OSI_target_e2.csv'.
- Create training and test datasets for each of the feature and target datasets. The training datasets will be used to train on, and the models will be evaluated using the test datasets.
- Instantiate a model instance of the LogisticRegressionCV class of scikit-learn's linear_model package.
- Fit the model to the training data.
- Make predictions on the test dataset using the trained model.
- Evaluate the models by comparing how they scored against the true values using the evaluation metrics.
After implementing these steps, you should get the following expected output:
l1
Precision: 0.7300
Recall: 0.4078
fscore: 0.5233
l2
Precision: 0.7350
Recall: 0.4106
fscore: 0.5269
Note
The solution for this activity can be found on page 348.
This activity has taught us how to use regularization in conjunction with cross-validation to appropriately score a model. We have learned how to fit a model to data using regularization and cross-validation. Regularization is an important technique to use to ensure that models don't overfit the training data. Models that have been trained with regularization will perform better on new data, which is generally the goal of machine learning models—to predict a target when given new observations of the input data. Choosing the optimal regularization parameter may require iterating over a number of different choices.
Cross-validation is a technique that's used to determine which set of regularization parameters fit the data best. Cross-validation will train multiple models with different values for the regularization parameters on different cuts of the data. This technique ensures the best set of regularization parameters are chosen, without adding bias and minimizing variance.