Evaluating the performance of your model
Evaluating the predictive performance of a model requires defining a measure of the quality of its predictions. There are several available metrics both for regression and classification. The metrics used in the context of Amazon ML are the following ones:
- RMSE for regression: The root mean squared error is defined by the square of the difference between the true outcome values and their predictions:
- F-1 Score and ROC-AUC for classification: Amazon ML uses logistic regression for binary classification problems. For each prediction, logistic regression returns a value between 0 and 1. This value is interpreted as a probability of the sample belonging to one of the two classes. A probability lower than 0.5 indicates belonging to the first class, while a probability higher than 0.5 indicates a belonging to the second class. The decision is therefore highly dependent on the value of the threshold. A value which we can modify.
- Denoting one class positive and the other negative, we have four possibilities depicted in the following table:
- This matrix is called a confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix) . It defines four indicators of the performance of a classification model:
- TP: How many Yes were correctly predicted Yes
- FP: How many No were wrongly predicted Yes
- FN: How many Yes were wrongly predicted No
- TN: How many No were correctly predicted No
- From these four indicators, we can define the following metrics:
- Recall: This denotes the amount of predicted positives actually positive. Recall is also called True Positive Rate (TPR) or sensitivity. It is the probability of detection:
Recall = (TP / TP + FN)
-
- Precision as the fraction of the real positives over all the positive predicted values:
Precision = (TP / TP + FP)
-
- False Positive Rate is the number of falsely predicted positives over all the true negatives. It's the probability of false alarm:
FPR = FP / FP + TN
-
- Finally, the F1-score is defined as the weighted average of the recall and the precision, and is given by the following:
F1-score = 2 TP / ( 2 TP + FP + FN)
-
- A F1 score is always between 0 and 1, with 1 the best value and 0 the worst one.
As noted previously, these scores are all dependent on the initial threshold used to interpret the result of the logistic regression in order to decide when a prediction belongs to one class or the other. We can choose to vary that threshold. This is where the ROC-AUC comes in.
If you plot the True Positive Rate (Recall) against the False Positive Rate for different values of the decision threshold, you obtain a graph like the following, called the Receiver Operating Characteristic or ROC curve:
- The diagonal line indicates an equal probability of belonging to one class or another. The closer the curve is to the upper-left corner, the better your model performances are.
- The ROC curve has been widely used since WWII, when it was first invented to detect enemy planes in radar signals.
- Once you have the ROC curve, you can calculate the Area Under the Curve or AUC.
- The AUC will give you a unique score for your model taking into account all the possible values for the probability threshold from 0 to 1. The higher the AUC the better.