Effective Amazon Machine Learning
上QQ阅读APP看书,第一时间看更新

Evaluating the performance of your model

Evaluating the predictive performance of a model requires defining a measure of the quality of its predictions. There are several available metrics both for regression and classification. The metrics used in the context of Amazon ML are the following ones:

  • RMSE for regression: The root mean squared error is defined by the square of the difference between the true outcome values and their predictions:
  • F-1 Score and ROC-AUC for classification: Amazon ML uses logistic regression for binary classification problems. For each prediction, logistic regression returns a value between 0 and 1. This value is interpreted as a probability of the sample belonging to one of the two classes. A probability lower than 0.5 indicates belonging to the first class, while a probability higher than 0.5 indicates a belonging to the second class. The decision is therefore highly dependent on the value of the threshold. A value which we can modify.
  • Denoting one class positive and the other negative, we have four possibilities depicted in the following table:
  • This matrix is called a confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix) . It defines four indicators of the performance of a classification model:
    • TP: How many Yes were correctly predicted Yes
    • FP: How many No were wrongly predicted Yes
    • FN: How many Yes were wrongly predicted No
    • TN: How many No were correctly predicted No
  • From these four indicators, we can define the following metrics:
    • Recall: This denotes the amount of predicted positives actually positive. Recall is also called True Positive Rate (TPR) or sensitivity. It is the probability of detection:

Recall = (TP / TP + FN)

    • Precision as the fraction of the real positives over all the positive predicted values:

Precision = (TP / TP + FP)

    • False Positive Rate is the number of falsely predicted positives over all the true negatives. It's the probability of false alarm:

FPR = FP / FP + TN

    • Finally, the F1-score is defined as the weighted average of the recall and the precision, and is given by the following:

F1-score = 2 TP / ( 2 TP + FP + FN)

    • A F1 score is always between 0 and 1, with 1 the best value and 0 the worst one.

As noted previously, these scores are all dependent on the initial threshold used to interpret the result of the logistic regression in order to decide when a prediction belongs to one class or the other. We can choose to vary that threshold. This is where the ROC-AUC comes in.

If you plot the True Positive Rate (Recall) against the False Positive Rate for different values of the decision threshold, you obtain a graph like the following, called the Receiver Operating Characteristic or ROC curve:

  • The diagonal line indicates an equal probability of belonging to one class or another. The closer the curve is to the upper-left corner, the better your model performances are.
  • The ROC curve has been widely used since WWII, when it was first invented to detect enemy planes in radar signals.
  • Once you have the ROC curve, you can calculate the Area Under the Curve or AUC.
  • The AUC will give you a unique score for your model taking into account all the possible values for the probability threshold from 0 to 1. The higher the AUC the better.