上QQ阅读APP看书，第一时间看更新

Evaluating the performance of your model

Evaluating the predictive performance of a model requires defining a measure of the quality of its predictions. There are several available metrics both for regression and classification. The metrics used in the context of Amazon ML are the following ones:

RMSE for regression: The root mean squared error is defined by the square of the difference between the true outcome values and their predictions:

F-1 Score and ROC-AUC for classification: Amazon ML uses logistic regression for binary classification problems. For each prediction, logistic regression returns a value between 0 and 1. This value is interpreted as a probability of the sample belonging to one of the two classes. A probability lower than 0.5 indicates belonging to the first class, while a probability higher than 0.5 indicates a belonging to the second class. The decision is therefore highly dependent on the value of the threshold. A value which we can modify.
Denoting one class positive and the other negative, we have four possibilities depicted in the following table:

This matrix is called a confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix) . It defines four indicators of the performance of a classification model:
- TP: How many Yes were correctly predicted Yes
- FP: How many No were wrongly predicted Yes
- FN: How many Yes were wrongly predicted No
- TN: How many No were correctly predicted No
From these four indicators, we can define the following metrics:
- Recall: This denotes the amount of predicted positives actually positive. Recall is also called True Positive Rate (TPR) or sensitivity. It is the probability of detection:

Recall = (TP / TP + FN)

- Precision as the fraction of the real positives over all the positive predicted values:

Precision = (TP / TP + FP)

- False Positive Rate is the number of falsely predicted positives over all the true negatives. It's the probability of false alarm:

FPR = FP / FP + TN

- Finally, the F1-score is defined as the weighted average of the recall and the precision, and is given by the following:

F1-score = 2 TP / ( 2 TP + FP + FN)

- A F1 score is always between 0 and 1, with 1 the best value and 0 the worst one.

As noted previously, these scores are all dependent on the initial threshold used to interpret the result of the logistic regression in order to decide when a prediction belongs to one class or the other. We can choose to vary that threshold. This is where the ROC-AUC comes in.

If you plot the True Positive Rate (Recall) against the False Positive Rate for different values of the decision threshold, you obtain a graph like the following, called the Receiver Operating Characteristic or ROC curve:

The diagonal line indicates an equal probability of belonging to one class or another. The closer the curve is to the upper-left corner, the better your model performances are.
The ROC curve has been widely used since WWII, when it was first invented to detect enemy planes in radar signals.
Once you have the ROC curve, you can calculate the Area Under the Curve or AUC.
The AUC will give you a unique score for your model taking into account all the possible values for the probability threshold from 0 to 1. The higher the AUC the better.