Overfitting
Overfitting occurs when the model was so well trained that it fits the training data too perfectly and cannot handle new data.
Say you have a unique predictor of an outcome and that the data follows a quadratic pattern:
- You fit a linear regression on that data , the predictions are weak. Your model is underfitting the data. There is a high error level on both the training error and the validation dataset.
- You add the square of the predictor in the model and find that your model makes good predictions. The error on both the training and the validation datasets are equivalent and lower than for the simpler model.
- If you increase the number and power of polynomial features so that the model is now , you end up fitting the training data too closely. The model has a very low prediction error on the training dataset but is unable to predict anything on new data. The prediction error on the validation dataset remains high.
This is a case of overfitting.
The following graph shows an example of an overfitting model with regard to the previous quadratic dataset, by setting a high order for the polynomial regression (n = 16). The polynomial regression fits the training data so well it would be incapable of any predictions on new data whereas the quadratic model (n = 2) would be more robust:
The best way to detect overfitting is, therefore, to compare the prediction errors on the training and validation sets. A significant gap between the two errors implies overfitting. A way to prevent this overfitting from happening is to add constraints on the model. In machine learning, we use regularization.