The statistical approach versus the machine learning approach
In 2001, Leo Breiman published a paper titled Statistical Modeling: The Two Cultures (http://projecteuclid.org/euclid.ss/1009213726) that underlined the differences between the statistical approach focused on validation and explanation of the underlying process in the data and the machine learning approach, which is more concerned with the results.
Roughly put, a classic statistical analysis follows steps such as the following:
- A hypothesis called the null hypothesis is stated. This null hypothesis usually states that the observation is due to randomness.
- The probability (or p-value) of the event under the null hypothesis is then calculated.
- If that probability is below a certain threshold (usually p < 0.05), then the null hypothesis is rejected, which means that the observation is not a random fluke.
p> 0.05 does not imply that the null hypothesis is true. It only means that you cannot reject it, as the probability of the observation happening by chance is not large enough.
This methodology is geared toward explaining and discovering the influencing factors of the phenomenon. The goal here is to establish/build a somewhat static and fully known model that will fit observations as well as possible and, therefore, will be able to predict future patterns, behaviors, and observations.
In the machine learning approach, in predictive analytics, an explicit representation of the model is not the focus. The goal is to build the best model for the prediction period, and the model builds itself from the observations. The internals of the models are not explicit. This machine learning approach is called a black box model.
By removing the need for explicit modeling of the data, the ML approach has a stronger potential for predictions. ML is focused on making the most accurate predictions possible by minimizing the prediction error of a model at the expense of explainability.