Data formats
In a supervised learning problem, there will always be a dataset, defined as a finite set of real vectors with m features each:
Considering that our approach is always probabilistic, we need to consider each X as drawn from a statistical multivariate distribution D. For our purposes, it's also useful to add a very important condition upon the whole dataset X: we expect all samples to be independent and identically distributed (i.i.d). This means all variables belong to the same distribution D, and considering an arbitrary subset of m values, it happens that:
The corresponding output values can be both numerical-continuous or categorical. In the first case, the process is called regression, while in the second, it is called classification. Examples of numerical outputs are:
Categorical examples are:
We define generic regressor, a vector-valued function which associates an input value to a continuous output and generic classifier, a vector-values function whose predicted output is categorical (discrete). If they also depend on an internal parameter vector which determines the actual instance of a generic predictor, the approach is called parametric learning:
On the other hand, non-parametric learning doesn't make initial assumptions about the family of predictors (for example, defining a generic parameterized version of r(...) and c(...)). A very common non-parametric family is called instance-based learning and makes real-time predictions (without pre-computing parameter values) based on hypothesis determined only by the training samples (instance set). A simple and widespread approach adopts the concept of neighborhoods (with a fixed radius). In a classification problem, a new sample is automatically surrounded by classified training elements and the output class is determined considering the preponderant one in the neighborhood. In this book, we're going to talk about another very important algorithm family belonging to this class: kernel-based support vector machines. More examples can be found in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson.
The internal dynamics and the interpretation of all elements are peculiar to each single algorithm, and for this reason, we prefer not to talk now about thresholds or probabilities and try to work with an abstract definition. A generic parametric training process must find the best parameter vector which minimizes the regression/classification error given a specific training dataset and it should also generate a predictor that can correctly generalize when unknown samples are provided.
Another interpretation can be expressed in terms of additive noise:
For our purposes, we can expect zero-mean and low-variance Gaussian noise added to a perfect prediction. A training task must increase the signal-noise ratio by optimizing the parameters. Of course, whenever such a term doesn't have zero mean (independently from the other X values), probably it means that there's a hidden trend that must be taken into account (maybe a feature that has been prematurely discarded). On the other hand, high noise variance means that X is dirty and its measures are not reliable.
Until now we've assumed that both regression and classification operate on m-length vectors but produce a single value or single label (in other words, an input vector is always associated with only one output element). However, there are many strategies to handle multi-label classification and multi-output regression.
In unsupervised learning, we normally only have an input set X with m-length vectors, and we define clustering function (with n target clusters) with the following expression:
In most scikit-learn models, there is an instance variable coef_ which contains all trained parameters. For example, in a single parameter linear regression (we're going to widely discuss it in the next chapters), the output will be:
>>> model = LinearRegression()
>>> model.fit(X, Y)
>>> model.coef_
array([ 9.10210898])