Hands-On Unsupervised Learning with Python

上QQ阅读APP看书，第一时间看更新

Semi-supervised learning algorithms

A semi-supervised scenario can be considered as a standard supervised one that exploits some features belonging to unsupervised learning techniques. A very common problem, in fact, arises when it's easy to obtain large unlabeled datasets but the cost of labeling is very high. Hence, it's reasonable to label only a fraction of the samples and to propagate the labels to all unlabeled ones whose distance from a labeled sample is below a predefined threshold. If the dataset has been drawn from a single data generating process and the labeled samples are uniformly distributed, a semi-supervised algorithm can achieve an accuracy comparable with a supervised one. In this book, we are not discussing these algorithms; however, it's helpful to briefly introduce two very important models:

Label propagation
Semi-supervised Support Vector Machines

The first one is called label propagation and its goal is to propagate the labels of a few samples to a larger population. This goal is achieved by considering a graph where each vertex represents a sample and every edge is weighted using a distance function. Through an iterative procedure, all labeled samples will send a fraction of their label values to all their neighbors and the process is repeated until the labels stop changing. This system has a stable point (that is, a configuration that cannot evolve anymore) and the algorithm can easily reach it with a limited number of iterations.

Label propagation is extremely helpful in all those contexts where some samples can be labeled according to a similarity measure. For example, an online store could have a large base of customers, but only 10% have disclosed their gender. If the feature vectors are rich enough to represent the common behavior of male and female users, it's possible to employ the label propagation algorithm to guess the gender of customers who haven't disclosed it. Of course, it's important to remember that all the assignments are based on the assumption that similar samples have the same label. This can be true in many situations, but it can also be misleading when the complexity of the feature vectors increases.

Another important family of semi-supervised algorithms is based on the extension of standard SVM, (short for Support Vector Machine) to datasets containing unlabeled samples. In this case, we don't want to propagate existing labels, but rather the classification criterion. In other words, we want to train the classifier using the labeled dataset and extend the discriminative rule to the unlabeled samples as well.

Contrary to the standard procedure that can only evaluate unlabeled samples, a semi-supervised SVM uses them to correct the separating hyperplane. The assumption is always based on the similarity: if A has label 1 and the unlabeled sample B has d(A, B) < ε (where ε is a predefined threshold), it's reasonable to assume that the label of B is also 1. In this way, the classifier can achieve high accuracy on the whole dataset even if only a subset has been manually labeled. Similar to label propagation, these kinds of model are reliable only when the structure of the dataset is not extremely complex and, in particular, when the similarity assumption holds (unfortunately there are some cases where it's extremely difficult to find a suitable distance metric, hence many similar samples are indeed dissimilar and vice versa).